Generators

Overview

What are generators?

Generators are Tonic's core configuration item for assigning a transformation to sensitive data--protecting privacy while enabling utility.

Tonic offers a variety of generators suited for handling different types of data. When considering which generators to use, it's helpful to be familiar with the following options which are shared by many of Tonic's generators:

  • Consistency: For ensuring the same input value maps to the same output value--across multiple columns, tables, and databases (even of different types)

  • Linking: For column which have an inter-dependency or correlation which should be maintained in the output

  • Differential Privacy: For ensuring the output does not reveal anything attributable to a specific member of the input dataset.

List of Generators

Address

Generates a random address-like string. This can be applied to various parts of an address string. For example, just the street address, just the city, or a full address string. The full list of address component options available is:

  • Building Number

  • Cardinal Direction

  • City

  • City Prefix

  • City Suffix

  • Country

  • Country Code

  • County

  • Direction

  • Full Address

  • Latitude

  • Longitude

  • Ordinal Direction

  • Secondary Address

  • State

  • State Abbr

  • Street Address

  • Street Suffix

  • Street Name

  • Zip Code

The following address components can be linked to ensure that multiple address generators on different column produce valid sets of values:

  • City

  • State

  • State Abbr

  • Zip Code

  • Latitude

  • Longitude

The Address generator also support Consistency.

Algebraic

The algebraic generator identifies the algebraic relationship between 3 or more numeric values (at least one non-integer) and generates new values to match. If a relationship cannot be found it defaults to the categorical generator. This generator can be linked with another algebraic generator.

Alphanumeric String Key

Generates unique alpha numeric strings of the same length as the input. This generator can be made consistent.

Array JSON Mask

Run a generator on values that match a user specified JSONPath.

Array Character Scramble

This generator replaces letters with random other letters, and numbers with random other numbers; punctuation and whitespace are preserved. This generator securely masks letters and numbers, there's no way to recover the original data. This generator can be made consistent.

ASCII Key

Generates unique alpha-numeric strings based on any printable ASCII characters. The length of the string will not be preserved. This generator can be made consistent.

Categorical

A categorical generator creates values at the same frequency and values of the underlying data. Another way to think of this is that it shuffles the existing values within a field, maintaining values/frequency while disassociating the values from other pieces of data. This generator can be linked to maintain relationships between multiple columns and can be made differentially private.

Character Scramble

This generator replaces letters with random other letters and numbers with random other numbers. Punctuation, whitespace, and mathematical symbols are preserved. This generator securely masks letters and numbers, there's no way to recover the original data. This generator can be made consistent.

Character Substitution

Random character replacement that preserves formatting (spaces, capitalization, and punctuation). Characters are replaced with other characters from within the same Unicode Block. This generator is implicitly consistent as every occurrence of a character will always map to the same substitute character. As such, it can be used to preserve a join between two text columns, such as a join on a name or email.

Company Name

Generates a random company name like string. This generator can be made consistent.

Conditional

Apply different generators to rows conditionally based on any value in this table.

Constant

Generates a single value (based on user input) that is used to mask all values in the column. Accepts a value compatible with the field type, e.g. string, numeric, date, etc.

Continuous

Generates a continuous distribution using a normal distribution to fit the underlying data. This generator can be linked to other Continuous Generators to create multi-variate distributions and can be partitioned by other columns. The continuous generator can be made differentially private.

Cross Table Sum

The cross table sum generator links columns in two tables. This column will be the sum of the values in another column. There is no preview for this generator as the sums cannot be computed until the other table is generated.

Custom Categorical

A categorical generator that selects from values you provide. This generator can be made consistent.

Date Truncation

Truncates dates to the specific date part.

Email

This generator scrambles characters while preserving formatting and keeping the '@', '.' This generator securely masks letters and numbers, there's no way to recover the original data. There are two optional parameters:

  • Email Domain: Enter a domain which all output values will use, i.e. ensure all generated values are '@mycompany.com'. The section before the '@' will be scrambled.

  • Excluded Email Domain: Any emails including this domain will not be masked. Useful for maintaining internal/testing emails which are not considered sensitive.

This generator can be made consistent.

Events

Generates timestamps fitting an event distribution. Link columns to create a sequence of events across multiple columns. This generator can be partitioned by other columns.

File Name

This generator scrambles characters while preserving formatting and keeping the file extension intact. This generator securely masks letters and numbers, there's no way to recover the original data. This generator can be made consistent.

Find and Replace

This generator replaces all instances of the find string with the replace string. If "Use regex" is enabled, use backslash ( \ ) as the escape character.

HIPAA Address

This generator can be used to generate cities, states, and zip codes that follow HIPAA guidelines for safe harbor.

Zip Codes

When generating zip codes, we examine the underlying zip code in the column and replace the last three digits with 0 unless the zip code is a low population area as designated by the current census, in which case we replace all digits in the zip code with 0.

Cities

When a zip code column has not been linked then we simply choose a random city in the United States. When a zip code has already been added to the link, however, we make sure to choose a city, at random, which has at least some overlap with the zip code.

For example, if the original city and zip code were (Atlanta, 30305) we would replace the zip code with 30300. There are many cities that contain zip codes beginning in 303 such as Atlanta, Decatur, Chamblee, Hapeville, Dunwoody, College Park, etc.). One of these cities is chosen at random so that our final value is (Chamblee, 30300), for example.

If the original zip code is designated as a low population area then we just choose a random city within the state, however, we only do this if the user has linked a State column. If they have not, we simply choose a random city anywhere in the United States.

States

HIPAA guidelines allow for information at the state level to be kept. Therefore, we passthrough these values.

Other address parts

All other address parts are generated randomly and hence there value is not influenced at all by the underlying value in the column.

Hostname

Generates random host names, based on the English language. This generator can be made consistent.

Integer Key

Generates integer values between 0 and 2^32 - 1. Input Values must be in the range 0 to 2^31 - 1 as well. This generator can be made consistent.

IP Address

Generates a random IP address formatted string. This generator supports both IPv4 and IPv6 addresses through an option to specify the ratio of IPv4 vs IPv6 addresses in the output. This generator can be made consistent.

JSON Mask

Run a generator on values that match a user specified JSONPath.

MAC Address

Generates a random MAC address formatted string. This generator can be made consistent.

Name

Generates a random name string from a dictionary of first and last names. When applying this generator a name format must be selected from the following options:

  • First (Note: Also commonly used for standalone "Middle Name" fields)

  • Last

  • First Last

  • First Middle Last

  • Last, First

  • Last, First Middle

The Name generator supports Consistency which can can be used in a few ways:

  • Name columns made consistent to another column: Making "First Name" and "Last Name" columns consistent to another unique identifier column for an individual. For example, this can be used to ensure that every row with user_id = 123 whose name is "John Smith" will map to "Bob Jones".

  • A name column consistent to itself: In this case the same input name always maps to the same output, e.g. "Jane" in the source always maps to "Samantha" in the output.

  • In both of the above scenarios, using the Consistency option will also ensure that separate name component columns (First, Middle, Last) appropriately match a separate "Full Name" column.

Null

Generates NULL values to fill the rows of the specified column.

Numeric String Key

Generates unique numeric strings of the same length as the input, i.e. makes use of format preserving encryption. This is a Key generator that ensures unique outputs. This generator can only be applied to columns which contain strings that are numeric. This generator can be made consistent.

Passthrough

Default option, does not mask data.

Phone

Generates a random phone number that matches the country/region of the input phone number while maintaining the format, e.g. (123) 456-7890) vs 123-456-7890. If the input is not a valid phone number, numeric characters will be randomly replaced. The "replace invalid numbers" option can be used to alter this behavior to ensure the output is a valid phone number. By default these are US phone numbers. Generated numbers will pass Google's libphonenumber verification as long as the input is a valid phone number or if the "replace invalid numbers" option is used. This generator can be made consistent.

Random Boolean

Generates a random boolean value. This generator can be linked with other random boolean generators.

Random Double

Generates a random double number between the specified min and max. For this generator, the "min" is inclusive and the "max" is exclusive.

Random Hash

Generates a random hash string.

Random Integer

Returns a random integer between the specified min and max. For this generator, the "min" in inclusive and the "max" is exclusive

Random Timestamp

Generates random dates, times, and timestamps.

Random UUID

Generates a random new UUID string.

Regex Mask

Uses regular expressions to parse strings and replace specified substrings with output of other generators. Parts of string to be replaced are specified inside unnamed top-level capture groups. In the case that multiple regular expressions match a given string, the first defined regular expression (and the sub-generators it specifies) will be applied.

For example, if a cell contained the string ProductId:123-BuyerId:234, it's possible to capture the substrings 123 and 234 with the regular expression ^ProductId:([0-9]{3})-BuyerId:([0-9]{3})$ - this would capture the two occurrences of three digit numbers in the pattern ProductId:xxx-BuyerId:xxx, making it possible to define a sub-generator on neither, either, or both of these captured substrings. We could also define a second broader capture, one that will match more cell values, with the regular expression ^(\w*).(\d*).(\w*).(\d*)$. This will capture pairs of words ((\w*)) and numbers ((\d*)) if there is a single character of any value between them, instead of the relatively more specific pattern of the first expression. The first expression defined (in our example, ^ProductId:([0-9]{3})-BuyerId:([0-9]{3})$) that matches the cell will have it's associated sub-generators applied, even if multiple expressions matched (and even if no sub-generators were defined on the matched expression). Defining multiple expressions allows for attaching completely different sets of sub-generators to a given cell depending on it’s value. For a reference on regular expressions in C#, see here

Sequential Integer

Generate a column of unique integer values. Starting value is 0 and increments in increments of 1. This generator can be linked with other sequential integer generators.

Shipping Container

Generates values of ISO 6346 compliant shipping container codes. All generated codes are in the freight category ("U"). This generator can be made consistent.

SIN

Generates a new valid Canadian Social Insurance Number that preserves formatting (non-digit characters). This generator can be made consistent.

SSN

Generates a new valid United States Social Security Number. This generator can be made consistent.

Timestamp Shift Generator

Shifts timestamps by a random amount of a specific unit of time, within a set range. Includes the following configuration options:

  • Date Part: The unit of time of the minimum and maximum shift (e.g. days).

  • Minimum Shift: The lower bound the value can be shifted from the original value.

  • Maximum Shift: The upper bound the value can be shifted from the original value.

This generator can be made consistent.

Smart Linking

Uses deep neural networks for high-fidelity data mimicking — see here.

Unique Email

Generates unique e-mail addresses by replacing the username with a randomly generated GUID and masking the domain with a character scramble. This Generator only guarantees uniqueness if the underlying column is unique. This generator can be made consistent.

URL

This is a substitution cipher that preserves formatting but keeps the URL scheme and top-level domain intact. This mask is not secure.

UUID Key

Generates UUIDs on Primary Key columns. All FK columns referencing this column will automatically have their UUID values masked as well.

XML Mask

Run a generator on values that match a user specified XPath