Tonic
Search…
Common Usage

Names

De-identifying individuals names can be accomplished in several ways in Tonic. The method selected largely depends on the specifics of the use case including the required realism of the output and privacy needs. The following are a few of the generator options and how and why they might be used:
    Name Generator: Can provide complete privacy by randomly returning a name from a dictionary of primarily Westernized names, unrelated to the original value (unless making use of Consistency). The output is realistic because the values returned are real names.
    Categorical Generator: By "shuffling" all of the values in the field, it ensures that not only is the output composed of realistic looking names, but that the output is actually composed of the names in the original data set. This can be beneficial if the original data contains for example names common to a particular region which should be maintained. When used with the Differential Privacy option, it ensures the output is secure from re-identification. However, if the source data set is small or each name is highly unique, Tonic may prevent you from using this option.
    Custom Categorical: Enables you to input your own dictionary of values which will be included in the output at the same frequency that they occur in the input list.
    Character Scramble: Randomly replaces characters with other characters, so while the output does not provide realistic looking names, it provides a high level of privacy that prevents recovering the original data. It does preserve whitespace, punctuation (e.g. hyphenated names), and upper/lower casing. And as a character-level replacement, preserves the length of the input string.
    Character Substitution: Similar to Character Substitution, however a single character mapping is used throughout the generated data. This reduces the privacy level, but ensures consistency and uniqueness. Additionally, this generator has more support additional unicode blocks to ensure the output characters more closely match the input, which may be helpful if the input includes names with characters outside of basic Latin (a-z, A-Z, etc) characters.

Dates/Events/Timestamps

It's common to have rows of data with multiple date/timestamp fields which have a logical dependency (e.g. START_DATE and END_DATE). In this case a randomly generated date is not viable as it could produce a nonsensical output where a secondary event occurs chronologically out of order. There are a few generator options to handle these scenarios:
    Timestamp Shift Generator (with Consistency): The problem described above can be solved by ensuring that two (or more) timestamps are randomly shifted by the same amount rather than independently from one another. The use of the Consistency option is key here. Take for example a row of data representing an individual identified by a primary key of PERSON_ID with START_DATE and END_DATE columns. Each of these columns can have a timestamp shift applied within the desired range with PERSON_ID as the "consistent to" column. Any time the same PERSON_ID value is encountered the dates will be shifted by the same amount.
    Events Generator: The “Events” generator can be applied to multiple date columns on the same table and they can be linked together to follow the underlying distribution of dates. For more information see our blog post here: https://www.tonic.ai/blog/simulating-event-pipelines
    Date Truncation Generator: This generator can satisfy the described problem in some cases. This generator can be configured to truncate the input to the Year, Month, Day, Hour, Minute, or Second. It guarantees that a secondary event will not occur BEFORE a primary event, but truncation may result in them being the same date/timestamp. Using this generator for this purpose is highly dependent on the typical time separation between the two events relative to the truncation option as well as whether truncation provides an adequate level of privacy for the particular use case.
Last modified 1mo ago