Using the Link block

Data anonymization is important for privacy, protection, and legal and ethical compliance. It enables organizations and engineers to use and share data securely, supports development and testing, and mitigates various risks associated with data processing. Anonymized data can be shared with researchers, analysts, and third parties without compromising data privacy, as developers and testers often need access to realistic data to ensure that applications and systems are working properly. Therefore, anonymization provides security for testing and development without exposing real user data. nxs-data-anonymizer works on these principles.

nxs-data-anonymizer is a tool developed to anonymize PostgreSQL and MySQL/MariaDB/Percona database dumps. It is beneficial for development teams or projects that need to work with production and test databases while ensuring security and preventing leaks.

One of the key elements of working with anonymized data is maintaining its integrity and consistency. This aspect became the basis for creating the link block in our tool. The idea of its development arose after one of our users contacted us on the Telegram channel with a case that gave us the idea to create such a feature.

The case was as follows: it was necessary to match the data transformation in such a way that user X, who appears in various tables of the database, would be mapped to A, user Y to M, and so on. Of course, the values A and M should be randomly generated or defined as static values through filters, but in any case, they must be consistent for X and Y throughout the entire database.

Was this possible without the link block? Yes, this was possible with the built-in nxs-data-anonymizer (command filter type) filters. Was it convenient? Actually, it took a lot of work to implement such a solution. We won't go into the details of its work process, and we don't need to, because now the link block works amazingly well instead.

The link block in the nxs-data-anonymizer configuration is used to create consistent data in different cells of the database. It ensures that cells with the same data before anonymization will have the same data after anonymization. A block can contain multiple tables and columns, and a common rule is applied to create new values in it.

In addition to the case study of one of our users, let us explain why and when such a feature can be used. In a database with user information, the link function can be used to ensure that user IDs in different tables (e.g., Orders and Contact Information) remain consistent after anonymization. This ensures that orders are correctly linked to anonymized users. The same principle can be used to apply block linking to data in any sector: Fintech, Foodtech, Medtech, etc. In healthcare databases, patient IDs can link to multiple tables (e.g., patient histories, appointments, prescriptions). The link function ensures that the anonymized patient ID remains unchanged in all these tables, preserving the links between the data.

*Block’s config *

Let's move directly to the operation of the block itself and its role in the configuration. Each element has the following properties:
value: The value that will be used to replace each cell in the specified column. Depending on the type, this can be a Go template or a shell command.
unique: If set to true, it ensures that the generated value for the cell is unique for all columns specified in the reference element.
with: Specifies the tables and columns to be linked.

The configuration will look as follows:

security:
  policy:
    tables: skip
    columns: skip

link:
- rule:
   value: "{{ randInt 1 50  }}"
   unique: true
 with:
   authors:
   - id
   posts:
   - author_id

filters:
 authors:
   columns:
     first_name:
       value: "{{- randAlphaNum 20 -}}"
     last_name:
       value: "{{- randAlphaNum 20 -}}"
     birthdate:
       value: "1999-12-31"
     added:
       value: "2000-01-01 12:00:00"
 posts:
   columns:
     id:
       value: "{{ randInt 1 100 }}"
       unique: true

In our example, the IDcolumn in the authors table is linked to the author_idcolumn in the posts table.
The sequence of tables in the dump does not affect data replacement. In this case, it means that after anonymizing data from one table, when processing the next table, the linked data will not be generated again. They will be transferred from the corresponding column of the first table.

The security block allows you to skip the anonymization of tables and columns that are not described. This is useful in cases where we need the original data for further work or if the data is not sensitive.
The rule block specifies that a random value between 1 and 50 should be generated for the associated columns.
The unique property ensures that the generated value is unique for all specified columns.
The with block lists the tables and columns to be linked. In our case, the id column in the authors table and the author_id column in the posts table will have the same UUID after anonymization.

Further described data in filters do not need to be linked to each other, so anonymization of their values can be set to random, or specific, static values.

Consider the following database example.

Before anonymization:

Table “authors”:

Table “posts”:

After anonymization:

Table “authors”:

Table “posts”:

So, let's summarize and determine what guarantees and benefits the link block alone can provide us:

Maintaining data integrity and security:
When anonymizing sensitive data, it is very important to maintain links between different tables for ease of use. For example, if a user ID in one table links to another table, the link function ensures that the anonymized user ID remains the same in both tables. This is especially important for complying with data protection rules, where sensitive data must be anonymized without losing its value for testing and analysis.
Consistency across tables for better data management:
The link function ensures that linked data remains so for applications that rely on data consistency across multiple tables. This is important for testing and development environments where application behavior must be tested on real data. Ensuring that changes made to one column are automatically reflected in linked columns reduces the need for manual updates.

The link creation feature in nxs-data-anonymizer is a powerful tool for maintaining data consistency and referential integrity during the anonymization process. It not only saves time but also reduces the chance of errors that can affect the final data results. The ability to create links in nxs-data-anonymizer makes it an indispensable tool for those looking to secure data without compromising its integrity and functionality. This is one of the key features that can significantly improve data efficiency in any project.

We would love to hear your opinions and listen to the needs of the community, this will not only help us develop and improve the tool but also help us understand how useful the features are to you and what your needs are. What other development opportunities do you see? What options would you add? We're open to any questions and comments in Telegram chat, here in the comments, or on GitHub!

Blog

Nixys

Join Our Newsletter. No Spam, Only the good stuff.

Related