Guided Data Access Patterns: A Deal Breaker for Data Platforms
Dr. Malte Polley
Posted on May 15, 2024
I am now sure that the best tech stack with the best people using it is of no use to an organization if it is not clear how access to data and the tech stack is. This blog post should therefore be seen as a supplement to the two previous ones, which were more technical. In the first one, I talked about a Zero-ETL approach for the integration of Snowflake and Amazon S3 and in the second one about the integration of Matillion for complex data transformation and data loading processes.
This blog post is more like a diary entry and addresses the two main problems that I have experienced. However, there are also possible solutions that I will talk about.
No Use Case, No KPIs, No Success
Every use case is about a goal. Yeah right. I would always prioritize the Use case. Simply integrating systems does not help any organization. Even more clearly, a cloud environment just costs money and has no benefit if you don't start from the use case. So, ask yourself, what is your use case or what is your requester's use case.
This doesn't just help with the implementation of a project. It also helps you in the evaluation when it comes to project completion. Therefore: no use case, no KPI, no success assessment.
Even worse: If you only think about the system, you may end up in a data swamp and do not create a logical and meaningful storage system for the synergies that you want to create for later new use cases. And I'm sure you want that. But one after another.
The Issues: Organizational and Technical Ones
If we have a use case, various questions may arise. These can be: Who gives me access to the data? What does the data look like? Do these need to be prepared further so that I can use them for my use case? Is there perhaps already some preparatory work that I can build on? And there are a many more.
Technical questions also arise, especially if you need to deal compliance requirements. How do the data have to be encrypted? How do I get test data with which I can do staging? Which fields contain PII-relevant data? How do these have to be treated technically? We do we integrate networks? And these are just a few examples.
Addressing the Issues
So how can we deal with the questions productively? We design clear access patterns to our data platform. These access patterns address precisely the organizational and technical problems.
What do I mean by access pattern? In principle this means a sequence of well-defined steps. The sequence of steps has different participants. In addition to those involved, the access patterns clearly state who must do what. And finally, we need technical requirements that are operationalized with maximum precision. Finally, the documentation needs to be expanded to complete the project. This is the only way to create synergies. Any by that, documentation is useful for new use cases and for data governance requirements.
Let’s continue with a technical sub-component of a prototypical data platform and suggest for a system that addresses both our governance and synergy requirements. But first let's start with a process definition.
The Predefined Process
Our use case should be implemented on the data platform. Our organization has already defined a process for new data products. Every box in the figure is well defined in terms of who is in charge for what. We are very lucky in this case. This is probably not always the case.
The question shows how we start: a description of the exact requirements must be read by the requester. With this he can consult the existing data catalog. For this to happen successfully, the data catalog must contain not only technical information, but also semantic information. What does that mean? Technical information consists of the source and the data that can be taken from a source. This includes, for example, column names or attributes (such as tags) that are read from AWS. However, not only the AWS databases or S3 buckets should be part of the information set, but also the tables and structures that can be found in the data platform in the data warehouse or at their locations. In summary, the data catalog should be able to technically represent all data artifacts.
That helps, but it's not everything. Because now we still lack information. Which fields are PII relevant? Who is the technical and who is the professional contact for data? What is this data used for? Which existing data products are you used for? Questions of this type of aim at semantic information.
From a governance perspective, the data catalog should show which versions of a data route already existed and how it has changed over time. It is also important to see which fields are specifically being transferred or transformed and how.
In summary: The data catalog enables a holistic view of all data artifacts and their use in an organization. Without these system components, requesters cannot decide whether a source system is already connected or whether this connection needs to be expanded. If our requester is lucky, the required sources are already available, and he can start directly with data product development. This must inevitably end with documentation in the data catalog. If he is unlucky, the source system first must be integrated. And this may mean a long journey that can last for months without proper process descriptions and technical blueprints.
The Technical Description of Sub-Components: Make Your GRCS-Team and Yourself happy
To show an example of what I mean by technical operationalization or blueprint, let's also look at a AWS service: Amazon S3. Widely used and still often not sufficiently configured. However, we managed to agree on the following technical blueprint with our Group Governance, Risk, Security and Compliance group.
Each bucket is accessed exclusively, and here in the example of Glue, from a VPC via a VPC endpoint. The local access control list contains no entries and blocks new ones. The same applies for the global access control list Amazon S3. In addition, every access is recorded via access logging. This process is defined via the configuration of a bucket policy. The same also enforces that you can only speak to the Bucket via TLS 1.2 or higher. The Bucket also has a dedicated KMS Customer Managed Key, which is rotated annually once a year. Finally, the consuming resource requires a role that is only allowed to read and write to this bucket and its key.
This shall be only an example. Please look at all AWS best practices for organization wide configuration. But if we have such a description, we can carry out the deployment with the help of a requirements specification and can use this accordingly in the audit. This also makes it possible to repeatedly keep up with AWS steps by taking new measures and removing old ones if necessary. Regardless of this, individual cases are not discussed because these blueprints have global validity. And of course, this process must be carried out for all sub-components of the data platform.
The Data Catalog: Data Hub Project
If you are new to the data catalog field, you can quickly find AWS Glue on the AWS platform. Unfortunately, AWS Glue does not offer a semantic data catalog and you must allow the user in the AWS GUI, which will be difficult to understand, especially for business users.
There are several commercial providers, but I would definitely recommend Data Hub Project. DataHub Project is an open-source metadata platform that serves as an extensible data catalog and supports data discovery, data observability, and federated governance to address the complexity of the data ecosystem. The data catalog enables the combination of technical, operational and business metadata to provide a 360-degree view of data entities. DataHub makes it possible to pre-enrich important metadata using shift-left practices and respond to changes in real time.
The deployment guide uses Kubernetes by default, but the cluster can also be hosted in AWS with many managed services and in the Elastic Container Service.
The Data Hub Project loading page can be secured via SSO and is responsive. A search helps to find data artifacts and find metadata about them.
Each asset has a detailed view, which is shown here in excerpts. In addition to the table definition, the most common table queries are displayed, for example, as are the individual data fields with tags and example values.
The lineage view then shows how the data artifacts depend on each other and which fields are used. In the example, the product starts in S3 and ends in PowerBI via Snowflake. All this information is imported into the catalog system via ingestion. There are quite a few of them.
The only thing the system can't do (yet) and I'm missing is an alert about updates through Ingestions. For example, far too few new values have arrived through integration, or the values differ significantly from previous values.
Final thoughts
I tried to show what is necessary to create access patterns to a data platform as simply as possible. I am convinced that in addition to a good technical foundation and an up-to-date description of the data platform, its capabilities and possibilities, a good platform team is also necessary to work in a customer-oriented manner. This is the only way to transform the people and your organization to act in a data-driven way.
Posted on May 15, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.