What is a Data Catalogue, when do you need one, and what does it do?
According to Google, the planet’s population generates a staggering 300 million terabytes of data daily, with this number continually on the rise, especially due to the recent surge in IoT adoption.
For a typical enterprise organization, data comes in various forms – logs, metrics, user transactions, web browser tracking information – ranging from structured to unstructured, time-series to relational, and more.
How does one make sense of this data deluge and ensure it’s accessible, searchable, and usable? Enter metadata. Creating metadata for each dataset is essential, and a data catalogue acts as an index and search function for this metadata.
The concept of data catalogues has ancient roots, dating back to the Library of Alexandria in ancient Greece, where scrolls and manuscripts were organized systematically.
The Greeks used diverse classifications to organize their knowledge, sorting information by subject, geography, author, chronological order, and more. However, the complete classification remains unknown as the Library faced destruction over the centuries from fires, wars, and conquests. In a similar vein, ancient Greece introduced not only democracy for its people but also the democratization of data, aiming to make it easily discoverable and accessible for everyone.
How to implement a Data Catalogue in your organisation
The library of Alexandria according to some historians contained hundreds of thousands of scrolls and had its own catalogue system of structuring the Data so it could be easily searched. If ancient Greeks could catalogue ALL their knowledge about the world, could we do the same with our world’s Data?
It turns out that we already doing it. Look at Google. What is it if not a massive global Data Catalogue?
Thankfully typical organisation doesn’t have to deal with Google-scale problems when working on their Datasets. The methodologies for developing Data Catalogues aren’t new; they have been studied and explored for centuries.
Vertically on Domains
Horizontally on lineage
In all other directions via the knowledge graph
Let’s dig a bit deeper into each of these.
In the context of Data Catalogues domain structure represents your organisation as a set of capabilities with a hierarchy. The term “domains” should not be mistaken for software engineering Domain-Driven Design(DDD). Domain in this context is a capability and a building block in the catalogue of knowledge inside an organisation.
The process of splitting an organisation’s Data into Domains can be difficult, but you don’t have to do it all at once. Start from the biggest and well-defined Capabilities and split them further into subcategories creating a hierarchical classification until you cover the Data Sources you are working with right now.
DNS tree is a perfect example of hierarchical classification. DNS-style notation also works well for naming Domains.
Lineage is the flow and transformation of data as it moves through various stages of a data pipeline or system. Visualised It provides a comprehensive view of how data is sourced, processed, transformed, and consumed across different components or processes within an organization’s data architecture.
Schema of the Camunda 7 internal Database
Data lineage is crucial for understanding the origins of data, ensuring data quality, and meeting regulatory compliance requirements.
In Data Catalogues, Lineage serves 2 main purposes:
Shows how Data is being used throughout the company. For example, it allows a Chief Information Officer(CIO) and Chief Security Officer(CSO) to address the problems of “Personal Identifiable Information”, i.e. PII and have control over other sensitive information by observing where and how it’s being consumed.
Enables consumers of the Data to see where their Data is coming from and why their reports or Data Products break. Without transparency in upstream lineage and formal Data Contracts(to be covered in future posts) the successful implementation of Data Products is practically not possible.
Lineage function is usually automatic and provided by the choice of underlying Data Platform and/or implementation of the Data Catalogue
We can use a structured representation of knowledge that captures relationships and connections between different entities as a third dimension of the Data Catalogue. To put it simply it should answer the question “What this Data is”. It creates a context for the Data and a holistic view of the relationships between different classes or categories. A knowledge graph links all parts of your data catalogue together and should provide a single logical structure of all your knowledge.
Conceptual diagram of a knowledge graph
4th Dimension – Data Classification
While not strictly mandatory for the creation of a Data Catalogue, this dimension holds critical importance for any organization handling Personally Identifiable Information (PII) or sensitive data.
Implementing data confidentiality classifications such as “Official,” “Secret,” and “Top Secret,” alongside evaluating data sensitivity, including the presence of Personally Identifiable Information (PII), is a crucial step in organizing data for government organizations. Incorporating these considerations into the Data Catalogue not only ensures security but also addresses the challenges of compliance.
Practical implementation of Data Catalogues. Quick Glance on a few products available
Enough theory, let’s have a quick look at the landscape of Data Catalogue products available.
Pretty much every Data vendor offers their own flavour of Data Catalogue. The role of any Data Catalogue Product is not only to support all the above dimensions but also to have a mechanism to continuously classify the new Data to make it searchable and keep the catalogue up to date.
Data Catalogues employ two primary mechanisms to maintain the currency of their indexes: Pull and Push. Pull methods involve the use of a “crawler,” akin to Google’s approach to indexing the internet. On the other hand, Push mechanisms are commonly employed in scenarios where metadata is streamed directly into the Data Catalogue in real-time.
We have chosen to explore the easiest-to-try options provided by AWS and Azure, as the majority of MetaOps clients use one of these platforms. To add an intriguing dimension, we are also digging into the offerings of an independent vendor, Data.world.
AWS Glue Data Catalog
In the AWS ecosystem, Data Catalogue is one of the features of AWS Glue. It attempts to create an overarching structure around a number of AWS Data-supporting services like S3, RDS, Redshift, Athena etc…
Similar to numerous other AWS services, AWS Glue Data Catalog integrates seamlessly with the existing AWS ecosystem. It comes with built-in crawlers and classifiers supporting major AWS Data Services.
When dealing with non-AWS Data Sources, JDBC crawlers in AWS Glue offer robust support for popular products such as Postgres or MariaDB. Writing classifiers becomes straightforward when you’re familiar with your data’s structure. For stacks beyond the common ones, it’s advisable to check for existing solutions for your DataSource, or you might need to build a custom solution.
Tracking data lineage may not appear as a straightforward solution in AWS. Instead of relying on a single centralized Catalogue Service, it can be achieved by utilizing combinations of various AWS products.
In this context, the AWS version of Data Catalog resembles more of a Lego building blocks approach, where the components seamlessly stick together rather than functioning as a standalone product. As always, AWS excels in its strong suit of seamlessly integrating its own services.
High-level view of the AWS Glue Data Catalogue
Purview governance portal on Azure
Azure Purview was released in General Availability in September 2021 and was re-branded as Microsoft Purview the next year.
It provides a unified data governance service that helps you manage your on-premises, multi-cloud, and software-as-a-service (SaaS) data. It gives the organisation a holistic map of its data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage.
Support for data sources of main Data vendors on the market is available along with most Azure services.
It has separate roles for Curators and Security Administrators to look after Data Security and provides an interface for Data consumers to find reliable sources of data in your organisation.
Microsoft Purview structure
Compared to some established DATA vendors like Informatica it’s a new kid on the block but it ticks basic Catalog requirements and is easy enough to adopt if your Data is already in the Azure Cloud. Microsoft continues to actively develop Purview, so the current state is likely to change.
A knowledge graph-based Data Catalogue with a high emphasis on the searchability of the Data and AI assisting element of the search function. The product is cloud-native and has integration with a majority of the Data Products available on the market.
Out of the box it provides data discovery with AI-assisted search, contextual results, auto-enrichment, and extensive data lineage. The search function allows you to ask natural-language questions of your data, including follow-up inquiries, to ensure users find the right information.
data.world lineage view
Other key features of the Data.world solutions are:
collaboration capabilities to help streamline workflows and enable knowledge sharing between data producers and users;
the ability to automatically organize, aggregate and present metadata in a format for easy use and sharing between collaborators; and
support for both virtualized and federated access to data, with built-in data governance controls.
All of it along with SOC 2 Type II and HIPAA compliance makes data.world a viable candidate for enterprise organisation.
Have you got a Data Catalogue implemented at your company? Are you considering implementing one? Do get in touch, we would love to hear your story and perhaps we can make sense of your Data together!