Speed up Data Catalog Implementation with Automation and AI

Nazar Labunets
Ataccama
Published in
3 min readJun 29, 2020

--

It’s no doubt true that crowdsourcing is a great data catalog capability. After all, it enables teams and departments to make their tribal knowledge of particular data available to everyone. However, crowdsourcing is more efficient as a second step, after a catalog has been populated and enriched with as much business metadata as possible.

With that in mind, what businesses really need is to automate the generation of business metadata. Why? Because the traditional process of populating a data catalog, which relies on crowdsourcing business metadata, is long and tedious. It’s worth a closer look.

The Challenging Task of Crowdsourcing Metadata

Organizations that want to introduce a data catalog generally assign the task of populating it to a small team of data stewards, who are then responsible for launching a ready-to-use solution for the rest of the organization or department. As anyone who has tried it will know, this is far easier said than done.

Most data catalog solutions connect to data sources and import technical metadata. What that usually means is thousands of tables and⁠ columns named something like XDS_E2121_A_32. These naming conventions don’t make the task easy for the data stewards. While end users and data owners sometimes possess the knowledge of what is really inside those data sets, often they are reluctant to create and maintain metadata for those data sets.

At the same time, existing metadata documentation in the form of, say, Excel spreadsheets is often not easily available. In other cases, it is outdated, incomplete, or inconsistent with other departments’ documentation for the same data source. Database information schemas are no better: they are unlikely to have been updated since the day they were created.

Therefore, it is up to the data stewards to make inquiries and populate the catalog with relevant, up-to-date business metadata, such as business domains, classifications, and data quality indicators. A task such as this takes anywhere from months to years to complete. Effectively, data stewards face what writers call “the fear of the blank page.” Where to start, and how to get from 0 to 1?

Even if we imagine that over the course of a year of painstaking manual population, the data catalog is finally “complete,” let’s not forget about the crucial element of keeping metadata up to date. After all, data is dynamic: schemas change, data comes and goes, human error is, to some extent, unavoidable. How, then, to keep up with the pace of change in data and metadata?

If crowdsourcing alone were the answer, it would require an unprecedented data culture, processes, and accountability. It’s clear that this is not the case in the world today — and perhaps not ever.

So, is there a better, simpler way to introduce a data catalog solution into an organization and keep it useful afterward? Fortunately, yes. The answer lies in a new realm that will soon revolutionize data cataloging: automation.

Learn what 9 data governance, data quality, and data catalog population processes can be automated by reading the full article at ataccama.com

--

--

Nazar Labunets
Ataccama

Effective communication: images and words at Ataccama.