A lightweight alternative to Amundsen for your dbt project

Louis Guitton

Data Engineer

Software Architect

Algolia

dbt

Python

If you've been using dbt for a little while, chances are your project has more than 50 models. Chances are more than 10 people are building dashboards based on those models.

In the best case, self-service analytics users are coming to you with repeting questions about what model to use when. In the worst case, they are taking business decisions using the wrong model.

In this post, I will show you how you can build a lightweight metadata search engine on top of your dbt metadata to answer all these questions. I hope to show you that data governance, data lineage, and data discovery don't need to be complicated topics and that you can get started today on those roadmaps with my lightweight open source solution.

LIVE DEMO: https://dbt-metadata-utils.guitton.co

Data Governance is Ripe

In his recent post The modern data stack: past, present, and future, Tristan Handy - the CEO of Fishtown Analytics (the company behind dbt) - was writing:

He later also points out that dbt has its own lightweight governance interface: dbt Docs. They are a great starting point and might be enough for a while. However, as time goes by, your dbt project will outgrow its clothes. The search in dbt Docs is Regex only, and you might find its relevancy going down with a growing number of models. This can become important for Data Analysts building dashboards and looking for the right model but also for Data Engineers looking to "pull the thread" when debugging a model. Those use cases can be summarised with the two following "Jobs to be done":

Data discovery can solve 2 'Jobs to be Done'

When I want to build a dashboard, but I don’t know which table to use, help me search through the available models, so I can be confident in my conclusions.

When I am debugging a data model, but I don’t know where to start, help me get data engineering context, so I can be faster to a solution.

These days, the solution to those two problems seems to be rolling out "heavyweight" tools like Amundsen. As Paco Nathan writes p.115 of the book Data Teams by Jesse Anderson (you can find my review of the book here):

Amundsen and other heavyweight tools are the go-to solution for data discovery

Those tools come on top of an already complex stack of tools that data teams need to operate. What if we wanted a lightweight solution instead, like dbt Docs?

The Features of Amundsen and other Metadata Engines

In his great Teardown of Data Discovery Platforms, Eugene Yan summarizes really well the features of Amundsen and other metadata engines. He splits them in 3 categories: features to find data, features to understand data and features to use data.

Architecture of your friendly neighbourhood metadata engine

Its friendly UI with a familiar search UX is one of the key factors behind Amundsen's success. But another one is its modular architecture, which is already being reused by other metadata open source projects like the project whale (previously called metaframe).

We can further split the 3 categories of features into 10 features of varying implementation difficulty. Those features have also varying returns, not represented here.

Taxonomy of 10 features from metadata engines, cost opinions are my own

The key thing to realise is that Lyft might have spent a 15⭐️-cost on Amundsen to assemble all those features. But what if we wanted to build a 3⭐️-cost metadata engine? What features and technologies would you pick?

2021-02-09 Update: The Ground paper from Rise labs

In the seminal paper Ground: A Data Context Service - RISE Lab, Rise labs have outlined those features with much better terminology that I wasn't aware of at the time of first writing this post: the ABC of Metadata

Application - information on how the data should be interpreted

Behavior - information on how the data is created and used

Change - information on the frequency and types of updates to the data

A Lightweight Alternative to Amundsen

Although it's possible that the feature completeness (everything is in one place) makes the USP of Amundsen and others, I want to make the case for a more lightweight approach.

Documentation tools go stale easily. Or at least in situations where they are not tied with data modeling code. dbt has proven with dbt Docs that data people want to document their code (hi team 😁). We were just waiting for a tool simple and integrated enough for the culture of Data Governance to blossom. It reminds me of those DevOps books showing that the solution is not the tooling but rather the culture (if you're curious check out The Phoenix Project).

Additionally, dbt sources are a great way to make raw data explicitly labeled. The dbt graph documents data lineage for you at the table level and I will leverage later that graph to propagate tags with no additional work.

In other words, with schemas, descriptions and data lineage, dbt Docs covers the category

Features to Understand

from the above diagram. So what is missing from dbt Docs to rival with Amundsen? Only a way to sublime the work that is already happening in your dbt repository. And that is Search.

Algolia market themselves as a 'flexible search platform'

A good search engine will cover the Features to Find category. Fortunately, we don't need to build a search engine. This is where we will use Algolia's free tier in addition to some static HTML and JS files to build our lightweight data discovery and metadata engine. Algolia's free tier allows you for 10k search requests and 10k records per month. Given that for us 1 record = 1 dbt model, and 1 search request = 1 data request from a user, my guess is that the free tier will cover our needs for a while.

Note: if you're worried that Algolia isn't open source, consider using the project typesense.

How to get at least one feature in the Features to Use category? Well, a dbt project is tracked in version control, so by parsing git's metadata, we can for example know each model's owner.

More generally, to extend our lightweight metadata engine, we would add metadata sources and develop parsers to collect and organise that metadata. We would then index that metadata in our search engine. Examples of metadata sources are:

dbt artifacts (_See my post on how to parse dbt artifacts _)

git metadata

BI tool metadata database (e.g. who queries what, who curates what)

data warehouse admin views (e.g. for Redshift: stl_insert, svv_table_info, stl_query, predicate columns)

...

What does good Search look like

Search is going to be key if our metadata engine is to rival with Amundsen, so let's look at Amundsen's docs. We know from their architecture page that they use ElasticSearch under the hood. And we can also read that we will need a ranking mechanism to order our dbt models by relevancy:

A bit further in the docs, we learn that Amundsen has three search indices and that the search bar uses multi-index search against those indices:

We even get examples for searchable attributes for the documents in the tables index:

Presumably, there's not much point in reverse engineering an open source project, so I'll spare you the rest: it also supports search-as-you-type and faceted search (applying filters).

Putting it together in the dbt-metadata-utils repository

I have assembled a couple of scripts in the (work in progress) repository called dbt-metadata-utils. I will walk through a couple of key parts here, but feel free to check out the full code there, and if you want to use it on your own project, hit me up.

All you will need is:

your already existing dbt project in a git repository locally

clone dbt-metadata-utils on the same machine than your dbt project

create one Algolia account (and API key)

create one Algolia app inside that account

run the commands laid out later

For the dbt project, we will use one of the example projects listed on the dbt docs: the jaffle_shop codebase.

I had no clue about Jaffles, and then I used dbt

Create an environment file in which you will need to fill in the values from the Algolia dashboard:

And then run the 4 make commands:

If you navigate to https://localhost:3000, you should see a UI that looks like this:

I didn't dwell on details, but our metadata engine's features are:

search as you type by table name, table descriptions, the model's folder in the dbt codebase, or its sources

uses DAG algorithms to propagate tags using the loader and sources keys from the dbt .yml files

faceted search by those tags

ranking by degree-centrality, and by boosting dbt models that are in a mart or have a docs description

enrich the tables documents with git metadata parsed from the git repository using the python git client

advanced search using dynamic filtering: if you enter a query with a loader (e.g. "airflow payments"), it will use rules to filter documents with loader=airflow

Conclusion

LIVE DEMO: https://dbt-metadata-utils.guitton.co

There you have it! A lightweight data governance tool on top of dbt artifacts and Algolia. I hope this showed you that data governance doesn't need to be a complicated topic, and that by using a knowledge graph of metadata, you can get a head start on your roadmap.

Leave a star on the github project, and let me know your thoughts on twitter. I enjoyed building this project and writing this post because it lies at the intersection of three of my areas of interest: NLP, Analytics and Engineering. I cover those three topics in other places on my blog.