DBT Anti-patterns
Model dependency hell
In application code, we have code dependency hell, in DBT, we have model dependency hell. As team sizes grow, it make sense to split our models into multiple projects. Sometimes we might create multiple models from a single table to split into multiple tables. Without careful pruning, we can end up with 1k+ models and quickly cause data engineers confusion as to what the purpose of each one is.
This also leads to broken changes downstream if not controlled. A core underlying model gets used by a sales report and then that in turn gets used by a marketing function. We've now created a dependency chain that can quickly break down.
Data modelling can often require careful planning and architecture. Without it we can quickly build technical debt unwittingly.
Not using DBT test efficiently
DBT test can be used to check schemas, ensure row counts etc. It's important to ensure consistency, debugging of the code, and document the expected behaviour. Like most tools, they can become issues in the wrong hands.
High level languages use constructs such as mocks to imitate behaviour for tests, this doesn't apply to databases unfortunately. Keeping tests relatively high level and abstract is the key here. Don't tie it too closely to the data if not required.
For example, testing expected statuses is something that should probably stay away from DBT models if not required. I'm not a huge fan of TDD as it often feels overkill. Likewise in DBT tests, don't over do it!
Overusing/Misusing SQL
SQL is a great tool to interact with a database. It doesn't have the same capabilities and abstractions as high level, general purpose languages. This means that if you're using things like recursion and branching logic everywhere in your CTEs (Common Table Expressions), please think about the poor soul who may look at this next.
Not using a linter like SQLFluff
SQLFluff is a really nice linter for DBT and SQL. Inconsistent styles, mistakes in SQL, compile failures can be difficult to track. DBT brings a modern approach to data transformation but needs a nice linter like SQLFluff to make it properly consistent. Highly recommended and we use it regularly. This should be an essential component to keeping your code consistent and is a common pattern in the software development world.
Performance in large modelled systems
So you've built your models. The system has evolved. The migration is nearly done. But it's huge.
Copying data into snapshots, into transformed tables, into normalised tables, into staging tables, phew.
The models underneath can be heavyweights. Be prudent and plan ahead. Start with a simple end to end flow and prove your migration approach. Your future self will thank you.
Related Posts
finance
Sep 4, 2024
For any financial organisation, being able to access all relevant client data quickly is not just a competitive advantage in the current market - it’s an absolute necessity for the company’s survival.
fivetran
Jun 11, 2024
Incremental sync and truncating the data in raw tables can improve the performance of your syncs.