DBT Anti-patterns

Model dependency hell

In application code, we have code dependency hell, in DBT, we have model dependency hell. As team sizes grow, it make sense to split our models into multiple projects. Sometimes we might create multiple models from a single table to split into multiple tables. Without careful pruning, we can end up with 1k+ models and quickly cause data engineers confusion as to what the purpose of each one is.

This also leads to broken changes downstream if not controlled. A core underlying model gets used by a sales report and then that in turn gets used by a marketing function. We've now created a dependency chain that can quickly break down.

Data modelling can often require careful planning and architecture. Without it we can quickly build technical debt unwittingly.

Not using DBT test efficiently

DBT test can be used to check schemas, ensure row counts etc. It's important to ensure consistency, debugging of the code, and document the expected behaviour. Like most tools, they can become issues in the wrong hands.

High level languages use constructs such as mocks to imitate behaviour for tests, this doesn't apply to databases unfortunately. Keeping tests relatively high level and abstract is the key here. Don't tie it too closely to the data if not required.

For example, testing expected statuses is something that should probably stay away from DBT models if not required. I'm not a huge fan of TDD as it often feels overkill. Likewise in DBT tests, don't over do it!

Overusing/Misusing SQL

SQL is a great tool to interact with a database. It doesn't have the same capabilities and abstractions as high level, general purpose languages. This means that if you're using things like recursion and branching logic everywhere in your CTEs (Common Table Expressions), please think about the poor soul who may look at this next.

SELECT state, 
  CASE {% for k, v in var("state_lookup").items() %} WHEN state in ({{ v|csl }}) THEN {{ k }}{% endfor %} 
  CASE {% for k, v in var("location_lookup").items() %} WHEN location in ({{ v|csl }}) THEN {{ k }}{% endfor %} 
  CASE {% for k, v in var("street_lookup").items() %} WHEN street in ({{ v|csl }}) THEN {{ k }}{% endfor %} 
  CASE {% for k, v in var("country_lookup").items() %} WHEN country in ({{ v|csl }}) THEN {{ k }}{% endfor %}   
  ELSE NULL 
    END AS payments FROM {{ ref('my_table') }}

Not using a linter like SQLFluff

SQLFluff is a really nice linter for DBT and SQL. Inconsistent styles, mistakes in SQL, compile failures can be difficult to track. DBT brings a modern approach to data transformation but needs a nice linter like SQLFluff to make it properly consistent. Highly recommended and we use it regularly. This should be an essential component to keeping your code consistent and is a common pattern in the software development world.

Performance in large modelled systems

So you've built your models. The system has evolved. The migration is nearly done. But it's huge.

Copying data into snapshots, into transformed tables, into normalised tables, into staging tables, phew.

The models underneath can be heavyweights. Be prudent and plan ahead. Start with a simple end to end flow and prove your migration approach. Your future self will thank you.

Too many models, exasperating tests, unreadable code and poor performance all hamper DBT projects

Jim Collins

Subscribe for the latest blogs and news updates!

finance

Sep 4, 2024

Streamlining Customer Data Onboarding: A Key to Success for Financial Organisations

For any financial organisation, being able to access all relevant client data quickly is not just a competitive advantage in the current market - it’s an absolute necessity for the company’s survival.

fivetran

Jun 11, 2024

Migrating Between Fivetran and Airbyte - A Two-Way Journey

Incremental sync and truncating the data in raw tables can improve the performance of your syncs.