dbt | 10 tips for getting started

12.02.2022 | 10 min Read
Category: Data Engineering | Tag: #dbt

1. Follow best practices, and ask the forum when in doubt

The most important thing you do is deciding to use dbt. The second most important thing you do is starting on the right foot by reading up on what constitutes best practice. dbt Labs, the company behind dbt, is fortunately very eager to share their knowledge, and has made available a solid list of tips and tricks for how to proceed: dbt best practices

In addition to being quite good at documenting online, dbt also has – by virtue of being both widely used and open source – a vibrant community found on Slack. Here you will find forums for all kinds of needs, and you can expect to quickly receive responses from product developers or experienced users.

The dbt Slack channels are a great source of information and discussion

2. Define the development standard

A natural next step, and one so important that it is mentioned here even though it can also be found via the point above, is to define the development standard before you even begin a dbt implementation. This is especially important in larger teams, but also very useful when you are the sole developer, as it forces a number of decisions that ensure consistency in what is being done.

It is not entirely uncommon for things to evolve over time, new developers join, and alternative ways of doing things are adopted. If you allow this, you will quickly end up with a project full of spaghetti architecture.

By first thinking through and documenting a development standard to be followed, it becomes easier to develop and perform code reviews, and it also becomes easier for new developers to get up to speed on the project. When the standard becomes concrete in this way, it also becomes easier to reassess it regularly as new experiences are gained. Note: remember point 10 about refactoring when the standard is changed.

A good starting point for creating your own development standard is to base it on the one dbt has made available: dbt style guide

In addition, The Zen of Python is eternally relevant.

3. Design the layer structure in the architecture early

Just as you should have the standard ready early, you should also have a clear plan for the different layers in the architecture. This is not just about how many there are and how they should be named; perhaps the most important thing is to determine what role they should play and how the layers should interact.

Regardless of how many layers you end up with, it is – as dbt recommends – a very good practice to let the first layer be a pure standardisation layer that is 1:1 against the source table. This way, you ensure that renaming, handling data quality and data types, and other simple operations are done in a single place that all subsequent models can benefit from.

Beyond this, it can be good practice to have as a baseline rule that each layer should primarily fetch data from the preceding layer, as this ensures you avoid unclear cross-dependencies within a layer.

4. Have a clear development process with code reviews

Once you have started development and have the development standard and architecture in place, you then need to set the frameworks and processes to ensure everything is adhered to. There are many simple but effective measures you can take here. When it comes to new developers, you should already have documented what you have decided in the steps above, making it easy to read up on. In addition, it is a good exercise to practise pair programming to some extent in the beginning.

Beyond this, you should systematise the process by defining pull request templates and setting up branch policies that ensure a code reviewer is automatically assigned when a pull request is created. CI/CD processes will also help here.

5. Always think DRY

Unlike more traditional, graphical ETL tools, a code-based tool like dbt provides opportunities to do things more in line with classical best practices from software development. DRY is an extremely important principle here – don’t repeat yourself. In other words, if you need to solve the same problem multiple times, make sure you build structures and functions that allow the code to be reused, and that you only need to maintain it in one place.

In dbt, there are (at least) three different components you can leverage here:

Utilise the layer structure: An effective layer structure, where renaming, data type conversion, and data quality handling are done early and in one place only, means that all subsequent dependencies avoid having to solve the same problem again and again.
Macros: A key component in dbt is Jinja. Here you can build macros to solve a specific problem. This can be very simple things for standardising data type conversion, filtering, or key generation, or far more advanced things. The possibilities are (almost) endless, and for more advanced users, this also opens up the ability to override built-in functionality if needed.
Variables: dbt also has good support for defining variables at different levels that you can use in model definitions or as part of macros.

6. Use a tool like poetry or devcontainer for development

If you want to scale dbt development, potentially with many developers spread across multiple teams, you need everyone to get the development environment up and running quickly and efficiently, and it must work exactly the same for everyone. In addition, you need the ability to make changes to the development environment for everyone more or less simultaneously when needed, for example when dbt needs to be updated to a new version.

There are several ways to achieve this, one good approach being to use poetry. You then define the development environment as part of the dbt repository, and the package manager poetry takes care of creating a virtual environment and installing all the same packages with the correct versions.

An alternative is to use devcontainers. In the same way, you define the development environment as part of the dbt repository, and Docker ensures that all necessary components are installed identically everywhere.

In both cases, the development experience is exactly the same as if you had installed everything locally (at least provided you have reasonably enough memory available, as using devcontainers can be heavy on resources).

7. Document and implement tests at the same time as code implementation

It is easy for documentation and testing to be deprioritised when you feel the deadline approaching and you still have a lot to do. But in dbt, it is so simple to do both that you really have no excuse for not doing both at the same time as you write the code.

A quick example might look as follows for the table eksempel. Here, both a table description and a column description with inferred data type are defined, as well as two defined tests for the column – that it should be unique and that it should not have null values.

The documentation that dbt generates is highly practical and well-organised

And what does it take to achieve this? A mere 9 lines of text.

models:
  - name: eksempel
    description: Dette er et eksempel på en tabell
    columns:
      - name: kolonne_1
        description: Dette er den første kolonnen i tabellen
        tests:
          - unique
          - not_null

Once you have done this, everything will appear in the dbt documentation, and if you use the build function, dbt will first run the model to create it, and then run both defined tests to see if they pass. Alternatively, you can run the tests alone with the command dbt test.

Here is an overview of built-in tests in dbt. But remember that you can always define your own as needed.

8. Take advantage of the fact that dbt makes CI/CD easy

State dbt has a wealth of features that make it suitable for CI/CD processes, and this should be utilised.

A very good example of this is state. dbt has the ability to save its state, meaning the condition the codebase was in after a run. On the next run, you can then compare the codebase and choose to only run what is new or changed since last time. In the example below, you can see this, and here all subsequent dependencies will also run (see the + after state:modified) – this is often required during deployment.

dbt run --select state:modified+ --full-refresh

This was just one example of using dbt’s command line syntax, but it is extremely flexible and it can be a useful exercise to become intimately familiar with it.

Documentation Another opportunity you should take advantage of is building and deploying the documentation every time you deploy to another environment, so that you always have up-to-date documentation for your environments. It then becomes a simple entry point for both developers and analysts, who can be confident that what they see in the documentation matches what they see in the database.

Building the documentation is just another variant of dbt’s command line capabilities. The generated files can then be deployed to the desired location. This is just a static website and can therefore be served from, for example, a data lake.

9. See it as a maturity journey

It can initially seem overwhelming to take in everything you should think about in connection with a dbt project, but there is no reason for that. Things rarely go very wrong in dbt – you never alter the source data, and a full re-run of all models is just a dbt run --full-refresh away. This means you can safely see it as a maturity journey where you build incrementally, and as you mature, you can adopt more advanced functionality.

Macros and CI/CD processes make life easier, but you can manage perfectly well without them. Start simple, solve the first needs, and take it from there.

10. Refactor often

And a final but very important tip – refactor often! As mentioned, things will evolve over time; you mature, gain new experiences, and find smarter ways of doing things. That is the way it should be.

But when this happens, you must not fall into the trap of starting to do things in a new way without simultaneously changing the old – or at minimum creating a plan for when it will be done. That is the recipe for accumulating technical debt and a future spaghetti architecture.

dbt makes refactoring very easy. It is just plain text files, which can therefore be changed quickly. If you want to make larger changes in batch, for example VS Code can perform relatively advanced search-and-replace operations, or you can write Python code to do the job. You will see the changes in the change log and can run the dbt project to see if everything compiles and runs as expected. If you make a mistake, a test can often catch it, or you have the change history in the git repository and can revert if needed.

So do not be afraid to do things in a new way – extensive changes are not that extensive in dbt.