A Dockerized dbt Workflow
How to setup dbt using docker containers with an RPC server for Dagster
Lately, I have been playing around quite a bit with Dagster, trying to understand how good of a replacement it would be for Airflow. I might write a more comprehensive post about it, but one interesting thing I came across in my testing was the integration with dbt. There are two ways to make Dagster work together with dbt: pointing to a dbt repository for running dbt commands or interacting with the dbt RPC server. I liked the idea of taking advantage of the RPC server as it meant I could decouple my implementation of dbt and Dagster. In this post, I will show you how you can dockerize a dbt RPC server while also maintaining a nice workflow for developing dbt models locally.
It should be noted, however, that the dbt RPC server will soon (end of 2022) be deprecated. However, the replacement will be the new dbt server. I have huge expectations for the dbt server for much more than this implementation, but for taking dbt metrics to a whole new level. I still see some value in using the RPC server for the time being and migrating to the new server later on.
Dockerized dbt RPC server
When integrating dbt and Dagster, my first step was to dockerize the dbt RPC server. Searched for dbt docker images, but wasn’t satisfied with any so I decided to go with my own Dockerfile. Because dbt is such a simple (but powerful) piece of software this was extremely straightforward. I simply needed a Python environment to install the dbt package:
The requirements.txt file contains the dependencies needed, including the dbt package. This will, of course, depend on the database adapter you are using, but for a PostgreSQL implementation, this would work:
dbt-postgres==1.0.0
dbt-rpc==0.1.1
For Snowflake you can use:
dbt-snowflake==1.0.0
dbt-rpc==0.1.1
Anyway, you get the idea!
Another thing to consider is the profiles.yml
. As you can see from the Dockerfile, I’m copying a profiles.yml
file from my dbt project into the container’s /root/.dbt/
directory. Now, there are a few ways to handle these credentials for the data warehouse.
The first option is to store the credentials inside the profiles.yml
file itself but do not add them to version control. This means that when the repository is first cloned, contributors need to create their own profiles.yml
. However, I prefer another method: using env variables in the profiles.yml
. In dbt, you can use the env_var
macro (reference) to insert environment variables into the YAML. My profiles.yml
the would look something like this for a PostgreSQL implementation:
For a Snowflake implementation it would look something like this:
Now, this means that when we ran the Docker container we need to define these environment variables. We can either do this when starting up the container or do it through docker-compose. To give this a try, we can first build the image:
docker build -t dbt .
Now we can run a container from that image:
docker run -p 8581:8580 --name dbt_rpc -v "/$(pwd):/app" --env-file ./.env -it dbt
Note that using this command, we are specifying an .env
file that has the credentials referenced above in the profiles.yml
. As you can see we are also mapping port 8580
to port 8581
. This means that our RPC server will be available at localhost:8581/jsonrpc
.
Once this is running you can test it by either using the Dagster integration or by sending requests to the server. As an example, you can send the following status
request:
curl — location — request POST ‘localhost:8581/jsonrpc’ \
— header ‘Content-Type: application/json’ \
— data-raw ‘{
“jsonrpc”: “2.0”,
“method”: “status”,
“id”: “000000000”
}’
If everything is working, you should receive back a 200 response with the status of ready.
Improving the development workflow
As a data manager and analytics engineer myself, developing an ergonomic development experience is extremely important to me. Developing new dbt models and modifying existing ones should be as straightforward as possible. One thing I noticed when implementing the dbt RPC server to work with Dagster is that it completely broke my developing workflow. Running and testing my models meant I had to run Dagster as an interface for dbt. So every time I need to develop dbt models, where I would just do a simple dbt run —-select my_model+
, now I have to start Dagster. Don’t get me wrong, I love Dagster’s job parametrisation and it’s super powerful, but I need something better for simply running and testing my models.
The solution I came up with is another Docker container. This container is running exactly the same image as the RPC server, but instead of running the dbt-rpc serve
command, it simply runs a bash console. We can then run this console interactively and run dbt commands on it.
To achieve this, we can keep the same image that we built before, but simply change the entrypoint to execute a bash console:
docker run --name dbt_console -v "/$(pwd):/app" --env-file ./.env --entrypoint /bin/bash -it dbt
Notice that we are getting rid of the port mapping, changing the name to dbt_console
and adding the --entrypoint
flag to specify to run a bash console. Doing this will immediately open a bash console and we can test everything is working by running any dbt command such as dbt debug
. We can use this console to run, test and iterate our models.
Composing everything together
This is nice, now we have 2 different docker containers, one for the RPC server to be used with Dagster and another one that provides a console we can use to run our dbt models. However, we can wrap this all together by using docker-compose. The resulting docker-compose is extremely simple, it includes 2 services, one for the RPC server and one for the console.
You can see that I also point to an env_file
that helps me manage the credentials in the profiles.yml
.
To enter the dbt console simply run:
docker exec -it dbt_console /bin/bash