Building Comprehensive DBT Documentation

The Challenge
The Solution
Key Documentation Sections
Critical Lessons Learned
Documentation Structure

The Challenge

When working as the sole data engineer maintaining a DBT project with multiple dashboards and dozens of models, I realized I needed comprehensive documentation that would allow:

Team members to work independently when I’m unavailable
New engineers to onboard quickly
Clear guidelines to prevent common mistakes

The existing documentation was scattered across README files and tribal knowledge. I needed a single source of truth.

The Solution

I created a layered documentation system with:

Quick Reference Summary - Get answers in 30 seconds
Detailed Guides - Deep dives into specific topics
Troubleshooting Playbooks - Step-by-step recovery procedures

The documentation focuses on the why behind decisions, not just the how.

Key Documentation Sections

Medallion Architecture

We use the Bronze → Silver → Gold pattern, but I documented a critical decision point that isn’t obvious:

SELECT * vs Explicit Column Selection in Bronze Layer

Approach	Advantages	Disadvantages
*`SELECT `**	✅ Fast to write ✅ Automatically gets new columns	❌ Blocks app team from dropping unused columns ❌ Less control over ingested data
Explicit Columns	✅ Doesn’t block app column deletions ✅ Clear documentation of used fields	❌ Must manually add new columns ❌ Slower initial development

The Trade-off: We started with SELECT * for speed, but plan to migrate to explicit columns once models stabilize. This prevents us from blocking the application team when they need to drop deprecated columns.

Local Development Setup

Most teams only document Docker-based workflows. I added a local development path that’s 3x faster for iteration:

Docker (Production-like):

colima start
make all

Local (Fast Development):

# One-time setup
pip install dbt-postgres==1.9.0
dbt deps

# Run models
dbt run --target local
dbt run --target local --select +silver_marketplace_report  # With dependencies
dbt run --target local --select silver_marketplace_report+  # With dependents

Key Innovation: The + syntax for dependency selection wasn’t obvious:

+model_name - Run model AND everything it depends on
model_name+ - Run model AND everything that depends on it
+model_name+ - Run model with both upstream and downstream

DBT Commands Explained

Many engineers know dbt run, but other commands are underutilized:

Command	What It Does	When to Use
`dbt run`	Executes models (builds tables/views)	Run transformations
`dbt compile`	Compiles to SQL without executing	Debug SQL before running
`dbt test`	Runs data quality tests	Validate data
`dbt build`	Runs models AND tests together	Full pipeline in one command
`dbt debug`	Tests connection and config	Troubleshoot connection issues
`dbt list`	Shows what would run (dry-run)	Preview impact before running

The dbt compile + cat target/compiled/... workflow is invaluable for debugging complex Jinja logic.

Testing Strategy

This section addresses a critical production issue we faced:

The Problem: Tests can pass in PR checks (empty test database) but fail after merge when hitting real data, causing everything to get stuck.

The Recovery Process:

Comment out the failing test in schema.yml
Investigate in SQL Lab with diagnostic queries
Fix the model SQL
Test in SQL Lab first (verify uniqueness/constraints)
Uncomment the test once validated

Manual Testing Queries (Run BEFORE Adding Tests):

-- Check for duplicate rows
SELECT
    column1,
    column2,
    COUNT(*) AS duplicate_count
FROM transform.silver_my_table
GROUP BY column1, column2
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC;

-- Check for NULL values
SELECT
    COUNT(*) AS total_rows,
    COUNT(column_name) AS non_null_rows,
    COUNT(*) - COUNT(column_name) AS null_count
FROM transform.silver_my_table;

Key Insight: Always test models in SQL Lab with real data before adding DBT tests. Don’t rely on empty test databases.

Critical Lessons Learned

1. Referencing Data

What	Syntax	Use Case
DBT Model	``	Reference other bronze/silver/gold models (creates dependency graph)
Source Table	``	Reference raw database tables (defined in `sources.yml`)

Never use direct table names - always use ref() or source() to maintain the dependency graph.

2. Macros Are Advanced

Macros use advanced Jinja syntax and should be approached carefully:

-- Calling a macro is different from referencing models

-- Example: Using dbt_utils surrogate_key

Document where to find examples (macros/ directory) rather than trying to explain all Jinja features.

3. SQL Style Matters

Standardizing on 4-space indentation and ALL CAPS keywords makes code reviews faster and reduces cognitive load:

SELECT
    column1,
    column2,
    CASE
        WHEN condition THEN value
        ELSE other_value
    END AS calculated_column
FROM 
WHERE status = 'Active'

4. Infrastructure Documentation

Don’t forget to document where things live:

DBT Repository: dbt-project

All models, tests, macros, documentation
Deployed via CI/CD

Infrastructure Configuration: projects repository

Helm charts for deployments
Configuration values
Test in staging first, then promote to production

Documentation Structure

The final structure provides multiple entry points:

docs/analytics-stack-guide/
├── README.md                          # Quick reference summary
├── 01-overview.md                     # Architecture & data flow
├── 02-dbt-setup.md                    # Installation & commands
├── 03-integration.md                  # Dashboard integration
├── 04-common-tasks.md                 # Step-by-step workflows
├── 05-troubleshooting.md              # Problem → Solution guide
├── 06-reference.md                    # Command cheat sheet
└── caching-slides.md                  # Visual presentations

This could be found in the repo superset-dbt-workflow, which I’m polishing and will make it public soon.

Auto-generated documentation:

This will create auto-generated documentation with the lineage, exposures and all models information

dbt docs generate

This one will serve it locally on localhost:8080

dbt docs serve

I have our docs hosted on S3 static page at, and I also included our specific instructions mentioned in analytics-stack-guide as a sub-topic.

Maintained - Links to auto-generated docs at https://data.domain.io/

This is not the perfect documentation - but I found it’s to be useful, I’m keep adding new tricks and ideas.

Full documentation available in the docs/analytics-stack-guide/ directory of the sparkle-dbt repository.

Table of Contents