Building Comprehensive DBT Documentation
Table of Contents
- The Challenge
- The Solution
- Key Documentation Sections
- Critical Lessons Learned
- Documentation Structure
The Challenge
When working as the sole data engineer maintaining a DBT project with multiple dashboards and dozens of models, I realized I needed comprehensive documentation that would allow:
- Team members to work independently when I’m unavailable
- New engineers to onboard quickly
- Clear guidelines to prevent common mistakes
The existing documentation was scattered across README files and tribal knowledge. I needed a single source of truth.
The Solution
I created a layered documentation system with:
- Quick Reference Summary - Get answers in 30 seconds
- Detailed Guides - Deep dives into specific topics
- Troubleshooting Playbooks - Step-by-step recovery procedures
The documentation focuses on the why behind decisions, not just the how.
Key Documentation Sections
Medallion Architecture
We use the Bronze → Silver → Gold pattern, but I documented a critical decision point that isn’t obvious:
SELECT * vs Explicit Column Selection in Bronze Layer
| Approach | Advantages | Disadvantages |
|---|---|---|
SELECT * |
✅ Fast to write ✅ Automatically gets new columns |
❌ Blocks app team from dropping unused columns ❌ Less control over ingested data |
| Explicit Columns | ✅ Doesn’t block app column deletions ✅ Clear documentation of used fields |
❌ Must manually add new columns ❌ Slower initial development |
The Trade-off: We started with SELECT * for speed, but plan to migrate to explicit columns once models stabilize. This prevents us from blocking the application team when they need to drop deprecated columns.
Local Development Setup
Most teams only document Docker-based workflows. I added a local development path that’s 3x faster for iteration:
Docker (Production-like):
colima start
make all
Local (Fast Development):
# One-time setup
pip install dbt-postgres==1.9.0
dbt deps
# Run models
dbt run --target local
dbt run --target local --select +silver_marketplace_report # With dependencies
dbt run --target local --select silver_marketplace_report+ # With dependents
Key Innovation: The + syntax for dependency selection wasn’t obvious:
+model_name- Run model AND everything it depends onmodel_name+- Run model AND everything that depends on it+model_name+- Run model with both upstream and downstream
DBT Commands Explained
Many engineers know dbt run, but other commands are underutilized:
| Command | What It Does | When to Use |
|---|---|---|
dbt run |
Executes models (builds tables/views) | Run transformations |
dbt compile |
Compiles to SQL without executing | Debug SQL before running |
dbt test |
Runs data quality tests | Validate data |
dbt build |
Runs models AND tests together | Full pipeline in one command |
dbt debug |
Tests connection and config | Troubleshoot connection issues |
dbt list |
Shows what would run (dry-run) | Preview impact before running |
The dbt compile + cat target/compiled/... workflow is invaluable for debugging complex Jinja logic.
Testing Strategy
This section addresses a critical production issue we faced:
The Problem: Tests can pass in PR checks (empty test database) but fail after merge when hitting real data, causing everything to get stuck.
The Recovery Process:
- Comment out the failing test in
schema.yml - Investigate in SQL Lab with diagnostic queries
- Fix the model SQL
- Test in SQL Lab first (verify uniqueness/constraints)
- Uncomment the test once validated
Manual Testing Queries (Run BEFORE Adding Tests):
-- Check for duplicate rows
SELECT
column1,
column2,
COUNT(*) AS duplicate_count
FROM transform.silver_my_table
GROUP BY column1, column2
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC;
-- Check for NULL values
SELECT
COUNT(*) AS total_rows,
COUNT(column_name) AS non_null_rows,
COUNT(*) - COUNT(column_name) AS null_count
FROM transform.silver_my_table;
Key Insight: Always test models in SQL Lab with real data before adding DBT tests. Don’t rely on empty test databases.
Critical Lessons Learned
1. Referencing Data
| What | Syntax | Use Case |
|---|---|---|
| DBT Model | `` | Reference other bronze/silver/gold models (creates dependency graph) |
| Source Table | `` | Reference raw database tables (defined in sources.yml) |
Never use direct table names - always use ref() or source() to maintain the dependency graph.
2. Macros Are Advanced
Macros use advanced Jinja syntax and should be approached carefully:
-- Calling a macro is different from referencing models
-- Example: Using dbt_utils surrogate_key
Document where to find examples (macros/ directory) rather than trying to explain all Jinja features.
3. SQL Style Matters
Standardizing on 4-space indentation and ALL CAPS keywords makes code reviews faster and reduces cognitive load:
SELECT
column1,
column2,
CASE
WHEN condition THEN value
ELSE other_value
END AS calculated_column
FROM
WHERE status = 'Active'
4. Infrastructure Documentation
Don’t forget to document where things live:
DBT Repository: dbt-project
- All models, tests, macros, documentation
- Deployed via CI/CD
Infrastructure Configuration: projects repository
- Helm charts for deployments
- Configuration values
- Test in staging first, then promote to production
Documentation Structure
The final structure provides multiple entry points:
docs/analytics-stack-guide/
├── README.md # Quick reference summary
├── 01-overview.md # Architecture & data flow
├── 02-dbt-setup.md # Installation & commands
├── 03-integration.md # Dashboard integration
├── 04-common-tasks.md # Step-by-step workflows
├── 05-troubleshooting.md # Problem → Solution guide
├── 06-reference.md # Command cheat sheet
└── caching-slides.md # Visual presentations
This could be found in the repo superset-dbt-workflow, which I’m polishing and will make it public soon.
Auto-generated documentation:
This will create auto-generated documentation with the lineage, exposures and all models information
dbt docs generate
This one will serve it locally on localhost:8080
dbt docs serve
I have our docs hosted on S3 static page at, and I also included our specific instructions mentioned in analytics-stack-guide as a sub-topic.
- Maintained - Links to auto-generated docs at
https://data.domain.io/
This is not the perfect documentation - but I found it’s to be useful, I’m keep adding new tricks and ideas.
Full documentation available in the docs/analytics-stack-guide/ directory of the sparkle-dbt repository.