4.6 KiB
Claude Instructions
Project Overview
DataBuild is a bazel-based data build system. Key files:
databuild.proto- System interfacesmanifesto.md- Project philosophycore-concepts.md- Core concepts
Build & Test
# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh
# Run all core unit tests
./scripts/bb_test_all
# Remote testing
./scripts/bb_remote_test_all
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
End-to-End Testing
The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:
Test Suite Structure
tests/end_to_end/simple_test.sh- Basic CLI vs Service validationtests/end_to_end/podcast_simple_test.sh- Podcast reviews CLI vs Service validationtests/end_to_end/basic_graph_test.sh- Comprehensive basic graph testingtests/end_to_end/podcast_reviews_test.sh- Comprehensive podcast testing
Event Validation
Tests ensure CLI and Service emit identical build events:
- Build request events: Orchestration lifecycle (received, planning, executing, completed)
- Job events: Job execution tracking
- Partition events: Partition build status
CLI vs Service Event Alignment
Recent improvements ensure both paths emit identical events:
- CLI: Enhanced with orchestration events to match Service behavior
- Service: HTTP API orchestration events + core build events
- Validation: Tests fail if event counts or types differ between CLI and Service
Running Individual Tests
# Test basic graph
tests/end_to_end/simple_test.sh \
examples/basic_graph/bazel-bin/basic_graph.build \
examples/basic_graph/bazel-bin/basic_graph.service
# Test podcast reviews (run from correct directory)
cd examples/podcast_reviews
../../tests/end_to_end/podcast_simple_test.sh \
bazel-bin/podcast_reviews_graph.build \
bazel-bin/podcast_reviews_graph.service
Project Structure
databuild/- Core system (Rust/Proto)examples/- Example implementationsscripts/- Build utilities
Key Components
- Graph analysis/execution in Rust
- Bazel rules for job orchestration
- Java/Python examples for different use cases
DataBuild Job Architecture
Job Target Structure
Each DataBuild job creates three Bazel targets:
job_name.cfg- Configuration target (calls binary with "config" subcommand)job_name.exec- Execution target (calls binary with "exec" subcommand)job_name- Main job target (pipes config output to exec input)
Unified Job Binary Pattern
Jobs use a single binary with subcommands:
def main():
command = sys.argv[1] # "config" or "exec"
if command == "config":
handle_config(sys.argv[2:]) # Output job configuration JSON
elif command == "exec":
handle_exec(sys.argv[2:]) # Perform actual work
Job Configuration Requirements
CRITICAL: Job configs must include non-empty args for execution to work:
config = {
"configs": [{
"outputs": [{"str": partition_ref}],
"inputs": [...],
"args": ["some_arg"], # REQUIRED: Cannot be empty []
"env": {"PARTITION_REF": partition_ref}
}]
}
Jobs with "args": [] will only have their config function called during execution, not exec.
DataBuild Execution Flow
- Planning Phase: DataBuild calls
.cfgtargets to get job configurations - Execution Phase: DataBuild calls main job targets which pipe config to exec
- Job Resolution: Job lookup returns base job names (e.g.,
//:job_name), not.cfgvariants
Graph Configuration
databuild_graph(
name = "my_graph",
jobs = [":job1", ":job2"], # Reference base job targets
lookup = ":job_lookup", # Binary that routes partition refs to jobs
)
Job Lookup Pattern
def lookup_job_for_partition(partition_ref: str) -> str:
if pattern.match(partition_ref):
return "//:job_name" # Return base job target
raise ValueError(f"No job found for: {partition_ref}")
Common Pitfalls
- Empty args: Jobs with
"args": []won't execute properly - Wrong target refs: Job lookup must return base targets, not
.cfgvariants - Missing partition refs: All outputs must be addressable via partition references
Documentation
We use plans / designs in the plans directory to anchor most large scale efforts. We create plans that are good bets, though not necessarily exhaustive, then (and this is critical) we update them after the work is completed, or after significant progress towards completion.