4.3 KiB
4.3 KiB
Claude Instructions
Project Overview
DataBuild is a bazel-based data build system. Key files:
databuild.proto- System interfacesmanifesto.md- Project philosophycore-concepts.md- Core concepts
Build & Test
# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh
# Run all core unit tests
./scripts/bb_test_all
# Remote testing
./scripts/bb_remote_test_all
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
End-to-End Testing
The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:
Test Suite Structure
tests/end_to_end/simple_test.sh- Basic CLI vs Service validationtests/end_to_end/podcast_simple_test.sh- Podcast reviews CLI vs Service validationtests/end_to_end/basic_graph_test.sh- Comprehensive basic graph testingtests/end_to_end/podcast_reviews_test.sh- Comprehensive podcast testing
Event Validation
Tests ensure CLI and Service emit identical build events:
- Build request events: Orchestration lifecycle (received, planning, executing, completed)
- Job events: Job execution tracking
- Partition events: Partition build status
CLI vs Service Event Alignment
Recent improvements ensure both paths emit identical events:
- CLI: Enhanced with orchestration events to match Service behavior
- Service: HTTP API orchestration events + core build events
- Validation: Tests fail if event counts or types differ between CLI and Service
Running Individual Tests
# Test basic graph
tests/end_to_end/simple_test.sh \
examples/basic_graph/bazel-bin/basic_graph.build \
examples/basic_graph/bazel-bin/basic_graph.service
# Test podcast reviews (run from correct directory)
cd examples/podcast_reviews
../../tests/end_to_end/podcast_simple_test.sh \
bazel-bin/podcast_reviews_graph.build \
bazel-bin/podcast_reviews_graph.service
Project Structure
databuild/- Core system (Rust/Proto)examples/- Example implementationsscripts/- Build utilities
Key Components
- Graph analysis/execution in Rust
- Bazel rules for job orchestration
- Java/Python examples for different use cases
DataBuild Job Architecture
Job Target Structure
Each DataBuild job creates three Bazel targets:
job_name.cfg- Configuration target (calls binary with "config" subcommand)job_name.exec- Execution target (calls binary with "exec" subcommand)job_name- Main job target (pipes config output to exec input)
Unified Job Binary Pattern
Jobs use a single binary with subcommands:
def main():
command = sys.argv[1] # "config" or "exec"
if command == "config":
handle_config(sys.argv[2:]) # Output job configuration JSON
elif command == "exec":
handle_exec(sys.argv[2:]) # Perform actual work
Job Configuration Requirements
CRITICAL: Job configs must include non-empty args for execution to work:
config = {
"configs": [{
"outputs": [{"str": partition_ref}],
"inputs": [...],
"args": ["some_arg"], # REQUIRED: Cannot be empty []
"env": {"PARTITION_REF": partition_ref}
}]
}
Jobs with "args": [] will only have their config function called during execution, not exec.
DataBuild Execution Flow
- Planning Phase: DataBuild calls
.cfgtargets to get job configurations - Execution Phase: DataBuild calls main job targets which pipe config to exec
- Job Resolution: Job lookup returns base job names (e.g.,
//:job_name), not.cfgvariants
Graph Configuration
databuild_graph(
name = "my_graph",
jobs = [":job1", ":job2"], # Reference base job targets
lookup = ":job_lookup", # Binary that routes partition refs to jobs
)
Job Lookup Pattern
def lookup_job_for_partition(partition_ref: str) -> str:
if pattern.match(partition_ref):
return "//:job_name" # Return base job target
raise ValueError(f"No job found for: {partition_ref}")
Common Pitfalls
- Empty args: Jobs with
"args": []won't execute properly - Wrong target refs: Job lookup must return base targets, not
.cfgvariants - Missing partition refs: All outputs must be addressable via partition references