# Claude Instructions ## Project Overview DataBuild is a bazel-based data build system. Key files: - [`databuild.proto`](databuild/databuild.proto) - System interfaces - [`manifesto.md`](manifesto.md) - Project philosophy - [`core-concepts.md`](core-concepts.md) - Core concepts ## Build & Test ```bash # Run comprehensive end-to-end tests (validates CLI vs Service consistency) ./run_e2e_tests.sh # Run all core unit tests ./scripts/bb_test_all # Remote testing ./scripts/bb_remote_test_all # Do not try to `bazel test //examples/basic_graph/...`, as this will not work. ``` ## End-to-End Testing The project includes comprehensive end-to-end tests that validate CLI and Service build consistency: ### Test Suite Structure - `tests/end_to_end/simple_test.sh` - Basic CLI vs Service validation - `tests/end_to_end/podcast_simple_test.sh` - Podcast reviews CLI vs Service validation - `tests/end_to_end/basic_graph_test.sh` - Comprehensive basic graph testing - `tests/end_to_end/podcast_reviews_test.sh` - Comprehensive podcast testing ### Event Validation Tests ensure CLI and Service emit identical build events: - **Build request events**: Orchestration lifecycle (received, planning, executing, completed) - **Job events**: Job execution tracking - **Partition events**: Partition build status ### CLI vs Service Event Alignment Recent improvements ensure both paths emit identical events: - CLI: Enhanced with orchestration events to match Service behavior - Service: HTTP API orchestration events + core build events - Validation: Tests fail if event counts or types differ between CLI and Service ### Running Individual Tests ```bash # Test basic graph tests/end_to_end/simple_test.sh \ examples/basic_graph/bazel-bin/basic_graph.build \ examples/basic_graph/bazel-bin/basic_graph.service # Test podcast reviews (run from correct directory) cd examples/podcast_reviews ../../tests/end_to_end/podcast_simple_test.sh \ bazel-bin/podcast_reviews_graph.build \ bazel-bin/podcast_reviews_graph.service ``` ## Project Structure - `databuild/` - Core system (Rust/Proto) - `examples/` - Example implementations - `scripts/` - Build utilities ## Key Components - Graph analysis/execution in Rust - Bazel rules for job orchestration - Java/Python examples for different use cases ## DataBuild Job Architecture ### Job Target Structure Each DataBuild job creates three Bazel targets: - `job_name.cfg` - Configuration target (calls binary with "config" subcommand) - `job_name.exec` - Execution target (calls binary with "exec" subcommand) - `job_name` - Main job target (pipes config output to exec input) ### Unified Job Binary Pattern Jobs use a single binary with subcommands: ```python def main(): command = sys.argv[1] # "config" or "exec" if command == "config": handle_config(sys.argv[2:]) # Output job configuration JSON elif command == "exec": handle_exec(sys.argv[2:]) # Perform actual work ``` ### Job Configuration Requirements **CRITICAL**: Job configs must include non-empty `args` for execution to work: ```python config = { "configs": [{ "outputs": [{"str": partition_ref}], "inputs": [...], "args": ["some_arg"], # REQUIRED: Cannot be empty [] "env": {"PARTITION_REF": partition_ref} }] } ``` Jobs with `"args": []` will only have their config function called during execution, not exec. ### DataBuild Execution Flow 1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations 2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec 3. **Job Resolution**: Job lookup returns base job names (e.g., `//:job_name`), not `.cfg` variants ### Graph Configuration ```python databuild_graph( name = "my_graph", jobs = [":job1", ":job2"], # Reference base job targets lookup = ":job_lookup", # Binary that routes partition refs to jobs ) ``` ### Job Lookup Pattern ```python def lookup_job_for_partition(partition_ref: str) -> str: if pattern.match(partition_ref): return "//:job_name" # Return base job target raise ValueError(f"No job found for: {partition_ref}") ``` ### Common Pitfalls - **Empty args**: Jobs with `"args": []` won't execute properly - **Wrong target refs**: Job lookup must return base targets, not `.cfg` variants - **Missing partition refs**: All outputs must be addressable via partition references