# Claude Instructions ## Project Overview DataBuild is a bazel-based data build system. Key files: - [`databuild.proto`](databuild/databuild.proto) - System interfaces - [`manifesto.md`](manifesto.md) - Project philosophy - [`core-concepts.md`](core-concepts.md) - Core concepts ## Build & Test ```bash # Run all tests ./scripts/bb_test_all # Remote testing ./scripts/bb_remote_test_all # Do not try to `bazel test //examples/basic_graph/...`, as this will not work. ``` ## Project Structure - `databuild/` - Core system (Rust/Proto) - `examples/` - Example implementations - `scripts/` - Build utilities ## Key Components - Graph analysis/execution in Rust - Bazel rules for job orchestration - Java/Python examples for different use cases ## DataBuild Job Architecture ### Job Target Structure Each DataBuild job creates three Bazel targets: - `job_name.cfg` - Configuration target (calls binary with "config" subcommand) - `job_name.exec` - Execution target (calls binary with "exec" subcommand) - `job_name` - Main job target (pipes config output to exec input) ### Unified Job Binary Pattern Jobs use a single binary with subcommands: ```python def main(): command = sys.argv[1] # "config" or "exec" if command == "config": handle_config(sys.argv[2:]) # Output job configuration JSON elif command == "exec": handle_exec(sys.argv[2:]) # Perform actual work ``` ### Job Configuration Requirements **CRITICAL**: Job configs must include non-empty `args` for execution to work: ```python config = { "configs": [{ "outputs": [{"str": partition_ref}], "inputs": [...], "args": ["some_arg"], # REQUIRED: Cannot be empty [] "env": {"PARTITION_REF": partition_ref} }] } ``` Jobs with `"args": []` will only have their config function called during execution, not exec. ### DataBuild Execution Flow 1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations 2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec 3. **Job Resolution**: Job lookup returns base job names (e.g., `//:job_name`), not `.cfg` variants ### Graph Configuration ```python databuild_graph( name = "my_graph", jobs = [":job1", ":job2"], # Reference base job targets lookup = ":job_lookup", # Binary that routes partition refs to jobs ) ``` ### Job Lookup Pattern ```python def lookup_job_for_partition(partition_ref: str) -> str: if pattern.match(partition_ref): return "//:job_name" # Return base job target raise ValueError(f"No job found for: {partition_ref}") ``` ### Common Pitfalls - **Empty args**: Jobs with `"args": []` won't execute properly - **Wrong target refs**: Job lookup must return base targets, not `.cfg` variants - **Missing partition refs**: All outputs must be addressable via partition references