databuild/CLAUDE.md

# Claude Instructions

## Project Overview
DataBuild is a bazel-based data build system. Key files:
- [`databuild.proto`](databuild/databuild.proto) - System interfaces
- [`manifesto.md`](manifesto.md) - Project philosophy
- [`core-concepts.md`](core-concepts.md) - Core concepts

## Build & Test
```bash
# Run all tests
./scripts/bb_test_all

# Remote testing
./scripts/bb_remote_test_all

# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
```

## Project Structure
- `databuild/` - Core system (Rust/Proto)
- `examples/` - Example implementations
- `scripts/` - Build utilities

## Key Components
- Graph analysis/execution in Rust
- Bazel rules for job orchestration
- Java/Python examples for different use cases

## DataBuild Job Architecture

### Job Target Structure
Each DataBuild job creates three Bazel targets:
- `job_name.cfg` - Configuration target (calls binary with "config" subcommand)
- `job_name.exec` - Execution target (calls binary with "exec" subcommand)
- `job_name` - Main job target (pipes config output to exec input)

### Unified Job Binary Pattern
Jobs use a single binary with subcommands:
```python
def main():
    command = sys.argv[1]  # "config" or "exec"
    if command == "config":
        handle_config(sys.argv[2:])  # Output job configuration JSON
    elif command == "exec":
        handle_exec(sys.argv[2:])    # Perform actual work
```

### Job Configuration Requirements
**CRITICAL**: Job configs must include non-empty `args` for execution to work:
```python
config = {
    "configs": [{
        "outputs": [{"str": partition_ref}],
        "inputs": [...],
        "args": ["some_arg"],  # REQUIRED: Cannot be empty []
        "env": {"PARTITION_REF": partition_ref}
    }]
}
```

Jobs with `"args": []` will only have their config function called during execution, not exec.

### DataBuild Execution Flow
1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations
2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec
3. **Job Resolution**: Job lookup returns base job names (e.g., `//:job_name`), not `.cfg` variants

### Graph Configuration
```python
databuild_graph(
    name = "my_graph",
    jobs = [":job1", ":job2"],  # Reference base job targets
    lookup = ":job_lookup",     # Binary that routes partition refs to jobs
)
```

### Job Lookup Pattern
```python
def lookup_job_for_partition(partition_ref: str) -> str:
    if pattern.match(partition_ref):
        return "//:job_name"  # Return base job target
    raise ValueError(f"No job found for: {partition_ref}")
```

### Common Pitfalls
- **Empty args**: Jobs with `"args": []` won't execute properly
- **Wrong target refs**: Job lookup must return base targets, not `.cfg` variants
- **Missing partition refs**: All outputs must be addressable via partition references