databuild/CLAUDE.md
2025-07-12 13:56:23 -07:00

137 lines
5.1 KiB
Markdown

# Claude Instructions
## Project Overview
DataBuild is a bazel-based data build system. Key files:
- [`databuild.proto`](databuild/databuild.proto) - System interfaces
- [`manifesto.md`](manifesto.md) - Project philosophy
- [`core-concepts.md`](core-concepts.md) - Core concepts
## Tenets
- We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
- In addition, do not add "unknown" results when parses or matches fail - these should always throw.
## Build & Test
```bash
# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh
# Run all core unit tests
./scripts/bb_test_all
# Remote testing
./scripts/bb_remote_test_all
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
```
## End-to-End Testing
The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:
### Test Suite Structure
- `tests/end_to_end/simple_test.sh` - Basic CLI vs Service validation
- `tests/end_to_end/podcast_simple_test.sh` - Podcast reviews CLI vs Service validation
- `tests/end_to_end/basic_graph_test.sh` - Comprehensive basic graph testing
- `tests/end_to_end/podcast_reviews_test.sh` - Comprehensive podcast testing
### Event Validation
Tests ensure CLI and Service emit identical build events:
- **Build request events**: Orchestration lifecycle (received, planning, executing, completed)
- **Job events**: Job execution tracking
- **Partition events**: Partition build status
### CLI vs Service Event Alignment
Recent improvements ensure both paths emit identical events:
- CLI: Enhanced with orchestration events to match Service behavior
- Service: HTTP API orchestration events + core build events
- Validation: Tests fail if event counts or types differ between CLI and Service
### Running Individual Tests
```bash
# Test basic graph
tests/end_to_end/simple_test.sh \
examples/basic_graph/bazel-bin/basic_graph.build \
examples/basic_graph/bazel-bin/basic_graph.service
# Test podcast reviews (run from correct directory)
cd examples/podcast_reviews
../../tests/end_to_end/podcast_simple_test.sh \
bazel-bin/podcast_reviews_graph.build \
bazel-bin/podcast_reviews_graph.service
```
## Project Structure
- `databuild/` - Core system (Rust/Proto)
- `examples/` - Example implementations
- `scripts/` - Build utilities
## Key Components
- Graph analysis/execution in Rust
- Bazel rules for job orchestration
- Java/Python examples for different use cases
## DataBuild Job Architecture
### Job Target Structure
Each DataBuild job creates three Bazel targets:
- `job_name.cfg` - Configuration target (calls binary with "config" subcommand)
- `job_name.exec` - Execution target (calls binary with "exec" subcommand)
- `job_name` - Main job target (pipes config output to exec input)
### Unified Job Binary Pattern
Jobs use a single binary with subcommands:
```python
def main():
command = sys.argv[1] # "config" or "exec"
if command == "config":
handle_config(sys.argv[2:]) # Output job configuration JSON
elif command == "exec":
handle_exec(sys.argv[2:]) # Perform actual work
```
### Job Configuration Requirements
**CRITICAL**: Job configs must include non-empty `args` for execution to work:
```python
config = {
"configs": [{
"outputs": [{"str": partition_ref}],
"inputs": [...],
"args": ["some_arg"], # REQUIRED: Cannot be empty []
"env": {"PARTITION_REF": partition_ref}
}]
}
```
Jobs with `"args": []` will only have their config function called during execution, not exec.
### DataBuild Execution Flow
1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations
2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec
3. **Job Resolution**: Job lookup returns base job names (e.g., `//:job_name`), not `.cfg` variants
### Graph Configuration
```python
databuild_graph(
name = "my_graph",
jobs = [":job1", ":job2"], # Reference base job targets
lookup = ":job_lookup", # Binary that routes partition refs to jobs
)
```
### Job Lookup Pattern
```python
def lookup_job_for_partition(partition_ref: str) -> str:
if pattern.match(partition_ref):
return "//:job_name" # Return base job target
raise ValueError(f"No job found for: {partition_ref}")
```
### Common Pitfalls
- **Empty args**: Jobs with `"args": []` won't execute properly
- **Wrong target refs**: Job lookup must return base targets, not `.cfg` variants
- **Missing partition refs**: All outputs must be addressable via partition references
- **Not adding new generated files to OpenAPI outs**: Bazel hermeticity demands that we specify each output file, so when the OpenAPI code gen would create new files, we need to explicitly add them to the target's outs field.
## Documentation
We use plans / designs in the [plans](./plans/) directory to anchor most large scale efforts. We create plans that are good bets, though not necessarily exhaustive, then (and this is critical) we update them after the work is completed, or after significant progress towards completion.