136 lines
4.9 KiB
Markdown
136 lines
4.9 KiB
Markdown
# Claude Instructions
|
|
|
|
## Project Overview
|
|
DataBuild is a bazel-based data build system. Key files:
|
|
- [`databuild.proto`](databuild/databuild.proto) - System interfaces
|
|
- [`manifesto.md`](manifesto.md) - Project philosophy
|
|
- [`core-concepts.md`](core-concepts.md) - Core concepts
|
|
|
|
## Tenets
|
|
|
|
- We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
|
|
- In addition, do not add "unknown" results when parses or matches fail - these should always throw.
|
|
|
|
## Build & Test
|
|
```bash
|
|
# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
|
|
./run_e2e_tests.sh
|
|
|
|
# Run all core unit tests
|
|
./scripts/bb_test_all
|
|
|
|
# Remote testing
|
|
./scripts/bb_remote_test_all
|
|
|
|
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
|
|
```
|
|
|
|
## End-to-End Testing
|
|
The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:
|
|
|
|
### Test Suite Structure
|
|
- `tests/end_to_end/simple_test.sh` - Basic CLI vs Service validation
|
|
- `tests/end_to_end/podcast_simple_test.sh` - Podcast reviews CLI vs Service validation
|
|
- `tests/end_to_end/basic_graph_test.sh` - Comprehensive basic graph testing
|
|
- `tests/end_to_end/podcast_reviews_test.sh` - Comprehensive podcast testing
|
|
|
|
### Event Validation
|
|
Tests ensure CLI and Service emit identical build events:
|
|
- **Build request events**: Orchestration lifecycle (received, planning, executing, completed)
|
|
- **Job events**: Job execution tracking
|
|
- **Partition events**: Partition build status
|
|
|
|
### CLI vs Service Event Alignment
|
|
Recent improvements ensure both paths emit identical events:
|
|
- CLI: Enhanced with orchestration events to match Service behavior
|
|
- Service: HTTP API orchestration events + core build events
|
|
- Validation: Tests fail if event counts or types differ between CLI and Service
|
|
|
|
### Running Individual Tests
|
|
```bash
|
|
# Test basic graph
|
|
tests/end_to_end/simple_test.sh \
|
|
examples/basic_graph/bazel-bin/basic_graph.build \
|
|
examples/basic_graph/bazel-bin/basic_graph.service
|
|
|
|
# Test podcast reviews (run from correct directory)
|
|
cd examples/podcast_reviews
|
|
../../tests/end_to_end/podcast_simple_test.sh \
|
|
bazel-bin/podcast_reviews_graph.build \
|
|
bazel-bin/podcast_reviews_graph.service
|
|
```
|
|
|
|
## Project Structure
|
|
- `databuild/` - Core system (Rust/Proto)
|
|
- `examples/` - Example implementations
|
|
- `scripts/` - Build utilities
|
|
|
|
## Key Components
|
|
- Graph analysis/execution in Rust
|
|
- Bazel rules for job orchestration
|
|
- Java/Python examples for different use cases
|
|
|
|
## DataBuild Job Architecture
|
|
|
|
### Job Target Structure
|
|
Each DataBuild job creates three Bazel targets:
|
|
- `job_name.cfg` - Configuration target (calls binary with "config" subcommand)
|
|
- `job_name.exec` - Execution target (calls binary with "exec" subcommand)
|
|
- `job_name` - Main job target (pipes config output to exec input)
|
|
|
|
### Unified Job Binary Pattern
|
|
Jobs use a single binary with subcommands:
|
|
```python
|
|
def main():
|
|
command = sys.argv[1] # "config" or "exec"
|
|
if command == "config":
|
|
handle_config(sys.argv[2:]) # Output job configuration JSON
|
|
elif command == "exec":
|
|
handle_exec(sys.argv[2:]) # Perform actual work
|
|
```
|
|
|
|
### Job Configuration Requirements
|
|
**CRITICAL**: Job configs must include non-empty `args` for execution to work:
|
|
```python
|
|
config = {
|
|
"configs": [{
|
|
"outputs": [{"str": partition_ref}],
|
|
"inputs": [...],
|
|
"args": ["some_arg"], # REQUIRED: Cannot be empty []
|
|
"env": {"PARTITION_REF": partition_ref}
|
|
}]
|
|
}
|
|
```
|
|
|
|
Jobs with `"args": []` will only have their config function called during execution, not exec.
|
|
|
|
### DataBuild Execution Flow
|
|
1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations
|
|
2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec
|
|
3. **Job Resolution**: Job lookup returns base job names (e.g., `//:job_name`), not `.cfg` variants
|
|
|
|
### Graph Configuration
|
|
```python
|
|
databuild_graph(
|
|
name = "my_graph",
|
|
jobs = [":job1", ":job2"], # Reference base job targets
|
|
lookup = ":job_lookup", # Binary that routes partition refs to jobs
|
|
)
|
|
```
|
|
|
|
### Job Lookup Pattern
|
|
```python
|
|
def lookup_job_for_partition(partition_ref: str) -> str:
|
|
if pattern.match(partition_ref):
|
|
return "//:job_name" # Return base job target
|
|
raise ValueError(f"No job found for: {partition_ref}")
|
|
```
|
|
|
|
### Common Pitfalls
|
|
- **Empty args**: Jobs with `"args": []` won't execute properly
|
|
- **Wrong target refs**: Job lookup must return base targets, not `.cfg` variants
|
|
- **Missing partition refs**: All outputs must be addressable via partition references
|
|
|
|
## Documentation
|
|
|
|
We use plans / designs in the [plans](./plans/) directory to anchor most large scale efforts. We create plans that are good bets, though not necessarily exhaustive, then (and this is critical) we update them after the work is completed, or after significant progress towards completion.
|