127 lines
No EOL
4.3 KiB
Markdown
127 lines
No EOL
4.3 KiB
Markdown
# Claude Instructions
|
|
|
|
## Project Overview
|
|
DataBuild is a bazel-based data build system. Key files:
|
|
- [`databuild.proto`](databuild/databuild.proto) - System interfaces
|
|
- [`manifesto.md`](manifesto.md) - Project philosophy
|
|
- [`core-concepts.md`](core-concepts.md) - Core concepts
|
|
|
|
## Build & Test
|
|
```bash
|
|
# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
|
|
./run_e2e_tests.sh
|
|
|
|
# Run all core unit tests
|
|
./scripts/bb_test_all
|
|
|
|
# Remote testing
|
|
./scripts/bb_remote_test_all
|
|
|
|
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
|
|
```
|
|
|
|
## End-to-End Testing
|
|
The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:
|
|
|
|
### Test Suite Structure
|
|
- `tests/end_to_end/simple_test.sh` - Basic CLI vs Service validation
|
|
- `tests/end_to_end/podcast_simple_test.sh` - Podcast reviews CLI vs Service validation
|
|
- `tests/end_to_end/basic_graph_test.sh` - Comprehensive basic graph testing
|
|
- `tests/end_to_end/podcast_reviews_test.sh` - Comprehensive podcast testing
|
|
|
|
### Event Validation
|
|
Tests ensure CLI and Service emit identical build events:
|
|
- **Build request events**: Orchestration lifecycle (received, planning, executing, completed)
|
|
- **Job events**: Job execution tracking
|
|
- **Partition events**: Partition build status
|
|
|
|
### CLI vs Service Event Alignment
|
|
Recent improvements ensure both paths emit identical events:
|
|
- CLI: Enhanced with orchestration events to match Service behavior
|
|
- Service: HTTP API orchestration events + core build events
|
|
- Validation: Tests fail if event counts or types differ between CLI and Service
|
|
|
|
### Running Individual Tests
|
|
```bash
|
|
# Test basic graph
|
|
tests/end_to_end/simple_test.sh \
|
|
examples/basic_graph/bazel-bin/basic_graph.build \
|
|
examples/basic_graph/bazel-bin/basic_graph.service
|
|
|
|
# Test podcast reviews (run from correct directory)
|
|
cd examples/podcast_reviews
|
|
../../tests/end_to_end/podcast_simple_test.sh \
|
|
bazel-bin/podcast_reviews_graph.build \
|
|
bazel-bin/podcast_reviews_graph.service
|
|
```
|
|
|
|
## Project Structure
|
|
- `databuild/` - Core system (Rust/Proto)
|
|
- `examples/` - Example implementations
|
|
- `scripts/` - Build utilities
|
|
|
|
## Key Components
|
|
- Graph analysis/execution in Rust
|
|
- Bazel rules for job orchestration
|
|
- Java/Python examples for different use cases
|
|
|
|
## DataBuild Job Architecture
|
|
|
|
### Job Target Structure
|
|
Each DataBuild job creates three Bazel targets:
|
|
- `job_name.cfg` - Configuration target (calls binary with "config" subcommand)
|
|
- `job_name.exec` - Execution target (calls binary with "exec" subcommand)
|
|
- `job_name` - Main job target (pipes config output to exec input)
|
|
|
|
### Unified Job Binary Pattern
|
|
Jobs use a single binary with subcommands:
|
|
```python
|
|
def main():
|
|
command = sys.argv[1] # "config" or "exec"
|
|
if command == "config":
|
|
handle_config(sys.argv[2:]) # Output job configuration JSON
|
|
elif command == "exec":
|
|
handle_exec(sys.argv[2:]) # Perform actual work
|
|
```
|
|
|
|
### Job Configuration Requirements
|
|
**CRITICAL**: Job configs must include non-empty `args` for execution to work:
|
|
```python
|
|
config = {
|
|
"configs": [{
|
|
"outputs": [{"str": partition_ref}],
|
|
"inputs": [...],
|
|
"args": ["some_arg"], # REQUIRED: Cannot be empty []
|
|
"env": {"PARTITION_REF": partition_ref}
|
|
}]
|
|
}
|
|
```
|
|
|
|
Jobs with `"args": []` will only have their config function called during execution, not exec.
|
|
|
|
### DataBuild Execution Flow
|
|
1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations
|
|
2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec
|
|
3. **Job Resolution**: Job lookup returns base job names (e.g., `//:job_name`), not `.cfg` variants
|
|
|
|
### Graph Configuration
|
|
```python
|
|
databuild_graph(
|
|
name = "my_graph",
|
|
jobs = [":job1", ":job2"], # Reference base job targets
|
|
lookup = ":job_lookup", # Binary that routes partition refs to jobs
|
|
)
|
|
```
|
|
|
|
### Job Lookup Pattern
|
|
```python
|
|
def lookup_job_for_partition(partition_ref: str) -> str:
|
|
if pattern.match(partition_ref):
|
|
return "//:job_name" # Return base job target
|
|
raise ValueError(f"No job found for: {partition_ref}")
|
|
```
|
|
|
|
### Common Pitfalls
|
|
- **Empty args**: Jobs with `"args": []` won't execute properly
|
|
- **Wrong target refs**: Job lookup must return base targets, not `.cfg` variants
|
|
- **Missing partition refs**: All outputs must be addressable via partition references |