databuild/AGENTS.md

87 lines
4.4 KiB
Markdown

# Agent Instructions
## Project Overview
DataBuild is a bazel-based data build system. Key files:
- [`DESIGN.md`](./DESIGN.md) - Overall design of databuild
- [`databuild.proto`](databuild/databuild.proto) - System interfaces
- Component designs - design docs for specific aspects or components of databuild:
- [Core build](docs/design/core-build.md) - How the core semantics of databuild works and are implemented
- [Build event log](docs/design/build-event-log.md) - How the build event log works and is accessed
- [Service](docs/design/service.md) - How the databuild HTTP service and web app are designed.
- [Glossary](docs/design/glossary.md) - Centralized description of key terms.
- [Graph specification](docs/design/graph-specification.md) - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface.
- [Deploy strategies](docs/design/deploy-strategies.md) - Different strategies for deploying databuild applications.
- [Wants](docs/design/wants.md) - How triggering works in databuild applications.
- [Why databuild?](docs/design/why-databuild.md) - Why to choose databuild instead of other better established orchestration solutions.
Please reference these for any related work, as they indicate key technical bias/direction of the project.
## Architecture Pattern
DataBuild implements **Orchestrated State Machines** - a pattern where the application core is composed of:
- **Type-safe state machines** for domain entities (Want, JobRun, Partition)
- **Dependency graphs** expressing relationships between entities
- **Orchestration logic** that coordinates state transitions based on dependencies
This architecture provides compile-time correctness, observability through event sourcing, and clean separation between entity behavior and coordination logic. See [`docs/orchestrated-state-machines.md`](docs/orchestrated-state-machines.md) for the full theory and implementation patterns.
**Key implications for development:**
- Model entities as explicit state machines with type-parameterized states
- Use consuming methods for state transitions (enforces immutability)
- Emit events to BEL for all state changes (observability)
- Centralize coordination logic in the Orchestrator (separation of concerns)
## Tenets
- Declarative over imperative wherever possible/reasonable.
- We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
- Do not add "unknown" results when parses or matches fail - these should always throw.
- Compile time correctness is a super-power, and investment in it speeds up flywheel for development and user value.
- **CLI/Service Interchangeability**: Both the CLI and service must produce identical artifacts (BEL events, logs, metrics, outputs) in the same locations. Users should be able to build with one interface and query/inspect results from the other seamlessly. This principle applies to all DataBuild operations, not just builds.
- The BEL represents real things that happen: job run processes that are started or fail, requests from the user, dep misses, etc.
## Build & Test
```bash
# Build all databuild components
bazel build //...
# Run databuild unit tests
bazel test //...
# Run end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
```
## Project Structure
- `databuild/` - Core system (Rust/Proto)
- `examples/` - Example implementations
- `scripts/` - Build utilities
## DataBuild Job Architecture
### Job Target Structure
Each DataBuild job creates three Bazel targets:
- `job_name.exec` - Execution target (calls binary with "exec" subcommand)
- `job_name` - Main job target (pipes config output to exec input)
### Graph Configuration
```python
databuild_graph(
name = "my_graph",
jobs = [":job1", ":job2"], # Reference base job targets
lookup = ":job_lookup", # Binary that routes partition refs to jobs
)
```
### Job Lookup Pattern
```python
def lookup_job_for_partition(partition_ref: str) -> str:
if pattern.match(partition_ref):
return "//:job_name" # Return base job target
raise ValueError(f"No job found for: {partition_ref}")
```
## Notes / Tips
- Rust dependencies are implemented via rules_rust, so new dependencies should be added in the `MODULE.bazel` file.