87 lines
4.4 KiB
Markdown
87 lines
4.4 KiB
Markdown
# Agent Instructions
|
|
|
|
## Project Overview
|
|
DataBuild is a bazel-based data build system. Key files:
|
|
- [`DESIGN.md`](./DESIGN.md) - Overall design of databuild
|
|
- [`databuild.proto`](databuild/databuild.proto) - System interfaces
|
|
- Component designs - design docs for specific aspects or components of databuild:
|
|
- [Core build](docs/design/core-build.md) - How the core semantics of databuild works and are implemented
|
|
- [Build event log](docs/design/build-event-log.md) - How the build event log works and is accessed
|
|
- [Service](docs/design/service.md) - How the databuild HTTP service and web app are designed.
|
|
- [Glossary](docs/design/glossary.md) - Centralized description of key terms.
|
|
- [Graph specification](docs/design/graph-specification.md) - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface.
|
|
- [Deploy strategies](docs/design/deploy-strategies.md) - Different strategies for deploying databuild applications.
|
|
- [Wants](docs/design/wants.md) - How triggering works in databuild applications.
|
|
- [Why databuild?](docs/design/why-databuild.md) - Why to choose databuild instead of other better established orchestration solutions.
|
|
|
|
Please reference these for any related work, as they indicate key technical bias/direction of the project.
|
|
|
|
## Architecture Pattern
|
|
|
|
DataBuild implements **Orchestrated State Machines** - a pattern where the application core is composed of:
|
|
- **Type-safe state machines** for domain entities (Want, JobRun, Partition)
|
|
- **Dependency graphs** expressing relationships between entities
|
|
- **Orchestration logic** that coordinates state transitions based on dependencies
|
|
|
|
This architecture provides compile-time correctness, observability through event sourcing, and clean separation between entity behavior and coordination logic. See [`docs/orchestrated-state-machines.md`](docs/orchestrated-state-machines.md) for the full theory and implementation patterns.
|
|
|
|
**Key implications for development:**
|
|
- Model entities as explicit state machines with type-parameterized states
|
|
- Use consuming methods for state transitions (enforces immutability)
|
|
- Emit events to BEL for all state changes (observability)
|
|
- Centralize coordination logic in the Orchestrator (separation of concerns)
|
|
|
|
## Tenets
|
|
|
|
- Declarative over imperative wherever possible/reasonable.
|
|
- We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
|
|
- Do not add "unknown" results when parses or matches fail - these should always throw.
|
|
- Compile time correctness is a super-power, and investment in it speeds up flywheel for development and user value.
|
|
- **CLI/Service Interchangeability**: Both the CLI and service must produce identical artifacts (BEL events, logs, metrics, outputs) in the same locations. Users should be able to build with one interface and query/inspect results from the other seamlessly. This principle applies to all DataBuild operations, not just builds.
|
|
- The BEL represents real things that happen: job run processes that are started or fail, requests from the user, dep misses, etc.
|
|
|
|
## Build & Test
|
|
```bash
|
|
# Build all databuild components
|
|
bazel build //...
|
|
|
|
# Run databuild unit tests
|
|
bazel test //...
|
|
|
|
# Run end-to-end tests (validates CLI vs Service consistency)
|
|
./run_e2e_tests.sh
|
|
|
|
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
|
|
```
|
|
|
|
## Project Structure
|
|
- `databuild/` - Core system (Rust/Proto)
|
|
- `examples/` - Example implementations
|
|
- `scripts/` - Build utilities
|
|
|
|
## DataBuild Job Architecture
|
|
|
|
### Job Target Structure
|
|
Each DataBuild job creates three Bazel targets:
|
|
- `job_name.exec` - Execution target (calls binary with "exec" subcommand)
|
|
- `job_name` - Main job target (pipes config output to exec input)
|
|
|
|
### Graph Configuration
|
|
```python
|
|
databuild_graph(
|
|
name = "my_graph",
|
|
jobs = [":job1", ":job2"], # Reference base job targets
|
|
lookup = ":job_lookup", # Binary that routes partition refs to jobs
|
|
)
|
|
```
|
|
|
|
### Job Lookup Pattern
|
|
```python
|
|
def lookup_job_for_partition(partition_ref: str) -> str:
|
|
if pattern.match(partition_ref):
|
|
return "//:job_name" # Return base job target
|
|
raise ValueError(f"No job found for: {partition_ref}")
|
|
```
|
|
|
|
## Notes / Tips
|
|
- Rust dependencies are implemented via rules_rust, so new dependencies should be added in the `MODULE.bazel` file.
|