# Agent Instructions ## Project Overview DataBuild is a bazel-based data build system. Key files: - [`DESIGN.md`](./DESIGN.md) - Overall design of databuild - [`databuild.proto`](databuild/databuild.proto) - System interfaces - Component designs - design docs for specific aspects or components of databuild: - [Core build](docs/design/core-build.md) - How the core semantics of databuild works and are implemented - [Build event log](docs/design/build-event-log.md) - How the build event log works and is accessed - [Service](docs/design/service.md) - How the databuild HTTP service and web app are designed. - [Glossary](docs/design/glossary.md) - Centralized description of key terms. - [Graph specification](docs/design/graph-specification.md) - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface. - [Deploy strategies](docs/design/deploy-strategies.md) - Different strategies for deploying databuild applications. - [Wants](docs/design/wants.md) - How triggering works in databuild applications. - [Why databuild?](docs/design/why-databuild.md) - Why to choose databuild instead of other better established orchestration solutions. Please reference these for any related work, as they indicate key technical bias/direction of the project. ## Architecture Pattern DataBuild implements **Orchestrated State Machines** - a pattern where the application core is composed of: - **Type-safe state machines** for domain entities (Want, JobRun, Partition) - **Dependency graphs** expressing relationships between entities - **Orchestration logic** that coordinates state transitions based on dependencies This architecture provides compile-time correctness, observability through event sourcing, and clean separation between entity behavior and coordination logic. See [`docs/orchestrated-state-machines.md`](docs/orchestrated-state-machines.md) for the full theory and implementation patterns. **Key implications for development:** - Model entities as explicit state machines with type-parameterized states - Use consuming methods for state transitions (enforces immutability) - Emit events to BEL for all state changes (observability) - Centralize coordination logic in the Orchestrator (separation of concerns) - If it has a `status` field (or similar), it should have a state machine with type safe transitions that governs it ## Tenets - Declarative over imperative wherever possible/reasonable. - We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input. - Do not add "unknown" results when parses or matches fail - these should always throw. - Compile time correctness is a super-power, and investment in it speeds up flywheel for development and user value. - **CLI/Service Interchangeability**: Both the CLI and service must produce identical artifacts (BEL events, logs, metrics, outputs) in the same locations. Users should be able to build with one interface and query/inspect results from the other seamlessly. This principle applies to all DataBuild operations, not just builds. - The BEL represents real things that happen: job run processes that are started or fail, requests from the user, dep misses, etc. ## Build & Test ```bash # Build all databuild components bazel build //... # Run databuild unit tests bazel test //... # Run end-to-end tests (validates CLI vs Service consistency) ./run_e2e_tests.sh # Do not try to `bazel test //examples/basic_graph/...`, as this will not work. ``` ## Project Structure - `databuild/` - Core system (Rust/Proto) - `examples/` - Example implementations - `scripts/` - Build utilities ## DataBuild Job Architecture ### Job Target Structure Each DataBuild job creates three Bazel targets: - `job_name.exec` - Execution target (calls binary with "exec" subcommand) - `job_name` - Main job target (pipes config output to exec input) ### Graph Configuration ```python databuild_graph( name = "my_graph", jobs = [":job1", ":job2"], # Reference base job targets lookup = ":job_lookup", # Binary that routes partition refs to jobs ) ``` ### Job Lookup Pattern ```python def lookup_job_for_partition(partition_ref: str) -> str: if pattern.match(partition_ref): return "//:job_name" # Return base job target raise ValueError(f"No job found for: {partition_ref}") ``` ## Notes / Tips - Rust dependencies are implemented via rules_rust, so new dependencies should be added in the `MODULE.bazel` file.