diff --git a/CLAUDE.md b/CLAUDE.md index 869bf7b..cbf8413 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -2,64 +2,42 @@ ## Project Overview DataBuild is a bazel-based data build system. Key files: +- [`DESIGN.md`](./DESIGN.md) - Overall design of databuild - [`databuild.proto`](databuild/databuild.proto) - System interfaces -- [`manifesto.md`](manifesto.md) - Project philosophy -- [`core-concepts.md`](core-concepts.md) - Core concepts +- Component designs - design docs for specific aspects or components of databuild: + - [Core build](./design/core-build.md) - How the core semantics of databuild works and are implemented + - [Build event log](./design/build-event-log.md) - How the build event log works and is accessed + - [Service](./design/service.md) - How the databuild HTTP service and web app are designed. + - [Glossary](./design/glossary.md) - Centralized description of key terms. + - [Graph specification](./design/graph-specification.md) - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface. + - [Observability](./design/observability.md) - How observability is systematically achieved throughout databuild applications. + - [Deploy strategies](./design/deploy-strategies.md) - Different strategies for deploying databuild applications. + - [Triggers](./design/triggers.md) - How triggering works in databuild applications. + - [Why databuild?](./design/why-databuild.md) - Why to choose databuild instead of other better established orchestration solutions. + +Please reference these for any related work, as they indicate key technical bias/direction of the project. ## Tenets +- Declarative over imperative wherever possible/reasonable. - We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input. - - In addition, do not add "unknown" results when parses or matches fail - these should always throw. +- Do not add "unknown" results when parses or matches fail - these should always throw. +- Compile time correctness is a super-power, and investment in it speeds up flywheel for development and user value. ## Build & Test ```bash -# Run comprehensive end-to-end tests (validates CLI vs Service consistency) +# Build all databuild components +bazel build //... + +# Run databuild unit tests +bazel test //... + +# Run end-to-end tests (validates CLI vs Service consistency) ./run_e2e_tests.sh -# Run all core unit tests -./scripts/bb_test_all - -# Remote testing -./scripts/bb_remote_test_all - # Do not try to `bazel test //examples/basic_graph/...`, as this will not work. ``` -## End-to-End Testing -The project includes comprehensive end-to-end tests that validate CLI and Service build consistency: - -### Test Suite Structure -- `tests/end_to_end/simple_test.sh` - Basic CLI vs Service validation -- `tests/end_to_end/podcast_simple_test.sh` - Podcast reviews CLI vs Service validation -- `tests/end_to_end/basic_graph_test.sh` - Comprehensive basic graph testing -- `tests/end_to_end/podcast_reviews_test.sh` - Comprehensive podcast testing - -### Event Validation -Tests ensure CLI and Service emit identical build events: -- **Build request events**: Orchestration lifecycle (received, planning, executing, completed) -- **Job events**: Job execution tracking -- **Partition events**: Partition build status - -### CLI vs Service Event Alignment -Recent improvements ensure both paths emit identical events: -- CLI: Enhanced with orchestration events to match Service behavior -- Service: HTTP API orchestration events + core build events -- Validation: Tests fail if event counts or types differ between CLI and Service - -### Running Individual Tests -```bash -# Test basic graph -tests/end_to_end/simple_test.sh \ - examples/basic_graph/bazel-bin/basic_graph.build \ - examples/basic_graph/bazel-bin/basic_graph.service - -# Test podcast reviews (run from correct directory) -cd examples/podcast_reviews -../../tests/end_to_end/podcast_simple_test.sh \ - bazel-bin/podcast_reviews_graph.build \ - bazel-bin/podcast_reviews_graph.service -``` - ## Project Structure - `databuild/` - Core system (Rust/Proto) - `examples/` - Example implementations @@ -89,21 +67,6 @@ def main(): handle_exec(sys.argv[2:]) # Perform actual work ``` -### Job Configuration Requirements -**CRITICAL**: Job configs must include non-empty `args` for execution to work: -```python -config = { - "configs": [{ - "outputs": [{"str": partition_ref}], - "inputs": [...], - "args": ["some_arg"], # REQUIRED: Cannot be empty [] - "env": {"PARTITION_REF": partition_ref} - }] -} -``` - -Jobs with `"args": []` will only have their config function called during execution, not exec. - ### DataBuild Execution Flow 1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations 2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec diff --git a/design/build-event-log.md b/design/build-event-log.md index d08bb4f..2151a8a 100644 --- a/design/build-event-log.md +++ b/design/build-event-log.md @@ -6,6 +6,7 @@ status summary, job run statistics, etc. ## Architecture - Uses [event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html) / [CQRS](https://www.wikipedia.org/wiki/cqrs) philosophy. +- BELs are only ever written to by graph processes (e.g. CLI or service), not the jobs themselves. - BEL uses only two types of tables: - The root event table, with event ID, timestamp, message, event type, and ID fields for related event types. - Type-specific event tables (e.g. task even, partition event, build request event, etc). diff --git a/design/why-databuild.md b/design/why-databuild.md new file mode 100644 index 0000000..acbe347 --- /dev/null +++ b/design/why-databuild.md @@ -0,0 +1,11 @@ + +# Why DataBuild? + +(work in progress) + +Why? +- Orchestration logic changes all the time, better to not write it directly +- Declarative -> Compile time correctness (e.g. can detect when no job produces a partition pattern) +- Compartmentalized jobs + data deps -> Simplicity and compartmentalization of complexity +- Bazel based -> Easy to deploy, maintain, and update +