Update design docs and claude.md

This commit is contained in:
Stuart Axelbrooke 2025-07-26 20:41:00 -07:00
parent e32fea0d58
commit 3c67d5cb82
3 changed files with 35 additions and 60 deletions

View file

@ -2,64 +2,42 @@
## Project Overview
DataBuild is a bazel-based data build system. Key files:
- [`DESIGN.md`](./DESIGN.md) - Overall design of databuild
- [`databuild.proto`](databuild/databuild.proto) - System interfaces
- [`manifesto.md`](manifesto.md) - Project philosophy
- [`core-concepts.md`](core-concepts.md) - Core concepts
- Component designs - design docs for specific aspects or components of databuild:
- [Core build](./design/core-build.md) - How the core semantics of databuild works and are implemented
- [Build event log](./design/build-event-log.md) - How the build event log works and is accessed
- [Service](./design/service.md) - How the databuild HTTP service and web app are designed.
- [Glossary](./design/glossary.md) - Centralized description of key terms.
- [Graph specification](./design/graph-specification.md) - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface.
- [Observability](./design/observability.md) - How observability is systematically achieved throughout databuild applications.
- [Deploy strategies](./design/deploy-strategies.md) - Different strategies for deploying databuild applications.
- [Triggers](./design/triggers.md) - How triggering works in databuild applications.
- [Why databuild?](./design/why-databuild.md) - Why to choose databuild instead of other better established orchestration solutions.
Please reference these for any related work, as they indicate key technical bias/direction of the project.
## Tenets
- Declarative over imperative wherever possible/reasonable.
- We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
- In addition, do not add "unknown" results when parses or matches fail - these should always throw.
- Do not add "unknown" results when parses or matches fail - these should always throw.
- Compile time correctness is a super-power, and investment in it speeds up flywheel for development and user value.
## Build & Test
```bash
# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
# Build all databuild components
bazel build //...
# Run databuild unit tests
bazel test //...
# Run end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh
# Run all core unit tests
./scripts/bb_test_all
# Remote testing
./scripts/bb_remote_test_all
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
```
## End-to-End Testing
The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:
### Test Suite Structure
- `tests/end_to_end/simple_test.sh` - Basic CLI vs Service validation
- `tests/end_to_end/podcast_simple_test.sh` - Podcast reviews CLI vs Service validation
- `tests/end_to_end/basic_graph_test.sh` - Comprehensive basic graph testing
- `tests/end_to_end/podcast_reviews_test.sh` - Comprehensive podcast testing
### Event Validation
Tests ensure CLI and Service emit identical build events:
- **Build request events**: Orchestration lifecycle (received, planning, executing, completed)
- **Job events**: Job execution tracking
- **Partition events**: Partition build status
### CLI vs Service Event Alignment
Recent improvements ensure both paths emit identical events:
- CLI: Enhanced with orchestration events to match Service behavior
- Service: HTTP API orchestration events + core build events
- Validation: Tests fail if event counts or types differ between CLI and Service
### Running Individual Tests
```bash
# Test basic graph
tests/end_to_end/simple_test.sh \
examples/basic_graph/bazel-bin/basic_graph.build \
examples/basic_graph/bazel-bin/basic_graph.service
# Test podcast reviews (run from correct directory)
cd examples/podcast_reviews
../../tests/end_to_end/podcast_simple_test.sh \
bazel-bin/podcast_reviews_graph.build \
bazel-bin/podcast_reviews_graph.service
```
## Project Structure
- `databuild/` - Core system (Rust/Proto)
- `examples/` - Example implementations
@ -89,21 +67,6 @@ def main():
handle_exec(sys.argv[2:]) # Perform actual work
```
### Job Configuration Requirements
**CRITICAL**: Job configs must include non-empty `args` for execution to work:
```python
config = {
"configs": [{
"outputs": [{"str": partition_ref}],
"inputs": [...],
"args": ["some_arg"], # REQUIRED: Cannot be empty []
"env": {"PARTITION_REF": partition_ref}
}]
}
```
Jobs with `"args": []` will only have their config function called during execution, not exec.
### DataBuild Execution Flow
1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations
2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec

View file

@ -6,6 +6,7 @@ status summary, job run statistics, etc.
## Architecture
- Uses [event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html) /
[CQRS](https://www.wikipedia.org/wiki/cqrs) philosophy.
- BELs are only ever written to by graph processes (e.g. CLI or service), not the jobs themselves.
- BEL uses only two types of tables:
- The root event table, with event ID, timestamp, message, event type, and ID fields for related event types.
- Type-specific event tables (e.g. task even, partition event, build request event, etc).

11
design/why-databuild.md Normal file
View file

@ -0,0 +1,11 @@
# Why DataBuild?
(work in progress)
Why?
- Orchestration logic changes all the time, better to not write it directly
- Declarative -> Compile time correctness (e.g. can detect when no job produces a partition pattern)
- Compartmentalized jobs + data deps -> Simplicity and compartmentalization of complexity
- Bazel based -> Easy to deploy, maintain, and update