Update design docs and claude.md
This commit is contained in:
parent
e32fea0d58
commit
3c67d5cb82
3 changed files with 35 additions and 60 deletions
83
CLAUDE.md
83
CLAUDE.md
|
|
@ -2,64 +2,42 @@
|
||||||
|
|
||||||
## Project Overview
|
## Project Overview
|
||||||
DataBuild is a bazel-based data build system. Key files:
|
DataBuild is a bazel-based data build system. Key files:
|
||||||
|
- [`DESIGN.md`](./DESIGN.md) - Overall design of databuild
|
||||||
- [`databuild.proto`](databuild/databuild.proto) - System interfaces
|
- [`databuild.proto`](databuild/databuild.proto) - System interfaces
|
||||||
- [`manifesto.md`](manifesto.md) - Project philosophy
|
- Component designs - design docs for specific aspects or components of databuild:
|
||||||
- [`core-concepts.md`](core-concepts.md) - Core concepts
|
- [Core build](./design/core-build.md) - How the core semantics of databuild works and are implemented
|
||||||
|
- [Build event log](./design/build-event-log.md) - How the build event log works and is accessed
|
||||||
|
- [Service](./design/service.md) - How the databuild HTTP service and web app are designed.
|
||||||
|
- [Glossary](./design/glossary.md) - Centralized description of key terms.
|
||||||
|
- [Graph specification](./design/graph-specification.md) - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface.
|
||||||
|
- [Observability](./design/observability.md) - How observability is systematically achieved throughout databuild applications.
|
||||||
|
- [Deploy strategies](./design/deploy-strategies.md) - Different strategies for deploying databuild applications.
|
||||||
|
- [Triggers](./design/triggers.md) - How triggering works in databuild applications.
|
||||||
|
- [Why databuild?](./design/why-databuild.md) - Why to choose databuild instead of other better established orchestration solutions.
|
||||||
|
|
||||||
|
Please reference these for any related work, as they indicate key technical bias/direction of the project.
|
||||||
|
|
||||||
## Tenets
|
## Tenets
|
||||||
|
|
||||||
|
- Declarative over imperative wherever possible/reasonable.
|
||||||
- We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
|
- We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
|
||||||
- In addition, do not add "unknown" results when parses or matches fail - these should always throw.
|
- Do not add "unknown" results when parses or matches fail - these should always throw.
|
||||||
|
- Compile time correctness is a super-power, and investment in it speeds up flywheel for development and user value.
|
||||||
|
|
||||||
## Build & Test
|
## Build & Test
|
||||||
```bash
|
```bash
|
||||||
# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
|
# Build all databuild components
|
||||||
|
bazel build //...
|
||||||
|
|
||||||
|
# Run databuild unit tests
|
||||||
|
bazel test //...
|
||||||
|
|
||||||
|
# Run end-to-end tests (validates CLI vs Service consistency)
|
||||||
./run_e2e_tests.sh
|
./run_e2e_tests.sh
|
||||||
|
|
||||||
# Run all core unit tests
|
|
||||||
./scripts/bb_test_all
|
|
||||||
|
|
||||||
# Remote testing
|
|
||||||
./scripts/bb_remote_test_all
|
|
||||||
|
|
||||||
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
|
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
|
||||||
```
|
```
|
||||||
|
|
||||||
## End-to-End Testing
|
|
||||||
The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:
|
|
||||||
|
|
||||||
### Test Suite Structure
|
|
||||||
- `tests/end_to_end/simple_test.sh` - Basic CLI vs Service validation
|
|
||||||
- `tests/end_to_end/podcast_simple_test.sh` - Podcast reviews CLI vs Service validation
|
|
||||||
- `tests/end_to_end/basic_graph_test.sh` - Comprehensive basic graph testing
|
|
||||||
- `tests/end_to_end/podcast_reviews_test.sh` - Comprehensive podcast testing
|
|
||||||
|
|
||||||
### Event Validation
|
|
||||||
Tests ensure CLI and Service emit identical build events:
|
|
||||||
- **Build request events**: Orchestration lifecycle (received, planning, executing, completed)
|
|
||||||
- **Job events**: Job execution tracking
|
|
||||||
- **Partition events**: Partition build status
|
|
||||||
|
|
||||||
### CLI vs Service Event Alignment
|
|
||||||
Recent improvements ensure both paths emit identical events:
|
|
||||||
- CLI: Enhanced with orchestration events to match Service behavior
|
|
||||||
- Service: HTTP API orchestration events + core build events
|
|
||||||
- Validation: Tests fail if event counts or types differ between CLI and Service
|
|
||||||
|
|
||||||
### Running Individual Tests
|
|
||||||
```bash
|
|
||||||
# Test basic graph
|
|
||||||
tests/end_to_end/simple_test.sh \
|
|
||||||
examples/basic_graph/bazel-bin/basic_graph.build \
|
|
||||||
examples/basic_graph/bazel-bin/basic_graph.service
|
|
||||||
|
|
||||||
# Test podcast reviews (run from correct directory)
|
|
||||||
cd examples/podcast_reviews
|
|
||||||
../../tests/end_to_end/podcast_simple_test.sh \
|
|
||||||
bazel-bin/podcast_reviews_graph.build \
|
|
||||||
bazel-bin/podcast_reviews_graph.service
|
|
||||||
```
|
|
||||||
|
|
||||||
## Project Structure
|
## Project Structure
|
||||||
- `databuild/` - Core system (Rust/Proto)
|
- `databuild/` - Core system (Rust/Proto)
|
||||||
- `examples/` - Example implementations
|
- `examples/` - Example implementations
|
||||||
|
|
@ -89,21 +67,6 @@ def main():
|
||||||
handle_exec(sys.argv[2:]) # Perform actual work
|
handle_exec(sys.argv[2:]) # Perform actual work
|
||||||
```
|
```
|
||||||
|
|
||||||
### Job Configuration Requirements
|
|
||||||
**CRITICAL**: Job configs must include non-empty `args` for execution to work:
|
|
||||||
```python
|
|
||||||
config = {
|
|
||||||
"configs": [{
|
|
||||||
"outputs": [{"str": partition_ref}],
|
|
||||||
"inputs": [...],
|
|
||||||
"args": ["some_arg"], # REQUIRED: Cannot be empty []
|
|
||||||
"env": {"PARTITION_REF": partition_ref}
|
|
||||||
}]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Jobs with `"args": []` will only have their config function called during execution, not exec.
|
|
||||||
|
|
||||||
### DataBuild Execution Flow
|
### DataBuild Execution Flow
|
||||||
1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations
|
1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations
|
||||||
2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec
|
2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec
|
||||||
|
|
|
||||||
|
|
@ -6,6 +6,7 @@ status summary, job run statistics, etc.
|
||||||
## Architecture
|
## Architecture
|
||||||
- Uses [event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html) /
|
- Uses [event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html) /
|
||||||
[CQRS](https://www.wikipedia.org/wiki/cqrs) philosophy.
|
[CQRS](https://www.wikipedia.org/wiki/cqrs) philosophy.
|
||||||
|
- BELs are only ever written to by graph processes (e.g. CLI or service), not the jobs themselves.
|
||||||
- BEL uses only two types of tables:
|
- BEL uses only two types of tables:
|
||||||
- The root event table, with event ID, timestamp, message, event type, and ID fields for related event types.
|
- The root event table, with event ID, timestamp, message, event type, and ID fields for related event types.
|
||||||
- Type-specific event tables (e.g. task even, partition event, build request event, etc).
|
- Type-specific event tables (e.g. task even, partition event, build request event, etc).
|
||||||
|
|
|
||||||
11
design/why-databuild.md
Normal file
11
design/why-databuild.md
Normal file
|
|
@ -0,0 +1,11 @@
|
||||||
|
|
||||||
|
# Why DataBuild?
|
||||||
|
|
||||||
|
(work in progress)
|
||||||
|
|
||||||
|
Why?
|
||||||
|
- Orchestration logic changes all the time, better to not write it directly
|
||||||
|
- Declarative -> Compile time correctness (e.g. can detect when no job produces a partition pattern)
|
||||||
|
- Compartmentalized jobs + data deps -> Simplicity and compartmentalization of complexity
|
||||||
|
- Bazel based -> Easy to deploy, maintain, and update
|
||||||
|
|
||||||
Loading…
Reference in a new issue