docs: Remove event sourcing section, add data build services discussion
This commit is contained in:
parent
fbdbe0d4dd
commit
dcc5471a29
1 changed files with 2 additions and 15 deletions
17
manifesto.md
17
manifesto.md
|
|
@ -153,21 +153,8 @@ Note: Data services are missing below. They could be "libraries" for data, e.g.
|
|||
|
||||
# Appendix
|
||||
|
||||
## Is this some rotation of event sourcing?
|
||||
|
||||
Event sourcing aspires to be honest about the different real data captured by and internal to the system.
|
||||
|
||||
Event sourcing's core tenets, when applied to DataBuild, offer a powerful paradigm for managing data lineage, reproducibility, and system evolution:
|
||||
|
||||
1. **All Changes are Events:** Every significant action or change in the data lifecycle (e.g., raw data ingestion, job execution, schema modification, quality check failure/success) is captured as an immutable event. This event log becomes the definitive history and single source of truth. For DataBuild, this means a job run isn't just a task, but an event that records its inputs, configuration, and outputs.
|
||||
2. **State is Derived from Events:** The current state of any data asset (like a partition's content or metadata) is a result of applying all relevant historical events in order. This allows DataBuild to reconstruct the state of the system or any specific dataset at any point in time, which is invaluable for debugging, auditing, and understanding data evolution.
|
||||
3. **Immutability:** Events, once recorded, are never deleted or modified. If a mistake occurs, a new compensating event is recorded to correct it. This ensures a complete and auditable trail, aligning with DataBuild's need for robust lineage tracking.
|
||||
4. **Temporal Queries:** The event log naturally supports querying the state of data as it was at any previous point. This is powerful for DataBuild when analyzing how a dataset has changed or why a job produced different results over time.
|
||||
5. **Decoupling and Projections:** Different parts of the DataBuild system (e.g., the data catalog, job scheduler, lineage tracker) can consume the event stream independently and build their own "projections" or views of the data they need. This promotes modularity and allows different components to evolve separately. For instance, the Data Catalog becomes a projection of partition creation/update events.
|
||||
|
||||
By adopting these tenets, DataBuild can achieve a highly auditable, reproducible, and resilient system. The "Data Catalog" and "Job Log" would effectively become materialized views or projections derived from this fundamental event stream, providing robust foundations for data reconciliation, lineage tracking, and understanding the history of every data asset.
|
||||
|
||||
## Why not data build services?
|
||||
Currently DataBuild handles all data materializing by creating a new process/running and completing a described program. Most programs take at least a couple seconds to start up, and others could be much worse (e.g. Python loading particularly heavy runtime deps). It's easy to imagine DataBuild instead interacting with data build app servers, that are already live, and interact via RPC calls.
|
||||
|
||||
Example use case:
|
||||
# What would the pros and cons of this model be? AI!
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue