From fbdbe0d4dd2d8ded0b20bec473202ed99f9fc45b Mon Sep 17 00:00:00 2001 From: "Stuart Axelbrooke (aider)" Date: Sun, 25 May 2025 15:06:38 -0700 Subject: [PATCH] docs: Explain event sourcing tenets relevant to DataBuild --- manifesto.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/manifesto.md b/manifesto.md index ff70bba..e5fb144 100644 --- a/manifesto.md +++ b/manifesto.md @@ -157,7 +157,15 @@ Note: Data services are missing below. They could be "libraries" for data, e.g. Event sourcing aspires to be honest about the different real data captured by and internal to the system. -# Describe event sourcing's worldview/tenets as relevant to databuild AI! +Event sourcing's core tenets, when applied to DataBuild, offer a powerful paradigm for managing data lineage, reproducibility, and system evolution: + +1. **All Changes are Events:** Every significant action or change in the data lifecycle (e.g., raw data ingestion, job execution, schema modification, quality check failure/success) is captured as an immutable event. This event log becomes the definitive history and single source of truth. For DataBuild, this means a job run isn't just a task, but an event that records its inputs, configuration, and outputs. +2. **State is Derived from Events:** The current state of any data asset (like a partition's content or metadata) is a result of applying all relevant historical events in order. This allows DataBuild to reconstruct the state of the system or any specific dataset at any point in time, which is invaluable for debugging, auditing, and understanding data evolution. +3. **Immutability:** Events, once recorded, are never deleted or modified. If a mistake occurs, a new compensating event is recorded to correct it. This ensures a complete and auditable trail, aligning with DataBuild's need for robust lineage tracking. +4. **Temporal Queries:** The event log naturally supports querying the state of data as it was at any previous point. This is powerful for DataBuild when analyzing how a dataset has changed or why a job produced different results over time. +5. **Decoupling and Projections:** Different parts of the DataBuild system (e.g., the data catalog, job scheduler, lineage tracker) can consume the event stream independently and build their own "projections" or views of the data they need. This promotes modularity and allows different components to evolve separately. For instance, the Data Catalog becomes a projection of partition creation/update events. + +By adopting these tenets, DataBuild can achieve a highly auditable, reproducible, and resilient system. The "Data Catalog" and "Job Log" would effectively become materialized views or projections derived from this fundamental event stream, providing robust foundations for data reconciliation, lineage tracking, and understanding the history of every data asset. ## Why not data build services?