2.3 KiB
If you look at the BEL definition, you'll see that there's two components to it, the literal serialized event stream, and the build state, a projection of the events into objects (e.g. via reducer, etc):
pub struct BuildEventLog<S: BELStorage + Debug> {
pub storage: S,
pub state: BuildState,
}
storage is the literal events that happened: a job run being launched, a want being requested, a job run finishing and producing some number of partitions, etc. state answers questions about the state of the world as a result of the serial occurrence of the recorded events, like "is the partition x/y/z live?" and "why hasn't partition a/b/c been built yet"? state is essentially the thing responsible for system consistency.
Most of the code in this project is in calculating next states for build state objects: determining wants that can have jobs run to satisfy them, updating partitions to live after a job run succeeds, etc. Can we formalize this into a composition of state machines to simplify the codebase, achieve more compile-time safety, and potentially unlock greater concurrency as a byproduct?
CPN concurrency can be describe succinctly: if the workloads touch disjoint places, they can be run concurrently. This seems to overwhelmingly be the case for the domain databuild is interested in, where a single "data service" is traditionally responsible for producing partitions in a given dataset. Another huge benefit to using a CPN framing for databuild is to separate concerns between state updates/consistency and all the stuff that connects to it.
Appendix
Partition Collisions?
Random thought, we also have this lingering "what if unrelated wants collide in the partition space", specifically for a paradigm where job runs produce multiple partitions based on their parameterization. This may also give us the confidence to just cancel the later of the colliding jobs and have it reschedule (how would partitions be diff?). Or, given that we update partition building status on job schedule, we would be confident that we just never get into that situation at the later want grouping stage (pre job scheduling), it would see the conflict partition as building thanks to the earlier job being started. Probably worth constructing a literal situation for this to war game it or implement a literal integration test.