databuild/docs/narrative/partition-identity.md at 022868b7b074b305d01a1a5df37d88be0317d7a8

Stuart Axelbrooke f388f4d86d WIP I guess

2025-10-11 11:13:27 -07:00

The core question: what defines partition identity?

Hive partitioning is great, but making every source of variation or dimension of config a partition col would be ridiculous for real applications
- Why would it be ridiculous? Examples:
  - We use many config dims: targeting goal, modeling strategy, per-modeling strategy config (imagine flattening a struct tree), etc. Then we add another dimension to the config for an experiment. Is that a "new dataset"?
    - "all config goes through partitions" means that every config change has to be a dataset or partition pattern change.
    - Maybe this is fine, actually, as long as authors make sensible defaults a non-problem? (e.g. think of schema evolution)
  - In cases where config is internally resolved by the job,
Whose to say the partition isn't a struct itself? s3://my/dataset/date=2025-01-01 could also be {"dataset": "s3://my/dataset", "date": "2025-01-01"}, or even {"kind": "icecream", "meta": {"toppings": ["sprinkles"]}} (though a clear requirement is that databuild does not parse partition refs)
Should we consider the bazel config approach, where identity is the whole set of config, manifest as a hash? (not human-readable)

Should we consider the bazel config approach, where identity is the whole set of config, manifest as a hash? (not human-readable)