diff --git a/plans/python-dsl.md b/plans/python-dsl.md index 4a8aecd..9205a50 100644 --- a/plans/python-dsl.md +++ b/plans/python-dsl.md @@ -92,6 +92,144 @@ Different parts of the system need different guarantees: - Query planning algorithms for data pipelines - Time-travel and what-if analysis capabilities +### Python as the Pattern Language + +A crucial realization: we don't need a new DSL for partition patterns - Python itself is the DSL! Python's native string handling capabilities provide everything needed: + +#### Built-in Pattern Matching (Python 3.10+) +```python +match partition_ref: + case f"raw/{source}/{date}" if source in ["api", "ftp"]: + return {"source": source, "date": date} + case f"model/{name}/v{version}" if version.isdigit(): + return {"name": name, "version": int(version)} +``` + +#### F-String Templates as Bidirectional Patterns +```python +from parse import parse # Inverse of format() + +# Define pattern +pattern = "s3://bucket/{table}/date={date}/hour={hour:02d}" + +# Extract (pattern → values) +result = parse(pattern, "s3://bucket/events/date=2024-01-01/hour=13") +# {'table': 'events', 'date': '2024-01-01', 'hour': 13} + +# Generate (values → pattern) +pattern.format(table='events', date='2024-01-01', hour=13) +``` + +#### Native Python Pattern Types +- **Glob patterns**: `raw/*/2024-01-*` using `fnmatch` or `pathlib` +- **Regular expressions**: Named groups for complex patterns +- **Type annotations**: `Annotated[RawPartition, "all sources for date"]` +- **String templates**: Standard library `string.Template` + +#### Why This Is Beautiful +1. **No new syntax** - Developers already know these patterns +2. **Full IDE support** - Autocomplete, refactoring, type checking +3. **Composable** - Mix glob, regex, f-strings as needed +4. **Testable** - Just Python strings and functions +5. **Bidirectional** - Same pattern for matching and generating + +The key insight: Python's string formatting is already a pattern language. We don't need to build a DSL - we need to embrace Python as the DSL. + +### Rust as an Alternative DSL + +While Python offers maximum expressiveness, Rust presents an intriguing alternative that could provide compile-time guarantees without sacrificing elegance: + +#### Procedural Macros for Job Definition +```rust +#[job] +fn process_daily( + date: Date, + #[partition("raw/{source}/{date}")] source: Partition, +) -> Partition { + let raw_data = source.load()?; + transform(raw_data, date).save() +} + +#[job(resources = "memory=4G,cpu=2")] +fn aggregate_weekly( + week: Week, + #[partition("processed/[date:date+7]")] daily: PartitionSet, +) -> Partition { + daily.load_all()?.aggregate().save() +} +``` + +#### Type-Safe Pattern Matching +```rust +// Compile-time pattern validation using const generics +pub struct Partition { + _phantom: PhantomData, +} + +// Pattern validated at compile time! +impl Partition { + const VALIDATE: () = { + assert!(is_valid_pattern(PATTERN)); + }; +} +``` + +#### Runtime Dependency Resolution with Compile-Time Structure +```rust +#[job( + // Pattern template at compile time + inputs = "raw/{source}/{date}", + // But resolution happens at runtime +)] +impl ProcessDaily { + fn resolve_dependencies(&self, args: &JobArgs) -> Vec { + let date = &args.date; + let mut deps = vec![]; + + // Dynamic resolution based on runtime state + for source in self.catalog.available_sources(date)? { + deps.push(format!("raw/{}/{}", source, date)); + } + + // Conditional dependencies + if date.weekday() == Weekday::Monday { + deps.push(format!("raw/{}/{}", "archive", date.previous_friday())); + } + + deps + } +} +``` + +#### Why Rust DSL is Compelling +1. **Zero-cost abstractions** - Macros expand to optimal code with no runtime overhead +2. **Compile-time validation** - Pattern syntax checked at compile time +3. **Type safety** - Full type checking across job boundaries +4. **Single language** - If engine is Rust, no polyglot complexity +5. **IDE support** - Full IntelliSense, refactoring, and go-to-definition + +#### The Hybrid Best: Compile-Time Structure, Runtime Flexibility +```rust +// Macro provides syntax sugar and compile-time checks +#[derive(DataBuildJob)] +struct ComplexJob { + const INPUT_TEMPLATE: &'static str = "raw/{source}/{date}"; + + // But actual resolution is dynamic + fn resolve_config(&self, args: &JobArgs, ctx: &Context) -> Result { + // Full runtime flexibility for dependency calculation + let sources = ctx.catalog.list_sources(&args.date)?; + let inputs = sources.iter() + .map(|s| format!("raw/{}/{}", s, args.date)) + .collect(); + + Ok(JobConfig { inputs, outputs: vec![...] }) + } +} +``` + +This Rust approach offers an interesting middle ground: stronger guarantees than Python while maintaining expressiveness, and seamless integration with a Rust-based execution engine. + ### Philosophical Questions - Is orchestration a fundamental need or an implementation detail? - Can we achieve both expressiveness and correctness?