Add more DSL thoughts

This commit is contained in:
Stuart Axelbrooke 2025-07-23 17:57:03 -07:00
parent 58c57332e1
commit eeef8b6444

View file

@ -92,6 +92,144 @@ Different parts of the system need different guarantees:
- Query planning algorithms for data pipelines
- Time-travel and what-if analysis capabilities
### Python as the Pattern Language
A crucial realization: we don't need a new DSL for partition patterns - Python itself is the DSL! Python's native string handling capabilities provide everything needed:
#### Built-in Pattern Matching (Python 3.10+)
```python
match partition_ref:
case f"raw/{source}/{date}" if source in ["api", "ftp"]:
return {"source": source, "date": date}
case f"model/{name}/v{version}" if version.isdigit():
return {"name": name, "version": int(version)}
```
#### F-String Templates as Bidirectional Patterns
```python
from parse import parse # Inverse of format()
# Define pattern
pattern = "s3://bucket/{table}/date={date}/hour={hour:02d}"
# Extract (pattern → values)
result = parse(pattern, "s3://bucket/events/date=2024-01-01/hour=13")
# {'table': 'events', 'date': '2024-01-01', 'hour': 13}
# Generate (values → pattern)
pattern.format(table='events', date='2024-01-01', hour=13)
```
#### Native Python Pattern Types
- **Glob patterns**: `raw/*/2024-01-*` using `fnmatch` or `pathlib`
- **Regular expressions**: Named groups for complex patterns
- **Type annotations**: `Annotated[RawPartition, "all sources for date"]`
- **String templates**: Standard library `string.Template`
#### Why This Is Beautiful
1. **No new syntax** - Developers already know these patterns
2. **Full IDE support** - Autocomplete, refactoring, type checking
3. **Composable** - Mix glob, regex, f-strings as needed
4. **Testable** - Just Python strings and functions
5. **Bidirectional** - Same pattern for matching and generating
The key insight: Python's string formatting is already a pattern language. We don't need to build a DSL - we need to embrace Python as the DSL.
### Rust as an Alternative DSL
While Python offers maximum expressiveness, Rust presents an intriguing alternative that could provide compile-time guarantees without sacrificing elegance:
#### Procedural Macros for Job Definition
```rust
#[job]
fn process_daily(
date: Date,
#[partition("raw/{source}/{date}")] source: Partition<RawData>,
) -> Partition<ProcessedData, "processed/{date}"> {
let raw_data = source.load()?;
transform(raw_data, date).save()
}
#[job(resources = "memory=4G,cpu=2")]
fn aggregate_weekly(
week: Week,
#[partition("processed/[date:date+7]")] daily: PartitionSet<ProcessedData>,
) -> Partition<WeeklyAggregate, "aggregated/week={week}"> {
daily.load_all()?.aggregate().save()
}
```
#### Type-Safe Pattern Matching
```rust
// Compile-time pattern validation using const generics
pub struct Partition<T, const PATTERN: &'static str> {
_phantom: PhantomData<T>,
}
// Pattern validated at compile time!
impl<T, const PATTERN: &'static str> Partition<T, PATTERN> {
const VALIDATE: () = {
assert!(is_valid_pattern(PATTERN));
};
}
```
#### Runtime Dependency Resolution with Compile-Time Structure
```rust
#[job(
// Pattern template at compile time
inputs = "raw/{source}/{date}",
// But resolution happens at runtime
)]
impl ProcessDaily {
fn resolve_dependencies(&self, args: &JobArgs) -> Vec<String> {
let date = &args.date;
let mut deps = vec![];
// Dynamic resolution based on runtime state
for source in self.catalog.available_sources(date)? {
deps.push(format!("raw/{}/{}", source, date));
}
// Conditional dependencies
if date.weekday() == Weekday::Monday {
deps.push(format!("raw/{}/{}", "archive", date.previous_friday()));
}
deps
}
}
```
#### Why Rust DSL is Compelling
1. **Zero-cost abstractions** - Macros expand to optimal code with no runtime overhead
2. **Compile-time validation** - Pattern syntax checked at compile time
3. **Type safety** - Full type checking across job boundaries
4. **Single language** - If engine is Rust, no polyglot complexity
5. **IDE support** - Full IntelliSense, refactoring, and go-to-definition
#### The Hybrid Best: Compile-Time Structure, Runtime Flexibility
```rust
// Macro provides syntax sugar and compile-time checks
#[derive(DataBuildJob)]
struct ComplexJob {
const INPUT_TEMPLATE: &'static str = "raw/{source}/{date}";
// But actual resolution is dynamic
fn resolve_config(&self, args: &JobArgs, ctx: &Context) -> Result<JobConfig> {
// Full runtime flexibility for dependency calculation
let sources = ctx.catalog.list_sources(&args.date)?;
let inputs = sources.iter()
.map(|s| format!("raw/{}/{}", s, args.date))
.collect();
Ok(JobConfig { inputs, outputs: vec![...] })
}
}
```
This Rust approach offers an interesting middle ground: stronger guarantees than Python while maintaining expressiveness, and seamless integration with a Rust-based execution engine.
### Philosophical Questions
- Is orchestration a fundamental need or an implementation detail?
- Can we achieve both expressiveness and correctness?