Add more DSL thoughts
This commit is contained in:
parent
58c57332e1
commit
eeef8b6444
1 changed files with 138 additions and 0 deletions
|
|
@ -92,6 +92,144 @@ Different parts of the system need different guarantees:
|
|||
- Query planning algorithms for data pipelines
|
||||
- Time-travel and what-if analysis capabilities
|
||||
|
||||
### Python as the Pattern Language
|
||||
|
||||
A crucial realization: we don't need a new DSL for partition patterns - Python itself is the DSL! Python's native string handling capabilities provide everything needed:
|
||||
|
||||
#### Built-in Pattern Matching (Python 3.10+)
|
||||
```python
|
||||
match partition_ref:
|
||||
case f"raw/{source}/{date}" if source in ["api", "ftp"]:
|
||||
return {"source": source, "date": date}
|
||||
case f"model/{name}/v{version}" if version.isdigit():
|
||||
return {"name": name, "version": int(version)}
|
||||
```
|
||||
|
||||
#### F-String Templates as Bidirectional Patterns
|
||||
```python
|
||||
from parse import parse # Inverse of format()
|
||||
|
||||
# Define pattern
|
||||
pattern = "s3://bucket/{table}/date={date}/hour={hour:02d}"
|
||||
|
||||
# Extract (pattern → values)
|
||||
result = parse(pattern, "s3://bucket/events/date=2024-01-01/hour=13")
|
||||
# {'table': 'events', 'date': '2024-01-01', 'hour': 13}
|
||||
|
||||
# Generate (values → pattern)
|
||||
pattern.format(table='events', date='2024-01-01', hour=13)
|
||||
```
|
||||
|
||||
#### Native Python Pattern Types
|
||||
- **Glob patterns**: `raw/*/2024-01-*` using `fnmatch` or `pathlib`
|
||||
- **Regular expressions**: Named groups for complex patterns
|
||||
- **Type annotations**: `Annotated[RawPartition, "all sources for date"]`
|
||||
- **String templates**: Standard library `string.Template`
|
||||
|
||||
#### Why This Is Beautiful
|
||||
1. **No new syntax** - Developers already know these patterns
|
||||
2. **Full IDE support** - Autocomplete, refactoring, type checking
|
||||
3. **Composable** - Mix glob, regex, f-strings as needed
|
||||
4. **Testable** - Just Python strings and functions
|
||||
5. **Bidirectional** - Same pattern for matching and generating
|
||||
|
||||
The key insight: Python's string formatting is already a pattern language. We don't need to build a DSL - we need to embrace Python as the DSL.
|
||||
|
||||
### Rust as an Alternative DSL
|
||||
|
||||
While Python offers maximum expressiveness, Rust presents an intriguing alternative that could provide compile-time guarantees without sacrificing elegance:
|
||||
|
||||
#### Procedural Macros for Job Definition
|
||||
```rust
|
||||
#[job]
|
||||
fn process_daily(
|
||||
date: Date,
|
||||
#[partition("raw/{source}/{date}")] source: Partition<RawData>,
|
||||
) -> Partition<ProcessedData, "processed/{date}"> {
|
||||
let raw_data = source.load()?;
|
||||
transform(raw_data, date).save()
|
||||
}
|
||||
|
||||
#[job(resources = "memory=4G,cpu=2")]
|
||||
fn aggregate_weekly(
|
||||
week: Week,
|
||||
#[partition("processed/[date:date+7]")] daily: PartitionSet<ProcessedData>,
|
||||
) -> Partition<WeeklyAggregate, "aggregated/week={week}"> {
|
||||
daily.load_all()?.aggregate().save()
|
||||
}
|
||||
```
|
||||
|
||||
#### Type-Safe Pattern Matching
|
||||
```rust
|
||||
// Compile-time pattern validation using const generics
|
||||
pub struct Partition<T, const PATTERN: &'static str> {
|
||||
_phantom: PhantomData<T>,
|
||||
}
|
||||
|
||||
// Pattern validated at compile time!
|
||||
impl<T, const PATTERN: &'static str> Partition<T, PATTERN> {
|
||||
const VALIDATE: () = {
|
||||
assert!(is_valid_pattern(PATTERN));
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
#### Runtime Dependency Resolution with Compile-Time Structure
|
||||
```rust
|
||||
#[job(
|
||||
// Pattern template at compile time
|
||||
inputs = "raw/{source}/{date}",
|
||||
// But resolution happens at runtime
|
||||
)]
|
||||
impl ProcessDaily {
|
||||
fn resolve_dependencies(&self, args: &JobArgs) -> Vec<String> {
|
||||
let date = &args.date;
|
||||
let mut deps = vec![];
|
||||
|
||||
// Dynamic resolution based on runtime state
|
||||
for source in self.catalog.available_sources(date)? {
|
||||
deps.push(format!("raw/{}/{}", source, date));
|
||||
}
|
||||
|
||||
// Conditional dependencies
|
||||
if date.weekday() == Weekday::Monday {
|
||||
deps.push(format!("raw/{}/{}", "archive", date.previous_friday()));
|
||||
}
|
||||
|
||||
deps
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Why Rust DSL is Compelling
|
||||
1. **Zero-cost abstractions** - Macros expand to optimal code with no runtime overhead
|
||||
2. **Compile-time validation** - Pattern syntax checked at compile time
|
||||
3. **Type safety** - Full type checking across job boundaries
|
||||
4. **Single language** - If engine is Rust, no polyglot complexity
|
||||
5. **IDE support** - Full IntelliSense, refactoring, and go-to-definition
|
||||
|
||||
#### The Hybrid Best: Compile-Time Structure, Runtime Flexibility
|
||||
```rust
|
||||
// Macro provides syntax sugar and compile-time checks
|
||||
#[derive(DataBuildJob)]
|
||||
struct ComplexJob {
|
||||
const INPUT_TEMPLATE: &'static str = "raw/{source}/{date}";
|
||||
|
||||
// But actual resolution is dynamic
|
||||
fn resolve_config(&self, args: &JobArgs, ctx: &Context) -> Result<JobConfig> {
|
||||
// Full runtime flexibility for dependency calculation
|
||||
let sources = ctx.catalog.list_sources(&args.date)?;
|
||||
let inputs = sources.iter()
|
||||
.map(|s| format!("raw/{}/{}", s, args.date))
|
||||
.collect();
|
||||
|
||||
Ok(JobConfig { inputs, outputs: vec![...] })
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This Rust approach offers an interesting middle ground: stronger guarantees than Python while maintaining expressiveness, and seamless integration with a Rust-based execution engine.
|
||||
|
||||
### Philosophical Questions
|
||||
- Is orchestration a fundamental need or an implementation detail?
|
||||
- Can we achieve both expressiveness and correctness?
|
||||
|
|
|
|||
Loading…
Reference in a new issue