87 lines
No EOL
1.9 KiB
Markdown
87 lines
No EOL
1.9 KiB
Markdown
# Simple Python DSL Example
|
|
|
|
This example demonstrates how to use DataBuild's Python DSL to define a simple data processing pipeline.
|
|
|
|
## Overview
|
|
|
|
The example defines a basic 3-stage data processing pipeline:
|
|
|
|
1. **IngestRawData**: Ingests raw data for a specific date
|
|
2. **ProcessData**: Processes the raw data into a processed format
|
|
3. **CreateSummary**: Creates summary statistics from processed data
|
|
|
|
## Files
|
|
|
|
- `simple_graph.py`: Python DSL definition of the data pipeline
|
|
- `BUILD.bazel`: Bazel build configuration
|
|
- `MODULE.bazel`: Bazel module configuration for dependencies
|
|
|
|
## Usage
|
|
|
|
### Generate DSL Targets
|
|
|
|
The DSL generator can create Bazel targets from the Python DSL definition:
|
|
|
|
```bash
|
|
bazel run //:simple_graph.generate
|
|
```
|
|
|
|
This will generate Bazel targets in the `generated/` directory.
|
|
|
|
### Build Individual Jobs
|
|
|
|
```bash
|
|
# Build a specific job
|
|
bazel build //:ingest_raw_data
|
|
|
|
# Build all jobs
|
|
bazel build //:simple_graph
|
|
```
|
|
|
|
### Analyze the Graph
|
|
|
|
```bash
|
|
# Analyze what jobs would run for specific partitions
|
|
bazel run //:simple_graph.analyze -- "summary/date=2024-01-01"
|
|
```
|
|
|
|
### Run the Graph
|
|
|
|
```bash
|
|
# Build specific partitions
|
|
bazel run //:simple_graph.build -- "summary/date=2024-01-01"
|
|
```
|
|
|
|
## Cross-Workspace Usage
|
|
|
|
This example can be consumed from external workspaces by adding DataBuild as a dependency in your `MODULE.bazel`:
|
|
|
|
```starlark
|
|
bazel_dep(name = "databuild", version = "0.0")
|
|
local_path_override(
|
|
module_name = "databuild",
|
|
path = "path/to/databuild",
|
|
)
|
|
```
|
|
|
|
Then you can reference and extend this example:
|
|
|
|
```python
|
|
from databuild.dsl.python.dsl import DataBuildGraph
|
|
# Import and extend the simple graph
|
|
```
|
|
|
|
## Testing
|
|
|
|
To test that the DSL generator works correctly:
|
|
|
|
```bash
|
|
# Test the DSL generation
|
|
bazel run //:simple_graph.generate
|
|
|
|
# Verify generated files exist
|
|
ls generated/
|
|
|
|
# Test job lookup
|
|
bazel run //:job_lookup -- "raw_data/date=2024-01-01"
|
|
``` |