96 lines
3.6 KiB
Markdown
96 lines
3.6 KiB
Markdown
|
|
# Podcast Reviews Example
|
|
|
|
This is an example data application where we produce text insights from podcast review data. It is made up of N datasets:
|
|
|
|
- Raw reviews `(date, podcast, text, rating)`
|
|
- Podcasts `(podcast, title, category)`
|
|
- Categorized review text `(date, category, podcast, text)`
|
|
- Phrase models `(date, category, hash, ngram, score)`
|
|
- Podcast phrase stats `(date, category, podcast, ngram, count, rating)`
|
|
- Podcast daily summary `(date, category, podcast, phrase_stats, recent_reviews)`
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
raw_reviews[(Raw Reviews)] & podcasts[(Podcasts)] --> categorize_text --> categorized_texts[(Categorized Texts)]
|
|
categorized_texts --> phrase[Phrase Modeling] --> phrase_models[(Phrase Models)]
|
|
phrase_models & raw_reviews --> phrase_stats --> podcast_phrase_stats[(Podcast Phrase Stats)]
|
|
podcast_phrase_stats & raw_reviews --> calc_summary --> podcast_daily_summary[(Podcast Daily Summary)]
|
|
```
|
|
|
|
## Input Data
|
|
|
|
Get it from [here](https://www.kaggle.com/datasets/thoughtvector/podcastreviews/versions/28?select=database.sqlite)! (and put it in `examples/podcast_reviews/data/ingest/database.sqlite`)
|
|
|
|
## `phrase` Dependency
|
|
|
|
This relies on [`soaxelbrooke/phrase`](https://github.com/soaxelbrooke/phrase) for phrase extraction - check out its [releases](https://github.com/soaxelbrooke/phrase/releases) to get a relevant binary.
|
|
|
|
## Building Output Partitions
|
|
|
|
### CLI Build
|
|
Use the DataBuild CLI to build specific partitions:
|
|
|
|
```bash
|
|
bazel build //:podcast_reviews_graph.build
|
|
# Builds bazel-bin/podcast_reviews_graph.build
|
|
|
|
# Build raw reviews for a specific date
|
|
bazel-bin/podcast_reviews_graph.build "reviews/date=2020-01-01"
|
|
|
|
# Build categorized reviews
|
|
bazel-bin/podcast_reviews_graph.build "categorized_reviews/category=Technology/date=2020-01-01"
|
|
|
|
# Build phrase models
|
|
bazel-bin/podcast_reviews_graph.build "phrase_models/category=Technology/date=2020-01-01"
|
|
|
|
# Build daily summaries
|
|
bazel-bin/podcast_reviews_graph.build "daily_summaries/category=Technology/date=2020-01-01"
|
|
|
|
# Build all podcasts data
|
|
bazel-bin/podcast_reviews_graph.build "podcasts/all"
|
|
```
|
|
|
|
### Service Build
|
|
Use the Build Graph Service for HTTP API access:
|
|
|
|
```bash
|
|
# Start the service
|
|
bazel build //:podcast_reviews_graph.service
|
|
bazel-bin/podcast_reviews_graph.service
|
|
|
|
# Submit build request for reviews
|
|
curl -X POST http://localhost:8080/api/v1/builds \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"partitions": ["reviews/date=2020-01-01"]}'
|
|
|
|
# Submit build request for daily summary (builds entire pipeline)
|
|
curl -X POST http://localhost:8080/api/v1/builds \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"partitions": ["daily_summaries/category=Technology/date=2020-01-01"]}'
|
|
|
|
# Check build status
|
|
curl http://localhost:8080/api/v1/builds/BUILD_REQUEST_ID
|
|
|
|
# Get partition status
|
|
curl http://localhost:8080/api/v1/partitions/reviews%2Fdate%3D2020-01-01/status
|
|
|
|
# Get partition events
|
|
curl http://localhost:8080/api/v1/partitions/reviews%2Fdate%3D2020-01-01/events
|
|
|
|
# Analyze build graph (planning only)
|
|
curl -X POST http://localhost:8080/api/v1/analyze \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"partitions": ["daily_summaries/category=Technology/date=2020-01-01"]}'
|
|
```
|
|
|
|
### Partition Reference Patterns
|
|
|
|
The following partition reference patterns are supported:
|
|
|
|
- `reviews/date=YYYY-MM-DD` - Raw reviews for a specific date
|
|
- `podcasts/all` - All podcasts metadata
|
|
- `categorized_reviews/category=CATEGORY/date=YYYY-MM-DD` - Categorized reviews
|
|
- `phrase_models/category=CATEGORY/date=YYYY-MM-DD` - Phrase models
|
|
- `phrase_stats/category=CATEGORY/date=YYYY-MM-DD` - Phrase statistics
|
|
- `daily_summaries/category=CATEGORY/date=YYYY-MM-DD` - Daily summaries (the "output")
|