databuild/examples/podcast_reviews/README.md
2025-07-07 19:20:45 -07:00

3.6 KiB

Podcast Reviews Example

This is an example data application where we produce text insights from podcast review data. It is made up of N datasets:

  • Raw reviews (date, podcast, text, rating)
  • Podcasts (podcast, title, category)
  • Categorized review text (date, category, podcast, text)
  • Phrase models (date, category, hash, ngram, score)
  • Podcast phrase stats (date, category, podcast, ngram, count, rating)
  • Podcast daily summary (date, category, podcast, phrase_stats, recent_reviews)
flowchart LR
    raw_reviews[(Raw Reviews)] & podcasts[(Podcasts)] --> categorize_text --> categorized_texts[(Categorized Texts)]
    categorized_texts --> phrase[Phrase Modeling] --> phrase_models[(Phrase Models)]
    phrase_models & raw_reviews --> phrase_stats --> podcast_phrase_stats[(Podcast Phrase Stats)]
    podcast_phrase_stats & raw_reviews --> calc_summary --> podcast_daily_summary[(Podcast Daily Summary)]

Input Data

Get it from here! (and put it in examples/podcast_reviews/data/ingest/database.sqlite)

phrase Dependency

This relies on soaxelbrooke/phrase for phrase extraction - check out its releases to get a relevant binary.

Building Output Partitions

CLI Build

Use the DataBuild CLI to build specific partitions:

bazel build //:podcast_reviews_graph.build
# Builds bazel-bin/podcast_reviews_graph.build

# Build raw reviews for a specific date
bazel-bin/podcast_reviews_graph.build "reviews/date=2020-01-01"

# Build categorized reviews
bazel-bin/podcast_reviews_graph.build "categorized_reviews/category=Technology/date=2020-01-01"

# Build phrase models
bazel-bin/podcast_reviews_graph.build "phrase_models/category=Technology/date=2020-01-01"

# Build daily summaries
bazel-bin/podcast_reviews_graph.build "daily_summaries/category=Technology/date=2020-01-01"

# Build all podcasts data
bazel-bin/podcast_reviews_graph.build "podcasts/all"

Service Build

Use the Build Graph Service for HTTP API access:

# Start the service
bazel build //:podcast_reviews_graph.service
bazel-bin/podcast_reviews_graph.service

# Submit build request for reviews
curl -X POST http://localhost:8080/api/v1/builds \
  -H "Content-Type: application/json" \
  -d '{"partitions": ["reviews/date=2020-01-01"]}'

# Submit build request for daily summary (builds entire pipeline)
curl -X POST http://localhost:8080/api/v1/builds \
  -H "Content-Type: application/json" \
  -d '{"partitions": ["daily_summaries/category=Technology/date=2020-01-01"]}'

# Check build status
curl http://localhost:8080/api/v1/builds/BUILD_REQUEST_ID

# Get partition status
curl http://localhost:8080/api/v1/partitions/reviews%2Fdate%3D2020-01-01/status

# Get partition events
curl http://localhost:8080/api/v1/partitions/reviews%2Fdate%3D2020-01-01/events

# Analyze build graph (planning only)
curl -X POST http://localhost:8080/api/v1/analyze \
  -H "Content-Type: application/json" \
  -d '{"partitions": ["daily_summaries/category=Technology/date=2020-01-01"]}'

Partition Reference Patterns

The following partition reference patterns are supported:

  • reviews/date=YYYY-MM-DD - Raw reviews for a specific date
  • podcasts/all - All podcasts metadata
  • categorized_reviews/category=CATEGORY/date=YYYY-MM-DD - Categorized reviews
  • phrase_models/category=CATEGORY/date=YYYY-MM-DD - Phrase models
  • phrase_stats/category=CATEGORY/date=YYYY-MM-DD - Phrase statistics
  • daily_summaries/category=CATEGORY/date=YYYY-MM-DD - Daily summaries (the "output")