Data Extractor

Orchestrates data collection and synchronization between local storage, Google Cloud Storage, and Drupal.org API.

Purpose

Manages the complete data pipeline from Drupal.org to your local and cloud storage systems.

Key Operations

Extract

Downloads missing data pages from Drupal.org API:

  • Compares local vs. remote data
  • Fetches only missing pages
  • Handles pagination automatically
  • Saves data locally as JSON files

Sync

Synchronizes data between local and cloud storage:

  • Uploads local files to Google Cloud Storage
  • Downloads missing files from cloud
  • Ensures consistency across storage systems
  • Uses parallel transfers for speed

Common Commands

# Download missing data from Drupal.org
make cli extract project        # Single resource
make cli extract-all           # All resources

# Sync local ↔ cloud storage  
make cli sync project          # Single resource
make cli sync-all             # All resources

Storage Structure

  • Local: data/external/{resource}/ - JSON files organized by page
  • Cloud: gs://{bucket}/{resource}/ - Mirror of local structure
  • BigQuery: {dataset}.{resource} - Processed data tables

When to Use

  • Extract: When you need fresh data from Drupal.org
  • Sync: When local and cloud storage are out of sync
  • Both: During initial setup or major data updates