Data Extractor
Orchestrates data collection and synchronization between local storage, Google Cloud Storage, and Drupal.org API.
Purpose
Manages the complete data pipeline from Drupal.org to your local and cloud storage systems.
Key Operations
Extract
Downloads missing data pages from Drupal.org API:
- Compares local vs. remote data
- Fetches only missing pages
- Handles pagination automatically
- Saves data locally as JSON files
Sync
Synchronizes data between local and cloud storage:
- Uploads local files to Google Cloud Storage
- Downloads missing files from cloud
- Ensures consistency across storage systems
- Uses parallel transfers for speed
Common Commands
# Download missing data from Drupal.org
make cli extract project # Single resource
make cli extract-all # All resources
# Sync local ↔ cloud storage
make cli sync project # Single resource
make cli sync-all # All resources
Storage Structure
- Local:
data/external/{resource}/
- JSON files organized by page - Cloud:
gs://{bucket}/{resource}/
- Mirror of local structure - BigQuery:
{dataset}.{resource}
- Processed data tables
When to Use
- Extract: When you need fresh data from Drupal.org
- Sync: When local and cloud storage are out of sync
- Both: During initial setup or major data updates