Data Extractor

Orchestrates data collection and synchronization between local storage, Google Cloud Storage, and Drupal.org API.

Purpose

Manages the complete data pipeline from Drupal.org to your local and cloud storage systems.

Key Operations

Extract

Downloads missing data pages from Drupal.org API:

Compares local vs. remote data
Fetches only missing pages
Handles pagination automatically
Saves data locally as JSON files

Sync

Synchronizes data between local and cloud storage:

Uploads local files to Google Cloud Storage
Downloads missing files from cloud
Ensures consistency across storage systems
Uses parallel transfers for speed

Common Commands

# Download missing data from Drupal.org
make cli extract project        # Single resource
make cli extract-all           # All resources

# Sync local ↔ cloud storage  
make cli sync project          # Single resource
make cli sync-all             # All resources

Storage Structure

Local: data/external/{resource}/ - JSON files organized by page
Cloud: gs://{bucket}/{resource}/ - Mirror of local structure
BigQuery: {dataset}.{resource} - Processed data tables

When to Use

Extract: When you need fresh data from Drupal.org
Sync: When local and cloud storage are out of sync
Both: During initial setup or major data updates

« Previous Next »