Drupal Data Fetcher

Screenshot of the GCP Bucket full of data from Drupal.org

Drupal Data Fetcher is a Python toolkit for downloading, synchronizing, and managing open data from the Drupal.org API, with support for local and Google Cloud Storage. It is designed for researchers, data engineers, and anyone interested in large-scale Drupal.org data analysis.

Features

  • Fetches all major Drupal.org datasets (projects, issues, releases, users, etc.)
  • Handles API rate limits and retries automatically
  • Parallel upload/download to Google Cloud Storage
  • CLI for easy automation and scripting
  • Extensible and well-documented Python classes

Quickstart

1. Local setup

  • Start the machine:
# Install dependencies
make install
# Start virtual env
source .venv/bin/activate

2. Cloud setup

  • Edit .env file with credentials:
# Basic project info.
PROJECT_ID=your-gcp-project-id # e.g. drucom
PROJECT_NAME='Drupal Data Fetcher'
# User/Bot in charge of publishing data to cloud.
SERVICE_ACCOUNT=drupal-data-fetcher
GOOGLE_APPLICATION_CREDENTIALS=/path/to/keys/drupal-data-fetcher.json
# Cloud Storage bucket.
# e.g. choose-a-globally-unique-name
BUCKET_ID=your-gcs-bucket-name 
BUCKET_REGION=europe-west-4 # us-central1
# Cloud BigQuery dataset.
BIGQUERY_DATASET_ID=drupal
# (optional) Billing account - set spending limits on it for safety.
# Find yours: `gcloud billing accounts list`
BILLING_ACCOUNT_ID=000000-111111-000001
  • Run the setup script:
# 1. Make sure your .env file is configured
# 2. Load environment variables
source .env  # or direnv reload

# 3. Run the setup script
./script/setup-gcp.sh

# NB: You can revert everything with this other script afterward:
# ./script/cleanup-gcp.sh

3. Extract data

First, synchronize data files from GCP as much as possible:

make cli sync project
make cli sync-all

Then fetch missing pages from Drupal.org:

make cli extract project
make cli extract-all

Next, deploy existing data to BigQuery to create tables:

make cli deploy-all

Finally, sync everything again to push newly fetched data to GCP:

make cli sync-all

4. Incremental updates

Now you can use the incremental update command to keep your data current:

make data

This command fetches only the latest changes from Drupal.org and updates your BigQuery tables efficiently.

Documentation

Supported Resources

Resource Description
project Drupal modules, themes, distributions
issue Bug reports, feature requests
release Software releases and versions
user User profiles and activity
forum Forum posts and discussions
organization Companies and groups
changenotice API change notifications
casestudy Case studies and examples
event Events and meetups
term Taxonomy terms
vocabulary Taxonomy vocabularies

Testing

make test

License

MIT License. See LICENSE file.