Data Transformer

Converts raw JSON data into optimized Parquet files ready for analytics.

Purpose

Transforms complex, nested JSON data from Drupal.org into clean, flat tables suitable for BigQuery analysis.

Key Operations

Data Processing Pipeline

  1. Merge: Combines all JSON files for a resource
  2. Flatten: Converts nested structures to flat columns
  3. Clean: Removes duplicates and empty columns
  4. Optimize: Reduces memory usage and file size
  5. Save: Creates compressed Parquet files

Flattening Strategy

Handles complex nested data using separators:

  • | - Joins components within a single object
  • || - Joins items in arrays/lists

Example:

# Original nested data
{"images": [{"url": "pic1.jpg", "alt": "Photo 1"}, {"url": "pic2.jpg", "alt": "Photo 2"}]}

# Flattened result  
"pic1.jpg|alt:Photo 1||pic2.jpg|alt:Photo 2"

Common Commands

# Transform single resource
make cli transform project

# Transform all resources
make cli transform-all

Data Flow

Input: data/external/{resource}/ (JSON files)
Output: data/raw/{resource}.parquet (Optimized Parquet file)

When to Use

  • After extraction: When you have fresh JSON data from Drupal.org
  • Before deployment: To prepare data for BigQuery loading
  • Data updates: When the raw JSON data has changed

Benefits

  • Faster queries: Parquet is optimized for analytics
  • Smaller storage: Compression reduces file sizes
  • Better performance: Columnar format ideal for BigQuery
  • Preserved structure: Nested data relationships maintained