Data Transformer
Converts raw JSON data into optimized Parquet files ready for analytics.
Purpose
Transforms complex, nested JSON data from Drupal.org into clean, flat tables suitable for BigQuery analysis.
Key Operations
Data Processing Pipeline
- Merge: Combines all JSON files for a resource
- Flatten: Converts nested structures to flat columns
- Clean: Removes duplicates and empty columns
- Optimize: Reduces memory usage and file size
- Save: Creates compressed Parquet files
Flattening Strategy
Handles complex nested data using separators:
|- Joins components within a single object||- Joins items in arrays/lists
Example:
# Original nested data
{"images": [{"url": "pic1.jpg", "alt": "Photo 1"}, {"url": "pic2.jpg", "alt": "Photo 2"}]}
# Flattened result
"pic1.jpg|alt:Photo 1||pic2.jpg|alt:Photo 2"
Common Commands
# Transform single resource
make cli transform project
# Transform all resources
make cli transform-all
Data Flow
Input: data/external/{resource}/ (JSON files)
Output: data/raw/{resource}.parquet (Optimized Parquet file)
When to Use
- After extraction: When you have fresh JSON data from Drupal.org
- Before deployment: To prepare data for BigQuery loading
- Data updates: When the raw JSON data has changed
Benefits
- Faster queries: Parquet is optimized for analytics
- Smaller storage: Compression reduces file sizes
- Better performance: Columnar format ideal for BigQuery
- Preserved structure: Nested data relationships maintained