Data Loader
Handles Google Cloud Storage operations and BigQuery table management.
Purpose
Manages the final step of the data pipeline: uploading processed data to Google Cloud and creating BigQuery tables for analytics.
Key Operations
Cloud Storage
- Upload: Transfers Parquet files to Google Cloud Storage
- Download: Retrieves files from cloud storage
- Chunked uploads: Handles large files efficiently
- Retry logic: Automatic retries on failures
BigQuery
- Table creation: Automatically creates tables from Parquet files
- Data loading: Loads Parquet data into BigQuery tables
- Schema inference: Automatically detects data structure
- Dataset management: Creates datasets if they don't exist
Common Commands
# Deploy single resource to BigQuery
make cli deploy project
# Deploy all resources to BigQuery
make cli deploy-all
Data Flow
- Parquet files created from JSON data
- Uploaded to Google Cloud Storage
- Loaded into BigQuery tables
- Ready for SQL queries and analytics
Storage Locations
- Local:
data/raw/{resource}.parquet - Cloud Storage:
gs://{bucket}/{resource}/{resource}.parquet - BigQuery:
{project}.{dataset}.{resource}table
When to Use
- Initial setup: Create all BigQuery tables from existing data
- After data updates: When new Parquet files are generated
- Schema changes: When data structure is modified