Data Loader

Handles Google Cloud Storage operations and BigQuery table management.

Purpose

Manages the final step of the data pipeline: uploading processed data to Google Cloud and creating BigQuery tables for analytics.

Key Operations

Cloud Storage

Upload: Transfers Parquet files to Google Cloud Storage
Download: Retrieves files from cloud storage
Chunked uploads: Handles large files efficiently
Retry logic: Automatic retries on failures

BigQuery

Table creation: Automatically creates tables from Parquet files
Data loading: Loads Parquet data into BigQuery tables
Schema inference: Automatically detects data structure
Dataset management: Creates datasets if they don't exist

Common Commands

# Deploy single resource to BigQuery
make cli deploy project

# Deploy all resources to BigQuery  
make cli deploy-all

Data Flow

Parquet files created from JSON data
Uploaded to Google Cloud Storage
Loaded into BigQuery tables
Ready for SQL queries and analytics

Storage Locations

Local: data/raw/{resource}.parquet
Cloud Storage: gs://{bucket}/{resource}/{resource}.parquet
BigQuery: {project}.{dataset}.{resource} table

When to Use

Initial setup: Create all BigQuery tables from existing data
After data updates: When new Parquet files are generated
Schema changes: When data structure is modified

« Previous