Data Loader

Handles Google Cloud Storage operations and BigQuery table management.

Purpose

Manages the final step of the data pipeline: uploading processed data to Google Cloud and creating BigQuery tables for analytics.

Key Operations

Cloud Storage

  • Upload: Transfers Parquet files to Google Cloud Storage
  • Download: Retrieves files from cloud storage
  • Chunked uploads: Handles large files efficiently
  • Retry logic: Automatic retries on failures

BigQuery

  • Table creation: Automatically creates tables from Parquet files
  • Data loading: Loads Parquet data into BigQuery tables
  • Schema inference: Automatically detects data structure
  • Dataset management: Creates datasets if they don't exist

Common Commands

# Deploy single resource to BigQuery
make cli deploy project

# Deploy all resources to BigQuery  
make cli deploy-all

Data Flow

  1. Parquet files created from JSON data
  2. Uploaded to Google Cloud Storage
  3. Loaded into BigQuery tables
  4. Ready for SQL queries and analytics

Storage Locations

  • Local: data/raw/{resource}.parquet
  • Cloud Storage: gs://{bucket}/{resource}/{resource}.parquet
  • BigQuery: {project}.{dataset}.{resource} table

When to Use

  • Initial setup: Create all BigQuery tables from existing data
  • After data updates: When new Parquet files are generated
  • Schema changes: When data structure is modified