GCP & Azure Fundamentals for Multi-Cloud

GCP Data & AI Services: BigQuery, Pub/Sub & Vertex AI

4 min read

Google's data and AI services are often considered best-in-class. Understanding these services is crucial for architect interviews at data-centric companies.

BigQuery: Serverless Data Warehouse

BigQuery is GCP's flagship analytics service and a key differentiator.

Architecture & Key Features

What Makes BigQuery Unique:

  • Serverless: No infrastructure management
  • Separation of compute and storage: Pay for what you query
  • Columnar storage: Optimized for analytics
  • Dremel execution engine: Massively parallel query execution
  • Petabyte scale: Handle massive datasets

Pricing Models

Model Best For Pricing
On-demand Variable workloads $6.25/TB scanned
Flat-rate (Editions) Predictable workloads $2,000/100 slots/month
Autoscaling Variable with baseline Baseline + burst slots

Interview Question: BigQuery vs Redshift

Q: "When would you recommend BigQuery over Amazon Redshift?"

A:

Factor BigQuery Redshift
Management Serverless (no clusters) Cluster management
Scaling Automatic, instant Manual, downtime for resize
Pricing Per TB scanned Per node hour
Best For Variable/ad-hoc queries Predictable, steady workloads
Streaming Native ($0.05/GB) Kinesis integration required
ML Integration BigQuery ML built-in SageMaker integration

Choose BigQuery when:

  • Unknown or variable query patterns
  • Team wants zero ops overhead
  • Need built-in ML capabilities
  • Real-time streaming analytics required

BigQuery Best Practices

Cost Optimization:

-- Use partitioning to reduce scanned data
CREATE TABLE myproject.mydataset.events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT * FROM raw_events;

-- Preview query cost before running
-- Click "More" → "Query Settings" → "Maximum bytes billed"

Performance Optimization:

  • Partition by date/timestamp columns
  • Cluster by high-cardinality filter columns
  • Avoid SELECT * (specify columns)
  • Use materialized views for common aggregations

Cloud Pub/Sub: Messaging & Streaming

Google's managed messaging service, similar to AWS SNS + SQS combined.

Key Characteristics

Feature Pub/Sub AWS Equivalent
Model Publish-subscribe SNS + SQS combined
Ordering Optional (per-key) SQS FIFO
Retention 7 days default (configurable to 31) 14 days max (SQS)
Dead Letter Supported Supported
Push/Pull Both SNS push, SQS pull

Pub/Sub Architecture Patterns

Event-Driven Architecture:

Publishers → Topic → Subscriptions → Subscribers
                    ├── Pull Subscription → Cloud Functions
                    ├── Push Subscription → Cloud Run
                    └── BigQuery Subscription → BigQuery (direct)

BigQuery Subscriptions (unique to GCP): Write messages directly to BigQuery without code.

Interview Question: Message Ordering

Q: "How do you guarantee message ordering in Pub/Sub?"

A: Use ordering keys:

# Publisher
from google.cloud import pubsub_v1

publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project, topic)

# Messages with same ordering_key are ordered
publisher.publish(
    topic_path,
    data=b"message",
    ordering_key="user-123"  # All user-123 messages in order
)

Important: Ordering is per-subscription, per-ordering-key. Messages with different keys may arrive out of order.

Dataflow: Stream & Batch Processing

GCP's managed Apache Beam service.

When to Use Dataflow

Scenario Dataflow BigQuery
Real-time transformation Yes Limited (streaming inserts)
Complex windowing Yes No
Cross-service ETL Yes Limited
ML inference pipeline Yes BigQuery ML only
Cost at scale Higher Lower for pure analytics

Common Dataflow Patterns

Streaming ETL:

Pub/Sub → Dataflow (transform, enrich, window) → BigQuery

Batch Processing:

Cloud Storage (CSV/JSON) → Dataflow → BigQuery/Bigtable

Vertex AI: Unified ML Platform

Google's managed ML platform, competing with AWS SageMaker.

Vertex AI Components

Component Purpose AWS Equivalent
Workbench Managed notebooks SageMaker Studio
Training Custom model training SageMaker Training
Prediction Model serving SageMaker Endpoints
Pipelines ML workflow orchestration SageMaker Pipelines
Feature Store Feature management SageMaker Feature Store
Model Garden Pre-trained models SageMaker JumpStart
Gemini API Foundation models Bedrock

Interview Question: Vertex AI vs SageMaker

Q: "What are the strengths of Vertex AI compared to SageMaker?"

A:

Vertex AI Strengths:

  • Tighter BigQuery integration (direct training from tables)
  • AutoML more mature (Google's ML heritage)
  • Gemini models for generative AI
  • Simpler pricing model
  • Better integration with data stack (BigQuery, Dataflow)

SageMaker Strengths:

  • Larger ecosystem of built-in algorithms
  • More deployment options (edge, batch, async)
  • Better multi-account governance
  • More mature MLOps features
  • Wider third-party integration

Data Architecture Decision Tree

Analytics/BI workload?
  └── Yes → BigQuery (serverless, cost-effective)
  └── Need real-time transformation? → Dataflow + BigQuery

Messaging/Events?
  └── Simple pub/sub → Pub/Sub
  └── Need strong ordering → Pub/Sub with ordering keys
  └── Direct to BigQuery → BigQuery Subscription

ML/AI?
  └── Tabular data in BigQuery → BigQuery ML
  └── Custom training → Vertex AI Training
  └── Foundation models → Gemini API / Model Garden

Pro Tip: GCP's data services are deeply integrated. A common pattern is: Pub/Sub → Dataflow → BigQuery → Vertex AI. This integration is stronger than AWS equivalents.

Next, we'll explore Azure core services. :::

Quiz

Module 3: GCP & Azure Fundamentals for Multi-Cloud

Take Quiz