GCP Data & AI Services: BigQuery, Pub/Sub & Vertex AI

Google's data and AI services are often considered best-in-class. Understanding these services is crucial for architect interviews at data-centric companies.

BigQuery: Serverless Data Warehouse

BigQuery is GCP's flagship analytics service and a key differentiator.

Architecture & Key Features

What Makes BigQuery Unique:

Serverless: No infrastructure management
Separation of compute and storage: Pay for what you query
Columnar storage: Optimized for analytics
Dremel execution engine: Massively parallel query execution
Petabyte scale: Handle massive datasets

Pricing Models

Model	Best For	Pricing
On-demand	Variable workloads	$6.25/TB scanned
Flat-rate (Editions)	Predictable workloads	$2,000/100 slots/month
Autoscaling	Variable with baseline	Baseline + burst slots

Interview Question: BigQuery vs Redshift

Q: "When would you recommend BigQuery over Amazon Redshift?"

Factor	BigQuery	Redshift
Management	Serverless (no clusters)	Cluster management
Scaling	Automatic, instant	Manual, downtime for resize
Pricing	Per TB scanned	Per node hour
Best For	Variable/ad-hoc queries	Predictable, steady workloads
Streaming	Native ($0.05/GB)	Kinesis integration required
ML Integration	BigQuery ML built-in	SageMaker integration

Choose BigQuery when:

Unknown or variable query patterns
Team wants zero ops overhead
Need built-in ML capabilities
Real-time streaming analytics required

BigQuery Best Practices

Cost Optimization:

-- Use partitioning to reduce scanned data
CREATE TABLE myproject.mydataset.events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT * FROM raw_events;

-- Preview query cost before running
-- Click "More" → "Query Settings" → "Maximum bytes billed"

Performance Optimization:

Partition by date/timestamp columns
Cluster by high-cardinality filter columns
Avoid SELECT * (specify columns)
Use materialized views for common aggregations

Cloud Pub/Sub: Messaging & Streaming

Google's managed messaging service, similar to AWS SNS + SQS combined.

Key Characteristics

Feature	Pub/Sub	AWS Equivalent
Model	Publish-subscribe	SNS + SQS combined
Ordering	Optional (per-key)	SQS FIFO
Retention	7 days default (configurable to 31)	14 days max (SQS)
Dead Letter	Supported	Supported
Push/Pull	Both	SNS push, SQS pull

Pub/Sub Architecture Patterns

Event-Driven Architecture:

Publishers → Topic → Subscriptions → Subscribers
                    ├── Pull Subscription → Cloud Functions
                    ├── Push Subscription → Cloud Run
                    └── BigQuery Subscription → BigQuery (direct)

BigQuery Subscriptions (unique to GCP): Write messages directly to BigQuery without code.

Interview Question: Message Ordering

Q: "How do you guarantee message ordering in Pub/Sub?"

A: Use ordering keys:

# Publisher
from google.cloud import pubsub_v1

publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project, topic)

# Messages with same ordering_key are ordered
publisher.publish(
    topic_path,
    data=b"message",
    ordering_key="user-123"  # All user-123 messages in order
)

Important: Ordering is per-subscription, per-ordering-key. Messages with different keys may arrive out of order.

Dataflow: Stream & Batch Processing

GCP's managed Apache Beam service.

When to Use Dataflow

Scenario	Dataflow	BigQuery
Real-time transformation	Yes	Limited (streaming inserts)
Complex windowing	Yes	No
Cross-service ETL	Yes	Limited
ML inference pipeline	Yes	BigQuery ML only
Cost at scale	Higher	Lower for pure analytics

Common Dataflow Patterns

Streaming ETL:

Pub/Sub → Dataflow (transform, enrich, window) → BigQuery

Batch Processing:

Cloud Storage (CSV/JSON) → Dataflow → BigQuery/Bigtable

Vertex AI: Unified ML Platform

Google's managed ML platform, competing with AWS SageMaker.

Vertex AI Components

Component	Purpose	AWS Equivalent
Workbench	Managed notebooks	SageMaker Studio
Training	Custom model training	SageMaker Training
Prediction	Model serving	SageMaker Endpoints
Pipelines	ML workflow orchestration	SageMaker Pipelines
Feature Store	Feature management	SageMaker Feature Store
Model Garden	Pre-trained models	SageMaker JumpStart
Gemini API	Foundation models	Bedrock

Interview Question: Vertex AI vs SageMaker

Q: "What are the strengths of Vertex AI compared to SageMaker?"

Vertex AI Strengths:

Tighter BigQuery integration (direct training from tables)
AutoML more mature (Google's ML heritage)
Gemini models for generative AI
Simpler pricing model
Better integration with data stack (BigQuery, Dataflow)

SageMaker Strengths:

Larger ecosystem of built-in algorithms
More deployment options (edge, batch, async)
Better multi-account governance
More mature MLOps features
Wider third-party integration

Data Architecture Decision Tree

Analytics/BI workload?
  └── Yes → BigQuery (serverless, cost-effective)
  └── Need real-time transformation? → Dataflow + BigQuery

Messaging/Events?
  └── Simple pub/sub → Pub/Sub
  └── Need strong ordering → Pub/Sub with ordering keys
  └── Direct to BigQuery → BigQuery Subscription

ML/AI?
  └── Tabular data in BigQuery → BigQuery ML
  └── Custom training → Vertex AI Training
  └── Foundation models → Gemini API / Model Garden

Pro Tip: GCP's data services are deeply integrated. A common pattern is: Pub/Sub → Dataflow → BigQuery → Vertex AI. This integration is stronger than AWS equivalents.

Next, we'll explore Azure core services. :::