GCP & Azure Fundamentals for Multi-Cloud
GCP Data & AI Services: BigQuery, Pub/Sub & Vertex AI
Google's data and AI services are often considered best-in-class. Understanding these services is crucial for architect interviews at data-centric companies.
BigQuery: Serverless Data Warehouse
BigQuery is GCP's flagship analytics service and a key differentiator.
Architecture & Key Features
What Makes BigQuery Unique:
- Serverless: No infrastructure management
- Separation of compute and storage: Pay for what you query
- Columnar storage: Optimized for analytics
- Dremel execution engine: Massively parallel query execution
- Petabyte scale: Handle massive datasets
Pricing Models
| Model | Best For | Pricing |
|---|---|---|
| On-demand | Variable workloads | $6.25/TB scanned |
| Flat-rate (Editions) | Predictable workloads | $2,000/100 slots/month |
| Autoscaling | Variable with baseline | Baseline + burst slots |
Interview Question: BigQuery vs Redshift
Q: "When would you recommend BigQuery over Amazon Redshift?"
A:
| Factor | BigQuery | Redshift |
|---|---|---|
| Management | Serverless (no clusters) | Cluster management |
| Scaling | Automatic, instant | Manual, downtime for resize |
| Pricing | Per TB scanned | Per node hour |
| Best For | Variable/ad-hoc queries | Predictable, steady workloads |
| Streaming | Native ($0.05/GB) | Kinesis integration required |
| ML Integration | BigQuery ML built-in | SageMaker integration |
Choose BigQuery when:
- Unknown or variable query patterns
- Team wants zero ops overhead
- Need built-in ML capabilities
- Real-time streaming analytics required
BigQuery Best Practices
Cost Optimization:
-- Use partitioning to reduce scanned data
CREATE TABLE myproject.mydataset.events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT * FROM raw_events;
-- Preview query cost before running
-- Click "More" → "Query Settings" → "Maximum bytes billed"
Performance Optimization:
- Partition by date/timestamp columns
- Cluster by high-cardinality filter columns
- Avoid SELECT * (specify columns)
- Use materialized views for common aggregations
Cloud Pub/Sub: Messaging & Streaming
Google's managed messaging service, similar to AWS SNS + SQS combined.
Key Characteristics
| Feature | Pub/Sub | AWS Equivalent |
|---|---|---|
| Model | Publish-subscribe | SNS + SQS combined |
| Ordering | Optional (per-key) | SQS FIFO |
| Retention | 7 days default (configurable to 31) | 14 days max (SQS) |
| Dead Letter | Supported | Supported |
| Push/Pull | Both | SNS push, SQS pull |
Pub/Sub Architecture Patterns
Event-Driven Architecture:
Publishers → Topic → Subscriptions → Subscribers
├── Pull Subscription → Cloud Functions
├── Push Subscription → Cloud Run
└── BigQuery Subscription → BigQuery (direct)
BigQuery Subscriptions (unique to GCP): Write messages directly to BigQuery without code.
Interview Question: Message Ordering
Q: "How do you guarantee message ordering in Pub/Sub?"
A: Use ordering keys:
# Publisher
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project, topic)
# Messages with same ordering_key are ordered
publisher.publish(
topic_path,
data=b"message",
ordering_key="user-123" # All user-123 messages in order
)
Important: Ordering is per-subscription, per-ordering-key. Messages with different keys may arrive out of order.
Dataflow: Stream & Batch Processing
GCP's managed Apache Beam service.
When to Use Dataflow
| Scenario | Dataflow | BigQuery |
|---|---|---|
| Real-time transformation | Yes | Limited (streaming inserts) |
| Complex windowing | Yes | No |
| Cross-service ETL | Yes | Limited |
| ML inference pipeline | Yes | BigQuery ML only |
| Cost at scale | Higher | Lower for pure analytics |
Common Dataflow Patterns
Streaming ETL:
Pub/Sub → Dataflow (transform, enrich, window) → BigQuery
Batch Processing:
Cloud Storage (CSV/JSON) → Dataflow → BigQuery/Bigtable
Vertex AI: Unified ML Platform
Google's managed ML platform, competing with AWS SageMaker.
Vertex AI Components
| Component | Purpose | AWS Equivalent |
|---|---|---|
| Workbench | Managed notebooks | SageMaker Studio |
| Training | Custom model training | SageMaker Training |
| Prediction | Model serving | SageMaker Endpoints |
| Pipelines | ML workflow orchestration | SageMaker Pipelines |
| Feature Store | Feature management | SageMaker Feature Store |
| Model Garden | Pre-trained models | SageMaker JumpStart |
| Gemini API | Foundation models | Bedrock |
Interview Question: Vertex AI vs SageMaker
Q: "What are the strengths of Vertex AI compared to SageMaker?"
A:
Vertex AI Strengths:
- Tighter BigQuery integration (direct training from tables)
- AutoML more mature (Google's ML heritage)
- Gemini models for generative AI
- Simpler pricing model
- Better integration with data stack (BigQuery, Dataflow)
SageMaker Strengths:
- Larger ecosystem of built-in algorithms
- More deployment options (edge, batch, async)
- Better multi-account governance
- More mature MLOps features
- Wider third-party integration
Data Architecture Decision Tree
Analytics/BI workload?
└── Yes → BigQuery (serverless, cost-effective)
└── Need real-time transformation? → Dataflow + BigQuery
Messaging/Events?
└── Simple pub/sub → Pub/Sub
└── Need strong ordering → Pub/Sub with ordering keys
└── Direct to BigQuery → BigQuery Subscription
ML/AI?
└── Tabular data in BigQuery → BigQuery ML
└── Custom training → Vertex AI Training
└── Foundation models → Gemini API / Model Garden
Pro Tip: GCP's data services are deeply integrated. A common pattern is: Pub/Sub → Dataflow → BigQuery → Vertex AI. This integration is stronger than AWS equivalents.
Next, we'll explore Azure core services. :::