Building a Reliable Logging Infrastructure from Scratch
December 27, 2025
TL;DR
- A reliable logging infrastructure is essential for debugging, observability, and compliance.
- Centralized log collection and structured formats (like JSON) make analysis far easier.
- Use log shippers (e.g., Fluentd, Logstash) to aggregate logs from multiple services.
- Prioritize security: encrypt logs in transit and control access tightly.
- Build for scale — from local development to multi-region production systems.
What You’ll Learn
- How to design a logging infrastructure that scales from a small team to enterprise level.
- The differences between various logging architectures (agent-based, sidecar, centralized).
- How to set up collection, transport, storage, and visualization layers.
- How to use Python’s modern logging configuration in production.
- Common pitfalls, performance considerations, and security best practices.
Prerequisites
You should be comfortable with:
- Basic system administration (Linux or containerized environments)
- Familiarity with cloud environments (AWS, GCP, or Azure)
- Basic Python knowledge for code examples
Logs are the breadcrumbs of your systems — they tell the story of what’s happening inside your services. Whether you’re debugging a failing API, monitoring performance, or auditing user actions, a well-designed logging infrastructure is your best ally.
However, logging isn’t just about writing messages to a file. In modern distributed systems, logs come from containers, microservices, serverless functions, and edge devices. Without a structured, scalable approach, logs quickly become noise.
This post walks you through setting up a robust, secure, and scalable logging infrastructure — the kind that powers large-scale production systems.
Understanding Logging Infrastructure
A logging infrastructure typically has four layers:
- Collection – Gathering logs from applications, containers, and systems.
- Transport – Shipping logs to a central location.
- Storage – Indexing and storing logs efficiently.
- Analysis & Visualization – Searching, alerting, and deriving insights.
Here’s a high-level view of how these layers interact:
graph TD
A[Applications] --> B[Log Shipper]
B --> C[Central Log Collector]
C --> D[Storage Backend]
D --> E[Visualization / Query Layer]
Each layer can be implemented with different tools — for example, Fluent Bit for collection, Kafka for transport, Elasticsearch for storage, and Kibana or Grafana for visualization.
Step-by-Step: Building a Logging Pipeline
Let’s build a minimal but production-ready logging pipeline using open-source components.
Step 1: Generate Structured Logs
The first step is to ensure that your applications produce structured logs. JSON is the most common format because it’s machine-readable and easy to parse.
Here’s a Python example using the built-in logging module with dictConfig() for structured output:
import logging
import logging.config
import json
LOGGING_CONFIG = {
'version': 1,
'formatters': {
'json': {
'format': ('{"timestamp": "%(asctime)s", "level": "%(levelname)s", '
'"message": "%(message)s", "module": "%(module)s"}')
}
},
'handlers': {
'console': {
'class': 'logging.StreamHandler',
'formatter': 'json'
}
},
'root': {
'handlers': ['console'],
'level': 'INFO'
}
}
logging.config.dictConfig(LOGGING_CONFIG)
logger = logging.getLogger(__name__)
logger.info("User login successful", extra={"user_id": 42})
Output:
{"timestamp": "2025-03-04 12:45:21,123", "level": "INFO", "message": "User login successful", "module": "auth"}
Structured logs make it trivial for downstream systems (like Elasticsearch or Loki) to parse and index data.
Step 2: Collect Logs with Fluent Bit
Fluent Bit is a lightweight log shipper that collects and forwards logs from multiple sources.
Example configuration (fluent-bit.conf):
[INPUT]
Name tail
Path /var/log/app/*.log
Parser json
[OUTPUT]
Name es
Match *
Host elasticsearch
Port 9200
Index app-logs
Start Fluent Bit as a container:
docker run -v $(pwd)/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf:ro \
fluent/fluent-bit:latest
This configuration tails all .log files, parses them as JSON, and sends them to Elasticsearch.
Step 3: Store Logs in Elasticsearch
Elasticsearch provides a distributed, full-text search engine ideal for log storage1. It indexes each JSON field, enabling fast queries and aggregations.
Example query:
curl -X GET "http://localhost:9200/app-logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match": { "message": "error" }
}
}'
Output:
{
"hits": {
"total": 3,
"hits": [
{"_source": {"message": "database connection error"}},
{"_source": {"message": "timeout error"}},
{"_source": {"message": "authentication error"}}
]
}
}
Step 4: Visualize with Kibana
Kibana is the visualization layer for Elasticsearch. It lets you build dashboards, set alerts, and visualize trends.
For example, you can create a dashboard showing:
- Error rates over time
- Top 10 endpoints by latency
- User actions per region
This is where logs turn into actionable insights.
Comparison: Common Logging Architectures
| Architecture Type | Description | Pros | Cons |
|---|---|---|---|
| Agent-based | Each host runs a log collector (e.g., Fluent Bit). | Simple, scalable, fault-tolerant. | Requires agent maintenance. |
| Sidecar pattern | Each container has a dedicated log collector. | Isolation per service, good for Kubernetes. | More resource usage. |
| Centralized collector | Logs are streamed to a central service (e.g., syslog). | Easier management. | Single point of failure if not replicated. |
| Serverless logging | Cloud-native logging (e.g., AWS CloudWatch). | No infrastructure to manage. | Vendor lock-in, limited flexibility. |
When to Use vs When NOT to Use
| Scenario | Use Logging Infrastructure | Avoid / Use Alternatives |
|---|---|---|
| Multi-service distributed apps | ✅ Centralized logging is essential | ❌ Local logs only |
| Compliance or audit requirements | ✅ Structured, immutable logs | ❌ Ephemeral logs |
| Small local projects | ❌ Overkill; simple file logs suffice | ✅ Simpler setup |
| Serverless-only workloads | ⚙️ Use managed logging (CloudWatch, Stackdriver) | ❌ Self-managed stack |
Real-World Example: Netflix and Observability
Netflix uses structured logging and centralized ingestion pipelines to handle high-volume telemetry data, enabling engineers to trace user requests across multiple services efficiently.
Common Pitfalls & Solutions
| Pitfall | Description | Solution |
|---|---|---|
| Unstructured logs | Hard to parse and index. | Use JSON or key-value formats. |
| Log overload | Too much noise, expensive storage. | Implement log levels and retention policies. |
| Missing context | Logs lack correlation IDs. | Include request IDs or trace IDs in every log. |
| Security leaks | Sensitive data in logs. | Sanitize logs before shipping. |
| Slow queries | Elasticsearch indices too large. | Use index lifecycle management. |
Performance Considerations
- Batching: Shippers like Fluent Bit batch logs before sending, reducing network overhead.
- Compression: Use gzip or zstd for log transport to cut bandwidth.
- Indexing strategy: Rotate indices daily or hourly to improve query performance.
- Retention policy: Archive older logs to S3 or Glacier for cost control.
Benchmarks commonly show that batching and compression can significantly reduce ingestion latency in I/O-bound workloads1.
Security Considerations
Security in logging is often overlooked. Follow these best practices:
- Encrypt in transit – Use TLS for all log transport.
- Encrypt at rest – Enable disk encryption for Elasticsearch or S3 buckets.
- Access control – Use role-based access control (RBAC) for log viewers.
- Mask sensitive data – Never log passwords, tokens, or PII.
- Audit logging – Keep immutable audit trails for compliance.
Scalability Insights
As your infrastructure grows:
- Horizontal scaling: Add more log shippers and storage nodes.
- Partitioning: Split logs by service or region.
- Queue buffering: Use Kafka or AWS Kinesis between collection and storage.
- Caching: Cache frequent queries in Kibana.
Large-scale systems often adopt a multi-tier pipeline: Fluent Bit → Kafka → Elasticsearch2. This decouples ingestion from storage, improving resilience.
Testing Logging Infrastructure
Testing ensures reliability under load.
Unit Testing Log Output
import logging
from io import StringIO
def test_json_logging():
stream = StringIO()
handler = logging.StreamHandler(stream)
formatter = logging.Formatter('{"msg": "%(message)s"}')
handler.setFormatter(formatter)
logger = logging.getLogger('test')
logger.addHandler(handler)
logger.setLevel(logging.INFO)
logger.info('hello')
assert 'hello' in stream.getvalue()
Integration Testing
- Simulate high log volume and verify ingestion speed.
- Test Elasticsearch queries for correctness.
- Validate retention and deletion policies.
Error Handling Patterns
When log shippers fail or storage is unavailable:
- Retry with backoff – Avoid overwhelming downstream systems.
- Fallback to local disk – Buffer logs temporarily.
- Dead letter queues – Capture malformed logs.
Example Fluent Bit retry configuration:
[OUTPUT]
Name es
Retry_Limit False
Retry_Backoff True
Monitoring & Observability Tips
- Use Prometheus metrics for log pipeline health.
- Monitor ingestion rate, error rate, and queue depth.
- Set alerts for missing logs (silence detection).
- Visualize pipeline latency in Grafana.
Common Mistakes Everyone Makes
- Logging too much debug data in production.
- Forgetting to rotate or archive logs.
- Ignoring structured formats.
- Mixing stdout and file logs inconsistently.
- Not testing under production-like load.
Troubleshooting Guide
| Problem | Possible Cause | Fix |
|---|---|---|
| Logs missing from Kibana | Fluent Bit misconfiguration | Check output match rules |
| High Elasticsearch CPU | Oversized shards | Reduce shard count |
| Duplicate logs | Multiple collectors reading same file | Use unique tags |
| Delayed logs | Network congestion | Enable batching and compression |
| Sensitive data exposure | Incomplete masking | Apply regex filters |
Try It Yourself Challenge
- Set up a local Elasticsearch + Kibana stack.
- Configure Fluent Bit to ship logs from a Python app.
- Add a correlation ID to each log entry.
- Create a Kibana dashboard showing error trends.
Key Takeaways
Logging is not an afterthought — it’s your system’s memory.
- Start with structured logs.
- Centralize collection and analysis.
- Secure and scale your pipeline.
- Continuously monitor and optimize.
FAQ
Q1: Should I log everything?
No. Log strategically — focus on errors, warnings, and key business events.
Q2: How long should I retain logs?
Depends on compliance needs. Typically 30–90 days for operational logs, longer for audits.
Q3: What’s the difference between metrics and logs?
Metrics are aggregated numeric data; logs are detailed event records.
Q4: How do I handle logs from Kubernetes?
Use DaemonSets with Fluent Bit or Fluentd to collect container logs.
Q5: How do I reduce storage costs?
Compress, archive, and apply retention policies.
Next Steps
- Experiment with Grafana Loki as a lightweight alternative to Elasticsearch.
- Explore OpenTelemetry for unified observability.
- Automate log ingestion tests in CI/CD.
Footnotes
-
Elasticsearch Documentation – https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html ↩ ↩2
-
Fluent Bit Official Docs – https://docs.fluentbit.io/manual/ ↩