How long should I retain logs?

Depends on compliance needs. Typically 30–90 days for operational logs, longer for audits.

What’s the difference between metrics and logs?

Metrics are aggregated numeric data; logs are detailed event records.

How do I handle logs from Kubernetes?

Use DaemonSets with Fluent Bit or Fluentd to collect container logs.

How do I reduce storage costs?

Compress, archive, and apply retention policies.

Building a Reliable Logging Infrastructure from Scratch

December 27, 2025

#logging #observability #infrastructure #DevOps #monitoring #python #cloud #security

Building a Reliable Logging Infrastructure from Scratch

TL;DR

A reliable logging infrastructure is essential for debugging, observability, and compliance.
Centralized log collection and structured formats (like JSON) make analysis far easier.
Use log shippers (e.g., Fluentd, Logstash) to aggregate logs from multiple services.
Prioritize security: encrypt logs in transit and control access tightly.
Build for scale — from local development to multi-region production systems.

What You’ll Learn

How to design a logging infrastructure that scales from a small team to enterprise level.
The differences between various logging architectures (agent-based, sidecar, centralized).
How to set up collection, transport, storage, and visualization layers.
How to use Python’s modern logging configuration in production.
Common pitfalls, performance considerations, and security best practices.

Prerequisites

You should be comfortable with:

Basic system administration (Linux or containerized environments)
Familiarity with cloud environments (AWS, GCP, or Azure)
Basic Python knowledge for code examples

Logs are the breadcrumbs of your systems — they tell the story of what’s happening inside your services. Whether you’re debugging a failing API, monitoring performance, or auditing user actions, a well-designed logging infrastructure is your best ally.

However, logging isn’t just about writing messages to a file. In modern distributed systems, logs come from containers, microservices, serverless functions, and edge devices. Without a structured, scalable approach, logs quickly become noise.

This post walks you through setting up a robust, secure, and scalable logging infrastructure — the kind that powers large-scale production systems.

Understanding Logging Infrastructure

A logging infrastructure typically has four layers:

Collection – Gathering logs from applications, containers, and systems.
Transport – Shipping logs to a central location.
Storage – Indexing and storing logs efficiently.
Analysis & Visualization – Searching, alerting, and deriving insights.

Here’s a high-level view of how these layers interact:

graph TD
  A[Applications] --> B[Log Shipper]
  B --> C[Central Log Collector]
  C --> D[Storage Backend]
  D --> E[Visualization / Query Layer]

Each layer can be implemented with different tools — for example, Fluent Bit for collection, Kafka for transport, Elasticsearch for storage, and Kibana or Grafana for visualization.

Step-by-Step: Building a Logging Pipeline

Let’s build a minimal but production-ready logging pipeline using open-source components.

Step 1: Generate Structured Logs

The first step is to ensure that your applications produce structured logs. JSON is the most common format because it’s machine-readable and easy to parse.

Here’s a Python example using the built-in logging module with dictConfig() for structured output:

import logging
import logging.config
import json

LOGGING_CONFIG = {
    'version': 1,
    'formatters': {
        'json': {
            'format': ('{"timestamp": "%(asctime)s", "level": "%(levelname)s", '
                       '"message": "%(message)s", "module": "%(module)s"}')
        }
    },
    'handlers': {
        'console': {
            'class': 'logging.StreamHandler',
            'formatter': 'json'
        }
    },
    'root': {
        'handlers': ['console'],
        'level': 'INFO'
    }
}

logging.config.dictConfig(LOGGING_CONFIG)
logger = logging.getLogger(__name__)

logger.info("User login successful", extra={"user_id": 42})

Output:

{"timestamp": "2025-03-04 12:45:21,123", "level": "INFO", "message": "User login successful", "module": "auth"}

Structured logs make it trivial for downstream systems (like Elasticsearch or Loki) to parse and index data.

Step 2: Collect Logs with Fluent Bit

Fluent Bit is a lightweight log shipper that collects and forwards logs from multiple sources.

Example configuration (fluent-bit.conf):

[INPUT]
    Name tail
    Path /var/log/app/*.log
    Parser json

[OUTPUT]
    Name  es
    Match *
    Host  elasticsearch
    Port  9200
    Index app-logs

Start Fluent Bit as a container:

docker run -v $(pwd)/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf:ro \
  fluent/fluent-bit:latest

This configuration tails all .log files, parses them as JSON, and sends them to Elasticsearch.

Step 3: Store Logs in Elasticsearch

Elasticsearch provides a distributed, full-text search engine ideal for log storage¹. It indexes each JSON field, enabling fast queries and aggregations.

Example query:

curl -X GET "http://localhost:9200/app-logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": { "message": "error" }
  }
}'

Output:

{
  "hits": {
    "total": 3,
    "hits": [
      {"_source": {"message": "database connection error"}},
      {"_source": {"message": "timeout error"}},
      {"_source": {"message": "authentication error"}}
    ]
  }
}

Step 4: Visualize with Kibana

Kibana is the visualization layer for Elasticsearch. It lets you build dashboards, set alerts, and visualize trends.

For example, you can create a dashboard showing:

Error rates over time
Top 10 endpoints by latency
User actions per region

This is where logs turn into actionable insights.

Comparison: Common Logging Architectures

Architecture Type	Description	Pros	Cons
Agent-based	Each host runs a log collector (e.g., Fluent Bit).	Simple, scalable, fault-tolerant.	Requires agent maintenance.
Sidecar pattern	Each container has a dedicated log collector.	Isolation per service, good for Kubernetes.	More resource usage.
Centralized collector	Logs are streamed to a central service (e.g., syslog).	Easier management.	Single point of failure if not replicated.
Serverless logging	Cloud-native logging (e.g., AWS CloudWatch).	No infrastructure to manage.	Vendor lock-in, limited flexibility.

When to Use vs When NOT to Use

Scenario	Use Logging Infrastructure	Avoid / Use Alternatives
Multi-service distributed apps	✅ Centralized logging is essential	❌ Local logs only
Compliance or audit requirements	✅ Structured, immutable logs	❌ Ephemeral logs
Small local projects	❌ Overkill; simple file logs suffice	✅ Simpler setup
Serverless-only workloads	⚙️ Use managed logging (CloudWatch, Stackdriver)	❌ Self-managed stack

Real-World Example: Netflix and Observability

Netflix uses structured logging and centralized ingestion pipelines to handle high-volume telemetry data, enabling engineers to trace user requests across multiple services efficiently.

Common Pitfalls & Solutions

Pitfall	Description	Solution
Unstructured logs	Hard to parse and index.	Use JSON or key-value formats.
Log overload	Too much noise, expensive storage.	Implement log levels and retention policies.
Missing context	Logs lack correlation IDs.	Include request IDs or trace IDs in every log.
Security leaks	Sensitive data in logs.	Sanitize logs before shipping.
Slow queries	Elasticsearch indices too large.	Use index lifecycle management.

Performance Considerations

Batching: Shippers like Fluent Bit batch logs before sending, reducing network overhead.
Compression: Use gzip or zstd for log transport to cut bandwidth.
Indexing strategy: Rotate indices daily or hourly to improve query performance.
Retention policy: Archive older logs to S3 or Glacier for cost control.

Benchmarks commonly show that batching and compression can significantly reduce ingestion latency in I/O-bound workloads¹.

Security Considerations

Security in logging is often overlooked. Follow these best practices:

Encrypt in transit – Use TLS for all log transport.
Encrypt at rest – Enable disk encryption for Elasticsearch or S3 buckets.
Access control – Use role-based access control (RBAC) for log viewers.
Mask sensitive data – Never log passwords, tokens, or PII.
Audit logging – Keep immutable audit trails for compliance.

Scalability Insights

As your infrastructure grows:

Horizontal scaling: Add more log shippers and storage nodes.
Partitioning: Split logs by service or region.
Queue buffering: Use Kafka or AWS Kinesis between collection and storage.
Caching: Cache frequent queries in Kibana.

Large-scale systems often adopt a multi-tier pipeline: Fluent Bit → Kafka → Elasticsearch². This decouples ingestion from storage, improving resilience.

Testing Logging Infrastructure

Testing ensures reliability under load.

Unit Testing Log Output

import logging
from io import StringIO

def test_json_logging():
    stream = StringIO()
    handler = logging.StreamHandler(stream)
    formatter = logging.Formatter('{"msg": "%(message)s"}')
    handler.setFormatter(formatter)

    logger = logging.getLogger('test')
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

    logger.info('hello')
    assert 'hello' in stream.getvalue()

Integration Testing

Simulate high log volume and verify ingestion speed.
Test Elasticsearch queries for correctness.
Validate retention and deletion policies.

Error Handling Patterns

When log shippers fail or storage is unavailable:

Retry with backoff – Avoid overwhelming downstream systems.
Fallback to local disk – Buffer logs temporarily.
Dead letter queues – Capture malformed logs.

Example Fluent Bit retry configuration:

[OUTPUT]
    Name es
    Retry_Limit False
    Retry_Backoff True

Monitoring & Observability Tips

Use Prometheus metrics for log pipeline health.
Monitor ingestion rate, error rate, and queue depth.
Set alerts for missing logs (silence detection).
Visualize pipeline latency in Grafana.

Common Mistakes Everyone Makes

Logging too much debug data in production.
Forgetting to rotate or archive logs.
Ignoring structured formats.
Mixing stdout and file logs inconsistently.
Not testing under production-like load.

Troubleshooting Guide

Problem	Possible Cause	Fix
Logs missing from Kibana	Fluent Bit misconfiguration	Check output match rules
High Elasticsearch CPU	Oversized shards	Reduce shard count
Duplicate logs	Multiple collectors reading same file	Use unique tags
Delayed logs	Network congestion	Enable batching and compression
Sensitive data exposure	Incomplete masking	Apply regex filters

Try It Yourself Challenge

Set up a local Elasticsearch + Kibana stack.
Configure Fluent Bit to ship logs from a Python app.
Add a correlation ID to each log entry.
Create a Kibana dashboard showing error trends.

Key Takeaways

Logging is not an afterthought — it’s your system’s memory.

Start with structured logs.

Centralize collection and analysis.

Secure and scale your pipeline.

Continuously monitor and optimize.

Next Steps

Experiment with Grafana Loki as a lightweight alternative to Elasticsearch.
Explore OpenTelemetry for unified observability.
Automate log ingestion tests in CI/CD.

Elasticsearch Documentation – https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html ↩ ↩²
Fluent Bit Official Docs – https://docs.fluentbit.io/manual/ ↩

Frequently Asked Questions

No. Log strategically — focus on errors, warnings, and key business events.