Log aggregation centralizes log streams from hundreds of services into a searchable store, solving the "SSH to every host" problem. The ELK Stack (Elasticsearch + Logstash + Kibana) is the most widely deployed self-managed solution, while Grafana Loki reduces storage costs by indexing only labels (not full text), treating logs like metrics. Cloud-native options — CloudWatch Logs (AWS), Azure Monitor Logs (LA Workspace), and Google Cloud Logging — provide zero-agent integration for managed services. Datadog Log Management and Splunk offer enterprise-grade indexing with SIEM capabilities at significantly higher cost.

Key Points

  • ELK Stack: Logstash/Filebeat ships logs → Elasticsearch indexes with inverted index → Kibana provides KQL search and dashboards; Elasticseach requires careful shard sizing (max 50 GB/shard)
  • Grafana Loki uses label-based indexing (service, env, level) and stores log chunks in object storage (S3/GCS); drastically cheaper than Elasticsearch for high-volume log retention
  • Log shipping agents: Fluentd and Fluent Bit (CNCF, low-memory), Filebeat (Elastic), Promtail (Loki); run as DaemonSets in Kubernetes
  • Retention tiers: hot (fast SSD, 7–30 days) → warm (slow disk, 30–90 days) → cold (S3 Glacier / object storage, years); ILM policies automate transitions
  • CloudWatch Logs Insights uses a SQL-like query language with field auto-discovery; Log Groups should be aligned to service + environment for cost allocation
  • Log-based metrics: transform log patterns into counter metrics (e.g., count of ERROR logs per service) to drive alerts without full-text search overhead
  • Cardinality trap: avoid high-cardinality labels (user IDs, request IDs) as Loki label values; put those in the log body, not labels
  • GDPR/compliance: log retention policies must enforce deletion timelines; use log scrubbing pipelines to remove PII before long-term archival

Real-World Example

Airbnb migrated from ELK to a hybrid model using Kafka as a durable log bus with Elasticsearch for hot search and S3 + Presto for cold analytical queries, cutting log storage cost by 60% while maintaining sub-second search on recent logs.