How to Set Up the LGTM Stack with Prometheus and OpenTelemetry

How to Set Up the LGTM Stack with Prometheus and OpenTelemetry

Introduction

Modern applications generate vast amounts of logs, metrics, and traces. To monitor and troubleshoot effectively, we need a powerful observability stack. The LGTM stack (Loki, Grafana, Tempo, and Mimir) combined with Prometheus and OpenTelemetry provides a comprehensive solution for logs, metrics, and tracing.

In this guide, we will explore each component of the stack, its use cases, and how they work together.


Understanding the Components

1. Loki (Log Aggregation)

Loki is a log aggregation system designed for efficiency. Unlike traditional log systems like ELK, Loki indexes metadata instead of log content, making it lightweight and cost-effective.

  • Use Cases:

    • Centralized log management
    • Debugging application failures
    • Searching logs efficiently with LogQL
  • Example LogQL Queries:

    {job="nginx"} |= "error"
    {app="myapp"} | json | request_time > 500

2. Prometheus (Metrics Collection)

Prometheus is a time-series database that collects and stores system and application metrics.

  • Use Cases:

    • Monitor system performance (CPU, memory, disk, network)
    • Track API latency and error rates
    • Alert on abnormal conditions
  • Example PromQL Queries:

    # CPU usage
    node_cpu_seconds_total{mode="user"} / sum(node_cpu_seconds_total)
    
    # Memory usage
    (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
    
    # HTTP request latency
    rate(http_request_duration_seconds_bucket[5m])

3. Tempo (Distributed Tracing)

Tempo helps track request flows across microservices, enabling efficient debugging.

  • Use Cases:

    • Find slow API requests
    • Debug request failures across services
    • Monitor service dependencies
  • Example in Grafana:

    • Trace ID lookup for debugging slow requests
    • Service dependency visualization

4. Grafana (Visualization & Alerting)

Grafana is a visualization tool that connects to Loki, Prometheus, and Tempo.

  • Use Cases:
    • Create interactive dashboards
    • Set up alerts based on logs and metrics
    • Correlate logs, metrics, and traces

5. OpenTelemetry (Instrumentation & Telemetry)

OpenTelemetry provides a unified approach to collecting logs, metrics, and traces.

  • Use Cases:

    • Instrument applications without modifying code
    • Process and export data to Loki, Prometheus, and Tempo
    • Reduce data storage via sampling
  • Example: Sending Traces with Python

    from opentelemetry import trace
    from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import BatchSpanProcessor
    
    trace.set_tracer_provider(TracerProvider())
    tracer = trace.get_tracer(__name__)
    
    exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
    span_processor = BatchSpanProcessor(exporter)
    trace.get_tracer_provider().add_span_processor(span_processor)
    
    with tracer.start_as_current_span("test-span"):
        print("Hello from OpenTelemetry!")

How LGTM + Prometheus + OpenTelemetry Work Together

ServiceTypePurposeUse Case
LokiLogsAggregation & SearchCentralized logging, debugging
PrometheusMetricsTime-series databaseMonitor system health, alerting
TempoTracingDistributed tracingDebugging service failures
GrafanaVisualizationDashboards & AlertsCorrelation of data
OpenTelemetryInstrumentationTelemetry pipelineUnifying logs, metrics, traces

Real-World Use Case Example

Scenario: Monitoring a Microservices-Based E-commerce Platform

  1. API Debugging:

    • Loki: Search for errors in checkout logs.
    • Prometheus: Analyze API response times.
    • Tempo: Trace slow transactions.
  2. Server Health Monitoring:

    • Prometheus: Check CPU/memory usage.
    • Grafana: Create alert rules for high usage.
    • Loki: Investigate logs for crashes.
  3. User Transactions Tracking:

    • OpenTelemetry: Capture request flow.
    • Tempo: Identify bottlenecks.
    • Loki: Retrieve error logs.

Deployment & Next Steps

You can deploy LGTM, Prometheus, and OpenTelemetry using Docker, Kubernetes, or NixOS. Here are some next steps:

Deploy with Kubernetes using Helm. ✅ Enable Alerting in Prometheus/Grafana. ✅ Use Fluent Bit for log collection.

Would you like a step-by-step setup guide? Let me know in the comments! 🚀