Reference Architecture for Observability using Cloud Native Tools
Implementing Observability with OpenTelemetry, Grafana Tempo, Loki, and Prometheus.
Introduction
Observability is a critical aspect of modern software systems, enabling developers and operators to gain insights into the behaviour and performance of their applications. With the rise of microservices and distributed systems, it has become increasingly important to have a unified approach to observability. OpenTelemetry is an open-source observability framework that provides a standardized way to collect, process, and export telemetry data.
In the previous blog, we discussed on the background of Observability, In this blog, we will explore the reference architecture of observability with OpenTelemetry, Grafana Tempo, Loki, and Prometheus. And also conclude with future reference architecture.
Observability Tools Landscape
There are several tools in the market for Observability - Open Source and Proprietary. Cloud Native Computing Foundation(CNCF) had setup “The CNCF End User Technology Radar”, which is a guide for evaluating cloud native technologies, on behalf of the CNCF End User Community. Their summary on Observability :
The most commonly adopted tools are open source.
There’s no consolidation in the observability space.
Prometheus and Grafana are frequently used together.
Since most of the companies use Prometheus and Grafana, its better to have these tools as a starting point.
Open Telemetry
OpenTelemetry is an open-source project that provides a set of tools and APIs for instrumenting, collecting, and exporting telemetry data from software systems. Here are some of the benefits of using OpenTelemetry:
Standardization: OpenTelemetry provides a vendor-neutral standard for telemetry data collection and export. This means that you can use it with any language, framework, or cloud provider.
Observability: OpenTelemetry makes it easier to monitor and debug complex distributed systems by providing a unified view of telemetry data across services and infrastructure.
Flexibility: OpenTelemetry is highly customizable and allows you to define your own metrics, traces, and logs. You can also choose from a variety of data exporters and integrations.
Interoperability: OpenTelemetry is designed to work seamlessly with other observability tools and platforms, such as Grafana, Prometheus, and Jaeger.
Community-driven: OpenTelemetry is developed by a community of experts and enthusiasts who are committed to improving observability in software systems. This means that the project is constantly evolving and improving based on feedback and contributions from its users.
Overall, OpenTelemetry provides a powerful and flexible platform for collecting and analyzing telemetry data, which can help you improve the performance, reliability, and scalability of your software systems.
However, only Traces is ready for production. To begin with, we can use OpenTelemetry only for Traces and use other tools for Metric and Logs.
Prometheus
Prometheus is an open-source monitoring system that collects metrics data from instrumented applications and stores it in a time-series database. It provides a powerful query language, PromQL, that allows users to query and analyze their metrics data.
Prometheus also provides a range of integrations with other monitoring tools like Grafana, Alert manager, and OpenTelemetry. With OpenTelemetry, Prometheus can consume metrics data from the OpenTelemetry collector, allowing users to consolidate their metrics data from different sources in a single backend.
Loki
Loki is a horizontally-scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It is designed to handle high volumes of log data and allows users to search and analyze logs using a powerful query language, LogQL.
Loki integrates with Prometheus and other data sources like OpenTelemetry to enable users to store, query, and visualize their logs alongside their metrics data. Loki stores logs in a compact, compressed, and indexed format, making it easy to search and analyze large volumes of log data.
Grafana Tempo
Grafana Tempo is an open-source distributed tracing system designed for high-scale environments. It provides a unified approach to tracing across multiple systems and enables users to analyze and diagnose issues in real-time.
Grafana Tempo integrates with OpenTelemetry and other tracing systems like Jaeger and Zipkin to provide a unified view of distributed traces. It also integrates with Prometheus and other data sources to enable users to correlate their tracing data with their metrics and logs data.
Reference Architecture 1.0
Now that we've introduced the key components of an observability platform, let's take a look at how they fit together in a reference architecture:
OpenTelemetry agents are deployed alongside your applications and infrastructure, capturing telemetry data and forwarding it to the appropriate backends.
Grafana Tempo is used to collect and store tracing data, allowing you to quickly identify performance issues and optimize your applications.
Loki is used to collect and store log data, providing a detailed view of what's happening in your system.
Prometheus is used to collect and store metrics data, allowing you to monitor the performance of your applications and infrastructure.
Grafana is used to visualize and analyze the data collected by Tempo, Loki, and Prometheus, providing a comprehensive view of your system's performance.
Reference Architecture 2.0
In Reference Architecture 1.0, the three pillars of observability, metrics, logs, and traces are viewed as silos. However, relying on just one of these pillars in isolation can limit our understanding of a system's behavior and performance.
For example, if we only look at metrics, we may miss important contextual information about what led to a particular metric value. Similarly, if we only examine logs, we may struggle to identify patterns and trends in the data. Traces can provide a more detailed view of individual requests, but without context from metrics and logs, it can be challenging to understand the overall behavior of the system.
Correlating telemetry using events helps to address these limitations by bringing together data from multiple sources and allowing us to view them in context.
By correlating events across different telemetry sources, we can more easily understand the relationships between them and identify patterns and trends in the data. This can help us to diagnose problems more quickly and improve the overall performance and reliability of our systems.
Therefore, while the three pillars of observability are useful in isolation, they are much more powerful when used together and correlated with events to provide a more comprehensive view of system behavior and performance.
Conclusion
By using these tools together, you can build a highly effective observability platform that provides real-time insights into your applications and infrastructure. With OpenTelemetry as the backbone, you can easily integrate with other tools and technologies, ensuring that your platform is flexible and future-proof. So, whether you're building a new observability platform from scratch or looking to improve an existing one, OpenTelemetry, Grafana Tempo, Loki, and Prometheus provide a powerful combination that can help you achieve your goals.
Reference
https://opentelemetry.io/docs/
https://grafana.com/blog/2022/02/23/introducing-exemplar-support-in-grafana-cloud-tightly-coupling-traces-to-your-metrics/
https://logz.io/learn/opentelemetry-guide/#overview
https://dzone.com/refcardz/getting-started-with-opentelemetry#section-3
https://www.slideshare.net/kbrockhoff/opentelemetry-for-architects
https://radar.cncf.io/2020-09-observability