Cracking Performance Issues in Microservices with Distributed Tracing

This new discipline, attributed to Google, helps pinpoint where failures occur and what causes poor performance.

Microservices

Microservices architecture is the new norm for building products these days. An application made up of hundreds of independent services enables teams to work independently and accelerate development. However, such highly distributed applications are also harder to monitor.

When hundreds of services are traversed to satisfy a single request, it becomes difficult to investigate system issues. For example, if a customer request returns a failure code, how can we determine which service from the hundreds involved caused the error? Or if the customer request is suddenly very slow to respond, how can we pinpoint the source of the performance degradation?

Logs have long been the most established tool for root cause analysis in monolithic systems. But in microservices architecture, we are faced with hundreds or thousands of services, and with each service serving hundreds or thousands of different requests per second. That means that log entries for a given request are scattered across numerous log files, which makes it hard to determine the relevant ones for that specific request, let alone put them together according to the execution flow. Other methods, such as inspecting stack traces or putting breakpoints, are also not applicable in these cases.

Distributed Tracing Fundamentals

This has given rise to the new discipline of Distributed Tracing. In fact, many attribute the discipline to Google, which released a monumental research paper in 2010 following their experience building Dapper, their own in-house Distributed Tracing system to meet their unprecedented scale. According to OpenTracing.org:

“Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.”

With Distributed Tracing, our application reports tracing data for each service and operation that is invoked as part of the request execution. This data, called spans, is collected by the analytics backend, ordered by causality, and then visualized, typically as a Gantt chart. In the example below, we can see a trace, starting with the HTTP GET /dispatch operation invoked on the frontend service and then flowing through a series of services and operations to fulfill.

In this example, it’s easy to see where most of the time is spent and potential performance inefficiencies, such as a series of sequential calls that, if run concurrently, could reduce the overall request latency.

Distributed Tracing is Growing in Popularity

Distributed Tracing has been growing in popularity in recent years. According to DevOps Pulse 2022 survey, 47% use distributed tracing in one form or another. In fact, Distributed Tracing adoption has seen a steady increase throughout the last three years, per DevOps Pulse’s yearly results. Moreover, among those who do not yet use it, 70% of the 2022 survey respondents stated they intend to start using it in the coming one or two years.

Open Source Plays a Key Role in Distributed Tracing

Open source plays an important role in this domain. The most popular Distributed Tracing tool is Jaeger Tracing. According to DevOps Pulse, Jaeger is used by over 32% of distributed tracing practitioners. Jaeger was developed by Uber for its own hyperscale needs and was later open-sourced. Today Jaeger is a graduate project in the Cloud Native Computing Foundation (CNCF), the organization that hosts Kubernetes, Prometheus and other prominent open-source projects.

Another important open source project in this space is OpenTelemetry, which is used for generating and collecting the tracing data, alongside metrics and logs, in a unified and standard manner. OpenTelemetry provides API and SDK for instrumenting applications in various programming languages, as well as a Collector for collecting the telemetry data from applications and from infrastructure components. OpenTelemetry is also a CNCF project; it has reached general availability for Distributed Tracing in 2021 and is considered ready for production use.

Tracing definitely stands out as a central tool for monitoring microservices and distributed systems, and augmenting logs and metrics. I covered this topic recently at ContainerDays 2022 conference; you can find the lecture recording here.


SHARE

Dotan Horovits lives at the intersection of technology, product and innovation. With over 20 years in the hi-tech industry as a software developer, a solutions architect and a product manager, he brings a wealth of knowledge in cloud computing, big data solutions, DevOps practices and more.

Horovits is an avid advocate of open source software, open standards and communities. He is also an advocate of the Cloud Native Computing Foundation (CNCF), an organizer of the CNCF Tel-Aviv meetup group, a podcaster at OpenObservability Talks, and a blogger, among others. Working as the principal developer advocate at Logz.io, Horovits evangelizes on Observability in IT systems using popular open source projects such as Prometheus, OpenSearch, Jaeger and OpenTelemetry.