Observability: How Adobe improves performances and MTTR using Epsagon and Fastly logs
We don’t often do guest posts on our blog, but when our partners at Epsagon — a company that gives visibility into microservices — reached out to tell a story about how they worked with Adobe to use Fastly to reduce mean time to repair, it was too good an opportunity to pass up.
Working together, Epsagon and Adobe’s Project Helix team built a very cool integration that uses clever parsing of VCL to generate tracing statements showing what variables have been created, updated, or deleted at every stage of a request and response in our platform. Then, using the flexibility of our real-time log streaming feature, the tracing statements are sent off to Epsagon — despite the fact that we had no formalized logging integration with them.
Once inside the Epsagon system, these logs show an incredibly clear overview of the whole request and response cycle. Back in the day, with huge monolithic apps, it was easy to diagnose issues by just tailing your Apache access log, but nowadays — as architectures get more complicated and involve serverless components and microservices, as well as multiple third-party services and cloud providers — you need much more sophisticated tools.
What excited us most was not only the fact that this project highlighted the need for different tools in more modern software stacks but also validated our approach to trying to build a platform that our customers could themselves build on. Plus, it makes us very happy when our partners work together — especially when those collaborations produce something this great.
So without further ado, here’s Lars Trieloff from Adobe and Ran Ribenzaft from Epsagon with their take on this project.
Building distributed applications and microservices patterns really improves the agility of teams when it comes to more quickly delivering and maintaining software –– even when using managed SaaS providers, which augment our applications with services such as caching, payments, user management, storage, and more.
However, it’s getting much harder to monitor and troubleshoot an application composed of tens or hundreds of services and components, especially when some of them are not even managed or controlled by you.
In the following post, we’ll present the strategies for observability, Adobe’s use case that includes Fastly, OpenWhisk, and more, and the way we’ve gained observability using Epsagon.
Project Helix is a research project at Adobe that explores the ability to create, manage, and deliver great content management system (CMS) and digital experiences using Adobe I/O Runtime, Adobe’s Apache OpenWhisk-based serverless runtime, which allows you to run custom code.
Project Helix uses serverless functions (or “actions,” as OpenWhisk calls them) for asynchronous content processing such as updating search indexes and synchronous rendering of web pages. In order to overcome the inherent latency in these synchronous operations, Fastly is used as an edge platform that performs request processing, image optimization, caching, and edge-side processing. Adobe chose Fastly because they had a very successful evaluation showing great results, and the Fastly team provided, and still provides, fantastic support.
The current team challenges are:
Understanding the flow of a specific request, including Fastly, OpenWhisk, Microsoft APIs, Google Apis, GitHub, and any other third-party APIs
Exploring end-to-end performance bottlenecks or issues
Pinpointing and analyzing events across the whole stack (e.g. looking for a specific event according to a user with a bad experience)
The distributed and transient nature of serverless applications make traditional, agent-based tracing and extensive logging impractical. While Fastly provides excellent real-time visibility, they and other platforms, such as OpenWhisk, aren't meant to provide dashboards and data visualization, so Project Helix was looking for an observability solution.
Observability in modern applications
Observability is the key to answering the challenges above. Instead simple, or traditional, monitoring, observability encapsulates all metrics, logging, and tracing under the same roof.
When done right, observability can help teams understand their production workloads, customer journeys, overall user experience, and performance issues. But how do you accomplish good observability? The answer is correlation.
Observability is more than the sum of its components. Getting a 5xx error threshold alert from a metric without a correlation to logs is almost meaningless. The engineer would need to correlate the data manually to understand (debug) the issue. And now how will you do that across 10 different services?
Correlation between metrics and logs to traces will give us a complete picture of what happened (metrics), why it happened (logs), and where it happened (traces). Doing so is not a simple task, and that’s why we chose Epsagon.
Observability using Epsagon
Let’s tackle the challenges we mentioned before one by one.
Understanding the journey of our customers can be really tricky. It is distributed, it is asynchronous, and some of the data comes from third-party APIs.
The Epsagon tracing library (in this case for Node.js) is responsible for collecting the calls and payloads of everything that occurs. It wraps the OpenWhisk actions and instruments calls in runtime (for example HTTP calls), from which it extracts the body, headers, and any other metadata.
With Fastly's real-time logging and powerful programmability features, we could build an instrumentation library that integrated well with Epsagon. Before a new version of VCL (Fastly's DSL for HTTP processing) is activated, a serverless action in Project Helix parses the VCL code, identifies critical state transitions and header values, and injects log statements that, at runtime, send tracing data to Epsagon. Fastly's real-time logging collects, batches, formats (as JSON), and ships the data off to Epsagon's HTTPS endpoint.
Fusing these events, traces, metrics, logs, and other parameters yield the Epsagon flow, which tells the story of the request:
An example trace flow of Helix in Epsagon
Next, we would like to know the overall performance of our application in order to understand how we can improve the customer experience or, in other words, the performance. In such a distributed-asynchronous nature, it is almost impossible to understand it from logs, metrics, or a graph. A timeline is much more suitable:
An example trace timeline of Helix in Epsagon
Ultimately, we wanted to pinpoint and aggregate events across the whole stack.
Using Epsagon automatic indexing of payloads, we can filter down calls to a specific parameters, such as owner and repository. This gives the ability to troubleshoot only relevant event and understand performance (requests, errors, latency) of this subset:
Filtering specific events in Epsagon
In modern applications that are composed of microservices, multiple frameworks, third-party APIs, and other SaaS applications, it becomes clear that observability is a key daily consideration for the engineering team.
A good observability solution in place, along with having real-time insights from an edge cloud platform like Fastly, can help teams troubleshoot issues more efficiently and focus on delivering great products and experiences. Additionally, choosing a managed solution such as Epsagon, can remove any heavy lifting or constant maintenance that is unwanted from the engineering teams.