You monitor distributed systems and log data, but what good does it do if you can’t observe an actual problem when there is an issue? The reality is, you’re drowning in log data and monitoring only gives you a high-level overview of a problem after it’s occurred.
Monitoring is the activity of observing the state of a system over time. It uses instrumentation for problem detection, resolution, and continuous improvement. Monitoring alerts are reactive–they tell you when a known issue has already occurred (i.e. maybe your available memory is too low or you need more compute).
Monitoring provides automated checks that you can execute against a distributed system to make sure that none of the things you predicted signify any trouble. While monitoring these known quantities is important, the practice also has limitations, including the fact that you are only looking for known issues. Which begs an important question, “what about the problems that you didn’t predict?”
Observability goes beyond monitoring, enabling the proactive introspection of distributed systems for greater operational visibility. Observability allows you to ask open-ended questions and have the data you need in order to explore the data to find answers. In short, observability gives you the information you need to make better decisions using real data.
Collecting log data and monitoring your systems are both important. But, as John Fahi at logz.io argues, both “are only giving you one-dimensional fragments of a complete picture. They are words or sentences of chapters in a story that is your environment. Once you assemble enough fragments and organize them in a way to glean actionable knowledge of your environment, you are creating that insight.” Observability is about generating a deep understanding of what should be changed to improve your environment.
Observability is about providing context. Cindy Sridharan describes this as providing “highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes.” She notes that it’s still not possible to predict every single failure mode a system could potentially run into, and as such it is important that we build systems that we can debug, armed with evidence and not conjecture.
With observability, you can ask new questions about unknowns and if you’ve followed observability best practices, you should have the data you need to answer those questions (and a clear path to access that data) within your collected logs. The beauty of observability is that you get an understanding of how a distributed system is working on the inside by reviewing data on the outside. This will provide the insights and dexterity you need to manage your distributed systems as they grow in complexity.
In order to have a foundation for observability, you need three things:
Logs: Logs are a verbose representation of events that have happened. Logs tell a linear story about an event using string processing and regular expressions. A common challenge with logs is that if you haven’t properly indexed something, it will be difficult to find due to the sheer volume of log data.
Traces: A trace captures a user’s journey through your application. Traces provide end-to-end visibility and are useful when you need to identify which components cause system errors, find performance bottlenecks, or monitor flow through modules.
Metrics: Metrics can be either a point in time or monitored over intervals. These data points could be counters, gauges, etc. They typically represent data over intervals, but sometimes sacrifice details of an event in order to present data that is easier to assimilate.
The best way to guarantee observability is to build it into your code as you write it. By focusing on observability during your development process, your developers will: