There are many important discussions taking place regarding how performance monitoring works, and although security monitoring shares a lot of common themes, it also differs in some crucial aspects. I thought it would be worthwhile for us to look at some of those differences.
Let’s start with understanding what performance monitoring entails at its core. Imagine a customer who is running a SaaS application and has thousands of virtual machines (VMs) running a Kubernetes cluster with hundreds of microservices.
A key responsibility for the DevOps team charged with maintaining this revenue-generating application is obviously to keep it running all the time!
To achieve high uptime, the team has to ensure there are no single points of failure, provide enough capacity with some room for transient fluctuations, implement failover plans and monitor error rates to ensure they are within safe bounds. Eventually, the goal is to meet the SaaS service level agreement (SLA) as defined by business needs.
Even with all that planning, Murphy’s Law applies. In a large distributed system, many things will go wrong from time to time. There is no way to avoid it, hence the need for performance monitoring. But given the above, it makes sense for DevOps teams to focus on any failures that impact whatever SLA is in place. Events with higher impact call for higher priority and faster responses. And if something doesn’t affect the SLA, then it can be put aside and dealt with later.
In short, a lot of DevOps’ performance monitoring firepower gets focused on discovering things that move the needle and directly affect the SLA.
Let’s consider some examples and see if they move the needle for our customer’s SaaS App.
- If usual TCP Traffic is around 1 billion connections per hour, does an extra 100 connections to a command and control server during that hour make any difference?
- Is attaching an extra security group to a VM which allows for more ports to be open something you should be concerned about?
- An administrator running some processes on a single VM.
- Doing 100 more APIs using an IAM credential that does millions of API calls per hour from a strange location.
- Creating a new IAM user, and launching a new set of very expensive VMs in a different region on AWS.
All of the examples above do not move the needle at all for our SaaS App. They do not disrupt the existing application’s SLA, hence they are unlikely to be the focus of whatever performance monitoring is in place. The fact that they do not disrupt existing applications is not an accident. Hackers have learned from experience to try very hard not to move the needle. If they do, they get caught faster and survival of the fittest applies. Most attackers will stay below your performance monitoring tool’s radar for a long period of time.
All of the above examples are also possible indicators of compromise. And a good security monitoring tool absolutely will find them. In short, security monitoring is largely about finding needles in the proverbial haystack.
Finding such issues necessitates one to observe the smallest of changes and analyze them to see if they fit any known patterns. An obvious challenge with that approach is that small changes are quite common. Therefore, a good security monitoring tool has to look at a very large volume of small changes to make sense out of it. In my previous blog, I explained how Lacework does just that.
Data processing required to focus on small changes has to be designed very differently than discovering things that break SLAs. The vast majority of security issues do not break SLAs, and the bulk of things that break SLAs are not security issues. Because of all this, if you only have a performance monitoring tool, now would be a great time to also invest in a powerful security monitoring platform.