At Go Daddy, we measure a lot of things, including up-time, latency, time to provision, and a whole slew of other data. Also, we keep track of various metrics in the Information Security (InfoSec) department, from blocked attacks per day to virus detections per workstation. We do a lot of “what happened” measurements. We even have our own internal Reputation Feed based on the events we collect from our security gear. Not only is this good for us to help determine our current posture, but it helps us determine if a change we made is making a difference.
What it doesn’t help us with is looking forward and addressing threats before they become a problem in our network. For example, if a single IP begins exhibiting malicious activity on our network, the IP is adding to a number in our metrics that indicates we detected/blocked it. This information then leads to some type of investigation by our SOC team or triggers an automation tool that takes action for us.
In a network our size, you lose a lot of detail in the swath of event data that we collect. It can be difficult to find the little nugget that will help us root out bad guys before they do bad things. This really is the ultimate goal for InfoSec. If we could get Tom Cruise and his pre-cogs to work IT Security, well, we’d all at least sleep better… for a while.
One thing we have always wanted to measure and act upon is how quickly a threat is getting worse or better. If we can determine a threat is going to be a problem in 24 hours, why not address it now? While I was reading Security Metrics, trying to come up with some new metrics for us to measure, I thought back to a physics class and a potential solution: Acceleration.
We currently have an escalating method of stopping brute force attacks within our network. If an attacker attempts to log in and fails x number of times in y amount of minutes, we block that attacker for a designated period of time. If they continue, we increase the amount of time. This works great. However, eventually an attacker can, and we have seen them, ramp up to failed log in attempts in the millions within a few days. In such cases, this method is not optimal and appears to be only a minor inconvenience to attackers with compute resources.
We want to get more predictive and aggressive on keeping attackers at bay.
Let’s say we want to calculate the acceleration of an IP address based on all of the alerts we have for it over the past five days. If we had 10 alerts in day 1, , and over the five days we have 2000 alerts for the IP, totaling V=83.3 events per day. Our equation would be, . That number is interesting as we can see the IP is clearly accelerating its attacks. But, where will it be in 36 days if we let it go?
or .83 events per hour. This may seem like a low number of events. But, when you consider that we collect 80K events per second, it can be hard to weed out those events. When we apply acceleration to failed log ins, these numbers get very big, very quickly.
Let’s work backwards from our example case.
For Attacker A, we recorded 1,500,000 failed log in attempts over five days with 500 failed log in attempts on day one. That is, Wow! ~60,000 . That’s way too high for my taste. Let’s say our threshold is , measured in 12 hour increments. How many hits do we allow before I say, “Enough,” and have a piece of code take action? If the attacks do not accelerate, it will take 3000 days for us to get to 1,500,000 failed log ins. That is not efficient for even the most laissez-faire of attackers.
On day 1, Attacker A has broken the threshold with 500 failed log ins. But, I don’t want to snap to judgment. This could be a customer with a misconfigured script. On day 2, Attacker A has 1,500 failed log ins. Hmm… that’s getting worse. But, again, that could be the same customer’s misconfigured script. On day 3, Attacker A has 2,000 failed log ins. Now that’s starting to look a lot less like a real customer. Attacker A now has 4,000 failed log ins at a velocity of 1333 failed log ins per day and is definitely headed in the wrong direction. We can now take action and greatly slow down this Attacker.
Of course, your mileage may vary depending on how you implement this in your environments. But, I’m finding that determining how quickly a threat is becoming a problem and dealing with it quickly, is freeing up more time and resources within our network to be able to deal with and smaller issues that may be happening.
How does acceleration look in your network? Does it help bubble up anything interesting to the top of the “Take Care Of” list?