Blue Team Challenge

30 October 2018

There are a number of extremely difficult challenges in running a successful Blue Team, or security operations defensive team. These range in magnitude from simply keeping track of everything that is going on to building better soft skills and relationships with interdependent teams (think networking, infrastructure, etc.) all the way to the fact that one missed clue could lead to a serious breach. Added to these challenges are the fact that most blue teams are designed to be comprised of zombie console jockeys with "eyeballs on glass" staring at mind numbing alerts for their entire shift. These twin factors combine to create a toxic soup of stress, ineffectiveness, and ultimately failure.

The Right Way to Blue Team

One of the first recognitions of a blue team should be that they're going to fail. The next breach is, in fact, inevitable. The measure of a blue team is often framed in terms of whether or not they stop a breach, an unquantifiable metric of success. Without collaborating with the bad guys it's impossible to tell if a blue team actually performed effectively and stopped a breach. A good purple team exercise can simulate this (where a red team breaks in while a blue team tries to stop them and learns lessons and identifies gaps), but even then it is only a point-in-time measure of success rather than evidence of overall success.

If blue team is doomed to fail as an eventual reality, how then do we measure blue team success? I propose that the measure of an effective blue team should be based on how they handle a breach, not whether or not they stop one. Of course every effort should be made to stop breaches, using threat models, defense in depth, and other tools of the trade, but realizing that these are never 100% effective it becomes critical to begin to develop metrics to demonstrate success in the face of adversity.

How a blue team responds to a breach can be measured in any number of ways. How long did it take to recognize the breach? What was the containment timeframe? How did after action reviews lead to improvements and help to prevent future breaches? Were there refinements that could be made to runbooks, detection strategies, training, or how were other gaps identified as a result of the breach? And so forth.

Vital to being able to collect this information is a blame-free after action, or root cause analysis. In this exercise, parties involved in the breach gather to discuss what went right, what went wrong, and what could be improved, without casting blame or pointing fingers. The after action must be a safe space where everyone feels free to contribute and no one feels judged. Far too often these exercises turn into a mud slinging session, which is unhelpful and depresses contribution. After a breach the entire organization needs to recognize the breach as an organizational failure, not an individual or group failure. Thus the after action should be a team building exercise to determine how the entire organization can do better.

Trust is an essential, and often missing ingredient to this entire process. If leadership doesn't trust their security teams, if security managers don't trust their engineers and analysts, or if other trust gaps exist, a tendency towards blame and suspicion creeps in. When a trust vacuum exists then blue team members become incentivized to hide data and suspicions rather than to bring them forward. For instance, if an analyst spots some anomaly during their shift, but dismisses it as a false positive and then in the intervening time between that shift and their next they begin to suspect they may have drawn the wrong conclusion they face a dilemma. Should the analyst bring the incident to the attention of their manager? Should they investigate further? If they find they made the wrong call and there was a security event does raising that issue paint the analyst as a failure? In an ideal environment data should always be open and exposed to everyone, but in far too many environments blue team members are incentivized to hide information because open information invites scrutiny, and if scrutiny leads to retribution then it's safer to hide mistakes than to examine them openly.

Keeping Track

Blue teaming, and especially security operations, is all about keeping track of security events and priorities. In a typical security operations center the staff face challenges from things as innocuous as users forgetting passwords to as complex as distributed denial of service attacks, and everything in between. Not only that but blue team must often collaborate to accomplish goals and even hand off initiatives, tasks, and even projects to other team members or other shift workers. This logistical nightmare invites mistakes. Every time a human driven process is required it leads to a potential failure. People are bad at keeping track of complex, competing tasks with shifting priorities. It's all too easy to forget about a low priority task when juggling multiple high priority tasks and fail to recall the low priority when bandwidth becomes available.

Using automated systems to track tasks and initiatives in a blue team is vital. It is important for managers to be able to track how subscribed their team members are, for team members to track their backlog and accomplishments, to handing off tasks and ensuring consistency of work. While the exact methodology of resource and task tracking doesn't necessarily matter, it is critical that the tools chose meet all the needs of the blue team. Any gap in the tooling to the requirements attracts disaster since, again, relying on humans to track complex tasks over time, even if done accurately 99.9% of the time, is going to introduce an error and each error could be costly to the organization.

Organizations that under resource their blue teams with respect to tools are setting themselves up for disaster. This situation is inexcusable given the multitude of Open Source products that can be used by blue teams, from purpose build tools like MozDef to general purpose tools like MediaWiki. There are amazing commercial tools for security event tracking, ticketing, automation and orchestration as well. Any organization that starves their blue team for resources, or forces them onto poorly fitting "common solutions," is inviting disaster and frustration.

Eyeballs on Glass is a Horror Show

Personally, the common refrain that organizations need "eyeballs on glass" always conjures that famous scene in the move Bladerunner demonstrating the failure of biometric authentication. The idea behind "eyeballs on glass" is that people are always watching security logs, even at 3AM, so that they can instantly respond to security events and double encrypt the firewall virtual blockchain to lock the hackers out. The idea is flawed for a number of reasons so I'll only spend time on the most obvious ones.

People don't function well staring at the same interface for long periods of time. Attention drifts and there is a wealth of academic research that demonstrates how wrong headed it is to force people to do dull work for long periods of time and still be effective. If the desire is truly to encourage this model at least the industry should take lessons from air traffic control systems and not IT Operations.

Furthermore, I would argue that there's no need for 24/7 human monitors of security logs. If the task is dull, repetitive, and mindless, it really should be done by a computer. Automation allows you to scale security blue team exercises and provide consistent results over time. If you're relying on a human gatekeeper then results will vary depending on how caffeinated that person is, whether or not their distracted by personal problems, or heaven forbid they're on a bio-break. Organizations should reserve their humans for higher order concerns. Simply trolling through endless logs looking for security events needs to be automated, or a hunt exercise, not part of daily operations. Ideally algorithms should sift through logs looking for items of true interest that need to be investigated by a human being who can integrate some left brain inductive reasoning necessary to solve mysteries that a computer algorithm cannot.

Another item to note is that security response in the middle of the night, or any off hours, is going to be messy. Off hours system owners aren't available for consultation, necessary teams aren't in the office to discuss strategies, and users may not be around to validate corrective measures. Unless your security fixes are tied to a configuration management system that can easily be rolled back you probably don't want your security team making changes and responding to security events without warning and when the rest of the organization is asleep. In the same way organizations shouldn't tolerate sudden IT infrastructure changes without proper coordination and collaboration, making sudden security changes can result in disaster. In most cases organizations are better off waiting to make changes when responding to a security incident until normal business hours when stakeholders can be informed, weigh in, and help with logistics like communications. Of course there's the nightmare scenario of rapidly spreading ransomware, but I would argue that even in those situations humans cannot respond rapidly enough to halt the malware and that work should, if effectiveness is desired, be automated.

Conclusion

Of course, every security team must be tailored to the organizations they serve. Some enterprises run 24 hours a day, locally or globally, and may require around the clock security teams simply to handle day-to-day security operations. The vast majority of organizations do not, however, need the generally accepted notion of what a blue team should be. The industry's slavish devotion to "conventional wisdom" indisputably results in bad solutions for many organizations but almost nowhere so much so as will blue team organization and structure.

It is vital to pair approaches to blue team with quantifiable requirements for an organization and recognize that one-size-fits-all will not work the vast majority of the time. The bespoke blue team suits the organization it serves, and the organization sets the team up for success through realistic requirements, trust, and resources. By taking a pragmatic approach to security organizations stand to make measurable gains and avoid the failures of investment and returns that plague so many entities today.