eBook

Threat modeling, incident command, & DDoS mitigation

In any sufficiently complex system, failure is impossible to rule out. Whether you’re fighting literal fires or keeping your application online, it’s critical you have a plan in place when things go south. In this ebook, you’ll learn key strategies for finding and mitigating threats from veteran incident responders and security experts. This metaplaybook draws from real-world experiences and tactics, offering your team the tools necessary to implement your own response to impending threats.

Read this exclusive security ebook to learn:

  • How to effectively find and evaluate all threats (both external and internal)
  • Tips and tricks from veteran incident responders (including firefighters)
  • How to identify and mitigate DDoS attacks
  • Tools for empowering your teams during a crisis

About the authors

Maarten Van Horenbeeck, VP of Security Engineering, Fastly
Dr. Jose Nazario, Director of Security Research, Fastly
Jonathan Foote, Senior Security Architect, Fastly
Anna MacLachlan, Content Marketing Manager, Fastly


Download this eBook

Threat modeling, incident command, & DDoS mitigation

Introduction

In any sufficiently complex system, failure is impossible to rule out, whether you’re fighting literal fires or striving to keep your applications online. Failure can have a wide ranging impact, from undermining end users’ trust to lost revenue. Fastly operates a large internetwork and a global application environment responsible for 1 terabit per second of internet traffic; as a result, we face many threats, both from a reliability and security perspective. As an organization, we’ve deliberately put in place a robust system of threat modeling, incident command, and DDoS mitigation that would allow us to rapidly identify and mitigate threats that come our way. We have these measures in place to ensure maximum team efficiency during incidents, with the overarching goal of working quickly and carefully to minimize risk and keep our customers online.

We also recognize the challenges that go with this, including communication, support, and proper delegating — making sure your sphere of responsibilities is taken care of while resisting the urge to jump in yourself, and empowering your teams to make smart decisions under pressure. We’ve designed our response to incidents to make the most of our team expertise, with the cornerstone belief that the people we hire are inherently intelligent, aware of their roles within the team, and are thus empowered to make decisions effectively.

In Sources of Power, Gary Klein studies groups from all walks of life — from firefighters and army generals to doctors and chess grandmasters — analyzing how they make critical decisions under pressure. While we’ll explore his numerous findings in the next few chapters, one key takeaway is that we must trust our teams to make good decisions, having empowered them with the appropriate tools to do so. Instead of trying to construct fail-proof systems (they don’t exist), we should trust the inherent competence of those systems’ operators. Part of building that trust is making sure these operators “have the tools to maintain situation awareness throughout the incident”1 — these tools include having a threat modeling system in place to identify threats before they happen, and an incident command structure to address threats when they do arise (as they will).

That said, you’re not going to have a runbook for every scenario — you need to have a process for coming up with solutions for unexpected situations, and you can’t be paralyzed when making a time-sensitive decision. Klein’s studies of emergency responders showed that they don’t compare various options against each other, but quickly evaluate courses of action by imagining how they might be carried out.2 There is no rigid and careful analysis when fighting a fire — such measures would paralyze responders while they carefully weighed each option, wasting valuable time. Rather, firefighters evaluate incidents as they arise, opting for the best course of action without analyzing all available actions — “the emphasis is on being poised to act rather than being paralyzed until all the evaluations have been completed.”3

There’s no one way to approach incidents within your organization. Don’t think of this text as a playbook to be followed rigorously, but rather as a metaplaybook: it’s a way to think about how threats are identified and mitigated under pressure, without having to follow a step-by-step guide that would waste valuable time and resources. We put these structures in place at Fastly to empower our teams to work quickly to mitigate threats, not be paralyzed by the decision-making process. By providing the proper tools and training, we can fully trust our teams to make effective decisions as threats (like DDoS attacks) arise. Read on to learn how.

Chapter 1: Threat modeling

Before your business is faced with an incident, it’s critical to have a plan in place — threat modeling empowers you to find and evaluate threats before they happen, setting your team up for success and protecting your organization.

Threat modeling is a procedure for optimizing security by identifying objectives and vulnerabilities, and then defining countermeasures to prevent, or mitigate the effects of, threats to the system.c1.1 By imagining how a scenario might play out beforehand, we equip ourselves to address threats under pressure. As Klein puts it, mental simulation makes for effective decision making, helping us “generate expectancies by providing a preview of events as they might unfold and letting us run through a course of action in our minds so we can prepare for it.”c1.2 For example, it’s how one rescue commander determined the best course of action for removing an unconscious man from a crashed car: he noticed the impact to the posts that supported the car’s roof, and imagined lifting the roof off the car, instead of using the jaws of life, which would have been difficult due the damaged doors. The team proceeded with the course of action he envisioned, ultimately leading to a successful rescue.c1.3

For businesses, threat modeling early in the development of a new product or feature can help ensure security is considered throughout the product lifecycle. By building things with security in mind, you can ensure it’s baked into everything you do and produce. Taking a proactive approach and imagining possible threats before they arise results in more securely designed systems, preventing vulnerabilities from being included in a release (and thus having to fix them later), but it can also prepare you for failure when it happens.

At Fastly, we define threats as a potential or actual adverse event that may be malicious (such as a denial-of-service attack) or incidental (such as the failure of a storage device), that could compromise or disrupt our service (and potentially affect our customers). For example, we evaluated various potential threats in preparation for the 2016 presidential election. Due to the high-profile nature of the event, we expected to see anything from surges in legitimate traffic for our customers reporting the news — The New York Times ended up seeing an 8,371% increase in traffic as readers rushed to check results on election nightc1.4 — to attacks with ideological motivations.

Our threat modeling involved imagining the various scenarios, and how we would react, as well as asking questions such as, “How do we determine legitimate versus nefarious traffic?” and “What will our approach be if one of our customers is DDoSed?” Outlining the possible scenarios before a major cultural event enables us to prepare for both recognizable and new situations in advance, putting the proper tools and methods in place before a crisis occurs — like firefighters ensuring they have the proper equipment before they receive the emergency call.

Some basic goals for threat modeling are to:

  • Identify threats to the target, and thereby the overall security of the system and the company as a whole.
  • Create living artifacts that can be used in subsequent threat modeling activities and for future reference.
  • Be efficient: strive to use resources judiciously.
  • Develop a threat model as early as possible and refine the model regularly, either tying to part of the development process or scheduling periodic refinements.
  • Be collaborative: keep an open line of communication, and encourage finding threats and issues — taking special care to avoid negativity toward others and the work they have done.

With our vulnerability handling processes taken into account, we’ll use a simple four-step process along the lines of those described by Shostack 2014.c1.5

  1. Model system
  2. Find threats
  3. Address threats
  4. Validate

Note that while we’ll move through these in order when describing each step, threats in the real world often don’t emerge (and aren’t addressed) in a linear fashion. We learn more about the system and the threats to it as they’re happening, so iterating and jumping around the steps makes for effective decision making.c1.6

Identifying and addressing threats to a system is never done — new threats will always arise. While you should aim to be as thorough as possible, it’s better to prioritize covering the whole of the target system and finding risky or time-sensitive threats. Burning all available resources on taking a very detailed look at a subset of a system while completely ignoring other aspects can be tempting, but it does little to raise assurance in the overall security of the system. For example, an emergency responder might prioritize putting out a fire, while failing to observe a structural fault in the roof of the building — if the roof collapsed, it would trap all those inside and render any firefighting moot.

At Fastly, this sort of “rabbit holing” may manifest itself in various ways, two of which include dwelling only on the sophisticated attacker (a malicious insider, or a nation-state actor) while ignoring the pedestrian, opportunistic attacker, or focusing on only one component of the system.

Identifying threats: what could go wrong?

Instead of drilling down on a single point of failure, use brainstorming techniques and go broad and aim for coverage of the system, its dependencies, and avenues of interaction. Aim for as many threats as possible, then review these and identify themes or common elements. From there you can begin to factor in a likelihood of realization and an impact if it happens.

The threat modeling exercises should not only take into account where threats may arise and how to mitigate, but also how to detect these threats and make signs of them obvious and indelible. How might your organization observe the system for signs of a threat and determine if it was successful? These steps must be a part of any threat modeling exercise. Keep in mind “What could go wrong?” when searching for and evaluating potential threats.

Brain dump

Have system experts or engineers that have been at the organization for a while describe any known threats to the system. Note: this activity may elicit threats to other systems, and these should certainly be recorded (just outside of the threat model for this system). Once the well of known issues runs dry, consider expanding the discussion to include a list of known issues with other aspects of the system (performance, maintenance, etc.) — these often have security implications as well. This topic could be expanded to include any “big fix” other members of your organization might have in mind. Broader discussions may elicit more threats, may reveal upcoming mitigations (or emerging threats), and could provide insight into how you could knock out swathes of issues in the system and feed larger security architecture decisions. And, while traditional brainstorming isn’t always the most effective, research from Northwestern University showed that “individuals are better at divergent thinking — thinking broadly to generate a diverse set of ideas — whereas groups are better at convergent thinking — selecting which ideas are worth pursuing.”c1.7 Have your team members bring their own potential threats to the table, and then work together to figure out which ones are worth further examination.

Start with external entities

Anything that interacts with the system that’s being modeled is an external entity. This can include humans, applications, or other systems. Walk through each external entity and discuss how they interact with the system in detail. Try to make sure all aspects of the interaction are covered — be creative. As part of this discussion, ask “What could go wrong?” and record the results. For example, a doctor may take into account environmental factors such as diet and living conditions when attempting to diagnose a patient, instead of focusing solely on a congenital disorder to explain the patient’s symptoms. This doctor would ask herself, “What possible factors could lead to chronic migraines?” and go from there.

List assets

In order to identify sensitive areas of the system, we’ll list assets subject to threats that the system stores or handles. If you find yourself making a super long list of assets as part of threat modeling, it may be worthwhile to consider practical aspects — if all of the files that you are listing have the same exposure and are handled in the same manner, you could consider grouping them. For example, instead of listing each and every house in the path of a fire, emergency responders might group them by neighborhood. Just be careful that you don’t overlook something that creates (or amplifies) a threat, such as a gas station among the houses in the fire’s path. Whether or not you use this technique, recording a list of assets is generally a worthwhile activity — a catalog of assets is valuable for not only the threat model but operations, maintenance, security monitoring, and more. It’s a lasting output of this exercise. Be sure to record this list alongside the rest of the threat model.

STRIDE per interaction

STRIDE is a threat classification model developed by Microsoft for thinking about computer security threats,c1.8 and stands for:

  • Spoofing of user identity
  • Tampering
  • Repudiation
  • Information disclosure
  • Denial of service
  • Elevation of privilege

The idea is to walk through each data flow that crosses a trust boundary and take time to think of threats associated with it — although brainstorming on each of the STRIDE threat classes might not be necessary for all threat modeling activities, it’s an effective way to generate ideas if you’re stuck.

Look at the issue tracker

This is probably best done outside of a threat modeling meeting, but looking through existing issues and supporting docs may reveal a host of known issues. Note that many issues that may seem benign to an employee working on a particular product may later reveal themselves to be security issues upon critical inspection. For example, a developer that discovers a crash may not realize the crash can be triggered by an attacker and is therefore security-related.

Get creative

There are lots of other ways to brainstorm threats: card games like Microsoft’s “Elevation of privilege” gamec1.9 uses a point system that allows you to challenge other team members by becoming your opponent’s biggest threats, and walking through MITRE’s voluminous attack treesc1.10 helps you understand how adversaries operate by offering “a comprehensive dictionary and classification taxonomy of known attacks that can be used by analysts, developers, testers, and educators to advance community understanding and enhance defenses.”c1.11

Threat modeling teams

Threat modeling teams at your organization should, at a minimum, include a member of the security team that understands threats to the technology that comprises the analyzed system, and a system expert that has a thorough understanding of the system’s use, design, and implementation. Gaining the security team’s perspective can help the system experts understand well-known threats, while having the systems experts’ perspective can help the security team discover latent issues with security implications that may not be obvious to an outsider.

Often covering all of this expertise requires more than two people, and in general, threat modeling benefits from having more than one expert on each of the required topics to encourage brainstorming, fact-checking, and offering an alternate perspective, such as having two firefighters accustomed to fighting forest fires and two from urban settings.

Providing a questionnaire of standard questions to the system experts prior to meeting can help provide information that can be used to better prepare for the meeting — we record the details around threats in JIRA, our task management system. Our team reviews these details, using them to get a better understanding of the scope, system, and possible threats prior to the actual threat modeling activity.

Storing threat models

Sometimes the most obvious and minute tracking techniques are the most effective ways for brainstorming and creating threat models: whiteboards, for example, are a common and easy-to-update medium for everyone to use. At Fastly, we store the threat model in a wiki page after the initial meeting to make sure all the details are easily referenceable by our security team, and access to these pages is restricted to the security team and those responsible for the system to avoid excessive chatter. Additionally, our issue tracker is updated to include any actions that must be taken to address any identified threats to make sure we’re not only aware of what needs to be done but also of who’s responsible.

In general, each threat model should include headings for:

  • Overview
    • High-level description of the system
    • Links to salient external docs
    • “Last updated” including who participated in threat modeling activities
  • System model
  • Threats

Modeling the system

Modeling the system is often tricky for companies with engineering teams that like to move quickly — common software artifacts are generally limited to the code, READMEs, and some wiki pages (if you’re lucky).

At Fastly, we use the traditional approach of using data flow diagrams (DFDs), text, and lists to model system activity. A DFD is a graphical representation of the “flow” of data through an information system, modeling its process aspects.c1.12 They have simple defined semantics and model the flow of data through the system in terms of behavior, allowing them to be annotated with trust boundaries (places that data flows from one trust level to another). According to Microsoft, “It is the action of applying the trust boundary that changes the DFD into a threat model.” It’s important that you get these right, because that is where your team will focus its attention when discussing threats going forward.”c1.13

Order isn’t important, but often it’s easiest to create data flow diagrams based on use cases or activities of the system. We recommend including “passive” activities so that you don’t forget about non-manifest (latent) data flows in a system. For firefighters, these might include building structural deficiencies and wind direction, as well as more blatant factors such as the fire itself.

A DFD (or set of DFDs) should capture:

  • External entities: users, external systems out of scope for the model, etc.
  • Processes: behavioral (not static) components: “code in motion” for the system
  • Data stores: important databases, files, etc.
  • Trust boundaries: places where there are different levels of trust on each side. For example, you’d trust internal traffic flowing through your network more than you would traffic from the public internet.

threat-modeling-ddos-mitigation-2

DFD guidance

Exactly what you capture should be gauged based on the level of abstraction of the discussion (and subsequently, the DFD) and practicality. Some basic guidelines:

  • You should be able to put a DFD on a presentation slide and tell a story about it — stories help us organize the cognitive worldc1.14 — by linking ideas, concepts, objects, and relationships into stories, we’re able to imagine how a threat might play out. Keep it relatively simple. Remember, the point is to find and address threats, not create something you can compile and run.
  • You can functionally decompose DFDs into additional diagrams: if you go further than one level deep (or two at the most), you’re probably going overboard for our purposes. Here’s one way to do this:
    • Create a top-level “Context” DFD (AKA “DFD-0”) that captures the system as a whole. Be as abstract as necessary, then:
      • Create per-use-case DFDs that capture detailed flows, and/or
      • Break any processes from the “Context” DFD that have trust boundaries cut through them (or are otherwise interesting)

Each DFD component should have a text description included with it, which should focus on security-relevant aspects of the component. This list isn’t exhaustive (and some of the items may be irrelevant to your system), but here are some examples to get the juices flowing:

  • External user:
    • Consider any RBAC, external threats, worries, etc.
    • Examples: authenticated users, unauthenticated users, trusted administrators
  • Processes:
    • Consider security-relevant functions, explain what it is does, list key tech or known problems, why you should (or shouldn’t) worry about it, etc.
    • Examples: network daemons, APIs, backend services
  • Data stores:
    • Consider assets, key tech, protocols, etc.
    • Examples: database servers, configuration files, secret management systems
  • Trust boundaries:
    • Key tech, protocols, trust levels on either side of it, etc.
    • Examples: internal networks, external networks, localhost, protocols used to cross boundaries

Storing DFDs

Represent DFD graphics in a format that ensures the diagram can be:

  • Included in the threat model as a picture for quick reference
  • Stored alongside the rest of the threat model for future edits

Don’t artificially limit what you want to include with a DFD, but for the sake of consistency we stick to the basic representation for conventional DFD components:

  • Process: Circle
  • Process group: Double circle
  • Data store: Traditional database
  • Data flow: Directed arrow
  • Trust boundary: Red dotted line

To make future edits easier, we link text associated with each component to a wiki in a wiki table, in the same section as the diagram.

Exit criteria

We can consider system modeling to be complete when the threat modeling team, which must include system experts, feels all salient aspects of the system have been represented in the data flow diagram and associated notes and key stakeholders and experts have signed off. It’s important to note that no one feels pressured to approve unless they’re comfortable with the model — if not, you risk going live with some known flaws that only some people know about and how to defend against. If it’s been surfaced and acknowledged by everyone, there’s a wider audience who knows about it. The risks may be acceptable, but that wider audience may also have an understanding that adjusts the acceptance level and makes it a show stopper.

Recording threats

After the system has been modeled and threats have been discovered, it’s important to record them to ensure they’re tracked and addressed appropriately. For the presidential election, we looked at attacks on integrity: content alteration, injection, or deletion; we looked at availability: DDoS attacks against publicly facing content services and backend APIs; and determined specific assets and services that matter more at some points than others, such as GOTV vs fundraising, in an effort to prevent blackouts or misinformation efforts succeeding at all costs.

When recording your own threats, be sure to include:

  • The threat: a simple, threat-focused summary of the issue
    • A DDoS attack on a major news site
  • The impact: a quick discussion of what the threat could result in
    • Prevents readers from accessing the news and staying informed on the election results
  • The risk and quick rationale:
    • Risk assessment can be based on different criteria including existing standards (e.g. the Common Vulnerability Scoring Systemc1.15), mitigating factors, active exploitation, and potential impact on both users and the business
    • Be sure to summarize any key assumptions so we can revisit this threat if they change — e.g., a DDoS attack will target a US-based news site (versus UK-based)
    • Capture detail as needed, considering that this may be fleshed out as part of a JIRA ticket
  • Action (what we’re going to do about it)
    • Alert the proper teams and ensure we have the right tools in place (see following chapters for more on incident management and DDoS mitigation)

In general it’s a good idea to include addressed threats in the threat model itself for future iterations and to make sure assumptions aren’t broken. However, if you’re choosing between exhaustively listing all threats that are mentioned and exhausting resources before finishing the threat model, just write down the ones that you’re unsure of. Once the initial model is created, it will be easy to add to the list later.

This list of threats should be included in a table alongside the rest of the threat model. Avoid adding too much detail here, however — the canonical location of risk and any other analysis associated with the threat should be confined to the task itself (as listed in your task management system).

Exit criteria

We can consider finding an individual threat to be complete when:

  • The threat modeling team has systematically covered all of the elements in the system model and thought critically in the context of each of them
  • All brainstormed threats have the associated Threat, Impact, and Risk recorded in the threat model

Addressing threats

This part is pretty easy from a process perspective: if there is something to do, we’ll create an issue in JIRA and triage it. In other cases, addressing threats might require more analysis, or something we haven’t thought of.

Some of the workflows associated with issue types don’t have a process associated with them — we often learn as we go. Regardless:

  • Ensure that a ticket is created for each threat that must be addressed, and is appropriately assigned.
  • Ensure the ticket links back to the threat model wiki page so we remember to update the threat model (and have an easy way to do it).

This step is complete when any threats that have a non-negligible risk have been documented and assigned in your task management system.

Completing an iteration of threat modeling

At Fastly, an iteration of threat modeling is complete when:

  • We’ve completed all of the steps above
  • All salient info is recorded to the wiki
  • All threats have associated JIRA issues
  • We have a rough idea of when we might revisit a threat model

For example, with the 2016 presidential election we actually included a handful of external stakeholders including media, candidates, and even US-CERT and ISACs in our threat finding endeavours. In preparation for this major, global event — and the possible threats that went with it — we looked at not only our own systems and risks we have but also external threat campaigns and their likelihood.

As we said before, finding threats is never done: it’s important to consistently revise the threat modeling process based on feedback from the threat modeling team and things you learn along the way. In the next chapter, we’ll apply our threat modeling to incident command — during which the threats that we’ve evaluated actually come into play, and we have to respond under time pressure.

Chapter 2: Incident command

Now that we’ve found and evaluated possible threats, it’s time to apply our methodologies to incidents in the real world. Incident command is the natural evolution of any good threat modeling activity: threat modeling helps you evaluate potential threats and incident command empowers you to deal with them when they become actual. As an organization, we deliberately put in place a robust system that would allow us to rapidly identify, mitigate, and contain incidents, and that would ensure effective communication flows both within the company and with our customers. In this chapter, we’ll discuss the challenges a large global network faces, the protocols that we found helpful, and how you can apply them to your own organization.

Where to find inspiration

When you start developing a program, it’s always good to look at how other types of people have solved similar problems. As engineers, we often tend to specialize and think the power to solve a particular problem is in our hands. When we do that, we can forget a few things: will we be able to ramp up engineers quickly enough, will our partners, such as network providers, be prepared and ready to help us when we need them to? How do we know if people on the frontline have the time and space to take care of basic needs (such as sleeping, eating, and de-stressing) during a prolonged incident?

It doesn’t take very long before you realize established systems must already exist elsewhere: there’s the Incident Command System (ICS) originally developed to address issues in the inter-agency response to California and Arizona wildfires, for instance, and the gold-silver-bronze command structure used by emergency services in the United Kingdom.c2.1 In addition, Mark Imbriaco’s “Incident Response at Heroku” talk from Surge 2011c2.2 was a huge inspiration for our initial framework.

While technology has its own unique characteristics (like the ability to automate responses), many of the issues faced by these other responders still affect us today, such as communication between teams, making difficult decisions under pressure, and establishing hierarchy during an incident. There was no need to reinvent the wheel: we took some of the best practices of those who came before us when developing our own incident command system.

Understand what you’re defending against

Another thing to understand well are the types of issues you’re likely to face. Emergency responders determine the types of incidents they might encounter depending on their region and environmental factors — responders in Montana will face different issues in the summer (wildfires) versus winter (blizzards, power outages). We refer to our system as the Incident Response Framework (IRF), which came to be a catch-all for any issue that could directly cause customer impact, such as site downtime. Over time, as we professionalized, the system started specializing as well, and there are now specific plans in place covering smaller issues that may not yet cause customer impact, but may have the potential to do so in the future. In addition, a specific Security Incident Response Plan (SIRP) was developed that triggers on any issues which may have security repercussions.

We’ve engaged the IRF for security vulnerabilities that required immediate remediation, customer-visible outages, and issues of critical systems that support our network, such as our monitoring and instrumentation tooling.

When engaged, we identify the severity of an issue based on customer impact and business risk, as well as the length of time the issue has manifested itself (see the severity matrix, below). Based on the severity, the team owning the affected service may be paged, or an organization-wide incident commander is allocated.

Identify the issue

Identifying an issue is critical — we have to know what we’re dealing with. Within Fastly, we have multiple mechanisms in place to monitor service-related issues, including open source monitoring tools such as Ganglia, log aggregation tools such as Graylog and Elasticsearch, and several custom-built alerting and reporting tools. Ensuring events from each of these services make it to our service owners and incident commanders in a timely manner is critical so we can mitigate or avoid any customer impact.

severity app delivery impact business operations impact scope of impact
SEV0 Critical Critical All Sites Affected
SEV1 Critical Critical Multiple Sites Affected, or Single Site unavailable or suffering from severe degradation
SEV2 Major Major Multiple Sites Affected, Single Site intermittently available or suffering from minor degradation
SEV3 Minor Minor Single Site or limited customer impact

Every team owning a critical service at Fastly maintains a pager rotation, and receives reports regarding their own services directly. Engineering teams, however, are empowered to classify and re-classify events as needed to ensure pager rotations do not become too onerous. Most of them develop their own integration that ensures we don’t suffer from alert fatigue on pageable events.

Events that do not lead to significant impact but could indicate wider problems over time are reviewed on a regular basis depending on their criticality, rather than leading to immediate action 24/7. As we covered last chapter, these can be recorded as part of modeling threats, for future exploration (and so they won’t distract from the incident at hand).

Ramp up the right people

At Fastly, we hire and develop teams with care and intention — bringing the right group of people together is critical for well-thought-out decisions under pressure. We make roles and functions clear beforehand, and empower individuals to make use of their own expertise and trust their instincts — a combination which makes for efficient teams who make effective decisions.c2.3 And, it’s important to remember the human element — we’ve seen attacks that have required all hands on deck, and it’s critical that we don’t burn people out. Engineers are humans too, and it’s important to account for basic needs — such as eating and sleeping — when faced with an incident.

Each team within Fastly designates a particular individual as being on call during a specific time slot. These people are primed to know that they will need to be more available and should keep their phones nearby. In addition, most teams that are critical for live services have a secondary engineer on-call, who is also aware of his or her responsibility to jump in in case of a major incident. In addition, we maintain a few critical people on call for incidents that grow beyond a wider team or have customer impact. The role of these individuals is different — they know they won’t be troubleshooting the issue directly, but they take on a number of critical roles that will help ensure mistakes are minimized, and investigation progresses as quickly as possible. They will:

  • Coordinate actions across multiple responders
  • Alert and update internal stakeholders, and update customers on our status — or designate a specific person to do so
  • Evaluate the high-level issue and understand its impact
  • Consult with team experts on necessary actions
  • Call off or delay other activities that may impact resolution of the incident

Incident commander is not a role someone is ready to tackle when they’re new to an organization. We select incident commanders based on their ability to understand Fastly’s system architecture from the top down, so they have a mental model of how systems interrelate and impact each other — if you recall from last chapter, mental simulations and creating stories based on interrelated factors help decision makers envision the best course of action under pressure. Incident commanders are also well-versed in the structure of teams, so they know the right people to speak with and can reliably engage them. Finally, they’re excellent communicators, and are able to maintain a cool, calm, and structured approach to incident response.

Above all, the incident commander is just that — a commander. They have the responsibility of ensuring the response is coordinated and running smoothly. To accomplish this, they must be empowered to delegate responsibilities and task people. They should be comfortable in this capacity and confident in their own abilities and their team. In Sources of Power, Gary Klein describes how experienced teams are well aware of the various roles and functions — and know the consequences when those break down.c2.4 Take for example the fireground commander who learned to bring a milk box with him to fires: when he first became a commander, he’d often leave his post to help put out the fire or offer medical attention. But, whenever he left his post his crew members couldn’t find him when they needed a decision made, and wasted valuable time searching for him. Over time, his roles and responsibilities as a leader became clear to him, and he learned to keep a foot on the milk box to keep himself at his post. As Klein puts it, “He had realized the functions he served for his team and where he fits into the jobs of everyone else.”c2.5

A specific (positive) issue many incidents run into is volunteers — people who see a building on fire are often eager to help. The problem is, not everyone is equipped with the skill set for rescuing people, offering medical attention, or putting out fires. The same applies to your organization: when an incident is taking place, many of your employees will understandably want to help, even if they have no direct responsibility to being involved. When not properly managed, this can sometimes have negative effects: the environment can get overly chatty, or it’s not clear who has picked up specific work. We’ve learned that removing people from the incident often is counterproductive — it demotivates people that want to work. Instead, we try to find opportunities to manage these volunteers, and either have them work on less critical items, or expand our scope of investigation beyond what we’d typically look at. This coordination will happen in a different room from the main incident, and is often coordinated by someone other than the main incident commander, but results are constantly communicated by a single individual.

Communicate your status

Communication is critical, both in how we communicate incidents internally as well as to our customers. Poor communication — or worse, the lack thereof — leads to confusion and inefficiency, two things we can’t afford when working quickly to assess incidents and keep our customers online. Both the method of how we communicate and what is communicated in these updates are important to consider — the tools you choose establish a framework for efficient incident response going forward, and what you communicate leads to fast, effective decision making.

Within Fastly, we use Slack and email as typical communication channels, and we use an external statuspage hosted by Statuspage.io to communicate status updates to our customers to avoid any circular dependencies (i.e., if our ability to deliver web assets was in any way impacted, we’d still be able to communicate to customers). Our goal is to quickly publish status notifications to keep our customers informed — as with internal processes, poor communication leads to confusion. By keeping our customers in the know, we help inspire trust while we work to mitigate.

Interestingly, some of the services we rely on often are also Fastly customers. This means we can’t necessarily depend on them being online during each type of incident affecting our service. As a result, we’ve grown through various backups, from our own IRC network, through phone chains, to alternative messaging tools to ensure systems are available. We also worked with some critical vendors to ensure our instance of their service operated on versions of their product that were not hosted behind Fastly to avoid these circular dependencies.

Over time we had to learn that various levels of individuals within the company have different needs for what they like to learn about an incident. In security incidents in particular, we assemble a specific group of executives who need to be briefed on very specific qualities of the incident — whether customer information was disclosed or not, or whether any critical systems were compromised.

Hence we’ve developed our processes to ensure incident commanders know what needs to be communicated, and to whom. During large incidents, quite often the incident commander will delegate ownership of communication to a dedicated resource to avoid over- or under-communicating an incident, which can erode the trust our customers place in us or lead to bad decisions.

Always improve

Each incident, as minor as it may seem, is logged in our incident management ticketing system. Within 24 hours after the incident, the incident commander will work with her or his team to develop an incident report, which is widely shared across the organization. We leverage the knowledge of the wider group involved to ensure it is as accurate as possible.

We use the time-proven “Five whys,” a technique developed by Sakichi Toyoda of Toyota fame during this process. The idea is simple, and while there are no concrete rules, for every incident you ask why it took place, and for every answer you come up with, you ask the same question again. As you ask this question enough, usually about five times, you get to the actual root cause of the issue. The system is helpful in two ways — intermediate answers give us ideas about what we can do to mitigate a future incident, or monitor for it more effectively, but the last one tells us the underlying problem we likely need to address.

The root cause and each issue that hampered either the identification or response to the incident will receive its own ticket. Incidents that have unresolved tickets are tracked on a weekly basis in an incident review until all stakeholders are sufficiently assured that the right actions have been taken to prevent recurrence.

Incidents provide essential learning opportunities, often leading to new projects; brittle systems are often identified during these processes, and the additional visibility the organization gains often leads to the development of replacements or improvements. In the next chapter, we’ll take a look at one of the common incidents we see — a distributed denial of service (DDoS) attack — and how to mitigate.

Chapter 3: DDoS mitigation

Distributed denial of service (DDoS) attacks are a pretty nasty topic. Fundamentally, a DDoS is an attempt to make machine or network resources unavailable to the intended users:c3.1 an attack on your infrastructure, customers, and employees — on the entire being of your company.

In the last 20 years, DDoS attacks have become front-page news; they’re often tied to high-profile events — everything from the Olympics and Super Bowl to political events such as elections, political parties, and news coverage are targeted. DDoSes are an easy, accessible way for people to achieve their aims — whether that’s putting a competitor out of business or silencing somebody. It’s all about making a statement in a very visible, impactful way.

Because Fastly offers a large network of globally distributed points of presence (POPs), we’re in a unique position to track global traffic patterns, and we defend our customers against attacks on a daily basis. For context, here’s what normal traffic looks like:

threat-modeling-ddos-mitigation-3

And here’s a DDoS attack – you have traffic coming from a lot of different sources and going where it shouldn’t:

threat-modeling-ddos-mitigation-4

Here’s another view — a medium-sized DDoS, at about 160 Gbps:

threat-modeling-ddos-mitigation-5

You can see the DDoS begin probing about two hours before the full attack, and then the floodgates open.

DDoSes can be divided into types:

  • Economically rational attacks, during which you get the bitcoin ransom letter. These are actually the easiest to deal with because all you need to do is make it more expensive for the attacker to attack you than the expected gain from the ransom.
  • Hate-based or ideologically motivated. The worst attacks are economically irrational. These attackers fall into two categories
    • Individual(s) without much budget; they might hate you, or they might hate one of your customers, or they might hate the city of New York and randomly pick a site that has the name New York in it.
    • A nation state with political or ideological motivations, such as the 2007 Russian cyber attacks on Estonia.c3.2

An unmitigated DDoS attack has debilitating consequences from both a technical and an emotional standpoint. In this chapter, we’ll look at both.

DDoS: A brief history

The rise in popularity of Internet of Things (IoT) devices — like DVRs, IP cameras, and NVRs (network video recorders) — has set the stage for massive IoT botnet attacks. The largest and perhaps most famous of these were the recent Mirai attacksc3.3 against security journalist Brian Krebsc3.4 and DNS provider Dyn.c3.5 The connections poorly secured IoT devices have to big networks plus an uptick in bitcoin-enabled extortion provide ample and easy opportunities for attackers.

The nature and complexity of DDoS attacks can vary, however — i.e., in terms of whether you’re being attacked at the application layer, in the kernel, or the network, and whether you’re being attacked directly or indirectly. Attacks can change as they’re happening as attackers try to evade defenses.c3.6

Here’s how varied and complex DDoS attacks can be:

threat-modeling-ddos-mitigation-6

How did we get here? Here’s a look back at DDoS attack traffic evolution:

  • In the late 90s, attacks were relatively unsophisticated, taking advantage of simply misconfigured networks to exhaust their target’s CPU by overwhelming servers with inbound requests — i.e., SYN flood, ICMP (Smurf) floods, etc. Attacks like Teardrop were designed to exploit kernel bugs, causing kernels to crash. This was also the time of Tribe Flood Networks, actors like Mixter in Israel and Mafiaboy in Montreal, and the CNN / Yahoo attacks.
  • As early attack methods and vulnerabilities were patched, attacks in the early 2000s went straight for bandwidth consumption. Prototype malware such as the worms Code Red and Nimda aimed to fill the pipes, overwhelming origin servers with requests. These methods led to the first mitigation services and devices designed to protect businesses.
  • The late 2000s brought TCP state machine issues, such as state holding, in which you keep a TCP connection in an unproductive state, exhausting kernel resources without doing anything useful, eating away at new connection slots. Defensive mechanisms became much more sophisticated — you could, for example, use TCP state machines to lock up your adversary’s TCP state machines. Further, as the world began to converge on using everything over HTTP or HTTPS, network service layer filtering became impossible and introduced the requirements of wire-speed packet inspection.
  • By early 2010, what’s old is new again: web application bugs (e.g., RefRef with SQL injection to lock up the backend, preventing the front end from talking to the database), DNS query floods, GET floods, UDP amplification, and cache busting attacks that employ random payloads to thwart application caches and increase the workload on the backend servers.

Throughout this time, the theme has remained the same: the attacker wishes to overwhelm the victim’s resources, and malware or exploits provided leverage. The victim must expend time and effort to drop attack traffic and maintain legitimate traffic. Each side must expect different amounts of work to achieve their aims, with the defender typically paying more money than the attacker.

Keeping an eye on DDoS: Research + Network

At Fastly, we keep track of DDoS activity and methodology both by observing traffic patterns on our network as well as conducting research. This two-fold method keeps us apprised of new methods as they arise, helping us contribute to the larger security community while protecting our customers.

Research: The anatomy of an attack

Our security research team uses honeypots — a mechanism set to detect, deflect, or counteract attempts of unauthorized use of information systems, and specifically in this case a modified version of open source tool Cowriec3.7 — to keep track of who’s probing our systems,c3.8 and what they’re trying to do. Last year, we tested how an array of unsecured IoT devices would perform on the open web, with some alarming results:c3.9

  • On average, an IoT device was infected with malware and had launched an attack within 6 minutes of being exposed to the internet.
  • Over the span of a day, IoT devices were probed for vulnerabilities 800 times per hour by attackers from across the globe.
  • Over the span of a day, we saw an average of over 400 login attempts per device, an average of one attempt every 5 minutes; 66 percent of them on average were successful.

These are not sophisticated attacks, but they’re effective — especially when you have 20,000 different bots at your disposal, trying to brute force their way in. Unless you change devices passwords and get them off the network, attackers are going to use them to break in and take over. It’s like the 90s all over again — this is a huge backslide in terms of securing the internet.

Using honeypots to identify botnets is nothing new,c3.10 and remains a valuable method to identify attack capabilities used in the wild. By studying the malware used in these attack tools, defenders can identify attack traffic characteristics and apply that in remediation. For example, knowing the random query strings that a bot could use during an HTTP flood can be useful to selectively filter those attack bots from legitimate traffic.

At 650 Gbps, the aforementioned Mirai attack on Brian Krebs is the largest-known DDoS to date. Krebs, a former Washington Post reporter, created his “Krebs on Security” blog to explore stories on cybercrime. He began covering the Mirai attacks, which at the time were under the radar — his reporting efforts revealed who was behind the attacks, and in retaliation they started attacking his blog, ultimately taking it offline. A few days later he wrote about the democratization of censorship:c3.11 this isn’t the first time people have used denial of service attacks to try to silence someone they disagreed with. The sheer scale of the attack led to dialogue that wasn’t happening before, including asking questions like, “Are we thinking about these things in the right way?” The Krebs attacks forced the discussion.

NetEng: Protecting the network

Although it doesn’t compare to the 650 Gbps attack on Krebs, in March 2016 Fastly’s network weathered what was at the time the largest DDoS we’d seen. At 150 million packets per second (PPS) and over 200 Gbps, the shape-shifting attack was a mix of UDP, TCP ACK, and TCP SYN floods, with internet-wide effects. We saw upstream backbone congestion as well as elevated TCP retransmission — when you see this out of your network without actually saturating your own links, a backbone provider may be experiencing issues or the source of the congestion could be even further upstream. Because we saw significant retransmits and our network wasn’t congested, the upstream link was congested and not all the attack traffic was reaching us. So, in all reality, the attack was quite likely more than 200 Gbps.

Minutes after the attack started, we engaged our incident command process, as well as our security response team to observe the attack characteristics we highlighted above. As we emphasized last chapter, communications (both external and internal) are critical: our next step was to post a customer-facing status, which we continued to update throughout the event. This particular attack was uniquely long-running, even expanding into our Asia POPs the next day. Our IC team used bifurcation and isolation techniques throughout to keep attackers away from critical IP addresses and POPs within our network. We left the mitigations in place, and the next day the attackers gave up.

In an ongoing effort to monitor attack trends and patterns, we pull data from Arbor Networks to track peak attack size by month. As you can see, the March 2016 attack — with a huge jump to about 600 Gbps — didn’t exactly follow the trend:

threat-modeling-ddos-mitigation-7

Source: Arbor Networks’ Worldwide Infrastructure Security Report

The red trend line is what you’re expecting and planning for. Because that attack was far bigger than we expected, it forced the botnet tracking and DDoS countermeasure community to start talking in different terms: we had to re-evaluate mitigation bandwidth capacity, and think about multiple, concurrent large-scale (500+ Gbps) attacks. The conversation became global, spanning between providers, CERTs, and other members of the cyber defense community all around the world.

A DDoS retrospective + lessons learned

Our processes during the March 2016 attack were two-fold:

  • Incident command ensured business continuity — ongoing CDN reliability and availability.
  • Meanwhile, our security response team engaged the security community to identify flow sources, bad actors, malware, and attack methods and capabilities that might be coming at us in the future.

As a result, we learned some valuable lessons, both in terms of what went well and future areas to work on:

  • Pre-planned bifurcation techniques proved invaluable in time to mitigate, empowering us to react quickly. The type of threat modeling we discussed in the first chapter helped pave the way, giving us the agility to split traffic to give us room to maneuver and empowering our teams to know when to invoke those techniques (and what the consequences are).
  • Mitigation options were enhanced by a well-designed IP addressing architecture.
  • A DDoS can often mask other system availability events — since they’re so noisy and all-consuming, your team might be entirely focused on the DDoS while there’s something else going on.
  • Separation of infrastructure and customer IP addressing, as well as DNS-based dependencies. This is a best practice for any provider: you want to be able to keep your infrastructure distinct from customer addresses and be able to reach your equipment if you need to make changes.
  • Continued threat intelligence gathering to understand future TTP vectors.
  • Emphasis on team health for long-running events: given the duration of the incident, it was critical to consider shift rotations and food. No one can work for 24 hours straight.

Although all these lessons were valuable, it’s the last one that we’ll focus on for the rest of this chapter.

The emotional stages of a DDoS attack

As we discussed last chapter, it’s important not to forget the human element when your business is attacked. You have all hands on deck and adrenaline is running high, making things like sleeping and eating fall by the wayside. There’s an emotional impact as well. Our experiences with DDoS have brought to light the debilitating effects they can have on people, including:

  • Hunger. People become hungry and forget to eat – everyone is working hard for a long time in a high adrenaline environment.
  • Sadness and worry. This is not like normal outages. There’s no timeline; no one actually knows when it’s going to end. People get scared, and sometimes believe it’s an existential threat to the organization.
  • Confusion. There’s no clear reason behind a DDoS — no one tripped a cable; no process failed.
  • Anger. A DDoS is not actually blameless, despite popular opinion. It’s not internal; it’s caused because there’s someone attacking you. Somewhere on the other side are people trying to make your life miserable, which is very different from a normal outage.
  • Coming to terms. Your entire organization (not just the tech team) is under attack. It’s not a process failure, and it’s not your fault.

These attacks can take a significant emotional toll on individuals, and it’s critical you take this into account when mitigating incidents and allocating shifts.

Just mitigate

As outlined above, the reason for a DDoS can be political or economical. Sometimes someone wants to prove that they can take you offline. You don’t actually know — you have no idea why you’re being DDoS’ed. So don’t guess; there’s no point. Similarly, you can’t afford to take the time to perform attack tool forensics beyond a rudimentary stage, that takes too much time, a precious commodity during an availability attack. Just mitigate — that’s all you can do.

To prepare for (and weather) a DDoS:

  • Find the weak spots in your architecture ahead of time. The attack doesn’t have to be huge — if there’s a page on your website that does a thousand database calls, and an attacker hits that page, your website goes down. You’ll need to find the weak points to prepare in advance for whatever comes your way. These weak spots may not be what’s under direct attack but rather what’s carrying the traffic. Consider every router and switch between the internet and target servers, at times these have crashed in the critical path despite not being directly targeted.
  • Run drills to train your teams on how they should react. We often get calls from people who are getting threatened with DDoS, and they’re very emotional because they’re scared. If you run drills, you at least know that you have a predefined set of actions to take. These drills may take the form of a tabletop exercise, or a live fire exercise from a reputable partner (and with the consent of your network provider).
  • Model your risk. Model what happens if different parts of your infrastructure are under attack. As outlined in Chapter 1, have a clear threat modeling process in place to find threats before they find you (and a plan in place for when they do).
  • Configure (and test) the system. It’s great to have a mitigation solution in place, but make sure you test it before an actual incident — you’ll sleep better.
  • Take care of your people. You need to plan for team sleep and food schedules — the rotation of people becomes even more important because you don’t have a timeline. After eight hours, people need to sleep because 44 hours later they’re not going to function. Remember the human element, and force people offline to rest. As we outlined last chapter, it’s important to communicate: spend a lot of time explaining to the organization what’s going on — much more so than you would for a normal event.
  • Ask for help. The internet is generally helpful, and there are partners out there who can support you. Build those relationships ahead of time, knowing who you might reach out to and what you can reasonably ask for. Grow your network, and leverage it.

And then, don’t panic. Things will be OK. Help is out there — the internet is helpful, carriers are helpful. By having plans in place before a crisis occurs — threat modeling, and a well-thought-out incident command structure — you can equip your team with the tools you need to react and make effective decisions under pressure. If you prepare, you can absolutely survive a DDoS (or whatever fire comes your way).


1 Klein, 283
2 Klein, 30
3 Klein, 30

c1.1 http://searchsecurity.techtarget.com/definition/threat-modeling
c1.2 Klein, 89
c1.3 Klein, 46-47
c1.4 https://www.fastly.com/blog/election-day-2016
c1.5 http://threatmodelingbook.com/
c1.6 Klein, 129
c1.7 https://www.inc.com/teresa-torres/why-brainstorming-doesnt-work-and-what-to-do-instead.html
c1.8 https://en.wikipedia.org/wiki/STRIDE_(security)
c1.9 https://www.microsoft.com/en-us/SDL/adopt/eop.aspx
c1.10 https://attack.mitre.org/wiki/Main_Page
c1.11 https://capec.mitre.org/
c1.12 https://en.wikipedia.org/wiki/Data_flow_diagram
c1.13 https://technet.microsoft.com/en-us/security/hh855044.aspx
c1.14 Klein, 177
c1.15 https://www.first.org/cvss

c2.1 https://en.wikipedia.org/wiki/Gold–silver–bronze_command_structure
c2.2 http://files.meetup.com/2331301/incident_response_101.pdf
c2.3 Klein, 245
c2.4 Klein 243
c2.5 Klein 243

c3.1 https://en.wikipedia.org/wiki/Denial-of-service_attack
c3.2 https://en.wikipedia.org/wiki/2007cyberattacks_on_Estonia
c3.3 https://www.wired.com/2016/11/web-shaking-mirai-botnet-splintering-also-evolving/
c3.4 https://krebsonsecurity.com/2016/11/akamai-on-the-record-krebsonsecurity-attack/
c3.5 http://dyn.com/blog/dyn-analysis-summary-of-friday-october-21-attack/
c3.6 https://www.eecis.udel.edu/~sunshine/publications/ccr.pdf
c3.7 https://github.com/micheloosterhof/cowrie
c3.8 https://en.wikipedia.org/wiki/Honeypot
(computing)
c3.9 https://www.fastly.com/blog/anatomy-iot-botnet-attack
c3.10 https://www.virusbulletin.com/conference/vb2006/abstracts/botnet-tracking-techniques-and-tools
c3.11 https://krebsonsecurity.com/2016/09/the-democratization-of-censorship/