Effective Adversary Emulation
Jeremy Mill
Posted on November 27, 2023
Abstract
So you’ve built an amazing suite of security tools that provide defense in depth. Have you ever actually tested them? Have you tested them in a controlled manner that fully examines their performance using real-world scenarios? Here I'll describe an effective method for adversary emulation designed for small and medium-sized security teams. I’ll describe how to build a test plan, execute it safely, and how to evaluate the results. I’ll pull it all together by describing how to fit this testing methodology into your security program, keep it up to date, and use it to drive real changes that make your organization safer in a very real way.
Video
I've also given the contents of this blog as a talk at BSidesCT and VetSecCon. You can view a recording of the BSidesCT talk here
Intro
Far too often those of us responsible for security fail to test our tools effectively. For example, we may only test our tools during a proof of concept; Comparing two or three tools against each other with a limited set of testing criteria in an untuned state with a limited amount of time to run the test. Or, we may test our tools well, but only individually. The problem with testing only during the proof of concept is that we miss the context of running a tool tuned for our environment where not only our selection criteria matter, but rather every feature the tool supports probably does. When we test only one tool, we miss the full context of our environment; We make assumptions about what is covered by other tools and never put it all together to evaluate those assumptions. This is akin to having unit tests with no integration tests.
This process also differs from other adversary emulation frameworks/processes because it does not attempt to perfectly emulate any single adversary. That method doesn't make sense for several reasons, first of which is that your org doesn't face a single threat and the threat actors emulated by other frameworks change their tactics, techniques, and procedures often based on the organization they're targeting and with time.
The testing process I describe is a MITRE ATT&CK driven, regularly scheduled process that tests the chosen security stack of an organization. It can be thought of as an integration or clear box test for a security program.
The process follows the following steps:
1) Scope
2) Tactics, Techniques, and Procedures (TTPs)
3) Design
4) Weigh
5) Execute & Score
6) Analyze & Plan
7) Implement
The Process
1. Scope
In the scoping step you must decide what you will, and what you will not test. This sounds simple and straightforward, but in reality it's quite hard to get right. Select a scope too large and the test becomes unwieldy and too hard to execute. Select a scope too small and you fail to effectively test the program and identify gaps.
Let's use the following (massively oversimplified) picture of an IT environment for an organization that hosts some services on the open internet:
If you attempt to test the red outlined items, the test will be entirely too large. You will be bogged down in test development and you won't be able to dig deep into the findings in later stages. Instead, you should try to scope to something closer to the purple or teal squares. Where purple tests the endpoints up to the authentication system and teal tests the cloud infrastructure against something like the log4j
or MoveIT
zero days. An additional square could be drawn here from Endpoint
to K8s
which would represent an exploited developer workstation or insider threat to your production systems.
When selecting your scope you should also keep in mind your data classification and crown jewels analysis to make sure that you are testing systems most critical for the protection of the data your organizations most wants to protect.
2. Tactics, Techniques, and Procedures
Tactics, Techniques, and Procedures (TTPs) are "the behaviors, methods, or patterns of activity used by a threat actor, or group of threat actors". With your scope in hand it is time to research how threat actors attack the systems you have selected to test. After all, it makes no sense to learn about attacks on hypervisors if you haven't decided to test your virtualization infrastructure.
Collection of TTPs is part of this step but this step should never really stop. I suggest building a running notebook of good blogs and posts that may apply to your environment and the systems you are tasked to protect. So you should think about this step as a selection from your ever-growing collection of TTPs and not a step to collect them by itself.
There are many different sources you can use to build your library of TTPs. Some of the ones I use regularly are:
- https://www.mandiant.com/resources/blog
- https://www.sentinelone.com/blog/
- https://attack.mitre.org/groups/
- Infosec mastodon (formerly infosec twitter)
- I'm partial to infosec.exchange
There are also numerous paid threat intelligence feeds that are also fantastic and if you have access to them can be used to supplement public sources.
Finally, while collecting your list of TTPs you want to make sure that you are building a collection of TTPs across the full exploit chain. During the test, you (may) need to test everything from initial access to exfiltration and you'll need specific TTPs for each. Also, remember that what you care most about is detecting behaviors. So just because you can't find a public example of a threat actor using rsync
for exfiltration doesn't mean that you can't use it.
3. Design
With a well-defined scope and a set of TTPs chosen, it's time to design a test. The best way to do this is to start with a narrative. Think about how you would describe an interesting case study or incident analysis to a colleague or friend (assuming your friends are as nerdy as mine). An example may be:
A threat actor built a malicious, typo-squatted NPM package and published it. A developer accidentally installed it. The malicious package downloaded a stager which downloaded a c2 payload. The attackers stole the developer's Okta auth cookies and used them to move laterally into the organization's cloud infrastructure
With the narrative written you can break it down into individual steps, adding additional details. This can be done in a flowchart like the following:
It is also useful to map out the infrastructure diagram. For this narrative it may look something like this:
The creation of these two diagrams may feel unnecessary but they are very useful to share among the team performing the test to ensure that everyone is on the same page. They are also incredibly useful for sharing with other key stakeholders when informing them that you are going to conduct the test.
3.1 Collection
Step 3 also includes the collection of any tools you decide to acquire or build as part of your design. If you say you're going to use Cobalt Strike as your C2, you need to acquire a copy of it during this step of the process. If you say you're going to build a custom stager, you must build it during this stage as well.
As a note, this step also includes education. If you're selecting an open source tool like responder
or sliver
, you should educate yourself during this step on how to use the tool before proceeding to steps 4 and 5. You should not be learning "on the job" during the execution step.
4. Weigh
Now that you have a well-defined narrative that combines your scope and your TTPs, it's time to break it down further. You will do this by 1) defining detections and then 2) weighing them. The output of this step is a table with all of the detections you expect to see and their relative weights.
4.1. Detections
For each step in the design from step 3, you want to think about what should be seen from your security tools. Let's start with a very simple step from the sample design: the download of the typosquatted NPM package. You probably want to see:
- The download of the file on the network security tool (firewall, SASE, etc)
- The network connection on the EDR if it supports it
- The creation of the file on disk from the EDR
That's at least 3 rows on the table you are creating in this step. A massively truncated version of the output of step 4.1 is:
Action | Tool | Event |
---|---|---|
Dev downloads NPM | Firewall | GET request |
Dev downloads NPM | EDR | netconn/GET request |
Dev downloads NPM | EDR | file creation |
... | ... | ... |
NPM pulls stager | Firewall | GET request |
NPM pulls stager | EDR | netconn/GET request |
Stager executes | EDR | Execution from memory |
... | ... | ... |
Attacker uses Okta cookie | Okta | Suspicious behavior |
... | ... | ... |
I will refer to each row in this table as an Action, Tool, Event tuple, or ATE tuple going forward.
4.2. Weigh
Not all steps in this process are as critical as the others to detect or prevent. For example, let's say persistence is achieved by creating or modifying an existing scheduled task. Scheduled task creation and modification events happen all the time, making them a much lower fidelity alerting metric than say, dumping LSASS. You want to identify the most critical items so you can treat them as more important when you analyze the results after the test execution is complete.
You do this by assigning a weight to each one of the rows you built in the previous step. These weights are subjective so the exact range you use doesn't matter, but the recommended scale is a range of 1 to 10 where 1 is the lowest fidelity/lowest criticality and 10 is "this absolutely must be detected or prevented and alerted on".
The sample from 4.1 with weights added looks like this:
Action | Tool | Event | Weight |
---|---|---|---|
Dev downloads NPM | Firewall | GET request | 2 |
Dev downloads NPM | EDR | netconn/GET request | 2 |
Dev downloads NPM | EDR | file creation | 4 |
... | ... | ... | ... |
NPM pulls stager | Firewall | GET request | 3 |
NPM pulls stager | EDR | netconn/GET request | 3 |
Stager executes | EDR | Execution from memory | 7 |
... | ... | ... | ... |
Attacker uses Okta cookie | Okta | Suspicious behavior | 10 |
... | ... | ... | ... |
5. Execute and Score
Time to hang up your white hat (off white?) and put on your black one because it's time to play bad guy and run the test. As you perform the test you should carefully note what you did and when you did it down to the second if you can. You'll thank yourself later when you're digging through logs to find out why something didn't alert like you expected it to.
Scoring is subjective and given on a scale of 0 to the weight assigned to the ATE based on the performance of the tool. If a tool does exactly what you think it should have up to blocking the attack chain (more on that in a bit) it gets a score equal to its weight. If the tool absolutely misses the mark, logs nothing, and doesn't get in the way of compromising the system, it gets a 0. A tool may also get any score in between. Maybe you expected it to generate an alert but all it did was log, this may be a score of 2 out of a weight of 6 or somewhere else in between. The "why" it failed does not matter at this step in the process.
Scoring may be performed during the test or after all of the steps are completed. You may find it easier to have one set of engineers performing the test communicating with a second set doing the scoring as the test is performed.
Important: When one of the security tools does its job and blocks a step in your design the test IS NOT OVER. If you stopped as soon as one of our tools was successful you would not be testing the defense in depth of the security tool stack. Instead that Action, Tool, Event tuple gets a full score, you allow-list the network connection, behavior, or signature in the tool, and you continue the test.
After the test has been completed and scores are assigned, you will have a scored ATE table that may look like this:
Action | Tool | Event | Weight | Score |
---|---|---|---|---|
Dev downloads NPM | Firewall | GET request | 2 | 2 |
Dev downloads NPM | EDR | netconn/GET request | 2 | 2 |
Dev downloads NPM | EDR | file creation | 4 | 3 |
... | ... | ... | ... | ... |
NPM pulls stager | Firewall | GET request | 3 | 3 |
NPM pulls stager | EDR | netconn/GET request | 3 | 1 |
Stager executes | EDR | Execution from memory | 7 | 2 |
... | ... | ... | ... | ... |
Attacker uses Okta cookie | Okta | Suspicious behavior | 10 | 1 |
... | ... | ... | ... | ... |
Note on Deviations
Deviations from the test plan should be minimized. You want the steps of this process to be repeatable so you can test the fixes you implement in step 7. Small deviations such as changing obfuscation techniques should be annotated and allowed but large deviations such as changing from one C2 infrastructure to another should be avoided. This is also because you have (hopefully) communicated your test plan to other key stakeholders and you don't want to deviate in principle from what you have told them and gotten their approval on.
6. Analyze and Plan
Unfortunately, the fun "pretend to be a bad guy" step is over. But fortunately, the "reason we all get paid" step is about to begin. Like step 4, this step has two sub-phases; Analyze and Plan.
6.1 Analyze
The first step in your analysis is to calculate the ATE+Score deltas. These are defined as the Score subtracted from the Weight. The higher the delta, the more critical the failure is in the security system. The sample table now with calculated deltas is below:
Action | Tool | Event | Weight | Score | Delta |
---|---|---|---|---|---|
Dev downloads NPM | Firewall | GET request | 2 | 2 | 0 |
Dev downloads NPM | EDR | netconn/GET request | 2 | 2 | 0 |
Dev downloads NPM | EDR | file creation | 4 | 3 | 1 |
... | ... | ... | ... | ... | ... |
NPM pulls stager | Firewall | GET request | 3 | 3 | 0 |
NPM pulls stager | EDR | netconn/GET request | 3 | 1 | 2 |
Stager executes | EDR | Execution from memory | 7 | 2 | 5 |
... | ... | ... | ... | ... | ... |
Attacker uses Okta cookie | Okta | Suspicious behavior | 10 | 1 | 9 |
... | ... | ... | ... | ... | ... |
Criticality of failures is defined by the magnitude of the delta.
- None: 0
- Low: 1-3
- Medium: 4-6
- High: 7-9
- Critical: 9+
After calculating the deltas you should perform a root cause analysis starting from the most critical of the failures. Ask yourself questions like: Why did the tool not log? Why didn't the alert fire? Why didn't it end up in the SIEM to be correlated? Why wasn't that TLS intercepted?
6.2 Plan
At this stage in the process you have everything you need to determine what to work on first and why you need to work on it. You next need to determine the correct path forward to fixing it. There are often hard decisions in this step. Sometimes it's as easy as fixing a typo in a custom alert but other times it's having a difficult discussion with the vendor of a tool or your MSSP.
You should be turning all of these plans into tickets in your work tracking system with appropriate criticalities assigned. After all, if it's not a ticket, will it ever get done? Grouping these tickets together as an adversary emulation epic (assuming Jira) helps keep everything more organized.
7. Implement
Every other step means nothing if you don't do something with the outputs. It's time to pick up the tickets you created and organized by criticality in step 6 and fix them. Your definition of done for completing a ticket MUST include testing them. You should never assume a change is successful without proving it.
Repeat
The cybersecurity environment we operate in is changing rapidly and the threats faced today are not the threats we will face tomorrow. As a result, the value of the outputs of this test begins to depreciate as soon as they are created. Adversary emulation tests should be performed annually or semi-annually in order to ensure that we are keeping up with the rate of change of attackers.
It is also important to vary your scope to ensure that you are testing all the different avenues that you can be attacked and all of the different security tools and log sources that you have in your environment.
Your first test
If you've made it this far into this guide I hope I've convinced you that this process is worth performing at your organization and you're excited to get started. It is advisable to keep two things in mind for your very first test:
1) Start Small
2) Keep. It. Simple.
It is incredibly easy to over scope your first test (and all tests, really) so you should start small. Smaller than you think you probably need to, in fact. For many small to medium-sized organizations a good first test narrative may be something like:
A threat actor emails a user with a link to an encrypted zip file with malware in it. The user downloads the zip and executes the malware. The malware persists via a scheduled task and spawns a reverse shell. The attacker uses it to steal files from the local computer.
This narrative has only one endpoint in scope and makes no attempt to move laterally. But nearly all of us can think of a user in our organizations (maybe an executive) for whom if this happened, it would be a major event.
It's also worth noting that this narrative has nothing crazy in it. No process hollowing, module stomping, dynamic call stacks or anything like that. That's because you probably don't need ANYTHING fancy to identify some failures in your security tools. You get fancy as you mature, you START with the basics like basic reverse shells, SSH, Powershell, the basics in Metasploit, etc. Keep. It. Simple.
Resourcing
The resourcing timelines here are best estimates based on experience. They can vary widely depending on the size and skill sets of the security professionals on your organization's team(s).
- Steps 1 and 2 often take one to two business days to fully select your scope and TTPs and produce the documentation.
- Step 3 varies significantly based on the outputs of steps 1 and 2 and the familiarity of the team with the tools and processes selected. It can be as short as 1 business day to a week or longer.
- Step 4 can often be completed in one business day or shorter.
- Step 5 often takes 1-2 business days. Additional engineers assigned to test execution often results in a longer timeline but comes at the benefit of educating members of the team and growing their respective skill sets. It is most often worth the added time.
- Step 6 can vary depending on the complexity of the findings. It can be as short as a half business day to several days.
- Step 7 is open-ended depending on the backlog and capacity of the security team and the severity of the findings.
Resources
Some resources I've found useful in the performance of these tests are:
- Sliver C2 (my preferred): https://github.com/BishopFox/sliver
- Covenant C2: https://github.com/cobbr/Covenant
- Bring your own vulnerable driver (BYOVD): https://www.loldrivers.io/
- Metasploit: https://github.com/rapid7/metasploit-framework
- Revshells: https://www.revshells.com/
- Payload all the things: https://github.com/swisskyrepo/PayloadsAllTheThings
- Atomic red team: https://github.com/redcanaryco/atomic-red-team
And so so many others.
Thanks
- Huge thanks to CryptoJones for editing and review of the post: https://infosec.exchange/@CryptoJones
- Markdown table generator: https://www.tablesgenerator.com/markdown_tables#
- Cover Photo: Photo by https://unsplash.com/@alain_pham
Posted on November 27, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.