Developing a Cloud & Container Incident Response Plan

Critical to eliminating or reducing the impact of security incidents is to have an incident response plan. Without a well-planned incident response plan, it is nearly impossible to manage complex incidents affecting multiple services and teams in a high-stress situation.

If you already have an incident response plan in place and automated configuration compliance, they will not do you and your team any good unless you keep them up to date. The best way to ensure your plan and systems is to update and regularly practice in peacetime. Consistent training and chaos simulations help teams to stay up to date and be prepared for incidents by incorporating a proactive approach to incident response.

There’s no definitive standard for cloud incident response plans, but we recommend that the five main points below should be a part of your plan:

Preparation: Often regarded as the most crucial part of incident response, preparation is fundamental for successful incident response. Preparation stage reduces “what if” moments and helps teams make practiced decisions. Having an on-call schedule with multiple rotations, escalations with correct responders, runbooks, practice sessions, and extensive documentation are all part of this crucial stage.

Detection & Alerting: Incident detection and alerting focuses on the communication of an abnormality. In this step, monitoring the right metrics and setting up the correct thresholds are important to reduce false positives. In the cloud, often multiple monitoring solutions are involved in different parts of the infrastructure covering network, application, performance, or compliance monitoring. An undesired state can trigger a chain reaction and a new level of incident management becomes crucial to aggregate, triage, and then alert only the things that matter.

Containment: It is important to acknowledge that the goal is not to heal the system or find the root cause of the incident. The containment stage is about limiting and preventing any further damage. In the case of complex incidents, teams join a war room and work together to stop the bleeding. In this stage, often an incident commander assigns tasks to predefined roles and takes informed actions in the incident command center.

Remediation: Once the incident is under control, it is now time to address the problem and figure out how it can be corrected to prevent a similar incident from occurring in the future. A decision-making framework can be used to approach the problem depending on the type of the incident (simple, complex, complicated, chaotic). This provides a structured approach that helps incident responders determine the best course of action based on the nature of the problem itself.

Another popular approach is to use chat tools like Slack and PagerDuty to enable teams to discuss and assess the incident. Modern tools make collaborative investigation and actioning remediation a lot easier with the click of a button or by typing a few words into the shared chat channel where everyone has visibility.

Analysis: Incident response does not end after remediating the issue. Continuous improvement requires learning from mistakes and the final step of any incident response plan should contribute to this idea. Postmortems or post-incident reviews help teams evaluate the incident and implement new measures to reduce the chances of experiencing a similar incident. An essential rule while writing postmortems is to identify the issue and create processes and practices that will prevent it from happening in the future. Pointing fingers does not contribute to a culture of improvement.

Ready to get better visibility into your cloud environment? Let’s chat!

 

Photo by Piotr Chrobot on Unsplash

Categories