For businesses, much emphasis has been put on cybersecurity, and for good reasons. To perform their intended functions, computer systems and networks must be free from intrusion, damage, and other interference. But in the quest to prevent all malicious software, ransomware events, stolen data, and other breaches, businesses may be overlooking additional risks that also have the potential to impede their operations.
According to a recent McKinsey article, these technology constraints include “capacity limitations, system uptime, data quality, and the ability to recover from a catastrophic technological, physical, or cyber event.” Events that disrupt computer system and network functionality have the capacity to limit, decrease the quality of, or stop software development services to clients for a brief or extended time or even bring all company operations to a halt. Needless to say, any of these scenarios could bring catastrophic results.
The McKinsey article notes that protecting against these threats “requires a resilient infrastructure with heightened visibility and transparency across the technology stack to keep an organization functioning in the event of a cyberattack, data corruption, catastrophic system failure, or other types of incidents.” In the following sections, we explore why companies need resiliency, the key components and stages of it, and how to build a resilient organization. But first, we take a closer look at what resiliency is.
What Is Resiliency?
To understand resiliency, companies must first consider two other concepts, criticality and risk tolerance. Criticality can be defined as the level of dependence an organization—and, in turn, its customers, other stakeholders, and the public—has on a process or system. The more critical an element is to the company’s or the public’s functioning, the more resilient it must be to protect that functionality.
In the event of a threat situation, a lack of resiliency may result in a declining percentage of operational elements within the system, threats to other parts of the system or to other systems, and the potential for additional damage throughout and beyond the organization. Risk tolerance refers to the level of damage to systems that could be incurred during a catastrophic event with the expectation of a full recovery.
An important factor in determining both criticality and risk tolerance is the nature of the company’s work. For example, a major blow—such as an extreme weather event—could impact the ability of a power utility to function and could also have a wide-reaching impact on the communities it serves, leaving homes and businesses without electricity that is, in turn, critical to their functioning and sometimes even people’s ability to stay alive.
Therefore, companies that provide infrastructure and other critical services have lower risk tolerance. The U.S. Cybersecurity & Infrastructure Security Agency (CISA) has identified 16 critical infrastructure sectors:
- Chemical
- Commercial facilities (spaces open to the general public)
- Communications
- Critical manufacturing (companies that make items with national significance)
- Dams
- Defense industrial base
- Emergency services
- Energy
- Financial services
- Food and Agriculture
- Government facilities
- Healthcare and public health
- Information technology
- Nuclear reactors, materials, and waste
- Transportation systems
- Water and wastewater
So, resiliency is the ability of a company to guard against the damage resulting from catastrophic events and, if needed, to rebuild afterward. Timing is also a significant factor because, if an event occurs, the more time that goes by, the worse the impact may be on a business. For example, if website functionality is lost for an e-commerce operation, each passing day represents up to billions of dollars in lost sales.
Why Companies Need to Be Resilient
Businesses of all types and sizes should prioritize resilience for a wide variety of reasons, including those listed here:
- Obligation to society, customers, and other stakeholders. When companies do business in the public sphere, they become beholden to a range of stakeholders, including investors, partners, and vendors, in addition to customers and society as a whole. While legally, their only obligation may be to investors, ethically and promotionally, they are wise to do all they can to continue to provide access to their services and to keep all these stakeholders safe.
- Maintain business continuity. To keep their stated and implicit promises to stakeholders, businesses must take steps to avoid downtime and disruptions within their services. This issue is even more important for critical infrastructures such as healthcare, energy, and finance.
- Comply with regulatory standards. Many industries have specific regulatory requirements and compliance obligations related to resilience. Companies must meet these obligations to avoid legal consequences, financial penalties, and reputational damage.
- Support employee morale and productivity. Organizations that maintain a high level of resilience demonstrate preparedness and adaptability, which positively impacts employee morale and productivity. When employees feel confident that their organization can effectively handle disruptions, they are more likely to remain engaged, motivated, and productive, even in challenging situations.
- Enhance longevity. Resilient companies are sustainable companies because they manage risks well, enabling them to continue pursuing their mission in circumstances under which others might lag behind. Resilient businesses are able to thrive when others may fail.
- Strengthen competitive advantage. Companies can gain an edge in the marketplace by paying attention to all of the above factors. Those that demonstrate a high level of resilience are perceived as reliable partners, suppliers, and service providers. These perceptions can attract new customers, strengthen existing relationships, and provide differentiation from competitors.
Key Components of Technology Resilience
Technology resiliency applies to many areas of a company’s systems. While each requires specific and unique attention, companies are wise to think systemically about creating a holistic resiliency strategy. Recovery focuses play a crucial role in this process. Key technology areas to consider include the following:
- Hardware. Organizations must ensure reliability and redundancy for physical equipment, including servers, storage systems, network infrastructure, data centers, power and cooling systems, and other components. Strategies to ensure resiliency include creating redundant configurations, installing fault-tolerant systems, using power backups, and creating disaster recovery sites.
- Software and applications. Businesses should focus on the strength and availability of software applications and systems that include fault tolerance, error handling, and recovery features. These characteristics enable software to continue operating in the event of failures, errors, or disruptions—or recover quickly afterward. Strategies include the use of data replication, load balancing, and automated failover.
- Networks. Companies must also pay attention to network resiliency, which ensures the reliability and availability of network components such as routers, switches, firewalls, and connectivity elements. To minimize the impact of network failures, businesses can implement redundant network designs, diverse network paths, and network segmentation. Strategies to increase resiliency include proactive monitoring, traffic management, and disaster recovery planning.
- Data. Organizations must also build resiliency into one of their most valuable assets, data. Doing so involves protecting and ensuring the availability, integrity, and recoverability of critical data using backups, replication, redundancy, and disaster recovery planning. Strategies to maximize data resiliency include the use of encryption, access controls, retention policies, and validation practices.
However, as the video below explains, resiliency is about much more than machinery. It is also about the practices and flexibility of IT team members and other professionals throughout an organization. According to the McKinsey article, “Technology resilience prepares organizations to overcome challenges when their technology stack is compromised, reducing the frequency of catastrophic events and enabling them to recover faster in the case of an event.”
That resilience translates to human factors and processes such as robust training programs, incident response planning and drills, and effective communication plans and practices. Such factors should be monitored and updated regularly to ensure companies are up to speed with the latest resiliency tactics and threats.
Finally, organizations must consider sources they may have less control over, including vendors, third-party services, and cloud providers. To achieve maximum resiliency, companies should thoroughly evaluate the resilience of such providers and their commitment to it. Service level agreements, backup and recovery capabilities, and contingency plans provide a starting point for vetting these partners.
The Resiliency Continuum
There is no one-size-fits-all prescription for technology resiliency plans. Companies must determine what level of resiliency they need and start to create the conditions to maintain it.
- Basic. At this level, an organization might be reactive rather than proactive in ensuring resilience, such as waiting for users or customers to report outages rather than monitoring for them.
- Developing. At this stage, an organization might be more proactive with things like backups and server mirroring, in addition to system monitoring. The company might have a disaster plan in place in case critical systems are compromised and begin to research root causes that could increase the likelihood of damaging incidents.
- Mature. Here, in addition to a fully fleshed-out disaster plan, an organization might actively synchronize various systems and actively monitor specific components such as applications and databases. Stress testing may be employed to ensure measures are operational.
- Advanced. At this point, an organization will consider resilience a critical factor in every decision and purchase. It includes active monitoring of every critical system as well as a sophisticated disaster plan that is regularly updated. Knowledge based on events is used to improve best practices.
Building Business Continuity Principles into Organization
Again, resiliency is more than just technology—it encompasses processes, including the high-level thought processes that most organizations employ in building a resilient culture. Initially, these companies must leverage the same technology solutions they aim to protect in order to construct that protective framework.
Resiliency, thus, is not solely about safeguarding technology but also strategically using it for optimal protection. Specifically, they should establish metrics for success and monitor performance to know when objectives are being met. With these ideas in place, companies can take steps that contribute to resilience:
- Identify critical systems. The most critical services include those needed to fulfill obligations to customers or clients, partners, investors, regulators, and society at large. While prioritizing may be challenging, organizations should put those systems first, then determine the order of others in the critical functioning of the company.
- Identify likely to unlikely events. The McKinsey article suggests “reviewing past incident-response logs, incident reports, and other documents to identify contributing factors, patterns, and insights that can shed light on causes behind the incidents.” For regions that suffer from regular extreme weather events, for example, a hurricane could realistically cause system delays or failures, while a solar storm is possible but much less likely.
- Determine the right technology. Many companies are moving operations to the cloud. But, for some, the right answer could be on-prem equipment, which offers a higher level of physical control, creating the opportunity to implement customized security measures, disaster recovery plans, and backup strategies and enact immediate response and recovery procedures following a threat event. Businesses may also consider a combination of on-prem systems and cloud systems or backups.
- Determine appropriate processes. How equipment is used is just as important as the equipment itself. Appropriate processes could include setting up regions or availability zones, load balancing, data mirroring, or snapshots/backups.
- Consider relationships between various systems. For example, IT and operational technology (OT) are closely connected and should be considered together when determining resiliency strategies. OT refers to technology systems and devices that monitor, control, and manage operations in manufacturing, energy, transportation, and other critical infrastructure. These systems often operate in real-time.In recent years, IT and OT have converged and relied on each other for effective resiliency. While IT systems provide the underlying infrastructure, OT systems contribute to the resilience of the overall organization, which should occur through the collaboration of IT and OT professionals.
- Build results into planning. Organizations should build the results of all this research into planning for digital initiatives. Resiliency plans should include this background as well as timelines and milestones, responsible parties, governance rules, and plans to establish baselines and progress toward goals. Additionally, they should include steps for reevaluating assumptions and procedures.
- Test. The best way to determine a baseline and progress is to test, which can be done in a number of ways:
- Business continuity and disaster recovery testing. BCDR involves simulating disruptive events and determining the company’s ability to recover critical systems and operations.
- Red team exercises. In this approach, organizations use ethical hackers to mimic real-world attacks to identify system vulnerabilities and weaknesses.
- Load and stress testing. This procedure simulates high traffic, large data volumes, or intensive processing demands to determine system behavior.
- Fault injection testing. Companies can use this process to introduce incidents such as network failures, power outages, or hardware or software malfunctions into a system to evaluate its resiliency and recovery capabilities.
- Data recovery testing. This type of testing validates the effectiveness of data backup, replication, and restoration processes by restoring data from backups.
- Cloud and infrastructure testing. As stated earlier, organizations must ensure providers’ commitment to resilience. This type of testing does just that by determining the availability of redundant infrastructure components, network resilience, data replication, and failover capabilities offered by the cloud provider.
- Vulnerability assessments and penetration testing. These types of tests identify security weaknesses and vulnerabilities through comprehensive security assessments.
- Monitoring and incident response drills. This process involves ongoing monitoring and incident response drills to help organizations detect and respond to potential disruptions in real-time.
- Learn. Finally, when technology professionals discover vulnerabilities, companies must address them to create an even more robust defense.
The Future of Resiliency
While resiliency is critical in 2024, it will become even more so in the coming years as the following factors come into play:
- Society continues to become more reliant on technology. With each passing day, it becomes harder to imagine life without technology as we know it today. Both consumers and businesses increasingly rely on it to perform everything from grocery shopping to complex data analysis and resolving global problems.
- Technology becomes more complex. The systems required to deliver on that demand for increasing convenience must continue to evolve in ever-more complicated arrangements and interconnections.
- The world becomes more interconnected. As became evident during the COVID-19 pandemic, the interconnected nature of global systems and supply chains increases the potential for “domino” effects when disruptions occur.
- Extreme weather events become more frequent. While many professionals and ordinary citizens are doing great things to halt the effects of climate change, much of the damage has already been done, meaning extreme weather events are likely to become more frequent before they decrease in frequency.
- Regulatory and compliance requirements get stricter. Governments and regulatory bodies recognize the importance of resiliency and are taking steps to ensure companies provide it. In turn, those companies must comply or face penalties and reputational damage.
- Geopolitical influences become more unpredictable. Threat actors, including governments and quasi-governmental entities, are an ongoing source of potential harm, representing one more reason to prioritize resiliency.