Your plan for the next system outage may be built on wishful thinking
Canadian CIOs have spent a decade building programs to stop data breaches. They got good at it. Now the call that can define a career is a different one entirely.
The systems are down.
Six hours in, the customer-facing app is still dark, and the team trying to recover it can’t tell anyone when it’s coming back. Curtis Simpson, chief strategy officer at Gambit Security, told tech leaders at this year’s CIO Association of Canada Peer Forum that the job has been rewritten, and most resilience programs were built for the old one.
“You’re not generally punished for data loss,” said Simpson.
“Today, businesses, enterprises, organizations, and governments are punished for outages [and] loss of availability.”
When something knocks an organization’s systems offline, whether a cyberattack, a software update gone wrong, data centre fire, or a cloud provider failing, the tech team’s job is to bring everything back.
That work is called recovery, and how fast it happens decides how much damage the business absorbs.
Splunk’s 2026 Hidden Costs of Downtime study found that unplanned outages cost Global 2000 companies an average of $300 million each per year, with the total cohort impact up roughly 50% in two years.
Cloud outages are no longer rare edge cases. Amazon Web Services (AWS), Microsoft Azure, and other major providers have all had recent failures that knocked customers offline for hours.
The October 2025 AWS outage roughly 2,000 organizations offline for close to 15 hours, including Lloyds Banking Group, Coinbase, Snapchat and the London Stock Exchange.
Simpson said CIOs used to treat a major AWS regional outage as rare enough that it didn’t need to be accounted for in their planning.
“Those days are gone,” he said.
Many of those same cloud providers serve Canadian organizations, and how long any of them would stay down in a similar event comes back to how well their recovery plans hold up.
What Simpson described is the part of resilience that gets real fast, before everyone starts refreshing the crisis spreadsheet.
The four questions every technology leader should be able to answer
Most leaders can’t answer all four with measured numbers, and that’s where the recovery plan falls apart long before the next outage tests it.
The first question is about what the technology team is trying to bring back. A customer-facing app like online banking sits on top of many pieces of infrastructure, including the servers, databases, and network connections. Most teams report recovery on each piece separately.
The customer interacts with the app, and any broken piece in this underlying infrastructure leaves them locked out.
“Nobody cares about whether a specific system or asset or host is recoverable,” said Simpson. “That doesn’t matter.”
How fast the app can come back is the next question. Most organizations have a target written down somewhere, whether it’s an hour, a day, a timeline the business previously agreed to.
Simpson estimates 95% of organizations aren’t testing recovery end-to-end on a consistent basis. The result is recovery targets that read well on paper but have never been tested against the way a real outage unfolds. The recovery target is what the board will want to hear during the outage. What customers will care about afterward is how long they were locked out and whether they come back at all.
The third question is how going offline would cost the company. When CIOs report risk to the board, they answer two questions, how likely an outage is and how much it would cost. The first usually has data behind it. The second has often been a guess, based on rough assumptions about lost transactions or customers walking away.
“I’ve mostly been measuring and managing likelihood and I’ve been telling stories around impact,” said Simpson, who previously served as Global CISO at Sysco and Armis.
Without a real cost number, the board can’t decide whether the recovery plan needs $2 million in investment or $50 million.
Finally, how much downtime can the business stand before real damage sets in? Simpson said leaders often wait for the business to give them that number, and waiting is the wrong call.
Technology leaders have to walk in with a number already worked out, drawn from the continuity planners, risk managers, or finance teams, and ask the business to confirm or push back.
“The reality is somebody knows, somebody has insights into this,” Simpson said.
A recovery plan can pass the audit and fail the outage
Recovery testing is supposed to work like a school fire drill.
You don’t wait until smoke is coming down the hallway to find out whether the exits are blocked, the alarm works, or half the class thinks the meeting point is beside the soccer field while the rest meander aimlessly.
The same thing happens in technology.
Many test the pieces separately, the backup works, the secondary system works, and the connection to the backup data centre works. Everyone gets a passing grade, and the spreadsheet looks perfect.
Then a real outage hits, and all those pieces have to work together at the same time, in real time.
That is the part Simpson said too many organizations still don’t know. A recovery isn’t finished when one server comes back or one backup restores. It’s finished when the customer can use the application again.
Testing falls short partly because the old systems didn’t disappear when the cloud arrived.
Many companies in banking, manufacturing, and inventory operations still run AS/400s, a class of IBM business computer dating to the late 1980s, alongside modern cloud applications layered on top.
“We didn’t move to the cloud. We added the cloud,” said Simpson. “Many are still using AS/400s, mainframes, et cetera, interacting with middleware platforms that are interacting with the cloud.”
Joseph Ruck, head of field architecture at Gambit Security, calls the disconnect between what companies think their recovery looks like on paper, and what it would look like in a real outage, the “petri dish paradox.”
Companies have to attest to auditors that they can recover. Those attestations are usually a meeting between people who want to say the plan works.
Customers of two U.S. credit unions found out the hard way what an untested recovery plan looks like.
The June 2024 ransomware attack on California-based Patelco Credit Union, with about 530,000 members and roughly $9 billion in assets, took most banking services offline for more than two weeks. Members couldn’t access their money.
Patelco’s own filings reported more than $39 million in quarterly losses tied to the incident, most of it covering overdrafts during the outage. A $7.2 million class-action settlement is now awaiting court approval.
Ruck said the same pattern shows up across every outage he has worked. Organizations whose self-assessment of their own readiness ends up being “absolutely certain, but absolutely wrong.”
VyStar Credit Union’s 2022 outage is the second. The Florida credit union was upgrading its core banking software, a three-day project that turned into weeks of customers locked out of basic services, with some features unavailable for more than six months.
The U.S. Consumer Financial Protection Bureau fined VyStar $1.5 million in late 2024 over what it called a botched rollout. Ruck described it as a self-inflicted outage.
“They were their own disruptor,” said Ruck.
Both cases point to the same problem. A recovery plan can exist on paper and still fail when customers need the system back.
That’s a call no CIO wants to get six hours into an outage that won’t end, with customers locked out and no honest recovery time the bridge could give.
“We, the board, the executives, don’t care what caused the outage,” said Simpson. “Could be a cyber attack, could be an AI-based outage, could be an infrastructure failure. Nobody cares.”
For many Canadian organizations, the next outage is a question of when, not if. The harder question is whether the recovery plan on file has been tested honestly or only shown to an auditor.
Final shots
- The recovery time the team commits to is only useful if it has been tested under real conditions.
- Recovery needs the CIO and CISO working from the same plan before an outage turns into another after-action review.
- Component tests can make the audit feel tidy. Full recovery testing shows whether customers can get back in.
Digital Journal is the national media partner for the CIO Association of Canada.
Your plan for the next system outage may be built on wishful thinking
#plan #system #outage #built #wishful #thinking