Insights from CIDT's DevOps and Blockchain Engineers
Alexey Kolesnik, DevOps Engineer · Vasyl Naumenko, Blockchain Developer
Most technical teams approach blockchain downtime the same way they'd approach a server outage. The two problems are different. The reality is more nuanced, and understanding the difference between these two worlds could save your team weeks of misdirected effort.
We asked two of our engineers — Alexey Kolesnik (DevOps) and Vasyl Naumenko (Blockchain) — how they actually think about uptime. What emerged from that conversation challenged even the original premise of this article.
The myth: blockchain can 'Go Down' like a regular app
The first thing Vasyl clarified is something many founders get wrong from the start: a properly decentralized blockchain is, by design, extremely resistant to downtime.
"The whole idea of blockchain is that it's a decentralized system — there's no single center that can control or switch it off. Asking how to prevent blockchain downtime is a bit like asking how to prevent the internet from going offline."
— Vasyl Naumenko, Blockchain Developer
This resilience comes from Byzantine Fault Tolerance (BFT) consensus — the same principle that makes it impossible for a single bad actor to corrupt a network. As long as two-thirds or more of nodes remain active and honest, the network continues to operate. Nodes spread across different data centers and jurisdictions have no single point that can be switched off.
The practical implication: once your mainnet has enough independent validators across different operators and locations, the network itself stops being your primary uptime concern.
What actually causes problems — and where
Both engineers pointed to the same areas — and none of them are the network protocol itself.
- Your network isn't decentralized yet
Early-stage mainnets — and all testnets — are typically run by a small number of nodes controlled by the same team. At this stage, the network is effectively centralized, and yes, it can go offline if those servers go down.
"Testnet is usually spun up locally, within one company — it's centralized. Mainnet is designed to be decentralized, but early on it often isn't. Once it genuinely is, no single party can take it down."
— Vasyl Naumenko
More independent validators in more locations means fewer single points of failure. Tokenomics incentives and operational resilience follow from the same decision.
- Individual validator downtime
Individual validators can go offline — due to server payment lapses, configuration errors, or penalties for double-signing and prolonged downtime. The latter is handled through a process called slashing, which results in a partial loss of their staked tokens. The network keeps running, but that validator drops out of consensus and stops earning rewards.
"Depending on the protocol, a validator may lose part of their stake through slashing — but even without that, losing the ability to earn rewards is incentive enough to stay online."
— Vasyl Naumenko
Most blockchain protocols include grace periods to account for this. On one network CIDT supports, validators have a 72-hour window to reconnect before being marked inactive. It's a sensible safeguard that balances flexibility with accountability.
- Protocol upgrade risks
Updates to the protocol carry inherent risk. Protocol upgrades are where classical downtime is most likely — a deployment can introduce unexpected behavior even when the network itself stays up. The exception is a consensus-level bug introduced during a protocol upgrade — this can cause a chain halt, where block production stops entirely until validators coordinate a fix.
"Murphy's Law applies here too. But everything gets tested, announcements go out to all validators, and updates are usually automated. In my experience, if something breaks, it gets fixed fast — we're talking hours."
— Vasyl Naumenko
Upgrades are always tested extensively on testnet first, sometimes for weeks, before touching mainnet. A logic error on mainnet with live financial flows carries consequences that a staging bug does not.
- The smart contract boundary
One more thing Vasyl was careful to distinguish: exploits and failures in smart contracts are not the same as blockchain downtime.
Projects can collapse because of logic vulnerabilities in their on-chain programs — but the underlying blockchain keeps running. Network resilience and application-layer security are separate concerns requiring separate attention. When you hear about a 'blockchain hack,' it's almost always a smart contract vulnerability, not a failure of the chain itself.
The infrastructure layer: where DevOps practices matter
Even if the blockchain protocol itself is highly resilient, the infrastructure surrounding it requires serious operational discipline. Alexey's approach to this is shaped by direct experience across multiple production deployments.
Test on testnet. Every time.
Every change — binary updates, smart contract deployments, configuration edits — goes through testnet before mainnet.
"We always test in testnet first, to avoid pushing a raw release into mainnet. If something goes wrong, we roll back in production to the previous version rather than trying to fix it live."
— Alexey Kolesnik, DevOps Engineer
The rollback-first philosophy is deliberate. Attempting to fix a production issue while the network is live adds risk on top of risk. Rolling back, diagnosing in testnet, and re-releasing takes longer — and produces fewer incidents.
Monitor at two levels
Alexey distinguishes between two distinct monitoring layers that are both essential:
- Server-level monitoring tracks CPU usage, memory, and disk space. Nodes need sufficient disk space to keep writing new blocks — running out means the validator stops participating.
- Validator-level monitoring watches actual blockchain behavior: is the validator signing blocks and earning rewards? If not, something is wrong — even if the server itself looks healthy.
"If we didn't have those metrics, we'd either find out too late — or the client would discover that rewards hadn't been generating for days. That's not a situation you want to be in."
— Alexey Kolesnik
The second layer is easy to overlook. A server can be running perfectly while the validator has drifted out of sync or been slashed. Monitoring both dimensions gives you a complete picture.
Daily backups — and actually test recovery
For one of the blockchain projects we support, the team runs daily backups of the blockchain state. Without a recent snapshot, syncing from scratch after a node failure can take days. Backups set the ceiling on recovery time.
"Syncing from scratch can take 6–7 days. With proper backups, you can restore in a couple of hours."
— Alexey Kolesnik
The number comes from an actual incident: a node failure with no backup meant six to seven days before the network was fully operational again. Without backups, a node failure meant nearly a week of downtime. With a recent snapshot, the same recovery takes hours. The backup itself is not the point; the tested, documented recovery process is.
Maintenance windows and communication
When planned changes to mainnet are necessary, coordinate with your client before touching anything. They need time to notify their users — especially in financial applications where even a brief interruption can have real consequences.
Clients who aren't warned in advance can't prepare their users. What starts as a routine update becomes a communication problem on top of a technical one.
What this means for founders building on blockchain
If you're launching a blockchain-based product — whether a custom L1, an appchain, or a validator-dependent protocol — here's what to focus on, distilled from both perspectives:
Before mainnet launch
- Ensure you have enough nodes across independent operators and locations — decentralization is your primary resilience mechanism.
- Run thorough testing in testnet, including simulated contract execution and edge-case scenarios
- Set up server-level monitoring (CPU, memory, disk) before go-live.
- Add validator health monitoring: a server can be online and the validator still not participating in consensus or generating rewards.
- Establish daily backups and document (and test) your recovery procedure.
Ongoing operations
- Maintain clear communication protocols for planned upgrades — inform your client in advance
- Track validator activity continuously, not just server uptime
- Roll back first, fix second — never debug in production under pressure
- Align with a DevOps partner who understands both infrastructure and blockchain-specific behavior
About CIDT
CIDT is an engineering partner for early-stage and growing teams building in Web3 and beyond. We've supported blockchain infrastructure projects — from testnet deployment to production monitoring and Validator-as-a-Service. If you're building on a custom blockchain and want to make sure your infrastructure is solid before mainnet, we'd be glad to talk.
Book a call with the CIDT team →
.png)





















.png)




