Skip to main content
Skip to main content
Polkadot logo

A Polkadot Postmortem - 24.05.2021

On 24 May 2021, Polkadot nodes failed with an out of memory (OOM) error on block 5,202,216. This block contained an on-chain solution to the validator election, which is normally computed off-chain and only takes place on-chain if no off-chain solution is submitted.

By Bastian KöcherMay 27, 2021

TL;DR: On 24 May 2021, Polkadot nodes failed with an out of memory (OOM) error on block 5,202,216. This block contained an on-chain solution to the validator election, which is normally computed off-chain and only takes place on-chain if no off-chain solution is submitted. Due to the large number of nominators, the election overflowed the memory allocated in the Wasm environment.

While an update was being prepared to fix the issue, validators were asked to temporarily downgrade their node software to a previous version that includes a native (non-Wasm) version of the runtime. The native version is not constrained by the Wasm memory allocator. The network recovered after an hour and ten minutes of downtime.

Later, on block 5,203,204, several nodes failed with a “storage root mismatch” error. After investigation, this was due to a difference in the compiler version that built the native runtime and the on-chain Wasm runtime. The solution was to implement a feature that allows overriding the on-chain Wasm runtime with a Wasm runtime build with the correct compiler version.

The issue has since been resolved and precautions have been implemented to prevent this from happening again in the future.

The bad

On 24.05.2021, Polkadot nodes failed with an out of memory (OOM) error while trying to build block 5202216. The nodes themselves did not crash, but the runtime did (i.e. the blockchain’s state transition function). Polkadot’s runtime is written in WebAssembly and is executed either by a Wasm interpreter or a Wasm compiler. However, as part of the runtime execution environment, a fixed amount of memory is always provided (64MB at that time) and this wasn’t enough for this block.

This block was the last block of the penultimate session in the era, meaning that a new validator set needed to be elected for the new era that would start after the next session. The election of the validator set can be done off-chain or on-chain, but off-chain is preferred as the election algorithm is quite a heavy computational task. However, for this session no validator submitted a solution (presumably because they also ran into the same OOM while doing the election off-chain), so it needed to be done on-chain and the result of this was the OOM all validators got while trying to author this block. The solution to the OOM was rather quite easy—to increase the default memory size of the Wasm runtime to 128MB: https://github.com/paritytech/substrate/pull/8892.

To bring this change to all validators, a new release would need to be cut, and a large number of validators would need to update. However, there was a much easier solution to this problem in the short-term (and most importantly faster to deploy). Polkadot’s runtime is compiled not only to Wasm but also to native code for better performance, and most importantly, the native runtime does not put any bounds on memory usage during execution. But the native runtime only matches the on-chain runtime when the running node is from the same release as the on-chain runtime. The on-chain runtime at this point was the runtime matching the v0.8.30 release, which was released on 08.04.2021. Since then, there had already been 3 new releases, meaning most of the validators already were running the latest node release (v0.9.x).

So, in an effort to overcome the problematic block as fast as possible, all validator operators were asked to downgrade their validators to v0.8.30 and to run them with the `--execution native` flag to force running with the native runtime. Overall, it took about 1 hour and 10 minutes from detecting the issue, coming up with a short-term solution, announcing it to validators and ultimately having new blocks built and having the network fully recover.

After the network was back, we started preparing the 0.9.3 release to distribute the increase of the Wasm max memory usage so we could support using the Wasm runtime again. In this process, we took a node and wanted to check that syncing the problematic block with the increased memory ceiling now worked with Wasm. The problematic block worked indeed, but we encountered a storage root mismatch while trying to import 5203204.

The ugly

A storage root mismatch means that importing a block doesn’t lead to the same storage root advertised by the block author. In general, in a blockchain the same input should always lead to the same output. However, in this case the network was still running and building blocks, which could only mean that there was a non-determinism between the native and the Wasm runtime, because we had instructed all validators to run with the native runtime.

So we started to investigate the mismatch between the native and Wasm runtimes. We tried to sync the chain locally first with the same release and the native runtime. However, this also led to the same storage root mismatch. This was even more alarming, because the same code compiled for the same architecture should always produce the same results. When we compile the Wasm runtime we do this using the so-called `no-std` environment, which involves using different code paths. So, it is “easier” to introduce some mismatch, but compiling the native runtime twice should result in code that is doing the same thing both times.

This brought us to the assumption that the rust compiler may have been generating faulty code that resulted in the mismatch we had seen. Due to some extreme luck (otherwise our endeavour would probably have taken a bit longer), someone at Parity still had a binary of this release lying around that wasn’t the same as the one attached to the release on github. This binary was able to sync the chain with the native runtime without any problems. The only difference between this binary and the one we built before was the rust compiler version that had been used. So we thought maybe something had changed between the latest compiler version and the version that we used to build the node back then. And yes, after downgrading the rust compiler and re-building the release branch, the node now managed to sync successfully.

The good

After verifying that the native runtime compiled with the old rust compiler could sync the chain, we also tried compiling the Wasm runtime with this rust compiler. There is a special flag for the Polkadot node that allows us to override the on-chain Wasm runtime with a local version, and we used this to verify that syncing worked. So the question became, why did we have this mismatch between the native and Wasm runtimes of the 0.8.30 release? You need to know that we use the rust nightly compiler to compile the Wasm runtime (the nightly is required because not everything we use in the Wasm build is yet in the stable rust compiler). The compiler versions used for the node and the Wasm runtime are part of the release announcement.

So something must have changed between the 1.51.0 stable rust compiler (released on 23.03.2021 and used to build the native runtime) and the rust nightly compiler from 7.04.2021 that was used to build the Wasm runtime. After some time bisecting the rust toolchains between these dates, we found the nightly from 05.03.2021 to be the first one that broke our determinism. So we only needed to check the commits that got merged between 04.03.2021 and 05.03.2021 and found the problematic commit.

Compiling the rust compiler without this commit and using the self-built compiler to compile our node showed that the native runtime produced the correct data and we could sync the chain. The commit changed the `binary_search_by` function in a way that it could return a different index when there are multiple matches. As we use this function in the runtime, it can lead to a slightly different ordering of the data that is stored in the state, which leads to a different storage root.

So this meant that we now had blocks built by the native runtime that could not be synced with the Wasm runtime, and we could not change the on-chain Wasm runtime to fix this, because you cannot rewrite the history of the blockchain without forking. We came up with a pull request that introduces `code_substitute` to the chain specification. The chain specification is mainly used to store the genesis and some other information about the chain. This new field `code_substitute` is a map that uses a block hash as key and maps to a Wasm runtime code blob. It instructs the node to overwrite the on-chain Wasm runtime with the given one from every block after the one specified in the chain specification until the spec version of the runtime doesn’t match anymore.

We also created a pull request that uses the `code_substitute` with the correct values to enable the nodes to sync again using Wasm. Anyone can rebuild the runtime using `srtool` to make sure that what’s being built is the code from v0.8.30 and that they get the same Wasm blob.

With the 0.9.3 release the node contains all the required fixes to make the chain work as expected.

In future we will improve the current situation even more:

  • The deprecation of the native runtime will now be pursued with a much higher priority. Using the Wasm compiler Wasmtime already brings us to a performance level that is almost the same as using the native runtime, so we don’t really need the native optimization anymore. Especially with all the potential downsides.
  • The allocator will be improved to support a much more flexible allocation of resources, meaning we will not cap the maximum allocation at 128MB and will probably support the maximum of Wasm (4GB).
  • On-chain elections will be completely disabled; an election now needs to always happen off-chain and be submitted to the runtime.
  • Until the allocator is improved the off-chain worker will use a higher memory limit than the on-chain Wasm runtime execution. This should help with making sure that off-chain elections don’t run out of memory and can be successfully submitted.
  • For the time being, with a native and Wasm runtime, we will make sure to use the same compiler version for the native and the Wasm build. This should prevent running into changes resulting from using different toolchain versions.

From the blog

Polkadot Ecosystem Ignites 2025: A Year of Unprecedented Decentralization, DeFi Breakthroughs, and Global Builder Momentum

A quarter-by-quarter recap of Polkadot’s 2025 milestones, from record-breaking decentralization and DeFi growth to Polkadot 2.0 and global builder momentum.

Proof of Personhood: How Polkadot proves you're real without KYC

Proof of personhood lets you prove you're a unique human without giving up privacy. Polkadot's Project Individuality uses tattoos and video games to fight bots and enable fair airdrops for millions.

Pudgy Party: The Web3 game that hides the blockchain

Pudgy Party hit 900,000 downloads in six weeks by hiding the blockchain entirely. Built on Mythos Chain, players get custodial wallets and zero gas fees without realizing it. The game proves Web3 gaming works when blockchain infrastructure becomes invisible.

Polkadot at TechCrunch Disrupt 2025: The only blockchain in the room

Polkadot showed up at TechCrunch Disrupt 2025 as the only blockchain sponsor. With nearly 10,000 booth visitors and strong coordination across ecosystem teams, the event proved valuable for positioning Polkadot in Web2 conversations.

Why most blockchains can't handle AI (and what changes that)

Most blockchains can't handle AI's computational demands. High costs, limited speed, and storage constraints require purpose-built modular infrastructure instead.

Onboarding 21,000 users with Nova Shots: What we learned & how we move forward

How do you bring thousands of esports fans onchain without asking them to buy anything first? At three BLAST Counter-Strike events, Nova Wallet onboarded 21,000 new users through free interactive gameplay, processing 2.8 million transfers on Polkadot.

Meet the first cohort: The 5 teams selected for the DeFi Builders Program

Velocity Labs announces 5 teams selected for the DeFi Builders Program Cohort 1, building innovative financial applications on Polkadot Hub.

5 tech outages that prove decentralization can't wait

From AWS to CrowdStrike, major outages are increasing. Discover why centralized infrastructure keeps failing and how decentralization offers a solution.

Real World Assets on Polkadot: Your comprehensive guide to RWA

Real-World Assets bring physical value onto blockchain. Learn what RWAs are, how tokenization works, and why Polkadot is best for RWA projects.

Q3 2025 Polkadot DAO recap: Supply cap, treasury decisions & what's next

Here's what happened in Polkadot governance during Q3 2025: a permanent supply cap, millions in treasury funding decisions, and notable proposal rejections that exposed growing pains in how the DAO evaluates non-technical work.

Building AI on Polkadot: Why centralized compute is the wrong foundation

Build AI on Polkadot with verifiable data, cryptographic privacy, and native interoperability. 90% cost savings, no vendor lock-in, production-ready.

What Does Web3 Music Success Actually Look Like?

The Decentralized Mic brought together builders and investors actively shaping the future of Web3 music to discuss what's working, what's broken, and where the industry is headed next.

xs