Preliminary Analysis of the Invalid Minting Bug

Published by AVAX on

The Avalanche network experienced a serious issue earlier this week. This article describes what caused it, how a team of developers and community members addressed it, and the lessons to be drawn from it.

TL;DR

Heavy load on the system triggered a non-deterministic bug related to state verification. This caused some validators to accept some invalid mint transactions, while the rest of the network refused to honor these transactions and stalled the C-chain.

This was an invalid mint, not a double spend. Independent transactions on the X-chain continued to process. The network did not halt.

Once the problem was identified, the code was quickly patched to apply the requisite checks in the future, without fund loss. The patched code was distributed to the community and applied to the network. The Avalanche network is stronger than before.

But the full story is much more colorful than that.

It started with the Pangolin launch, which generated significant activity on Avalanche’s smart contract chain. Recall that Avalanche has 3 default chains: C-chain for contracts, X-chain for asset transfers, and P-chain for staking and subnet management. They each run a different virtual machine, designed to serve their respective purposes and validated by the same set of validators.

AVAX and other digital assets can be transferred between these chains. Because these chains are separate and isolated from each other, they do not know about or parse each other’s transactions. When someone moves AVAX from one chain to another, the system destroys the AVAX on the source chain, and mints it on the destination chain through a pair of transactions. This design enables the high-performance, DAG-based X-chain to be coupled to the slower, totally-ordered, linear C- and P-chains, without requiring any time or data structure synchronization.

All blockchains, Avalanche included, perform validity checks on each transaction and block. The validity checks related to minting transactions on the C-chain in AvalancheGo v1.1.5 were performed here.

The purpose of this code is to ensure that all import transactions are valid. You can see, for example, that the C-chain genesis block is taken as a base case, and that all other minting transactions are verified through a block cache. As an illustration of what happened, suppose that the network has issued blocks B0 B1 B2 B3 B4, where indicates a parent relationship.

If B1 and B4 contain a conflicting minting transaction, B1 should be marked as valid and B4 should be rejected for being invalid. The validity check for B4 should ensure that none of its ancestors B1 through B3 consume a conflicting minting transaction. Now keep in mind that under high load, multiple blocks can be in processing at once, with only a prefix having been accepted.

Suppose that block B0 has been accepted, and blocks B1, B2, B3, and B4 are currently at the proposal stage simultaneously, so they are currently unaccepted (also known as undecided or unfinalized, that is, they have not been committed to the blockchain yet). It turns out that this code snippet for checking minting transactions contains a subtle bug that arises under a very specific, timing dependent condition.

if atx != nil {
inputs.Union(atx.UnsignedTx.(UnsignedAtomicTx).InputUTXOs())
inputsCopy.Union(*inputs)
}

Specifically, this snippet populates a variable called inputsCopy, but never performs a union of inputs with inputsCopy if the current block does not contain a minting or destroy transaction. This empty input set is then cached, which breaks the invariant that a cached set contains all the input IDs of its ancestor’s minting and destroy transactions. This meant that an invalid cache update, made at one point in time due to a concurrent not-yet-accepted parent, would incorrectly enable a future transaction to be marked as valid. So, under the following scenario, where B1 contains a minting transaction, B2 contains a non-minting transaction, and B3 contains an unrelated minting transaction, and where blocks B1 through B3 are not yet accepted, then a bad cache entry could enable a spurious mint in B4.

The bug did not affect regular transactions, coin transfers, asset transfers, coin destruction, or smart contract invocations. Avalanche never allowed any user to successfully send the same funds to two recipients. This means that absolutely no double-spends happened.

Instead, the bug permitted an invalid mint operation on the C-chain by failing to record a conflict with an ancestor block, and did so if and only if that ancestor block was undecided simultaneously on a sufficiently large number of nodes.

The fact that the bug could only be triggered when three ancestors were undecided meant that the bug was non-deterministic. That is, it could only occur under certain intermittent network conditions. Since all nodes proceed at slightly different rates, one node may have successfully validated and accepted B1 when B4 was being checked for validity, in which case it would correctly mark B4 as invalid, while another might not have finalized B1 at the time it checked B4, thus permitting B4.

On February 11th, an increase in network traffic due to the launch of Pangolin caused the number of concurrently processing blocks to increase. This increased the number of concurrent blocks in the system, and triggered the bug.

The upshot is that some validators accepted blocks that contained invalid import transactions while others did not. Eventually, the validators that accepted invalid mints began to disagree amongst themselves as well because of the non-determinism, which led them to stop making progress on the C-chain to ensure safety until these issues were resolved.

Because the X- and P-chains did not contain the minting bug, the Avalanche network continued to process transactions unrelated to the C-chain. The C-chain stopped processing, but other independent transactions on the Avalanche network never stopped. The X-Chain is a DAG, not a linear chain, which enabled it to process transactions, even as developers were debugging the system.

Per the guarantees of the Avalanche protocol, no two nodes accepted different blocks at the same height or containing otherwise conflicting transactions. Specifically, some Avalanche nodes continued to make progress on the correct canonical chain while others that didn’t encounter the bug simply halted. Validators who were not susceptible to the non-deterministic bug refused to make any progress when a collection of validators accepted a block they considered invalid. Because a majority of the validators did accept the block, however, the remaining validators could not make progress on any other blocks at that height that did not contain the invalid minting bug.

The Fix

At first thought, the easiest way to fix a problem like this may seem to be to rewrite the blockchain and undo accepted transactions. But because no single entity is in control of the Avalanche network or controls a sufficient percentage of the nodes to do so, such an approach was actually infeasible. This is a good thing, as it shows that the Avalanche network is truly decentralized.

Instead, the best feasible fix is one that is incrementally deployable, that is, it does not require a hard or a soft “fork,” does not require control over a prohibitive majority of the network, and provides benefits to everyone who deploys it. The developers quickly put together a patch that accepted all of the blocks that the network had accepted, and rolled the network forwards to the same set of common tips. The critical part of the patch is the following, which disables the cache and introduces the notion of special blocks that respect the invalid mints.

The diff can be viewed here.

The downside of the patch was that it permitted the generation of 790.2160157 additional AVAX. To maintain the invariant that there will never be more than 720mm AVAX, the Avalanche Foundation has decided to burn the same amount of AVAX that was created. Deploying the patch allowed all of the correct nodes to agree on the highest valid tip and continue making progress in unison.

Conclusion

Almost every major blockchain had episodes where a bug caused an issue that threatened the safety or liveness of the system. These are undesirable but inevitable events that offer opportunities to harden the system.

The bug did not stem from a fundamental problem with Avalanche consensus, the network, or the system.

The Avalanche ecosystem collectively responded quickly to what happened. The existing code and the deployed patch ensured all funds remained safe and permitted a graceful recovery of the system. The network is healthy, which anyone can verify by performing transactions or observing network metrics.

Exchanges are verifying their internal accounting as they do after any such event. The bug coincided with the Chinese New Year, which has caused delays in the reactivation of services. No funds were or could be lost.

The exchanges will take time to get back into sync with the network. If you were performing a withdrawal from an exchange at the time of the network slowdown, there is a possibility that the exchange needs to rebroadcast your withdrawal transaction. If you were performing a deposit, and explorers are showing that the transaction went through, then the exchanges should credit your account when they sync up with the network.

This episode revealed and fixed a complex bug, and at the same time demonstrated to the world that the Avalanche Network is truly decentralized. Both the Avalanche network is stronger as a result, and its community remains as incredibly supportive as always.

As the analysis covers complex subjects, updates may be made to clarify.


Preliminary Analysis of the Invalid Minting Bug was originally published in Avalanche on Medium, where people are continuing the conversation by highlighting and responding to this story.

Categories: Avalabs