AvalancheGo v1.9.12 X-Chain Regression

Published by AVAX on

The Avalanche Primary Network experienced 2 brief periods of instability (134 minutes on 3/22/23 and 82 minutes on 3/26/23) because of a regression in OperationTx verification introduced during a refactor of the X-Chain. Avalanche Subnets, which run consensus independently of the Primary Network, still finalized blocks as usual during these 2 periods.

Issue: Erroneous X-Chain Refactor to Prepare for Cortina (036e34)

Previously, the execution of X-Chain state transitions was spread over multiple files with multiple abstractions:

In an effort to unify and simplify the pre- and post-linearization code paths before Cortina, all of this state transition logic was migrated to a single location. The special case around the additional OperationTx UTXOs, however, was not properly migrated: https://github.com/ava-labs/avalanchego/commit/036e34f143836dada1409af104e150f4b22598be

The release of this refactor, introduced in v1.9.12, meant that nodes that were running v1.9.11 executed OperationTx transactions differently than nodes that were running v1.9.12.

Fix: Restoring Missing Validity Checks to OperationTx and Repairing State (v1.9.12…v1.9.16)

To fix this regression, the missing logic from v1.9.11 OperationTx handling was reintroduced to protect against the incorrect execution of any further transactions. This patch was shipped in v1.9.14.

Because a handful of OperationTx transactions were accepted on the X-Chain while a supermajority of the Avalanche Network was running v1.9.12, the next thing to do was to re-align the canonical state across nodes running v1.9.11, v1.9.12, and v1.9.14. This was completed when validators upgraded to v1.9.15. This change could not be released as part of v1.9.14 because incorrect execution needed to be prevented before state re-alignment could be performed (otherwise it would not have been possible to compute a deterministic state to re-align to).

Shortly after v1.9.15 was released, however, we noticed that the number of processing vertices on the X-Chain remained at 2 (instead of floating between the typical range of ~0–3):

Upon further research, we found one remaining item added to state pre-v1.9.14 that was not properly re-aligned in v1.9.15. The last re-alignment was completed when validators upgraded to v1.9.16.

Impact: 2 Brief Periods of Network Instability on C-Chain and P-Chain

Around the time that validation behavior was modified in v1.9.14 and when the state was re-aligned in v1.9.16, block acceptance was delayed on both the C-Chain and P-Chain (X-Chain was accepting new transactions during both periods):

  • 7:24 PM ET to 9:38 PM ET on 3/22/23 [134 minutes]
  • 6:36 AM ET to 7:58 AM ET on 3/26/23 [82 minutes]

In the first case, a substantial number of validators had not yet updated to the latest available version of AvalancheGo and were thus running with different execution rules. In the second case, a substantial number of validators had non-aligned states. Once more validators updated to the latest version of AvalancheGo available at each time, the Avalanche Network regained stability.

The majority of Subnets continued to finalize blocks, as designed, even when there were spikes in unfinalized blocks on the Primary Network (screenshot taken over 3/22/23):

However, a number of Subnet APIs became inaccessible during this period of instability because nodes returned an unhealthy status and were removed from their associated load balancers even though they could still serve traffic and were accepting new blocks (the node will return a failing health status if ANY aspect of the node is unhealthy, even if a particular Subnet is fine). This led to a dip in transaction throughput as many users that relied on publicly accessible endpoints could not issue transactions and explorers that read data from these endpoints could not index new data:

Looking Forward

To add additional regression testing to the X-Chain as soon as possible, the Cortina activation timeline on Fuji will be delayed 1 week. The code for the Fuji activation will now be released on 4/3/23 and Cortina will activate on Fuji on 4/6/23.

We kicked off a X-Chain state merklization effort to protect against silent execution discrepancies and to unblock State Sync a few months ago. The first product of this workstream was the ALPHA release of the x/merkledb and it continues in Cortina with the addition of an unused state root field in the new X-Chain block format: https://github.com/ava-labs/avalanchego/blob/7d73b59cb4838d304387ea680b9cc4053b72620c/vms/avm/blocks/standard_block.go#L25. Once x/merkledb is production-ready, this state root field will be computed and verified during X-Chain block execution to ensure any variance in transaction execution is quickly detected during testing and runtime. We are stress-testing x/merkledb in the HyperSDK and have begun our first audit.

To improve the Subnet UX, AvalancheGo will introduce a Subnet-scoped health check that will allow Subnet integrators to continue serving API requests behind load balancers if there is instability on another Subnet run on a given node. This is another step in the continuing journey to ensure there is great fault isolation between Subnets on Avalanche.

We appreciate the eagerness of the Avalanche community to share their logs/metrics with each other to help find the root cause of this regression and the speed (usually 30–90 minutes) at which they updated a supermajority of stake after each release. Your contributions to recovery were invaluable.

AvalancheGo v1.9.12 X-Chain Regression was originally published in Avalanche on Medium, where people are continuing the conversation by highlighting and responding to this story.


Categories: Avalabs