Incident Retrospective: Self Chain Mainnet v2 CosmWasm Upgrade

Dear Self Chain Community,

We recently experienced an extended consensus halt during the Self Chain Mainnet v2 upgrade, which activated CosmWasm smart contracts. In this detailed retrospective, we'll transparently share exactly what went wrong, how we resolved the issues, and key learnings and future measures we're implementing to prevent recurrence.


πŸ›‘ Summary of the Incident

On April 18, 2025, at block height #4,805,959, the scheduled Mainnet v2 upgrade was activated. Shortly afterward, the network experienced severe consensus issues, including:

  • AppHash mismatches
  • Repeated round regressions
  • Double-signing (equivocation) events
  • Multiple validator slashing and jailings
  • Extended network halts

The network instability lasted several hours, causing downtime for validators, users, and ecosystem participants.


πŸ” Detailed Analysis: What Went Wrong?

Upon deep analysis, we've identified three primary root causes:

1. Non-deterministic State Migration

The core upgrade handler (responsible for migrating blockchain state at the upgrade height) was non-deterministic, meaning validators ended up with slightly differing blockchain states. Even minor differences in state data resulted in:

  • Mismatching AppHash and LastResultsHash across validators
  • Complete failure of consensus (validators refused to precommit blocks)

2. Insufficient Validator Coordination

Validator instructions were clear but coordination at the time of rollback and recovery was insufficient, resulting in validators using:

  • Different binaries (version mismatches)
  • Different snapshots or blockchain states
  • Diverging consensus rounds, causing round regression errors

3. Double-Signing Events (Byzantine Behavior)

Due to validators repeatedly resetting and restarting with inconsistent states, some validators accidentally double-signed blocks, triggering severe slashing events, automatic jailings, and further consensus disruptions.


🚨 Detailed Timeline of Events

Timestamp (UTC) Event
Apr 11, 2025, 11:00 Governance proposal voting started
Apr 16, 2025, 11:00 Governance proposal approved
Apr 18, 2025, 11:08:09 Chain halted at block #4,805,959 for upgrade
Apr 18, 2025, ~11:20 Validators upgraded to v2 binary; consensus briefly restored
Apr 18, 2025, ~11:44 First consensus failure due to AppHash mismatch
Apr 18, 2025, ~12:05–17:36 Repeated attempts to restore consensus failed; multiple round regressions
Apr 18, 2025, ~17:36 Critical consensus panic (CONSENSUS FAILURE) occurred due to Byzantine validators
Apr 18, 2025, ~18:52 Massive validator jailing and network-wide halt
Apr 19, 2025, ~00:39 Identified Byzantine validators clearly causing double-signing events
Apr 19, 2025, early morning Coordinated final rollback, using identical snapshot and deterministic binary
Apr 19, 2025, later morning Consensus finally restored; stable network operation resumed

πŸ› οΈ Immediate Actions & Recovery Steps

To fully resolve these issues, we executed the following emergency measures:

  • Snapshot Restoration:
    All validators were instructed to reset and clearly restore a snapshot taken at the same block height (#4,805,958).
  • Corrected Deterministic Binary:
    Released and enforced the mainnet-v2.0.1 binary with fully deterministic state migration.
  • Consensus State Reset:
    Validators explicitly cleared corrupted consensus states (cs.wal) and carefully recreated priv_validator_state.json.
  • Validator Slashing & Unjailing:
    Addressed double-signing clearly, with affected validators slashed and temporarily jailed, followed by careful unjailing after stability was restored.

πŸ“Œ Lessons Learned (What we'll improve):

We deeply apologize for the disruption caused by these events. Moving forward, we've committed to the following improvements:

1. Strict Determinism in State Migrations

  • Mandatory code reviews and automated testing to enforce strict determinism.
  • Public testnet dry-runs before every upgrade.

2. Enhanced Validator Coordination & Communication

  • Real-time validator coordination channels (Discord, Telegram) dedicated to upgrade events.
  • Mandatory use of officially provided snapshots and binaries clearly during upgrades.

3. Improved Incident Response Protocols

  • Clearly documented recovery & rollback procedures available publicly in advance.
  • Immediate public transparency updates during incidents.

πŸ›‘οΈ Preventive Measures (Our Commitment):

  • Mandatory Testnet Upgrades:
    Every upgrade will clearly go through multiple testnet rounds with deterministic testing tools.
  • Detailed Validator Upgrade Checklists:
    Providing validators with precise, step-by-step upgrade instructions, including checksum-verified binaries, snapshots, and consensus recovery instructions.
  • Incident Simulation & Drills:
    Regular simulations of upgrade-related incidents for rapid response preparedness.

πŸ™ Final Words and Appreciation

We deeply appreciate the patience, cooperation, and proactive support of the entire validator community, exchanges, and our broader Self Chain ecosystem partners throughout this incident.

Your resilience, prompt collaboration, and dedication allowed the network to restore stability and resume normal operations clearly and efficiently.

We're committed to learning from these incidents. Our future upgrades and operational practices will benefit immensely from these hard-earned insights, and we'll strive to ensure this never happens again.

Thank you sincerely for your trust, patience, and continued support.

Warm regards,
Self Chain Core Team

Subscribe to Self Chain Blog

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe