We wanted to provide a post-mortem on our staking pool’s downtime from May 28th to June 5th.
We first got wind of a potential issue near the end of epoch 268. Our block production rate was far lower than our stake, which prompted some internal discussions on the potential issue. Our concerns became reality as we failed to see any blocks on epoch 269. We started conducting internal checks, including double-checking our relays. We eventually found the problem to be a delayed KES key rotation, which we corrected on May 31st.
We were still not seeing any blocks after the key rotation. We started working with some staking pool runners on other potential issues, but had not found any problems with our setup. Eventually, we identified some mistakes were conducted in how we have implemented the KES key rotation. Specifically, there was some internal miscommunication related to Cardano staking pool operations.
As of June 6th, we started to see blocks again on our staking pool and deemed the current issue as resolved.
As part of our fix, we have added additional monitors on KES key expiration and block production. However, the main problem for this outage was a lack of human resources to monitor and maintain our Cardano staking pool. We have been actively hiring additional talented devops and infrastructure engineers to our team, but we realize we must push further to add more people. This will be our number one priority for the remainder of this quarter.
We wanted to give a special thanks to Markus from Clio.1 for being helpful through our time of need. If you haven’t yet, please check out their staking pool!
stakefish is the leading validator for Proof of Stake blockchains. With support for 10+ networks, our mission is to secure and contribute to this exciting new ecosystem while enabling our users to stake with confidence. Because our nodes and our team are globally distributed, we are able to maintain 24-hour coverage.