From Outage to Opportunity: Hardening Our Infrastructure for What’s Next

Mar 30, 2025

What Happened?

During epoch 762 the SUNREN validator node crashed sometime around 03:00 EST on March 27, 2025.

Immediate Response & Contingency Activation

Our partner who operates the SUNREN validator node was alerted to this event and was able to fail over to his backup node which runs in a different location within about 60 minutes of the initial crash.

Impact Assessment

The immediate impact of this crash resulted in 1 hour of downtime for the SUNREN validator node. During this time it was running on the backup node, Staked RPC services were not benefiting from the enhanced landing speed as SWQoS was not online. Furthermore, the SUNREN node operator chose to keep the identity of his validator running on his backup node until the SVS team woke up the next morning to help identify the root cause of the crash. For a total of around 8 hours SWQoS was offline. This did not cause any hard downtime to Staked RPC users, however the sendTransaction landing times were negatively affected.
Unknown to us at the time, the SUNREN backup node hardware was having its own issues. For a reason not identified yet at this time, the backup validator hardware was having horrific vote latency causing ridiculous voting delays and bad vote score during this time. As a result, during epoch 762, the SUNREN validator did not meet the Solana Foundation Delegation Program minimum vote score (98%) and had its stake revoked the following epoch.
This translates to poor performance on our Staked RPC services sendTransaction landing times due to having less overall stake-weight on the SWQoS service. Additionally, this makes it harder to convince stake pools to invest in our validator node as no one wants to invest in a validator node that has downtime.
The good news is the SFDP stake should be reinstated in the next couple of days as our performance reverts back to normal over epoch 764.

Root Cause Analysis

SVS as well as the SUNREN validator node operator performed a investigation on all the logs to try and identify the culprit as to what may have caused this crash, but ultimately were unsuccessful in identifying the issue. This is mostly due to the fact that no errors were reported in the logs during the crash. Additionally, all of SVS's RPC nodes remained operating normal during this time which indicated the problem only affected the validator node itself. This is the first time the validator node has crashed since SVS took over managing the hardware for the SUNREN node and had big downtime such as this.

Improvements We've Made

As a result of this very unfortunate event, it was quite evident that we needed to harden the restart process so that if it does ever happen again, we could at least minimize the negative impact it has. Therefore, several changes to the systemd service file that runs the Solana validator software were made to improve this.
An important aspect to point out is that the SUNREN validator does not take snapshots while running. This is on purpose as we found it would have a negative impact on performance when this was enabled. The downside to this is that it takes longer to startup when the node crashes as it has to download a new snapshot before it can start-up again. SVS has a dedicated snapshot RPC node that we use to manually download a snapshot from when starting up other RPC nodes as well as the validator node. However if a node crashes and we are not around to respond, it will follow the normal snapshot download process built into the Solana validator software which tries its best to find a public RPC node with an up-to-date snapshot file that it can download from. This can sometimes be really slow, as I have seen it take over an hour to download the snapshot file publicly depending on what RPC node it decides to use.
Therefore, we have created our own custom solution to get the best of both worlds without having to rely on slow snapshot downloads.
We have created a custom bash script that will check the current snapshot directory to see if there is an up-to-date snapshot file available. If there isn't, the script will automatically download the snapshot files from our dedicated snapshot RPC node before starting the Solana validator software. After testing this out on the SUNREN validator, we are able to download fresh snapshot files from our dedicated RPC node, start the Solana validator service, and catch all the way up to the tip of the network in approximately 10 minutes. This is mostly due to the fact that our dedicated snapshot RPC node can let us download fresh snapshot files in under 4 minutes as opposed to relying on a random public RPC node for this task.
Therefore, if this ever happens again in the future and we are not around to respond to it, Staked RPC service users and SUNREN validator stakers can be rest assured that downtime will never exceed 10 minutes going forward.

Our Commitment to You

Issues happen and it is impossible to expect everything to work perfectly from the get-go. However, we do our best to learn from issues and mistakes such as this so that we can continue to be better. Additionally, I have open-sourced the "Solana-Snapshot-Auto-Download" script we created to combat this issue in our GitHub here. I hope other RPC operators find this useful and can also benefit from this situation.
If anyone has any further questions related to this even, please reach out in our Discord server here.

-bigJ

Back to all posts

SOLANA

VIBE STATION

Lightning-fast Solana solutions—built for developers, powered by experts.

Terms & Conditions

SOLANA

VIBE STATION

Lightning-fast Solana solutions—built for developers, powered by experts.

Terms & Conditions

SOLANA

VIBE STATION

Lightning-fast Solana solutions—built for developers, powered by experts.

Terms & Conditions