
Operating a Solana Validator: Security via Simplicity and Effective Monitoring
A minimalist approach to security, monitoring, and validator operations
Author:
Giuseppe CocomazziEditor:
Seth HallemBuilding infrastructure and systems often feels like a choice between two unsatisfying options: adopt the heavyweight frameworks everyone uses and spend weeks learning their idiosyncrasies, or build something yourself and face the inevitable raised eyebrows about ignoring established and standard solutions. The former path offers the comfort of convention; the latter, the discomfort of justifying every decision.
When Certora joined the Solana Foundation Delegation Program by setting up Solana Validator nodes for Mainnet and Testnet, the program’s demanding requirements and the high-performance, low-latency nature of the Solana network required critical and quick design decisions from the outset.
We chose discomfort.
Not out of stubbornness or a reflexive distrust of popular tools, but because the conventional solutions carried too much weight for what we needed, and because starting fresh meant we could make calculated choices about complexity rather than inheriting someone else's. What follows is an account of building a system that consciously rejects the prevailing architectural patterns.
The first, non-obvious task was to select the right bare-metal hosting provider, with the additional requirement of avoiding colocation in already densely populated data centers, as specifically mandated in the Delegation Criteria. But we still wanted to reach the “cool kids” in the densest hubs with reasonably low latency.
However, if latency matters for day-to-day performance, an important, and often overlooked aspect of networking is connectivity: you want your hosting provider to not only have a decent amount of upstreams but also you care about who these upstreams are; Tier-1 carriers are what to look for. The same applies to Peering: you should prefer internet exchanges and large peer factories in the peer list of a hosting provider.
Copious upstreams matter for network resilience: a provider with only one upstream has a single point of failure, whereas the right peering ensures fewer hops towards interconnected Solana hubs.
Choosing the right provider, installing, and fine-tuning the Solana validator were really the easiest tasks. The hardest part was ensuring smooth daily operations, which required a security and monitoring architecture with sufficient flexibility. Traditional wisdom (followed by almost all off-the-shelf-solutions today) states that a data collection agent must run on the machine under observation and send this data periodically to a host organizing the data into a time series database. This is the approach used by Prometheus, InfluxDB, and Grafana, among others.
However, maintaining a full Prometheus/Grafana stack, upgrading and making it secure, absorbing its documentation to the point of being able to quickly troubleshoot issues, is a major undertaking. For instance, after reading the documentation of many data collection agents, we still could not find a clear way to limit the bandwidth of data transfers without an application-level traffic shaper. Traffic predictability is an important factor to keep in mind when dealing with the eventual bursts of data during leader-scheduled slots.
We ultimately decided to write our own monitoring “anti-framework” to forgo a dedicated agent altogether, and to rely on battle-tested Unix facilities such as cron, logrotate, and rsync’s –bwlimit. The data points are just JSONL appended to files on a remote machine; writes to these files are orchestrated with flock(1) for mutually exclusive access.
We completely avoided compression on the validator machine: the Solana validator’s logging can be very verbose, and even the smartest compression algorithm could occasionally steal substantial CPU cycles or an entire core on such big files. Thus, logs and metrics are kept uncompressed after rotation and sent uncompressed to the remote monitoring host.
The “anti-framework” we ended up implementing embodies the Unix philosophy: small, simple, and composable utilities operating on text data. Flat text files liberate us from conforming to a “central” database schema or binary format, and enable easy decoupling of data processing from data storage. No specific query language is needed to reason about the data, which can be ingested by simple Python pipelines, encouraging easy transfer of knowledge without reliance on a niche query language specialist.
Even the graphical representation of the data follows the same paradigm. A static HTML file rendering the dashboards with ChartJS and served by nginx was deemed more than enough:

The alerting system is simply a script reading from the same flat files. Alerting rules are just Python code. All scripts are orchestrated by cron, so there is no real event loop waiting to do something, or worse, susceptible to being stopped due to the unavoidable “Exception”. This eliminates the need for additional facilities to monitor the status of the monitoring system itself. The only running systems are nginx and cron. Development, testing, and deployment only took a few days and the total amount of code is under 1000 lines.
This approach consciously deviates from the traditional engineering advice to avoid "reinventing the wheel" or "rolling your own" solutions. This choice was deliberate because adopting large, new technologies means accepting someone else's one-size-fits-all vision, which invariably introduces unnecessary layers of abstraction and complexity that we did not want to commit to. Instead, we eschewed complex frameworks and favored a “Unix” philosophy that can be a viable approach to system engineering even for enterprise and mission critical deployments.