Building Secure and Resilient Platforms

August 22, 2024
Kirat Singh

Building Secure and Resilient Platforms

Security incidents always present valuable opportunities for us to review and, where necessary, revise Beacon’s security and deployment processes.

Deploying urgent and important updates, quickly, safely, and securely

Getting urgent and important updates out to clients is a core requirement for many software and services companies, Beacon included. Our clients need to move at the speed of markets, whether that means rolling out their own changes or deploying Beacon’s enhancements and updates. A common aphorism is that you can have your updates quickly, safely, or securely, pick any two. But companies that operate in milliseconds with vast amounts of potentially sensitive information cannot afford to just pick two of these options, they need all three.

Understanding dependencies to boost platform resilience

Our software team’s combined experience in the demanding, high-intensity software environments of big banks and major investment houses has helped us design and build a platform with multiple layers of resilience. We have each worked through intentional cybersecurity attacks and accidental software crashes that have taken down large percentages of our employer’s IT resources. A common lesson from many of these incidents is to pay close attention to dependencies and downstream systems.

Designing for multiple isolated release streams and staggering deployments

When we designed Beacon’s underlying architecture, we clearly demarcated release streams for different privilege levels. These range from VM images and boot up sequences, to post-boot configurations that are root level and designed with minimal dependencies and the most scrutiny. We layer our orchestration stack on top without root level privileges, and finally run client code in containers with minimal privileges.

Every layer in the stack has independent release streams, each of which can be rolled forward/backward automatically, using metadata stored in permissioned object stores that run against a different stack.

Our releases are completely automated to move from one tag to the next. We require automated tests with coverage tracking as a breaking step when moving a release between tags. Coverage is a really simple and effective tool, but it can be hard to do effectively. Often folks get caught up with mandated coverage metrics. I have a really simple rule—any new code added/changed must have coverage. This makes it super easy to review pull requests (PRs), and forces us as developers to think about the right investment in testing.

Making this work with the right level of security requires a big investment into zero trust infrastructure and distributed service reliability without single points of failure. This is what we do so you don’t have to worry about it.

We’re obsessed about automating CI/CD so you can release the stack at any level as part of our Git flow.

Staggered deployments

I can’t overstate how important it is to stagger your deployment plans. Release internally first and use it before releasing it to clients. You’ll get valuable feedback from other teams and it will flesh out any unintentional interactions. Cloud based virtualization is your friend here. You can run end-to-end releases testing combinations of architectures, distributions and their interactions. We can run end-to-end releases across our entire stack in under an hour if we choose.

Monitoring incidents and security as a service

Each client also has control over when the update is deployed in their environment, resulting in a staggered update process. Should any of our clients have a problem with the update, their supervisor process notifies the Beacon Alerting System which opens a Jira ticket and notifies our operational team. In the unlikely event that something does go wrong, either a bad update or attempted security breach, the platform always has the ability to rollback to the earlier locked and version-controlled image. Clients using our Managed Security Monitoring Service gain the additional benefit of 24/7 monitoring by Beacon’s Security Team, who leverage machine learning and anomaly detection capabilities to identify risks before they impact operations, ensuring continuous, real-time protection against both malicious activity and unintentional errors.

Learning from everyone, everywhere, all at once

For what it is worth, recent security incidents have been a non-issue for us. I woke up to see that our platform infrastructure was operating normally. According to at least one of our clients, Beacon was “the only thing that worked” that day. That should not be taken as a boast – modern, high-performance operating environments are very complex. But we can substantially reduce the probability of a significant outage by following our core principles. We have designed resilience into the platform, but we know that computers will always find a new way to really mess things up. So we take lessons learned from this and other incidents, review our own practices and processes, and revise where needed.