Staff Site Reliability Engineer
Hybrid/remote-first - Ontario, Canada
$170,000 - $250,000 + RRSP matching
Your opportunity
Our client builds a high throughput system powering hundreds of billions of daily transactions, each completed within milliseconds, across globally distributed infrastructure designed for reliability and efficiency. They have been quietly bootstrapping and growing in line with revenue for over 2 decades. They operate at a planetary scale with an exchange platform that handles nearly 500 billion daily auctions and is trending towards 1 trillion daily auctions in the next ~5 years. To the present day, the company has not raised money from venture capitalists. It maintains its independence from external stakeholders, enabling it to chart its course, maintain a long-term perspective and build an enduring, sustainable business that currently employs ~600 team members globally.
As a Staff Site Reliability Engineer (SRE), you’ll be a technical leader, collaborating cross-functionally to architect solutions that elevate operational excellence and reliability on a global scale. With a focus on low-latency systems, you’ll develop strategies for proactive monitoring, automation, and incident management, helping drive initiatives that keep our client’s platform highly resilient and continuously available. You’ll work closely with globally distributed engineering teams to integrate best practices into the development lifecycle, ensuring each layer of their system is robust, scalable, and optimized for "real-time" performance.
Our client has been stubbornly racking and stacking infrastructure around the world for the duration of their existence, a habit that allows for them to price themselves at the cost of electricity while their competitors are mired in rising cloud infrastructure costs. While the major cloud infrastructure provider's offerings are considered in isolated, strategic instances, deep expertise with on-prem and hybrid infrastructure systems is the focus of this search. Our client’s infrastructure spans continents, supporting a growing business where every millisecond counts. As part of this team, you’ll guide projects that impact millions of users and directly shape the future of content creation and journalism across the open internet.
If you’re looking for a role that offers an opportunity to innovate and optimize systems at a massive scale, creating a lasting impact in an environment that values technical excellence and resilience, this could be for you.
Tech stack
Ansible, Terraform, Docker, Kafka, Nexus
Prometheus, ELK, Jaeger, Grafana, Nagios, Zabbix
Hadoop, HDFS, Spark, HBase
Go, Python, Bash, or Perl for automation
Bare-metal, vSphere, KVM, Kubernetes
Key Responsibilities
Vision and strategic direction: Deploying your passion for staying up to date on the latest technological and industry innovations, enabling you to identify and drive strategic initiatives that impact the entire business
Technical leadership: Leading significant architectural projects involving cross-functional teams that enhance system performance and reliability on a global scale
Operational excellence: Building and enhancing proactive automation, monitoring, and incident management technologies and processes
Collaboration with software engineering: Integrating SRE best practices, and system scalability and resiliency considerations into the SDLC and engineering culture
Incident management: Leading incident response efforts that drive rapid resolutions and thoughtful post-incident analysis
Providing insights: Designing and implementing reporting mechanisms that provide deep insight into system health and reliability
Your know-how
6+ years in Site Reliability Engineering (SRE) within low-latency, global-scale environments, ideally with upstream Kubernetes in on-prem or hybrid cloud contexts
3+ years of experience in technical leadership and team-building roles
Expertise in incident response and root cause analysis
Expertise with configuration management and associated tools (Ansible, Puppet, Salt, etc.)
Expertise with observability components (Prometheus, OpenTelemetry, ELK, Mimir)
Comfort with the Cloud Native Computing Foundation (CNCF) suite of SRE tools (Rook, Jaeger, Cilium, ArgoCD, OPA)
Software engineering skills, ideally (but not necessarily) with Go, Python and/or Perl
Excellent command of English and expertise in cross-functional communications
Interested in learning more?
Please upload your resume or a .pdf export of your LinkedIn profile using the following link or send your resume or LinkedIn profile URL to talent@lutrapartners.com with “Staff SRE” as the subject, and one of our partners will be in contact shortly!