Houston, Texas

Site Reliability Engineer

Posted on Tuesday, 19th May 2026

Quick Apply

Financial Services

Houston, Texas

Negotiable

Permanent

Senior Site Reliability Engineer

iO Assocciates are supporting this growing f/s organization that specialises in building resilient, cloud-native platforms that set the standard for operational excellence. Renowned for their innovative approach and commitment to growth, they foster a collaborative environment that values technical excellence, continuous learning, and impactful contributions.

Role Overview

This pivotal role has been created to support a strategic expansion of their reliability capabilities. As a Senior Site Reliability Engineer, you will be instrumental in enhancing the stability, availability, and scalability of vital systems. Your expertise will directly influence the resilience of mission-critical infrastructure, enabling the organisation to deliver seamless services and uphold their reputation for operational excellence.

Key Responsibilities

Lead initiatives to define and refine service reliability metrics, error budgets, and operational best practices
Architect and implement comprehensive monitoring, logging, and alerting solutions to improve incident detection and response
Drive incident management processes, including structured post-incident reviews and preventative measures
Collaborate with development teams to optimise performance, scalability, and disaster recovery strategies
Design and maintain highly available, fault-tolerant cloud architectures across multiple regions
Develop and maintain Infrastructure as Code modules and CI/CD pipelines to ensure consistent, automated deployment processes
Participate in defining recovery objectives (RTO/RPO) and orchestrate regular disaster recovery testing exercises
Produce documentation, runbooks, and reports to support ongoing operational improvements

Essential Skills & Experience

A minimum of 7 years’ experience in Site Reliability Engineering, DevOps, or related fields supporting production environments
Proven expertise with cloud platforms, especially AWS, with significant hands-on experience managing mission-critical systems
Strong knowledge of Infrastructure as Code tools such as Terraform or CloudFormation/CDK
Demonstrable experience with CI/CD pipelines and automation frameworks, including version-controlled infrastructure
Skilled in designing resilient cloud architectures and implementing BCP/DR plans with structured testing
Proficiency with monitoring tools such as Datadog, Prometheus, Grafana, or ELK/OpenSearch
Solid Linux fundamentals, networking knowledge (TCP/IP, DNS, load balancing), and troubleshooting skills
Practical scripting experience in languages such as Python, Bash, or Go
Excellent technical documentation and communication skills, with the ability to create clear runbooks and reports

Desirable Qualifications & Skills

Familiarity with additional-cloud providers like Azure or Oracle Cloud
Experience with container orchestration platforms such as Kubernetes, EKS, or ECS
Knowledge of advanced observability tools like OpenTelemetry and distributed tracing
Certification achievements such as AWS Solutions Architect, DevOps Engineer, or Kubernetes CKA/CKAD
Experience with progressive delivery techniques, including blue/green and canary deployments, as well as automation in rollback strategies

Houston, Texas

Site Reliability Engineer

Financial Services

Houston, Texas

Negotiable

Permanent

Apply for this role