Houston, Texas

Site Reliability Engineer

Posted on Tuesday, 19th May 2026

Financial Services
Houston, Texas
Negotiable
Permanent

Senior Site Reliability Engineer

iO Assocciates are supporting this growing f/s organization that specialises in building resilient, cloud-native platforms that set the standard for operational excellence. Renowned for their innovative approach and commitment to growth, they foster a collaborative environment that values technical excellence, continuous learning, and impactful contributions.

Role Overview

This pivotal role has been created to support a strategic expansion of their reliability capabilities. As a Senior Site Reliability Engineer, you will be instrumental in enhancing the stability, availability, and scalability of vital systems. Your expertise will directly influence the resilience of mission-critical infrastructure, enabling the organisation to deliver seamless services and uphold their reputation for operational excellence.

Key Responsibilities

  • Lead initiatives to define and refine service reliability metrics, error budgets, and operational best practices
  • Architect and implement comprehensive monitoring, logging, and alerting solutions to improve incident detection and response
  • Drive incident management processes, including structured post-incident reviews and preventative measures
  • Collaborate with development teams to optimise performance, scalability, and disaster recovery strategies
  • Design and maintain highly available, fault-tolerant cloud architectures across multiple regions
  • Develop and maintain Infrastructure as Code modules and CI/CD pipelines to ensure consistent, automated deployment processes
  • Participate in defining recovery objectives (RTO/RPO) and orchestrate regular disaster recovery testing exercises
  • Produce documentation, runbooks, and reports to support ongoing operational improvements

Essential Skills & Experience

  • A minimum of 7 years’ experience in Site Reliability Engineering, DevOps, or related fields supporting production environments
  • Proven expertise with cloud platforms, especially AWS, with significant hands-on experience managing mission-critical systems
  • Strong knowledge of Infrastructure as Code tools such as Terraform or CloudFormation/CDK
  • Demonstrable experience with CI/CD pipelines and automation frameworks, including version-controlled infrastructure
  • Skilled in designing resilient cloud architectures and implementing BCP/DR plans with structured testing
  • Proficiency with monitoring tools such as Datadog, Prometheus, Grafana, or ELK/OpenSearch
  • Solid Linux fundamentals, networking knowledge (TCP/IP, DNS, load balancing), and troubleshooting skills
  • Practical scripting experience in languages such as Python, Bash, or Go
  • Excellent technical documentation and communication skills, with the ability to create clear runbooks and reports

Desirable Qualifications & Skills

  • Familiarity with additional-cloud providers like Azure or Oracle Cloud
  • Experience with container orchestration platforms such as Kubernetes, EKS, or ECS
  • Knowledge of advanced observability tools like OpenTelemetry and distributed tracing
  • Certification achievements such as AWS Solutions Architect, DevOps Engineer, or Kubernetes CKA/CKAD
  • Experience with progressive delivery techniques, including blue/green and canary deployments, as well as automation in rollback strategies

Apply for this role