Senior Site Reliability Engineer
iO Assocciates are supporting this growing f/s organization that specialises in building resilient, cloud-native platforms that set the standard for operational excellence. Renowned for their innovative approach and commitment to growth, they foster a collaborative environment that values technical excellence, continuous learning, and impactful contributions.
Role Overview
This pivotal role has been created to support a strategic expansion of their reliability capabilities. As a Senior Site Reliability Engineer, you will be instrumental in enhancing the stability, availability, and scalability of vital systems. Your expertise will directly influence the resilience of mission-critical infrastructure, enabling the organisation to deliver seamless services and uphold their reputation for operational excellence.
Key Responsibilities
- Lead initiatives to define and refine service reliability metrics, error budgets, and operational best practices
- Architect and implement comprehensive monitoring, logging, and alerting solutions to improve incident detection and response
- Drive incident management processes, including structured post-incident reviews and preventative measures
- Collaborate with development teams to optimise performance, scalability, and disaster recovery strategies
- Design and maintain highly available, fault-tolerant cloud architectures across multiple regions
- Develop and maintain Infrastructure as Code modules and CI/CD pipelines to ensure consistent, automated deployment processes
- Participate in defining recovery objectives (RTO/RPO) and orchestrate regular disaster recovery testing exercises
- Produce documentation, runbooks, and reports to support ongoing operational improvements
Essential Skills & Experience
- A minimum of 7 years’ experience in Site Reliability Engineering, DevOps, or related fields supporting production environments
- Proven expertise with cloud platforms, especially AWS, with significant hands-on experience managing mission-critical systems
- Strong knowledge of Infrastructure as Code tools such as Terraform or CloudFormation/CDK
- Demonstrable experience with CI/CD pipelines and automation frameworks, including version-controlled infrastructure
- Skilled in designing resilient cloud architectures and implementing BCP/DR plans with structured testing
- Proficiency with monitoring tools such as Datadog, Prometheus, Grafana, or ELK/OpenSearch
- Solid Linux fundamentals, networking knowledge (TCP/IP, DNS, load balancing), and troubleshooting skills
- Practical scripting experience in languages such as Python, Bash, or Go
- Excellent technical documentation and communication skills, with the ability to create clear runbooks and reports
Desirable Qualifications & Skills
- Familiarity with additional-cloud providers like Azure or Oracle Cloud
- Experience with container orchestration platforms such as Kubernetes, EKS, or ECS
- Knowledge of advanced observability tools like OpenTelemetry and distributed tracing
- Certification achievements such as AWS Solutions Architect, DevOps Engineer, or Kubernetes CKA/CKAD
- Experience with progressive delivery techniques, including blue/green and canary deployments, as well as automation in rollback strategies