The Site Reliability Engineer applies software engineering principles to operations, ensuring production systems are highly available, performant, and scalable. This role bridges development and operations teams to build self-healing infrastructure, automate manual tasks, and define reliability standards. The SRE leads incident response, conducts capacity planning, and implements monitoring solutions to maintain service-level objectives (SLOs) and error budgets.
Key Responsibilities:
- Design, deploy, and maintain monitoring and alerting systems using tools like Prometheus, Grafana, ELK, or Datadog
- Develop and manage Infrastructure as Code (IaC) with Terraform, CloudFormation, Ansible, or Pulumi to automate provisioning and configuration
- Build and optimize CI/CD pipelines (Jenkins, CircleCI, GitHub Actions) for reliable, repeatable deployments
- Implement container orchestration and management with Kubernetes, Docker Swarm, or similar platforms
- Lead on-call rotations, perform incident response, root-cause analysis, and drive blameless postmortems to prevent recurrences
- Conduct load testing, chaos engineering experiments, and capacity planning to identify and address performance bottlenecks
- Collaborate with security teams to integrate vulnerability scanning, secrets management (Vault), and compliance checks into DevOps workflows
Qualifications & Skills:
- 3+ years of experience in a Site Reliability, DevOps, or Operations Engineering role
- Strong proficiency with Linux/Unix administration and networking concepts (TCP/IP, DNS, load balancing)
- Hands-on experience with cloud providers (AWS, GCP, or Azure) and related managed services
- Advanced scripting or programming skills in Python, Go, or Bash for automation and tooling
- Deep understanding of SLIs, SLOs, and error budgets, and the ability to translate these into technical and operational practice
- Excellent collaboration and communication skills to work effectively across development, QA, and support teams
- Familiarity with service mesh (Istio, Linkerd), distributed tracing (Jaeger, Zipkin), and security best practices in cloud environments