Site Reliability Engineer (SRE)

Full-time

Boston, MA

Ensure uptime and performance for critical services.

The Site Reliability Engineer applies software engineering principles to operations, ensuring production systems are highly available, performant, and scalable. This role bridges development and operations teams to build self-healing infrastructure, automate manual tasks, and define reliability standards. The SRE leads incident response, conducts capacity planning, and implements monitoring solutions to maintain service-level objectives (SLOs) and error budgets.

Key Responsibilities:

Design, deploy, and maintain monitoring and alerting systems using tools like Prometheus, Grafana, ELK, or Datadog
Develop and manage Infrastructure as Code (IaC) with Terraform, CloudFormation, Ansible, or Pulumi to automate provisioning and configuration
Build and optimize CI/CD pipelines (Jenkins, CircleCI, GitHub Actions) for reliable, repeatable deployments
Implement container orchestration and management with Kubernetes, Docker Swarm, or similar platforms
Lead on-call rotations, perform incident response, root-cause analysis, and drive blameless postmortems to prevent recurrences
Conduct load testing, chaos engineering experiments, and capacity planning to identify and address performance bottlenecks
Collaborate with security teams to integrate vulnerability scanning, secrets management (Vault), and compliance checks into DevOps workflows

Qualifications & Skills:

3+ years of experience in a Site Reliability, DevOps, or Operations Engineering role
Strong proficiency with Linux/Unix administration and networking concepts (TCP/IP, DNS, load balancing)
Hands-on experience with cloud providers (AWS, GCP, or Azure) and related managed services
Advanced scripting or programming skills in Python, Go, or Bash for automation and tooling
Deep understanding of SLIs, SLOs, and error budgets, and the ability to translate these into technical and operational practice
Excellent collaboration and communication skills to work effectively across development, QA, and support teams
Familiarity with service mesh (Istio, Linkerd), distributed tracing (Jaeger, Zipkin), and security best practices in cloud environments

‍

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.