Site Reliability Engineer (SRE)

Full-time

Boston, MA

Ensure uptime and performance for critical services.
Apply Now

The Site Reliability Engineer applies software engineering principles to operations, ensuring production systems are highly available, performant, and scalable. This role bridges development and operations teams to build self-healing infrastructure, automate manual tasks, and define reliability standards. The SRE leads incident response, conducts capacity planning, and implements monitoring solutions to maintain service-level objectives (SLOs) and error budgets.

Key Responsibilities:

  • Design, deploy, and maintain monitoring and alerting systems using tools like Prometheus, Grafana, ELK, or Datadog
  • Develop and manage Infrastructure as Code (IaC) with Terraform, CloudFormation, Ansible, or Pulumi to automate provisioning and configuration
  • Build and optimize CI/CD pipelines (Jenkins, CircleCI, GitHub Actions) for reliable, repeatable deployments
  • Implement container orchestration and management with Kubernetes, Docker Swarm, or similar platforms
  • Lead on-call rotations, perform incident response, root-cause analysis, and drive blameless postmortems to prevent recurrences
  • Conduct load testing, chaos engineering experiments, and capacity planning to identify and address performance bottlenecks
  • Collaborate with security teams to integrate vulnerability scanning, secrets management (Vault), and compliance checks into DevOps workflows

Qualifications & Skills:

  • 3+ years of experience in a Site Reliability, DevOps, or Operations Engineering role
  • Strong proficiency with Linux/Unix administration and networking concepts (TCP/IP, DNS, load balancing)
  • Hands-on experience with cloud providers (AWS, GCP, or Azure) and related managed services
  • Advanced scripting or programming skills in Python, Go, or Bash for automation and tooling
  • Deep understanding of SLIs, SLOs, and error budgets, and the ability to translate these into technical and operational practice
  • Excellent collaboration and communication skills to work effectively across development, QA, and support teams
  • Familiarity with service mesh (Istio, Linkerd), distributed tracing (Jaeger, Zipkin), and security best practices in cloud environments

Apply now

Site Reliability Engineer (SRE)

Full-time

Boston, MA

Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.