Build reliable, scalable systems through automation and engineering. Improve service stability using SLOs, monitoring and incident response. Acerca de nuestro cliente A U.S.-based e-commerce organization specializing in personalized products, operating high-volume digital platforms supported by global teams. The company emphasizes technology-driven operations, strong customer experience, and scalable infrastructure to support rapid growth and large production capacity. Descripción Reliability & Performance Define and manage SLIs, SLOs, and error budgets. Improve system reliability, scalability, and resilience. Lead reliability reviews and prevent incidents proactively. Observability & Monitoring Build and maintain monitoring, logging, and alerting. Ensure actionable alerts and effective dashboards. Implement distributed tracing. Automation & Tooling Automate operational tasks to reduce toil. Build tools for reliability and automated remediation. CI/CD & Deployments Improve CI/CD pipelines for safe deployments. Implement canary, blue/green, and rollback strategies. Ensure production readiness. Incident Management Join on-call rotations. Lead incident response and post-incident reviews. Promote a blameless culture. Cloud & Infrastructure Manage AWS/Azure cloud environments. Work with containers, serverless, and event-driven systems. Ensure scalable, secure, and cost-efficient infrastructure. Infrastructure as Code Build and manage infrastructure using Terraform. Maintain automated and consistent provisioning. Security & Compliance Embed security in CI/CD pipelines. Support audits and compliance activities. Perfil buscado (h/m) 4+ years of experience in SRE, DevOps, or Platform Engineering. Strong software engineering mindset and programming/scripting skills (Python, Go, Bash, etc.). Hands‑on experience with AWS or Azure cloud environments. Solid understanding of distributed systems and cloud-native architectures. Proficiency with Terraform and Infrastructure as Code practices. Experience defining and managing SLIs, SLOs, and error budgets. Strong background in observability: monitoring, logging, alerting, and tracing. Experience improving CI/CD pipelines and deployment strategies. Ability to lead incident response and conduct blameless postmortems. Familiarity with automation, reliability tooling, and reducing operational toil. Strong analytical and problem‑solving skills. Excellent communication skills and ability to partner with engineering teams. Proactive, detail-oriented, and focused on continuous improvement. Advanced English (B2-C1) required for daily communication with international teams. Qué Ofrecemos 100% remote role from Colombia. Undefined contract through Michael Page Colombia. Exposure to modern SRE practices, automation frameworks, resilience engineering, and cloud-native tooling. Professional growth through complex technical challenges and continuous learning. Chance to work with global teams and cutting-edge cloud technologies across AWS and Azure. #J-18808-Ljbffr
Site Reliability Engineer (Sre)
MICHAEL PAGE COLOMBIA
san gil, san gil
Publicado hace 11 días
Denunciar empleo