At Amwell, we’re transforming healthcare for all—powered by technology and inspired by people. Here, your ideas don’t just matter—they drive real change, improving lives on a global scale. We marry technology and innovation with clinical excellence to provide trusted solutions that solve the healthcare industry’s biggest pain points and are on a mission to enable greater access to more convenient, affordable, and effective care. We do this through our technology-enabled care platform that is designed to help our clients achieve their digital care ambitions – today and in the future. We offer programs spanning the full care continuum , including urgent, acute and specialty care, behavioral health, and services for the treatment of chronic conditions such as heart and cardiometabolic diseases. Programs are powered by Amwell as well as our growing partner network. For almost two decades , Amwell has proudly served some of the largest and most sophisticated healthcare organizations in the U.S. and worldwide. Our team is passionate about technology’s role in transforming care delivery and making it more equitable, accessible, efficient, cost-effective and navigable for all. Brief Overview As a Staff Site Reliability Engineer (P4), you will define and elevate the reliability standards across the platform. This role goes beyond owning individual services — you will establish the patterns, practices, and tooling that enable all teams to build and operate reliable systems at scale. You will operate across team boundaries, identifying systemic reliability risks and designing cross-cutting solutions that improve the overall health of the platform. Acting as a bridge between service-level reliability and organizational maturity, you will help ensure reliability becomes a built-in property of the system rather than a reactive effort. This role combines deep technical expertise with strong leadership and influence. You will mentor senior engineers, guide architectural decisions, and promote a culture of proactive reliability, observability, and operational excellence across the organization. Core Responsibilities Define and evolve reliability standards, patterns, and tooling adopted across the platform. Own the reliability posture for critical service domains and drive architectural reviews to ensure reliability, operability, and recovery are first-class concerns. Design and implement cross-cutting reliability mechanisms such as circuit breakers, retry policies, graceful degradation, and load shedding. Establish and maintain scalable SLO frameworks that teams can adopt with minimal friction. Lead complex, multi-service incident response as an incident commander and drive high-quality postmortems focused on systemic improvements. Identify recurring incident patterns and implement structural solutions to prevent future failures. Improve incident response processes, tooling, escalation paths, and communication practices. Design and drive observability strategies across services, including metrics, logs, traces, and alerting systems. Ensure alerting is actionable and aligned with SLOs, and build shared dashboards and runbooks to reduce time to resolution. Collaborate with Platform Engineering to strengthen infrastructure reliability across Kubernetes (EKS), networking, and data systems. Contribute to infrastructure as code for reliability-critical components and validate disaster recovery, backup, and restore strategies. Drive chaos engineering practices and ensure deployment pipelines include reliability safeguards such as health checks, canary releases, and rollback automation. Lead capacity planning and performance optimization efforts across services and shared infrastructure. Identify bottlenecks and failure risks across distributed systems and design solutions that improve resilience and recovery. Mentor engineers through design reviews, incident leadership, and knowledge sharing. Promote best practices, improve operational maturity across teams, and influence engineering culture toward proactive reliability investments. Represent reliability concerns in cross-functional planning and contribute to long-term platform strategy. Qualifications 8+ years of experience in Site Reliability Engineering, infrastructure, or production engineering roles. Strong experience operating and improving large-scale production systems in AWS environments. Deep expertise in Kubernetes (preferably EKS), including networking, scheduling, and observability. Hands‑on experience with Infrastructure as Code tools such as Terraform or CDKTF. Advanced understanding of distributed systems, networking, and failure modes. Experience designing and managing observability stacks (e.g., Prometheus, Grafana, OpenSearch, OpenTelemetry). Proven experience leading incident response for complex, multi-service production environments. Demonstrated ability to drive systemic reliability improvements across teams and platforms. Strong written communication skills, including experience creating postmortems, design documents, runbooks, and architectural proposals. Experience with service mesh technologies (e.g., Istio) and mTLS is a plus. Familiarity with GitOps workflows (e.g., ArgoCD, Flux) is a plus. Experience working in compliance‑driven environments (e.g., HIPAA, SOC2, FedRAMP) is preferred. Exposure to chaos engineering practices and cost‑aware infrastructure design (FinOps) is a plus. Benefits Medical Plan Coverage provided by Colmédica Plan Coverage provided by Pan American Hybrid Allowance Additional Paid Time Off Maternity Leave 18 weeks Parental/Paternity Leave 2 mandatory weeks + 4 weeks Mental Health and Resiliency Virtual Second Opinion with the Cleveland Clinic Coverage LinkedIn Learning Rewards and Recognition Service Anniversaries Annual Bonus Referral Program Amwell tuition reimbursement benefit Privacy Notice #J-18808-Ljbffr
Staff Site Reliability Engineer
B CAPITAL
Remote, Remote
Publicado hace 12 días
Denunciar empleo