Svitla Systems Inc. is looking for a Junior Site Reliability Engineer for a full-time position (40 hours per week) in Colombia. Our client is an American cybersecurity company that specializes in data center and cloud security, focusing on preventing the lateral movement of cyber threats within IT environments through micro-segmentation technology. The flagship product provides visibility into workload communication across diverse compute environments, automatically generates optimal segmentation policies, and enforces firewall rules at the host level. It helps organizations reduce cyber risk by containing threats and stopping ransomware before it spreads. The company pioneered the concept of breach containment, combining real-time threat detection and automated response across hybrid and multi-cloud environments. Its platform aligns with Zero Trust security principles that treat all network traffic as untrusted until verified. The company emphasizes AI-driven breach containment, threat intelligence integration, and policy automation to simplify and strengthen security operations, protecting more than 15 of the Fortune 500, 6 of the 10 largest global banks, and 3 of the 5 largest SaaS enterprise companies. You’ll join a 25-person Cloud Operations Group that runs a multi-cloud security SaaS platform serving enterprise customers. This senior role owns production reliability during the 8:00 AM – 5:00 PM Pacific Time window, leads incidents, sharpens SLOs and runbooks, and mentors a small cohort of SREs. You will partner with Product Service Owners, Platform SRE, and Infra to drive uptime, safe releases, observability quality, and steady reduction of toil across a mixed estate that is moving from legacy to modern patterns. The first 90–120-day activities/goals are demonstrating faster, safer incident response under Senior SRE guidance. Requirements 2+ years of experience in Site Reliability Engineering/Production Operations for high-availability SaaS or distributed systems. Basic knowledge of Kubernetes operations in production (cluster lifecycle, networking, storage, workload debugging, node health, autoscaling). Basic understanding of AWS and/or Azure, including IAM, VPC/networking, load balancing, and managed K8s services. Experience with Terraform (authoring and refactoring modules, providers, workspaces, drift remediation, plan/apply safety). Solid experience with Argo CD and GitOps workflows. Proven expertise in incident management, crisp written and verbal communication, and stakeholder management during outages. Basic fundamentals in Linux and networking, plus scripting in Python or similar for automation and diagnostics. Familiarity with compliance-aware operations (SOC 2 or ISO), evidence capture, and change tracking. Nice to have: Experience with multi-cloud service design patterns and customer data sovereignty considerations. Experience migrating legacy config management to an IaC-first model. Basic knowledge of observability design using Observe, Datadog, Prometheus, and Grafana. Responsibilities Assist in incident response for Sev-1/Sev-2 events by joining war rooms, documenting timelines, and supporting communication updates. Learn incident command protocols and contribute to post-incident reviews with clear documentation and follow-up actions. Help monitor service health against defined SLOs and error budgets. Participate in efforts to reduce MTTA/MTTR by improving alert response, updating runbooks, and testing diagnostic tools. Support safe deployment practices using Argo CD or similar tools. Validate pre-deploy checks and health gates under guidance from senior engineers. Collaborate with service owners to ensure release readiness. Contribute to improving observability by standardizing logs, metrics, and traces. Help identify noisy or flapping alerts and propose cleanup opportunities. Learn to define and monitor golden signals for key services. Assist in capacity validation and autoscaling tests across EKS/AKS environments. Document redundancy patterns and help identify single points of failure. Participate in chaos engineering exercises and game-day simulations. Support maintain legacy infrastructure (Chef/Puppet) while learning Terraform and GitOps workflows. Assist migration tasks and contribute to infrastructure stability efforts. Draft and improve SOPs and runbooks with guidance from senior team members. Follow change management and deployment checklist procedures. Help gather audit-ready evidence for compliance frameworks like SOC 2 and ISO. Actively seek mentorship from senior SREs on Kubernetes, incident handling, and root cause analysis. Share learnings with peers and contribute to a culture of continuous improvement. About Svitla Svitla Systems is a global trusted IT solutions company headquartered in California, with business and development offices through out the US, Latin America, Europe, and Asia. Svitla is an outspoken advocate of workplace flexibility, best known for its well-established remote culture, individual approach to our teammate’s professional and personal growth, and welcoming environment. Since 2003, Svitla has served a wide range of clients, from innovative start-ups in California to mega-large corporations such as Ingenico, Amplience, InvoiceASAP and Global Citizen. At Svitla, developers work with clients’ teams directly, building lasting and successful partnerships, as a result of seamless integration with on-site processes. Svitla Systems’ global mission is to build a business that contributes to the well-being of our partners, personnel and their families, improves our communities, and makes a lasting difference in the world. Join us! If you are interested in our vacancy, just click "Apply". We will be happy to see you in our friendly team :) #J-18808-Ljbffr
Junior Site Reliability Engineer
SVITLA SYSTEMS, INC.
workfromhome, workfromhome
Publicado hace 17 días
Denunciar empleo