We are seeking a Lead Site Reliability Engineer (Infrastructure) to join our fast-moving VSaaS engineering organization. This role carries responsibility for technical leadership and operational execution of the Infrastructure SRE team. You will own the reliability, scalability, and operability of our shared platform and production systems, while shaping how reliability engineering and SRE practices are applied across the organization and mentoring senior and staff engineers. You will work closely with product engineering and platform teams to ensure a seamless developer experience, while setting standards, driving priorities, and leading by example during incidents and high-impact operational work. This role requires a strong technical background in cloud infrastructure, distributed systems, CI/CD, and GitOps, along with hands-on development experience in Golang and/or Python, to improve developer workflows, automation, and long-term system reliability. This is a remote role in the United States. Role Overview Site Reliability Engineer - Infrastructure The Infrastructure team provides leadership, direction, and accountability for platform architecture, system design, and end-to-end implementation to meet and exceed product non-functional requirements, including quality, security, reliability, availability, and performance. Site Reliability Engineers enable Product Development teams to ship features with reliable velocity by owning the stability, scalability, and operability of the underlying infrastructure and shared services. What You Will Do: As a Lead Site Reliability Engineer, you will: • Operate and evolve large-scale distributed systems, anticipating failure modes and proactively mitigating risks across production environments, while owning day-to-day production operations, including monitoring, alert triage, incident response, post-incident analysis, and critical incident coordination and documentation. • Lead the design, build, and implementation of automation, orchestration, and operational tooling to improve efficiency, reliability, signal-to-noise ratio, and reduce recurring issues, minimizing service-impacting events. • Set technical direction and influence platform strategy by defining platform architecture, system design, and documentation to guide development, testing, deployment, and long-term maintenance of complex distributed systems. • Establish and enforce standards, operational rigor, and best practices for deploying, monitoring, managing, and operating cloud-native and distributed infrastructure environments. • Lead the adoption and execution of modern CI/CD, GitOps, and cloud-native infrastructure practices, ensuring reliable, scalable, and traceable software and infrastructure releases. • Mentor and develop senior and staff engineers, reinforcing SRE principles, DevOps practices, accountability, and operational excellence across the Infrastructure SRE team. • Collaborate closely with product and engineering stakeholders, advocating for an SRE mindset and system-level thinking to maximize reliability, performance, availability, security, and scalability across shared platforms and services.

عبر JSearch

Lead Site Reliability Engineer - Infrastructure

ملخص NerdLevelTech الذكي

ابقَ على مسار النيرد