Director, Site Reliability Engineering
Company: eGain Corporation
Location: Cupertino
Posted on: January 13, 2026
|
|
|
Job Description:
eGain is the leader in AI knowledge management solutions for
enterprises. As organizations recognize the critical value of
trusted knowledge and content feeding AI systems, eGain provides
the single source of truth—explainable, reliable, and
maintainable—that serves as the repository for all enterprise
know-how including SOPs, policy documents, troubleshooting guides,
and product information. This foundation enables scalable and
effective AI automation of business operations, with customer
service as the primary point of ROI. Our solutions power leading
companies including JP Morgan, Liberty Mutual, Florida Blue, and
Bosch. The Opportunity Join us in reimagining knowledge management
as mission-critical infrastructure for the AI-powered enterprise.
We’re seeking talented, hungry, and bold leaders to shape the
future of how enterprises leverage AI and knowledge at scale.
Position Overview As Director of Site Reliability Engineering, you
will ensure that eGain’s AI knowledge management platform operates
with the reliability, performance, and resilience that enterprise
customers demand. You’ll lead the strategy and execution for
observability, incident management, capacity planning, and
continuous improvement of our production systems. This role is
critical as our platform becomes mission-critical infrastructure
for the world’s leading enterprises. Key Responsibilities • Build
and lead a world-class SRE organization that ensures exceptional
reliability and performance of eGain’s cloud services • Define and
achieve ambitious SLOs/SLAs that meet the demands of enterprise
customers operating 24/7 customer service operations • Establish
comprehensive observability across the platform including
monitoring, logging, tracing, and alerting • Drive incident
response processes, post-mortems, and continuous improvement to
prevent recurring issues • Lead capacity planning and performance
optimization to ensure the platform scales efficiently with
customer growth • Implement automation for deployment, operations,
and remediation to reduce toil and improve reliability • Partner
with platform and application engineering teams to build
reliability into the system from the ground up • Champion a culture
of reliability engineering across the organization, educating teams
on best practices • Manage disaster recovery planning and business
continuity to protect customer operations • Own the technical
relationship with customers on reliability and performance topics
What We’re Looking For • 10 years of experience in software
engineering, operations, or SRE roles with 5 years in SRE
leadership • Deep expertise in observability tools, monitoring
systems, and incident management practices • Strong background in
distributed systems, cloud infrastructure, and production
operations at scale • Experience establishing and achieving
SLOs/SLAs for enterprise SaaS or mission-critical systems •
Proficiency with automation, infrastructure-as-code, and modern
DevOps/SRE tooling • Track record of improving system reliability
through data-driven approaches and systematic problem-solving •
Excellent incident management and crisis leadership skills • Strong
collaboration abilities and experience partnering with engineering
teams to improve reliability • Passion for operational excellence
and continuous improvement • Bold thinking about what’s possible in
system reliability combined with pragmatic execution Why eGain •
Ensure the reliability of systems that power customer service for
the world’s leading enterprises • Build SRE practices from the
ground up with significant impact and visibility • Work with modern
cloud technologies and solve complex reliability challenges • Lead
a team focused on operational excellence and engineering rigor
Keywords: eGain Corporation, San Mateo , Director, Site Reliability Engineering, IT / Software / Systems , Cupertino, California