Senior DGX Cloud AI Infrastructure Software Engineer
Company: Quality Control Specialist - Pest Control
Location: Santa Clara
Posted on: June 2, 2025
Job Description:
Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing
to the infrastructure that powers our innovative AI research. This
team focuses on optimizing efficiency and resiliency of AI
workloads, as well as developing scalable AI and Data
infrastructure tools and services. Our objective is to deliver a
stable, scalable environment for AI researchers, providing them
with the necessary resources and scale to foster innovation. We are
seeking an AI infrastructure software engineer to join our team.
You'll be instrumental in designing, building, and maintaining AI
infrastructure that enable large-scale AI training and inferencing.
The responsibilities include implementing software and systems
engineering practices to ensure high efficiency and availability of
AI systems.As a senior DGX Cloud AI Infrastructure software
engineer at NVIDIA, you will have the opportunity to work on
innovative technologies that power the future of AI and data
science, and be part of a dynamic and supportive team that values
learning and growth. The role provides the autonomy to work on
meaningful projects with the support and mentorship needed to
succeed, and contributes to a culture of blameless postmortems,
iterative improvement, and risk-taking. If you are seeking an
exciting and rewarding career that makes a difference, we invite
you to apply now!What you'll be doing:
- Develop infrastructure software and tools for large-scale AI,
LLM, and GenAI infrastructure.
- Develop and optimize tools to improve infrastructure efficiency
and resiliency.
- Root cause and analyze and triage failures from the application
level to the hardware level
- Enhance infrastructure and products underpinning NVIDIA's AI
platforms.
- Co-design and implement APIs for integration with NVIDIA's
resiliency stacks.
- Define meaningful and actionable reliability metrics to track
and improve system and service reliability.
- Skilled in problem-solving, root cause analysis, and
optimization.What we need to see:
- Minimum of 8+ years of experience in developing software
infrastructure for large scale AI systems.
- Bachelor's degree or higher in Computer Science or a related
technical field (or equivalent experience).
- Strong debugging skills and experience in analyzing and
triaging AI applications from the application level to the hardware
level.
- Proven track record in building and scaling large-scale
distributed systems.
- Experience with AI training and inferencing and data
infrastructure services.
- Familiar in operating large-scale observability platforms for
monitoring and logging (e.g., ELK, Prometheus, Loki).
- Proficiency in programming languages such as Python, C/C++,
script languages
- Excellent communication and collaboration skills, and a culture
of diversity, intellectual curiosity, problem solving, and openness
are essential.Ways to stand out from the crowd:
- Experience in working with the large scale AI cluster
- Strong understanding of NVIDIA GPUs, network technologies
(RDMA, IB, NCCL)
- Good understanding on DL frameworks internal PyTorch,
TensorFlow, JAX, and Ray
- Experience and root cause analysis of failures and datacenter
scale
- Strong background in software design and development.NVIDIA
leads the way in groundbreaking developments in Artificial
Intelligence, High-Performance Computing, and Visualization. The
GPU, our invention, serves as the visual cortex of modern computers
and is at the heart of our products and services. Our work opens up
new universes to explore, enables amazing creativity and discovery,
and powers what were once science fiction inventions, from
artificial intelligence to autonomous cars. NVIDIA is looking for
exceptional people like you to help us accelerate the next wave of
artificial intelligence.The base salary range is 184,000 USD -
356,500 USD. Your base salary will be determined based on your
location, experience, and the pay of employees in similar
positions.You will also be eligible for equity and benefits .
NVIDIA accepts applications on an ongoing basis.NVIDIA is committed
to fostering a diverse work environment and proud to be an equal
opportunity employer. As we highly value diversity in our current
and future employees, we do not discriminate (including in our
hiring and promotion practices) on the basis of race, religion,
color, national origin, gender, gender expression, sexual
orientation, age, marital status, veteran status, disability status
or any other characteristic protected by law.
#J-18808-Ljbffr
Keywords: Quality Control Specialist - Pest Control, San Mateo , Senior DGX Cloud AI Infrastructure Software Engineer, IT / Software / Systems , Santa Clara, California
Didn't find what you're looking for? Search again!
Loading more jobs...