Senior Production Engineer, Tooling & Frameworks
Company: CoreWeave
Location: Sunnyvale
Posted on: February 14, 2026
|
|
|
Job Description:
Job Description Job Description CoreWeave is The Essential Cloud
for AI™. Built for pioneers by pioneers, CoreWeave delivers a
platform of technology, tools, and teams that enables innovators to
build and scale AI with confidence. Trusted by leading AI labs,
startups, and global enterprises, CoreWeave combines superior
infrastructure performance with deep technical expertise to
accelerate breakthroughs and turn compute into capability. Founded
in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV)
in March 2025. Learn more at www.coreweave.com. About the Role
Production Engineering ensures CoreWeave's cloud runs with
world-class reliability, performance, and operational excellence.
Herd is our newest innovation: an agentic AI platform that serves
as CoreWeave's intelligent SRE assistant - combining AI reasoning,
data infrastructure, and observability into an autonomous
operational intelligence layer for internal use. As a Production
Engineer on Herd, you'll define and build the systems that power a
scalable agentic ecosystem. You'll design distributed services and
data pipelines that process, embed, and retrieve operational
knowledge at scale, enabling LLM-powered agents to work alongside
human engineers in production. This is a hands-on role at the
intersection of AI operations, distributed systems, and data
infrastructure. What You'll Do Architect and build large-scale
distributed systems that power AI SRE Platforms. Design data
infrastructure for AI reasoning (embedding generation, context
retrieval, vector stores) optimized for real-time operational
queries. Build agent orchestration and lifecycle components so
agents can communicate, delegate, and reason collectively across
CoreWeave systems. Integrate AI SRE Platform with a large number of
internal systems (Kubernetes, observability platforms, etc.) to
enable end-to-end automation and insights. Lead architectural
design discussions and set technical direction for AI-driven
reliability systems. Partner across Production Engineering, Data
Engineering, ML Infrastructure, and Platform to operate AI SRE as a
high-availability platform embedded in critical reliability
workflows. Develop services that interpret telemetry, detect
anomalies, and generate RCA (root cause analysis) and PIR
(post-incident review) artifacts; trigger automated mitigations
where appropriate. Codify operational best practices into services,
APIs, and Kubernetes-native components. Participate in an on-call
rotation supporting the systems you build. What You've Worked On
(Minimum Qualifications) 5 years in software or infrastructure
engineering building and operating distributed systems at scale.
Proficiency in Python (or similar), with experience delivering
production microservices and data pipelines. Expertise in
Kubernetes, container orchestration, and cloud-native
architectures. Strong understanding of data systems (streaming,
indexing, caching, ETL). Demonstrated experience designing for
scalability, fault tolerance, and performance. Preferred
Qualifications Experience building RAG systems or embedding-based
search. Familiarity with vector databases and text/signal retrieval
systems. Experience developing or deploying agentic AI systems
(LLM-based automation, AI observability). Strong
distributed-systems background (consensus, message buses, job
orchestration, eventual consistency). Experience designing data
schemas and APIs for knowledge representation and operational
reasoning. Familiarity with ChatOps frameworks or workflow
orchestration (Temporal, Argo, Airflow). Background in MLOps, AI
infrastructure, or platform reliability engineering. Experience
with observability frameworks (Prometheus, Grafana, OpenTelemetry)
and using telemetry for automated reasoning or remediation. Why
CoreWeave At CoreWeave, we work hard, have fun, and move fast.
You'll join a team that values curiosity, ownership, and creative
problem-solving. As part of Production Engineering, you'll operate
at the intersection of AI and reliability — building systems that
make operating the most powerful AI cloud in the world smarter
every day. Core Values: Be Curious at Your Core Act Like an Owner
Empower Employees Deliver Best-in-Class Client Experiences Achieve
More Together The base salary range for this role is $139,000 to
$204,000. The starting salary will be determined based on
job-related knowledge, skills, experience, and market location. We
strive for both market alignment and internal equity when
determining compensation. In addition to base salary, our total
rewards package includes a discretionary bonus, equity awards, and
a comprehensive benefits program (all based on eligibility). What
We Offer The range we've posted represents the typical compensation
range for this role. To determine actual compensation, we review
the market rate for each candidate which can include a variety of
factors. These include qualifications, experience, interview
performance, and location. In addition to a competitive salary, we
offer a variety of benefits to support your needs, including:
Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance Voluntary supplemental life insurance
Short and long-term disability insurance Flexible Spending Account
Health Savings Account Tuition Reimbursement Ability to Participate
in Employee Stock Purchase Program (ESPP) Mental Wellness Benefits
through Spring Health Family-Forming support provided by Carrot
Paid Parental Leave Flexible, full-service childcare support with
Kinside 401(k) with a generous employer match Flexible PTO Catered
lunch each day in our office and data center locations A casual
work environment A work culture focused on innovative disruption
Our Workplace While we prioritize a hybrid work environment, remote
work may be considered for candidates located more than 30 miles
from an office, based on role requirements for specialized skill
sets. New hires will be invited to attend onboarding at one of our
hubs within their first month. Teams also gather quarterly to
support collaboration California Consumer Privacy Act - California
applicants only CoreWeave is an equal opportunity employer,
committed to fostering an inclusive and supportive workplace. All
qualified applicants and candidates will receive consideration for
employment without regard to race, color, religion, sex,
disability, age, sexual orientation, gender identity, national
origin, veteran status, or genetic information. As part of this
commitment and consistent with the Americans with Disabilities Act
(ADA) , CoreWeave will ensure that qualified applicants and
candidates with disabilities are provided reasonable accommodations
for the hiring process, unless such accommodation would cause an
undue hardship. If reasonable accommodation is needed, please
contact: careers@coreweave.com. Export Control Compliance This
position requires access to export controlled information. To
conform to U.S. Government export regulations applicable to that
information, applicant must either be (A) a U.S. person, defined as
a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident
(green card holder), (iii) refugee under 8 U.S.C. § 1157, or (iv)
asylee under 8 U.S.C. § 1158, (B) eligible to access the export
controlled information without a required export authorization, or
(C) eligible and reasonably likely to obtain the required export
authorization from the applicable U.S. government agency. CoreWeave
may, for legitimate business reasons, decline to pursue any export
licensing process.
Keywords: CoreWeave, Danville , Senior Production Engineer, Tooling & Frameworks, IT / Software / Systems , Sunnyvale, California