Sr/Staff Software Engineer, Observability (Network Engineering)
Company: Crusoe
Location: San Francisco
Posted on: February 19, 2026
|
|
|
Job Description:
Job Description Job Description Crusoe's mission is to
accelerate the abundance of energy and intelligence. We’re crafting
the engine that powers a world where people can create ambitiously
with AI — without sacrificing scale, speed, or sustainability. Be a
part of the AI revolution with sustainable technology at Crusoe.
Here, you'll drive meaningful innovation, make a tangible impact,
and join a team that’s setting the pace for responsible,
transformative cloud infrastructure. About This Role: We are
looking for a highly skilled engineer with deep expertise in
building and operating observability platforms at scale. As a
Senior/Staff Software Engineer, Observability , you'll be
supporting our team of Network Engineers (no network engineering
experience required). The Crusoe Cloud Network Engineering Team is
responsible for designing, building, and operating the global edge,
backbone, and data center network for High Performance Compute
(HPC) Clusters with GPUs. This is a unique opportunity wherein
you'll be building an observability platform specific to devices;
one that is language agnostic and scalable. You will design,
develop, and run Crusoe’s next-generation observability stack,
enabling engineers to understand the internal state of distributed
systems through metrics, logs, and traces. A Day in the Life:
Designing and operating scalable observability systems (metrics,
logging, tracing) across multi-datacenter Kubernetes environments
Architecting end-to-end telemetry pipelines, including ingestion,
storage, querying, and visualization Extending monitoring and
alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and
OpenTelemetry Building scalable log collection and processing
pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
Implementing distributed tracing platforms (Tempo, Jaeger,
OpenTelemetry) and integrating with service meshes, load balancers,
and APIs Defining and driving adoption of SLOs, SLIs, and error
budgets across services and teams Automating provisioning and
scaling of observability infrastructure with Kubernetes, Terraform,
and custom tooling (Go, Python) Ensuring reliability and cost
efficiency of telemetry pipelines while supporting high-volume
workloads (AI/ML, HPC clusters, GPU infrastructure) Embedding
security best practices into observability platforms, including
RBAC, TLS, secret management, and multi-tenant access controls
Mentoring engineers and shaping Crusoe’s observability strategy and
technical roadmap You Will Thrive In This Role If: 6 years of
experience with distributed systems, with a focus on observability
and monitoring systems Deep expertise with metrics systems
(Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit,
Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger,
Tempo, OpenTelemetry) Strong programming skills in Go or Python for
automation, operators, and custom integrations Experience running
observability platforms on Kubernetes and operating them at scale
across multi-datacenter environments Proven ability to design,
optimize, and scale telemetry pipelines handling high cardinality
and high throughput data Solid understanding of distributed
systems, performance engineering, and debugging complex workloads
Familiarity with service meshes, networking, and workload
instrumentation (Envoy, Istio, OpenTelemetry SDKs) Strong
collaboration skills and the ability to influence engineering teams
to adopt observability best practices Benefits: Industry
competitive pay Restricted Stock Units in a fast growing,
well-funded technology company Health insurance package options
that include HDHP and PPO, vision, and dental for you and your
dependents Employer contributions to HSA accounts Paid Parental
Leave Paid life insurance, short-term and long-term disability
Teladoc 401(k) with a 100% match up to 4% of salary Generous paid
time off and holiday schedule Cell phone reimbursement Tuition
reimbursement Subscription to the Calm app MetLife Legal Company
paid commuter benefit; $300 per month Compensation Range:
Compensation will be paid in the range of $172,000 - $253,000
Bonus. Restricted Stock Units are included in all offers.
Compensation to be determined by the applicant’s education,
experience, knowledge, skills, and abilities, as well as internal
equity and alignment with market data. Crusoe is an Equal
Opportunity Employer. Employment decisions are made without regard
to race, color, religion, disability, genetic information,
pregnancy, citizenship, marital status, sex/gender, sexual
preference/ orientation, gender identity, age, veteran status,
national origin, or any other status protected by law or
regulation. Crusoe is an Equal Opportunity Employer. Employment
decisions are made without regard to race, color, religion,
disability, genetic information, pregnancy, citizenship, marital
status, sex/gender, sexual preference/ orientation, gender
identity, age, veteran status, national origin, or any other status
protected by law or regulation.
Keywords: Crusoe, Danville , Sr/Staff Software Engineer, Observability (Network Engineering), Engineering , San Francisco, California