Voltage Park logo

    Manager of Infrastructure Engineering (Observability)

    Voltage Park
    Apply Now

    Job Details

    Location
    Redmond, Washington, United States
    Posted
    1 day ago
    Job Type
    FULL_TIME

    Job Description

    Voltage Park is your enterprise AI factory. We offer scalable compute power, on-demand and reserved bare metal AI infrastructure using NVIDIA GPUs, with world-class service, performance, and value. Founded with the mission of making accessible AI computing for all, our flexible, affordable GPU solutions power everyone from builders to enterprises.

    Voltage Park is looking for a Manager of Infrastructure Engineering for our Infrastructure Engineering team. Our team is responsible for building automation, tooling, and API-driven systems to bridge the gap between our physical infrastructure and the systems that our customers depend on for AI/ML training, inference, and HPC workloads at scale.

    In this role, you’ll design and implement systems that enable humans and software to interact programmatically with thousands of bare-metal servers, storage clusters, and high-performance networks. You will work closely with teams across Voltage Park to drive new infrastructure rollouts and improve the lifecycle management of existing resources. Observability is not a nice-to-have—it is foundational to how we operate safely, efficiently, and at scale.

    QUALIFICATIONS:

    7+ years in infrastructure engineering, SRE, or platform roles 2+ years managing technical teams Deep experience designing and operating observability systems at scale Strong background in Linux, distributed systems, and production operations

    Experience in GPU, HPC, or AI infrastructure environments:

    Hands-on experience with bare-metal systems and hardware-level telemetry (power, thermal, network, GPU) Comfort operating in environments with hardware dependencies, physical failure modes, and tight SLAs Strong Technical Background In Metrics systems (Prometheus, VictoriaMetrics, Mimir, etc.) Logging systems (ELK / OpenSearch, Loki, ClickHouse, Kafka-based pipelines)

    Distributed tracing (OpenTelemetry, Jaeger, Tempo) Kubernetes observability (nodes, clusters, workloads, control plane) Alerting strategy, SLOs, SLIs, and error budgets High-cardinality, high-volume telemetry tradeoffs Nice to Have

    Experience designing observability for monitoring hardware failure modes (GPU ECC, PCIe, NIC errors, power or thermal limits):

    Experience operating observability platforms across multiple data centers and failure domains:

    Familiarity with capacity-aware or constraint-driven alerting (power, thermal, rack-level limits)

    Experience balancing telemetry cost, retention, and fidelity at large scale:

    Prior experience evolving alerting from reactive to SLO-driven

    Experience building or scaling observability teams or platforms in high-growth environments:

    WHAT YOU'LL DO:

    Technical Ownership & Strategy Own Voltage Park’s observability strategy across infrastructure and platform layers

    • Define standards for metrics, logs, traces, alerts, dashboards, and SLOs
    • Drive architecture decisions for telemetry pipelines, storage, and retention

    Balance signal quality, system performance, and cost at scale Team Leadership

    • Build, manage, and mentor a team of infrastructure engineers focused on observability

    Set clear technical direction, priorities, and expectations

    • Review designs, guide implementation, and raise the bar on operational rigor

    Partner closely with Engineering and Operations teams Platform Engineering

    • Design and operate high-throughput observability pipelines (metrics, logs, traces)
    • Ensure observability platforms are reliable, scalable, and resilient
    • Improve alert quality and reduce noise across production systems

    Enable self-service observability for internal engineering teams Reliability & Operations

    • Participate in and lead infrastructure incident response

    Use observability data to drive root-cause analysis and systemic improvements

    • Build feedback loops from incidents into better tooling, alerts, and runbooks
    • Help establish a culture of measurement-driven reliability

    Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter.

    Related Jobs You Might Like

    Voltage Park logo

    Technical Program Manager (Datacenters)

    Voltage Park
    US
    1 day ago
    Remote
    FULL_TIME

    Voltage Park is your enterprise AI factory. We offer scalable compute power, on-demand and reserved bare metal AI infrastructure using NVIDIA GPUs, with world-class service, performance, and value.

    Voltage Park logo

    Staff Software Engineer

    Voltage Park
    Redmond, Washington, US
    1 day ago
    FULL_TIME

    ABOUT VOLTAGE PARK Voltage Park is your enterprise AI factory. We offer scalable compute power, on-demand and reserved bare metal AI infrastructure using NVIDIA GPUs, with world-class service,...

    Voltage Park logo

    Software Engineer

    Voltage Park
    Redmond, Washington, US
    1 day ago
    FULL_TIME

    ABOUT VOLTAGE PARK Voltage Park is your enterprise AI factory. We offer scalable compute power, on-demand and reserved bare metal AI infrastructure using NVIDIA GPUs, with world-class service,...

    ACAC Midlothian logo

    Aquatics Summer Swim Team Coach

    ACAC Midlothian
    Midlothian, Virginia, US
    1 day ago

    Description Aquatics: Swim Coach FLSA Classification: Nonexempt Reports to: Aquatics Manager/Head Coach Job Description Summary/objective The Swim Coach conducts practices for acac members and guests...

    ACAC Midlothian logo

    Swim Instructor

    ACAC Midlothian
    Midlothian, Virginia, US
    1 day ago

    Description Do you enjoy changing the lives of others who enjoy swimming or who are eager to learn? The Swim Instructor's primary duty is to teach swim lessons to children and adults! The ideal...

    Allied Orion Group logo

    Leasing Associate

    Allied Orion Group
    Houston, Texas, US
    1 day ago
    OTHER

    Job Details Level: Entry Job Location: Peninsula Park - Houston, TX 77045 Job Category: Sales/Leasing Our national multi-family management company seeks a Bilingual Leasing Associate with a "Whatever...

    Unlock All 3,000+ Outdoor Industry Jobs

    Take a quick quiz to find the perfect outdoor industry career path for you

    Exclusive Listings

    Daily Updates

    Job Alerts

    Success Stories from Outdoor Professionals

    Real people, real results, real outdoor careers

    "I found my dream job as a Park Ranger in just 2 weeks! The advanced filters helped me narrow down exactly what I wanted, and the daily updates meant I was always first to apply."

    Sarah Martinez
    Sarah Martinez
    Park Ranger
    National Park Service
    ✓ Hired in 2 weeks

    "After months of searching traditional job boards, I signed up and got 5 interview requests in my first week. The quality of listings here is unmatched."

    Michael Chen
    Michael Chen
    Adventure Guide
    REI Adventures
    ✓ 5 interviews in 1 week

    "The job alerts feature is a game-changer. I set my preferences and received my perfect role notification the next day. Worth every penny!"

    Emily Thompson
    Emily Thompson
    Conservation Director
    The Nature Conservancy
    ✓ Dream job in 1 day

    Watch Our Introduction

    Frequently Asked Questions

    Everything you need to know about Get Outdoor Jobs