FusionProbe — GB200 NVL72 BMC Diagnostic Collector
- Full-stack diagnostic tool for NVIDIA GB200 NVL72 GPU servers; collects BMC snapshots over Redfish.
- Modular collection engine across 9 modules walking 600+ HGX-specific Redfish endpoints.
- Self-contained dark-themed HTML viewer: health dashboard, sensor table, GPU accordions, log timeline.
- Cross-platform Electron app — bundles PowerShell 7, ships .dmg + .exe with no end-user dependencies.
- Mock BMC server with 600+ routes and 7 integration test suites (339+ assertions, byte-exact).
- Reverse-engineered endpoint inventory documenting NVIDIA HGX paths absent from the public DMTF spec.
A BMC diagnostic collector and report viewer for NVIDIA GB200 NVL72 servers. One run produces a self-contained HTML report a field technician can email, drop on OneDrive, or attach to a support ticket — colleagues open it anywhere with zero install, no server, no driver, no admin rights.
Purpose
GB200 NVL72 is NVIDIA's flagship AI-training platform: 4 GPUs per sled, 72 GPUs per rack, ~$4M+ per rack, liquid-cooled. When a unit fails in the field, every minute of downtime is expensive — and existing options for triage all had real gaps:
- The vendor BMC web UI works, but it's browse-one-screen-at-a-time. No way to archive a snapshot, no way to share, no comparison across systems.
- Raw Redfish curl scripts are fast for one engineer and useless for a team.
nvidia-bug-report.shis in-band — it requires the host OS booted with NVIDIA drivers loaded. Useless during pre-OS issues, OS-install failures, or bare-metal triage.- NVIDIA Field Diagnostics is heavyweight, environment-specific, and not aimed at the kind of fleet-wide triage Dell field operations runs daily.
What was missing was a single command that captures everything an engineer might need — Redfish state, IPMI events, sensor readings, leak detectors, BMC logs, firmware versions — and bundles it into one portable artifact you can attach to a JIRA ticket or send to NVIDIA L3 without giving them BMC credentials.
I built FusionProbe to fill that gap, and to make the resulting data actionable rather than just dumped on the reader.
Goals
- Out-of-band only. Must work even when the host OS is wiped or has no drivers loaded — diagnostics shouldn't depend on the very thing being diagnosed.
- Zero installation on the technician's laptop. No package manager, no admin rights, no Python, no Node. Drop a folder onto the machine, double-click .bat, go.
- Single shareable artifact. Output one file the recipient can open without unzipping, running a server, or trusting an installer.
- Read-only by design. The collector cannot reset the BMC, clear logs, modify configuration, or alter system state — verified at the command level on every release.
- Educational. Junior technicians reading the output should be able to learn what each IPMI command means and what each finding implies, without leaving the viewer.
Constraints
- Field laptops run Windows with PowerShell 5.1 (no PS7, no Node, no admin rights, no internet during runs).
- BMCs are on isolated management networks reachable only via a "repair jumper" — so all traffic gets tunneled over a single SSH session.
- Many BMCs flake under load (TLS handshake failures, mid-session connection drops, "Stream was not readable" errors). Reliability had to be engineered in.
- Distribution is manual — sync the bundle through iCloud or OneDrive, copy to a colleague's laptop. No app stores, no auto-update.
- The viewer has to render gracefully from
file://, OneDrive web preview, and a local HTTP server — without three different code paths.
Architecture
Two halves: a PowerShell collector, and a self-contained HTML viewer.
Field laptop (Windows)
└─ Run-FusionProbe.ps1 (Windows Forms GUI)
└─ SSH tunnel session → repair jumper
├─ TCP port-forward → BMC:443 (Redfish over HTTPS)
└─ remote shell → ipmitool over IPMB
└─ writes <serial>-<timestamp>/ + viewer.html (self-contained)
The collector walks every relevant Redfish endpoint (firmware, inventory, GPU, sensors, leak detectors, 13 log services), runs 19 read-only ipmitool commands on the jumper, parses output, and produces a structured zip bundle. The viewer is built by a final New-Viewer.ps1 step that inlines every bundle JSON and log file as a window.BUNDLE script block at the top of viewer.html. The renderer prefers that embedded data and falls back to fetch() when the bundle is served over HTTP — same file, both modes.
Development Process
This was an iterative, feedback-driven build. Every release was triggered by a real problem someone hit in the field — not a roadmap I wrote up front:
| Phase | What triggered it | What shipped |
|---|---|---|
| Initial collector | "We need a one-shot snapshot of a GB200 BMC." | PowerShell modules, one per Redfish domain. Zip bundle output. |
| Script GUI | "Field techs aren't running PowerShell by hand." | Windows Forms front-end, double-clickable .bat. |
| Reliability fixes | "Stream was not readable" errors mid-collection. | -DisableKeepAlive on every request, exponential-backoff retries, transport-error connection-pool flush. |
| Self-contained viewer | "Colleagues open viewer.html on OneDrive and see no data." | New-Viewer.ps1 inlines all data; same file works from file://, OneDrive Web, or HTTP. |
| Verdict panel | "I have hundreds of sensors — what's actually wrong?" | Rule-based finding synthesis with severity, evidence, recommended actions. XID code → action mapping for ~20 codes. |
| Support packet export | "I retype the same triage notes into every JIRA ticket." | One-click markdown export with serial / firmware / findings / log excerpts. |
| Fleet view | "I want to triage 18 racks at once." | Drag-and-drop multiple bundles → row-per-system dashboard. |
| Trends mode | "We need to catch GPU degradation over hours, not snapshots." | Watch-Bmc.ps1 polls on an interval; viewer auto-detects same-serial bundles → sparklines + delta-since-first. |
| Trainee glossary | "Junior techs don't know XID, FRU, DCMI, ERoT, etc." | Built-in glossary panel with ~35 terms across BMC / sensors / GPU / cooling. |
| IPMI command descriptions | Same audience. | Each of the 19 ipmitool commands rendered with a one-line explanation embedded next to its output. |
| GUI polish | "Run Collection feels frozen for 3 seconds." | UI panel switches first, blocking work runs after with DoEvents() flushes; cleanup deferred so New Collection feels instant. |
Engineering Decisions
- PowerShell over Python — Windows-native, no install, runs on every field laptop. Shipping a Python interpreter would have meant vendor approval and ~50 MB of bundled binaries.
- Self-contained HTML viewer over Electron — I prototyped an Electron desktop app first. It was a 100 MB installer with code-signing pain and a parallel codebase. I deleted it. The viewer is now a single 3–5 MB HTML file that works everywhere, including environments where you can't install software.
- JSON-driven viewer with embedded fallback — Single source of truth (the bundle JSONs), rendered identically whether served over HTTP or opened from disk. The renderer's
fetchJson()checkswindow.BUNDLEfirst, then network — one indirection, two modes. - Rule-based verdict, not ML — Findings come from hand-curated rules with explanations a technician can read and override. Auditable, deterministic, no model drift, no opaque "the algorithm says X" failures.
- Read-only by design — Every IPMI command audited; destructive variants (
mc reset,sel clear,chassis power off,raw 0x..) explicitly excluded so the tool can run on production systems without authorization friction.
Challenges Solved
A few representative ones:
- PowerShell 5.1 + UTF-8 encoding — Em dashes in source files were mangled by the system codepage, breaking the parser at runtime on Windows-1252 hosts. Fixed by sanitizing all
.ps1files to ASCII; documented as a maintenance rule. </script>inside a JS comment — Closed the surrounding<script>tag prematurely (the HTML parser doesn't understand JS comments) and rendered the rest of the JS as page text. Fixed with<\/script>escape.- TLS handshake failures through SSH tunnels — BMCs would close pooled connections mid-collection, surfacing as "Stream was not readable." Fixed by forcing fresh TCP+TLS per request via
-DisableKeepAliveand proactiveServicePoint.CloseConnectionGroup()on retry. - Empty-string args dropped by
pwsh -File— Couldn't pass an empty jumper password through the script-bundle GUI. Fixed by passing secrets via env vars (FUSION_PROBE_USER/FUSION_PROBE_PASS/FUSION_PROBE_JUMPER_PASS), matching the pattern Microsoft uses forGet-Credential. - Windows Forms UI freezes on cleanup —
Stop-SshTunnelandRemove-Job -Forceblock the message pump for several seconds, leaving text boxes unable to take focus. Fixed by reordering: switch panels first, pump the queue withApplication::DoEvents(), defer cleanup off the user-perceived path.
Outcomes
- One command, complete diagnostic snapshot of a $4M+ system in ~90 seconds.
- One file (3–5 MB
viewer.html) that works on any reviewer's machine, including OneDrive web preview, with zero install. - Triage that's actionable — every finding maps to a recommended action with evidence linked back to the source data.
- Fleet-aware — N bundles in a folder become a row-per-system dashboard with severity rollups.
- Trend-aware — repeated snapshots of the same system render as sparklines and a snapshot table.
- Trainee-aware — every IPMI command and BMC concept is explained inside the tool.
- Read-only and auditable — safe to run on production hardware without sign-off.
Tech Stack
- PowerShell 5.1 / 7 — collector, GUI, watcher
- Windows Forms — local GUI on the technician's laptop
- Posh-SSH (SSH.NET) — bundled SSH library for the jumper tunnel
- Redfish (DMTF) API over HTTPS — primary BMC interface
- IPMI / ipmitool — secondary, run remotely on the jumper over SSH
- Vanilla HTML + CSS + ES2020 JavaScript — viewer; no framework, no build step
- SVG sparklines — trend visualization
- Markdown export — support-packet format
Built and maintained by Samar Siddiqui. Designed to make BMC diagnostics shareable, actionable, and educational — turning a structured data dump into a triage assistant.