Skip to content

Fix kmsg timestamps#288

Open
Sayan- wants to merge 1 commit into
mainfrom
sayan/kernel-1356-kmsg-timestamps
Open

Fix kmsg timestamps#288
Sayan- wants to merge 1 commit into
mainfrom
sayan/kernel-1356-kmsg-timestamps

Conversation

@Sayan-

@Sayan- Sayan- commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Overview

system_oom_kill events were stamped with the kmsg envelope timestamp, which is CLOCK_MONOTONIC-derived. That clock freezes while a unikraft VM is suspended (scale-to-zero), so on any VM that had standbyed the OOM timestamps skewed backward by the accumulated suspended duration — sometimes by hours.

  • sysmon kmsg reader now stamps each record with wall-clock read time (time.Now()) instead of the envelope timestamp. We only ever read live records (the source seeks to end first), so read time is an accurate event time.
  • Documented the invariant on events.Event.Ts and added a gotcha (WebRTC OSS launch #13) in AGENTS.md: all event timestamps must be wall-clock at emit/observe, never a monotonic/source clock.

Test

  • Added a linux-gated unit test asserting kmsgparserSource stamps observation time, not the envelope's monotonic time

Note

Low Risk
Localized change to kmsg timestamp stamping and documentation; behavior improves correctness on suspended VMs with minimal blast radius.

Overview
Fixes system_oom_kill (and related OOM parsing) timestamps that could appear hours in the past on scale-to-zero VMs because kmsg envelope times are monotonic-derived and freeze while the VM is suspended.

kmsgparserSource now sets each KmsgMessage.Timestamp to wall-clock read time (time.Now() at forward), not the parser’s envelope timestamp; comments on OomInstance.TimeOfDeath and KmsgMessage describe that contract.

Documents the same rule on events.Event.Ts and adds AGENTS.md gotcha #13 so in-process telemetry producers stamp at emit/observe. Adds a linux-gated unit test that the forwarded stamp is observation time, not the envelope time.

Reviewed by Cursor Bugbot for commit 0b909ed. Bugbot is set up for automated code reviews on this repo. Configure here.

@Sayan- Sayan- requested review from archandatta and rgarcia June 12, 2026 20:05
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Fixes OOM kill events (system_oom_kill) delivering timestamps hours in the past on browser VMs that had been suspended and resumed via scale-to-zero. Customers reading OOM event timestamps now get accurate wall-clock times.

Intended effect:

  • system_oom_kill event ts accuracy: No production metric directly measures OOM event timestamps (they flow to customers over the telemetry SSE stream, not to our OTel pipeline). Confirmed if scale-to-zero VMs that OOM report event ts within seconds of wall-clock, rather than hours behind. No pre-deploy telemetry baseline exists for this signal.
  • copy telemetry stream ERROR rate: baseline ~5 errors/hr; confirmed normal if rate stays below 20/hr after deploy.

Risks:

  • OOM event delivery stops silently — sysmon goroutine panic or channel close; signal: any sysmon WARN/ERROR log in API logs (baseline: 0); alert on any occurrence.
  • copy telemetry stream error spike — telemetry SSE path regression; signal: copy telemetry stream ERROR logs in API logs; alert if >20/hr sustained for 30+ min (baseline: ~5/hr).
  • VM image build fails — new VMs continue running the old image with skewed OOM timestamps; signal: CI build failure in kernel-images; alert if build does not complete after merge.

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant