Keep and backfill Usage Event records of running apps, tasks, and service instances#5210
Open
joyvuu-dave wants to merge 4 commits into
Open
Keep and backfill Usage Event records of running apps, tasks, and service instances#5210joyvuu-dave wants to merge 4 commits into
joyvuu-dave wants to merge 4 commits into
Conversation
5 tasks
0790a05 to
15657b0
Compare
Member
|
Heads-up: #5121 merged, replacing machinist + Sham with FactoryBot. This PR will need a rebase and a fixup - your diff adds around 29
|
2282a50 to
81492dc
Compare
Contributor
Author
|
Done. |
f02f8f8 to
a9542d8
Compare
The scheduled usage event cleanup job used to delete every record older than the configured cutoff age, including the opening STARTED/CREATED event of a resource that is still running. Once the cleanup deleted that event, nothing was left to reconstruct what is running right now. Database::OldRecordCleanup can now optionally keep the records of running resources. Each model declares its lifecycles via usage_lifecycles: which states open a run (STARTED/CREATED/TASK_STARTED, plus the WAS_RUNNING/TASK_WAS_RUNNING baselines), which state closes it (STOPPED/DELETED/TASK_STOPPED), and which column names the resource. An old opening event is then only deleted when: * a closing event for the same resource exists later and is also old -- the run is over; or * it is neither the first opening of the current run nor the resource's latest one (again judged only against old rows). Consumers only need the first opening (the true start time) and the latest (the current size). The ones in between, written each time a running resource is scaled or updated, tell a consumer nothing it still needs -- and deleting them is what keeps the table size bounded for long-running, frequently-changed resources. The app and service usage event repositories turn this on with keep_running_records: true. Asking for it on a model without usage_lifecycles raises an error instead of silently deleting the records of running resources. Task events get their own lifecycle (TASK_STARTED/TASK_WAS_RUNNING -> TASK_STOPPED, matched by task_guid), so the start events of long-running tasks survive cleanup too. Task baselines use their own TASK_WAS_RUNNING state because task events carry an empty app_guid: if they said WAS_RUNNING, the app lifecycle would see them all as events of one app whose guid is '' and wrongly delete them (and the backfill's repair would write bogus STOPPED events for that phantom app). Deletion runs in a deliberate order: first the opening events that are safe to delete, while the events that make them safe still exist; then everything else. The reverse order could delete a closing event first and leave its opening event looking like a still-running resource. The cleanup log line now reports the row counts BatchDelete returns instead of running extra COUNT queries, and BatchDelete fetches each batch's ids in the same query that checks whether anything is left, halving the evaluations of the (potentially expensive) filtered dataset. Also renames the positional days_ago argument to a cutoff_age_in_days keyword.
Add a composite [state, <guid>, id] index on app_usage_events and service_usage_events. The keep-running cleanup decides whether to delete an event by looking up related events of the same resource (same guid, a given state, a higher or lower id), and the backfill checks whether a resource already has an event on record; both lookups walk exactly this index. Created concurrently on Postgres. Task events need no third index: they are looked up by task_guid, a task has only a handful of events, and the existing app_usage_events_task_guid_index makes that cheap.
…ances Seed a synthetic WAS_RUNNING usage event for every currently-running app process, a TASK_WAS_RUNNING event for every currently-running task, and a WAS_RUNNING event for every existing service instance. Billing consumers can then bootstrap a complete picture of what is running, even though the usage event cleanup deleted the original STARTED/TASK_STARTED/CREATED events long ago. The backfill is a batched VCAP::WasRunningBackfill helper called from thin no_transaction migrations, following the bigint-migration pattern. It walks the started processes / running tasks / service instances in id order, one batch at a time, each batch in its own READ COMMITTED transaction -- so no statement comes near the migration statement timeout, and MySQL's INSERT..SELECT takes no shared next-key locks on the scanned rows while the API keeps serving traffic. Tasks in CANCELING count as running: they stay billable until Diego reports them dead, and no usage event marks the moment a task enters CANCELING. The app query limits its package/droplet subqueries to each batch's apps so it never scans those whole tables, and it COALESCEs nullable legacy columns so one bad NULL row cannot abort a deploy. The seeds skip any resource whose start is already on record -- an earlier baseline, or a real STARTED/TASK_STARTED/CREATED/UPDATED event -- so running the backfill again cannot give a resource a second start that a consumer would bill twice. The API stays live during migrations, so a seed batch can race a stop or delete and write a baseline for a resource that is already gone -- or whose stop event landed earlier in the table, with a lower id. Deleting such rows would not help: consumers read these tables forward, by id, and keep what they read. A poller may already have the baseline, and for tasks a TASK_STOPPED may already have been written against it. You can delete a row; you cannot make a consumer un-read it. So instead, a post-seed repair adds the missing ending event (STOPPED / DELETED / TASK_STOPPED) for every baseline whose resource is no longer running and that has no later ending event (one with a higher id). The ending is built from the baseline row itself, which carries every NOT NULL column an ending needs -- necessary, because the resource row may be gone entirely. A baseline that already has its real ending is never touched, and each added ending stops its baseline from matching the test, so re-running the backfill changes nothing. Two properties of the added ending are deliberate. Its created_at is the repair time, not the true stop time: a bounded overbill that ends, which beats a missing ending billed forever. And its previous_state is the baseline's state, which no normal ending carries, so repaired endings are easy to tell apart. A skip_was_running_backfill config flag lets operators opt out. The migrations check it (not the helper), because they are recorded as applied either way; 'rake db:was_running_backfill' runs the same seeding and repair later. Use the rake task after a skipped migration, once after the deploy that ships these migrations (to repair anything that slipped through while old API servers were still running), or after a destructive usage-event purge, which wipes the task start events that task stop events depend on. The rake task takes a session advisory lock so two runs cannot both add the same missing ending. The migrations' down blocks are deliberate no-ops: consumers may already have read the seeded rows, and deleting a row cannot make a consumer un-read it -- it would only leave the stop events written against these rows without a start event to pair with. Document the WAS_RUNNING/TASK_WAS_RUNNING states, their created_at semantics, the repaired ending events, and the rules consumers must follow on the V3 resources, and list the new states in the legacy V2 usage-event docs because V2 reads the same event rows.
create_stop_event_if_needed skipped the TASK_STOPPED event whenever the TASK_STARTED event was absent. So a task whose start event the cleanup had already deleted never got a stop event when it finished, and a billing consumer that had recorded the start billed the task forever. Now the stop is written when either piece of recorded start evidence exists: the TASK_STARTED event, or the TASK_WAS_RUNNING baseline the backfill seeds for tasks that were already running when the keep-running cleanup was introduced. A legitimately started task always has one of the two: the cleanup no longer deletes the start event of a running task, and the backfill covers tasks that had already lost theirs. When neither exists (say a task canceled before it ever ran), no consumer ever saw the task start, and a stop event would be noise nothing can pair with. The after_destroy hook now goes through the same check. It used to write a stop unconditionally, so destroying a never-started PENDING task (app deletion destroys each non-terminal task) produced exactly the unmatched stop the update path avoids. Both pieces of evidence are looked up in one query, and a comment pins a MySQL constraint: at MySQL's default REPEATABLE READ isolation level, the evidence read must be the first query in the surrounding transaction.
a9542d8 to
a1a3a3d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fully addresses #4182.
It solves the issue of consumers of Usage Records not having a way of determining the current state of running apps, tasks, and service instances.
With this change, usage event records related to running Apps, Tasks, and Service Instances are kept from being pruned during the normal cleanup job. A one-time backfill also seeds a baseline event (
WAS_RUNNING/TASK_WAS_RUNNING) for resources that were already running when the change shipped, so consumers can reconstruct the current state even after the original events have been pruned.Seeded baseline events are never deleted, because consumers may already have read them. If a resource stops while the backfill is running and its baseline is left without a matching ending event, the backfill adds the missing ending event (
STOPPED/DELETED/TASK_STOPPED) instead. Two consequences for consumers, both documented on the V3 usage event resources: an added ending event carries the time of the repair rather than the exact stop time, so the interval it closes can run slightly long; and consumers should close an interval on the first ending event they see and ignore any duplicates. Task stop events are also only emitted when the task has a start event or baseline on record, so consumers never see a stop they cannot pair with a start.Deployment note
After deploying this change, run
rake db:was_running_backfillonce so we repair any events that happened while old API servers were still serving traffic.mainbranchbundle exec rake