`ClaimedExecution` can be stuck in `claimed` state forever when `#finished` raises

## Problem

When the DB is in an unstable state (e.g., a transient network issue), `SolidQueue::ClaimedExecution#perform` sometimes fails to delete the `SolidQueue::ClaimedExecution` and mark `SolidQueue::Job` as finished.

Once this happens (we actually hit it in our production environment), the job remains `claimed` and will never be finished because the owning worker process remains alive.

## Reproduction

<details>
<summary> Reproducible script </summary>
<p>

```ruby
# script.rb
require "bundler/inline"

gemfile(true) do
  source "https://rubygems.org"

  gem "rails"
  gem "sqlite3"
  gem "solid_queue", require: false, github: 'rails/solid_queue'
  gem 'debug'
end

# Setup application

require 'active_job/railtie'
require 'active_record/railtie'
require 'solid_queue'

require "minitest/autorun"

File.delete('solid_queue_repro.sqlite3') if File.exist?('solid_queue_repro.sqlite3')
ENV["DATABASE_URL"] = "sqlite3:solid_queue_repro.sqlite3?timeout=5000"

class TestApp < Rails::Application
  config.eager_load = false
  config.logger = Logger.new(File.open('rails.log', 'w'))
  config.logger.level = Logger::INFO
  config.secret_key_base = "secret_key_base"

  config.active_job.queue_adapter = :solid_queue
  config.solid_queue.silence_polling = true
end

Rails.application.initialize!

# Create tables for Solid Queue
ActiveRecord::Schema.define do
  instance_eval(File.read(SolidQueue::Engine.root.join("lib/generators/solid_queue/install/templates/db/queue_schema.rb")))
end

class TestJob < ActiveJob::Base
  queue_as :default

  def perform
    # do nothing
  end
end

# Mimics DB instability
SolidQueue::ClaimedExecution.prepend(
  Module.new do
    # https://github.com/rails/solid_queue/blob/176721e33f542e07923fe02964cd55d2c18b4389/app/models/solid_queue/claimed_execution.rb#L110-L115
    def finished
      raise 'Intentional failure that mimics a transient DB glitch' if rand < 0.5

      super
    end
  end
)

# Test the behavior under DB instability
class TestTransientDbFailure < ActiveSupport::TestCase
  def setup
    @supervisor_pid = fork do
      config = SolidQueue::Configuration.new(queues: "default", workers: 1)
      SolidQueue::Supervisor.start(configuration: config)
    end
  end

  def teardown
    Process.kill(:TERM, @supervisor_pid)
    Process.waitpid(@supervisor_pid)
  end

  def test_transient_db_failure
    100.times do
      TestJob.set(wait: 2.seconds).perform_later
    end
    assert_equal 100, SolidQueue::ScheduledExecution.count

    # Never finishes
    until SolidQueue::Job.finished.count == 100
      sleep 1
      puts "finished_count: #{SolidQueue::Job.finished.count}, claimed_count: #{SolidQueue::ClaimedExecution.count}"
    end
  end
end
```


</p>
</details>

I would expect Solid Queue to eventually finish every job, but in practice some jobs are stuck in `claimed` forever:

```bash
$ ruby script.rb
---snip---
finished_count: 0, claimed_count: 0
finished_count: 0, claimed_count: 3
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
...
```

## Root Cause

When a worker completes the job, it calls [ClaimedExecution#finished](https://github.com/rails/solid_queue/blob/176721e33f542e07923fe02964cd55d2c18b4389/app/models/solid_queue/claimed_execution.rb#L110) to delete the `ClaimedExecution` and mark `Job` finished.
If that transaction raises, the exception ends up in the `on_rejection!` block inside [Pool](https://github.com/rails/solid_queue/blob/176721e33f542e07923fe02964cd55d2c18b4389/lib/solid_queue/pool.rb#L28-L30), and it swallows the exception.

As a result:

* The `ClaimedExecution` is **not** deleted
* The owning worker process is still alive and registered, so the execution is **not** considered orphaned and is never released by any recovery mechanisms
* The job therefore remains `claimed` indefinitely, even though it is completed

## Possible Solutions

I'm listing two possible approaches below, but I'm not sure either of them is ideal.
I'd love to hear your thoughts and discuss this.

### 1. Stop the worker when `ClaimedExecution#finished` fails

One possible solution to this issue is to stop the worker when it fails to complete `ClaimedExecution`.

The following is the PoC for this approach.
https://github.com/rails/solid_queue/compare/main...genya0407:solid_queue:feature/experimental-change-for-fixing-claimed-execution?expand=1

With this change, the previous reproducible script successfully completed all the jobs.

```bash
# Change the gem source
$ cat main.rb | grep 'gem "solid_queue"'
gem "solid_queue", require: false, github: 'genya0407/solid_queue', branch: 'feature/experimental-change-for-fixing-claimed-execution'

$ ruby script.rb
---snip---
finished_count: 0, claimed_count: 0
finished_count: 0, claimed_count: 0
finished_count: 1, claimed_count: 3
finished_count: 6, claimed_count: 3
finished_count: 8, claimed_count: 3
---snip---
finished_count: 99, claimed_count: 0
finished_count: 99, claimed_count: 0
finished_count: 100, claimed_count: 0
```

### 2. Periodic worker restarts

Once the worker is restarted, the stuck claimed executions will be released immediately and eventually retried.
In fact, when we encountered this issue, we restarted the workers by restarting the supervisors, and the jobs were released and re-executed.

So maybe supporting an automatic restart feature based on uptime or the number of processed jobs could mitigate this problem and others.
I believe, aside from this problem, restarting worker processes constantly is a common practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ClaimedExecution` can be stuck in `claimed` state forever when `#finished` raises #748

Problem

Reproduction

Root Cause

Possible Solutions

1. Stop the worker when `ClaimedExecution#finished` fails

2. Periodic worker restarts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ClaimedExecution can be stuck in claimed state forever when #finished raises #748

Description

Problem

Reproduction

Root Cause

Possible Solutions

1. Stop the worker when ClaimedExecution#finished fails

2. Periodic worker restarts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`ClaimedExecution` can be stuck in `claimed` state forever when `#finished` raises #748

1. Stop the worker when `ClaimedExecution#finished` fails