Skip to content

ClaimedExecution can be stuck in claimed state forever when #finished raises #748

@genya0407

Description

@genya0407

Problem

When the DB is in an unstable state (e.g., a transient network issue), SolidQueue::ClaimedExecution#perform sometimes fails to delete the SolidQueue::ClaimedExecution and mark SolidQueue::Job as finished.

Once this happens (we actually hit it in our production environment), the job remains claimed and will never be finished because the owning worker process remains alive.

Reproduction

Reproducible script

# script.rb
require "bundler/inline"

gemfile(true) do
  source "https://rubygems.org"

  gem "rails"
  gem "sqlite3"
  gem "solid_queue", require: false, github: 'rails/solid_queue'
  gem 'debug'
end

# Setup application

require 'active_job/railtie'
require 'active_record/railtie'
require 'solid_queue'

require "minitest/autorun"

File.delete('solid_queue_repro.sqlite3') if File.exist?('solid_queue_repro.sqlite3')
ENV["DATABASE_URL"] = "sqlite3:solid_queue_repro.sqlite3?timeout=5000"

class TestApp < Rails::Application
  config.eager_load = false
  config.logger = Logger.new(File.open('rails.log', 'w'))
  config.logger.level = Logger::INFO
  config.secret_key_base = "secret_key_base"

  config.active_job.queue_adapter = :solid_queue
  config.solid_queue.silence_polling = true
end

Rails.application.initialize!

# Create tables for Solid Queue
ActiveRecord::Schema.define do
  instance_eval(File.read(SolidQueue::Engine.root.join("lib/generators/solid_queue/install/templates/db/queue_schema.rb")))
end

class TestJob < ActiveJob::Base
  queue_as :default

  def perform
    # do nothing
  end
end

# Mimics DB instability
SolidQueue::ClaimedExecution.prepend(
  Module.new do
    # https://github.com/rails/solid_queue/blob/176721e33f542e07923fe02964cd55d2c18b4389/app/models/solid_queue/claimed_execution.rb#L110-L115
    def finished
      raise 'Intentional failure that mimics a transient DB glitch' if rand < 0.5

      super
    end
  end
)

# Test the behavior under DB instability
class TestTransientDbFailure < ActiveSupport::TestCase
  def setup
    @supervisor_pid = fork do
      config = SolidQueue::Configuration.new(queues: "default", workers: 1)
      SolidQueue::Supervisor.start(configuration: config)
    end
  end

  def teardown
    Process.kill(:TERM, @supervisor_pid)
    Process.waitpid(@supervisor_pid)
  end

  def test_transient_db_failure
    100.times do
      TestJob.set(wait: 2.seconds).perform_later
    end
    assert_equal 100, SolidQueue::ScheduledExecution.count

    # Never finishes
    until SolidQueue::Job.finished.count == 100
      sleep 1
      puts "finished_count: #{SolidQueue::Job.finished.count}, claimed_count: #{SolidQueue::ClaimedExecution.count}"
    end
  end
end

I would expect Solid Queue to eventually finish every job, but in practice some jobs are stuck in claimed forever:

$ ruby script.rb
---snip---
finished_count: 0, claimed_count: 0
finished_count: 0, claimed_count: 3
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
...

Root Cause

When a worker completes the job, it calls ClaimedExecution#finished to delete the ClaimedExecution and mark Job finished.
If that transaction raises, the exception ends up in the on_rejection! block inside Pool, and it swallows the exception.

As a result:

  • The ClaimedExecution is not deleted
  • The owning worker process is still alive and registered, so the execution is not considered orphaned and is never released by any recovery mechanisms
  • The job therefore remains claimed indefinitely, even though it is completed

Possible Solutions

I'm listing two possible approaches below, but I'm not sure either of them is ideal.
I'd love to hear your thoughts and discuss this.

1. Stop the worker when ClaimedExecution#finished fails

One possible solution to this issue is to stop the worker when it fails to complete ClaimedExecution.

The following is the PoC for this approach.
https://github.com/rails/solid_queue/compare/main...genya0407:solid_queue:feature/experimental-change-for-fixing-claimed-execution?expand=1

With this change, the previous reproducible script successfully completed all the jobs.

# Change the gem source
$ cat main.rb | grep 'gem "solid_queue"'
gem "solid_queue", require: false, github: 'genya0407/solid_queue', branch: 'feature/experimental-change-for-fixing-claimed-execution'

$ ruby script.rb
---snip---
finished_count: 0, claimed_count: 0
finished_count: 0, claimed_count: 0
finished_count: 1, claimed_count: 3
finished_count: 6, claimed_count: 3
finished_count: 8, claimed_count: 3
---snip---
finished_count: 99, claimed_count: 0
finished_count: 99, claimed_count: 0
finished_count: 100, claimed_count: 0

2. Periodic worker restarts

Once the worker is restarted, the stuck claimed executions will be released immediately and eventually retried.
In fact, when we encountered this issue, we restarted the workers by restarting the supervisors, and the jobs were released and re-executed.

So maybe supporting an automatic restart feature based on uptime or the number of processed jobs could mitigate this problem and others.
I believe, aside from this problem, restarting worker processes constantly is a common practice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions