Problem
When the DB is in an unstable state (e.g., a transient network issue), SolidQueue::ClaimedExecution#perform sometimes fails to delete the SolidQueue::ClaimedExecution and mark SolidQueue::Job as finished.
Once this happens (we actually hit it in our production environment), the job remains claimed and will never be finished because the owning worker process remains alive.
Reproduction
Reproducible script
# script.rb
require "bundler/inline"
gemfile(true) do
source "https://rubygems.org"
gem "rails"
gem "sqlite3"
gem "solid_queue", require: false, github: 'rails/solid_queue'
gem 'debug'
end
# Setup application
require 'active_job/railtie'
require 'active_record/railtie'
require 'solid_queue'
require "minitest/autorun"
File.delete('solid_queue_repro.sqlite3') if File.exist?('solid_queue_repro.sqlite3')
ENV["DATABASE_URL"] = "sqlite3:solid_queue_repro.sqlite3?timeout=5000"
class TestApp < Rails::Application
config.eager_load = false
config.logger = Logger.new(File.open('rails.log', 'w'))
config.logger.level = Logger::INFO
config.secret_key_base = "secret_key_base"
config.active_job.queue_adapter = :solid_queue
config.solid_queue.silence_polling = true
end
Rails.application.initialize!
# Create tables for Solid Queue
ActiveRecord::Schema.define do
instance_eval(File.read(SolidQueue::Engine.root.join("lib/generators/solid_queue/install/templates/db/queue_schema.rb")))
end
class TestJob < ActiveJob::Base
queue_as :default
def perform
# do nothing
end
end
# Mimics DB instability
SolidQueue::ClaimedExecution.prepend(
Module.new do
# https://github.com/rails/solid_queue/blob/176721e33f542e07923fe02964cd55d2c18b4389/app/models/solid_queue/claimed_execution.rb#L110-L115
def finished
raise 'Intentional failure that mimics a transient DB glitch' if rand < 0.5
super
end
end
)
# Test the behavior under DB instability
class TestTransientDbFailure < ActiveSupport::TestCase
def setup
@supervisor_pid = fork do
config = SolidQueue::Configuration.new(queues: "default", workers: 1)
SolidQueue::Supervisor.start(configuration: config)
end
end
def teardown
Process.kill(:TERM, @supervisor_pid)
Process.waitpid(@supervisor_pid)
end
def test_transient_db_failure
100.times do
TestJob.set(wait: 2.seconds).perform_later
end
assert_equal 100, SolidQueue::ScheduledExecution.count
# Never finishes
until SolidQueue::Job.finished.count == 100
sleep 1
puts "finished_count: #{SolidQueue::Job.finished.count}, claimed_count: #{SolidQueue::ClaimedExecution.count}"
end
end
end
I would expect Solid Queue to eventually finish every job, but in practice some jobs are stuck in claimed forever:
$ ruby script.rb
---snip---
finished_count: 0, claimed_count: 0
finished_count: 0, claimed_count: 3
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
finished_count: 46, claimed_count: 54
...
Root Cause
When a worker completes the job, it calls ClaimedExecution#finished to delete the ClaimedExecution and mark Job finished.
If that transaction raises, the exception ends up in the on_rejection! block inside Pool, and it swallows the exception.
As a result:
- The
ClaimedExecution is not deleted
- The owning worker process is still alive and registered, so the execution is not considered orphaned and is never released by any recovery mechanisms
- The job therefore remains
claimed indefinitely, even though it is completed
Possible Solutions
I'm listing two possible approaches below, but I'm not sure either of them is ideal.
I'd love to hear your thoughts and discuss this.
1. Stop the worker when ClaimedExecution#finished fails
One possible solution to this issue is to stop the worker when it fails to complete ClaimedExecution.
The following is the PoC for this approach.
https://github.com/rails/solid_queue/compare/main...genya0407:solid_queue:feature/experimental-change-for-fixing-claimed-execution?expand=1
With this change, the previous reproducible script successfully completed all the jobs.
# Change the gem source
$ cat main.rb | grep 'gem "solid_queue"'
gem "solid_queue", require: false, github: 'genya0407/solid_queue', branch: 'feature/experimental-change-for-fixing-claimed-execution'
$ ruby script.rb
---snip---
finished_count: 0, claimed_count: 0
finished_count: 0, claimed_count: 0
finished_count: 1, claimed_count: 3
finished_count: 6, claimed_count: 3
finished_count: 8, claimed_count: 3
---snip---
finished_count: 99, claimed_count: 0
finished_count: 99, claimed_count: 0
finished_count: 100, claimed_count: 0
2. Periodic worker restarts
Once the worker is restarted, the stuck claimed executions will be released immediately and eventually retried.
In fact, when we encountered this issue, we restarted the workers by restarting the supervisors, and the jobs were released and re-executed.
So maybe supporting an automatic restart feature based on uptime or the number of processed jobs could mitigate this problem and others.
I believe, aside from this problem, restarting worker processes constantly is a common practice.
Problem
When the DB is in an unstable state (e.g., a transient network issue),
SolidQueue::ClaimedExecution#performsometimes fails to delete theSolidQueue::ClaimedExecutionand markSolidQueue::Jobas finished.Once this happens (we actually hit it in our production environment), the job remains
claimedand will never be finished because the owning worker process remains alive.Reproduction
Reproducible script
I would expect Solid Queue to eventually finish every job, but in practice some jobs are stuck in
claimedforever:Root Cause
When a worker completes the job, it calls ClaimedExecution#finished to delete the
ClaimedExecutionand markJobfinished.If that transaction raises, the exception ends up in the
on_rejection!block inside Pool, and it swallows the exception.As a result:
ClaimedExecutionis not deletedclaimedindefinitely, even though it is completedPossible Solutions
I'm listing two possible approaches below, but I'm not sure either of them is ideal.
I'd love to hear your thoughts and discuss this.
1. Stop the worker when
ClaimedExecution#finishedfailsOne possible solution to this issue is to stop the worker when it fails to complete
ClaimedExecution.The following is the PoC for this approach.
https://github.com/rails/solid_queue/compare/main...genya0407:solid_queue:feature/experimental-change-for-fixing-claimed-execution?expand=1
With this change, the previous reproducible script successfully completed all the jobs.
2. Periodic worker restarts
Once the worker is restarted, the stuck claimed executions will be released immediately and eventually retried.
In fact, when we encountered this issue, we restarted the workers by restarting the supervisors, and the jobs were released and re-executed.
So maybe supporting an automatic restart feature based on uptime or the number of processed jobs could mitigate this problem and others.
I believe, aside from this problem, restarting worker processes constantly is a common practice.