Skip to content

Design and implementation ideas for job retries #509

@r-c-n

Description

@r-c-n

Opening a discussion here to put together ideas about the requirements, use cases and design of job retries.

I'll start by pointing out that there isn't a single solution to this, it'll obviously depend on the requirements we need, but even within a defined set of requirements there are many ways to implement retries and different approaches.

IMO, regardless of the solution we choose in the end, handling job retries from a high-level point of view (where a retry is an entity that's exposed by the API) offers more flexibility than handling them exclusively from a low-level point of view (where a retry is something intra-API that's automatically handled by the core and not exposed outside), although that doesn't mean that both are incompatible. A mix of the two can be a sane option, where a low-level retry stage is automatically performed by the runtimes if they detect a clear condition to trigger a retry (such as what Lava does) with no API/Core intervention, and a higher-level retry stage can then be performed by whatever logic interacting with the API results. Under this scheme, the low-level retry stage would act as a filter that filters out certain runtime-specific retries that could be too noisy to handle at a higher level.

Questions

How will retries be triggered?

  • Automatically (low-level, platform-based)?
    • What will trigger an automatic retry?
  • Will API clients (users / pipeline / tools) submit or request job retries?
    • What information will be needed to submit / request a retry?
    • Can they submit a retry of anything or only of certain job types?
    • Job ownership: which users can trigger retries for a certain job?

Is there a retry limit per run?

How to conceptualize a "retry"?

  • Does it need to be a separate entity?
  • Can a retry be simply a "job"?
  • If not, how to define the relationship between a job run and its retries?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions