Design and implementation ideas for job retries

Opening a discussion here to put together ideas about the requirements, use cases and design of job retries.

I'll start by pointing out that there isn't a single solution to this, it'll obviously depend on the requirements we need, but even within a defined set of requirements there are many ways to implement retries and different approaches.

IMO, regardless of the solution we choose in the end, handling job retries from a high-level point of view (where a retry is an entity that's exposed by the API) offers more flexibility than handling them exclusively from a low-level point of view (where a retry is something intra-API that's automatically handled by the core and not exposed outside), although that doesn't mean that both are incompatible. A mix of the two can be a sane option, where a low-level retry stage is automatically performed by the runtimes if they detect a clear condition to trigger a retry (such as what Lava does) with no API/Core intervention, and a higher-level retry stage can then be performed by whatever logic interacting with the API results. Under this scheme, the low-level retry stage would act as a filter that filters out certain runtime-specific retries that could be too noisy to handle at a higher level.

### Questions

#### How will retries be triggered?

- Automatically (low-level, platform-based)?
  - What will trigger an automatic retry?
- Will API clients (users / pipeline / tools) submit or request job retries?
  - What information will be needed to submit / request a retry?
  - Can they submit a retry of anything or only of certain job types?
  - Job ownership: which users can trigger retries for a certain job?

#### Is there a retry limit per run?

#### How to conceptualize a "retry"?
- Does it need to be a separate entity?
- Can a retry be simply a "job"?
- If not, how to define the relationship between a job run and its retries?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design and implementation ideas for job retries #509

Questions

How will retries be triggered?

Is there a retry limit per run?

How to conceptualize a "retry"?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Design and implementation ideas for job retries #509

Description

Questions

How will retries be triggered?

Is there a retry limit per run?

How to conceptualize a "retry"?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions