Opening a discussion here to put together ideas about the requirements, use cases and design of job retries.
I'll start by pointing out that there isn't a single solution to this, it'll obviously depend on the requirements we need, but even within a defined set of requirements there are many ways to implement retries and different approaches.
IMO, regardless of the solution we choose in the end, handling job retries from a high-level point of view (where a retry is an entity that's exposed by the API) offers more flexibility than handling them exclusively from a low-level point of view (where a retry is something intra-API that's automatically handled by the core and not exposed outside), although that doesn't mean that both are incompatible. A mix of the two can be a sane option, where a low-level retry stage is automatically performed by the runtimes if they detect a clear condition to trigger a retry (such as what Lava does) with no API/Core intervention, and a higher-level retry stage can then be performed by whatever logic interacting with the API results. Under this scheme, the low-level retry stage would act as a filter that filters out certain runtime-specific retries that could be too noisy to handle at a higher level.
Questions
How will retries be triggered?
- Automatically (low-level, platform-based)?
- What will trigger an automatic retry?
- Will API clients (users / pipeline / tools) submit or request job retries?
- What information will be needed to submit / request a retry?
- Can they submit a retry of anything or only of certain job types?
- Job ownership: which users can trigger retries for a certain job?
Is there a retry limit per run?
How to conceptualize a "retry"?
- Does it need to be a separate entity?
- Can a retry be simply a "job"?
- If not, how to define the relationship between a job run and its retries?
Opening a discussion here to put together ideas about the requirements, use cases and design of job retries.
I'll start by pointing out that there isn't a single solution to this, it'll obviously depend on the requirements we need, but even within a defined set of requirements there are many ways to implement retries and different approaches.
IMO, regardless of the solution we choose in the end, handling job retries from a high-level point of view (where a retry is an entity that's exposed by the API) offers more flexibility than handling them exclusively from a low-level point of view (where a retry is something intra-API that's automatically handled by the core and not exposed outside), although that doesn't mean that both are incompatible. A mix of the two can be a sane option, where a low-level retry stage is automatically performed by the runtimes if they detect a clear condition to trigger a retry (such as what Lava does) with no API/Core intervention, and a higher-level retry stage can then be performed by whatever logic interacting with the API results. Under this scheme, the low-level retry stage would act as a filter that filters out certain runtime-specific retries that could be too noisy to handle at a higher level.
Questions
How will retries be triggered?
Is there a retry limit per run?
How to conceptualize a "retry"?