Skip to content

Single GPU benchmark scripts#15514

Open
ChSonnabend wants to merge 9 commits into
AliceO2Group:devfrom
ChSonnabend:devel_fst_numactl
Open

Single GPU benchmark scripts#15514
ChSonnabend wants to merge 9 commits into
AliceO2Group:devfrom
ChSonnabend:devel_fst_numactl

Conversation

@ChSonnabend

Copy link
Copy Markdown
Collaborator

This PR brings two scripts that benchmark the single GPU performance

  • gen_single_gpu_rtc_benchmark.sh generates the workflow from dpl-workflow.sh by setting environment variables and using early stops to avoid processing failures
  • analyze_gpu_benchmarks.py then analyzes the resulting log file for processing times, records then, histograms them and fits a gaussian to the result to determine the mean processing time per timeslice

@ChSonnabend ChSonnabend requested a review from a team as a code owner June 12, 2026 07:28

@davidrohr davidrohr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't check anything in detail, but the things that immediately came to my mind


# ROCm library injection is only useful for HIP runs. Keep it off by default for CUDA/NVIDIA containers,
# because mixed AMD/NVIDIA hosts can otherwise leak ROCm libraries into LD_LIBRARY_PATH.
if [[ "${GPUTYPE:-}" == "HIP" && "0$BENCH_AUTO_ROCM_LIBS" == "01" ]]; then

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With new bash you can just use $BENCH_AUTO_ROCM_LIBS == 1


export DPL_REPORT_PROCESSING="${DPL_REPORT_PROCESSING:-1}"

export FST_TMUX_NO_EPN="${FST_TMUX_NO_EPN:-1}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, since start_tmux.sh is not used

# ----------------------------------------------------------------------------------------------------------------------
# Locate original workflow script. Keep the original untouched.

: "${GEN_TOPO_MYDIR:=$(dirname "$(realpath "$0")")}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you simple use $O2_ROOT/dpl-workflow.sh?

export WORKFLOW_PARAMETERS="${WORKFLOW_PARAMETERS:-GPU,CTF}"
export GPUTYPE="${GPUTYPE:-CUDA}"
export NGPUS=1
export NUMAGPUIDS=1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NUMAGPUIDS and NUMAID should not be set, if not using NUMA pinning

Comment on lines +46 to +59
export EPNSYNCMODE="${EPNSYNCMODE:-0}"
export SYNCMODE="${SYNCMODE:-1}"
export SYNCRAWMODE="${SYNCRAWMODE:-0}"

export TIMEFRAME_RATE_LIMIT="${TIMEFRAME_RATE_LIMIT:-5}"
export GEN_TOPO_NO_TF_RATE_UPSCALING="${GEN_TOPO_NO_TF_RATE_UPSCALING:-1}"

export DISABLE_ROOT_OUTPUT="${DISABLE_ROOT_OUTPUT:-1}"

# Double pipeline requires zsraw input. Therefore default to raw TF input, not CTF.
export CTFINPUT="${CTFINPUT:-0}"
export RAWTFINPUT="${RAWTFINPUT:-1}"
export DIGITINPUT="${DIGITINPUT:-0}"
export EXTINPUT="${EXTINPUT:-0}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you redefine all the defaults that come from setenv.sh?
I would only set those settings, which you need.
That should be
SYNCMODE=1
TIMEFRAME_RATE_LIMIT=5
RAWTFINPUT=1

source "$PWD/local_env.sh"
fi

export ALICE_O2_FST="${ALICE_O2_FST:-1}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hack for running on MI100, I would not put it in this script


export ALICE_O2_FST="${ALICE_O2_FST:-1}"

if [[ -f "$GEN_TOPO_MYDIR/setenv.sh" ]]; then

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dpl-workflow.sh will source setenv.sh, why do you source it here?

# Let O2/core dumps land in the benchmark run directory, not in the original working directory.
export CORE_DUMP_DIR="${CORE_DUMP_DIR:-$RUNDIR}"
export O2_CORE_DUMP_DIR="${O2_CORE_DUMP_DIR:-$RUNDIR}"
export FAIRMQ_SHM_MONITOR_CONFIG="${FAIRMQ_SHM_MONITOR_CONFIG:-}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not run the SHM MONITOR, why do you need this?

@alibuild

Copy link
Copy Markdown
Collaborator

Error while checking build/O2/fullCI_slc9 for 6353bf6 at 2026-06-12 21:08:

## sw/BUILD/O2-full-system-test-latest/log
command /sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local1/prodtests/full-system-test/dpl-workflow.sh had nonzero exit code 1

Full log here.

(has_detector_reco ITS && ! has_detector_gpu ITS) && ! has_detector_from_global_reader ITS && add_W o2-its-reco-workflow "$ITS_CONFIG $ITS_STAGGERED $DISABLE_MC ${DISABLE_DIGIT_CLUSTER_INPUT:-} $DISABLE_ROOT_OUTPUT --pipeline $(get_N its-tracker ITS REST 1 ITSTRK),$(get_N its-clusterer ITS REST 1 ITSCL)" "$ITS_CONFIG_KEY;$ITSMFT_STROBES;$ITSEXTRAERR"
[[ ${DISABLE_DIGIT_CLUSTER_INPUT:-} =~ "--digits-from-upstream" ]] && has_detector_gpu ITS && ! has_detector_from_global_reader ITS && add_W o2-its-reco-workflow "--disable-tracking ${DISABLE_DIGIT_CLUSTER_INPUT:-} $ITS_STAGGERED $DISABLE_MC $DISABLE_ROOT_OUTPUT --pipeline $(get_N its-clusterer ITS REST 1 ITSCL)" "$ITS_CONFIG_KEY;$ITSMFT_STROBES;$ITSEXTRAERR"
(has_detector_reco TPC || has_detector_ctf TPC) && ! has_detector_from_global_reader TPC && add_W o2-gpu-reco-workflow "--gpu-reconstruction \"$GPU_CONFIG_SELF\" --input-type=$GPU_INPUT $DISABLE_MC --output-type $GPU_OUTPUT $([[ $TPC_CORR_OPT == *--disable-ctp-lumi-request* ]] && echo --disable-ctp-lumi-request) $ITS_STAGGERED --pipeline gpu-reconstruction:${N_TPCTRK:-1},gpu-reconstruction-prepare:${N_TPCTRK:-1} $GPU_CONFIG" "GPU_global.deviceType=$GPUTYPE;GPU_proc.debugLevel=0;$GPU_CONFIG_KEY;$TRACKTUNETPCINNER;"
(has_detector_reco TPC || has_detector_ctf TPC) && ! has_detector_from_global_reader TPC && add_W o2-gpu-reco-workflow "--gpu-reconstruction \"$GPU_CONFIG_SELF\" $MSLOG --input-type=$GPU_INPUT $DISABLE_MC --output-type $GPU_OUTPUT $([[ $TPC_CORR_OPT == *--disable-ctp-lumi-request* ]] && echo --disable-ctp-lumi-request) $ITS_STAGGERED --pipeline gpu-reconstruction:${N_TPCTRK:-1},gpu-reconstruction-prepare:${N_TPCTRK:-1} $GPU_CONFIG" "GPU_global.deviceType=$GPUTYPE;GPU_proc.debugLevel=0;$GPU_CONFIG_KEY;$TRACKTUNETPCINNER;"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of modifying dpl-workflow.sh, you can just set

ARGS_EXTRA_PROCESS_o2_gpu_reco_workflow="--log-timestamp-us"

in your benchmark script.


export DPL_REPORT_PROCESSING="${DPL_REPORT_PROCESSING:-1}"
export WORKFLOW_PARAMETERS="${WORKFLOW_PARAMETERS:-GPU,CTF}"
export GPUTYPE="${GPUTYPE:-CUDA}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I would not set CUDA here, but would request the user to set it, since the script is supposed to work equally for CUDA and for HIP. Just to avoid user error, if the user doesn't provide it.

export O2_GPU_DOUBLE_PIPELINE="${O2_GPU_DOUBLE_PIPELINE:-1}"
export O2_GPU_RTC="${O2_GPU_RTC:-1}"
export SYNCMODE="${SYNCMODE:-1}"
export DISABLE_ROOT_OUTPUT="${DISABLE_ROOT_OUTPUT:-1}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DISABLE_ROOT_OUTPUT is alrady enabled by default.
So you can remove it here.
(And btw, for this setting to correct should be DISABLE_ROOT_OUTPUT="--disable-root-output")


export RUN_BENCHMARK="${RUN_BENCHMARK:-0}"

echo "# Alien/JAliEn environment check:"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we need this alien token magic. If alien-token-info finds the token before running this script, that should be all that is needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants