perf: OpenMP + SIMD optimization for OFDFT esolver grid loops#7495
Open
Xiao-Han666 wants to merge 2 commits into
Open
perf: OpenMP + SIMD optimization for OFDFT esolver grid loops#7495Xiao-Han666 wants to merge 2 commits into
Xiao-Han666 wants to merge 2 commits into
Conversation
- P0-1: Add #pragma omp parallel for simd to 10 grid loops across before_opt, update_potential, optimize, update_rho, after_opt, cal_energy. All guarded by #ifdef _OPENMP for serial fallback. - P0-opt: Replace per-iteration new/delete in optimize() with persistent buffer ptemp_phi_persistent_, allocated once in allocate_array(). - P1-1: Three-tier version-adaptive offload for dL/dφ computation (_OPENMP>=201811→GPU target, <201811→host parallel, none→serial). - P2-2: Precompute cos/sin/rho0 outside inner loops; fuse 3 ZEROS calls into single grid loop to reduce loop overhead. Modified files: source/source_esolver/esolver_of.cpp (+134/-36) source/source_esolver/esolver_of.h (+1, ptemp_phi_persistent_) source/source_esolver/esolver_of_tool.cpp (+8, allocate persistent buffer)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
perf: OpenMP + SIMD optimization for OFDFT esolver grid loops
Summary
Add OpenMP multithreading and SIMD vectorization to
ESolver_OFmesh-grid loops, achieving 1.48× wall-time speedup on a Si₆₄ diamond benchmark while preserving bit-identical total energies.Changes
P0-1 — OpenMP parallel for simd (10 grid loops)
Ten element-wise loops across
before_opt,update_potential,optimize,update_rho,after_opt, andcal_energyare parallelized with#pragma omp parallel for simd. All pragmas are guarded by#ifdef _OPENMPso the code compiles and runs correctly without OpenMP.before_opt#pragma omp parallel for simdbefore_opt#pragma omp parallel for simdbefore_opt#pragma omp parallel for simdupdate_potential#pragma omp parallel for simdupdate_potentialparallel for simd/target teamsoptimize#pragma omp parallel for simdupdate_rho#pragma omp parallel for simdafter_opt#pragma omp parallel for simdafter_opt#pragma omp parallel for simdcal_energy#pragma omp parallel for simd reduction(+:local_sum)P0-opt — Persistent line-search buffer
optimize()previously callednew double[nrxx]/delete[]forptemp_phion every SCF iteration. A persistent bufferptemp_phi_persistent_is now allocated once inallocate_array()and reused, eliminating heap-allocation contention under multithreading.P1-1 — GPU-target offload for dL/dφ
The dL/dφ element-wise loop in
update_potentialuses a three-tier fallback:P2-2 — SIMD + precomputation
simdclause for AVX2 packed-FMA generation.update_rho:cos(theta)/sin(theta)hoisted outside the inner loop.before_opt:nelec / omegadivision hoisted toconst double rho0.before_opt: three separateZEROS()calls fused into a single zeroing loop.Files Modified
source/source_esolver/esolver_of.cppsource/source_esolver/esolver_of.hptemp_phi_persistent_membersource/source_esolver/esolver_of_tool.cppBenchmark
System: Si₆₄ diamond 2×2×2 supercell, ecutwfc = 100 Ry, WT KEDF, 2 MPI processes
Hardware: WSL2 / i5-11260H (4C8T) / 11 GB RAM
All runs yield identical total energy (−6965.09274 eV).
Notes
module_pwandmodule_ofdftrespectively.#ifdef _OPENMPguards: the code compiles and runs correctly with-DUSE_OPENMP=OFF.