Skip to content

perf: OpenMP + SIMD optimization for OFDFT esolver grid loops#7495

Open
Xiao-Han666 wants to merge 2 commits into
deepmodeling:developfrom
Xiao-Han666:develop
Open

perf: OpenMP + SIMD optimization for OFDFT esolver grid loops#7495
Xiao-Han666 wants to merge 2 commits into
deepmodeling:developfrom
Xiao-Han666:develop

Conversation

@Xiao-Han666

Copy link
Copy Markdown

perf: OpenMP + SIMD optimization for OFDFT esolver grid loops

Summary

Add OpenMP multithreading and SIMD vectorization to ESolver_OF mesh-grid loops, achieving 1.48× wall-time speedup on a Si₆₄ diamond benchmark while preserving bit-identical total energies.

Changes

P0-1 — OpenMP parallel for simd (10 grid loops)

Ten element-wise loops across before_opt, update_potential, optimize, update_rho, after_opt, and cal_energy are parallelized with #pragma omp parallel for simd. All pragmas are guarded by #ifdef _OPENMP so the code compiles and runs correctly without OpenMP.

# Function Loop Directive
1 before_opt φ init (uniform) #pragma omp parallel for simd
2 before_opt φ init (from file) #pragma omp parallel for simd
3 before_opt Fused zeroing (3 arrays) #pragma omp parallel for simd
4 update_potential V_eff → dE/dφ copy #pragma omp parallel for simd
5 update_potential dL/dφ computation parallel for simd / target teams
6 optimize ptemp_phi/rho init #pragma omp parallel for simd
7 update_rho φ/ρ update #pragma omp parallel for simd
8 after_opt rho_save copy #pragma omp parallel for simd
9 after_opt ML data extraction #pragma omp parallel for simd
10 cal_energy Explicit reduction (replaces BLAS dot) #pragma omp parallel for simd reduction(+:local_sum)

P0-opt — Persistent line-search buffer

optimize() previously called new double[nrxx] / delete[] for ptemp_phi on every SCF iteration. A persistent buffer ptemp_phi_persistent_ is now allocated once in allocate_array() and reused, eliminating heap-allocation contention under multithreading.

P1-1 — GPU-target offload for dL/dφ

The dL/dφ element-wise loop in update_potential uses a three-tier fallback:

#if defined(_OPENMP) && _OPENMP >= 201811
#pragma omp target teams distribute parallel for simd  // GPU
#elif defined(_OPENMP)
#pragma omp parallel for simd                          // host threads
#endif

P2-2 — SIMD + precomputation

  • All 10 loops carry the simd clause for AVX2 packed-FMA generation.
  • update_rho: cos(theta) / sin(theta) hoisted outside the inner loop.
  • before_opt: nelec / omega division hoisted to const double rho0.
  • before_opt: three separate ZEROS() calls fused into a single zeroing loop.

Files Modified

File Δ Description
source/source_esolver/esolver_of.cpp +134/−36 Core optimizations
source/source_esolver/esolver_of.h +1 ptemp_phi_persistent_ member
source/source_esolver/esolver_of_tool.cpp +8 Persistent buffer allocation

Benchmark

System: Si₆₄ diamond 2×2×2 supercell, ecutwfc = 100 Ry, WT KEDF, 2 MPI processes
Hardware: WSL2 / i5-11260H (4C8T) / 11 GB RAM

Version OMP threads Wall time Speedup vs ORIG OMP=1
ORIG (baseline) 1 316 s 1.00×
ORIG 4 319 s 0.99×
OPT 1 295 s 1.07×
OPT 2 213 s 1.48×
OPT 4 223 s 1.42×

All runs yield identical total energy (−6965.09274 eV).

Notes

  • P2-1 (multi-trial line search) was prototyped but reverted — changing the optimization path broke exact reproducibility with existing reference data. It is a valid future enhancement that would require updating test references.
  • P1-2 (non-blocking MPI) and P3-1 (FP32-mixed KEDF) are out of scope for this PR as they require changes in module_pw and module_ofdft respectively.
  • All optimizations use #ifdef _OPENMP guards: the code compiles and runs correctly with -DUSE_OPENMP=OFF.

- P0-1: Add #pragma omp parallel for simd to 10 grid loops across
  before_opt, update_potential, optimize, update_rho, after_opt, cal_energy.
  All guarded by #ifdef _OPENMP for serial fallback.
- P0-opt: Replace per-iteration new/delete in optimize() with persistent
  buffer ptemp_phi_persistent_, allocated once in allocate_array().
- P1-1: Three-tier version-adaptive offload for dL/dφ computation
  (_OPENMP>=201811→GPU target, <201811→host parallel, none→serial).
- P2-2: Precompute cos/sin/rho0 outside inner loops; fuse 3 ZEROS calls
  into single grid loop to reduce loop overhead.

Modified files:
  source/source_esolver/esolver_of.cpp       (+134/-36)
  source/source_esolver/esolver_of.h         (+1, ptemp_phi_persistent_)
  source/source_esolver/esolver_of_tool.cpp  (+8, allocate persistent buffer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants