perf: OpenMP + SIMD optimization for OFDFT esolver grid loops by Xiao-Han666 · Pull Request #7495 · deepmodeling/abacus-develop

Xiao-Han666 · 2026-06-19T13:16:11Z

perf: OpenMP + SIMD optimization for OFDFT esolver grid loops

Summary

Add OpenMP multithreading and SIMD vectorization to ESolver_OF mesh-grid loops, achieving 1.48× wall-time speedup on a Si₆₄ diamond benchmark while preserving bit-identical total energies.

Changes

P0-1 — OpenMP parallel for simd (10 grid loops)

Ten element-wise loops across before_opt, update_potential, optimize, update_rho, after_opt, and cal_energy are parallelized with #pragma omp parallel for simd. All pragmas are guarded by #ifdef _OPENMP so the code compiles and runs correctly without OpenMP.

#	Function	Loop	Directive
1	`before_opt`	φ init (uniform)	`#pragma omp parallel for simd`
2	`before_opt`	φ init (from file)	`#pragma omp parallel for simd`
3	`before_opt`	Fused zeroing (3 arrays)	`#pragma omp parallel for simd`
4	`update_potential`	V_eff → dE/dφ copy	`#pragma omp parallel for simd`
5	`update_potential`	dL/dφ computation	`parallel for simd` / `target teams`
6	`optimize`	ptemp_phi/rho init	`#pragma omp parallel for simd`
7	`update_rho`	φ/ρ update	`#pragma omp parallel for simd`
8	`after_opt`	rho_save copy	`#pragma omp parallel for simd`
9	`after_opt`	ML data extraction	`#pragma omp parallel for simd`
10	`cal_energy`	Explicit reduction (replaces BLAS dot)	`#pragma omp parallel for simd reduction(+:local_sum)`

P0-opt — Persistent line-search buffer

optimize() previously called new double[nrxx] / delete[] for ptemp_phi on every SCF iteration. A persistent buffer ptemp_phi_persistent_ is now allocated once in allocate_array() and reused, eliminating heap-allocation contention under multithreading.

P1-1 — GPU-target offload for dL/dφ

The dL/dφ element-wise loop in update_potential uses a three-tier fallback:

#if defined(_OPENMP) && _OPENMP >= 201811
#pragma omp target teams distribute parallel for simd  // GPU
#elif defined(_OPENMP)
#pragma omp parallel for simd                          // host threads
#endif

P2-2 — SIMD + precomputation

All 10 loops carry the simd clause for AVX2 packed-FMA generation.
update_rho: cos(theta) / sin(theta) hoisted outside the inner loop.
before_opt: nelec / omega division hoisted to const double rho0.
before_opt: three separate ZEROS() calls fused into a single zeroing loop.

Files Modified

File	Δ	Description
`source/source_esolver/esolver_of.cpp`	+134/−36	Core optimizations
`source/source_esolver/esolver_of.h`	+1	`ptemp_phi_persistent_` member
`source/source_esolver/esolver_of_tool.cpp`	+8	Persistent buffer allocation

Benchmark

System: Si₆₄ diamond 2×2×2 supercell, ecutwfc = 100 Ry, WT KEDF, 2 MPI processes
Hardware: WSL2 / i5-11260H (4C8T) / 11 GB RAM

Version	OMP threads	Wall time	Speedup vs ORIG OMP=1
ORIG (baseline)	1	316 s	1.00×
ORIG	4	319 s	0.99×
OPT	1	295 s	1.07×
OPT	2	213 s	1.48×
OPT	4	223 s	1.42×

All runs yield identical total energy (−6965.09274 eV).

Notes

P2-1 (multi-trial line search) was prototyped but reverted — changing the optimization path broke exact reproducibility with existing reference data. It is a valid future enhancement that would require updating test references.
P1-2 (non-blocking MPI) and P3-1 (FP32-mixed KEDF) are out of scope for this PR as they require changes in module_pw and module_ofdft respectively.
All optimizations use #ifdef _OPENMP guards: the code compiles and runs correctly with -DUSE_OPENMP=OFF.

- P0-1: Add #pragma omp parallel for simd to 10 grid loops across before_opt, update_potential, optimize, update_rho, after_opt, cal_energy. All guarded by #ifdef _OPENMP for serial fallback. - P0-opt: Replace per-iteration new/delete in optimize() with persistent buffer ptemp_phi_persistent_, allocated once in allocate_array(). - P1-1: Three-tier version-adaptive offload for dL/dφ computation (_OPENMP>=201811→GPU target, <201811→host parallel, none→serial). - P2-2: Precompute cos/sin/rho0 outside inner loops; fuse 3 ZEROS calls into single grid loop to reduce loop overhead. Modified files: source/source_esolver/esolver_of.cpp (+134/-36) source/source_esolver/esolver_of.h (+1, ptemp_phi_persistent_) source/source_esolver/esolver_of_tool.cpp (+8, allocate persistent buffer)

Xiao-Han666 added 2 commits June 19, 2026 20:20

删除了不必要的文件

9ca0ce4

mohanchen added the project_learning label Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: OpenMP + SIMD optimization for OFDFT esolver grid loops#7495

perf: OpenMP + SIMD optimization for OFDFT esolver grid loops#7495
Xiao-Han666 wants to merge 2 commits into
deepmodeling:developfrom
Xiao-Han666:develop

Xiao-Han666 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Xiao-Han666 commented Jun 19, 2026

perf: OpenMP + SIMD optimization for OFDFT esolver grid loops

Summary

Changes

P0-1 — OpenMP parallel for simd (10 grid loops)

P0-opt — Persistent line-search buffer

P1-1 — GPU-target offload for dL/dφ

P2-2 — SIMD + precomputation

Files Modified

Benchmark

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants