This project started as part of a bachelor thesis.
It contains code for n-body simulations that were used to analyze the runtime behavior of two widely known n-body algorithms: the naive approach and the Barnes-Hut algorithm.
Both algorithms are implemented using SYCL and can be executed in parallel on GPUs and CPUs.
- Parallel SYCL implementation of the naive algorithm
- Parallel SYCL implementation of the Barnes-Hut algorithm
- Support for CPUs as well as GPUs from NVIDIA and AMD (not tested on Intel GPUs)
- ParaView output for visualization
The project supports two different SYCL implementations: AdaptiveCpp and DPC++. The project requires Linux.
After the installation of one of the two supported SYCL implementations, clone this repository and create a build directory:
git clone https://github.com/TimThuering/N-Body-Simulation.git
cd N-Body-Simulation
mkdir build
cd build
AdaptiveCpp is supported with the CUDA, ROCm, and OpenMP backends.
Note that since we use sycl::nd_ranges, AdaptiveCpp should be built with the omp.accelerated target enabled in order to achieve acceptable performance on CPUs.
The following commands build the project with AdaptiveCpp for CUDA and OpenMP.
Replace sm_XX with the compute capability of your GPU, e.g., sm_75.
cmake -DUSE_DPCPP=OFF -DHIPSYCL_TARGETS="cuda:sm_XX;omp.accelerated" -DCMAKE_BUILD_TYPE=release ..
make
To build the project for AMD GPUs, replace cuda:sm_XX with hip:gfxXXX and replace gfxXXX according to your AMD GPU.
If you do not wish to build the project with support for CPUs with OpenMP, delete ;omp.accelerated.
For further details, please refer to the AdaptiveCpp documentation.
The project was tested with this AdaptiveCpp commit.
Note: using AdaptiveCpp's new generic target may work, but hasn't been tested.
DPC++ is supported with the CUDA backend. AMD GPUs are only supported with DPC++ when using the naive algorithm.
The following commands build the project with DPC++ for CUDA.
cmake -DUSE_DPCPP=ON -DCMAKE_BUILD_TYPE=release ..
make
The following commands build the project with DPC++ for AMD GPUS.
Make sure that the DEVICE_LIB_PATH environment variable points the the location of <path to amdgcn>/amdgcn/bitcode/
cmake -DUSE_DPCPP=ON -DUSE_DPCPP_AMD=ON -DDPCPP_ARCH="gfxXXX" -DCMAKE_BUILD_TYPE=release ..
make
Replace gfxXXX according to your AMD GPU.
For more information, please refer to the DPC++ documentation.
The project was tested with this DPC++ commit.
The program has several optional and mandatory program arguments.
| Argument | Description | Notes |
|---|---|---|
--file |
Path to a .csv file containing the data for the simulation | - |
--dt |
Width of the time step for the simulation | E.g.: 1h for one hour |
--t_end |
The internal time until the system will be simulated | E.g.: 365d for 365 days or 12y for twelve years |
--vs |
The time step width of the visualization | E.g.: 1d to visualize every day |
--vs_dir |
The top-level output directory for the output files | A separate folder (with a timestamp) that contains all output files will be created in this directory |
--algorithm |
The algorithm to use for the simulation | Either <naive> or <BarnesHut> |
| Argument | Description | Notes |
|---|---|---|
--use_gpus |
Enable / disable execution on GPUs (GPU execution is enabled by default) |
true or false |
--energy |
Enable / disable computation of the energy of the system in each visualized step (disabled by default). The results will be written to the ParaView files. |
true or false, can result into long runtimes with large datasets |
| Argument | Description | Notes |
|---|---|---|
--block_size |
The size of the blocks after which the local memory is updated in the naive algorithm |
Should be a power of 2, e.g., 128 |
--opt_stage |
Selects the optimization stage of the naive algorithm (default: 2) |
Possible values:
|
| Argument | Description | Notes |
|---|---|---|
--theta |
The theta-value which determines the accuracy of the Barnes-Hut algorithm |
Smaller values like 0.2 mean worse performance but better accuracy Larger values like 1 result into better performance but worse accuracy |
--num_wi_octree |
Determines the number of work-items used to build the octree |
- |
--num_wi_top_octree |
Determines the number of work-items used to build the top of the octree |
- |
--num_wi_AABB |
Determines the number of work-items used to calculate the AABB |
- |
--num_wi_com |
Determines the number of work-items used to calculate the center of mass |
- |
--max_level_top_octree |
Determines the maximum level to which the top of the octree gets build |
- |
--wg_size_barnes_hut |
Determines the work-group size of the acceleration kernel | - |
--sort_bodies |
Enable / disable sorting of the bodies according to their in-order position in the octree (enabled by default) |
true or false |
--storage_size_param |
Scales the amount of memory for the octree data structures |
Use only if you encounter problems with specific datasets |
--stack_size_param |
Scales the amount of memory for the stack used to traverse the octree |
Use only if you encounter problems with specific datasets |
| Option | Description | Notes |
|---|---|---|
-DUSE_DPCPP |
Enable / disable the usage of DPC++ | ON (Default) or OFF |
-DUSE_OCTREE_TOP_DOWN_SYNC |
Use the top-down synchronized approach without subtrees for the tree creation | Default OFF |
-DENABLE_TESTS |
Enable building of tests | Only supported with DPC++ |
-DUSE_DPCPP_AMD |
Use DPC++ with AMD GPUs | - |
-DDPCPP_ARCH |
Specify the GPU architecture for DPC++ when using AMD GPUs | Not recommended when using NVIDIA GPUs |
The input data for the simulation has to contain information about all bodies of the system. The columns of the .csv file have to contain the following information about each body, starting in the second line:
| id | name of body | name for class of body | mass of body in kg | x position | y position | z position | x velocity | y velocity | z velocity |
|---|
Use a , as the separator. The unit of length has to be the astronomical unit (AU).
The dataset converter can be used to generate datasets in this format.
The following command starts a simulation using the naive algorithm.
./N_Body_Simulation --file=<path to simulation data> --dt=1h --t_end=365d --vs=1d --vs_dir=<path to output directory> --algorithm=naive
The following command starts a simulation using the Barnes-Hut algorithm, specifying work-item counts for the tree creation and the theta value explicitly.
./N_Body_Simulation --file=<path to simulation data> --dt=1h --t_end=365d --vs=1d --vs_dir=<path to output directory> --algorithm=BarnesHut --theta=0.6 --num_wi_top_octree=640 --num_wi_octree=640
The implementation of the naive algorithm is based on the work by Nyland et al. The implementation of the Barnes-Hut algorithm is based on the work by Burtscher et al.