GPU Profling
For the CPU version of the code we have VTune to profile, as well as our tasking plots. However for the GPU version of the code we need different software to profile the MegaKernel™ and improve its performance.
nvvp
Getting The GUI profiling tool can be downloaded here. The NVIDIA Visual Profiler isn't usually linked within your system (so typing nvvp
in the terminal, or looking in your menus isn't going to help) meaning you will need to look for the install location here in the Quick Start Guide, or you can use the cudatoolkit
module on Piz Daint.
Compiling
You can profile tests and the main kernel using the cudatoolkit
module on Piz Daint. Once the module has loaded you will need to recompile your code (check that the -lineinfo
flag is set for nvcc
to collect profiling information). Then, use nvprof ./binary
and this will run your code and should give you some basic profiling output.
test_27_cells
To get this to work with test_27_cells
you will need to remove the memory clear at the end of the test as there are some as-yet undiagnosed problems with this...
nvprof
Chuck these in a shell script (and of course replace <Binary> <Opt>
with your binary).
nvprof --export-profile timeline.prof <Binary> <Opt>
nvprof --metrics achieved_occupancy,executed_ipc -o metrics.prof <Binary> <Opt>
nvprof --source-level-analysis pc_sampling -o pcsampling.prof <Binary> <Opt>
nvprof --analysis-metrics -o analysis_metrics.prof <Binary> <Opt>
You can then use it as a submission script with sbatch
or call
salloc -C gpu --res=<res> --time=H:MM:SS
to get allocated time on an actual node with a GPU.
Note that if you are planning on running the profiler again within the same directory you will have to either use -f
on nvprof
or delete your .prof
files.
nvvp
Using You will need to download these from the cluster and store them somewhere on your own machine. These can then be analysed using the GUI tool from NVIDIA, nvvp
.
More information on running the profiler can be found in these manual pages but to get you started you want to use File -> Import, choose nvprof, and select the files. See below for an example.
Selecting your data files:
Example timeline once the data has loaded:
PC Sampling
PC Sampling tells us where we are spending most of the time in our code, on a line-by-line basis. Here's a quick video that shows how to get to the PC Sampling area in nvvp
. You will have to link your code to nvvp
so ensure that you have a local copy and that you choose the correct file. Otherwise, you will have to click on the button that looks like a pencil () to unlink and re-link your code.
At the moment the majority of the time spent in our code is on waiting for stuff to come down from global memory to registers, which we hope to improve with caching.
Realtime Profiling on Piz Daint
To analyse the data on your local machine you can use either this nvidia tool to do things in real-time or (probably preferrably) you can just generate some stuff on Piz Daint with nvprof
and copy this to your local machine.
API
Testing a single task
On branch cuda_test, you can edit and compile a test running a single task. To do so, copy the task that you wish to test in the tests/testcuda.cu file and update do_test_pair or do_test. You will also need to switch runPair on or off in the main. To compile the script, do (script written for daint)
./make_cuda.sh
and run the following script
#!/bin/bash -l
#SBATCH --job-name=job
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --partition=normal
#SBATCH --constraint=gpu
#SBATCH --res=eurohack
nvprof --export-profile timeline.prof ./testcuda -p 8 -r 10
nvprof --metrics achieved_occupancy,executed_ipc -o metrics.prof ./testcuda -p 8 -r 10
nvprof --source-level-analysis pc_sampling -o pcsampling.prof ./testcuda -p 8 -r 10
nvprof --analysis-metrics -o analysis_metrics.prof ./testcuda -p 8 -r 10