For the CPU version of the code we have VTune to profile, as well as our tasking plots. However for the GPU version of the code we need different software to profile the MegaKernel™ and improve its performance.
The GUI profiling tool can be downloaded here. The NVIDIA Visual Profiler isn't usually linked within your system (so typing
nvvp in the terminal, or looking in your menus isn't going to help) meaning you will need to look for the install location here in the Quick Start Guide, or you can use the
cudatoolkit module on Piz Daint.
You can profile tests and the main kernel using the
cudatoolkit module on Piz Daint. Once the module has loaded you will need to recompile your code (check that the
-lineinfo flag is set for
nvcc to collect profiling information). Then, use
nvprof ./binary and this will run your code and should give you some basic profiling output.
To get this to work with
test_27_cells you will need to remove the memory clear at the end of the test as there are some as-yet undiagnosed problems with this...
Chuck these in a shell script (and of course replace
<Binary> <Opt> with your binary).
nvprof --export-profile timeline.prof <Binary> <Opt> nvprof --metrics achieved_occupancy,executed_ipc -o metrics.prof <Binary> <Opt> nvprof --source-level-analysis pc_sampling -o pcsampling.prof <Binary> <Opt> nvprof --analysis-metrics -o analysis_metrics.prof <Binary> <Opt>
You can then use it as a submission script with
sbatch or call
salloc -C gpu --res=<res> --time=H:MM:SS
to get allocated time on an actual node with a GPU.
Note that if you are planning on running the profiler again within the same directory you will have to either use
nvprof or delete your
You will need to download these from the cluster and store them somewhere on your own machine. These can then be analysed using the GUI tool from NVIDIA,
More information on running the profiler can be found in these manual pages but to get you started you want to use File -> Import, choose nvprof, and select the files. See below for an example.
PC Sampling tells us where we are spending most of the time in our code, on a line-by-line basis. Here's a quick video that shows how to get to the PC Sampling area in
nvvp. You will have to link your code to
nvvp so ensure that you have a local copy and that you choose the correct file. Otherwise, you will have to click on the button that looks like a pencil () to unlink and re-link your code.
At the moment the majority of the time spent in our code is on waiting for stuff to come down from global memory to registers, which we hope to improve with caching.
Realtime Profiling on Piz Daint
To analyse the data on your local machine you can use either this nvidia tool to do things in real-time or (probably preferrably) you can just generate some stuff on Piz Daint with
nvprof and copy this to your local machine.
Testing a single task
On branch cuda_test, you can edit and compile a test running a single task. To do so, copy the task that you wish to test in the tests/testcuda.cu file and update do_test_pair or do_test. You will also need to switch runPair on or off in the main. To compile the script, do (script written for daint)
and run the following script
#!/bin/bash -l #SBATCH --job-name=job #SBATCH --time=01:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=12 #SBATCH --partition=normal #SBATCH --constraint=gpu #SBATCH --res=eurohack nvprof --export-profile timeline.prof ./testcuda -p 8 -r 10 nvprof --metrics achieved_occupancy,executed_ipc -o metrics.prof ./testcuda -p 8 -r 10 nvprof --source-level-analysis pc_sampling -o pcsampling.prof ./testcuda -p 8 -r 10 nvprof --analysis-metrics -o analysis_metrics.prof ./testcuda -p 8 -r 10