Presentation

Transcription

Presentation
Code-Agnostic Performance Characterisation and Enhancement
Ben Menadue
Academic Consultant
nci.org.au
@NCInews
Who Uses the NCI?
• NCI has a large user base
– 1000s of users across 100s of projects
• These projects encompass almost every research area
–
–
–
–
–
–
physical sciences
Earth sciences
engineering
mathematics
finance
social science
• Correspondingly, there is a huge variation in backgrounds and experience
– some are programmers – can optimise their algorithm and code to suit the machine
– most just run pre-packaged software – no control over the source
nci.org.au
Performance Characterisation for Beginners
• If source code is available, can instrument and profile in the usual fashion.
– For non-advanced users, often we walk them through this and help them analyse the results.
• What about for pre-built packages?
– Use an LD_PRELOAD to catch and log e.g. MPI calls
• We provide several such tools:
– IPM
– mpiP
– perf (not an LD_PRELOAD, but still doesn’t require recompilation)
– IPM is our tool of choice:
– easy to use: module load openmpi ipm
– interfaces with PAPI for hardware counters
– NCI patches for message binning, rounding off, suspend-resume…
nci.org.au
IPM Profile of CCAM
• Performance of CCAM on Raijin was not what we expected – slower than on Vayu!
• Profiled a run using IPM to see what was going on…
nci.org.au
IPM Profile of CCAM
• Performance of CCAM on Raijin was not what we expected – slower than on Vayu!
• Profiled a run using IPM to see what was going on…
nci.org.au
Performance Improvement in CCAM
• What can we do to improve the performance?
• Standard software in use by many researchers.
– Can’t change the algorithm or code.
• Need a different strategy to improve performance.
• Work with the communication and system libraries instead.
• IPM profile shows huge overhead coming from MPI calls.
– Mellanox Accelerators
• Messaging Accelerator (MXM)
– improves message passing by using extra, Mellanox hardware features
• Fabric Collective Accelerator (FCA)
– offloads collectives from the processes to the interconnect hardware
nci.org.au
Performance Improvement in CCAM – Without Changing Code
Execution Time (s)
1664.2
CCAM
827.8
750.7
700.07
Avg. Time Stdev
July 2013
Original results
November 2013
Mellanox Accelerators
March 2014
Kernel updates and tweaks, MXM, FCA
April 2014
Latest Result with HT
1664.2 148.36
827.8
24.6
750.7
4
700.07
2.14
Avg. Time is based on varied number of runs
July 2013- Normal
November 2013Mellanox
Acceletrators
March 2014-Kernel
updates and tweaks,
MXM, FCA
April-2014-Latest
Result with HT
nci.org.au
Performance Improvement in CCAM – Without Changing Code
Execution Time (s)
1664.2
CCAM
827.8
750.7
700.07
Avg. Time Stdev
July 2013
Original results
November 2013
Mellanox Accelerators
March 2014
Kernel updates and tweaks, MXM, FCA
April 2014
Latest Result with HT
1664.2 148.36
827.8
24.6
750.7
4
700.07
2.14
Avg. Time is based on varied number of runs
July 2013- Normal
November 2013Mellanox
Acceletrators
March 2014-Kernel
updates and tweaks,
MXM, FCA
April-2014-Latest
Result with HT
nci.org.au
Performance Improvement in CCAM
• What can we do to improve the performance?
• Standard software in use by many researchers.
– Can’t change the algorithm or code.
• Need a different strategy to improve performance.
• Work with the communication and system libraries instead.
• Operating system can also be impacting performance
– Moved to the latest CentOS 6 kernel and operating system
• new task scheduling, memory management, …
– Enabled hyperthreading
• allow operating system tasks to run on separate hardware threads – reduce impact and jitter
nci.org.au
Performance Improvement in CCAM – Without Changing Code
Execution Time (s)
1664.2
CCAM
827.8
750.7
700.07
Avg. Time Stdev
July 2013
Original results
November 2013
Mellanox Accelerators
March 2014
Kernel updates and tweaks, MXM, FCA
April 2014
Latest Result with HT
1664.2 148.36
827.8
24.6
750.7
4
700.07
2.14
Avg. Time is based on varied number of runs
July 2013- Normal
November 2013Mellanox
Acceletrators
March 2014-Kernel
updates and tweaks,
MXM, FCA
April-2014-Latest
Result with HT
nci.org.au
Application Software Stack
• General package installations are made on request to a central location, /apps.
– Lustre filesystem, mounted on all nodes.
• We typically build these so they pass all their tests.
– This normally means default optimisation and gcc.
– Fortran 90/03/08 modules and libraries are built using both gfortran and ifort since the ABI is different.
• While quite reasonable performance, sometimes users need/want more.
• Working closely with a developer of Fludity to compile a custom software stack:
– Lots of dependencies: MPI, PETSc, Metis, Scotch, Zoltan, Python, GMSH, …
– All built using latest Intel compilers and OpenMPI with very high optimisation settings.
– Found several compiler bugs – reported to Intel and several already fixed.
• 20% improvement in runtime using custom software stack!
– Still using a debugging build of PETSc – known to have significant performance impact.
nci.org.au
Summary
• Even without changing a line of source code, there still lots of performance enhancements available!
• Highly optimised software stack for best serial performance.
• Using latest kernel and system libraries can reduce impact from operating system.
• Hyperthreading reduces jitter and impact from O/S tasks.
• Mellanox Accelerators can significantly improve MPI performance – especially for collectives.
nci.org.au