Zhang Qing - GPU Technology Conference

Transcription

Zhang Qing - GPU Technology Conference
Zhang Qing,[email protected]
HPC Application R&D Manager,Inspur
• Inspur-Nvidia GPU Joint Lab Introduction
• Caffe-MPI: Parallel CAFFE framework based on
GPU cluster
Inspur-Nvidia GPU Joint Lab Introduction
• Inspur-Nvidia GPU Joint Lab
App Research Directions
– Traditional HPC
– Deep Learning
Field
Application
BLASTN
Life Science
ET
CFD
Oil&gas
CSP
LBM_LES
RNA
PSTM
Scandip
Caffe
DNN
K-means
Neural Network
Clients
Beijing Institute
of Genomics
Institute of
Biophysics, CSA
—
BGP
Qihoo
IFlick
Qihoo
Qihoo
Speed-up ratio
Platform
35X(kernel)
1GPU /1CPU core
48X
1GPU /1CPU core
100X
8X
5X
9X
12.5X
13X
35X
270X
1GPU /1CPU core
24 GPU nodes /24 CPU nodes
6 GPU nodes / 6 CPU nodes
4GPU+2CPU /2CPU
16GPU/1GPU
16GPU/1GPU
1GPU/1CPU core
4GPU/1CPU core
• Application :DNN
• Client:IFLYTEK
• Performance:16GPU/1GPU = 13X
Mobile
Phone
Car
Deep
learning For
speech
recognition
Business
travel
query
Intelligent
customer
service
• Application: neural network
• Client:Qihoo
• Performance:4 GPU/1 CPU core =270X
Time(s)
ForwardBackward computing
80%
Weight computing
16%
Net update
4%
Data parallel
Some part can be
paralleled
Some part can
be paralleled
• Caffe has many users, it is very popular in China.
• Caffe need a long training time for big data based one GPU
node.
• Caffe’s ForwardBackward computing,weight computing and
net update all can be paralleled with GPU cluster.
• What is Caffe-MPI?
– Developed by Inspur
• Open-source:https://github.com/Caffe-MPI/Caffe-MPI.github.io
–Based on the Berkeley Vision and
Learning Center (BVLC) Single GPU Caffe
version
–A GPU Cluster Caffe version
–Support 16+ GPUs to Train
• based on HPC Technology
– Hardware arch:IB+GPU cluster+Lustre
– Software arch:MPI+Pthread+CUDA
• Data parallel on GPU Cluster
GPU Cluster Configuration
GPU master
node
Multi GPUs
GPU Salve
Node
Multi GPUs
Storage
Lustre
network
56Gb/s IB
Software
Linux/Cuda7.5/Mvapi
ch2
• MPI Mast-Slave model
– Master Process:Multi Pthread Threads+CUDA Threads
– Slave Process:CUDA Threads
Reference:Q Ho,J Cipar,H Cui,JK Kim,S Lee,... More Effective
Distributed ML via a Stale Synchronous Parallel Parameter Server.
• Master Process (0 process)
– Three Pthread Groups
• Parallel read data and send data
• Weight Computing and The parameter update
• The parameter communication
• Slave process
– CPU
• To receive training data from the
master process
• To send weight data(GPU-to-GPU)
• To receive new net data(GPU-to-GPU)
– GPU
• ForwardBackward computing
• Slave Node
– The number of Slave process = the
number of GPU
• GPU parallel computing
• Computing & Communication asynchronous parallel
• Communication Optimization
– GPU RDMA:Weight Data and Net data between GPUs
Total Time=max(TRead Data+Send Data ,TForwardBackWord Computing+ Weight Computing and Net Update+ Net Send)
• Speed-up Ratio:16GPU/1GPU=10.45X
• Scalability efficiency:65%
• Speed-up Ratio:16GPU/1GPU=10.74X
• Scalability efficiency:67%
• Peformance speed by cuDNN =21%
• Speed-up Ratio:16GPU/1GPU=12.66X
• Scalability efficiency:79%
G o o g le N e t(Ite ra tio n s = 4 0 0 0 ,b a tc h s ize = 6 4 )
T ra in in g T im e (s )
1 ,4 0 0
1 ,3 8 0
1 ,0 5 0
700
350
0
1
132
109
1 6 (C a ffe -M P I)
1 6 (C a ffe -M P I+ c u D N N )
T he N um b er of G P U
• Parallel read training data from Lustre Storage and send data to different
GPUs
– GPU Cluster be divided into many groups
– Every group have a master node
– Every master node parallel read and send data with Multi Processes +Multi
Threads
• Can support large-scale GPU computing for a big training platform
• Speed-up Ratio:16GPU/1GPU=13X
• Scalability efficiency:81%
• The Next work:
– Support cuDNN 4.0
– MPI Framework tuning
• Symmetric model
• Caffe-MPI version open source roadmap
• Q2:Computing-Intensive Model:support 32+ GPU parallel
• Q3:IO-Intensive Model:support 16+ GPU parallel
• Q4:Support Half Precision for Pascal GPU
Conclusions
• Caffe-MPI is based on HPC technology
architecture
– Performance:16 GPU/1GPU=13X
• Caffe-MPI can support 16+ GPU to train big data
• Inspur will continue to open source new versions
– 32 GPU parallel version for Computing-Intensive
Model
– 16+ GPU parallel version for IO
– Support Half Precision for Pascal GPU