Zhang Qing - GPU Technology Conference
Transcription
Zhang Qing - GPU Technology Conference
Zhang Qing,[email protected] HPC Application R&D Manager,Inspur • Inspur-Nvidia GPU Joint Lab Introduction • Caffe-MPI: Parallel CAFFE framework based on GPU cluster Inspur-Nvidia GPU Joint Lab Introduction • Inspur-Nvidia GPU Joint Lab App Research Directions – Traditional HPC – Deep Learning Field Application BLASTN Life Science ET CFD Oil&gas CSP LBM_LES RNA PSTM Scandip Caffe DNN K-means Neural Network Clients Beijing Institute of Genomics Institute of Biophysics, CSA — BGP Qihoo IFlick Qihoo Qihoo Speed-up ratio Platform 35X(kernel) 1GPU /1CPU core 48X 1GPU /1CPU core 100X 8X 5X 9X 12.5X 13X 35X 270X 1GPU /1CPU core 24 GPU nodes /24 CPU nodes 6 GPU nodes / 6 CPU nodes 4GPU+2CPU /2CPU 16GPU/1GPU 16GPU/1GPU 1GPU/1CPU core 4GPU/1CPU core • Application :DNN • Client:IFLYTEK • Performance:16GPU/1GPU = 13X Mobile Phone Car Deep learning For speech recognition Business travel query Intelligent customer service • Application: neural network • Client:Qihoo • Performance:4 GPU/1 CPU core =270X Time(s) ForwardBackward computing 80% Weight computing 16% Net update 4% Data parallel Some part can be paralleled Some part can be paralleled • Caffe has many users, it is very popular in China. • Caffe need a long training time for big data based one GPU node. • Caffe’s ForwardBackward computing,weight computing and net update all can be paralleled with GPU cluster. • What is Caffe-MPI? – Developed by Inspur • Open-source:https://github.com/Caffe-MPI/Caffe-MPI.github.io –Based on the Berkeley Vision and Learning Center (BVLC) Single GPU Caffe version –A GPU Cluster Caffe version –Support 16+ GPUs to Train • based on HPC Technology – Hardware arch:IB+GPU cluster+Lustre – Software arch:MPI+Pthread+CUDA • Data parallel on GPU Cluster GPU Cluster Configuration GPU master node Multi GPUs GPU Salve Node Multi GPUs Storage Lustre network 56Gb/s IB Software Linux/Cuda7.5/Mvapi ch2 • MPI Mast-Slave model – Master Process:Multi Pthread Threads+CUDA Threads – Slave Process:CUDA Threads Reference:Q Ho,J Cipar,H Cui,JK Kim,S Lee,... More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. • Master Process (0 process) – Three Pthread Groups • Parallel read data and send data • Weight Computing and The parameter update • The parameter communication • Slave process – CPU • To receive training data from the master process • To send weight data(GPU-to-GPU) • To receive new net data(GPU-to-GPU) – GPU • ForwardBackward computing • Slave Node – The number of Slave process = the number of GPU • GPU parallel computing • Computing & Communication asynchronous parallel • Communication Optimization – GPU RDMA:Weight Data and Net data between GPUs Total Time=max(TRead Data+Send Data ,TForwardBackWord Computing+ Weight Computing and Net Update+ Net Send) • Speed-up Ratio:16GPU/1GPU=10.45X • Scalability efficiency:65% • Speed-up Ratio:16GPU/1GPU=10.74X • Scalability efficiency:67% • Peformance speed by cuDNN =21% • Speed-up Ratio:16GPU/1GPU=12.66X • Scalability efficiency:79% G o o g le N e t(Ite ra tio n s = 4 0 0 0 ,b a tc h s ize = 6 4 ) T ra in in g T im e (s ) 1 ,4 0 0 1 ,3 8 0 1 ,0 5 0 700 350 0 1 132 109 1 6 (C a ffe -M P I) 1 6 (C a ffe -M P I+ c u D N N ) T he N um b er of G P U • Parallel read training data from Lustre Storage and send data to different GPUs – GPU Cluster be divided into many groups – Every group have a master node – Every master node parallel read and send data with Multi Processes +Multi Threads • Can support large-scale GPU computing for a big training platform • Speed-up Ratio:16GPU/1GPU=13X • Scalability efficiency:81% • The Next work: – Support cuDNN 4.0 – MPI Framework tuning • Symmetric model • Caffe-MPI version open source roadmap • Q2:Computing-Intensive Model:support 32+ GPU parallel • Q3:IO-Intensive Model:support 16+ GPU parallel • Q4:Support Half Precision for Pascal GPU Conclusions • Caffe-MPI is based on HPC technology architecture – Performance:16 GPU/1GPU=13X • Caffe-MPI can support 16+ GPU to train big data • Inspur will continue to open source new versions – 32 GPU parallel version for Computing-Intensive Model – 16+ GPU parallel version for IO – Support Half Precision for Pascal GPU