Intel and Facebook* Collaborate to Boost Caffe2 Performance on Intel CPU’s

Description: https://software.intel.com/sites/default/files/managed/ea/8a/intel_rgb_3000.png                           
by Shrutee K/DNS
Every day, the world generates more and more information — text, pictures, videos and more. In recent years, artificial intelligence and deep learning have improved several applications that help people better understand this information with state-of-the-art voice/speech recognition, image/video recognition, and recommendation engines.
Most deep learning workloads consists of both training and inference. Training usually requires many hours or days to complete. Inference usually requires milliseconds or seconds and is often a step of a larger process. While the computing intensity of inference is much lower than that of training, inference is often done on a much larger dataset. Therefore, the total computing resources spent on inference are likely to dwarf those spent on training. The overwhelming majority of all inference workloads run on Intel® Xeon® CPUs.
Over the past year, Intel rapidly added CPU support across several deep learning frameworks to optimize for a variety of training and inference applications. At the center of these optimizations is Intel® Math Kernel Library (Intel® MKL) which makes use of Intel® Advanced Vector Extension CPU instructions (e.g., Intel® AVX-512) that provide enhanced support for deep learning applications.
Caffe2* is an open source deep learning framework created by Facebook and built with expression, speed, and modularity in mind. Caffe2 is deployed at Facebook to help researchers train large machine learning models and deliver AI on mobile devices. Now, developers will have access to many of the same tools, allowing them to run large-scale distributed training scenarios and build machine learning applications for mobile.
Intel and Facebook are collaborating to integrate Intel® MKL functions into Caffe2 for optimal inference performance on CPU’s. Table 1 shows inference performance numbers on AlexNet* using the Intel® MKL library and the Eigen* BLAS library for comparison. In this table, OMP_NUM_THREADS indicates the number of physical cores used in these workloads (details in the table caption). These results show that Caffe2 is highly optimized on CPUs and offers competitive performance. For small batch inference workloads it is recommended to run each workload in each CPU core and run multiple workloads in parallel with one workload per core.
OMP_NUM_THREADS=44
OMP_NUM_THREADS=1
batch size
Intel® MKL
(images/sec)
Eigen BLAS
(images/sec)
Intel® MKL
(images/sec)
Eigen BLAS
(images/sec)
1
173.4
5.2
28.6
5.1
32
1500.2
29.3
64.6
15.4
64
1596.3
35.3
66.0
15.5
256
1735.2
44.9
67.3
16.2
Table 1: Performance results on Caffe2 using the AlexNet topology with Intel® MKL and Eigen BLAS. Experiments were performed on Intel® Xeon® processor E5-2699 v4 (codename Broadwell) @ 2.20GHz with dual sockets, 22 physical cores per socket (total of 44 physical cores in both sockets), 122GB RAM DDR4, 2133 MHz, HT Disabled, on Linux 3.10.0-514.2.2.el7.x86_64 CentOS 7.3.1611, Intel® MKL version 20170209, Eigen BLAS version 3.3.2, based on Caffe2 as of April 18, 2017.
Later this year, the new generation of Intel® Xeon® processors (codename Skylake) will become available to the general market. Skylake introduces the 512-bit wide Fused Multiply Add (FMA) instructions as part of the larger 512-bit wide vector engine, i.e., Intel® AVX-512, providing a significant performance boost over the previous 256-bit wide AVX2 instructions in the Haswell/Broadwell processor for both training and inference workloads. The 512-bit wide FMA’s essential doubles the FLOPS that Skylake can deliver and significantly speeds up single precision matrix arithmetic used in convolutional and recurrent neural networks. Inference workloads are massively parallel and will benefit from the larger core count offered by Skylake. In addition, the Skylake CPUs have re-architected memory subsystem supporting faster system memory and larger Mid-Level-Cache (MLC) per core, which also helps with the performance improvements over current generation CPUs and significant enhancement over the common installed base of four year old systems.
Andres Rodriguez, PhD, is a Sr. Principal Engineer with Intel’s AI Products Group (AIPG) where he designs deep learning solutions for Intel’s customers and provides technical leadership across Intel for deep learning products. He has 13 years of experience working in artificial intelligence. Andres received his PhD from Carnegie Mellon University for his research in machine learning. He holds over 20 peer reviewed publications in journals and conferences, and a book chapter on machine learning.
Niv Sundaram, PhD, is Director of Engineering with Intel’s Datacenter Engineering Group (DEG), focusing on performance and power optimizations of current and emerging workloads. In this role, she leads a team that works with Intel’s customers to characterize deep learning/machine learning and augmented/virtual/mixed reality workloads for the datacenter. Niv has a PhD in Electrical Engineering from the University of Wisconsin-Madison and has one issued patent and several peer-reviewed publications.

Comments

Popular posts from this blog

सभी माओं को सुष्मिता सेन का संदेश: अब 'हाँ' अधिक बार कहिये

Oman Tourism Launches New “www.experienceoman.com” website

Mayor praises the footballing talent of Mumbai as he announces young stars will travel to London