Face Detection is the process of detecting the face of one or more persons in an image and finds wide application in numerous areas including auto focus in cameras, person tracking in video conferencing, and gaming. This computer vision problem — essentially a binary classifier with localization — is being solved with many face detection frameworks and improvements are still being worked upon by computer scientists and engineers.
While face recognition identifies the person, face detection is the stepping stone to it, creating the bounding box containing the face to be further processed by the recognition algorithm. Detection without false positives and false negatives, faster detection time, detection in poor conditions, and partial detection are some of the preferred attributes of face detection frameworks. This blog captures some of the popular machine learning based face detection frameworks and tools, briefly covering their operation. Embien's edge computing services include deploying face detection frameworks on resource-constrained embedded hardware for real-time AI vision applications.
OpenCV Haar cascade
Originally proposed by Paul Viola and Michael Jones, this method uses cascaded classifiers to run on Haar features extracted from images and identify faces. This is one of the most used and computationally less intensive face detection algorithms. Haar features are essentially small binary filters that use the difference in two or more adjacent rectangular regions as a feature. This can give good results as the face has some common features — for example, the rectangular region covering both eyes is darker than the corresponding rectangle in the forehead. Speed comes from the use of integral images, where each pixel is represented as the sum of all pixels to the top left of it. In this way, it is quickly possible to calculate the sum of pixels of a rectangular region with few additions and subtractions.
OpenCV Haar Cascade (Image Source: https://docs.opencv.org/4.x/d2/d99/tutorial_js_face_detection.html)
Since the number of features is way too high, a variance of Adaptive Boosting (Adaboost) is trained to choose the best features and use them. Among the face detection algorithms, this approach uses a cascade of weak classifiers to create a strong classifier as a weighted sum. With training on a large set of images, it is possible to achieve good accuracy. Further, it is possible to run the algorithm for any window size, enabling detection on variable image sizes. OpenCV provides a Haar cascade implementation that can be used to run these face detection algorithms.
OpenCV DNN
Deep neural networks (DNN) are among the most powerful face detection algorithms, designed inspired by the human neural system. Having more than 1 hidden layer, the DNN is trained using back propagation with weights of the neurons updated to reflect the learning. With repeated supervised training, DNNs can be used for different kinds of applications. It is also possible to build DNN-based face detection frameworks with a variety of layers such as convolutional layer, max-pooling layer etc.
OpenCV has a face detection module called YuNet that is trained on WIDER Face and is highly optimized for performance. It employs depth-wise convolution and pointwise convolution to replace standard convolution, making it one of the most efficient face detection frameworks for resource-constrained edge devices. OpenCV's DNN face detector module is widely used in edge deployment scenarios.
OpenCV DNN (Image Source: https://opencv.org/blog/opencv-face-detection-cascade-classifier-vs-yunet/)
The pre-trained files are available readily for use (res10_300x300_ssd_iter_140000_fp16.caffemodel and deploy.prototxt) for Caffe, and (opencv_face_detector_uint8.pb and opencv_face_detector.pbtxt) for TensorFlow. Otherwise, the model can be trained with needed images and utilized.
HOG + Linear SVM on Dlib
Dlib is a very useful and practical toolkit for making real world machine learning and data analysis applications. It provides a few face detection algorithms, one of which is the Histogram of Oriented Gradients (HOG) detector with a linear SVM. The image is split into small cells with magnitude and direction of gradients found in each cell. Then typically the angles are categorized into 9 values — 0°, 20°, 40°, 60°, 80°, 100°, 120°, 140°, and 160°. To build the histogram, for each gradient, the magnitude is weighted and added to the angle categories.
HOG + Linear SVM on Dlib (Image Source: https://www.warse.org/IJETER/static/pdf/file/ijeter244892020.pdf)
This histogram of oriented gradients depicts the feature of the face region. The block is normalized and formed as a feature vector. Finally, a linear Support Vector Machine is trained with both positive and negative samples to create the binary classifier as the final step. During detection, the same steps are followed over a sliding window, feature vector extracted from HOG and given to the linear SVM to determine if there is a face there or not. Among the Dlib-based face detection algorithms, this approach is highly efficient on CPU.
Dlib MMOD CNN face detector
Dlib also offers another algorithm called the MMOD CNN face detector. Based on Convolutional Neural Networks (CNN), it implements Max-Margin Object Detection (MMOD) algorithm for improved results. CNN is one of the special types of neural networks that can work effectively on grid-based data like images and can run effectively on GPUs rather than CPUs. The algorithm identifies 68 landmarks in a human face and uses them as feature vectors. The trained model has a vector that has generalized the desired features to maximize the margin between it and the true positive.
While this CNN-based algorithm can be trained with fewer images than other face detection algorithms, the Dlib library comes with a pretrained model that fares quite well for most applications. Though the algorithm is slower, it offers good performance.
MediaPipe Face Detection
As a part of the MediaPipe On-device machine learning solutions framework, the face detector is based on the BlazeFace detector. It is one of the face detection frameworks optimized for edge and mobile application use cases. MediaPipe uses an improved network based on MobileNet, more depthwise convolution, and a fixed-time GPU dispatch model.
MediaPipe Face Detection (Image Source: https://developers.google.com/mediapipe/solutions/vision/face_detector)
The MediaPipe face detection framework can output face locations along with the left eye, right eye, nose tip, mouth, left eye tragion, and right eye tragion facial key points. It can quite accurately detect and track human faces in real-time video streams or images. This makes it one of the most practical face detection frameworks for embedded AI applications.
Retinaface with TensorFlow
Retinaface with TensorFlow is one of the more advanced face detection frameworks available today. Retinaface with TensorFlow supports single-stage dense face localization and achieves high accuracy across scale variations, including small faces. Retinaface with TensorFlow uses a feature pyramid network (FPN) backbone to detect faces at multiple scales simultaneously. Among all face detection frameworks, Retinaface with TensorFlow stands out for its ability to create 3D face reconstructions from 2D images along with landmark estimation. Compared to MTCNN, Retinaface with TensorFlow demonstrates higher accuracy and lower loss. Edge video analytics deployments frequently use Retinaface with TensorFlow or Edgeface-based face detection frameworks for real-time processing at the camera node. Edgeface brings CNN-based face detection to edge hardware with limited power budgets.
Conclusion
While these are some of the face detection frameworks in use, there are many others such as Single Shot Multibox Detector (SSD), Dual Shot Face Detector (DSFD), and MediaPipe. NVIDIA offers DetectNet_v2 detector with ResNet18 as a feature extractor targeted for its GPU platforms.
Of the above face detection algorithms, Retinaface and MTCNN are good for face detection, but Retinaface with TensorFlow has higher accuracy and lower loss compared to MTCNN. It can also create 3D face from 2D image along with landmarks. Edgeface extends CNN-based face detection frameworks to devices with severely constrained compute. The end application dictates the exact algorithm that can be used based on computation power available, presence of CPU vs GPU, speed of detection, and desired accuracy across the available face detection frameworks.
