A new framework for training a CNN with a hardware-software architecture

Parra Prada, Dorfell Leonardo

Tesis de Doctorado en Ingeniería - Ingeniería Eléctrica (16.74Mb)

Autor

Parra Prada, Dorfell Leonardo

Director

Camargo Bareño, Carlos Ivan

Tipo de contenido

Trabajo de grado - Doctorado

Idioma del documento

Inglés

Fecha de publicación

2023-04

@misc{unal_84550, author = {Parra Prada Dorfell Leonardo}, title = {A new framework for training a CNN with a hardware-software architecture}, year = {2023-04}, abstract = {Facial Expression Recognition (FER) systems classify emotions by using geometrical approaches or Machine Learning (ML) algorithms such as Convolutional Neural Networks (CNNs). However, designing these systems could be a challenging task that depends on the data set's quality or the designer's expertise. Moreover, CNNs inference requires a large amount of memory and computational resources, making it unfeasible for low-cost embedded systems. Hence, although GPUs are expensive and have high power consumption, they are frequently employed because they considerably reduce the inference time compared to CPUs. On the other hand, SoCs implemented in FPGAs could employ less power and support pipelining. However, the floating point representation may result in intricate and larger designs that are only suitable for high-end FPGAs. Therefore, custom hardware-software architectures that maintain acceptable performance while using simpler data representations are advantageous. To address these challenges, this work proposes a design methodology for CNN-based FER systems. The methodology includes the preprocessing, the Local binary pattern (LBP), and the data augmentation. Besides, several CNN models were trained with TensorFlow and the JAFFE data set to validate the methodology. In each test, the relationship between parameters, layers, and performance was studied, as were the overfitting and underfitting scenarios. Furthermore, this work introduces the model M6, a single channel CNN that reaches an accuracy of 94% in less than 30 epochs. M6 has 306.182 parameters in 1.17 MB. In addition, the work also employs the quantization methodology from TensorFlow Lite (tflite), to compute the inference of a CNN using integer numbers. M6's accuracy dropped from 94.44% to 83.33% after quantization, the number of parameters increased from 306.182 to 306.652, and the model size decreased almost 4x from 1.17 MB to 0.3 MB. Also, the work presents a custom hardware-software architecture to accelerate CNNs known as the FER SoC, which reproduces the main tflite operations in hardware. Hence, as the integer numbers are fully mapped to hardware registers, the accelerator results will be similar to their software counterparts. The architecture has been tested on a Zybo-Z7 development board with 1 GB RAM and the Zynq7 device XCZ7020-CLG400. Moreover, it was observed that the architecture got the same accuracy but was 20% slower than a laptop equipped with an AMD CPU with 16 threads, 16 GB of RAM and a Nvidia GTX1660Ti GPU. Therefore, it is recommended to assess whether the trade-off between quantization and inference time is worth it for the target application. Lastly, another contribution is the framework for CNNs' training in custom hardware-software architectures known as Resiliency. It has been used to train and run the inference of the single-channel M6 model. Resiliency provides the design files needed as well as the Pynq 2.7 image created for running ML frameworks such as TensorFlow and PyTorch. Although the training time was slow, the accuracy and loss were consistent to traditional approaches. However, the execution time could be improved by utilizing bigger FPGAs with MPSoCs like the Zynq Ultrascale family. (Texto tomado de la fuente)}, url = {https://repositorio.unal.edu.co/handle/unal/84550} }TY - GEN T1 - A new framework for training a CNN with a hardware-software architecture AU - Parra Prada, Dorfell Leonardo Y1 - 2023-04 UR - https://repositorio.unal.edu.co/handle/unal/84550 PB - Universidad Nacional de Colombia AB - Facial Expression Recognition (FER) systems classify emotions by using geometrical approaches or Machine Learning (ML) algorithms such as Convolutional Neural Networks (CNNs). However, designing these systems could be a challenging task that depends on the data set's quality or the designer's expertise. Moreover, CNNs inference requires a large amount of memory and computational resources, making it unfeasible for low-cost embedded systems. Hence, although GPUs are expensive and have high power consumption, they are frequently employed because they considerably reduce the inference time compared to CPUs. On the other hand, SoCs implemented in FPGAs could employ less power and support pipelining. However, the floating point representation may result in intricate and larger designs that are only suitable for high-end FPGAs. Therefore, custom hardware-software architectures that maintain acceptable performance while using simpler data representations are advantageous. To address these challenges, this work proposes a design methodology for CNN-based FER systems. The methodology includes the preprocessing, the Local binary pattern (LBP), and the data augmentation. Besides, several CNN models were trained with TensorFlow and the JAFFE data set to validate the methodology. In each test, the relationship between parameters, layers, and performance was studied, as were the overfitting and underfitting scenarios. Furthermore, this work introduces the model M6, a single channel CNN that reaches an accuracy of 94% in less than 30 epochs. M6 has 306.182 parameters in 1.17 MB. In addition, the work also employs the quantization methodology from TensorFlow Lite (tflite), to compute the inference of a CNN using integer numbers. M6's accuracy dropped from 94.44% to 83.33% after quantization, the number of parameters increased from 306.182 to 306.652, and the model size decreased almost 4x from 1.17 MB to 0.3 MB. Also, the work presents a custom hardware-software architecture to accelerate CNNs known as the FER SoC, which reproduces the main tflite operations in hardware. Hence, as the integer numbers are fully mapped to hardware registers, the accelerator results will be similar to their software counterparts. The architecture has been tested on a Zybo-Z7 development board with 1 GB RAM and the Zynq7 device XCZ7020-CLG400. Moreover, it was observed that the architecture got the same accuracy but was 20% slower than a laptop equipped with an AMD CPU with 16 threads, 16 GB of RAM and a Nvidia GTX1660Ti GPU. Therefore, it is recommended to assess whether the trade-off between quantization and inference time is worth it for the target application. Lastly, another contribution is the framework for CNNs' training in custom hardware-software architectures known as Resiliency. It has been used to train and run the inference of the single-channel M6 model. Resiliency provides the design files needed as well as the Pynq 2.7 image created for running ML frameworks such as TensorFlow and PyTorch. Although the training time was slow, the accuracy and loss were consistent to traditional approaches. However, the execution time could be improved by utilizing bigger FPGAs with MPSoCs like the Zynq Ultrascale family. (Texto tomado de la fuente) ER -

Abstract

Facial Expression Recognition (FER) systems classify emotions by using geometrical approaches or Machine Learning (ML) algorithms such as Convolutional Neural Networks (CNNs). However, designing these systems could be a challenging task that depends on the data set's quality or the designer's expertise. Moreover, CNNs inference requires a large amount of memory and computational resources, making it unfeasible for low-cost embedded systems. Hence, although GPUs are expensive and have high power consumption, they are frequently employed because they considerably reduce the inference time compared to CPUs. On the other hand, SoCs implemented in FPGAs could employ less power and support pipelining. However, the floating point representation may result in intricate and larger designs that are only suitable for high-end FPGAs. Therefore, custom hardware-software architectures that maintain acceptable performance while using simpler data representations are advantageous. To address these challenges, this work proposes a design methodology for CNN-based FER systems. The methodology includes the preprocessing, the Local binary pattern (LBP), and the data augmentation. Besides, several CNN models were trained with TensorFlow and the JAFFE data set to validate the methodology. In each test, the relationship between parameters, layers, and performance was studied, as were the overfitting and underfitting scenarios. Furthermore, this work introduces the model M6, a single channel CNN that reaches an accuracy of 94% in less than 30 epochs. M6 has 306.182 parameters in 1.17 MB. In addition, the work also employs the quantization methodology from TensorFlow Lite (tflite), to compute the inference of a CNN using integer numbers. M6's accuracy dropped from 94.44% to 83.33% after quantization, the number of parameters increased from 306.182 to 306.652, and the model size decreased almost 4x from 1.17 MB to 0.3 MB. Also, the work presents a custom hardware-software architecture to accelerate CNNs known as the FER SoC, which reproduces the main tflite operations in hardware. Hence, as the integer numbers are fully mapped to hardware registers, the accelerator results will be similar to their software counterparts. The architecture has been tested on a Zybo-Z7 development board with 1 GB RAM and the Zynq7 device XCZ7020-CLG400. Moreover, it was observed that the architecture got the same accuracy but was 20% slower than a laptop equipped with an AMD CPU with 16 threads, 16 GB of RAM and a Nvidia GTX1660Ti GPU. Therefore, it is recommended to assess whether the trade-off between quantization and inference time is worth it for the target application. Lastly, another contribution is the framework for CNNs' training in custom hardware-software architectures known as Resiliency. It has been used to train and run the inference of the single-channel M6 model. Resiliency provides the design files needed as well as the Pynq 2.7 image created for running ML frameworks such as TensorFlow and PyTorch. Although the training time was slow, the accuracy and loss were consistent to traditional approaches. However, the execution time could be improved by utilizing bigger FPGAs with MPSoCs like the Zynq Ultrascale family. (Texto tomado de la fuente)

Resumen

Los sistemas de reconocimiento de expresiones faciales (FER) clasifican emociones usando estrategias geométricas o algoritmos de Machine Learning (ML) como redes neuronales convolucionales (CNNs). Sin embargo, el diseño de estos sistemas es una tarea compleja que depende de la calidad del set de datos y la experiencia del diseñador. Además, la inferencia de las CNNs requiere recursos de memoria y cómputo que hacen inviable el uso de sistemas embebidos de bajo costo. Igualmente, aunque las GPUs son costosas y presentan un alto consumo de potencia, se utilizan frecuentemente porque reducen el tiempo de ejecución en comparación a las CPU. Por otro lado, los sistemas on-chip (SoCs) implementados en FPGAs emplean menos potencia y soportan cómputo en paralelo. No obstante, representaciones numéricas como punto flotante pueden resultar en diseños complejos sólo adecuadas para FPGAs de gama alta. Por esta razón, el uso de arquitecturas de hardware-software que emplean representaciones numéricas sencillas y mantienen un desempeño aceptable son favorables. Para afrontar estos desafíos, este trabajo propone una metodología de diseño para sistemas FER basados en CNNs. La metodología incluye el preprocesamiento, el patrón local binario (LBP), y la aumentación de datos. Asimismo, para validar la metodología varios modelos CNNs fueron entrenados con TensorFlow y el set de datos JAFFE. En cada test, se estudia la relación entre los parámetros, las capas y el desempeño, el subentrenamiento y el sobreentrenamiento. Además, este trabajo introduce un modelo CNN de un canal llamado M6 que alcanza una exactitud de 94% en menos de 30 épocas. M6 tiene 306.182 parámetros y emplea 1.17 MB. El trabajo también utiliza la estrategia de quantización de TensorFlow Lite (tflite) para computar la inferencia de la CNN empleando números enteros. Después de la quantización, la exactitud de M6 se redujo de 94.44% a 83.33%, el número de parámetros aumentó de 306.182 a 306.652, y el tamaño del modelo se redujo aproximadamente 4 veces pasando de 1.17 MB a 0.3 MB. Igualmente, el trabajo presenta el FER SoC, una arquitectura de hardware-software para la aceleración de CNNs que reproduce las operaciones principales de tflite en hardware. En este caso, como los registros en hardware soportan las operaciones enteras, los resultados del acelerador son similares a la contraparte software. El FER SoC fue implementado en el sistema de desarrollo Zybo-Z7, el cuál emplea 1 GB RAM, y la FPGA Zynq XCZ7020-CLG400. Además, se observó que la arquitectura obtuvo la misma exactitud que una laptop con una CPU AMD de 16 threads, 16 GB de RAM, y la GPU de Nvidia GTX1660Ti, pero fue 20% más lenta. Por lo que se recomienda evaluar sí el intercambio entre la quantización y el tiempo de ejecución es suficiente para la aplicación objetivo. Por último, otra contribución del trabajo es el framework Resiliency, el cuál permite el entrenamiento y la inferencia de modelos CNN de un solo canal. Resiliency, provee los archivos de diseño necesarios y la imagen Pynq 2.7 creada para ejecutar los frameworks de ML TensorFlow y PyTorch. Aunque el tiempo de entrenamiento fue lento, la exactitud y la perdida son consistentes con las estrategias tradicionales. Sin embargo, el tiempo de ejecución puede ser mejorado usando FPGAs de gama alta con MPSoCs como las Zynq Ultrascale.

Palabras clave

FER ; CNN ; FPGA ; HNN ; FER ; CNN ; FPGA ; HNN ;

Descripción Física/Lógica/Digital

ilustraciones, diagramas., fotografías a color

URI

https://repositorio.unal.edu.co/handle/unal/84550

Colecciones

Doctorado en Ingeniería - Eléctrica [47]

Esta obra está bajo licencia internacional Creative Commons Reconocimiento-NoComercial 4.0.Este documento ha sido depositado por parte de el(los) autor(es) bajo la siguiente constancia de depósito