Detection of Indonesian Food to Estimate Nutritional Information Using YOLOv5

Currently, the development of online food delivery service applications is very popular. The application offers convenience in finding and fulfilling food needs. That circumstance has an impact such as not controlling the type and amount of food consumed. Therefore, to maintain a healthy lifestyle, people need to eat healthy and nutritious food. The goal of this research is to build a model using the YOLOv5 model that can detect images of Indonesian food so that nutritional estimation can then be carried out by taking information per serving data sourced from the FatSecret Indonesia website. The methods of this research include data collection, data pre-processing, training, testing, evaluation, image detection, and model export. The outcome of this research is an object detection model that is ready to be implemented in android applications or websites to detect images of Indonesian food which can be estimated for each nutrient. Based on the detection results, 98.6% for an average of a curacy, 95% for precision, 95.3% for recall, and 95% for F1-Score were obtained. The results of the detection are then used to estimate nutrition by taking information per portion from the FatSecret Indonesia website. From the experiments that were carried out on seven pictures of Indonesian food, the estimation was carried out well by displaying various nutritional information including energy, protein, fat, and carbohydrates.


I. INTRODUCTION
The Industrial Revolution 4.0 has brought about technological advancements that have made it easier for people to find and meet their needs. One of them concerns the need for food [1]. Currently, the convenience of online food delivery service applications is the answer to finding and meeting people's food needs. People sometimes encounter conditions such as feeling lazy, doing piles of work, or avoiding traffic jams, so they prefer to order food using an online food delivery service application [2]. Research by Nielsen Media Research Singapore in 2019 shows that Indonesia is a large market for food delivery services, especially during the COVID-19 pandemic, this is evidenced by the many food delivery services such as GoFood and ShopeeFood [3]. This convenience will have an impact on the type and amount of food consumed to become uncontrolled. Research [4] explains that the increase in food consumption is quite large due to the use of social media. Lack of information regarding nutritional needs as well as nutrition can also lead to excess nutritional status such as obesity. In essence, the average energy adequacy rate for Indonesian people is 2,100 kcal and for protein, adequacy is 57 grams per person per day at the consumption level [5]. This education is sometimes still not understood by the community so people still often eat food carelessly. The solution to this issue is to develop a model that can recognize the type of food and then estimate the nutritional information.
Several studies have started to develop systems regarding food detection and then estimate its nutrition. One of them is research by [6] building an application to detect food that was captured and then display the item's nutritional value using YOLOv3. According to the study's findings, an accuracy of 100% for images with only one object, 88% for images with three objects, and 68% for images with five objects. From this study, it can be concluded that the more objects in an image, the less accuracy will decrease.
A system to assist users in controlling their eating behaviors through food detection and calorie analysis was proposed in prior research titled "Food Calorie and Nutrition Analysis System based on Mask R-CNN" by [7]. This research involved the detection of food using the Mask R-CNN, analysis of the food proportion using a food mask, and 159 TEKNIKA, Volume 12 (2) Another study [1] regarding building applications for recognizing fruit objects, including apples, grapes, pears, oranges, and bananas using YOLOv3 resulted in an accuracy of 97.4%. In this study, there was a case of overfitting because the data used as testing data was training data.
Research [8] regarding food detection using the Single Shot Detector (SSD) in the android application. The model can detect food on a plate and then estimate the number of calories by taking calorie information per serving from the FatSecret website.
Research on the detection of mold in food was carried out by [9] comparing the detection models using YOLOv3, YOLOv4, and YOLOv5. The results of this study are YOLOv5 managed to outperform in a precision, recall, and average precision (AP) score with respective scores of 98.10%, 100%, and 99.60%. Researchers suggest increasing the number of images to improve better results.
The study [10] compares different object identification and localization algorithms based on accuracy, execution time, and parameter values with different input image sizes. This study has discovered a novel single-stage model methodology that increases speed without significantly reducing accuracy. The comparing findings demonstrate that YOLOv3-Tiny accelerated object detection while retaining result accuracy.
In the study, the YOLO algorithm is also utilized to detect prostate cancer and determine important regions on input biopsy images automatically in the research [11]. The first set contains 50 actual photos of prostate tissue that are comparable to the train set and have a 97% accuracy rate. The accuracy rate for the test set, which comprises 137 fully different actual prostate tissue biopsy photos, was 89%. The misdetected and misclassified cases can be found in four of the test images. Data augmentation is the primary cause of poor detection and classification. Although the augmented image data increases the system's accuracy, it can also lead to errors in detection and classification. Nevertheless, according to the test results, tools for diagnosing prostate cancer with high accuracy can be created utilizing artificial intelligence techniques such as object detection algorithms.
YOLO is also utilized for detecting fruit as done by this research [12]. This research has developed the YOLO-Tomato, a more accurate and compact version of the YOLOv3 tomato detection model. Compared to other stateof-the-art detection techniques, the YOLO-Tomato performed the best.
Another study [13] suggests a better method for dealing with mask-wearing detection based on YOLO-v4. The results demonstrate that the accuracy of mask detection in this study is higher than in other algorithms. Regardless, the situation of false and missed detection has evolved. Furthermore, the algorithm presented in this paper reduces the model's training costs and complexity requirements, allowing the algorithm to be extended to other object detection tasks like the detection of mask wear by staff, passengers, students, and other patients in addition to being deployed on medium devices.
The research [14] has introduced LabelStoma, an application for measuring stomatal density. LabelStoma uses the YOLO algorithm to detect stomata and has 0.91 in F1-Score when tested on photos from the species used for training. This application will contribute to a better understanding of processes related to plant gas exchange as well as the carbon and water cycles.
This study [15] created a deep learning target identification method called YOLO-SASE based on the YOLO algorithm to enhance the ability to detect infrared tiny targets in complex backgrounds. The algorithm proposed in this study's experiment had accuracy and recall rates that were 2% and 3% higher than those of the original model, and the stability of the results was substantially better than during training.
This research will build a model for detecting Indonesian food using YOLOv5, which is known to be a very good object detection model recently. This is proven in research [16] about the development of all five versions of the YOLO and can be concluded that the last version of YOLO, YOLOV5, performs better than YOLOv4 in terms of speed and accuracy. After food has been successfully detected, the nutritional estimation will then be carried out by taking information per serving sourced from the FatSecret Indonesia website.

II. RESEARCH METHOD
The research process consisted of seven stages, including data collection, data pre-processing, training, testing, evaluation, image detection to display nutritional information, and model export to a file with the .pt extension. The process is presented in Figure 1. The first process in this research was data collection, both data in the form of images and nutritional information. Data collection is in the form of image datasets sourced from Kaggle, Roboflow, and Google Images sites. As for the nutrition dataset, it is sourced from the FatSecret Indonesia website. Indonesian food objects or classes used in this study include bakso kuah, batagor kering, gado-gado, nasi goreng, pempek, sate ayam, and soto Lamongan.
The data pre-processing consists of image annotation, splitting image datasets, resizing images, and image augmentation. Here is the explanation: 1. Image annotation is the stage of class labeling and bounding boxes on the image. The result of this process is a file with the extension .txt [17]. The file contains object coordinates which consist of object classes that are represented by numbers, x coordinates, y coordinates, width, and object height [18]. 2. Split image datasets. The dataset must then be divided into training, validation, and testing data after the image has been annotated. 3. Resize the image. This process aims to uniform image sizes [19]. This process is carried out to reset the image size so that it can be adjusted to the size read by the model [20]. 4. Image augmentation. According to Perez & Wang, the image augmentation process will change and modify the image so that the computer detects that the image being changed is a different image, but humans can still know that the image being changed is the same image [21]. This process is carried out on the training data by cutting the image into certain parts, rotating, brightening, providing exposure, and making the image into a collage or mosaic.
Training is the process of using several prepared training data [20]. The model used in this research is YOLOv5. Validation data is also used in this process, this data aims to test accuracy when carrying out the training process [22].
YOLOv5 can outperform other YOLO models in terms of accuracy and speed [16]. In research conducted by Yan, where robots were given the task of picking apples, the YOLOv5 model was compared to the YOLOv3 and YOLOv4 models. From this study, mAP increased by 14.95% and 4.75%, respectively [9].
The architecture of YOLOv5 was created using PyTorch and divided into three parts, namely the backbone, neck, and head as shown in Figure 2. On the backbone, this version uses the Focus and CSPDarknet5 structures which are to reduce the required CUDA memory, reduce layers, and increase forward propagation, and backpropagation. The neck component uses PANet which adopts the Feature Pyramid Network (FPN) which includes several bottom-top and topbottom layers. This increases the propagation of low-level features in the model. The last component in this architecture, namely the head, uses the same components as in the fourth and third versions which produce three different feature map outputs to achieve multiscale prediction. This helps improve the prediction of small to large objects efficiently in the model.
An explanation of the YOLOv5 architecture presented in Figure 2 includes the focus layer on the backbone evolving from the YOLOv3 structure, which previously had three layers to become one layer in this fifth version. Conv here shows layer convolution. C3 consists of three convolution layers and modules that flow by various bottlenecks or barriers. SPP is a pooling layer that is used to remove fixed network size limitations. Upsample is used in upsampling the previous layer's fusion at the closest node. Concat is a slicing layer and is used to slice the previous layer. The third Conv2d is a detection module that is used on the head. The testing process aims to test/evaluate the model that has been trained using test data. The testing phase will continue with the process of predicting the bounding box for each detected object frame. The output of this process is in the form of pictures of food with the results of objects being detected and a confidence score in detecting the object [23].
The evaluation process is used to evaluate the performance of a model using evaluation metrics such as the confusion matrix, accuracy, precision, recall, and F1-score.

Confusion Matrix
The Confusion Matrix is a table that describes model performance on a test data set with known values [24]. The table is shown in Table 1.

Accuracy
Based on calculations from Sokolova & Lapalme, the accuracy score describes how accurate and efficient the model as a whole is in detecting data correctly, the accuracy calculation is stated in equation (1) [25].

Precision
Precision is the comparison of data images correctly detected as positive by the system and all data detected as positive. Precision is stated in equation (2) [26].

Recall
A recall is a measurement of data images that are correctly detected as positive by the system divided by the total positive samples [18] [26]. A recall is stated in equation 5. F1-Score F1-Score is the result of combining precision and recall computations. The worst F1-Score is 0, while the best F1-Score is 1. A good F1-Score indicates that the resulting model has good precision and recall scores [27]. F1-Score is stated in equation (4) [26].
After the evaluation results on the model achieve the desired results, it can proceed to the next process. However, if the evaluation results have not achieved the expected results, then the process can be repeated from the beginning to improve the process again until the expected evaluation results are achieved.
After the evaluation results are following the expected results, it can proceed to the image detection stage and then display nutritional information. In this process, the model will detect images taken from the training dataset and then the detection results in the form of strings are used to retrieve nutritional information from the FatSecret Indonesia website.
The Python programming language is utilized in this research, which is executed in Google Colaboratory.
The next process is exporting or saving the model to local storage using the PyTorch library with the output in the form of a .pt extension file which can then be used for implementation in the form of a website or Android.
PyTorch is an open-source machine learning library utilized for applications such as computer vision. This system was created by Facebook and it was made to supply models that are simpler to compose than other systems such as TensorFlow. The YOLOv5 model is accessible as it were for PyTorch at this point, which is why we used PyTorch for this research [28].

III. RESULT AND DISCUSSION
The first process was data collection. The collection of nutritional information data uses the assumption that the portions of Indonesia's food are from the FatSecret Indonesia website. This information is used to estimate the nutritional value of each object class. The nutritional information includes energy, fat, carbohydrates, and protein.
Food image data collection was sourced from Kaggle, Roboflow, and Google Images sites. The objects or classes of the images include bakso kuah, batagor kering, gado-gado, nasi goreng, pempek, sate ayam, dan soto Lamongan. The total images collected were 2,625 images with 375 images in each class. From the picture, there is only one food object in it.
The data pre-processing consists of these steps: 1. Image Annotations. This stage is carried out using tools from roboflow. The annotation process is shown in Figure  3.  4. Image Augmentation. The augmentations used in this process include crop, rotation, brightness, exposure, and mosaic. The augmentation process is only carried out on training data, so that the training data, which was originally 300 images per class, becomes 900 because each class will be added two types of augmentation techniques to add variety to the images. From this process, the initial 2.635 images increased to 6.825. The training process is using YOLOv5. This process is carried out using the training data that we have carried out the pre-processing process first so that the training process runs smoothly. This process is executed on the Google Colaboratory platform using the PyTorch library. This process uses 150 epochs and takes about 3 hours and 39 minutes.
The testing process is carried out using test data. The following image shown in Figure 4 is a sample image of detection results using test data. Pempek, (f) Sate Ayam, and (g) Soto Lamongan.
From the several samples shown in Figure 5, a bounding box is obtained that frames the detected objects as well as the confidence score of the model in detecting each of these samples. When detecting bakso kuah, a confidence score obtained 72% for recognizing objects as bakso kuah. For batagor kering detection, the model recognizes the object with 93% confidence. In the gado-gado object, the confidence score obtained at 65%. Meanwhile, the nasi goreng gets a score of 90%. For the pempek object, the confidence score obtained at 89%. The sate ayam object obtained a score of 90%. And for the last object, namely soto Lamongan, the confidence score is obtained at 90%.
When the testing procedure is completed successfully, the following step is to evaluate the results using the confusion matrix to determine the evaluation's accuracy, precision, recall, and F1-Score. The following is the evaluation result of each class. Based on Table 2, the highest accuracy value is obtained in the Soto Lamongan class with a value of 99.59%. The highest precision value is found in the batagor kering class with a value of 1.00. The highest recall values were found in the sate ayam with and Soto Lamongan classes, each with a value of 1.00. And the highest F1-score is in the Bakso kuah and Soto Lamongan classes with each value of 0.98.
The conclusion from evaluation results in Table 2, average accuracy value for the entire class of this study was 98.61%, the average precision value was 0.95, the average recall value was 0.96, and the average F1-Score value of 0.95.
After the authors tested the model using a nutritional dataset, the next process was to display nutritional information on each food that was successfully detected. The results of this process are shown in Figure 6.  Based on Figure 7, the model was successful in detecting the batagor kering image and estimating its nutrition. The nutrients contained in batagor kering include energy of 290 kcal, protein of 14.92 g, fat of 29.14 g, and carbohydrates of 10.28 g.  Based on Figure 9, the model was successful in detecting the nasi goreng image and estimating its nutrition. The nutrients contained in nasi goreng include energy of 250 kcal, protein of 9.28 g, fat of 31.38 g, and carbohydrates of 9.39 g. Based on Figure 11, the model was successful in detecting the sate ayam image and estimating its nutrition. The nutrients contained in sate ayam include energy of 101 kcal, protein of 6.67 g, fat of 2.19 g, and carbohydrates of 8.79 g. Based on Figure 12, the model was successful in detecting the Soto Lamongan image and estimating its nutrition. The nutrients contained in Soto Lamongan image include energy of 312 kcal, protein of 14.92 g, fat of 19.55 g, and carbohydrates of 24.01 g.
The results of the weights of the model that have been trained are automatically stored after the training process is complete, this process is the export model process. The model is stored in the .pt file extension which is a common convention of the PyTorch library.

IV. CONCLUSION
An image detection system for Indonesian food to estimate nutritional information has been built using the YOLOv5 model. The YOLOv5 model performs very well for the detection of Indonesian specialties, as evidenced by the results obtained with an average accuracy of 98.6%, an average precision value of 0.95, an average recall value of 0.96, and an average F1-Score value of 0.95. The results of the detection are then used to estimate nutrition by taking information per portion from the FatSecret Indonesia website. From the experiments that were carried out on seven pictures of Indonesian food, the estimation was carried out well by displaying various nutritional information including energy, protein, fat, and carbohydrates. The outcome of this research is an object detection model that is ready to be implemented in android applications or websites. Suggestions for further research are to be able to implement the model as an android application or website with various supporting features such as recording calories per day to make it easier for users.