# 1 - Investigating the Classification of Breast Cancer Subtypes using KMeans

This project provides an insight into an investigation of the classification of breast cancer sub-types using proetomic dataset through a machine learning approach.

Status: draft, Type: Project

Kehinde Ezekiel, su21-reu-362, Edit

Keywords: AI, cancer, breast, algorithms, machine learning, healthcare, subtypes, classification.

## 1. Introduction

Breast cancer is the most common cancer, and also the primary cause of mortality due to cancer in females around the World. It is an heterogenous disease that is characterized by the abnormal growth of cells in the breast region1. Early diagnosis and detection of possible cancerous cells in the breast usually increase survival chances and provide a better approach for treatment and management. Treatment and management often depend on the stage of cancer, the subtype, the tumor size, location and many other factors. During the last 20 years, four major intrinsic molecular subtypes for breast cancer- luminal A, luminal B, HER2-enriched and Basal-like have been identified, classified and intensively studied. Each subtype has its distinct morphologies and clinical treatment. The classification is based on gene expression profiling, specificaly defined by mRNA expression of 50 genes (also known as, PAM50 Genes). This test is known as the PAM50 test. The accurate grouping of breast cancer into its relevant subtypes can improve accurate treatment-decision making2. The PAM50 test is now known as the Prosigna Breast Cancer Prognostic Gene Signature Assay 50 (known as Prosigna) and it analyzes the activity of certain genes in early-stage, hormone-receptor-positive breast cancer3. This classification is based on the mRNA expression and the activity of 50 genes and it aims to estimate the risk of distant reccurrence of breast cancer. Since the assay was based on mRNA expression, this project suggested that a classification based on the final product of mRNA, that is protein, can be implemented to investigate its role in the classifictaion of molecular breast cancer subtypes. As a result, the project was focused on the use of a proteomic dataset which contained published iTRAQ proteome profiling of 77 breast cancer samples and expression values for the proteins of each sample.

Most times, breast cancer is diagnosed and detected through a combination of different approaches such as imaging (e.g. mammogram and ultrasound), physical examination by a radiologist and biopsy. Biopsy is used to confirm the breast cancer symptoms. However, research has shown that radiologists can miss up to 30% of breast cancer tissues during detection4. This gap has brought about the introduction of Computer aided Diagnosis (CAD) systems can help detect abnormalities in an efficient manner. CAD is a technology that includes utilizing the concept of artificial intelligence(AI) and medical image processing to find abnormal signs in the human body5. Machine Learning is a subset of AI and it has several algorithms that can be used to build a model to perform a specific task or to predict a pattern. KMeans is one of such algorithm.

Building a model using machine learning involves selecting and preparing the appropriate dataset, identifying the accurate machine learnning algorithm to use, training the algorithm on the data to build a model, validating the resulting model’s performance on testing data and using the model on a new data6. In this project, KMeans was the algorithm used in this project, the datasets were prepared through several procedures like filtering, merging. KMeans clustering method was used to investigate the classification of the molecular subtypes. Its efficacy is often tested by a silhouette score. A silhouette score shows how similar an object is to its own cluster and it ranges from -1 to 1 where a high values indicates that an object is well matched to its own cluster. A homogeneity score determines if a cluster should only contain samples that belong to a particular class. It ranges from a value between 0 to 1 with low values indicating a low homogeneity.

The project investigated the efficient number of clusters that could be generated for the proteome dataset which would consequetly provide an optimal classification of the protein expression values for the breast cancer samples. The proteins that were used in the KMeans analysis were the proteins that were associated with the PAM50 genes. The result of the project could provide insights to medical scientists and researchers to identify any interelatedness between the original classification of breast cancer molecular subtypes.

## 2. Datasets

Datasets are eseential in drawing conclusion. In the diagnosis, detection and classification of breast cancecr, datasets have been essential to draw conclusion by identifying patterns. These datasets range from imaging datasets to clinical datasets, proteomic datasets etc. Large amounts of data have been collected due to new technological and computational advances like the use of websites like NCBI, equipments like Electroencephalogram (EEG) which record clinical information. Medical researchers leverage these datasets to make useful health care decisions that affect a region, gender or the world. The need for accuracy and reproducibilty has led to the use of machine learning as an important tool for drawing conclusions.

Machine Learning involves training a piece of software, also known as model, to idnetify patterns from a dataset and make useful predictions. There are several factors to be considered when using datasets. One of such is data privacy. Recently, measures have been taken to ensure that the privacy of data. Some of these measures include, replacing codes for patients name, using documents and mobile applications that ask for permission from patients before using their data. Recently, the World Health Organization (WHO) made her report on AI and provided priniples that ensure that AI works for all. On of such is that the designer of AI technologies should satisfy regulatory requirements for safety, accuracy and efficacy for well-defined use cases or indications. Measures of quality control in practice and quality improvement in the use of AI must be available7. Building a model using machine learning involves selecting and preparing the appropriate dataset, identifying the accurate machine learnning algorithm to use, training the algorithm on the data to build a model, validating the resulting model’s performance on testing data and using the model on a new data6. In this project, KMEans was the algorithm used in this project, the datasets were prepared through several procedures like filtering, merging.

## 3. The KMeans Approach

KMeans clustering is an unsupervised machine learning algorithm that makes inferences from datasets without referring to a known outcome. It aims to identify underlying patterns in a dataset by looking for a fixed number of clusters, (known as k). The required number of clusters is chosen by the person building the model. KMeans was used in this project to classify the protein IDs (or RefSeq_IDs) into clusters. Each cluster was designed to be associated with related protein IDs.

Three datasets were used for the algorithm. The first and main dataset was a proteomic dataset. It contained published iTRAQ proteome profiling of 77 breast cancer samples generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH). Each sample contained expression values for ~12000 proteins, with missing values present when a given protein could not be quantified in a given sample. The variables in the dataset included the RefSeq_accession_number(also known as RefSeq protein ID), “the gene_symbol” (which was unique to each gene), “the gene_name” (which was the full name of the gene). The remaining columns were the log2 iTRAQ ratios for each of the 77 samples while the last three columns are from healthy individuals.

The second dataset was a PAM50 dataset. It contained the list of genes and proteins used in the PAM50 classification system. The variables include the RefSeqProteinID which matched the Protein IDs(or RefSeq_IDs) in the main proteome dataset.

The third dataset was a clinical data of about 105 clinical breast cancer samples. 77 of the breast cancer samples were the samples in the first dataset. The excluded samples were as a result of protein degradation8. The variables in the dataset are: ‘Complete TCGA ID', ‘Gender’, ‘Age at Initial Pathologic Diagnosis’, ‘ER Status’, ‘PR Status’, ‘HER2 Final Status’, ‘Tumor’, ‘Tumor–T1 Coded’, ‘Node’, ‘Node-Coded’, ‘Metastasis’, ‘Metastasis-Coded’, ‘AJCC Stage’, ‘Converted Stage’, ‘Survival Data Form’, ‘Vital Status’, ‘Days to Date of Last Contact’, ‘Days to date of Death’, ‘OS event’, ‘OS Time’, ‘PAM50 mRNA’, ‘SigClust Unsupervised mRNA’, ‘SigClust Intrinsic mRNA’, ‘miRNA Clusters’, ‘methylation Clusters’, ‘RPPA Clusters’, ‘CN Clusters’, ‘Integrated Clusters (with PAM50)’, ‘Integrated Clusters (no exp)’, ‘Integrated Clusters (unsup exp).’

During the preparation of the datasets for KMeans analysis, unused columns like “gene_name” and “gene_symbol” were removed in the first dataset. The first and third dataset were merged together. Prior to merging, the variable ‘Complete TCGA ID’ in the third dataset was found to be the same as the TCGAs in the first dataset. The Complete TCGA ID refered to a breast cancer patient, some patients were found in both datasets. The TCGA ID in the first dataset was renamed to match with the TCGA of the third dataset, thereby giving the same syntax. The first dataset was also transposed as a row and its gene expression as the columns. These processes were done in order to merge both dataset efficiently.

After merging, the “PAM5O RNA” variable from the second dataset was selected to join the merged dataset. This single dataset was named “pam50data”. It contained all the variables that were needed for KMeans Analysis which included the genes that were used for the PAM50 classification (only 43 were available in the dataset), the complete TCGA ID of each 80 patient, and their molecular tumor type. Missing values in the dataset were imputed using SimpleImputer. Then, KMeans clustering was performed. The metrics were tested with cluster numbers of 3, 4, 5, 20 and 79. The bigger numbers (20 and 79) were tested just for comparison. Further details on the codes written can be found in 9. Also, 10 and 11 were kernels that provided insights for the written code.

## 5. Results and Images

Several codes were written to determine the best number of clusters for the model. The effectiveness of a cluster is often measured by scores such as silhouette score, homogeneity score and adjusted rand score.

The silhouette score for a cluster of 3, 4, and 5, 8, 20 and 79 were 0.143, 0.1393, 0.1193, 0.50968, 0.0872, 0.012 while the homogenenity scores were 0.4635, 0.4749, 0.1193, 0.5617, 0.6519 and 1.0 respectively. The homogeneity score for 79 is 1.0 since the algorithm can assign all the points into sepearate clusters. However, it is not efficient for the dataset we used. A cluster of 3 works best since the silhouette score is high and the homogeneity score jumps ~2-fold.

Figures 1 and 2 show the results of the visualization of the clusters of 3 and 4.

Figure 1: The classification of Breast Cancer Moleecular Subtypes using KMeans Clustering. (k=3). Each data point represnt the expression value for the genes that were used for clustering.

Figure 2: The classification of Breast Cancer Moleecular Subtypes using KMeans Clustering. (k=4). Each data point represnt the expression value for the genes that were used for clustering.

## 6. Benchmark

This program was executed on a Google Colab server and the entire runtime took 1.012 seconds Table 1 lists the amount of time taken to loop for n_components. The n_components is gotten from the code and it refers to the features of the dataset.

Name Status Time(s)
parallel 1 ok 0.647
parallel 3 ok 0.936
parallel 5 ok 0.952
parallel 7 ok 0.943
parallel 9 ok 1.002
parallel 11 ok 0.991
parallel 13 ok 0.958
parallel 15 ok 1.012

Benchmark: The table shows the parallel process time take the for loop for n_components.

## 7. Conclusion

The results of the KMeans analysis showed that three clusters provided an optimal result for the classification using a proteomic dataset. A cluster of 3 provided a balanced silhouette and homogeneity score. This predict that some interrelatedness could exist between the original PAM50 subtype classfication, since the result of classifying a protein dataset using a machine learning algorithm identified a cluster of 3 as one with the optimal result. Also, future research could be done by using other machine learning algorithms, possibly a supervised learning algotithm, to identify the correlation between the clusters and the four molecular subtypes. This model can be improved on and if proven to show that there truly exist a relationship between the four molecular subtypes, more research could be done to identify the factors that contribure to the interelatedness. This would lead medical scientists and researchers to work on better innovative methods that will aid the treatment and management of breast cancer.

## 8. Acknowledgments

This projected was immensely supported by Dr. Gregor von Laszewski. Also, a big appreciation to the REU Instructors (Carlos Theran, Yohn Jairo and Victor Adankai) for their contribution, support, teachings and advice. Also, gratitude to my colleagues who helped me out; Jacques Fleischer, David Umanzor and Sheimy Paz Serpa. gratitude to my colleagues. Lastly, appreciation to Dr. Byron Greene, the Florida A&M University, the Indiana University and Bethune Cookman University for providing a platform to be able to learn new things and embark on new projects.

## 9. References

1. Akram, Muhammad et al. “Awareness and current knowledge of breast cancer.” Biological research vol. 50,1 33. 2 Oct. 2017, doi:10.1186/s40659-017-0140-9 ↩︎

2. Wallden, Brett et al. “Development and verification of the PAM50-based Prosigna breast cancer gene signature assay.” BMC medical genomics vol. 8 54. 22 Aug. 2015, doi:10.1186/s12920-015-0129-6 ↩︎

3. Breast Cancer.org Prosigna Breast Cancer Prognostic Gene Signature Assay. https://www.breastcancer.org/symptoms/testing/types/prosigna ↩︎

4. L. Hussain, W. Aziz, S. Saeed, S. Rathore and M. Rafique, “Automated Breast Cancer Detection Using Machine Learning Techniques by Extracting Different Feature Extracting Strategies,” 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), 2018, pp. 327-331, doi: 10.1109/TrustCom/BigDataSE.2018.00057. ↩︎

5. Halalli, Bhagirathi et al. “Computer Aided Diagnosis - Medical Image Analysis Techniques.” 20 Dec. 2017, doi: 10.5772/intechopen.69792 ↩︎

6. Salod, Zakia, and Yashik Singh. “Comparison of the performance of machine learning algorithms in breast cancer screening and detection: A protocol.” Journal of public health research vol. 8,3 1677. 4 Dec. 2019, doi:10.4081/jphr.2019.1677Articles ↩︎

7. WHO, WHO issues first global report on Artificial Intelligence (AI) in health and six guiding principles for its design and use. https://www.who.int/news/item/28-06-2021-who-issues-first-global-report-on-ai-in-health-and-six-guiding-principles-for-its-design-and-use ↩︎

8. Mertins, Philipp et al. “Proteogenomics connects somatic mutations to signalling in breast cancer.” Nature vol. 534,7605 (2016): 55-62. doi:10.1038/nature18003 ↩︎

9. Kehinde Ezekiel, Project Code, https://github.com/cybertraining-dsc/su21-reu-362/blob/main/project/code/final_breastcancerproject.ipynb ↩︎

10. Kaggle_breast_cancer_proteomes «https://pastebin.com/A0Wj41DP> ↩︎

11. Proteomes_clustering_analysis https://www.kaggle.com/shashwatwork/proteomes-clustering-analysis ↩︎

# 2 - Project: Detection of Autism Spectrum Disorder with a Facial Image using Artificial Intelligence

This project uses artificial intelligence to explore the possibility of using a facial image analysis to detect Autism in children. Early detection and diagnosis of Autism, along with treatment, is needed to minimize some of the difficulties that people with Autism encounter. Autism is usually diagnosed by a specialist through various Autism screening methods. This can be an expensive and complex process. Many children that display signs of Autism go undiagnosed because there families lack the expenses needed to pay for Autism screening and diagnosing. The development of a potential inexpensive, but accurate way to detect Autism in children is necessary for low-income families. In this project, a Convolutional Neural Network (CNN) is utilized, along with a dataset obtained from Kaggle. This dataset consists of collected images of male and female, autistic and non-autistic children between the ages of two to fourteen years old. These images are used to train and test the CNN model. When one of the images are received by the model and importance is assigned to various features in the image, an output variable (autistic or non-autistic) is received.

Status: final: Project

Myra Saunders, su21-reu-378, Edit

Keywords: Autism Spectrum Disorder, Detection, Artificial Intelligence, Deep Learning, Convolutional Neural Network.

## 1. Introduction

Autism Spectrum Disorder (ASD) is a broad range of lifelong developmental and neurological disorders that usually appear during early childhood. Autism affects the brain and can cause challenges with speech and nonverbal communication, repetitive behaviors, and social skills. Autism Spectrum Disorder can occur in all socioeconomic, ethnic, and racial groups, and can usually be detected and diagnosed from the age of three years old and up. As of June 2021, the World Health Organization has estimated that one in 160 children have an Autism Spectrum Disorder worldwide1. Early detection of Autism, along with treatment, is crucial to minimize some of the difficulties and symptoms that people with Autism face2. Symptoms of Autism Spectrum Disorder are normally identified based on psychological criteria3. Specialists use techniques such as behaivoral observation reports, questionaires, and a review of the child’s cognitive ability to detect and diagose Autism in children.

Many researchers believe that there is a correlation between facial morphology and Autism Spectrum Disorder, and that people with Autism have distinct facial features that can be used to detect their Autism Spectrum Disorder4. Human faces encode important markers that can be used to detect Autism Spectrum Disorder by analyzing facial features, eye contact,facial movements, and more5. Scientists found that children diagnosed with Autism share common facial feature distinctions from children who are not diagnosed with Autism6. Some of these facial features are wide-set eyes, short middle region of the face, and a broad upper face. Figure 1 provides an example of the facial feature differences between a child with Autism and a child without.

Figure 1: Image of Child with Autism (left) and Child with no Autism (right)7.

Due to the distinct features of Autistic individuals, we believe that it is necessary to explore the possiblities of using a facial analysis to detect Autism in children, using Artificial Intelligence (AI). Many researchers have attempted to explore the possibility of using various novel algorithms to detect and diagnose children, adolescents, and adults with Autism2. Previous research has been done to determine if Autism Spectrum Disorder can be detected in children by analyzing a facial image7. The author of this research collected approximately 1500 facial images of children with Autism from websites and Facebook pages associated with Autism. The facial images of non-autistic children were randomly downloaded from online and cropped.The author aimed to provide a first level screening for autism diagnosis, whereby parents could submit an image of their child and in return recieve a probability of the potential of Autism, without cost.

To contribute to this previous research7, this project will propose a model that can be used to detect the presence of Autism in children based on a facial image analysis. A deep learning algorithm will be used to develop an inexpensive, accurate, and effective method to detect Autism in children. This project implements and utilizes a Convolutional Neural Network (CNN) classifier to explore the possibility of using a facial image analysis to detect Autism in children, with an accuracy of 95% or higher. Most of the coding used for this CNN model was obtained from the Kaggle dataset and was done by Fran Valuch8. We made changes to some parts of this code, which will be discussed further in this project. The goal of this project is not to diagnose Autism, but to explore the possibility of detecting Autism at its early stage, using a facial image analysis.

Previous work exists on the use of artificial intelligence to detect Autism in children using a facial image. Most of this previous work used the Autism kaggle dataset7, which was also used for this project. One study utilized MobileNet followed by two dense layers in order to perform deep learning on the dataset6. MobileNet was used because of its ability to compute outputs much faster, as it can reduce both computation and model size. The first layer was dedicated to distribution, and allowed customisation of weights to input into the second dense layer. The second dense layer allowed for classification. The architecture of this algorithm is shown below in Figure 2.

Figure 2: Algorithm Architecture using MobileNet6.

Training of this model completed after fifteen epochs, which resulted in a test accuracy of 94.64%. In this project we utilize a classic Convolutional Neural Network model using tensorflow. This will be done in hopes of obtaining a test accuracy of 95% or higher.

## 3. Dataset

The dataset used for this project was obtained from Kaggle7. This dataset contained approximately 1500 facial images of children with Autism that were obtained from websites and Facebook pages associated with Autism. The facial images of non-autistic children were randomly downloaded from online. The pictures obtained were not of the best quality or consistency with respect to the facial alignment. Therefore, the author developed a python program to automatically crop the images to include only the extent possible for a facial image. These images consist of male and female children that are of different races and range from around ages two to fourteen.

This project uses version 12 of this dataset, which is the latest version. The dataset consists of three directories labled test, train, and valid, along with a CSV file. The training set is labeled as train, and consists of ‘Autistic’ and ‘Non-Autistic’ subdirectories. These subdirectories contain 1269 images of autistic and 1269 images of non-autistic children respectively. The validation set located in the valid directory are separated into ‘Autistic’ and ‘Non-autistic’ subdirectories. These subdirectories also contain 100 images of autistic and 100 images of non-autistic children respectively. The testing set located in the test directory is divided into 100 images of autistic children and 100 images of non-autistic children. All of the images provided in this dataset are in 224 X 224 X 3, jpg format. Table 1 provides a summary of the content in the dataset.

Table 1: Summary Table of Dataset.

## 4. Proposed Methodology

Convolutional Neural Network (CNN)

This project utilizes a Convolution Neural Network (CNN) to develop a program that can be used to detect the presence of Autism in children from a facial image analysis. If successful this program can be used an inexpensive method to detect Autism in children at its early stages. We believed that a CNN model would be the best way create this program because of its little dependence on preprocessing data. A Convolutional Neural Network was also used becuase of its ability to take in an image and assign importance to, and identify diferent objects within the image. CNN also has very high accuracy when dealing with image recognition. The dataset used contains 1269 training images that were used to train and test this Convolution Neural Network model. The architecture of this model can be seen in Figure 3.

Figure 3: Architecture of utilized Convolutional Neural Network Model.

## 5. Results

The results of this project is estimated by affectability and accuracy by utilizing the Confusion Matrix CNN. The results also rely on how correct and precise the model was trained. This model was created to explore the possibility of detecting Autism in children at its early stage, using a facial image analysis. A Convolutional Neural Network classifier was used to create this model. For this CNN model we utilized max pooling and Rectified Linear Unit (ReLU), with two epochs. This resulted in an accuracy of 71%. These results can be seen below in Figure 4. Figure 5 displays some of the images that were classified and labeled correctly (right) and the others that were labeled incorrectly (left).

Figure 4: Results after Execution.

validation loss: 57% - validation accuracy: 68% - training loss: 55% - training accuracy: 71%

Figure 5: Correct Labels and Incorrect Labels.

## 6. Benchmark

Figure 6 shows the Confusion Matrix of the Convolutional Neural Network model used in this project. The Confusion Matrix displays a summary of the model’s predicted results after its attempt to classify each image as either autistic or non-autistic. Out of the 200 images, 159 of the images were labled correctly and 41 of the images were labled incorrectly.

Figure 6: Confusion Matrix of the Convolutional Neural Network model.

Cloudmesh-common9 was used to create a Stopwatch module, that was used to measure and study the training and testing time of the model. Table 2 shows the cloudmesh benchmark output.

Table 2: Cloudmesh Benchmark

Name Status Time Sum Start tag msg Node User OS Version
Train ok 3745.28 3745.28 2021-08-10 16:08:57 dab8db0489cd collab Linux #1 SMP Sat Jun 5 09:50:34 PDT 2021
Test ok 2.088 2.088 2021-08-10 17:43:09 dab8db0489cd collab Linux #1 SMP Sat Jun 5 09:50:34 PDT 2021

## 7. Conclusions and Future Work

Autism Spectrum Disorder is a broad range of lifelong developmental and neurological disorders that is considered one of the most growing disorders in children. The World Health Organization has estimated that one in 160 children have an Autism Spectrum Disorder worldwide1. Techniques that are used by specialists to detect autism can be time consuming and inconvenient for some families. Considering these factors, finding effective and essential ways to detect Autism in children is a neccesity. The aim of this project was to create a model that would analyze facial images of children, and in return determine if the child is Autistic or not. This was done in hopes of receiving 95% accuracy or higher. After executing the model we received an accuracy of 71%.

As shown in the results section above, some of the pictures that were initially labeled as Autistic, were labeled incorrectly after running the model. This low accuracy rate could be improved if the CNN model is combined with other algorithms such as transfer learning and VGG-19. This low accuracy could also be improved by using a dataset that includes a wider variety and larger amount of images. We could also ensure that images in the dataset includes children that are of a wider age range. These improvements could possibly increase our chances of obtaining an accuracy of 95% or higher. When this model is improved and an accuracy of atleast 95% is achieved, furture work can be done to create a model that can be used for Autistic individuals outside of the dataset age range (2 - 14 years old).

## 8. Acknowledgments

The author of this project would like to express a vote of thanks to Yohn Jairo, Carlos Theran, and Dr. Gregor von Laszewski for their encouragement and guidance throughout this project. A special vote of thanks also goes to Florida A&M University for funding this wonderful research program. The completion of this project could not have been possible without their support.

## 9. References

1. World Health Organization. 2021. Autism spectrum disorders, [Online resource] https://www.who.int/news-room/fact-sheets/detail/autism-spectrum-disorders ↩︎

2. Raj, S., and Masood, S., 2020. Analysis and Detection of Autism Spectrum Disorder Using Machine Learning Techniques, [Online resource https://reader.elsevier.com/reader/sd/pii/S1877050920308656?token=D9747D2397E831563D1F58D80697D9016C30AAC6074638AA926D06E86426CE4CBF7932313AD5C3504440AFE0112F3868&originRegion=us-east-1&originCreation=20210704171932 ↩︎

3. Khodatars, M., Shoeibi, A., Ghassemi, N., Jafari, M., Khadem, A., Sadeghi, D., Moridian, P., Hussain, S., Alizadehsani, R., Zare, A., Khosravi, A., Nahavandi, S., Acharya, U. R., and Berk, M., 2020. Deep Learning for Neuroimaging-based Diagnosis and Rehabilitation of Autism Spectrum Disorder: A Review. [Online resource] https://arxiv.org/pdf/2007.01285.pdf ↩︎

4. Musser, M., 2020. Detecting Autism Spectrum Disorder in Children using Computer Vision, Adapting facial recognition models to detect Autism Spectrum Disorder. [Online resource] https://towardsdatascience.com/detecting-autism-spectrum-disorder-in-children-with-computer-vision-8abd7fc9b40a ↩︎

5. Akter, T., Ali, M. H., Khan, I., Satu, S., Uddin, Jamal., Alyami, S. A., Ali, S., Azad, A., and Moni, M. A., 2021. Improved Transfer-Learning-Based Facial Recognition Framework to Detect Autistic Children at an Early Stage. [Online resource] https://www.mdpi.com/2076-3425/11/6/734 ↩︎

6. Beary, M., Hadsell, A., Messersmith, R., Hosseini, M., 2020. Diagnosis of Autism in Children using Facial Analysis and Deep Learning. [Online resource] https://arxiv.org/ftp/arxiv/papers/2008/2008.02890.pdf ↩︎

7. Piosenka, G., 2020. Detect Autism from a facial image. [Online resource] https://www.kaggle.com/gpiosenka/autistic-children-data-set-traintestvalidate?select=autism.csv ↩︎

8. Valuch, F., 2021. Easy Autism Detection with TF.[Online resource] https://www.kaggle.com/franvaluch/easy-autism-detection-with-tf/comments ↩︎

9. Gregor von Laszewski, Cloudmesh StopWatch and Benchmark - Cloudmesh-Common, [GitHub] https://github.com/cloudmesh/cloudmesh-common ↩︎

# 3 - Project: Analyzing the Advantages and Disadvantages of Artificial Intelligence for Breast Cancer Detection in Women

Breast Cancer is one of the most dangerous type of disease that affects many women. For detecting Breast Cancer, machine learning techniques are applied to improve the accuracy of diagnosis.

Status: draft, Type: Project

RonDaisja Dunn, su21-reu-377, Edit

Keywords: project, reu, breast cancer, Artificial Intelligence, diagnosis detection, women, early detection, advantages, disadvantages

## 1. Introduction

The leading cause of cancer death in women worldwide is breast cancer. This deadly form of cancer has impacted many women across the globe. Specifically, African American women have been the most negatively impacted. Their death rates due to breast cancer have surpassed all other ethnicities. Serial screening is an essential part in detecting Breast cancer. Detecting the early stages of this disease and decreasing mortality rates is most effective by utilizing serial screening. Some women detect that they could have breast cancer by discovering a painless lump in their breast. Other women began to detect that there may be a problem due to annual and bi-annual breast screenings. Screening in younger women is not likely, because breast cancer is most likely to be detected in older women. Women from the age 55 to 69 are likely to be diagnosed with breast cancer. Women who frequently participate in receiving mammograms reduce the chance of breast cancer mortality.

Artificial Intelligence is the branch of computer science dedicated to the development of computer algorithms to accomplish tasks traditionally associated with human intelligence, such as the ability to learn and solve problems. This branch of computer science coincides with diagnosing breast cancer in individuals because of the use of radiology. Radiological images can be quantitated and can inform and train some algorithms. There are many terms that relate to Artificial Intelligence such as artificial neural networks (ANNs), machine and deep learning (ML, DL). These techniques complete duties in healthcare, including radiology. Machine learning interprets pixel data and patterns from mammograms. Benign or malignant features for inputs are defined by microcalcifications. Deep learning is effective in breast imaging, where it can identify several features such as edges, textures, and lines. More intricate features such as organs, shapes, and lesions can also be detected. Neural networks algorithms are used for image feature extractions that cannot be detected beyond human recognition.

A computer system that can perform complicated data analysis and picture recognition tasks is known as artificial intelligence (AI). Both massive processing power and the application of deep learning techniques made this possible, and are increasingly being used in the medical field. Mammograms are the x-rays used to detect breast cancer in women. Early detection is important to reduce deaths, because that is when the cancer is most treatable. Screenings have presented a 15%-35% false report in screened women. Errors and the ability to view the cancer from the human eye are the reasons for the false reports. Artificial Intelligence offers many advantages when detecting breast cancer. These advantages include less false reports, fewer cases missed because the AI program does not get tired and it reduces the effort of reading thousands of mammograms.

## 2. Methods From Literature Review

The goal was to emphasize the present data in terms of test accuracy and clinical utility results, as well as any gaps in the evidence. Women are screened by getting photos taken of each breast from different views. Two readers are assigned to interpret the photographs in a sequential order. Each reader decides whether the photograph is normal or whether a woman should be recalled for further examination. Arbitration is used when there is a disagreement. If a woman is recalled, she will be offered extra testing to see if she has cancer.

Another goal is to detect cancer at an earlier stage during screening so that therapy can be more successful. Some malignancies found during screening, on the other hand, might never have given the woman symptoms. Overdiagnosis is a term used to describe a situation in which a person has caused harm to another person during their lifetime. As a result, overtreatment (unnecessary treatment) occurs. Since some malignancies are overlooked during screening, the women are misled.

The methods in diagnostic procedures vary between radiologists and Artificial Intelligence networks. In a breast ultrasound exam, radiologists look for abnormal abnormalities in each image, while AI networks analyze each image in an exam that is processed separately using a ResNet-18 model, and a saliency map is generated, identifying the most essential sections. With radiologists, the focus is on photos with abnormal lesions and with AI networks the image is given an attention score based on its relative value. To make a final diagnosis, radiologists consider signals in all photos, and AI computes final predictions for benign and malignant results by combining information from all photos using an attention technique.

## 3. Results From Literature Review

Using pathology data, each breast in an exam was given a label indicating the presence of cancer. Image-guided biopsy or surgical excision were used to collect tissues for pathological tests. The AI system was shown to perform comparably to board-certified breast radiologists in the reader study subgroup. In this reader research, the AI system detected tumors with the same sensitivity as radiologists, but with greater specificity, a higher PPV, and a lower biopsy rate. Furthermore, the AI system outperformed all ten radiologists in terms of AUROC and AUPRC. This pattern was replicated in the subgroup study, which revealed that the algorithm could correctly interpret Ultrasound Screening examinations that radiologists considered challenging.

Figure 1: Analysis of saliency maps on a qualitative level- This figure displays the sagittal and transverse views of the lesion (left) and the AI’s saliency maps indicating the anticipated sites of benign (center) and malignant (right) findings in each of the six instances (a-f) from the reader study.

## 4. Datasets

Figure 2: The probabilistic forecasts of each hybrid model were randomly divided to fit the reader’s sensitivity. The dichotomization of the AI’s predictions matches the sensitivity of the average radiologists. Readers' AUROC, AUPRC, specificity, and PPV improve as a result of the collaboration between AI and readers, whereas biopsy rates decrease.

## 5. Conclusion

There are some benefits of AI help with mammogram screenings. The reduction in treatment expenses is one of the advantages of screening. Treatment for people who are diagnosed sooner is less invasive and expensive, which may lessen patient anxiety and improve their prognosis. One or all human readers could be replaced by AI. AI may be used to pre-screen photos, with only the most aggressive ones being reviewed by humans. AI could be employed as a reader aid, with the human reader relying on the AI system for guidance during the reading process.

However, there is also fear that AI could discover changes that would never hurt women. Because the adoption of AI systems will alter the current screening program, it’s crucial to determine how accurate AI is in breast screening clinical practice before making any changes. It’s uncertain how effective AI is at detecting breast cancer in different sorts of women or in different groups of women (for example different ethnic groups). AI could significantly minimize staff workload, as well as the proportion of cancers overlooked during screening, and the amount of women who are asked to return for more tests despite the fact that they do not have cancer. According to the findings of the reader survey, such teamwork between AI systems and radiologists increases diagnosis accuracy and decreases false positive biopsies for all 10 radiologists. This research indicated that integrating the Artificial intelligence system’s predictions enhanced the performance of all readers.

## 6. Acknowledgments

Thank you to the extremely intellectual, informative, patient and courteous instructors of the Research Experience for Undergraduates Program.

1. Carlos Theran, REU Instructor
2. Yohn Jairo Parra, REU Instructor
3. Gregor von Laszewski, REU Instructor
5. REU Peers
6. Florida Agricultural and Mechanical University

## 7. References

1. Coleman C. Early Detection and Screening for Breast Cancer. Semin Oncol Nurs. 2017 May;33(2):141-155. doi: 10.1016/j.soncn.2017.02.009. Epub 2017 Mar 29. PMID: 28365057
2. Freeman, K., Geppert, J., Stinton, C., Todkill, D., Johnson, S., Clarke, A., & Taylor-Phillips, S. (2021, May 10). Use of Artificial Intelligence for Image Analysis in Breast Cancer Screening. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/987021/AI_in_BSP_Rapid_review_consultation_2021.pdf
3. Li, J., Zhou, Z., Dong, J., Fu, Y., Li, Y., Luan, Z., & Peng, X. (2021). Predicting breast cancer 5-year survival using machine learning: A systematic review. PloS one, 16(4), e0250370.
4. Mendelsonm, Ellen B., Artificial Intelligence in Breast Imaging: Potentials and Limitations. American Journal of Roentgenology 2019 212:2, 293-299
5. Seely, J. M., & Alhassan, T. (2018). Screening for breast cancer in 2018-what should we be doing today?. Current oncology (Toronto, Ont.), 25(Suppl 1), S115–S124.
6. Shamout, F. E., Shen, A., Witowski, J., Oliver, J., & Geras, K. (2021, June 24). Improving Breast Cancer Detection in Ultrasound Imaging Using AI. NVIDIA Developer Blog. https://developer.nvidia.com/blog/improving-breast-cancer-detection-in-ultrasound-imaging-using-ai/

# 4 - Increasing Cervical Cancer Risk Analysis

Cervical Cancer is an increasing matter that is affecting various women across the nation, in this project we will be analyzing risk factors that are producing higher chances of this cancer. In order to analyize these risk factors a machine learning technique is implemented to help us understand the leading factors of cervical cancer.

Status: draft, Type: Project

Theresa Jean-Baptistee, su21-reu-369, Edit

Keywords: Cervical, Cancer, Diseases, Data, conditions

## 1. Introduction

Cervical cancer is a disease that is increasing in various women nationwide. It occurs within the cells of the cervix (can be seen in stage 1 of the image below). This cancer is the fourth leading cancer, where there are about 52,800 cases found each year, predominantly being in lower developed countries. Cervical cancer occurs most commonly in women who are within their 50’s and who has symptoms such as watery and bloody discharge, bleeding, and painful intercourse. Two other common causes can be an early start on sexual activity and multiple partners. The most common way to determine if one may be affected by this disease is through a pap smear. When witnessed early it, can allow a better chance of results and treatment.

Cervical cancer is so important for the future of reproduction, being the cause of a successful or unsuccessful birth with completions like premature a child. The cervix help keeps the fetus stable within the uterus during this cycle, towards the end of development, it softens and dilates for the birth of a child. If diagnosed with this cancer, a miracle would be needed to conceive a child after having treatment. Most treatments begin with a biopsy removing affected areas of cervical tissue. As it continues, to spread radiotherapy might be recommended to treat the cancer where may affect the womb. lastly, one may need to have a hysterectomy which is the removal of the womb.

In this paper, we will study the exact cause and risk factors that may place someone in this position. If spotted early it wouldn’t affect someone’s dream chance of conceiving or affect their reproductive parts. Using various data sets we will study the way everything may alignes in causes and machine leaning would be the primary technique to used interpretate the relation between variables and risk factor on cervical cancer.

## 2. DataSets

The Data sets obatained shows the primary risk factors that affect women ages 15 and above. The few factors that sticked out the most were age, start of sexual activity, tabacoo intake, and IUD. The age and start of sexual activity maybe primary factor because a person is more liable to catch an STD and get this diease from mutiple parnters never really knowing what the other person may be doing outside of the encounterment. Tabcoo intake causes an affect making a person by weaking the immune system and making somone more septable to the disease. The IUD has the highest number on the data set being a primary factor that may put a person at risk, this device aids the prevention of pregency by thickneing the mucos of the cervix that could later cause infection or make your more spetiable to them.

## 3. Other People Works

The research of others work has made a huge imapact to this project starting from data to important knowledge needed to conduct the project. With the various research sites, we were able to witness what the affects various day to day activtie affect women long term. The Cervical Cancer Diagnosis Using a Chicken Swarm Optimization Based Machine Learning Method, was a big aid throught the project explaing the stages of cervical cancer, ways it can be treated, and the affects it may cause. With the data that was used from UCI Machine Learning, we were able to find efficent correlation into the data, helping the implented machine learning algorithm for the classification task.

## 4. Explantion of Confusion Matrix

The confusion matrix generated by multilayer perceptron can be explained as the perdicted summary results from the data obtained. Zero is when no cervical cancer is witnessed, one is when cervical cancer is seen. A hundrend and sixty-two is the highest number of this disease seen on the chart and the lowest number being winessed is two and eight being quiet of a jump.

## 6. Conclusion

In conclusion it can be found as women partake in their first sexual activity and continue they are more at risk. 162 is a dramatic number not necessarily being affected by age and 0 is only seen when a person does not partake in it. In the future I hope to keep furthering my Knowledge on Cervical Cancer, hopefully coming up with a realistic method to cure this disease where one can continue to live their life with as a human being.

## 7. Acknowledgments

The author would like to thank Yohn, Carlos, Gregor, Victor, and Jacques for all of their Help. Thank you!

# 5 - Cyber Attacks Detection Using AI Algorithms

This research is analysing multiple artificial intelligence algorithms to detect cyber attacks

Status: draft, Type: Project

Victor Adankai, su21-reu-365, Edit

Keywords: AI, ML, DL, Cybersecurity, Cyber Attacks.

## 1. Introduction

• Find literature about AI and Cyber Attacks on IoT Devices Dectection.
• Analyze the literature and explain how AI for Cyber Attacks on IOT Devices Detection are beneficial.

#### Types of Cyber Attacks

• Denial of service (DoS) Attack:
• Remote to Local Attack:
• Probing:
• User to Root Attack:
• Poisoning Attack:
• Evasion Attack:
• Integrity Attack:
• Malware Attack:
• Phising Attack:
• Zero Day Attack:
• Sinkhole Attack:
• Causative Attack:

#### Examples of AI Algorithms for Cyber Attacks Detection

• Convolutional Neural Network (CNN)
• Autoencoder (AE)
• Deep Belief Network (DBN)
• Recurrent Neural Network (RNN)
• Generative Adversal Network (GAN)
• Deep Reinforcement Learning (DIL)

## 2. Datasets

• Finding data sets in IoT Devices Cyber Attacks.
• Can any of the data sets be used in AI?
• What are the challenges with IoT Devices Cyber Attacks data set? Privacy, HIPPA, Size, Avalibility
• Datasets can be huge and GitHub has limited space. Only very small datasets should be stored in GitHub. However, if the data is publicly available you program must contain a download function instead that you customize. Write it using pythons request. You will get point deductions if you check-in data sets that are large and do not use the download function.

## 3. Using Images

• Place a cool image into projects images in my directory
• Correct the following link, replace the fa number with my su number and thne chart of png.
• If the image has been copied, you must use a reference such as shown in the Figure 1 caption.

Figure 1: Images can be included in the report, but if they are copied you must cite them 1.

## 4. Benchmark

Your project must include a benchmark. The easiest is to use cloudmesh-common [^2]

## 5. Conclusion

A convincing but not fake conclusion should summarize what the conclusion of the project is.

## 6. Acknowledgments

• Gregor von Laszewski
• Yohn Jairo Bautista
• Carlos Theran

## 7. References

1. Gregor von Laszewski, Cloudmesh StopWatch and Benchmark from the Cloudmesh Common Library, [GitHub] https://github.com/cloudmesh/cloudmesh-common ↩︎

# 6 - Report: Dentronics: Classifying Dental Implant Systems by using Automated Deep Learning

Artificial intelligence is a branch of computer science that focuses on building and programming machines to think like humans and mimic their actions. The proper concept definition of this term cannot be achieved simply by applying a mathematical, engineering, or logical approach but requires an approach that is linked to a deep cognitive scientific inquiry. The use of machine-based learning is constantly evolving the dental and medical field to assist with medical decision making process. In addition to diagnosis of visually confirmed dental caries and impacted teeth, studies applying machine learning based on artificial neural networks to dental treatment through analysis of dental magnetic resonance imaging, computed tomography, and cephalometric radiography are actively underway, and some visible results are emerging at a rapid pace for commercialization. Researchers have found deep convolutional neural networks to have a future place in the dental field when it comes to classification of dental implants using radiographic images.

Status: final, Type: Project

Jamyla Young, su21-reu-376, Edit

Keywords: Dental implants, Deep Learning, Prosthodontics, Implant classificiation, Artificial Intelligence, Neural Networks.

## 1. Introduction

Dental implants are ribbed oral protheses typically made up of biocompatible titanium to replace the missing root(s) of an absent tooth. These dental protheses are used to support the jaw bone to prevent deterioration due to an absent root1. This is referred to as bone resorption which can result to facial malformation as well as reduced oral function such as biting and chewing. These devices are composed of three elements that imitates a natural tooth function and structure.The implant which are typically ribbed and threaded to promote stability while integrating within the bone tissue. The osseointegration process usually takes 6-8 months to rebuild the bone to support the implant. An implant abutment is fixed on top of the implant to act as a base for prosthetic devices 2. Prefabricated abutments are manufactured in many shapes, sizes and angles depending on the location of the implant and the types of prothesis that will be attached. Dental abutments support a range of prothetic devices such as dental crowns, bridges, and dentures 3.

Osseointegrated dental implants depend on various factors that affect the anchorage of the implant to the bone tissue. Successful surgical anchoring techniques can contribute to long term success of implant stability. Primary stability plays a role 2 week postoperatively by achieving mechanical retention of the implant. It helps establish a mechanical microenvironment for gradual bone healing, or osseointegration-This is secondary implant stability. Bone type, implant length, implant and diameter influences primary and secondary implant stability. Implant length can range from 6mm to 20mm; however, the most common lengths are between 8mm to 15mm. Many studies suggest that implant length contribute to decreasing bone stress and increasing implant stability. Bone stress can occur at both the cortical and cancellous part of the bone. Increasing implant length will decrease stress in the cancellous part of the bone while increasing the implant diameter can decrease stress in the cortical part of the bone4. Bone type can promote positive bone stimulation around an implant improving the overall function. There are four different types: Type I, Type II, Type III, and Type IV. Type I is the most dense of them which provides more cortical anchorage but has limited vascularity. Type II is the best for osseointegration because it provides good cortical anchorage and has better vascularity than type I. Type III and IV have a thin layer of cortical bone which decrease the success rate of primary stability5.

Implant stability can be measured using the Implant Stability Quotient (ISQ) as an indirect indicator to determine the time frame for implant loading and prognostic indicator for implant failure 4. This can be measured by resonance frequency analysis (RFA) immediately after the implant has been placed. Resonance frequency analysis is the measurement in which a device vibrates in response to frequencies in the range of 5-15 kHz. The peak amplitude of the response is then encoded into the implant stability quotient (ISQ). The clinical range of ISQ is from 55-80. High stability is >70 ISQ while medium stability is between 60-69 ISQ. Low stability is <60 ISQ6.

There are over 2000 types of dental implant systems (DIS) that differs in diameter, length, shape, coating, and surface material properties. These devices have more than a 90% long termed survival rate which ranges more than 10 years. Inevitably, biological and mechanical complications such as fractures, low implant stability, and screw loosening can occur. Therefore, identifying the correct Dental Implant System is essential to repair or replace the existing system. Methods and techniques that enables clear identification is insufficient 7.

Artificial intelligence is a branch of computer science that focuses on building and programming machines to think like humans and mimic their actions. A deep convolutional neural network (DCNN) is a brach of artificial intelligence that applies multiple layers of nonlinear processing units for feature extraction, transformation, and classification of high dimensional datasets. Deep convolutional neural networks are commonly used to identify patterns in images and videos. The structure typically consist of four types of layers: convolution, pooling, activation, and fully connected. These neural networks use images as an input to train a classifier which employs a mathematical operation called a convolution. Deep neural networks have been successfully applied in the dental field and demonstrated advantages in terms of diagnosis and prognosis. Using automated deep convolutional neural networks is highly efficient in classifying different dental implant systems compared to most dental professionals7.

## 2. Data sets

Researchers at Daejon Dental Hospital used automated deep convolutional neural networks to evaluate the efficacy of its ability to classify dental implant systems and compare the performance with dental professionals using radiographic images.

11,980 raw panoramic and periapical radiographic images of dental implant systems were collected. these images were then randomly divided into 2 groups: 9584 (80%) images were selected for the training dataset and the remaining 2396 (20%) images were used as the testing dataset.

## 2.1 Dental implant classification

Dental implant systems were classified into six different types with a diameter of 3.3-5.0mm and a length of 7-13mm.

• Astra OsseoSpeed TX (Dentsply IH AB, Molndal, Sweden), with a diameter of 4.5–5.0 mm and a length of 9–13 mm;
• Implantium (Dentium, Seoul, Korea), with a diameter of 3.6–5.0 mm and a length of 8–12 mm;
• Superline (Dentium, Seoul, Korea), with a diameter of 3.6–5.0 mm and a length of 8–12 mm;
• TSIII (Osstem, Seoul, Korea), with a diameter of 3.5–5.0 mm and a length of 7–13 mm;
• SLActive BL (Institut Straumann AG, Basel, Switzerland), with a diameter of 3.3–4.8 mm and a length of 8–12 mm;
• SLActive BLT (Institut Straumann AG, Basel, Switzerland), with a diameter of 3.3–4.8 mm and a length of 8–12 mm.

## 2.2 Deep Convulutional Neural Network

Using Neuro-T to automatically select the model and optimize hyper-parameter. During training and inference, the automated DCNN automatically creates effective deep learning models and searches the optimal hyperparameters. An Adam optimizer with L2 regularization was used for transfer learning. The batch size was set to 432, and the automated DCNN architecture consisted of 18 layers with no dropout.

Figure 1: Overview of an automated deep convolutional neural network 7.

## 3. Results

For the evaluation, the following statistical parameters were taken into account: receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), 95% confidence intervals (CIs), standard error (SE), Youden index (sensitivity + specificity − 1), sensitivity, and specificity, which were calculated using Neuro-T and R statistical software . Delong’s method was used to compare the AUCs generated from the test dataset, and the significance level was set at p < 0.05.

The accuracy of the automated DCNN abased on the AUC, Youden index, sensitivity, and specificity for the 2,396 panoramic and periapical radiographic images were 0.954(95% CI = 0.933–0.970, SE = 0.011), 0.808, 0.955, and 0.853, respectively. Using only panoramic radiographic images (n = 1429), the automated DCNN achieved an AUC of 0.929 (95% CI = 0.904–0.949, SE = 0.018, Youden index = 0804, sensitivity = 0.922, and specificity = 0.882), while the corresponding value using only periapical radiographic images (n = 967) achieved an AUC of 0.961 (95% CI = 0.941–0.976, SE = 0.009, Youden index = 0.802, sensitivity = 0.955, and specificity = 0.846). There were no significant differences in accuracy among the three ROC curves.

Figure 2: The accuracy of the automated DCNN for the test dataset did not show a significant difference among the three ROC three ROC curves based on DeLong’s method 7.

The Straumann SLActive BLT implant system has a relatively large tapered shape compared to other types of DISs. Thus, the automated DCNN (AUC = 0.981, 95% CI = 0.949–0.996). However, for the Dentium Superline and Osstem TSIII implant systems that do not have conspicuous characteristic elements with a tapered shape, the automated DCNN classified correctly with an AUC of 0.903 (95% CI = 0.850–0.967) and 0.937 (95% CI = 0.890–0.967)

Figure 3 (a-f): Performance of the automated DCNN and comparison with dental professionals for classification of six types of DIS 7

## 4. Conclusion

Nonetheless, this study has certain limitations. Although six types of DISs were selected from three different dental hospitals and categorized as a dataset, the training dataset was still insufficient for clinical practice. Therefore, it is necessary to build a high-quality and large-scale dataset containing different types of DISs. If time and cost are not limited, the automated DCNN can be continuously trained and optimized for improved accuracy. Additionally, the automated DCNN regulates the entire process, including appropriate model selection and optimized hyper-parameter adjustment. The automated DCNN can help clinical dental practitioners to classify various types of DISs based on dental radiographic images. Nevertheless, further studies are necessary to determine the efficacy and feasibility of applying the automated DCNN in clinical practice.

## 5. Acknowledgments

1. Carlos Theran, REU Instructor

2. Yohn Jairo Parra, REU Instructor

3. Gregor von Laszewski, REU Instructor

5. Jacques Fleischer, REU peer

6. Florida Agricultural and Mechanical University

## 6. References

[^3] Gregor von Laszewski, Cloudmesh StopWatch and Benchmark from the Cloudmesh Common Library, [GitHub https://github.com/cloudmesh/cloudmesh-common

1. Karras, Spiro, Look at the structure of dental implants.(2020, September 2). https://www.drkarras.com/a-look-at-the-structure-of-dental-implants/ ↩︎

2. Ghidrai, G. (n.d.). Dental implant abutment. Stomatologia pe intelesul tuturor. https://www.infodentis.com/dental-implants/abutment.php ↩︎

3. Bataineh, A. B., & Al-Dakes, A. M. (2017, January 1). The influence of length of implant on primary stability: An in vitro study using resonance frequency analysis. Journal of clinical and experimental dentistry. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5268121/ ↩︎

4. Huang, H., G, Wu., & E, Hunziker. (2020). The clinical significance of implant Stability QUOTIENT (ISQ) MEASUREMENTS: A literature review. Journal of oral biology and craniofacial research. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7494467/ ↩︎

5. Li, J., Yin, X., Huang, L., Mouraret, S., Brunski, J. B., Cordova, L., Salmon, B., & Helms, J. A. (2017, July). Relationships among Bone QUALITY, IMPLANT Osseointegration, and WNT SIGNALING. Journal of dental research https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5480808/ ↩︎

6. Möhlhenrich, S. C., Heussen, N., Modabber, A., Bock, A., Hölzle, F., Wilmes, B., Danesh, G., & Szalma, J. (2020, July 24). Influence of bone density, screw size and surgical procedure on orthodontic mini-implant placement – part b: Implant stability. International Journal of Oral and Maxillofacial Surgery. https://www.sciencedirect.com/science/article/abs/pii/S0901502720302496 ↩︎

7. Lee JH, Kim YT, Lee JB, Jeong SN. A Performance Comparison between Automated Deep Learning and Dental Professionals in Classification of Dental Implant Systems from Dental Imaging: A Multi-Center Study. Diagnostics (Basel). 2020 Nov 7;10(11):910. doi: 10.3390/diagnostics10110910. PMID: 33171758; PMCID: PMC7694989. ↩︎

# 7 -

Here comes the abstract

Fix the links: and than remove this line

Status: draft, Type: Project

Fix the links: and than remove this line

Gregor von Laszewski, hid-example, Edit

Keywords: tensorflow, example.

## 1. Introduction

Do not include this tip in your document:

Tip: Please note that an up to date version of these instructions is available at

Here comes a convincing introduction to the problem

## 2. Report Format

The report is written in (hugo) markdown and not commonmark. As such some features are not visible in GitHub. You can set up hugo on your local computer if you want to see how it renders or commit and wait 10 minutes once your report is bound into cybertraining.

To set up the report, you must first replace the word hid-example in this example report with your hid. the hid will look something like sp21-599-111

It is to be noted that markdown works best if you include an empty line before and after each context change. Thus the following is wrong:

# This is My Headline
This author does ignore proper markdown while not using empty lines between context changes
1. This is because this author ignors all best practices


Instead, this should be

# This is My Headline

We do not ignore proper markdown while using empty lines between context changes

1. This is because we encourage best practices to cause issues.


## 2.1. GitHub Actions

When going to GitHub Actions you will see a report is autmatically generated with some help on improving your markdown. We will not review any document that does not pass this check.

## 2.2. PAst Copy from Word or other Editors is a Disaster!

We recommend that you sue a proper that is integrated with GitHub or you use the commandline tools. We may include comments into your document that you will have to fix, If you juys past copy you will

1. Not learn how to use GitHub properly and we deduct points
2. Overwrite our coments that you than may miss and may result in point deductions as you have not addressed them.

## 2.3. Report or Project

You have two choices for the final project.

1. Project, That is a final report that includes code.
2. Report, that is a final project without code.

YOu will be including the type of the project as a prefix to your title, as well as in the Type tag at the beginning of your project.

## 3. Using Images

Figure 1: Images can be included in the report, but if they are copied you must cite them 1.

## 4. Using itemized lists only where needed

Remember this is not a powerpoint presentation, but a report so we recommend

1. Use itemized or enumeration lists sparingly
2. When using bulleted lists use * and not -

## 5. Datasets

Datasets can be huge and GitHub has limited space. Only very small datasets should be stored in GitHub. However, if the data is publicly available you program must contain a download function instead that you customize. Write it using pythons request. You will get point deductions if you check-in data sets that are large and do not use the download function.

## 6. Benchmark

Your project must include a benchmark. The easiest is to use cloudmesh-common 2

## 6. Conclusion

A convincing but not fake conclusion should summarize what the conclusion of the project is.

## 8. Acknowledgments

Please add acknowledgments to all that contributed or helped on this project.

## 9. References

Your report must include at least 6 references. Please use customary academic citation and not just URLs. As we will at one point automatically change the references from superscript to square brackets it is best to introduce a space before the first square bracket.

1. Use of energy explained - Energy use in homes, [Online resource] https://www.eia.gov/energyexplained/use-of-energy/electricity-use-in-homes.php ↩︎

2. Gregor von Laszewski, Cloudmesh StopWatch and Benchmark from the Cloudmesh Common Library, [GitHub] https://github.com/cloudmesh/cloudmesh-common ↩︎

# 8 - Report: Aquatic Animals Classification Using AI

Marine animals play an important role in the ecosystem. ‘Aquatic animals play an important role in nutrient cycles because they store a large proportion of ecosystem nutrients in their tissues, transport nutrients farther than other aquatic animals and excrete nutrients in dissolved forms that are readily available to primary producers’ (Vanni MJ 1) Fish images are captured by scuba divers, tourist, or underwater submarines. different angles of fishes image can be very difficult to get because of the constant movement of the fish. In addition to getting the right angles, the images of marine animals are usually low-quality because of the water. Underwater cameras that is required for a good quality image can be expensive. Using AI could potentially increase the marine population by the help of classification by testing the usage of machine learning using the images obtained from the aquarium combined with advanced technology. We collect 164 fish images data from Georgia acquarium to look at the different movements.

Status: final, Type: Report

Timia Williams, su21-reu-370, Edit

Keywords: tensorflow, example.

## 1. Introduction

It can be challenging to obtain a large number of different complex species in a single aquatic environment. Traditionally, it would take marine biologists years to collect the data and successfully classify the type of species obtained [1]. Scientist says that more than 90 percent of the ocean’s species are still undiscovered, with some estimating that there are anywhere between a few hundred thousand and a few million more to be discovered" (National Geographic Society). Currently, scientists know of around 226,000 ocean species. Now and days, Artificial intelligence and machine learning has been used for detection and classification in images. In this project, We will propose to use machine learning techniques to analyze the images obtained from the Georgia Aquarium to identify legal and illegal fishing.

## 2. Machine learning in fish species.

Aquatic ecologists often count animals to keep up the population count of providing critical conservation and management. Since the creation of underwater cameras and other recording equipment, underwater devices have allowed scientists to safely and efficiently classify fishes images without the disadvantages of manually entering data, ultimately saving lots of time, labor, and money. The use of machine learning to automate image processing has its benefits but has rarely been adopted in aquatic studies. With using efforts to use deep learning methods, the classification of specific species could potentially increase. In fact, there is a study done in Australia’s ocean waters that classification of fish through deep learning was more efficient that manual human classification. In the study to test the abundance of different species, “The computer’s performance in determining abundance was 7.1% better than human marine experts and 13.4% better than citizen scientists in single image test datasets, and 1.5 and 7.8% higher in video datasets, respectively” (Campbell, M. D.). This remarkably explain that using machiene learning in marine animals is a better method than a manually classifying Aquatic animals Not only is it good for classification, it will be used to answer broader questions such as population count, the location of species, its abundance, and how it appears to be thriving. Since Machine learning and deep learning are often defined as one, both learning methods will be used to analyze the images and find patterns on my data.

## 3. Datasets

We used two datasets in my project. The first dataset includes the pictures that I took at the Georgia Acquarium. That dataset was used for testing. The second dataset used was a fish dataset from kaggle which contains 9 different seafood types (Black Sea Sprat, Gilt-Head Bream, Hourse Mackerel, Red Mullet, Red Sea Bream, Sea Bass, Shrimp, Striped Red Mullet, Trout). For each type, there are 1000 augmented images and their pair-wise augmented ground truths.

The link to access the Kaggle dataset is https://www.kaggle.com/crowww/a-large-scale-fish-dataset

## 3.1. Sample of Images of Personal Dataset

Left to right: Banded Archerfish, Lionfish, and Red Piranha

Figure 1: These images are samples of my personal data which is made up of images of fishes taken at the Georgia Acquarium.

## 4. Conclusion

Deep learning methods provide a faster, cheaper, and more accurate alternative to manual data analysis methods currently used to monitor and assess animal abundance and have much to offer the field of aquatic ecology. We was able to create a model to prove that we can use AI to efficiently detect and classify marine animals.

## 5. Acknowledgments

Special thanks to these people that helped me with this paper: Gregor von Laszewski Yohn Jairo Carlos Theran Jacques Fleischer Victor Adankai

# 9 - Project: Hand Tracking with AI

In this project we study the ability of an AI to recognize letters from the American Sign Language (ASL) alphabet. We use a Convolutional Neural Network and apply it to a dataset of hands in different positionings showing the letters ‘a’, ‘b’, and ‘c’ in ASL. With this we build a model to recognize the letter and output the letter it predicts.

Status: draft, Type: Project

David Umanzor, su21-reu-364, Edit

Keywords: ai, object recognition, image processing, computer vision, american sign language.

## 1. Introduction

Object detection and feature selection are essential tasks in computer vision and have been approached from various perspectives over the past few decades 1. The brain uses object recognition to solve an inverse problem: one where (surface properties, shapes, and arrangements of objects) need to be inferred from the perceived outcome of the image formation process 2. Visual object recognition as a neural substrate in humans was revealed by neuropsychological studies. There are specific brain regions that cause object recognition, yet we still do not understand how the brain achieves this remarkable behavior 3. Human beings rely and rapidly recognize objects despite considerable retinal image transformations arising from changes in lighting, image size, position, and viewing angle 3.

A gesture is a form of nonverbal communication done with positions and movements of the hand, arms, body parts, hand shapes, movements of the lips or face 4. One of the key differences of hand gestures is that they allow communication over a long distance 5. American Sign Language (ASL) is a formal language that has the same lingual properties as oral languages commonly used by deaf people to communicate [6]. ASL typically is formed by the finger, hand, and arm positioning and can contain static and dynamic movement or a combination of both to communicate words and meanings to another 6. Communication with other people can be challenging because people are not typically willing to learn sign language 6.

In this paper, we consider the problem of detecting and understanding American Sign Language. We test CNN’s ability to recognize the ASL alphabet. As advancements in technology increase, there are more improvements to 2D methods of hand detection. Commonly these methods are visual-based, using color, shape, and edge to detect and recognize the hand 7. There are issues to these technologies like inconsistent lighting conditions, non-hand color similarity, and varying viewpoints that can decrease the model’s ability to recognize the hand and its positioning. We use a Convolutional Neural Network to create the model and detect different letters of American Sign Language.

## 2. Data Sets

In this research we use two sources of datasets, the first is from kaggle which it was already prepared but we needed more. The second is self made dataset by take images in good lighting against a white wall, it was then cropped to 400x400 pixels focused on the hand. The program then sets the images to grayscale as the color is not needed for this research. Finally, the images are reduced to 50x50 resolution for the AI to use for training.

Figure 1: Dataset of hands doing different alphabet letters in ASL 5.

## 3. Documentation

Figure 2: The Convolutional Neural Network (CNN) model.

This model shows the CNN model that we used to train the AI. The CNN takes pictures and breaks them down into smaller segments called features. It is trained to find patterns and features over the images allowing the CNN to predict an ‘a’, ‘b’, or ‘c’ upon the given ASL image with high accuracy. A CNN uses a convolution operation that filters every possible position the feature it collected can be matched to and attempts to find where it fits in 8. This process is repeated and becomes the convolution layer or in the image depicted as Conv2d + Relu. The ReLU stands for the rectified linear unit and is used as an activation function for the CNN 9.

• [] What is a Relu operation? ReLU operation is a rectified linear unit and is used as an activation function for the CNN, we use a Leaky ReLU in our model because it is easy to use to train the model quickly and it has a small tolerance for negatives values unlike the normal ReLU fuction. paperswithcode.com/method/leaky-relu add figure of a leaky ReLU

• [] What is a Conv2d? Conv2d is a 2D Convolution layer meant for a images as it uses height and width. They build a filter across the image by recognizing the similarities of the image

• [] What is a BashNormalization operation? Batch Normalization is a process that standardizes the updates as the Convolutional process sets weights and as the neural network goes through each layer the procedure keeps adjusting to a target that never stays the same, requiring more epochs and and reduces the time it takes to train a deep learning neural network. :Reference 12:

• [] what is Maxpooling operation? Maximum pooling is an operation the gathers the biggest number in each collection of each feature map. This provides a way to avoid over-fitting

• [] what is Fully Connected?

• [] what is Softmax operation?

## 4. Methodology

In this research, we built the model using a convolution neural network (CNN) to create an AI that can recognize ASL letters (‘a’, ‘b’, and ‘c’), using a collection of 282 images. The Dataset contains 94 images for each letter to train the AI’s CNN. This can be expanded to allow an AI to recognize letters, words, and any expression that can be made using a still image of the hands. A CNN fits this perfectly as we can use its ability to assign importance to segments of an image and tell the difference from one another using weights and biases. With the proper training, it is able to learn and identify these characteristics [9].

## 5. Benchmark

Figure 3: The Confusion Matrix of the finished CNN model.

The Confusion Matrix shows the results of the model after being tested on its ability to recognize each letter, in the image it shows that the AI had a difficult time recognizing the difference between an ‘a’ and ‘c’ only getting 6% of the images labelled as ‘c’ correct.

## 6. Conclusion

We build a model to recognize an ASL given an image and predict the corresponding letter using a convolutional neural network. The model provides a means of 66% accuracy in classifying the ASL among the three classes ‘a’, ‘b’, and ‘c’. From the given results, the letters ‘a’ and ‘c’ became the most difficult for the CNN to differentiate from each other, as shown in the confusion matrix in figure 3. We suggest that the low accuracy rate is based on similar appearing grayscale of the letters ‘a’ and ‘c’ and the lack of a larger dataset for the AI to learn from. We determine that using a larger dataset of the entire alphabet and increasing the number of examples of each letter to train the AI could improve the results.

We found that the low accuracy can be increased by improving the resolution of the image giving the program more features to go off of in its computing to recognize the image, going from model 1 at 50 x 50 pixels to model 2 at 80 x 80 pixels there was an increase from 66% to 76% in accuracy, this in theory should improve as the resolution of the image increases from 100x 100 to 200x 200 and at the best the image’s resolution would be left at the original size off 400 x 400. This is accuracy increase would be because as the resolution drops the program has less information and some of the important landmarks of the hand are lost due to the resolution of the image.

• Correct formatting and grammar

Future studies using a larger dataset can be applied to more complex methods than just singular letters but words from the ASL language to recreate a text to speech software based around ASL hand positioning.

## 7. Acknowledgments

We thank Carlos Theran (Florida A & M University) for advising, guidance, and resources used in the research; We thank Yohn Jairo (Florida A & M University) for guidance and aid on the research report; We thank Gregor von Laszewki (Florida A & M University) for advice and commenting on the code and report; We thank the Polk State LSAMP Program for aid in obtaining this opportunity. We thank Florida A & M University for funding this research.

## 8. References

1. Pan, T.-Y., Zhang, C., Li, Y., Hu, H., Xuan, D., Changpinyo, S., Gong, B., & Chao, W.-L. (2021, July 5). On Model Calibration for Long-Tailed Object Detection and Instance Segmentation. arXiv.org. https://arxiv.org/abs/2107.02170↩︎

2. Wardle, S. G., & Baker, C. (2020). Recent advances in understanding object recognition in the human brain: Deep neural networks, temporal dynamics, and context. F1000Research. F1000 Research Ltd. https://doi.org/10.12688/f1000research.22296.1 ↩︎

3. Wardle, S. G., & Baker, C. (2020). Recent advances in understanding object recognition in the human brain: Deep neural networks, temporal dynamics, and context. F1000Research. F1000 Research Ltd. https://doi.org/10.12688/f1000research.22296.1 ↩︎

4. Dabre, K., & Dholay, S. (2014). Machine learning model for sign language interpretation using webcam images. 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA), 317-321. https://ieeexplore.ieee.org/document/6839279 ↩︎

5. tecperson, Sign Language MNIST Drop-In Replacement for MNIST for Hand Gesture Recognition Tasks, [Kaggle] https://www.kaggle.com/datamunge/sign-language-mnist ↩︎

6. A. Rahagiyanto, A. Basuki, R. Sigit, A. Anwar and M. Zikky, “Hand Gesture Classification for Sign Language Using Artificial Neural Network,” 2017 21st International Computer Science and Engineering Conference (ICSEC), 2017, pp. 1-5, <doi: 10.1109/ICSEC.2017.8443898> ↩︎

7. Jiayi Wang, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. 2020. RGB2Hands: real-time tracking of 3D hand interactions from monocular RGB video. ACM Trans. Graph. 39, 6, Article 218 (December 2020), 16 pages. https://doi.org/10.1145/3414685.3417852 ↩︎

8. Rohrer, B. (2016, August 18). How do Convolutional Neural Networks work? Library for end-to-end machine learning. https://e2eml.school/how_convolutional_neural_networks_work.html ↩︎

9. Patel, K. (2020, October 18). Convolution neural networks - a beginner’s Guide. Towards Data Science. https://towardsdatascience.com/convolution-neural-networks-a-beginners-guide-implementing-a-mnist-hand-written-digit-8aa60330d022 ↩︎

# 10 - Review: Handwriting Recognition Using AI

This study reviews two approaches and/or machine learning tools used by researchers/developers to convert handwritten information into digital forms using Artificial Intelligence.

Status: final, Type: Report

Mikahla Reeves, su21-reu-366, Edit

Keywords: handwriting recognition, optical character recognition, deep learning.

## 1. Introduction

Perhaps one of the most monumental things in this modern-day is how our devices can behave like brains. Our various devices can call mom, play our favorite song, and answer our questions by just a simple utterance of Siri or Alexa. These things are all possible because of what we call artificial intelligence. Artificial intelligence is a part of computer science that involves learning, problem-solving, and replication of human intelligence. When we hear of artificial intelligence, we often hear of machine learning as well. The reason for this is because machine learning also involves the use of human intelligence. Machine learning is the process of a program or system getting more capable over time 1. One example of machine learning at work is Netflix. Netflix is a streaming service that allows users to watch a variety of tv shows and movies, and it also falls under the category of a recommendation engine. Recommendation engines/applications like Netflix do not need to be explicitly programmed. However, their algorithms mine the data, identify patterns, and then the applications can make recommendations.

Now, what is handwriting recognition? Handwriting Recognition is a branch of (OCR) Optical Character Recognition. It is a technology that receives handwritten information from paper, images, and other items and interprets them into digital text in real-time 2. Handwriting recognition is a well-established area in the field of image processing. Over the last few years, developers have created handwriting recognition technology to convert written postal codes, addresses, math questions, essays, and many more types of written information into digital forms, thus making life easier for businesses and individuals. However, the development of handwriting recognition technology has been quite challenging.

One of the main challenges of handwriting recognition is accuracy, or in other words, the variability in data. There is a wide variety of handwriting styles, both good and bad, thus making it harder for developers to provide enough samples of what a specific character/integer looks like 3. In handwriting recognition, the computer has to translate the handwriting into a format that it understands, and this is where Optical Character Recognition becomes useful. In OCR, the computer focuses on a character, compares it to characters in its database, then identifies what the letters are and fundamentally what the words are. Also, this is why deep learning algorithms like Convolutional Neural Networks and Long Short Term Memory exist. This study will highlight the impact each algorithm has on the development of handwriting recognition.

## 2. Convolutional Neural Network Model

Figure 1: Architecture of CNN for feature extraction 4

In this model, the input image passes through two convolutional layers, two sub-sample layers, and a linear SVM (Support Vector Machine) that allows for the output which is a class prediction. This class prediction leads to the editable text file.

Class prediction is a supervised learning method where the algorithm learns from samples with known class membership (training set) and establishes a prediction rule to classify new samples (test set). This method can be used, for instance, to predict cancer types using genomic expression profiling 5.

## 3. Long Short Term Memory Model

Figure 2: Overview of the CNN-RNN hybrid network architecture 6

This model has a spacial transformer network, residual convolutional blocks, bidirectional LSTMs and the CTC loss (Connectionist Temporal Classification loss) which are all the processes the input worded image has to pass through before the output which is a label sequence.

Sequence labeling is a typical NLP task which assigns a class or label to each token in a given input sequence 7.

## 4. Handwriting Recognition using CNN

Figure 3: Flowchart of handwriting character recognition on form document using CNN 4

In the study, former developers created a system to recognize the handwriting characters on form document automatically and convert it into editable text. The system consists of four stages: get ROI (Region of Interest), pre-processing, segmentation and classification. In the getting ROI stage, according to the specified coordinates, the ROI is cropped. Next, each ROI goes through pre-processing. The pre-processing consists of bounding box removal using the eccentricity criteria, median filter, and bare open. The output image of the pre-processing stage will be segmented using the Connected Component Labeling (CCL) method. It aims to get an individual character4.

## 5. Conclusion

In this study, we learned how to use synthetic data, domain-specific image normalization, and augmentation - to train an LSTM architecture6. Additionally, we learned how a CNN is a powerful feature extraction method when applied to extract the feature of the handwritten characters and linear SVM using L1 loss function and L2 regularization used as end classifier4.

For future research, we can focus on improving the CNN model to be able to better process information from images to create digital text.

## 6. Acknowledgments

This paper would not have been possible without the exceptional support of Gregor von Laszewski, Carlos Theran, Yohn Jairo. Their constant guidance, enthusiasm, knowledge and encouragement have been a huge motivation to keep going and to complete this work. Thank you to Jacques Fleicher, for always making himself available to answer questions. Finally, thank you to Byron Greene and the Florida A&M University for providing this great opportunity for undergraduate students to do research.

## 7. References

1. Brown, S., 2021. Machine learning, explained | MIT Sloan. [online] MIT Sloan. Available at: https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained↩︎

2. Handwriting Recognition in 2021: In-depth Guide. (n.d.). https://research.aimultiple.com/handwriting-recognition ↩︎

3. ThinkAutomation. 2021. Why is handwriting recognition so difficult for AI? - ThinkAutomation. [online] Available at: https://www.thinkautomation.com/bots-and-ai/why-is-handwriting-recognition-so-difficult-for-ai/↩︎

4. Darmatasia, and Mohamad Ivan Fanany. 2017. “Handwriting Recognition on Form Document Using Convolutional Neural Network and Support Vector Machines (CNN-SVM).” In 2017 5th International Conference on Information and Communication Technology (ICoIC7), 1–6. ↩︎

5. “Class Prediction (Predict Parameter Value).” NEBC: NERC Environmental Bioinformatics Centre. Silicon Genetics, 2002. http://nebc.nerc.ac.uk/courses/GeneSpring/GS_Mar2006/Class%20Prediction.pdf ↩︎

6. K. Dutta, P. Krishnan, M. Mathew and C. V. Jawahar, “Improving CNN-RNN Hybrid Networks for Handwriting Recognition,” 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018, pp. 80-85, doi: 10.1109/ICFHR-2018.2018.00023. ↩︎

7. Jacob. “Deep Text Representation for Sequence Labeling.” Medium. Mosaix, August 15, 2019. https://medium.com/mosaix/deep-text-representation-for-sequence-labeling-2f2e605ed9d ↩︎

# 11 - Project: Analyzing Hashimoto disease causes, symptoms and cases improvements using Topic Modeling

Analyzing factors as immune systems, genetics and diets than can lead to Hashimoto disease

Status: final, Type: Project

Sheimy Paz, su21-reu-372, Edit

Keywords: Thyroid disease, Hashimoto, H Pylori, Implants, Food Sensitivity, Diary sensitivity, Healthy Diets, Exercise, topic modeling, text mining, BERT model.

## 1. Introduction

Hashimoto thyroiditis is an organ-specific autoimmune disorder. its symptoms were first described 1912 but the disease was not recognized until 1957. Hashimoto is an autoimmune disorder that destroys thyroid cells and is antibody-mediated 1. In a female-to-men radio at least 10:4 women are more often affected than men. The diagnostic is often called between the ages of 30 to 50 years 2. Pathologically speaking, Hashimoto stimulates the formation of antithyroid antibodies that attack the thyroid tissue, causing progressive fibrosis. Hashimoto is believe to be the concequence of a combination of mutated genes and eviromental factors3. The disorder is difficult to diagnose since in the early course of the disease the patients may or may not exhibit symptoms or laboratory findings of hyperthyroidism, it may show normal values because the destruction of the gland cells may be intermittent 1. Clinical and epidemiological studies suggest worldwide that the most common cause of hypothyroidism is an inadequate dietary intake of iodine.

Due to the arduous labor to identify this disorder a Machine Learning algorithm based on prediction would help to identify Hashimoto in early stages as well as any other health issues related to it 4. This will be helpful for patients that would be able to get the correct treatment in an early stage of the illness avoiding future complications. This research algorithm was mainly intended to find patient testimonies of improvements, completed healed cases, early symptoms, trigger factors or any useful information about the disorder.

Hashimoto autoimmune diseases have been linked to the infection caused by H pylori bacteria. H pylori is until the date the most common chronic bacterial infection, affecting half of the world’s population and is known for the presence of Caga antigens which are virulent strains that have been found in organ and non-organ specific autoimmune diseases 23. Another important trigger of Hashimoto disorder is the inadequate modern diet patterns and the environmental factors that are closely related to it 4. For instance, western diet consumption is an essential factor that trigs the disorder since this food is highly preserved and predominate the consumption of artificial flavors and sugars which have dramatically increase in the past years, adding to it the use of chemicals and insecticides in the fruits and vegetables and the massive introduction of hormones for meat production, all this can be the cause of the rise of autoimmune diseases 5.

We utilize deep learning BERT model to train our dataset. BERT is a superior performer Bidirectional Encoder, which superimposes 12 or 24 layers of multiheaded attention in a Transformer 1. Bert stands for Bidirectional(read from left to right and vice versa with the purpose of an accurate understanding of the meaning of each word in a sentence or document) Encoder Representations from Transformers(the used of transformers and bidirectional models allows the learning of contextual relations between words). Notice that BERT uses two training strategies MLM and NSP.

Masked Lenguage Model (MLM) process is made by masking around 15% of token making the model predict the meaning or value of each of the masked words. In technical word it requires 3 steps Adding a classification layer on top of the encoder output, Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension. And lastly calculating the probability of each word in the vocabulary with SoftMax. Here we can see an image of the process 6.

MLM Bert Figure: “Masked Language Model Figure Example” 6.

Next Sentence Prediction (NSP) process is based in sentence prediction. The model obtains pair of sentences as inputs, and it is train to predict which is the second sentence in the pair. In The training process 50% of the input sentences are in fact first and second sentence and in the other 50% the second sentences are random sentences used for training purposes. The model is able to distinguish if the second sentence is connected to the first sentence by a 3-step process. An CLS (the reserved token to represent the start of sequence) is inserted at the beginning of the first sentence while the SEP (separate segments or sentence) is inserted at the end of each sentence. And embedding indicating sentence A or B is added to each token, and lastly a positional embedding is added to each token to indicate its position in the sequence like is shown on the image 6.

NSP Bert Figure: “Next Sentence Prediction Figure Example” 6.

By the trained model Parameter learning we obtains the word embeddings of the input sentence or input sentence pair in the unsupervised learning framework proceeds by solving the following two tasks: Masked Language Model and Next Sentence Prediction.

We try to use Bert model in the small dataset Hashimoto without any success because the BERT model was overfitting the data points. We use LDA model to train the Hashimoto dataset which allow us to find topic probabilities that we compare with the thyroiditis dataset that was trained with the BERT model-framework.

We used Natural Language Toolkit (NLTK) which is a module that uses the process of splitting sentences from paragraph, split words, recognizing the meaning of those words, to highlighting the main subjects, with the purpose to help to understand the meaning of the document 7. For instance, in our NLTK model we used two data sets Hashimoto and thyroiditis and we were able to identify the top 30 topics connected to these disorders. From the information collected we were able to identify general information like association of the disorder with other health issues. The impact of Hashimoto patient with covid19, long term consequences of untreated Hashimoto, recommendation for advance cases, and diet suggestion for improvement. The used of Natural Language tool kit made a precise and less time consuming research process.

## 2. Summary Tables

We can observe in this table the differences between this two similar disorders that are frequently misunderstood.

Summary Table 1: “Differences Between Hashimoto’s Thyroiditis and Grave’s Disease” 8.

Summary Table 2: “Hashimoto’s thyroiditis is associated with other important disorders” 8.

Summary Table 3: “Overview of the main dietary recommendations for patients with Hashimoto” 8.

## 3. Datasets

Silobreaker software was used to obtain scientific information related to the Hashimoto disease coming from different sources such as journals, proceedings, tweets, and news. Our date consists in the fallowing feature: ID, cluster Id, Description, publication date, Source URL, publisher. And the purpose is to analyze the preform of the proposed approach to discover the hiding semantic structures related with Hashimoto and thyroiditis the description from the gather data is used to study the frequency of Hashimoto and thyroiditis appears in the documents and detecting words and phrases patterns within them to automatically clustering work groups.

The dataset was obtained from Silobreaker database which is a commercial database. We got access through Florida A&M University who provided me the right to query the data. the link for the silobreaker information is [Here] (https://www.silobreaker.com/) 9.

This data was preprocessed dropping the columns ‘Id’, ‘ClusterId’, ‘Language’, ‘LastUpdated’,‘CreatedDate’,‘FirstReported’. Also, stop words and punctuation were removed, we convert to lower case all the titles.

## 4. Results

The following figures were creating with the help of libraries like gensim. Gensim stands for Generate similar and is an unsupervised library wide used for topic modeling and natural language processing based on modern statistical machine learning 10. it can handle large text collections of data and can preforms task like corpora, building document, word vectors and topic identification, which is one of the technics we used here, and we can observe it in some of the images. Each figure is described and explains the method we used to created it along with the relationship of the key word or major topic to the Hashimoto disorder.

Figure 1: “Example of a Word Cloud Object”

On Figure 1 we observed an example of a word cloud object and represent the difference words found in our dataset and the size of the words means the frequency of the given words in the document. Meaning that the size of the words is proportional to the frequency of its used.

Figure 2: “Example of a Intertopic Distance Map”

Figure 2 shows an Intertopic Distance Map which is a two-dimensional space filled with circles representing the proportional number of words that belongs to each topic making the distance to each other represent the relation between the topics, meaning that topics that are closer together have more words in common. For instance, in topic 1 we observed word like hypothyroidism, Morgan, symptoms after a small search we were able to find that Morgan is a well-known writer that presented thyroiditis symptoms after giving birth which is something that happen to some women’s and then recover after a couple of months, however this increments the risk of developing the syndrome later in their lives 11. On topic 4 we see words like food, levothyroxine, liothyronine, selenium and dietary. the relationship between these words is symptom control, symptoms relive, some natural remedies and supplements 1213.

Figure 3: “Top 30 major Topics”

On figure 3 we observed a bar chart that shows 30 major terms. The bars indicate the total frequency of the term across the entire corpus. The size of the bubble measures the importance of the topics, relative to the data. for example, for visualization purposes we used the first topic that include Hashimoto, thyroiditis, and selenium. Saliency is a measure of how much the term talks about the topic. And in terms of findings is important to mentions the relationship between Hashimoto thyroiditis and selenium. Selenium is a suplement recomended for patients with this disorder that have shown a reduction on antibody levels 13.

Figure 4: “Example of Hierarchical Clustering chart”

On figure 4 we can see that the dendrograms have been created joining points 4 with 9, 0 with 2, 1 with 6, and 12 with 13. The vertical height of the dendrogram shows the Euclidean distances between points. It is easy to see that Euclidean distance between points 12 and 13 is greater than the distance between point 4 and 9. This is because the algorithm is clustering by similarity, differences, and frequency of words. We observed in the dark green dendrogram topic 7,3,4,9 which are all related to an advance stage of the disorder. we can find the information about certain treatments, causes of the disorder, level of damage at certain stages. On the reds dendrograms we observe topics 0,2,1,6 which are closely related to diagnosis, early symptoms and procedures used for the diagnosis of the disorder.

Figure 5: “Exaple of Similarity Matrix Chart”

On figure 5 we can see a similarity matrix chart, the graph is build based on similarity reached from the volume of topic and association by document, therefore the graph show groups of documents that are cluster together based on similarities. in this case the blue square is an indication of a strong similarity, and the green and light green is an indication of different topics. for instance, we are able to derive as a conclusion that carcinoma cancer, carcinoma therapy, lymph papillary metastasis and hypothyroidism are closely related. in facts they are advance stages of the disorder. E.g. Carcinoma therapy is a type of treatment that can be used for this disorder 14.

Figure 6: “Example of Term Score Decline Per Topic Chart”

On figure 6 we observed TF-IDF which is an interesting technic used on machine learning that have the ability to give weight to those words that are not frequent in the document but can carry important information. In this example we can see how topic 12, covid19 pandemic patients is the at the top of the chart and then start declining when the rank term increase. The science behind this behave is explain by the TF-IDF which is term frequency - Inverse document frequency. Therefore, covid 19 was a relative new disease, and we do not expect to have a high frequency used in the document. In this case we were able to find information about Hashimoto patients and covid19 which it seems not to causes any extreme symptoms for patient with this disorder others than the ones expected from a healthy person in other words Hashimoto patients have the same risk of a healthy person 15.

Figure 7: “Example of Topic Probability chart”

On figure 7 we see a probability distribution chart based on each topic frequency and its relationship with the main topic: Hashimoto thyroiditis causes or cure. We can see that topic 12 is the least frequent or least related since most of its content is about covid19. Then we have topic 11 zebrafish which is related to the investigation of the disorder but most of its content is about the research made on zebrafish and how had help researchers to understand thyroid diseases in other no mammals’ animals, but is not closely related to the major point of this project, however, is an interesting research which have provide useful information about thyroiditis 16.

Figure 8: “Example of a Topic Word Score Chart”

On figure 8 we have Topic Word Scores chart that provides a deep understanding of large corpus of texts trough topic extraction. for instance, the data used in this project provide 5 fundamental topics from 0 to 4. Essentially each topic provided closely related words with deep information about the disorder itself, treatments, diagnosis, and symptoms. E.g. in topic number 4 we find a specific word “eye” which it does not seem to have a close relationship with Hashimoto thyroiditis but in facts is related to one of the early symptoms that the human body experiment most likely when is still undiagnosed [16]. In the same topic we also find the word teprotumumab which is an eye relieve medication recommended from doctors to relive the symptoms, in other word is not the cure but it helps 17.

## 5. Hashimoto Findings

As we can see our findings are wide in aspects of causes which is one of the main keys, because if we know the cause of something most likely we will be able to avoid it. However, this disorder is considered relative knew and have been around for some decades only, but it is necessary to point out the relation of diseases with the environment. Environmental changes are a fact and are affecting us every day even when we don’t notice it. We have seen an exponential increase of Hashimoto cases in the last five decades, and at the same time the last five decades have been potentially related to climate change, high levels of pollution, less fertile soils, increased use of pesticides on food, etc. It would be a good idea to think about our environment and how to help it heal since it will bring benefits for all of us18.

Table Summary: “Finding summary on causes, descriptions and recommendations.”

Possible Causes Description Recomendations
Genetic predispositions Genetically linked Manage stress
Dietary errors Imbalance of iodine intake Balance is key
Nutritional deficiencies not enough veggies, vitamins and minerals eat more veggies
Hormone deficiencies lover levels of vit D Enough sleep, Take some sun light
Viral, bacterial, yeast, and parasitic infections. H pylori, Bad guts microbes food hygiene
Enviromental Factors Pollution, Pesticides used etc. Human footprint on environment

Possible causes

Genetic predispositions, Dietary errors, Nutritional deficiencies, Hormone deficiencies, Viral, bacterial, yeast, and parasitic infections 8.

Hashimoto Trigger Food

Some Food that can trigger Hashimoto are gluten, dairy, some type of grains, eggs, nuts or nightshades, sugar, sweeteners, sweet fruits, including honey, agave, maple syrup, and coconut sugar and high-glycemic fruits like watermelon, mango, pineapple, grapes, canned and dried fruits. Vegetable oil, specially hydrogenated oils, ad trans-fat. Patient with this disorder may experience symptoms of fatigue, rashes, joint pain, digestive issues, headaches, anxiety, and depression after eating some of these foods 19.

Hashimoto recommended Diets

The recommended foods are healthy fats like coconut, avocado, and olive oil, ghee, grass-fed and organic meat, wild fish, healthy fats, fermented foods like coconut yogurt, kombucha, fermented cucumbers and pickle ginger, and plenty of vegetable like Asparagus, spinach, lettuce, broccoli, beets, cauliflower, carrots, celery, artichokes, garlic, onions 19.

Environmental causes of Hashimoto

There have been an increase in the number of Hashimoto cases in the United States since 1950s. These is one of the reason research explain that Hashimoto disorder can be closely related to environmental causes since the rapid increase of cases can not only be related to family gens as it takes at least two generations to acquire and transfer gen mutation. Adding to this that for generation thru history human have been fitting microorganisms than enter our body but for the past centuries our environment has become very hygienic consequently our immune system suddeling was left without aggressors therefore humane start developing more allergies and autoimmune diseases. Another important factor is the balance of iodine intake because too much is as dangerous for people with genetic Hashimoto predisposition but too littler can be also dangerous for patients with the disorder to reduce goiter which is the enlargement of the thyroid glands 18.

It is still not enough research to state that low vitamin D levels are a cause or a consequence of the Hashimoto disorder, but it is a fact that most patients with this disorder have low levels of vitamin D this insufficient this is closely related to insufficient sun exposure 18.

The exposure to certain synthetic pesticide. An important fact is that 9 out of 12 pesticides are dangerous and persistent pollutants 18.

Symptoms of Hashimoto’s

Some of the symptoms are fatigue and sluggishness, sensitivity to cold, constipation, pale and dry skin, dry eyes, puffy face, brittle nails, hair loss, enlargement of the tongue, unexplained weight gain, muscle aches, tenderness and stiffness, joint pain and stiffness, muscle weakness, excessive or prolonged menstrual bleeding, depression, memory lapses, Another symptom reported by some patients was ablation, some patient described as an acceleration of the heart rhythm 20.

Complications

Tissue damage, Abnormal look of the thyroid gland (figure 2), goiter, Heart problems, mental health issues, myxedema, birth defects 20, Nodule (figure 4 Similarity Matrix topic 3), and High antibody level. It is important to mention an association between high levels of thyroid autoantibodies and the increased of mood disorders, thyroid autoimmunity disease, celiac disease, panic disorder and major depressive disorder 8.

Recomendations

Healthy diets, exercising, selenium supplementation [8], healthy sun exposure at an adequate time, getting enough sleep is primordial for the human body, in special for the metabolism regulation and the creation of normal hormones that the human body needs, 8 lowering stress levels by physical exercise is a good idea, exercise like yoga and reiki are valuable because it also exercise you brain with meditation which is a great stress reliever.

## 6. Benchmark

We used benchmark to perform the process time to get topics frequency in parallel using google colab with run type: GPU and TPU. We can observe that TPU machines take less time to classify topic 1. Tensor Processor Unit (TPU) is designed to run cutting-edge machine learning models with AI services on Google Cloud 21

Benchmark Topics Frequency:

parallel Topic Status Time processor
164 1_cancer_follicular_carcinoma_autoimmune ok 0.53 GPU
190 1_cancer_follicular_carcinoma_autoimmune ok 0.002 TPU

## 7. Conclusion

As expected, we were able to derive helpful information of the Hashimoto thyroiditis disorder. we attempted to summarize our findings concerning Hashimoto thyroiditis in aspects of causes, symptoms, recommended diets and supplements and used medication.

Our findings highlight the great potential of the model we used. certainly, topic modeling method was a precise idea for the optimization of the research process. We also used various features of genism, which allows to manipulate data texts on NPL projects. The use of clustering technics was very useful to label our findings on the large datasets. Each used graph provided useful details and key words that later help us to review each important topic in a faster manner and develop the research project with accurate results.

## 8. Acknowledgments

Gregor von Laszewski

Yohn J Parra

Carlos Theran

## 9. References

1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, [Online resource] https://arxiv.org/abs/1810.04805 ↩︎

2. Helicobacter pylori infection in women with Hashimoto thyroiditis, [Online resource] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5265752/ ↩︎

3. I. Voloshyna, V.I Krivenko, V.G Deynega, M.A Voloshyn, Autoimmune thyrod disease related to helicobacterer pylori contamination, [Online resource] https://www.endocrine-abstracts.org/ea/0041/eposters/ea0041gp213_eposter.pdf ↩︎

4. How your diet can trigger Hashimoto’s, [Online resource] https://www.boostthyroid.com/blog/2019/4/5/how-your-diet-can-trigger-hashimotos ↩︎

5. Hypothyroidism in Context: Where We’ve Been and Where We’re Going, [Online resource] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6822815/ ↩︎

6. BERT Explained: State of the art language model for NLP https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270 ↩︎

7. Pavan Sanagapati, Knowledge Graph & NLP Tutorial-(BERT,spaCy,NLTK), [Online resource] https://www.kaggle.com/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk ↩︎

8. Hashimoto’s Thyroiditis, A Common Disorder in Women: How to Treat It, [Online resource] https://www.townsendletter.com/article/441-hashimotos-thyroiditis-common-disorder-in-women/ ↩︎

9. Silobreaker: Intelligent platform for the data era https://www.silobreaker.com ↩︎

10. Gensim Tutorial – A Complete Beginners Guide, [Onile resource] https://www.machinelearningplus.com/nlp/gensim-tutorial/ ↩︎

11. Julia Haskins, Thyroid Conditions Raise the Risk of Pregnancy Complications, [Online resource] https://www.healthline.com/health-news/children-thyroid-conditions-raise-pregnancy-risks-052913 ↩︎

12. How your diet can trigger Hashimoto’s, [Online resource] https://www.boostthyroid.com/blog/2019/4/5/how-your-diet-can-trigger-hashimotos ↩︎

13. Selenium Supplementation for Hashimoto’s Thyroiditis, [Online resource] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4005265/ ↩︎

14. Thyroid Cancer Treatment, [Online resource] https://www.cancer.gov/types/thyroid/patient/thyroid-treatment-pdq ↩︎

15. Hashimoto’s Disease And Coronavirus (COVID-19), [Online resource] https://www.palomahealth.com/learn/coronavirus-and-hashimotos-disease ↩︎

16. How zebrafish research has helped in understanding thyroid diseases, [Online resource] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5730863/ ↩︎

17. Teprotumumab for the Treatment of Active Thyroid Eye Disease, [Online resource] https://www.nejm.org/doi/full/10.1056/nejmoa1910434 ↩︎

18. 11 environmental triggers of Hashimoto’s, [Online research] https://www.boostthyroid.com/blog/11-environmental-triggers-of-hashimotos ↩︎

19. Hashimoto’s low thyroid autoimmune, [Online research] https://www.redriverhealthandwellness.com/diet-hashimotos-hypothyroidism/ ↩︎

20. Hashimoto’s disease, [Online research] https://www.mayoclinic.org/diseases-conditions/hashimotos-disease/symptoms-causes/syc-20351855 ↩︎

21. TPU: Tensor Processor Unit https://cloud.google.com/tpu ↩︎

# 12 - Project: Classification of Hyperspectral Images

Here comes the abstract

Status: draft, Type: Project

Carlos Theran, su21-reu-360, Edit

Keywords: tensorflow, example.

## 1. Introduction

Do not include this tip in your document:

Tip: Please note that an up to date version of these instructions is available at

Here comes a convincing introduction to the problem

## 2. Report Format

The report is written in (hugo) markdown and not commonmark. As such some features are not visible in GitHub. You can set up hugo on your local computer if you want to see how it renders or commit and wait 10 minutes once your report is bound into cybertraining.

To set up the report, you must first replace the word hid-example in this example report with your hid. the hid will look something like sp21-599-111

It is to be noted that markdown works best if you include an empty line before and after each context change. Thus the following is wrong:

# This is My Headline
This author does ignore proper markdown while not using empty lines between context changes
1. This is because this author ignors all best practices


Instead, this should be

# This is My Headline

We do not ignore proper markdown while using empty lines between context changes

1. This is because we encourage best practices to cause issues.


## 2.1. GitHub Actions

When going to GitHub Actions you will see a report is autmatically generated with some help on improving your markdown. We will not review any document that does not pass this check.

## 2.2. PAst Copy from Word or other Editors is a Disaster!

We recommend that you sue a proper that is integrated with GitHub or you use the commandline tools. We may include comments into your document that you will have to fix, If you juys past copy you will

1. Not learn how to use GitHub properly and we deduct points
2. Overwrite our coments that you than may miss and may result in point deductions as you have not addressed them.

## 2.3. Report or Project

You have two choices for the final project.

1. Project, That is a final report that includes code.
2. Report, that is a final project without code.

YOu will be including the type of the project as a prefix to your title, as well as in the Type tag at the beginning of your project.

## 3. Using Images

Figure 1: Images can be included in the report, but if they are copied you must cite them 1.

## 4. Using itemized lists only where needed

Remember this is not a powerpoint presentation, but a report so we recommend

1. Use itemized or enumeration lists sparingly
2. When using bulleted lists use * and not -

## 5. Datasets

Datasets can be huge and GitHub has limited space. Only very small datasets should be stored in GitHub. However, if the data is publicly available you program must contain a download function instead that you customize. Write it using pythons request. You will get point deductions if you check-in data sets that are large and do not use the download function.

## 6. Benchmark

Your project must include a benchmark. The easiest is to use cloudmesh-common 2

## 6. Conclusion

A convincing but not fake conclusion should summarize what the conclusion of the project is.

## 8. Acknowledgments

Please add acknowledgments to all that contributed or helped on this project.

## 9. References

Your report must include at least 6 references. Please use customary academic citation and not just URLs. As we will at one point automatically change the references from superscript to square brackets it is best to introduce a space before the first square bracket.

1. Use of energy explained - Energy use in homes, [Online resource] https://www.eia.gov/energyexplained/use-of-energy/electricity-use-in-homes.php ↩︎

2. Gregor von Laszewski, Cloudmesh StopWatch and Benchmark from the Cloudmesh Common Library, [GitHub] https://github.com/cloudmesh/cloudmesh-common ↩︎

# 13 - Project: Detecting Multiple Sclerosis Symptoms using AI

This work implements machine learning algorithim apply in Multiple Sclerosis symptoms and provides treatment options available

Status: draft, Type: Project

Raeven Hatcher, su21-reu-371, Edit

Keywords: tensorflow, example.

## 1. Introduction

MS or Multiple sclerosis is a potentially disabling autoimmune disease that can damage the brain, spinal cord, and optic nerves located in the eyes. It is the most common progressive neurological disability that affects adolescents. The immune system attacks the central nervous system in this disease, specifically myelin (sheth that covers and protects nerve fibers), oligodendrocytes (myelin-producing cells), and the nerve fibers located under myelin. Myelin enables nerves to send and receive electrical signals swiftly and effectively. The myelin sheath becomes scarred from being attacked. These attacks make the myelin sheath inflamed in little patches, observable on an MRI scan. These little inflamed patches potentially disrupt messages moving along the nerves, which ultimately lead to the symptoms of Multiple sclerosis. If these attacks happen frequently, permanent damage can occur to the involved nerve. Because Multiple sclerosis affects the central nervous system, which control all of the actions carried out in the body, symptoms can affect any part of the body and vary. The most common symptoms of this progressive disease include muscle weakness, pins and needle sensation, electrical shock sensation, loss of bladder control, muscle spasms, tremors, double or blurred vision, partial or total vision loss, to name a few. Researchers are not sure what causes Multiple sclerosis but believe those between the ages of 20 and 40, women, smoke, are exposed to certain infections, have a vitamin D and B12 deficiency, and related to someone affected by this disease are more susceptible.

It can be difficult to diagnose MS due to the symptoms usually being vague or very similar to other conditions. There is no single test to diagnose it positively. However, doctors can choose a neurological examination, MRI scan, evoked potential test, lumbar puncture, or a blood test to diagnose a patient properly. Currently, clinical practices divide MS into four phenotypes: clinically isolated syndrome (CIS), relapsing-remitting MS (RRMS), primary-progressive MS (PPMS), and secondary progressive MS (SPMS). Two factors define these phenotypes; disease activity (evidenced by relapses or new activity on MRI scan) and progression of disability. Phenotypes are routinely used in clinical trials to choose patients and conduct treatment plans.

New technologies, such as artificial intelligence and machine learning, help assess multidimensional data to recognize groups with similar features. When implemented in apparent abnormalities on MRI scans, these new technologies have assured promising results in classifying patients who share similar pathobiological mechanisms rather than the typical clinical features.

Researchers at UCL work with the Artificial intelligence (AI) tool SuStain (Subtype and Stage Inference) to ask whether AI can find Multiple sclerosis subtypes that follow a particular pattern on brain images? The results uncovered three data-driven MS subtypes defined by pathological abnormalities seen on brain images (Skylar). The three data-driven MS subtypes are cortex-led, normal-appearing WM-led, Lesion-led. Cortex-led MS is characterized by early tissue shrinkage (atrophy) in the outer layer of the brain. Normal-appearing WM-led is identified by irregular diffused tissue located in the middle of the brain. Lastly, a lesion-led subtype is characterized by early extension accumulation of brain damage areas that lead to severe atrophy in numerous brain regions. All three of these subtypes correlate to the earliest abnormalities observed on an MRI scan within each pattern.

In this experiment, researchers utilized the SuStain tool to capture MRI scans of 6,332 patients. The unsupervised SuStain taught itself and identified those three patterns that were previously undiscovered.

## 3. Using Images

The above image shows the MRI-based subtypes. The color shades range from blue to pink, representing the probability of abnormality mild to severe, respectively. (Eshaghi)

## 4. Datasets

MRI brain scans of 6,322 MS patients. look if you can find figures descrbing the data.

## 5. Benchmark

Your project must include a benchmark. The easiest is to use cloudmesh-common.

## 6. Conclusion

A vital barrier in distinguishing subtypes in Multiple sclerosis is to stitch observations together from cross-sectional or longitudinal studies. Grouping individuals based wholly on their MRI scan is ineffective because patients belonging to the same subgroup could show ranging abnormalities as the disease progresses and would appear different. SuStaIn, Subtype and Staging Inference, a newly developed unsupervised machine learning algorithm aids in uncovering data-driven disease subtypes that have distinct temporal progression patterns. “The ability to disentangle temporal and phenotypic heterogeneity makes SuStain different from other unsupervised learning or clustering algorithms” (Eshaghi). SuStaIn identifies subtypes given the data, defined by a particular pattern of variation in a set of features, such as MRI abnormalities. Once the SuStain subtypes and their MRI trajectories are adequately identified, the disease model can conclude how approximately a patient, whose MRI is unseen, belongs to each of the three subtypes and stages.

A total of 9,390 patients participated in this research study. Six thousand three hundred twenty-two patients were utilized in training, and 3,068 patients were used for the validation dataset. Patient characteristics such as sex, age, disease duration, and expanded disability status scale (EDSS) were similar between the training and validation dataset. There were 18 MRI features measured, 13 of those differed dramatically from those between the MS training dataset and control group and were maintained in the SustaIn model. Three subtypes, with very distinct patterns, were identified in the training dataset and validated in the validation dataset. The early abnormalities noticed by SuStain helped define the three subtypes: cortex-led, normal-appearing white matter-led, and lesion-led.

There was a statistically significant difference in the rate of the disease progression between the subtypes in the training dataset and validation datasets. The lesion-led subtype held a 30% higher risk of developing 24-week confirmed disability progression (CDP) than the cortex-led subtype in the training dataset. The lesion-led validation dataset had a 32% higher risk of confirmed disability progression than the cortex-led subtype. No other differences in the advancement of disability between subtypes were noted. When SuStaIn was applied to the training and validation dataset, it was pointed out that there were differences in the risk of disability progression between SuStaIn stages.

Each MRI-based subtype had a different response to treatment, comparing those on treatment and those on placebo. The lesion-led subtype showed a remarkable response to the treatment. Patients on the lesion-led active treatment subtype showed a significantly slower worsening of EDSS than those on the placebo. No differences in the rate of EDSS were observed in those on the placebo compared to active treatment in the NAWM-led and cortex-led subtypes.

When SuStain was applied to a large set of Multiple sclerosis scans, it identified three subtypes. Researchers found out the patient’s baseline subtype and stage were associated with an increased risk of disease progression. Combining clinical information with the MRI-based three subtypes increased the predictive accuracy of just using the MRI scan information alone. The patterns of MRI abnormality in these subtypes provide perspicacity into disease mechanisms, and, alongside clinical phenotypes, they may aid the stratification of patients for future studies.

## 7. Acknowledgments

The author likes to thank Gregor von Laszewski, Yohn Jairo, and Carlos Theran.

## 8. References

[^6] What Is MS? National Multiple Sclerosis Society. (n.d.). https://www.nationalmssociety.org/What-is-MS.

# 14 - Report: AI in Orthodontics

In this effort we are analyzing X-ray images in AI and identifying cavitites

Status: final, Type: Report

Whitney McNair, su21-reu-363, Edit

Keywords: ai, orthodontics, x-rays.

## 1. Introduction

Dental field technology capability has increased over the past 25 years, and has helped reduce time, cost, medical errors, and dependence on human expertise. Intelligence in orthodontics can learn, build, remember, understand and recognize designs from techniques used in correcting the teeth like retainers. Dental field can create alternatives, adapt to change and explore experiences with sub-groups of patients. AI has taken part of the dental field by accurately and efficiently processing the best data from treatments. For smart use of Health Data, machine learning and artificial intelligence are expected to promote further development of the digital revolution in (dental) medicine, like x-rays, using algorithms to simulate human cognition in the analysis of complex data. The performance is better, the higher the degree of repetitive pattern and the larger the amount of accessible data1.

## 2. Data Sets

We found a dataset on a kaggle website that is about dental images. The data was collected by Mr. Parth Chokhra. The name of the dataset is Dental Images of kjbjl. The dataset did not have metadata and an explanation of how they collected the data. The data set supports how x-rays of teeth in dentistry becomes artificial intelligence. The Dental Images of kjbjl dataset was used in AI already using autoencoders. Autoencoders are an freely artificial neural network (located in the nervous system) that learns how to accurately encode data and reconstruct the data back from the reduced encoded depiction to a representation that is closes to the original. For some challenges with Orthodontics data sets with privacy, size, avalibility were surprisingly hard to find than we thought.

## 3. Figures

Below we observed actual dental x-rays. These dental x-rays images below came from Parth Chorkhra on kaggle.com 2. The images are patient x-rays taken by Parth in his dental imagery data set. In the images we can see the caps and nerves of the teeth. Using these x-rays, we may can also find cavities if there are some. We can also identify other issues with patients teeth by taking and using x-rays.

Figure 1: First x-ray

Figure 2: Second x-ray

Figure 3: Third x-ray

## 4. Example of a AI algorighm in Orthodontics

On a separate kaggle website, we found a code for DENTAL PANORAMIC KNN 1. The kaggle site shows dental codes taken place in Orthodontics.

## 5. Benchmark

Here is an algorithm/code from one of the researchers we found. are using to study the performance of their algorithms or code.

Lateral and frontal facial images of patients who visited the Orthodontic department (352 patients) were employed as the training and evaluation data. An experienced orthodontist examined all the facial images for each patient and identified as many clinically used facial traits during the orthodontic diagnosis process as possible (e.g., deviation of the lips, deviation of the mouth, asymmetry of the face, concave profile, upper lip retrusion, presence of scars). A sample patient’s image, a list of sample assessments (i.e., labels), and the multi-label data used in the work by Murata et al. are shown in Figure 4.

Figure 4: Sample patient’s image and a list of sample assessments (i.e., labels) including the region of interest, evaluation, etc. In a previous study by Murata et al., they employed labels representing only the facial part (mouth, chin, and whole face), distorted direction (right and left), and its severity (severe, mild, no deviation).

## 6. Conclusion

Artificial intelligence is rapidly expanding into multiple facets of society. Orthodontics may be one of the fastest branches of dentistry to adapt AI for three reasons. First, patient encounters during treatment generate many types of data. Second, the standardization in the field of dentistry is low compared to other areas of healthcare. A range of valid treatment options exists for any given case. Using AI and large datasets (that include diagnostic results, treatments, and outcomes), one can now measure the effectiveness of different treatment modalities given very specific clinical findings and conditions. Third, orthodontics is largely practiced by independent dentists in their own clinics. Despite the promise of AI, the volume of orthodontic research in this field is relatively low. Further, the clinical accuracy of AI must be improved with an increased number and variety of cases. Before AI can take on a more important role in making diagnostic recommendations, the volume and quality of research data will need to increase1.

## 7. Acknowledgments

Dr. Gregor von Laszewski, Carlos and Yohn guided me throughout this process.

## 8. References

1. Hasnitadita. (2021, July 10). DENTAL panoramic knn. Kaggle. https://www.kaggle.com/hasnitadita/dental-panoramic-knn ↩︎

2. Chokhra, P. (2020, June 29). Medical image dataset. Kaggle. https://www.kaggle.com/parthplc/medical-image-dataset ↩︎

# 15 - Time Series Analysis of Blockchain-Based Cryptocurrency Price Changes

This project applies neural networks and Artificial Intelligence (AI) to historical records of high-risk cryptocurrency coins to train a prediction model that guesses their price. The code in this project contains Jupyter notebooks, one of which outputs a timeseries graph of any cryptocurrency price once a csv file of the historical data is inputted into the program. Another Jupyter notebook trains an LSTM, or a long short-term memory model, to predict a cryptocurrency’s closing price. The LSTM is fed the close price, which is the price that the currency has at the end of the day, so it can learn from those values. The notebook creates two sets: a training set and a test set to assess the accuracy of the results. The data is then normalized using manual min-max scaling so that the model does not experience any bias; this also enhances the performance of the model. Then, the model is trained using three layers— an LSTM, dropout, and dense layer—minimizing the loss through 50 epochs of training; from this training, a recurrent neural network (RNN) is produced and fitted to the training set. Additionally, a graph of the loss over each epoch is produced, with the loss minimizing over time. Finally, the notebook plots a line graph of the actual currency price in red and the predicted price in blue. The process is then repeated for several more cryptocurrencies to compare prediction models. The parameters for the LSTM, such as number of epochs and batch size, are tweaked to try and minimize the root mean square error.

Status: final, Type: Project

Jacques Fleischer, su21-reu-361, Edit

Keywords: cryptocurrency, investing, business, blockchain.

## 1. Introduction

Blockchain is an open, distributed ledger which records transactions of cryptocurrency. Systems in blockchain are decentralized, which means that these transactions are shared and distributed among all participants on the blockchain for maximum accountability. Furthermore, this new blockchain technology is becoming an increasingly popular alternative to mainstream transactions through traditional banks3. These transactions utilize blockchain-based cryptocurrency, which is a popular investment of today’s age, particularly in Bitcoin. However, the U.S. Securities and Exchange Commission warns that high-risk accompanies these investments4.

Artificial Intelligence (AI) can be used to predict the prices' behavior to avoid cryptocurrency coins' severe volatility that can scare away possible investors5. AI and blockchain technology make an ideal partnership in data science; the insights generated from the former and the secure environment ensured by the latter create a goldmine for valuable information. For example, an up-and-coming innovation is the automatic trading of digital investment assets by AI, which will hugely outperform trading conducted by humans6. This innovation would not be possible without the construction of a program which can pinpoint the most ideal time to buy and sell. Similarly, AI is applied in this experiment to predict the future price of cryptocurrencies on a number of different blockchains, including the Electro-Optical System and Ethereum.

Long short-term memory (LSTM) is a neural network (form of AI) which ingests information and processes data using a gradient-based learning algorithm7. This creates an algorithm that improves with additional parameters; the algorithm learns as it ingests. LSTM neural networks will be employed to analyze pre-existing price data so that the model can attempt to generate the future price in varying timetables, such as ten days, several months, or a year from the last date. This project will provide as a boon for insights into investments with potentially great returns. These findings can contribute to a positive cycle of attracting investors to a coin, which results in a price increase, which repeats. The main objective is to provide insights for investors on an up-and-coming product: cryptocurrency.

## 2. Datasets

This project utilizes yfinance, a Python module which downloads the historical prices of a cryptocurrency from the first day of its inception to whichever day the program is executed. For example, the Yahoo Finance page for EOS-USD is the source for Figure 18. Figure 1 shows the historical data on a line graph when the program receives “EOS-USD” as an input.

Figure 1: Line graph of EOS price from 1 July 2017 to 22 July 2021. Generated using yfinance-lstm.ipynb2 located in project/code, utilizing price data from Yahoo Finance8.

## 3. Architecture

Figure 2: The process of producing LSTM timeseries based on cryptocurrency price.

This program undergoes the four main phases outlined in Figure 2: retrieving data from Yahoo Finance8, isolating the Close prices (the price the cryptocurrency has at the end of each day), training the LSTM to predict Close prices, and plotting the prediction model, respectively.

## 4. Implementation

Initially, this project was meant to scrape prices using the BeautifulSoup Python module; however, slight changes in a financial page’s website caused the code to break. Alternatively, Kaggle offered historical datasets of cryptocurrency, but they were not up to date. Thus, the final method of retrieving data is from Yahoo Finance through the yfinance Python module, which returns the coins' price from the day to its inception to the present day.

The code is inspired from Towards Data Science articles by Serafeim Loukas9 and Viraf10, who explore using LSTM to predict stock timeseries. This program contains adjustments and changes to their code so that cryptocurrency is analyzed instead. This project opts to use LSTM (long short-term memory) to predict the price because it has a memory capacity, which is ideal for a timeseries data set analysis such as cryptocurrency price over time. LSTM can remember historical patterns and use them to inform further predictions; it can also selectively choose which datapoints to use and which to disregard for the model11. For example, this experiment’s code isolates only the close values to predict them and nothing else.

Firstly, the code asks the user for the ticker of the cryptocurrency that is to be predicted, such as EOS-USD or BTC-USD. A complete list of acceptable inputs is under the Symbol column at https://finance.yahoo.com/cryptocurrencies but theoretically, the program should be able to analyze traditional stocks as well as cryptocurrency.

Then, the program downloads the historical data for the corresponding coin through the yfinance Python module. The data must go through normalization for simplicity and optimization of the model. Next, the Close data (the price that the currency has at the end of the day, everyday since the coin’s inception) is split into two sets: a training set and a test set, which are further split into their own respective x and y sets to guide the model through training.

The training model is run through a layer of long short-term memory, as well as a dropout layer to prevent overfitting and a dense layer to give the model a memory capacity. Figure 3 showcases the setup of the LSTM layer.

Figure 3: Visual depiction of one layer of long short-term memory12

After training through 50 epochs, the program generated Figure 4, a line graph of the prediction model. Unless otherwise specified, the following figures use the EOS-USD data set from July 1st, 2017 to July 26th, 2021. Note that only the last 200 days are predicted so that the model can analyze the preexisting data prior to the 200 days for training purposes.

Figure 4: EOS-USD price overlayed with the latest 200 days predicted by LSTM

Figure 5: Zoomed-in graph (same as Figure 4 but scaled x and y-axis for readability)

During training, the number of epochs can affect the model loss. According to the following figures 6 and 7, the loss starts to minimize around the 30th epoch of training. The greater the number of epochs, the sharper and more accurate the prediction becomes, but it does not vastly improve after around the 30th epoch.

Figure 6: Line graph of model loss over the number of epochs the prediction model completed using EOS-USD data set

Figure 7: Effect of EOS-USD prediction model based on number of epochs completed

The epochs can also affect the Mean Squared Error, which details how close the prediction line is to the true Close values in United States Dollars (USD). As demonstrated in Table 1, more epochs lessens the Mean Squared Error (but the change becomes negligible after 25 epochs).

Table 1: Number of epochs compared with Mean Squared Error; all tests were run with EOS-USD as input. The Mean Squared Error is rounded to the nearest thousandth.

Epochs Mean Squared Error
5 0.924 USD
15 0.558 USD
25 0.478 USD
50 0.485 USD
100 0.490 USD

Lastly, cryptocurrencies other than EOS such as Dogecoin, Ethereum, and Bitcoin can be analyzed as well. Figure 8 demonstrates the prediction models generated for these cryptocurrencies.

Figure 8: EOS, Dogecoin, Ethereum, and Bitcoin prediction models

Dogecoin presents a model that does not account well for the sharp rises, likely because the training period encompasses a period of relative inactivity (no high changes in price).

## 5. Benchmark

The benchmark is run within yfinance-lstm.ipynb located in project/code2. The program ran on a 64-bit Windows 10 Home Edition (21H1) computer with a Ryzen 5 3600 processor (3.6 GHz). It also has dual-channel 16 GB RAM clocked at 3200 MHz and a GTX 1660 Ventus XS OC graphics card. Table 2 lists these specifications as well as the allocated computer memory during runtime and module versions. Table 3 shows that the amount of time it takes to train the 50 epochs for the LSTM is around 15 seconds, while the entire program execution takes around 16 seconds. A StopWatch module was used from the package cloudmesh-common13 to precisely measure the training time.

Table 2: First half of cloudmesh benchmark output, which details the specifications and status of the computer at the time of program execution

Attribute Value
cpu_cores 6
cpu_count 12
frequency scpufreq(current=3600.0, min=0.0, max=3600.0)
mem.available 7.1 GiB
mem.free 7.1 GiB
mem.percent 55.3 %
mem.total 16.0 GiB
mem.used 8.8 GiB
platform.version (‘10’, ‘10.0.19043’, ‘SP0’, ‘Multiprocessor Free’)
python 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]
python.pip 21.1.3
python.version 3.9.5
sys.platform win32
uname.machine AMD64
uname.processor AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
uname.release 10
uname.system Windows
uname.version 10.0.19043

Table 3: Second half of cloudmesh benchmark output, which reports the execution time of training, overall program, and prediction

Name Time Sum Start OS Version
Overall time 16.589 s 35.273 s 2021-07-26 18:39:57 Windows (‘10’, ‘10.0.19043’, ‘SP0’, ‘Multiprocessor Free’)
Training time 15.186 s 30.986 s 2021-07-26 18:39:58 Windows (‘10’, ‘10.0.19043’, ‘SP0’, ‘Multiprocessor Free’)
Prediction time 0.227 s 0.474 s 2021-07-26 18:40:13 Windows (‘10’, ‘10.0.19043’, ‘SP0’, ‘Multiprocessor Free’)

## 6. Conclusion

At first glance, the results look promising as the predictions have minimal deviation from the true values. However, upon closer look, the values lag by one day, which is a sign that they are only viewing the previous day and mimicking those values. Furthermore, the model cannot go several days or years into the future because there is no data to run on, such as opening price or volume. The experiment is further confounded by the nature of stock prices: they follow random walk theory, which means that the nature in which they move follows a random walk: the changes in price do not necessarily happen as a result of previous changes. Thus, this nature of stocks contradicts the very architecture of this experiment because long short-term memory assumes that the values have an effect on one another.

For future research, a program can scrape tweets from influencers' Twitter pages so that a model can guess whether public discussion of a cryptocurrency is favorable or unfavorable (and whether the price will increase as a result).

## 7. Acknowledgments

Thank you to Dr. Gregor von Laszewski, Dr. Yohn Jairo Parra Bautista, and Dr. Carlos Theran for their invaluable guidance. Furthermore, thank you to Florida A&M University for graciously funding this scientific excursion and Miami Dade College School of Science for this research opportunity.

## 8. References

1. Jacques Fleischer, README.md Install Documentation, [GitHub] https://github.com/cybertraining-dsc/su21-reu-361/blob/main/project/code/README.md ↩︎

2. Jacques Fleischer, yfinance-lstm.ipynb Jupyter Notebook, [GitHub] https://github.com/cybertraining-dsc/su21-reu-361/blob/main/project/code/yfinance-lstm.ipynb ↩︎

3. Marco Iansiti and Karim R. Lakhani, The Truth About Blockchain, [Online resource] https://hbr.org/2017/01/the-truth-about-blockchain ↩︎

5. Jeremy Swinfen Green, Understanding cryptocurrency market fluctuations, [Online resource] https://www.telegraph.co.uk/business/business-reporter/cryptocurrency-market-fluctuations/ ↩︎

6. Raj Shroff, When Blockchain Meets Artificial Intelligence. [Online resource] https://medium.com/swlh/when-blockchain-meets-artificial-intelligence-e448968d0482 ↩︎

7. Sepp Hochreiter and Jürgen Schmidhuber, Long Short-Term Memory, [Online resource] https://www.bioinf.jku.at/publications/older/2604.pdf ↩︎

8. Yahoo Finance, EOS USD (EOS-USD), [Online resource] https://finance.yahoo.com/quote/EOS-USD/history?p=EOS-USD ↩︎

9. Serafeim Loukas, Time-Series Forecasting: Predicting Stock Prices Using An LSTM Model, [Online resource] https://towardsdatascience.com/lstm-time-series-forecasting-predicting-stock-prices-using-an-lstm-model-6223e9644a2f ↩︎

10. Viraf, How (NOT) To Predict Stock Prices With LSTMs, [Online resource] https://towardsdatascience.com/how-not-to-predict-stock-prices-with-lstms-a51f564ccbca ↩︎

11. Derk Zomer, Using machine learning to predict future bitcoin prices, [Online resource] https://towardsdatascience.com/using-machine-learning-to-predict-future-bitcoin-prices-6637e7bfa58f ↩︎

12. Christopher Olah, Understanding LSTM Networks, [Online resource] https://colah.github.io/posts/2015-08-Understanding-LSTMs/ ↩︎

13. Gregor von Laszewski, Cloudmesh StopWatch and Benchmark from the Cloudmesh Common Library, [GitHub] https://github.com/cloudmesh/cloudmesh-common ↩︎

# 16 - Analysis of Covid-19 Vaccination Rates in Different Races

With the ready availability of COVID-19 vaccinations, it is concerning that a suprising large portion of the U.S. population still refuses to recieve one. In order to control the spread of the pandemic and possibly even erradicate it completely, it is integral that the United States vaccinate as much of the population as possible. Not only does this require ensuring that everyone who wishes to be vaccinated recieves a vaccine, it also requires that those who are unwilling to recieve the vaccine are persuaded to take it. The goal of this report is to analyze the demographics of those who are hesitant to recieve the vaccine and find the reasoning behind their decision. This will make it easier to properly persuade them to recieve the vaccine and aid in raising the United States' vaccination rates.

Status: draft, Type: Project

Ololade Latinwo, su21-reu-375, Edit

Keywords: tensorflow, example.

## 1. Introduction

It has been shown by several economic and health institutions that rates COVID-19 in the United States have been among the highest in the world. Estimates show that about 10 million people have been infected and over a quarter of a million have died in the U.S. by the end of November 2020 1. Fortunately, several pharmaceutical companies such as Pfizer, Moderna, and Johnson & Johnson have managed to create a vaccine by the end of 2020, with several million Americans being given the vaccine by early March. Interestingly, it appears that despite the ready availability of vaccines, a sizeable portion of the population has no intention of receiving either their second does or either dose at all. Voluntarily receiving the COVID-19 vaccine is integral to putting the pandemic to an end, so it is important to explore which demographics are hesitant to receive their vaccine and explore their reasons for doing so. In this project, the variables that will be examined are age, sex, race, education level, location within the United States, and political affiliation.

## 2. Data Sets

The first of these larger data sets are the results from a study done in the Journal of Community of Health done by Jagdish Khubchandani, Sushil Sharma, James H. Price, Michael J. Wiblishauser, Manoj Sharma, and Fern J. Webb. In this study, participants of a variety of backgrounds were given a questionnaire regarding whether or not they were likely or unlikely to receive the COVID vaccine. The variables included in the study and explored in this project are sex, age group, race, education level, regional location, and political affiliation. The raw data for this study is not available, however, the study has published the results as percentages. The second and third data sets are two maps of the United States that are provided by the CDC and illustrate the estimated rates of vaccine hesitancy and vaccination progress. Additionally, the United States Census Bureau provided a breakdown of vaccine hesitancy and progress by certain variables, of which age, sex, race and ethnicity, and education level are explored in this project. Like the study done by the Journal of Community Health, the raw data is not available, but percentages are available for each variable.

### 2.1 COVID-19 Vaccination Hesitancy in all 50 States by County

Figure 1: An image of the United States illustrating the percentage of adults who are hesitant to recieve a vaccine for COVID-192.

### 2.2 Vaccination Rates by Demographic

Figures 2a-d: Images of vaccine hesitancy rates by certain demographics 2.

### 2.3 COVID-19 Vaccination Rates in all 50 States

Figure 3: An image of the United States illustrating the percentage of adults who have recieved at least one dose of the COVID-19 vaccine3.

### 2.4 Vaccination Rates in all 50 States

Figures 4a-d: Images of vaccine hesitancy rates by certain demographics

### 2.5 Results of the Study Done by the Journal of Community Health

Figure 5a-c: Relevant results for the explored variables from a study in vaccine hesitancy rates done by the Journal of Community Health

## 3. Results

### 3.1 Race

In the study done by the Journal of Community Health, Black Americans are shown to have the highest rate of vaccine hesitancy at 34% with White Americans not very far behind at 22%1. Additionally, Asian Americans have the lowest rate of hesitancy at 11%1. Similar results can be seen in the data provided by the US Census Bureau, with Black and White Americans having the highest two vaccination hesitancy rates, however they are shown to be very close, at 10.6% and 11.9%, respectively4. In Addition, Asian Americans are once again shown to have the lowest vaccination rates at as low as 2.3%4. This information correlates with the data provided by the US Census Bureau, which shows that Black Americans have the lowest rate of vaccination at 72.7% and Asian Americans have the highest rate of vaccination of 94.1%4.

### 3.2 Sex

In the study done by the Journal of Community Health, men and women are shown to have very similar rates of vaccine hesitancy at 22% for both groups1. This can also be seen in the data provided by the US Census Bureau with women having a hesitancy rate of 10.3% and men having a barely higher rate of 11.3%4. This goes along well with the vaccination rates provided by the US Census Bureau, which shows that men and women have vaccination rates of 81.4% and 80.5%4.

### 3.3 Age

The age ranges for both studies are grouped slightly differently, yet, interestingly enough, both data sets show two completely different age groups having the highest rate of hesitancy. In the Journal of Community Health study, the 41-60-year-old group is shown to have the highest rate at 24%1. Yet, in the US Census Bureau’s data, ages 25-39 have the highest hesitancy rate at 15.9%4. Because it has a much larger sample size, it is safer to assume that the US Census Bureaus data is more correct in this case. However, both data sets show that seniors have the lowest hesitancy rates, which makes sense as this age group has the highest vaccination rate of 93%4.

### 3.4 Education Level

Once again, the education levels are grouped slightly differently with the US Census Bureau not taking account for those with a Master’s degree above and the Journal of Community Health grouping together those who have a high school education and those who have not completed high school. However, unlike the previous variable, very similar results are yielded despite the difference. In both studies, those who had a high school education or below had the highest hesitancy rate at 31% in the Journal of Community Health study and 15.7% for those who have less than a high school education and 13.8% for those who have one in the data provided by the US Census Bureau14. This correlates with the vaccination rate data as those who have a high school education or less have the two lowest vaccination rates of 74.3% and 70.9%, respectively4.

### 3.5 Location

The Journal of Community Health is the only data set that assigns specific percentages to the hesitancy rates to regions of the United States with the Northeast having the highest hesitancy rate of 25%, followed very closely by the West and South at 24% and 23%, respectively1. This leaves the Midwest with the lowest hesitancy rate of 18%1. This is partially reflected by the data displayed in the map of the United States that illustrates the rates of hesitancy made by the Census Bureau, which shows that the rates of vaccine hesitancy are slightly higher in the Northeast, West, and Midwest, with the lowest hesitancy rate in the South2. Once again, due to the much larger sample size, it is safer to assume that the Census Bureau is more correct.

### 3.6 Political Affiliation

This variable is only seen in the Journal of Community Health study. It shows that vaccine hesitancy is highest in Republicans at 29% and lowest in Democrats at 16%.

## 4. Conclusion

After analyzing the data, a person’s sex appears to have no influence regarding a person’s willingness to recieve a COVID vaccine. However, the other variables looked at during this study appear to have an influence. One’s level of education appears to have great influence over their willingness to take the vaccine. This makes sense as those who are less educated are less likely to understand the significance of vaccinations and how they protect the U.S. population and may also be more vulnerable to false information about the vaccine as they may not have the skills to deduce whether a source is reliable or not. Additionally, the data shows that younger adults are less likely to recieve the vaccine. This is possibly due to the common misconception that young people are not adversely affected by the virus and as a result the vaccine is not necessary, which could be why , according to the US Census Bureau, 34.9% of those who are hesitant to take the vaccine don’t believe that they need it4. One’s political affiliation had a great influence on whether or not someone would be receptive to the vaccine as there is a 13% difference between the hesitancy rates in Republicans and Democrats1. This is very interesting considering that the region with the greatest Republican presence, the South, has the lowest hesitancy rate while the rest of the regions, which have a smaller Republican presence, have the highest hesitancy rates 32. Race is also a significant variable, with Black Americans having the lowest vaccination rates and among the highest vaccination hesitancy rates. This could be due to a number of overlapping factors, such as level of education and past historical events. For example, a majority of Black Americans only have a high school diploma, while a relatively small amount of them have a Bachelor’s Degree and as we’ve seen before, those with a lower level of education are more likely to be hesitant about taking the vaccine5. Additionally, Black Americans have a fairly long history of being used as medical guinea pigs, with the most infamous example being the Tuskeegee syphillis study, which would understandably result in some aversion to new vaccinations from the community6.

After analyzing this data, it is very apparent that, in order to persuade more Americans to take it, that a significant amount of time and effort be put towards having easily accesible education to the real facts about the COVID vaccine, especially in poorer areas where there people are more likely to have lower levels of education. Additionally, to reduce vaccine hesitancy amongst Republicans, we must stop politicizing the vaccine as well as the severity of COVID-19 and frame the pandemic as a bipartisan matter of public health. Regarding the high vaccination hesitancy amongst Black Americans, there are no short-term fixes as a significant portion of the community has not forgotten about the several transgressions that the United States has committed against them, however, the sooner the United States manages to regain the trust of the Black community, the closer we come to ending this pandemic.

## 6. Acknowledgments

Special thanks to Yohn J Parra, Carlos Theran, and Gregor Lasweski for supporting this project.

## 7. References

1. Khubchandani, J., Sharma, S., Price, J.H. et al. COVID-19 Vaccination Hesitancy in the United States: A Rapid National Assessment. J Community Health 46, 270–277 (2021). https://doi.org/10.1007/s10900-020-00958-x ↩︎

2. Estimates of vaccine hesitancy for COVID-19 Center for Disease Control and Prevention https://data.cdc.gov/stories/s/Vaccine-Hesitancy-for-COVID-19/cnd2-a6zw ↩︎

3. Party Affiliation by Region Pew Research Center https://www.pewforum.org/religious-landscape-study/compare/party-affiliation/by/region/-trend ↩︎

4. Household Pulse Survey COVID-19 Vaccination Tracker United States Census Bureau https://www.census.gov/library/visualizations/interactive/household-pulse-survey-covid-19-vaccination-tracker.html ↩︎

5. Black High School Attainment Nearly on Par With National Average United States Census Bureau https://www.census.gov/library/stories/2020/06/black-high-school-attainment-nearly-on-par-with-national-average.html ↩︎

6. Washington, Harriet A. Medical Apartheid: The Dark History of Medical Experimentation on Black Americans from Colonial Times to the Present https://www.jstor.org/stable/25610054?seq=1#metadata_info_tab_contents ↩︎

# 17 - Aquatic Toxicity Analysis with the aid of Autonomous Surface Vehicle (ASV)

With the passage of time, human activities have created and contributed much to the aggrandizing problems of various forms of environmental pollution. Massive amounts of industrial effluents and agricultural waste wash-offs, that often comprise pesticides and other forms of agricultural chemicals, find their way to fresh water bodies, to lakes, and eventually to the oceanic systems. Such events start producing a gradual increase in the toxicity levels of marine ecosystems thereby perturbing the natural balance of such water-bodies. In this endeavor, an attempt will be made to analyze the various water quality metrics (viz. temperature, pH, dissolved-oxygen level, and conductivity) that are measured with the help of autonomous surface vehicles (ASV). The collected data will undergo big data analysis tasks so as to find the general trend of values for the water quality of the given region. These obtained values will then be compared with sample water quality values obtained from neighboring sources of water for ascertaining if these sample values exhibit aberration from the established values that were found earlier from the big data analysis tasks for water-quality standards. In the event, the sample data popints significantly deviate from the standard values established earlier, it can then be successfully concluded that the aquatic system in question, from which the water sample was sourced from, has been degraded and may no longer be utilized for any form of human usage, such as being used for drinking water purposes.

Status: final, Type: Project

Saptarshi Sinha, fa20-523-312, Edit

Keywords: toxicology, pollution, autonomous systems, surface vehicle, sensors, arduino, water quality, data analysis, environment, big data, ecosystem

## 1. Introduction

When it comes to revolutionizing our qualities of life and improving standards, there is not another branch of science and technology that has made more impact than the myriad technological capabilities offered by the areas of Artificial Intelligence (AI) and its sub-fields involving Computer Vision, Robotics, Machine Learning, Deep Learning, Reinforcement Learning, etc. It should be borne in mind that AI was developed to allow machines/computer processors to work in the same way as the human brain works and which could make intelligent decisions at every conscious level. It was meant to help with tasks for rendering scientific applications more smarter and efficient. There are many tasks that can be performed in a far more dexterous fashion by employing smart-machines and algorithms than by involving human beings. But even more importantly, AI has also been designed to perform tasks that cannot be successfully completed by employing human beings. This could either be due to the prolonged boredom of the task itself, or a task that involves hazardous environments that cannot sustain life-forms for a long time. Some examples in this regard would involve exploring deep mines or volcanic trenches for mineral deposits, exploring the vast expanse of the universe and heavenly bodies, etc. And this is where the concept employing AI/Robotics based technology fits in perfectly for aquatic monitoring and oceanographical surveillance based applications.

Toxicity analysis of ecologically vulnerable water-bodies, or any other marine ecosystem for that matter, could give us a treasure trove of information regarding biodiversity, mineral deposits, unknown biophysical phenomenon, but most importantly, it could also provide meaningful and scientific information related to the degradation of the ecosystem itself. In this research endeavor, an attempt will be made to utilize aquatic Autonomous Surface Vehicle (ASV) that will be deployed in marine ecosystems and which can continue collecting data for a prolonged period of time. Such vehicles are typically embedded with different kinds of electronic sensors, that are capable of measuring physical quantities such as temperature, pH, specific conductance, dissolved oxygen level, etc. The data collected by such a system can either be over a period of time (temporal data), or it could cover a vast aquatic geographical region (spatial data). This is the procedure by which environmental organizations record and store massive amounts of data that can be analyzed to obtain useful information about the particular water body in question. Such analytical work will provide us with statistical results that can then be compared with existing sample values so as to decipher whether the water source, from where the sample was obtained, manifests normal trend of values or shows large deviations from established trends that can signify an anomaly or biodegradation of the ecosystem. The datasets used in this endeavor are provided publicly by environmental organizations in the United States, such as the US Geological Survey (USGS). While the primary goal involves conducting big data analysis tasks for the databases so as to obtain useful statistical results, a secondary goal in this project involves finding out the extent to which a particular sample of water, obtained from a specific source of water, deviates from the normal trend of values. The extent of such deviations can then give us an indication about the status of the aquatic degradation of the ecosystem in question. The data analysis framework will be made as robust as possible and in this effort, we will work with data values that are spread over multiple years and not just focused on a single year.

## 2. Background Research and Previous Work

After reviewing the necessary background literature and previous work that has been done in this field, it can be stated that most of such endeavors focused majorly on environmental data collection with the help of sensors attached to a navigational buoy in a particular location of a water-body. Such works did not involve any significant data analysis framework and focused on particular niche areas. For instance, a particular research effort involved deploying a surface vehicle that collected data from large swaths of geographical areas in various water bodies but concentrated primarily on different algorithms employed for vehicular navigation and their relative success rates 1. Other research attempts focussed on even more niche areas such as study of the migration pattern exhibited by zooplanktons upon natural and aritifical irradiance 2, and detection and monitoring of marine fauna 3. Although these are interesting projects and can provide us with novel information about various aspects of biological and aquatic research, such research attempts neither focused much on the data analysis portion for multiple sensory inputs (viz. temperature, pH, specific conductance, and dissolved oxygen level, which are the four most important water quality parameters) nor did they involve an intricate procedure to compare the data with sample observations so as to arrive at a suitable conclusion regarding the extent of environmental degradation of a particular water body.

As mentioned in the previous section, this research endeavor will exhaustively focus not just on the working principles and deployment of surface vehicles to collect data, but it will also involve employing deeper study towards the subject of big-data analysis of both the current data of the system in question and the past data obtained for the same aquatic profile. In this way, it would be possible to learn more about the toxicological aspects of the ecosystem in question and which can then be also applied to neighboring regions.

## 3. Choice of Data-sets

Upon exploring a wide array of available datasets, the following two data repositories were given consideration to get the required water quality based data over a particular period of time and for a particular geographical region:

After going through the sample data values than can be visualized from the respective websites of USGS and EPA, the USGS datasets were chosen over the EPA datasets. This is mainly because the USGS datasets are more commensurate with the research goal of this endeavor, especially since it contains a huge array of databases that focuses on the four most important water quality data which are - temperature, pH, specific conductance, and dissolved oxygen level. Some previous work was conducted on similar USGS datasets by a particular research team 6. However, such work was drastically different in nature, when compared with this research attempt, since its emphasis was on a very broad perspective so as to create an overview of how to use and visualize the data from the USGS water quality portal. Besides, such work emphasizes on characterizing the seasonal variation of lake water clarity in different regions throughout the continental US, something that is very deviant from what would be addressed in this particular article which majorly involves studying environmental degradation and aquatic toxicology from the context of big data analytical tasks.

To address the questions involving existence of multiple data-sets and motivation of using multiple data-sets, we must keep in mind that the very nature of this study is based on historical trends of the nature of water-quality in a particular region from the past and to this effect, emphasis has been given to use data values from the past years as well in addition to the current year. The geographical location for this analysis was chosen to be the East Fork Whitewater River that is located at Richmond, IN (USA). The years that have been chosen in this case are 2017, 2018, 2019, and 2020. For all these years, focus would be placed on the same time-period, that spans from November 1 to November 14, for all these years so as to establish consistency and high fidelity across the borad. Having multiple data-sets in this way will help us in achieving robust data-analytical results. It would also ensure that too much focus is not given on outlier cases, that may be relevant to just a particular time and day on a given year, or an aberration in the data that may only have surfaced due to an unknown underlying phenomenon or some form of cataclysmic event from the past. Using multiple datasets would help to get a resultant data structure that is more likely to converge towards an approximate level of historical thresholds and which can then be used to find out how a current sample data-point deviates from such established trends of previous patterns.

## 4. Methodology

### 4.1 Hardware Component

Although this project focuses more on the data analysis portion than mechanical details, some information relating to the design and working principle of ASVs is provided herewith.The rough outline of an autonomous surface vehicle (ASV) in question has been perceived in Autodesk Fusion 360, which is a software package that helps creating and printing three-dimensional custom designs. A preliminary model has been designed in this software and after printing, it can be interfaced with the appropriate sensors in question. The system can be driven by an Arduino-Uno based microcontroller or even a Raspberry Pi based microprocessor, and it can comprise different types of environmental sensors that helps with collecting and offloading data to remote servers/machines. Some of these sensors can be purchased commercially from the vendor, “Atlas Scientific” 7. For instance, the following sensors can be used with an ASV to measure the four most important water quality parameters involving temperature, pH, dissolved oxygen level, and specific conductance values:

A very rudimentary framework of such a system has been realized in the Autodesk Fusion 360 software architecture as shown below. A two-hull framework is usually more helpful than a single hull based design since the former would help with stability issues especially while navigating through choppy waters. Figure 1 shows the design in the form of a very simplistic platform but which definitely lays down the foundation for a more complex structure for an ASV system.

Figure 1: Nascent framework of an ASV system in Fusion 360

With the chassis framework out of the way, a careful analysis could be conducted towards the other successful components of such a vehicle so as to complete the entire build process for a fully functional prototype ASV. In essence, an ASV can be thought of being composed of certain key sub-elements. From a broad perspective, they comprise the hardware makeup, a suitable propulsion system, a sensing system, a communication system, and an appropriate source of onboard power source. The hardware makeup being out of the way, the other aspects can now be elaborated as follows:

Propulsion System: Primarily, the two major possibilities for propulsion systems in an ASV involve using either a single servo motor with an assortment of rudders and propellers for appropriate steering, or using two separate servo motors, one of which will drive the left-hand side of the system and the other would drive the right-hand side. The second arrangement is preferred in many scenarios as it provides with better maneuverability and control of the system as a whole. For instance, to move forward in a rectilinear fashion, both the motors would be given the same level of power. Whereas for steering the system in a particular direction, one of the motors would be assigned a lower power level than the other, thereby enabling the system to curve inwards on the side which has the motor with a lower power level. Of course, there will always be perturbations and natural disturbances that will deter the system from making these correct path changes. For this reason, a Proportional-Integral-Derivative (PID) controlled response could be augmented with the locomotion algorithms.

Sensing System: An ASV can have as many sensors as possible (dependent upon physical and electrical constraints of microcontroller/microprocessor) but for a study like this, an arrangement involving four different sensors needs to be integrated in the ASV which measures the four principle water quality parameters. This way, when the entire ASV system is deployed in an aquatic environment, it will be able to simultaneously provide readings for all four water-quality parameters in this case. Precisely, these water-quality parameters would be temperature, potential of hydrogen (pH), dissolved oxygen level, and specific conductance value. It should be noted in this perspective that it is possible to include even more sensors in this ASV system. However, the reason why it is generally not helpful to go beyond a certain number of sensors is primarily because of two reasons. Firstly, these parameters are important to most toxicological analysis studies 1, and the readings provided by such a sensory system could be considered as a foundation which could provide future directions (including adding more sensors, if needed). Secondly, we should also keep in mind that the hardware system has certain constraints. In this scenario, it involves a sensory shield (that can be integrated with a microcontroller or microprocessor) which can be used for incorporating multiple sensors. But it also has a maximum of four ports for four different sensors only. Though it is possible to add multiple layers of shield on top of the others (thereby raising the capability of integrating the number of sensors to eight or even more), it leads to unnecessary bandwidth issues, memory depletion possibilities, along with an increased demand of higher power supply. These issues will especially be more consequential if we are dealing with a microcontroller that has very limited memory and power, unlike a microprocessor. Hence, the decision to stick with only a certain number of sensors is really an important one.

Communication System: This is possibly the most important part of the ASV system as we need to device a technique to offload the data that is collected by the vehicle back to a remote computer/server that would likely be located at a considerable distance away from the ASV. There are different options that can be considered in this regard for establishing a proper communicative functionality between the ASV and the remote computer. Some options that are typically considered involve Bluetooth, IR signals, RF signals, GPS-based system, satellite communication, etc. There are both pros and cons when it comes to using any of these different communication systems for the ASV. However, the most important metric in this case involves the maximum range the communication system could span over. Obviously, some of the options (viz. Bluetooth) would not be possible in this regard as they have a very limited communication range. Some others (viz. satellite communication systems) have a very high range but are nevertheless not feasible for small-scale research endeavors as they require too much onboard processing power to even carry out their most basic operations. Hence, a balanced approach is normally followed in these scenarios and a GPS/RF-based system is often found to be a reliable candidate for carrying out the proposed tasks of an ASV.

Power Source: Finally, we certainly need an onboard processing power system that can provide the required amount of power to all the functional entities housed in the ASV system. The characteristic of such a desired power source would be that it would not require frequent charging and can sustain proper ASV operations for at least five to six hours. Additionally, the weight of the power source should also not be too clumsy that might put the stability of the entire ASV system in jeopardy. It must have a suitable weight, and should also come in an appropriate shape and size such that the weight of the entire power module is evenly distributed over a large area, thereby further reinforcing the stability of the system.

### 4.2 Software Component

With the help of the collected data from ASVs, the datasets of USGS are then prepared which meticulously tabulates all the readings from the different water sensors of the ASV. Such tabular data is made public for research and other educational purposes. In this endeavor, such datasets will be analyzed to decipher the median convergent values of the water body for the four different parameters that have been measured (i.e. temperature, pH, dissolved oxygen level, and specific conductance). The results of this data analysis task will manifest the water quality parametric values and standards for the particular aquatic ecosystem. Such a result will then be used to find out if a different water sample value sourced from a particular region deviates by a large proportion from the established standards which was obtained after analyzing the historical data from USGS for a regional source of water. The USGS website makes it easier to find data from a nearby geographical region by making it possible to enter the desired location prior to searching for water quality data in their huge databases. In this way, one can also use these databases to figure out if the water quality parameters of the particular ecological system varies wildly from a neighboring system that has almost the same geographical and ecological attributes.

The establishment of the degree of variance of the sample data from the normal standards will be carried out by deciphering the number of water quality paramteric values that are aberrant in nature. For instance, a sample value with only an aberrant pH value could be classified as “Critical Category 1” whereas, a sample value with aberrant values for pH, temperature, and specific conductance would be classified as “Critical Category 3”. The aberrant nature of a particular parameter is postulated by enumerating how far the values are away from the established median data, which was obtained from the past/historical datasets. This will involve centroid-based calculations for the k-means algorithm (discussed in the next section). Such aberrant nature of a particular water quality paramteric value can also be figured out by using the context of standard deviations and quartile ranges. For instance, if the current data resides in the second quartile, it can be demarcated as being more or less consistent with previously established values. However, if it resides in the first or third quartile then it might be that the particular ecosystem has aberrant aspects, which would then need to be investigated for possible effects of outside pollutants (viz. industrial effluents, agricultural wash-off, etc.), or presence of harmful invasive species that might be altering the delicate natural balance of the ecosystem in question.

The software logic is located at this link 8. It has been created using the aid of Google Colaboratory platform, or simply Colab 9. The code has been appropriately commented and documented in such a way that it can be easily reproduced. Important instructions and vital information about different aspects of the code have been properly written down wherever necessary. The coding framework helps in corroborating the inferences and conclusions made by this research attempt, and which are described in detail in the subsequent sections. The major steps that have been followed in establishing the software logic for this big-data analytical task are discussed below.

#### 4.2.1 Data Pre-processing

Meta-data: As with any other instance of big data, the data obtained from the USGS website is rife with many unnecessary information. Most of these information relate to meta-data for the particular database and it also contains detailed logistical information such as location information, units used for measurement, contact information, etc. Fortunately, all these information have been bunched up nicely at the beginning of the database and they were conveniently filtered out by skipping the first thirty-four (34) rows while reading the corresponding comma separated value (csv) files.

Extraction of required water-quality parameters (Temperature, Specific Conductance, pH, Dissolved Oxygen): After filtering out the meta-data, it is very essential to focus only on the relevant portion of the database which contains the required information that is needed for the data analysis tasks. In this case, if we observe carefully, we will notice that not all the columns contain the required information relating to water-quality parameters. Some of them include information such as date, time, system’s unit, etc. Since these will not be required for our analysis task, we extract only those columns that contain information relating to the four primary water-quality parameters. Figure 2 displays the layout of the USGS dataset files containing all the extraneous information that are deemed unnecessary for the big data analysis task.

Figure 2: Sample Database file obtained from the USGS water-quality database for the year 2017

#### 4.2.2 Attributes of the preliminary data

Seasonal Consistency: The preliminary data, that was pre-processed and extracted, was plotted to visualize the basic results. The data pertains to a particular duration of time in a specific seasonal time of the year, more importantly that spans the first two weeks of November for all the four years. This has been done to maintain consistency of results across the platform and to create as much as a robust framework as possible that can have high fidelity.

Data-points as x-axis: Additionally, it should be kept in mind that the data has already been time-stamped rigorously. More precisely, the databases that are uploaded to the USGS website have data tabulated in them that are arranged in a chronological manner. As a result, having the x-axis refer to actual data-points means the same if we had the x-axis refer to time instead. These plotted results have been provided in the next section.

Visualization of trends: The preliminary plotting of the data helps us to visualize the overall trends of the variation of the four important water-quality parameters. This gives an approximate idea regarding what we should normally expect from the water-quality data, the approximate maximum and minimum range of values, and it further helps in detecting any kind of outlier situations that might arise either due to the presence of artifacts, or nuisance environmental variables.

#### 4.2.3 Unsupervised learning: K-means clustering analysis (“Safe” & “Unsafe” Centroid calculation)

General Clustering Analysis: The concept behind any clustering based approach involves an unsupervised learning mechanism. This means that the dataset traditionally does not come labelled and the task in hand is to find patterns within the data-set based on suitable metrics. These patterns help delineate the different clusters and classify the data by assigning them to one of these clusters. This process is usually carried out by measuring the euclidean distance of each point from the “centroids” of every cluster. The resultant clusters are created in such a way that the distance within the points in a particular cluster are minimized as much as possible whereas, the corresponding distance between points from different clusters are maximized. Figure 3 summarizes this concept of clustering technique.

Figure 3: Concept of Clustering analysis adopted as an unsupervised learning process 10

K-means Classification: Centroid Calculation: The algorithm for this step first starts with selecting a random centroid value to start with to begin the iteration process. In order to be rigorous in this regard, the initial points for centroid values were chosen to be one standard deviation away from the median values for the four water-quality parameters. More importantly, two initial centroid values were chosen that would result in two clusters, one belonging to the safe water-standard cluster and the other belonging to the unsafe category. For the safe centroid value, the point chosen was such that it was one standard deviation lower than the median values for the temperature and specific conductance parameters, whereas it was one standard deviation higher than the median values for the pH and disolved oxygen level parameters. This logic was reversed in case of the unsafe centroid starting value. The intuition behind this approach comes from the fact that a lower temperature and specific conductance value means lower degree of exothermic reactions and lower amount dissolved salts which are typically the traits of unpolluted water sources, and which also have higher pH value (that is, less acidic) and higher level of dissolved oxygen. Hence, the initial centroid value for the “safe” category was chosen in this way. For the unsafe cateory, the metrics were simply reversed.

Iteration process: The next steps for the centroid calculation involves creating an iteration process which will keep updating and perfecting the centroid values upon every execution of the iteration. In this case, the condition for ending the iteration involved either exceeding fifty iterations or reaching the ideal situation where the new centroid value equals the previous centroid value. In the later case, it can be mentioned that the centroid calculation has converged to a specific ideal value. Fortunately, in the case of this project, it was possible to arrive at this convergence of centroid values with not too many iteration steps. The details of this process has been explained in the coding framework with appropriate comments.

Assigning of Clusters: At the end of the iteration step, we end up with the final centroid values for the safe and unsafe category. With the help of these values, we calculate the euclidean distance for every points and assign them to appropriate clusters to which they are closest to. In this way, we complete the unsupervised algorithm of clustering process for any unlabelled data that is provided to us. In this particular endeavor, we assign the value “0” for a predicted cluster indicating a safe set of water-quality values for a given point, and a value of “1” for an unsafe set of water-quality values for a given point. Figure 4 summarizes this idea behind the K-means clustering method that chiefly works as an unsupervised learning algorithm.

Figure 4: Concept of Clustering analysis adopted as an unsupervised learning process 11

#### 4.2.4 Display of results & analysis of any given sample values set

In the final step, we display all the results and relevant plots for this big-data analysis task. Additionally, based on the labelled data and predicted classes for safe and unsafe water-quality standards, it will now be possible to find whether an arbitrary set of data values, that represent water-quality data for a particular region, would be classified as safe or unsafe as per this technique. This has also been shown in the results section.

## 5. Inference

### 5.1 Analysis of extracted data and statistical information

The first preliminary set of results were analyzed to get a general idea of how the water quality paramters vary for the system in question. For the purposes of data visualization, the processed data (viewed as a data-frame in python) was first analyzed to understand the four primary water-quality parameters that are being worked upon. The data-frames for the big data sets are shown below, along with the corresponding statistical information that were evalauted for such attributes. Figure 5 shows the preliminary results that were obtained for the data visualization and statistical processing tasks.

Figure 5: Displaying results of data visualization and statistical information for the water-quality parameters

In the above set of results, it should be worthwhile to note that temperature is measured in the celsius scale, specific conductance is measured in microsiemens per centimeter at 25 degree celsius, pH is measured in the usual standard range (between 0-14), and the level of dissolved oxygen is measured in milligrams per liter.

Next, the content of the dataset, after it is processed in the software architecture, is plotted. It displays the alteration of the values (expressed in scatter plots) of the four main water-quality parameters (viz. Temperature, Specific Conductance, pH, and Dissolved Oxygen) over the period of time that starts from November 1 to November 14 for the four years involving 2017, 2018, 2019, and 2020.

Figure 6 displays the scatter plot data for the “temperature” attribute in the year 2017.

Figure 6: Scatter plot for the water-quality parameter involving “Temperature” (2017)

Figure 7 displays the scatter plot data for the “specific conductance” attribute in the year 2017.

Figure 7: Scatter plot for the water-quality parameter involving “Specific Conductance” (2017)

Figure 8 displays the scatter plot data for the “pH” attribute in the year 2017.

Figure 8: Scatter plot for the water-quality parameter involving “pH” (2017)

Figure 9 displays the scatter plot data for the “dissolved oxygen” attribute in the year 2017.

Figure 9: Scatter plot for the water-quality parameter involving “Dissolved Oxygen” (2017)

Figure 10 displays the scatter plot data for the “temperature” attribute in the year 2018.

Figure 10: Scatter plot for the water-quality parameter involving “Temperature” (2018)

Figure 11 displays the scatter plot data for the “specific conductance” attribute in the year 2018.

Figure 11: Scatter plot for the water-quality parameter involving “Specific Conductance” (2018)

Figure 12 displays the scatter plot data for the “pH” attribute in the year 2018.

Figure 12: Scatter plot for the water-quality parameter involving “pH” (2018)

Figure 13 displays the scatter plot data for the “dissolved oxygen” attribute in the year 2018.

Figure 13: Scatter plot for the water-quality parameter involving “Dissolved Oxygen” (2018)

Figure 14 displays the scatter plot data for the “temperature” attribute in the year 2019.

Figure 14: Scatter plot for the water-quality parameter involving “Temperature” (2019)

Figure 15 displays the scatter plot data for the “specific conductance” attribute in the year 2019.

Figure 15: Scatter plot for the water-quality parameter involving “Specific Conductance” (2019)

Figure 16 displays the scatter plot data for the “pH” attribute in the year 2019.

Figure 16: Scatter plot for the water-quality parameter involving “pH” (2019)

Figure 17 displays the scatter plot data for the “dissolved oxygen” attribute in the year 2019.

Figure 17: Scatter plot for the water-quality parameter involving “Dissolved Oxygen” (2019)

Figure 18 displays the scatter plot data for the “temperature” attribute in the year 2020.

Figure 18: Scatter plot for the water-quality parameter involving “Temperature” (2020)

Figure 19 displays the scatter plot data for the “specific conductance” attribute in the year 2020.

Figure 19: Scatter plot for the water-quality parameter involving “Specific Conductance” (2020)

Figure 20 displays the scatter plot data for the “pH” attribute in the year 2020.

Figure 20: Scatter plot for the water-quality parameter involving “pH” (2020)

Figure 21 displays the scatter plot data for the “dissolved oxygen” attribute in the year 2020.

Figure 21: Scatter plot for the water-quality parameter involving “Dissolved Oxygen” (2020)

### 5.2 Centroids & Predicted Classes/Clusters

Figure 22 shows the final centroid values for the safe and unsafe water-quality standards for the year 2017. Furthermore, Figure 23 shows the predicted classes for the safe and unsafe clusters, which were calculated based on the results of the centroid values, for the same year of 2017.

Figure 22: Safe and unsafe centroid values for the year 2017.

Figure 23: Predicted classes for safe (“0”) and unsafe (“1”) clusters for the year 2017.

### 5.3 Heatmaps for the years - 2017, 2018, 2019, and 2020

For the all the four years, heatmaps were plotted to get more information about the trend of the data. Chiefly, the heatmaps give us an empirical form of ideology relating to the degree of correlation between the different water-quality parameters in this data analysis task.

Figure 24 visualizes the heat-map and shows the relationships between the various aquatic parameters for the year of 2017.

Figure 24: Heatmap for the water-quality parameters (Year - 2017)

Figure 25 visualizes the heat-map and shows the relationships between the various aquatic parameters for the year of 2018.

Figure 25: Heatmap for the water-quality parameters (Year - 2018)

Figure 26 visualizes the heat-map and shows the relationships between the various aquatic parameters for the year of 2019.

Figure 26: Heatmap for the water-quality parameters (Year - 2019)

Figure 27 visualizes the heat-map and shows the relationships between the various aquatic parameters for the year of 2020.

Figure 27: Heatmap for the water-quality parameters (Year - 2020)

### 5.4 Analysis of sample set of values

In this portion, we test the unsupervised learning mechanism on actual sample sets of water-quality values. For this purpose, we feed the sample values to the coding framework and based on the centroids calculated for the past four years, it is able to identify whether the given water sample belong in the safe or unsafe category. Further, if it belongs in the unsafe category, the system can further inform us the degree of criticality of the water-quality degradation. This is carried out by evaluating how many water quality parametric values are beyond the normal range. Based on this analysis, a critical nature is then displayed as an output result. As an example, a critical category of “2” would signify two of the water-quality parameters were beyond the normal range of values, while the others were normal. Needless to say, higher is the critical category level, the more degraded the water source is.

Figure 28 displays the results obtained for specific sample values of water quality parameters.

Figure 28: Analysis of water sample values by ascertaining the degree of degradation based on critical level

### 5.5 Benchmark information

Finally, the benchmark analysis results are shown below for the respective tasks carried out by the coding platform. Some important facts that should be kept in mind in this regard are as follows:

• The benchmark analysis was carried out using the cloudmesh benchmark procedure in Python (executed in Google Colab).
• A sleep-time of “5” was selected as the standard for the benchmark analysis and it has been adopted consistently for all the other benchmark calculations.
• The time values for the different benchmark results were not rounded off in this case as it was both important and interesting to know the minute differences between the different kinds of tasks that were carried out in this regard. It should be noted that the final benchmark value is exceptionally high since this step involves the part where the human user inputs the sample values for ascertaining the safety standard for a water source.

Figure 29 shows the benchmark results that were obtained for specific sections in the coding framework.

Figure 29: Benchmark results for all the tasks that were carried out for this big data analysis task using Google Colab

## 6. Conclusion

This research endeavor implements a big data analysis framework for analyzing toxicity of aquatic systems. It is an attempt where hardware and software meet together to give reliable results, and on the basis of which we are able to design an elaborate mechanism that can analyze other sample water sources and provide decisions regarding its degradation. Although the endeavor carried out in this example might not involve a plethora of high-end applications or intricate logic frameworks, it still provides a very decent foundational approach for carrying out toxicological analysis for water sources. Most importantly, it should be noted that the results obtained for sample values correspond to what we would normally expect for polluted and non-polluted sources of water. For instance, it was found that water sources with high values of specific conductance were categorized in the “unsafe” category which is to be expected since a high conductance value typically signifies the presence of dissolved salts and ions in the water source, which normally indicates that effluents or agricultural run-off might have made its way to the water source. Additionally, it was also found out that water samples with high values of dissolved oxygen levels were categorized in the “safe” category which is certainly true based on biological postulates.

Of course, a more diverse array of data coupled with more scientific and enhanced ASV systems would have probably provided us with even better results. But as indicated earlier, this research endeavor provides a pragmatic foundational approach to conducting this kind of big data analytical work. We can build up on the logic presented in this endeavor to come up with even more advanced and robust version of toxicological analysis tasks.

## 7. Acknowledgements

The author would like to thank Dr. Geoffrey Fox, Dr. Gregor von Laszewski, and the associate instructors in the FA20-BL-ENGR-E534-11530: Big Data Applications course (offered in the Fall 2020 semester at Indiana University, Bloomington) for their continued assistance and suggestions with regard to exploring this idea and also for their aid with preparing the various drafts of this article.

## 8. References

1. Valada A., Velagapudi P., Kannan B., Tomaszewski C., Kantor G., Scerri P. (2014) Development of a Low Cost Multi-Robot Autonomous Marine Surface Platform. In: Yoshida K., Tadokoro S. (eds) Field and Service Robotics. Springer Tracts in Advanced Robotics, vol 92. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40686-7_43 ↩︎

2. M. Ludvigsen, J. Berge, M. Geoffroy, J. H. Cohen, P. R. De La Torre, S. M. Nornes, H. Singh, A. J. Sørensen, M. Daase, G. Johnsen, Use of an Autonomous Surface Vehicle reveals small-scale diel vertical migrations of zooplankton and susceptibility to light pollution under low solar irradiance. Sci. Adv. 4, eaap9887 (2018). https://advances.sciencemag.org/content/4/1/eaap9887/tab-pdf ↩︎

3. Verfuss U., et al., (2019, March). A review of unmanned vehicles for the detection and monitoring of marine fauna. Marine Pollution Bulletin, Volume 140, Pages 17-29. Retrieved from https://doi.org/10.1016/j.marpolbul.2019.01.009 ↩︎

4. USGS Water Quality Data, Accessed: Nov. 2020, https://waterdata.usgs.gov/nwis/qw ↩︎

5. EPA Water Quality Data, Accessed Nov. 2020, https://www.epa.gov/waterdata/water-quality-data-download ↩︎

6. Read, E. K., Carr, L., De Cicco, L., Dugan, H. A., Hanson, P. C., Hart, J. A., Kreft, J., Read, J. S., and Winslow, L. A. (2017), Water quality data for national‐scale aquatic research: The Water Quality Portal, Water Resour. Res., 53, 1735– 1745, doi:10.1002/2016WR019993. https://agupubs.onlinelibrary.wiley.com/doi/epdf/10.1002/2016WR019993 ↩︎

7. Atlas Scientific team. Atlas Scientific Environmental Robotics. Retrieved from https://atlas-scientific.com/ ↩︎

8. Sinha, Saptarshi. (2020) A Big-data analysis framework for Toxicological Study. Retrieved from https://github.com/cybertraining-dsc/fa20-523-312/blob/main/project/code/toxicologyASV.ipynb ↩︎

10. Tan, Steinback, Kumar (2004, April). Introduction to Data Mining, Lecture Notes for Chapter 8, Page 2. Accessed: Nov. 2020, https://www-users.cs.umn.edu/~kumar001/dmbook/dmslides/chap8_basic_cluster_analysis.pdf ↩︎

11. Alan J. (2019, November). K-means: A Complete Introduction. Retrieved from https://towardsdatascience.com/k-means-a-complete-introduction-1702af9cd8c ↩︎

# 18 - How Big Data has Affected Statistics in Baseball

The purpose of this report is to highlight how the inception of big data in baseball has changed the way baseball is played and how it affects the choices managers make before, during, and after a game. It was found that big data analytics can allow baseball teams to make more sound and intelligent decisions when making calls during games and signing contracts with free agent and rookie players. The significance of this project and what was found was that teams that adopt the moneyball mentality would be able to perform at much higher levels than before with a much lower budget than other teams. The main conclusion from the report was that the use of data analytics in baseball is a fairly new idea, but if implemented on a larger scale than only a couple of teams, it could greatly change the way baseball is played from a managerial standpoint.

Status: final, Type: report

Holden Hunt, holdhunt@iu.edu, fa20-523-328, Edit

Keywords: sports, data analysis, baseball, performance

## 1. Introduction

Whenever people talk about sports, they will always talk about some kind of statistic to show that their team is performing well, or certain player(s) are playing incredibly well. This is due to the fact that statistics has become extremely important in sports, especially in rating an entity’s performance. While essentially every sport has adopted statistics to quantify performance, baseball is the most well-known sport to use it, due to the obscene number of stats that are tracked for each player and team, as well as the sport that uses stats the most in how they play the game. The MLB publishes about 85 different statistics for individual players, including the stats tracked of the teams, there is likely to be about double the amount if tracked statistics for the sport of baseball. The way that all the statistics are calculated, is, of course, by analyzing big data found from the players and teams. This report will mainly talk about the history of big data and data analytics in baseball, what the data is tracking, what we can learn from the data, and how the data is used.

## 2. Dataset

The dataset that will be analyzed in this report will be the Lahmen Sabermetrics dataset. This dataset is a large dataset curated by Sean Lahmen, which contains baseball data starting from the year 1871, which is when the Major League Baseball association was founded 1. The dataset contains data for many different types of statistics, including batting stats, fielding stats, pitching stats, awards stats, player salaries, and games they played in. The data for this dataset has statistics from the last season that occurred (2020 season), but the data that could be accessed for this report is from 1871-2015. I plan to use this dataset for discussing later how the data in sets like this is used for statistical analysis in baseball and how teams can use this to their advantage.

## 3. Background

The concept of baseball has been a sport that has existed for centuries, but the actual sport called baseball started in early to mid 1800s. Baseball became popularized in the United States in the 1850s, where a baseball craze hit the New York area, and baseball quickly was named a national pastime. The first professional baseball team was the Cincinnati Red Stockings, which was established in 1869, and the first professional league was established in 1871, and was called the National Association of Professional Base Ball Players. This league only lasted a few years and was replaced by a more formally structured league called the National League in 1876, and the American League was established in 1901 from the Western League. A vast majority of the modern rules of baseball were in place by 1893, and the last major change was instituted in 1901 where foul balls are counted as strikes. The World Series was inaugurated in the fall of 1903, where the champion of the National League would play against the champion of the American League. During this time, there were many problems with the league, such as strikes due to poor and unequal pay and the discrimination of African Americans. The era in the time of the early 1900s was the first era of baseball, and the second era of baseball started in the 1920s where a plethora of changes to the game caused the sport to move from more of a pitcher’s game to a hitter’s game, which was emphasized by the success of the first power hitter in professional baseball, Babe Ruth. In the 1960s, baseball was losing revenue due to the rising popularity of football, the league had to make changes to combat this drop in revenue and popularity. The changes lead to the salaries of the players getting increased and also an increase in attendance to games, which means increased revenue 2.

Big data is able to be used in baseball through the use of statistics, which has become a major part in how the sport is played. The use of statistics in baseball has become known as sabermetrics, which can be used by teams to make calls for the game based on numbers. The idea of using sabermetrics started from the book called Moneyball. The book was published in 2003, and was about the Oakland Athletics baseball team and how they were able to use sabermetric analysis to compete on equal grounds with teams that had much more money and good players than the A’s team 3. This book has had a major impact on the way baseball is played today for several teams. For example, teams like the New York Yankees, the St. Louis Cardinals, and the Boston Red Sox have hired full-time sabermetric analysts in attempts to gain an edge over other teams by using these sabermetrics to influence their decisions. Also, the Tampa Bay Rays were able to make the moneyball idea a reality by making it to the 2020 World Series with a much lower budget team and using lots of sabermetrics in their decision-making 3. This moneyball strategy is not the perfect strategy however, because the Rays lost the World Series, and the turning point for their loss could be pointed to a decision they made based on what the analytics said they should do, which ended up being the wrong choice, which lost them a crucial game in the series.

## 4. Big Data in Baseball

The data that is being analyzed for the statistics in baseball are datasets similar to the one this report is looking into, the Lahmen Sabermetrics dataset. Since this dataset contains a vast amount of data for each player and many different tables containing many different kinds of data, many kinds of statistics are able to be tracked for each player. With this large amount of statistics, teams are able to look to numbers and analytics in order to make the best decision on the actions to make during a game. Teams can also use analytic technology to predict the performance of a player based on their previous accomplishments and comparing that to similar players and situations in the past 4.

This kind of analysis can be used to gauge the potential performance of a free-agent or rookie player, as well as deciding what player should be in the starting lineup for the upcoming game(s). Since these sabermetric analyses are able to predict the performance of a player in the coming years, they are able to tell if a contract made by a team is likely to not be a smart deal since they tend to make long-term deals for lots of money, even though it is likely that the player will not continue to perform at the same level they are currently at for the entirety of their contract. This is normally due to the inability to play at an incredibly high level consistently for a long period of time and regression of performance from age, which is shown to start occuring at around 32 years old 4. This simply means that large contracts have a trend of being a large loss of money in the long run shown from analysis of similar types of contracts in the past. This kind of analysis also allows for a players ranking to be deciding by more areas than before. For example, a player’s offensive capabilities can be shown by looking at more categories than the amount of home runs hit and batting average, they can also look at baserunning skill, slugging percentage, and overall baseball intelligence. This ability of looking at a player’s overall capabilities in a more analytical manner allows teams to not throw all their budget into one or two top prospected players, but can spread their money across several talented players to have a good and balanced team. Another reason why deciding to not spend lots of money on a long contract for a top prospected player is that the analysis shows that players have started to have shorter lengths of time where they are able to perform at their best, even though other sports have seen the opposite in recent years . However, young players have been performing at a much higher level in recent years and they have had younger players moving from the minor to the major league much faster than before 4.

The data that is presented in the Lahman Sabermetrics database and other similar databases is able to allow analysts to compare data and statistics of one team with any/every other team with relative ease and in an easy to understand way. For example, the comparative analysis Figure 1 below shows that the payroll of teams and their winning percentage, analysts are able to learn that the New York Yankees have a much higher payroll than all other teams and they have a very good win rate, but there are other teams that do have very high payrolls and have the same good rate rate. Also, Figure 1 shows that there are other teams that have higher payrolls than average, but have a very bad win rate compared to all other teams, including teams that have a much lower payroll 5. This kind of analysis shows us that spending lots of money does not guarantee a strong season, which can strengthen the idea of the moneyball strategy coined earlier where teams attempt to waste less money by spreading budgets across several players other than spending most of the budget on only one or two players.

Figure 1: Comparative Analysis of Payroll to Win Percentage 5

This is only one way that the Lahman Sabermetrics dataset can be used, but there are many more ways this data can be used to make league wide analyses and compare a certain team to others. This can be used by teams to possibly learn what they might be doing wrong if they feel as though they should be performing better.

## 5. Conclusion

This report discusses the history of baseball and how big data analytics came to be prevalent in the sport, as well as how big data is used in baseball and what can be learned from the use of it so far. Big data is able to be used to make decisions that could greatly benefit a team from saving money on a contract with a player to making a choice during a game. Big data analytics use in baseball is a fairly new occurrence, but due to the advantages a team can gain from using analytics, it is likely that use of it will increase soon in the future.

## 6. Acknowledgements

The author would like to thank Dr. Gregor Von Laszewski, Dr. Geoffrey Fox, and the associate instructors in the FA20-BL-ENGR-E534-11530: Big Data Applications course (offered in the Fall 2020 semester at Indiana University, Bloomington) for their continued assistance and suggestions with regard to exploring this idea and also for their aid with preparing the various drafts of this article.

## 7. References

1. Lahman, Sean. “Lahman’s Baseball Database - Dataset by Bgadoci.” Data.world, 5 Oct. 2016. https://data.world/bgadoci/lahmans-baseball-database ↩︎

2. Wikipedia. “History of Baseball.” Wikipedia, Wikimedia Foundation, 27 Oct. 2020. https://en.wikipedia.org/wiki/History_of_baseball ↩︎

3. Wikipedia. “Moneyball.” Wikipedia, Wikimedia Foundation, 28 Oct. 2020. https://en.wikipedia.org/wiki/Moneyball ↩︎

4. Wharton University of Pennsylvania. “Analytics in Baseball: How More Data Is Changing the Game.” Knowledge@Wharton, 21 Feb. 2019. https://knowledge.wharton.upenn.edu/article/analytics-in-baseball/ ↩︎

5. Tibau, Marcelo. “Exploratory data analysis and baseball.” Exploratory Data Analysis and Baseball, 3 Jan. 2017. https://rstudio-pubs-static.s3.amazonaws.com/239462_de94dc54e71f45718aa3a03fc0bcd432.html ↩︎

# 19 - Predictive Model For Pitches Thrown By Major League Baseball Pitchers

The topic of this review is how big data analysis is used in a predictive model for classifying what pitches are going to be thrown next. Baseball is a pitcher’s game, as they can control the tempo. Pitchers have to decide what type of pitch they want to throw to the batter based on how their statistics compare to that of the batters. They need to know what the batter struggles to hit against, and where in the strike zone they struggle the most. With the introduction of technology into sports, data scientists are sliding headfirst into Major League Baseball. And with the introduction of Statcast in 2015, The MLB has been looking at different ways to use technology in the game. In 2020 alone, the MLB introduce several different types of technologies to keep the fans engaged with the games while not being able to attend them [^3]. In this paper, we will be exploring a predictive model to determine pitches thrown by each pitcher in the MLB. We will be reviewing several predictive models to understand how this can be done with the use of big data.

Status: final, Type: report

Bryce Wieczorek, fa20-523-343, Edit

Keywords: Pitch type, release speed, release spin, baseball, pitchers, MLB

## 1. Introduction

Big data is everywhere and has been used in sports for years. Especially in Major League Baseball, where big data has been analyzed for various things. For example, in the movie Moneyball (which is based on true events), the general manager of the Oakland A’s uses analytics to find players that will win them games that other teams overlook. They used many different statistics; batting average, batting average with runners on base, fielding percentage, etc., to find these players. Not only do teams use big data to find the right players for them, but the MLB also uses it for their Statcast technology. Statcast was implemented in 2015 and is an automated tool that analyzes data to deliver accurate statistics in real time 1. This technology tracks many different aspects of the game, including many different pitching statistics.

The pitching statistics that are recording is actually through the PITCHf/x system 2. This system was implemented nine years before Statcast. The system is made up of three cameras that are located throughout the stadium that measure the speed, spin, and trajectory of the baseball as it is thrown. Statcast adopted this technology as the MLB was looking at ways to track more than just pitching statistics.

And while there is a lot of data about pitching, we will only be focusing on release speed, release spin, pitch type, and the pitcher for our model. And while there is a lot of data about these different pitches and the pitchers, it can still be difficult to predict exactly what type of pitch is thrown according to the pitcher. For instance, the difference between different types of fastballs, a four-seam and two-seam, can be different by less than a mile per hour and forty rotations per minute (as thrown by Aaron Nola of the Philadelphia Phillies). In our model that we try to create, we are use these variables, along with the last three pitches thrown for each pitch type, to try to predict what pitch was just thrown.

## 2. Background and Previous Work

Baseball players are always looking for a way to have an edge over their opponents. Whether that is through training, recovery, use of supplements, and by watching film or statistics to find what your opponent’s struggle with.

There are several existing models that have been made to predict pitch type. We plan on examining them to determine how this can be done. We are looking for a general blueprint of which multiple models have used and/or have followed. This also allowed us to see what type of datasets would best be used for an analysis of this sort.

### 2.1 Harvard College Model

The first prediction model we studied was done through Harvard University 3. This prediction model was very complex and explored many different types of analysis. They first investigated doing a binary classification model. However, they ultimately decided against this because the label of pitches did not accurately represent what they actual were. This led them into multi-class predictive models in which they used. The other types of analysis were boosted trees, classification trees, random forests, linear discriminant analysis, and vector machines. The main point of this report was to see if they could replicate previous work done, but they ultimately failed at each one. This resulted to them in trying to determine if one can correctly predict pitch type selection. There research showed that machine learning cannot correctly predict pitch type due to the many different aspects that need to be analyzed. They hoped that their work could be used as a reference for future work done.

We found this article to be very useful in our work since it stands as a reference for future work. We found this to be helpful in our review of predictive models for pitch types due to the different models they tried to replicate.

### 2.2 North Carolina State University Model

The prediction model created by North Carolina State University 4 was created to predict the next pitch that was going to be thrown. Their model compared old data to the current live data of a game, they are using many different types of data to predict the next type of pitch. They used a very large database that consisted of 287 MLB pitchers who had an average of 81 different pitch features (pitch type, ball rotation, speed, etc.). They are trying to determine if the next pitch will be a fastball or an off-speed pitch. Like the previously mentioned model, this model is also using trees to give a classification output. The parent node is the first type of pitch thrown, which then leads to if the pitch was a strike or a ball, then it uses this information and compares it to their dataset to predict the next set of nodes.

We found this to be very useful for our review since their model worked correctly and it used current data from the game. With Statcast today, we find this to be very important since all of this information is recorded and logged as each pitch is thrown.

## 3. Dataset

Major League Baseball first started using Statcast in 2015, after a successful trial run in 2014. Statcast 1 provided the MLB with a way to collect and analyze huge amounts of data in real time. Some of the data it collects, which is used in many different models, includes the pitcher’s name, pitch type, release spin, release speed, and the amount of times each pitch was thrown.

Since we do not need all the data Statcast collects, we found a dataset with the datatypes listed above 5. We chose this dataset because it contains a significant amount of data that can be used for a prediction model. The dataset contains a lot of information about not only the pitchers, but about the batters. The database shows what each batter has done against every type of pitch they have faced, whether they hit the ball, fouled it, swung and missed, etc. We felt as if this data is very important as well since coaches and pitchers will want to know how the batter reacts to different pitches and the different locations of each pitch. This would be a vital part to any pitch prediction model as coaches signal to the pitchers what types of pitches to throw according the strengths and weaknesses of the batters.

## 4. Search and Analysis

After conducting our background research, we determined that in order to build an accurate model to predict which pitch was just thrown, you would need to use a similarity analysis tool model with classification trees. This model allows us to take old data and compare it to the live data of the pitch. By comparing the live data to the older data, it will give the most accurate results due to how the players are playing during that game. A batter may struggle with a certain type of pitch that day and not another.

The similarity analysis tool model would best be used to find instances of the batter in certain pitch counts. For instance, this tool would analyze the different amount of times the batter has faced a count 1 – 2 (1 ball and two strikes), and then find what pitch they saw next and what the outcome was. This information would then be filtered to see what pitch will have the best outcome for the pitcher, it would tell them what pitch to throw and where to throw it. This is where the classification trees would then be used. The trees would classify the recommended pitches to throw and their location to the types of pitches that they throw.

## 5. Limitations

There are several limitations to predictive models for pitch types 6. The biggest is pitch classification itself. Pitchers ultimately pick what types of pitches they throw and as some pitches behave like others, different pitchers can classify the same pitch type as something else. For example, the traditional curve ball and a knuckle curve react the same. Other classification challenges that can be faced can be related to how the pitcher is playing. As the game goes on, pitcher’s velocities traditionally decrease due to their arms tiring, this could result in misidentification between different types of fastballs. The last limitation that could affect the pitch classification could be the PITCHf/x system malfunctioning or interference with the system.

## 6. Conclusion

This report discusses the use of predictive model in baseball and how they can be used to predict the next pitch thrown. By reviewing previous models down by Harvard 3 and by North Carolina State University 4, we determined that the best way to build a predictive model would be by using a similarity analysis tool model with classification trees. With the Carolina State University’s successful model, we also concluded that there were limitations to it, as different pitchers classify their pitches differently and telling different pitches apart due to the similar behaviors. The use of predictive models in baseball could change the game as it gives the pitchers an advantage over the batters. We hope that this report can show how big analytics can affect the game of baseball.

## 7. Acknowledgements

The author would like to thank Dr. Gregor Von Laszewski, Dr. Geoffrey Fox, and the associate instructors in the FA20-BL-ENGR-E534-11530: Big Data Applications course (offered in the Fall 2020 semester at Indiana University, Bloomington) for their continued assistance and suggestions with regard to exploring this idea and also for their aid with preparing the various drafts of this article.

## 8 References

1. [2020. About Statcast. MLB Advanced Media. http://m.mlb.com/glossary/statcast ↩︎

2. Nathan, Alan M. Tracking Baseballs Using Video Technology: The PITCHf/x, HITf/x, and FIELDf/x Systems. http://baseball.physics.illinois.edu/pitchtracker.html ↩︎

3. Plunkett, Ryan. 2019. Pitch Type Prediction in Major League Baseball. Bachelor’s thesis, Harvard College. https://dash.harvard.edu/handle/1/37364634 ↩︎

4. Sidle, Glenn. 2017. Using Multi-Class Classification Methods to Predict Baseball Pitch Types. North Carolina State University. https://projects.ncsu.edu/crsc/reports/ftp/pdf/crsc-tr17-10.pdf ↩︎

5. Schale, Paul. 2020. MLB Pitch Data 2015-2018. Kaggle https://www.kaggle.com/pschale/mlb-pitch-data-20152018 ↩︎

6. Sharpe, Sam. 2020. MLB Pitch Classification. Medium. https://technology.mlblogs.com/mlb-pitch-classification-64a1e32ee079 ↩︎

# 20 - Big Data Analytics in the National Basketball Association

The National Basketball Association and the deciding factors in understanding how the game should be played in terms of coaching styles, positions of players, and understanding the efficiencies of shooting certain shots is something that is prevalent in why analytics is used. Analytics is a topic space within basketball that has been growing and emerging as something that can make a big difference in the outcomes of gameplay. With the small analytic departments that have been incorporated within teams, results have already started coming in with the teams that use the analytics showing more advantages and dominance over opponents who don’t. We will analyze positions on the court of players and how big data and analytics can further take those positions and their game statistics and transform them into useful strategies against opponents.

Status: final, Type: Report

Igue Khaleel, fa20-523-317, Edit

Keywords: basketball, sports, team, analytics , statistics, positions

## 1. Introduction

The National Basketball Association was first created in the year of 1946 with the name of BAA (Basketball Association of America). However, in 1949 the name was changed to the NBA with a total of 17 teams1. As time progressed the league started picking up steam and more and more teams began to join and it wasn’t until the 90’s that we see the total amount of NBA teams be produced.This league consists of professional basketball players from both national and international spaces of the world. As there are 16 roster spots per team and 32 teams in total, only the very most athletic, skillfull, and colossal individuals are chosen to represent this league. Now, knowing the special skillsets of individual players, the founder of basketball, James Naismith, created positions to maximize these individual players for team success. On the court there are 5 positions : point guard, shooting guard, small forward, power forward, and center1.

#### 1.1 Point Guard

Starting with the point guard, generally these individuals are the smallest players on the court with an average height around 6'2 tall. With what these player lack in height they make up for in skillset in terms of quickness, passing, agility, ball handling, and natural shooting ability. Point guards are generally looked at to be the floor general of the team and take up the job of setting up the coach’s gameplan and teamates.

#### 1.2 Shooting Guard

The shooting guard is a generally a slightly taller player than the point guard and like the name suggests they are generally the player known for their indiviualisitc shooting prowess whehter if it is beyond the 3 point line or in the mid-range. Shooting guards are known to be positioned in the perimeter(outside the arc) as a partner to the point guard. On occasion, the role of the shooting guard is expanded in the case that the point guard is pressured so the role may be for the shooting guard to be better at defense or a player that can help in the playmaking duties of the point guard.

#### 1.3 Small Forward

The small forward is where things change in terms of roles when comparing to the guards of that were previously mentioned above. They can be considered hybrids in the sense that they can both operate on the perimeter like guards and can go down low like power forwards and centers which will be discussed later. Noramlly with wings(another name for small forward) with an average height around 6'7, there are a plethora of responsibilites in order to be considered effective. The reason for this is because generally speaking, small forwards are the most athletic player on the court. They basically have most the agility and ball handling of guards and most of the physicallity and power of power forwards/centers. Understandibly, there are tasked with big defensive assignments and are usually looked at to be a decent to above-average producer on offense.

#### 1.4 Power Forward

The power forward position is where the physicallity of players matters more. Generally these players are around 6'9 to 6'11 and are heavier than most players. Becuase of this they give up speed and shooting which is why they operate around the free throw line and basket. They are looked at to protect the interior with the center from smaller players and small forwards driving in the lane to the basket.

#### 1.5 Center

The center is considered mostly the point guard of the defense of the team. They are generally the anchor that protects the rim primarily and takes up defensive assignemtns and calls. Without a competent center, a team can see their defense take a hit. Along with defense, centers are good options to go to when the team has offensive lulls since the easiest shot to make in the nba is a hook shot or layup and the center operates 3 feet from the basket. Centers generally range from 6'11 to as high as 7'6 in height. On rare occasions you can see 6'9 to 6'10 centers take the court and that is generally because of play-style or above-average defense.

## 2. Era of Analytics

The National Basketball Association continues to not only grow in the sense of continued personnel but an increase of cap(cash flow) amongst teams as well. Within the scope of this prosperous cap situation that the NBA has accumulated over the years through merchandising, tickets, and tv deals, teams have found flexibility in the ability to create the optimal situation for whatever version of basketball the General Manager sees fit for the vision of the team. In terms of better understanding how this can be accomplished it is best to understand what spurred this action of finding styles to lead to the best team success.

That particular action is players such as Stephen Curry, a 6-3 NBA point guard, that led to the change in utilizing analytics. The year Steph Curry broke through as an MVP, his team; the Golden State Warriors broke the former Chicago Bulls record of 72-9. This in big part was due to Steph Curry breaking the 3pt record as well as Golden State adopting the small ball philosophy. This particular year gave birth to the era of analytics because of how dominate those two approaches were.

#### 3.1 The Houston Rockets

This has then inspired teams to introduce analytics departments to measure ways to beat the game and exploit mismatches in defensive schemes and height within players. An example of a team that spearheaded this change in strategy is the Houston Rockets. Their GM(General Manager) Daryl Morey was a MIT graduate who advocated for a team that primarily shoots three point shots as their main forte2. The science behind this concept was that 33% shooting from the three point line measure to 50% from the two point line respectively. This was in the works in the year of 2017 just two years removed from Steph Curry’s three point dominance in his MVP season. In terms of numbers representing the change, the 2018 Houston Rockets attempted approximately 82% of their shot attempts around the three point line and the restricted area(the circle around ~5 feet in diameter surrounding the rim)2. The next best team in that department was eleven percent down at 71% in terms of attempts. In this year, the Rockets won their conference at a record of 65 wins - 17 losses as well as break the NBA record in three pointers attemted and made.

#### 3.2 Tools

In order to evaluate these players and acquire the data necessary for analyization, the NBA partnered with a company name STATS to provide the necessary tools for data collection. STATS worked with the NBA by installing six cameras in each basketball arena in order to, “track player and referee movements at 25 frames per second to get the most analytical data for teams and the NBA to analyze3.” This is very effective in terms of showing the play-by-play moves of players in a system as well as even how referees move. With players, these tools can serve as a chess board where the coach is able to watch pieces move and can determine where certain positions could be optimized to its maximum efficiency. This allows for film sessions to be more productive and helpful for players to better see where they fit and even improve in. In terms of referees, throughout sports it is known that referees have cost some games due to missed calls or questionable decisions. This technology can help in terms of understanding: 1) how a specific referee calls certain fouls and 2) if there seems to be a number count of fouls depending on what team the referee is reffing historically. Understanding both the tendencies of players and refs alike gives coaching staffs a direction to go in when preparing for opponents on a game-by-game basis3.

#### 3.3 Draft Philosophy

Another facet of the game that is likewise impacted by the tools and techniques described in 3.2 is the NBA draft. The NBA draft consists a total of 60 players selected in two rounds combined. The general consensus before this analytics era was to choose the best player avaible most of the time. Teams back then usually drafted big men(e.g. forwards and centers) because it was considered a safe pick and known to help your team better in more areas. As time passed, we’ve seen a shift to more guards that are drafted instead to fit the narrative the analytics presents to teams regarding the best path to success. For example, earlier Stephen Curry was mentioned to be one of the foundational reasons that the analytics movement was largely adapted. The year Curry got drafted, the #1 pick in the draft was Blake Griffin who at the time was considered the best Power Forward in the draft while Curry was drafted at 8th overall and even James Harden of the Houston Rockets was drafted 3rd4. As we fast forward to 2020, both Curry and Harden are looked at as the two best players from their draft class with Curry revolutionizing the three point shot and Harden being the ultimate analytics player with his ability to manipulate the defense and draw free throws from fouls like no player has ever done. As years passed, there has been a shift in drafting players with the mindset of that particular players' potential over fit in the sense that teams look for the best available player that fits the teams system the most efficiently4. An example is the upcoming 2020 NBA draft where there is a question of who will become the #1 and 2 pick respectively. The Golden State Warriors have the 2nd pick in the draft because of a year of injuries for all of their star players. So, they typically aren’t looking for a player like most losing teams are doing in the draft. In the eyes of many scouts, some view a player like Lamelo Ball, a 6'7 point guard as the best player or at least second best and others see players like Anthony Edwards(SG), James Wiseman(C) and Deni Advija(SG) as potentially better fits and safer picks. However, for the warriors rumors over social media from notable sources have shown that they aren’t interested in drafting Lamelo Ball as he is a point guard and they have Steph Curry already. They instead prefer to choose a Small Forward or Center that can help their defensive potential and style of play. Years ago, that may not have been the case as the best player available would usually come off the board and the team would figure it out after that. Thus, this shows how analytics has not only persuaded teams to change their play styles and system but also the players that come with it whether they are veterans or incoming rookies.

## 4. Background Work and Advanced Analytics in Basketball

Considering how the impacts of how implemented analytics has aided the NBA atmosphere as mentioned above, we look to learn technologies and work that help bring this about. This begins with camera systems that have been implemented by a company named SportVU who’ve helped bring about change in NBA arenas since 2013 that track player and basketball movement across the arenas5. This system goes further in the analysis of collecting data in the context of individual player statistics being captured as well as their positioning on the court and speed in particular instances.

Thus by capturing the basic statistics such as points, assists, rebounds, steals and blocks. Analytical tools such as Player Efficiency Ratings and Defensive Metrics were better used to analyze players and their individualistic impacts on the basketball court. The impacts of these analytical/computational metrics are represented in many organizations abilities to understand the scope of player’s salaries and positioning on the court, who to draft, and helps sports analyst on TV shows such as FS1 and ESPN to easily break down the game of Basketball.

This is where algorithims come to play as an algorithim needs a dataset from which it can train itself and develop statistical patterns to help in predictive analysis and representation for coaches and teams to utilize respectively. An example of this is show through students named Panna Felsen and Lucy from the University of California Berkely, who are developing a software name Bhostgusters that helped analyze the body positions of players and further the response and movements of a team to certain plays run by the opposition5. The end goal of this for coaches to be able to draw up a play on a tablet and see potential conflicts, results, and how opponents may counteract that particular play.

Other technologies that are being developed and implemented are things like, CourtVision, which is technology that shows the statistics of a player making a shot based on that players' past statistics and position on the court. As the player is moving through the court the numbers change to reflect his efficiency on certain areas on the court based on this. As stated by Marcus Woo, the author of this article, these technologies aren’t meant to replace the systems in place but instead are there the help in efficiency and effectiveness5.

## 5. Algorithims associated with NBA

When it comes to the variety of algorithims used in the National Basketball Association, we will be analyzing the range of algorithims discussed through articles and papers on google scholar. We looked at a total of five algorithims that were commonolu shown to be used of the most searches when it came to predictive and learning analysis within NBA analytics departements and outside agencies. The algorithims as presented are: K-means, Artificial Neural Networks, Linear Regression, Logistic Regression, and Support Vector Machines6. Linear Progression was by far the most written on topic within the five algorithims listed above with a total of 11,000 searches. It is followed by the Support Vector Machines with 5,240, Logistic Regression with 4,500, Artificial Neural Networks with 4,300, and K-Means with 1,590 search results(*all results via google scholar search bar).

#### 5.1 K-Means

The first algorithim we’ll look at is K-Means which is classified as generally the “clustering algorithim” which takes the form of initializing a single point of k or the mean and organizing the data towards that particular mean6. This is then repeated over and over until the appropriate results are found and compiled. Now as National Basketball Association statistics are inserted this can be used to cluster players together than fit the criteria on certain outcomes of points, rebounds, assists, and blocks.

#### 5.2 Linear Regression

Linear Regression, which is very commonly used in machine learning is very effective as a predictor tool. It works by forming “regression coefficients” that stems from pitting together independent variables which help in predictions within a game6. So, throught the input and output variables taht are presented predictive measurements can be performed to highlight potential productivety. An example is the “Box-Plus-Minus”. This was created to show a basketball player’s overall court production and effect through their statistics, what position they play on the court and the wins and losses that team incurs because of this7. This was built through linear regression and shows through charts based on statistics how productive a player is or potentially can be given the system and oppurtunities.

#### 5.3 Logistic Regression

Similarly to Linear Regression, Logistic Regression shares a lot of features in terms of the formula used for prediction except it utilizes a sigmoid as opposed to a linear function when performing calculations. Weight values are the main form of predictions in whatever form of scenario or situation in which that analyst wants to produce6. An example of this is shown through a logistical regression analysis performed by Oklahoma State University on clutch and non-clutch shots by players in the National Basketball Association. The premise of this is taking the data of an individual player based on their shooting percentages in spots on the floor relative to the distance of the defender on them and using that to figure out the potential of a player making a shot in the clutch(universally known as the last two minutes in a close game)8. This then shows how a predictive algorithm can be utilized not only based on solely percentages and efficiencies but also with the inclusion of situation on a basketball floor.

#### 5.4 Support Vector Machines

Support Vector Machines are considered to be a very formidable tool when it comes to measuring classification issues. This modeled machine creates a decision-making tree that helps in the predictions of basketball games and thus can help coaches form strategies and gameplans around what the model predicts can happen. Additional advantages that come with this tool is its ability to operate in high dimensions, the ability to identify kernels, and its memory efficiency9. The minor issue with this machine is the lack of rule generation but as it is more of an emerging tool overtime this is something that is relatively fixable10. The advantages

#### 5.5 Artificial Neural Networks

With Artificial Neural Networks the use of the Multi-Layer Perceptron is prevalent and it is highlighted by the vertices of a group in correlation to input varables and comes out with the output9. This tool according the Beckler is also considered to be, “an adaptive system that changes its structure based on external and internal information flows during the network training phase”6. With this, the Artificial Neural Network is considered to be one of the most accurate predictive tools when it comes to basketball and can predict patterns as more data is inputed9.

## 6. Conclusion

As time progresses, we will continue to see the use of analytics as well as the expanision of analytics departments in not only the National Basketball Association but other professional sports as well. The impacts of analytics have been highlighted through recent years as mentioned above with the change to styles of play, and the way coaches approach gameplans before each respective game is played. As Adam Silver, the commissioner of the National Basketball Association stated, “Analytics have become front and center with precisely when players are rested, how many minutes they get, who they’re matched up against11.” Through this, Silver explains not only to technical aspect of basketball that analytics supports but the physical aspect which can aid in preventing things like player injuries and rest. Understandibly, this highlights how analytics can help the league now and in the future; especially when more sophisticated machine learning tools and algorithims are produced for this purpose.

## 7. Acknowledgment

The author would like to thank Dr. Gregor von Laszewski, Dr. Geoffrey Fox, and the associate instructors in the FA20-BL-ENGR-E534-11530: Big Data Applications course (offered in the Fall 2020 semester at Indiana University, Bloomington) for their continued assistance and suggestions with regard to exploring this idea and also for their aid with preparing the various drafts of this article.

## 8. References

1. Online, N., 2020. NBA History. [online] Nbahoopsonline.com. Available at: https://nbahoopsonline.com/History/#:~:text=The%20NBA%20began%20life%20as,start%20of%20the%20next%20season. [ Accessed 20 October 2020]. ↩︎

2. Editor, M., 2020. How NBA Analytics Is Changing Basketball | Merrimack College. [online] Merrimack College Data Science Degrees. Available at: https://onlinedsa.merrimack.edu/nba-analytics-changing-basketball/ [Accessed 16 November 2020]. ↩︎

3. N. M. Abbas, “NBA Data Analytics: Changing the Game,” Medium, 21-Aug-2019. [Online]. Available: https://towardsdatascience.com/nba-data-analytics-changing-the-game-a9ad59d1f116. [Accessed: 17-Nov-2020]. ↩︎

4. C. Ford, “NBA Draft 2009,” ESPN. [Online]. Available: http://www.espn.com/nba/draft2009/index?topId=4279081. [Accessed: 17-Nov-2020]. ↩︎

5. M. Woo, “Artificial Intelligence in NBA Basketball,” Inside Science, 21-Dec-2018. [Online]. Available: https://insidescience.org/news/artificial-intelligence-nba-basketball. [Accessed: 07-Dec-2020]. ↩︎

6. M. Beckler and M. Papamichael, “NBA Oracle,” 10701 Report, 2008. [Online]. Available: https://www.mbeckler.org/coursework/2008-2009/10701_report.pdf. [Accessed: 06-Dec-2020]. ↩︎

7. R. Anderson, “NBA Data Analysis Using Python & Machine Learning,” Medium, 02-Sep-2020. [Online]. Available: https://randerson112358.medium.com/nba-data-analysis-exploration-9293f311e0e8. [Accessed: 07-Dec-2020]. ↩︎

8. J. P. Hwang, “Learn linear regression using scikit-learn and NBA data: Data science with sports,” Medium, 18-Sep-2020. [Online]. Available: https://towardsdatascience.com/learn-linear-regression-using-scikit-learn-and-nba-data-data-science-with-sports-9908b0f6a031. [Accessed: 07-Dec-2020]. ↩︎

9. J. Perricone, I. Shaw, and W. Swie¸chowicz, “Predicting Results for Professional Basketball Using NBA API Data,” Stanford.edu, 2016. [Online]. Available: http://cs229.stanford.edu/proj2016/report/PerriconeShawSwiechowicz-PredictingResultsforProfessionalBasketballUsingNBAAPIData.pdf. [Accessed: 06-Dec-2020]. ↩︎

10. A. P. B. N. Barakat, J. H. F. L. Breiman, M. T. R. Burbidge, K.-S. S. T. Chen, J. L. R. WW. Cooper, V. N. V. C. Cortes, E. F. M. Hall, J. Holland, R. C. E. J. Kennedy, K. J. Kim, K. H. T. K. Kirchner, J. S. S. P. Kvan, A. C. W. BL. Lee, B. B. D. Martens, J. Mercer, J. K. B. Min, O. B. K. Muata, J. S. L. IS. Oh, P. M. M. M. Pal, J. R. Quinlan, F. P.-C. FJR. Ruiz, W. H. C. JY. Shih, H. M. E. I.-D. MBA. Snousy, P. V. E. Štrumbelj, L. C. FEH. Tay, V. V. S. S. Tripathi, G. Valentini, V. N. Vapnik, G. D. N. Vlastakis, J. N. Wang, E. Y. K. A. Widodo, C. F. H. TA. Zak, and J. S. J. Zhou, “Analyzing basketball games by a support vector machines with decision tree model,” Neural Computing and Applications, 01-Jan-1970. [Online]. Available: https://link.springer.com/article/10.1007/s00521-016-2321-9. [Accessed: 07-Dec-2020]. ↩︎

11. 2017 A. S. M. N. A. Jun 01, “The NBA’s Adam Silver: How Analytics Is Transforming Basketball,” Knowledge@Wharton. [Online]. Available: https://knowledge.wharton.upenn.edu/article/nbas-adam-silver-analytics-transforming-basketball/. [Accessed: 07-Dec-2020]. ↩︎

# 21 - Big Data in E-Commerce

The topic of my report is big data in e-commerce. E-commerce is a big part of todays society. During the shopping online, the recommend commodities are fitter and fitter for my liking and willingness to buy. This is the merit of big data. Big data use my purchase history and browsing history to analyze my liking and recommend the goods for me.

Status: final, Type: Report

Wanru Li, fa20-523-329, Edit

Keywords: e-commerce, big data, data analysis

## 1. Introduction

E-commerce is already changed by big data a lot. As we can see in the lecture slides, the retail store was closed rapidly in the past three years. It can be seen that e-commerce has begun to take shape and has been accepted by customers. E-commerce can make shopping more convenient for customers, and enable companies to better discover current trends and customers' favorite categories for better development.

For customers, they can find the item they want easier than find it in a retail store. Maybe the customer doesn’t know what he wants, doesn’t know his brand, only knows its style, but that’s enough to search for the item on e-commerce. There are also some e-commerce services that offer photo search, which makes shopping easier. Shopping in e-commerce usually keeps a record of the purchase. In this way, you don’t have to go to a retail store to buy some products repeatedly. Instead, you can directly find the products in the record and place orders, which saves a lot of time.

For companies, there are more changes. They can analyze customers preferences and purchasing power based on their browsing data, shopping cart data, and purchasing data. Large enough to predict the future business trend, small enough to better see the customer’s evaluation of the product.

E-commerce companies have access to a lot of data, which makes it easy for them to analyze product trends and customer preferences. As talent says, “Retail websites track the number of clicks per page, the average number of products people add to their shopping carts before checking out, and the average length of time between a homepage visit and a purchase. If customers are signed up for a rewards or subscription program, companies can analyze demographic, age, style, size, and socioeconomic information. Predictive analytics can help companies develop new strategies to prevent shopping cart abandonment, lessen time to purchase, and cater to budding trends. Likewise, e-commerce companies use this data to accurately predict inventory needs with changes in seasonality or the economy” 1. There is an example of the Lenovo, to enhance the customer experience and stand out from the competition, Lenovo needs to understand customers' needs, preferences, and purchasing behaviors. By collecting data sets from various touchpoints, Lenovo USES real-time predictive analytics to improve customer experience and increase revenue per retail segment by 11 percent 1.

Meeting customer needs is not just an immediate problem. E-commerce depends on having the right inventory in the future. Big data can help the company to be prepared for the emerging trend, in the slow or potential prosperity and development of the year, or around major activity plan marketing activities. E-commerce companies will compile large data sets. By evaluating the data of a few years ago, electronics retailers can plan accordingly inventory, inventory to predict peak, simplify the overall business operations, and predict demand. E-commerce sites, for example, can do it in the shopping rush hour in social media significantly depreciate sales promotion, to eliminate redundant products. In order to optimize pricing decisions, e-commerce sites can also provide a special discount. Through big data analysis and machine learning, learn when to offer discounts, how long they should last, and what discount prices are offered more accurately (para 8).

E-commerce is bound to dominate the retail market in the future because it can help retailers better analyze and predict future trends, which retailers cannot resist. At the same time, e-commerce provides better ways for customers to shop. With better analysis, retail companies will be able to provide better service to customers, so e-commerce will be more and more accepted and popular in the future.

## 2. Background Research and Previous Work

The most common and widely used application of big data is in e-commerce. Nowadays, the application of big data in e-commerce is relatively mature. As Artur Olechowski wrote, “As businesses scale up, they also collect an increasing amount of data. They need to get interested in data and its processing; this is just inevitable. That’s why a data-driven e-commerce company should regularly measure and improve upon: shopper analysis, customer service personalization, customer experience, the security of online payment processing, targeted advertising” 2.

There are also some disadvantages of the big data, or to say more need to do after getting the data. Artur Olechowski wrote, “Understand the problem of security — Big Data tools gather a lot of data about every single customer who visits your site. This is a lot of sensitive information. If your security is compromised, you could lose your reputation. That’s why before adopting the data technology, make sure to hire a cybersecurity expert to keep all of your data private and secure”2 . Security is always a big problem with big data. This is one of the components will be analyzed in my report. He also wrote, “Lack of analytics will become a bigger problem — Big Data is all about gathering information, but to make use of it, your system should also be able to process it. High-quality Big Data solutions can do that and then visualize insights in a simple manner. That’s how you can make this valuable information useful to everyone, from managers to customer service reps” 2. The analysis is also an important part of the big data. Only collecting data cannot help e-commerce anything. Security and analytics will be talked about in my report.

## 3. Choice of Data-sets

QUARTERLY RETAIL E-COMMERCE SALES 2 nd QUARTER 2020:https://www.census.gov/retail/mrts/www/data/pdf/ec_current.pdf

For the dataset, the source website provided in the project requirements will be used, if there needs more information, data and information on the web will be searched for. As a result of recent COVID-19 incidents, many organizations work in a small capacity or have entirely ceased activities. The Census Bureau has tracked and analyzed the response and data quality in this manner.

Monthly Retail Trade from Census will be analyzed. The Census Bureau of the Department of Commerce today reported that the forecast of U.S. retail e-commerce revenue for the second quarter of 2020 adjusted for seasonal fluctuations, but not for price adjustments, was $211.5 billion, a rise of 31.8 per cent (plus or minus 1.2 per cent) from the first quarter of 2020. Total retail revenues were projected at$1,311.0 billion for the second quarter of 2020, a decline of 3.9 percent (plus or minus 0.4 percent) from the first quarter of 2020. The e-commerce forecast for the second quarter of 2020 increased (para1).

Retail e-commerce sales are estimated from the same sample used for the Monthly Retail Trade Survey (MRTS) to estimate preliminary and final U.S. retail sales. Advance U.S. online transactions are calculated from a subsample of the MRTS survey that is not of appropriate magnitude to calculate improvements in retail e-commerce transactions.

A stratified basic random sampling procedure is used to pick approximately 10,800 retailers, except food services, whose transactions are then weighted and benchmarked to reflect the entire universe of over two million retailers. The MRTS sample is focused on probability and represents all employer firms engaged in retail activities as described in the North American Industry Classification System (NAICS). Coverage covers all vendors whether or not they are active in e-commerce. Internet travel agents, financial brokers and distributors, and ticket sales companies are not listed as retail and are not included with either the gross retail or retail e‐commerce sales figures. Non employees are reflected in the projections by benchmarking of previous annual survey estimates that contain non employer revenue based on administrative data. E-commerce revenues are included in the gross monthly sales figures.

The MRTS sample is revised on a continuous basis to account for new retail employees (including those selling over the Internet), company deaths and other shifts in the retail business environment. Firms are asked to report e-commerce revenue on a monthly basis separately. For each month of the year, data for non-responsive sampling units shall be calculated from reacting sampling units falling under the same class of sector and sales size segment or on the basis of the company’s historical results. Responding firms account for approximately 67% of the e-commerce sales estimate and approximately 72% of the U.S. retail sales estimate for any quarter.

Estimates are obtained by summing the weighted sales (either reported or charged) for each month of the quarter. The monthly figures are benchmarked against previous annual survey estimates. Quartal projections are determined summing up the monthly benchmarked figures. The forecast for the last quarter is a provisional forecast. The calculation is also open to revision. Data consumers who make their own projections using data from this study can only apply to the Census Bureau as the source of input data.

This article publishes forecasts optimized for seasonal variation and variations in holiday and trade days, but not for adjustments in rates. As feedback for the X‐13ARIMA‐SEATS programme, we have used the updated figures of quarterly figures of e-commerce revenue for the fourth quarter 1999 up to the present quarter. For revenue, we estimated the quarterly adjusted figures for each year with an additional modified monthly revenue forecast. Seasonal estimate adjustment is an approximation based on current and previous experiences.

The estimates containing sample errors and non-sample errors in this article are based on a survey.

The difference between the prediction and the results of the full population listing under the same sample conditions is the sampling error. This mistake happens when a national poll only tests a sub-set of the total population. Estimated sampling variance measurements are standard errors and variance coefficients, as stated in Table 2 of this article.

## 4. Search and Analysis

This year the pandemic accelerated growth in ecommerce in the US, with online revenues projected to hit just 2022. The top 10 ecommerce retailers will strengthen their hold on the retail market with our Q3 American retail prediction.

This year, revenues of US eCommerce are projected to hit $794.50 billion, up 32.4% annually. This is even more than the 18.0% predicted in our Q2, since customers are now ignoring shops and opting to buy online in the wake of the pandemic. In “US Ecommerce Growth Jumps to More than 30%, Accelerating Online Shopping Shift by Nearly 2 Years”, it says, “‘We’ve seen ecommerce accelerate in ways that didn’t seem possible last spring, given the extent of the economic crisis,’ said Andrew Lipsman, eMarketer principal analyst at Insider Intelligence. ‘While much of the shift has been led by essential categories like grocery, there has been surprising strength in discretionary categories like consumer electronics and home furnishings that benefited from pandemic-driven lifestyle needs’” 3. This year, ecommerce revenues are projected to hit 14.4 percent and 19.2 percent of all US retail spending by 2024. Without purchases of petrol and cars, ecommerce penetration leaps to 20.6% (classes sold almost entirely offline). In “US Ecommerce Growth Jumps to More than 30%, Accelerating Online Shopping Shift by Nearly 2 Years”, it writes, “‘There will be some lasting impacts from the pandemic that will fundamentally change how people shop,’ said Cindy Liu, eMarketer senior forecasting analyst at Insider Intelligence. ‘For one, many stores, particularly department stores, may close permanently. Secondly, we believe consumer shopping behaviors will permanently change. Many consumers have either shopped online for the first time or shopped in new categories (i.e., groceries). Both the increase in new users and frequency of purchasing will have a lasting impact on retail’” 3. Online commerce will be so high that this year, at$4,711 trillion, this will more than compensate for the 3,2 percent fall in brick and mortar expenses. Complete US retail revenue will also remain relatively flat.

## 5. Conclusion

### 6.2 Africa’s Geographical Problems

Africa’s geography is causing similar problems to their internet distribution just like America’s. Africa is a ginormous continent with an area of 11.73 million miles^2. With that much land it can be hard to cover this entire continent with internet and provide people in rural areas with internet. About one third of the population is out of reach of all mobile broadband reach and can not get access to any internet. To cover this land Africa needs to invest an more infrastructure to cover this land with 250,000 base stations and 250,000 kilometers of fiber. Africa’s population is spread out throughout the whole country, based on Figure 4 we see all the connections and networks throughout Africa there are. It is estimated that 45% of African’s are to far away from any fiber network to connect to the internet. These areas with no networks or infrastructure are areas with low populations and not hotspots or major cities. This problem might not be solved for a while as companies will focus on hotspots rather than helping people in need to access the internet. One of the more recent developments that might help Africa is the SpaceX Satellite internet 15. This can help them as they wont need the vast infrastructure to create the network and the connection will be able to reach all over the continent. Africa’s geography limits people connection based on where they live and if they are near the water to connect the submerged connections. With more investment and new inventions we could see this geographical challenge tackled within 20 years.

Figure 4: Seeing the hotspots throughout the continent and where the internet access is shows the disparity of rural vs urban living. Hotspots of populations are targeted more for internet connection as it makes sense for companies financially to target those areas. This creates a large divide for areas inside the continent who do not have networks or any chance of connecting to the networks. This figure shows how this disparity is present within Africa.

## 2. Resources

Table 2.1: Resources

No. Name Version Type Notes
1. Python 3.6.9 Programming language Python is a high-level interpreted programming language.
2. MongoDB 4.4.2 Database MongoDB is a NoSQL Database program that uses JSON-like documents.
3. Heroku 0.1.4 Cloud Platform Heroku is a cloud platform used for deploying applications. It uses a Git server to handle application repositories.
4. Gunicorn 20.0.4 Server Gateway Interface Gunicorn is a python web server gateway interface . It is mainly used in the project for running python applications on Heroku.
5. Tensorflow 2.3.1 Python Library Tensorflow is an open-source machine learning library. It is mainly used in the project for training models and predicting results.
6. Keras 2.4.3 Python Library Keras is an open-source python library used for interfacing with artificial neural networks. It is an interface for the Tensorflow library.
7. Scikit-Learn 0.22.2 Python Library Scikit-learn is an open-source machine learning library featuring various algorithms for classification, regression and clustering problems.
8. Numpy 1.16.0 Python Library Numpy is a python library used for handling and performing various operations on large multi-dimensional arrays.
9. Scipy 1.5.4 Python Library Scipy is a python library used for scientific and technical computing. It is not directly used in the project but serves as a dependency for tensorflow.
10. Pandas 1.1.4 Python Library Pandas is a python library used mainly for large scale data manipulation and analysis.
11. Matplotlib 3.2.2 Python Library Matplotlib is a python library used for graphing and plotting.
12. Pickle 1.0.2 Python Library Pickle-mixin is a python library used for saving and loading python variables.
13. Pymongo 3.11.2 Python Library Pymongo is a python library containing tools for working with MongoDB.
14. Flask 1.1.2 Python Library Flask is a micro web framework for python. It is used in the project for creating the API.
15. Datetime 4.3 Python Library Datetime is a python library used for handling dates as date objects.
16. Pytz 2020.4 Python Library Pytz is a python library used for accurate timezone calculations.
17 Yahoo Financials 1.6 Python Library Yahoo Financials is an unofficial python library used for extracting data from Yahoo Finance website by web scraping.
18 Dns Python 2.0.0 Python Library DNS python is a necessary dependency of Pymongo.

## 3. Dataset

The project builds its own dataset by extracting the data from Yahoo Finance website using Yahoo Financial python library 4. The data includes cryptocurrency prices, stock market indices and foreign exchange rates from September 30 2015 to December 5 2020. The project uses historical data of 6 cryptocurrencies - Bitcoin (BTC), Ethereum (ETH), Dash (DASH), Litecoin (LTC), Monero (XMR) and Ripple (XRP), 25 stock market indices - S&P 500 (USA), Dow 30 (USA), NASDAQ (USA), Russell 2000 (USA), S&P/TSX (Canada), IBOVESPA (Brazil), IPC MEXICO (Mexico), Nikkei 225 (Japan), HANG SENG INDEX (Hong Kong), SSE (China), Shenzhen Component (China), TSEC (Taiwan), KOSPI (South Korea), STI (Singapore), Jakarta Composite Index (Indonesia), FTSE Bursa Malaysia KLCI (Malaysia), S&P/ASX 200 (Australia), S&P/NZX 50 (New Zealand), S&P BSE (India), FTSE 100 (UK), DAX (Germany), CAC 40 (France), ESTX 50 (Europe), EURONEXT 100 (Europe), BEL 20 (Belgium), and 22 foreign exchange rates - Australian Dollar, Euro, New Zealand Dollar, British Pound, Brazilian Real, Canadian Dollar, Chinese Yuan, Hong Kong Dollar, Indian Rupee, Korean Won, Mexican Peso, South African Rand, Singapore Dollar, Danish Krone, Japanese Yen, Malaysian Ringgit, Norwegian Krone, Swedish Krona, Sri Lankan Rupee, Swiss Franc, New Taiwan Dollar, Thai Baht. This data is, then, posted to MongoDB Database. The three databases are created for each of the data types - Cryptocurrency prices, Stock Market Indices and Foreign Exchange Rates. The three databases each contain one collection for every currency, index and rate respectively. These collections have a uniform structure containing 6 columns - “id”, “formatted_date”, “low”, “high”, “open” and “close”. The tickers used to extract data from Yahoo Finance 4 are stated in Figure 3.1.

Figure 3.1: Ticker Information

The data is, then, preprocessed to get only one column per date (“close” price) and to add missing information by replicating previous day’s values, which is used to make a large dataset including the prices of all indices and rates for all the dates within the given range. This data is saved in a different MongoDB Database and collection, both, called nn_data. This collection has 54 columns containing closing prices for each cryptocurrency price, stock market index and foreign exchange rate and the date. The rows represent different dates.

One additional database is also created - Predictions - which contain the predictions of cryptocurrency prices for each day and it’s true value. The collection has 13 columns containing a date column and 2 columns for each cryptocurrency (prediction value and true value). New rows are inserted everyday for all collections except the “nn_data” collection. Figure 3.2 represents the overview of the MongoDB Cluster. Figure 3.3 shows the structure of the nn_data collection.

Figure 3.2: MongoDB Cluster Overview

Figure 3.3: Short Structure of NN_data Collection

## 4. Analysis

### 4.1 Principal Component Analysis

Principal Component Analysis uses Singular Value Decomposition (SVD) for dimensionality reduction, exploratory data analysis and making predictive models. PCA helps understand a linear relationship in the data5. In this project, PCA is used for the preliminary analysis to find a pattern between the target and the features. Here we have tried to make some observations by performing PCA on various cryptocurrencies with stocks and forex data. In this analysis, we reduced the dimension of the dataset to 3D, represented in Figure 4.1. The first and second dimension is on x-axis and y-axis respectively whereas the third dimension is used in the color. On observing the scatter plots in Figure 4.1, we can clearly see the patterns formed by various relationships. Therefore, it can be stated that the target and features are related in some way based on the principal component analysis.

Figure 4.1: Principal Component Analysis

### 4.2 TSNE Analysis

T-Distributed Stochastic Neighbour Embedding is mainly used for non-linear dimensionality reduction. TSNE uses local relationships between points to create a low-dimensional mapping. TSNE uses Gaussian distribution to create a probability distribution. In this project, TSNE is used to analyze non-linear relationships between cryptocurrencies and the features (stock indices and forex rates), which were not visible in the principal component analysis. It can be observed in Figure 4.2, that there are visible patterns in the data i.e. same colored data points are in some pattern, proving a non linear relationship. The t-SNE plots in Figure 4.2 are not like the typical t-SNE plots i.e. they do not have any clusters. This might be because of the size of the dataset.

Figure 4.2: t-SNE Analysis

### 4.3 Weighted Features Analysis

Layers of neural networks have weights assigned to each feature column. These weights are updated continuously while training. Analyzing the weights of the model which is trained for this project, can give us a picture of the important features. To perform such an analysis, the top five feature weights are noted for each layer. The number of times a feature is present in the top five of a layer, is also noted. This is represented in Figure 4.3, where we can observe that the New Zealand Dollar and the Canadian Dollar are repeated most number of times in the top five weights of layers.

Figure 4.3: No. of repetitions in top five weights

The relationships of these two features - New Zealand Dollar and Canadian Dollar with various cryptocurrencies are, then, analyzed in Figure 4.4 and Figure 4.5. It can be observed that Bitcoin has a direct relationship with these rates. Bitcoin can be observed to increase with an increase in NZD to USD rate and an increase in CAD to USD rate. For the rest of the cryptocurrencies, we can observe that they tend to rise when the NZD to USD rate and the CAD to USD rate are stable and tend to fall when the rates move towards either of the extremes.

Figure 4.4: Relationship of NZD with Cryptocurrencies

Figure 4.5: Relationship of CAD with Cryptocurrencies

## 5. Neural Network

### 5.1 Data Preprocessing

The first step to build a neural network for predicting cryptocurrency prices, is to clean the data. In this step, data from the “NN_data” collection is imported. Two scalers are used to normalize the data, one for feature columns and other for the target columns. For this purpose, “StandardScaler” from Scikit-learn library is used. These scalers are made to fit with the data and then saved to a file using pickle-mixin, in order to use it later for predictions. These scalers are then used to normalize the data using mean and standard deviation. This normalized data is shuffled and split into a training set and a test set. This procedure is done by using the “train_test_split()” function from the Scikit-learn library. The data is split into 94:6 ratio for training and testing respectively. The final data is split into four - X_train, X_test, y_train and y_test and is ready for training the neural network model.

### 5.2 Model

For the purpose of predicting the prices of cryptocurrency based on previous day’s stock indices and forex rates, the project uses a fully connected neural network. The solution to this problem could have been perceived in different ways like making it a classification problem by predicting rise or fall in price or by making it a regression problem by either predicting the actual price or by predicting the growth. After trying all these ways of solution, it was concluded that predicting the price regression problem was the best option.

The final model comprises three layers - one input layer, one hidden layer and one output layer. The first layer uses 8 units with an input dimension of (None, 47) and uses Rectified Linear Unit (ReLU) as its activation function, and He Normal as its kernel initializer. The second layer which is a hidden layer uses 2670 hidden units with Rectified Linear Unit (ReLU) Activation function. ReLU is used because of its faster and effective training in regression models. The third layer which is the output layer has 6 units, one each for predicting 6 cryptocurrencies. The output layer uses linear activation function.

The overview of the final model can be seen in Figure 5.2. The predictions using the un-trained model can be seen in Figure 5.3, where we can observe the initialization of weights.

Figure 5.2: Model Overview

Figure 5.3: Visualization of Initial Weights

### 5.3 Training

The neural network model is compiled before training. The model is compiled using Adam optimizer with a default learning rate of 0.001. The model uses Mean Squared Error as its loss function in order to reduce the error and give a close approximation of the cryptocurrency prices. Mean squared error is also used as a metric to visualize the performance of the model.

The model is, then, trained by using X_train and y_train, as mentioned above, for 5000 epochs and by splitting the dataset for validation (20% for validation). The performance of the training of the final model for first 2500 epochs can be observed in Figure 5.4.

Figure 5.4: Final Model Training

This particular model was chosen because of its low validation mean squared error as compared to the performance of other models. Figure 5.5 represents the performance of a similar fully connected model with Random Normal as its initializer instead of He Normal. Figure 5.6 represents the performance of a Convolutional Neural Network. This model was trained with a much lower mean squared error but had a higher validation mean squared error and was therefore dropped.

Figure 5.5: Performance of Fully Connected with Random Normal

Figure 5.6: Performance of Convolutional Neural Network

### 5.4 Prediction

After training, the model is stored in a .h5 file, which can be used to make predictions. For making predictions, the project preprocesses the data provided which needs to be of the input dimension of the model i.e. of shape (1, 47). Both the scalers which were saved earlier in the preprocessing stage are loaded again using pickle-mixin. The feature scaler is used to transform the new data to normalized data. This normalized data of the given dimension is then used to predict the prices for six cryptocurrencies. Since regression models do not show accuracy directly, it can be measured manually by rounding off the predicted values and the corresponding true values to the decimal place of one or two and then getting the difference between the two and comparing it to a preset threshold. If the values are rounded off to one decimal place and the threshold is set to 0.05 on the normalized predictions, the accuracy of the prediction is approximately 88% and if the values are rounded off to two decimal places, the accuracy is approximately 62%. The predictions of the test data and the corresponding true values for Bitcoin can be observed in Figure 5.7, where similarities can be observed. Prediction for a new date for the prices of all six cryptocurrencies and its true values can be observed in Figure 5.8. Figure 5.9 also displays the actual result of this project as it can be observed that the predictions and the true values have similar trend with a low margin of error.

Figure 5.7: Prediction vs. True

Figure 5.8: Prediction vs. True for one day’s test data

Figure 5.9: Prediction vs. True for all cryptocurrencies

## 6. Deployment

### 6.1 Daily Update

The database is supposed to be updated daily using a web-app deployed on Heroku. Heroku is a cloud platform used for deploying web-apps of various languages and also uses a Git-server for repositories 6. This daily update web-app is triggered daily at 07.30 AM UTC i.e 2.00 AM EST. The web-app extracts the data for the previous day and updates all the collections. The new data is then preprocessed by using the saved feature normalizer. This normalized data is used to get predictions for the prices of cryptocurrencies for the day that just started. The web-app then gets the true values of the cryptocurrency prices for the previous day and updates the predictions collection using this data for future comparison. The web-app is currently deployed on Heroku and is triggered daily using Heroku Scheduler. The web-app is entirely coded in Python.

### 6.2 REST Service

The data from the MongoDB databases can be accessed using a public RESTful API. The API is developed using Flask-Python. The API usage is given below.

URL - https://crypto-project-api.herokuapp.com/

/get_data/single/market/index/date

Type - GET

Sample Request -

https://crypto-project-api.herokuapp.com/get_data/single/crypto/bitcoin/2020-12-05

Sample Respose -

{
"data":
[
{
"close":19154.23046875,
"date":"2020-12-05",
"high":19160.44921875,
"low":18590.193359375,
"open":18698.384765625
}
],
"status":"Success"
}


/get_data/multiple/market/index/start_date/end_date

Type - GET

Sample Request -

https://crypto-project-api.herokuapp.com/get_data/multiple/crypto/bitcoin/2020-12-02/2020-12-05

Sample Respose -

{
"data":
[
{
"close":"19201.091796875",
"date":"2020-12-02",
"high":"19308.330078125",
"low":"18347.71875",
"open":"18801.744140625"
},
{
"close":"19371.041015625",
"date":"2020-12-03",
"high":"19430.89453125",
"low":"18937.4296875",
"open":"18949.251953125"
},
{
"close":19154.23046875,
"date":"2020-12-05",
"high":19160.44921875,
"low":18590.193359375,
"open":18698.384765625
}
],
"status":"Success"
}


/get_predictions/date

Type - GET

Sample Request -

https://crypto-project-api.herokuapp.com/get_predictions/2020-12-05

Sample Respose -

{
"data":
[
{
"bitcoin":"16204.04",
"dash":"24.148237",
"date":"2020-12-05",
"ethereum":"503.43005",
"litecoin":"66.6938",
"monero":"120.718414",
"ripple":"0.55850273"
}
],
"status":"Success"
}


## 7. Conclusion

After analyzing the historical data of Stock Market Indices, Foreign Exchange Rates and Cryptocurrency Prices, it can be concluded that there does exist a non-linear relationship between the three. It can also be concluded that cryptocurrency prices can be predicted and its trend can be estimated using Stock Indices and Forex Rates. There is still a large scope of improvement in reducing the mean squared error. The project can further improve the neural network model for better predictions. In the end, it is safe to conclude that the indicators of international politics like Stock Market Indices and Forex Exchange Rates are factors affecting the prices of cryptocurrency.

## 8. Acknowledgement

Krish Hemant Mhatre would like to thank Indiana University and Luddy School of Informatics, Computing and Engineering for providing me with the opportunity to work on this project. He would also like to thank Dr. Geoffrey C. Fox, Dr. Gregor von Laszewski and the Assistant Instructors of ENGR-E-534 Big Data Analytics and Applications for their constant guidance and support.

## References

1. Szmigiera, M. “Cryptocurrency Market Value 2013-2019.” Statista, 20 Jan. 2020, https://www.statista.com/statistics/730876/cryptocurrency-maket-value↩︎

2. Lansky, Jan. “Possible State Approaches to Cryptocurrencies.” Journal of Systems Integration, University of Finance and Administration in Prague Czech Republic, http://www.si-journal.org/index.php/JSI/article/view/335↩︎

3. Sovbetov, Yhlas. “Factors Influencing Cryptocurrency Prices: Evidence from Bitcoin, Ethereum, Dash, Litcoin, and Monero.” Journal of Economics and Financial Analysis, London School of Commerce, 26 Feb. 2018, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3125347↩︎

4. Sanders, Connor. “YahooFinancials.” PyPI, JECSand, 22 Oct. 2017, https://pypi.org/project/yahoofinancials/↩︎

5. Jaadi, Zakaria. “A Step-by-Step Explanation of Principal Component Analysis.” Built In, https://builtin.com/data-science/step-step-explanation-principal-component-analysis↩︎

6. “What Is Heroku.” Heroku, https://www.heroku.com/what↩︎

# 28 - Project: Deep Learning in Drug Discovery

Machine learning has been a mainstay in drug discovery for decades. Artificial neural networks have been used in computational approaches to drug discovery since the 1990s. Under traditional approaches, emphasis in drug discovery was placed on understanding chemical molecular fingerprints, in order to predict biological activity. More recently however, deep learning approaches have been adopted instead of computational methods. This paper outlines work conducted in predicting drug molecular activity, using deep learning approaches.

Status: final, Type: Project

Anesu Chaora, sp21-599-359, Edit

Keywords: Deep Learning, drug discovery.

## 1. Introduction

### 1.1. De novo molecular design

Deep learning (DL) is finding uses in developing novel chemical structures. Methods that employ variational autoencoders (VAE) have been used to generate new chemical structures. Approaches have involved encoding input string molecule structures, then reparametrizing the underlying latent variables, before searching for viable solutions in the latent space by using methods such as Bayesian optimizations. The results are then decoded back into simplified molecular-input line-entry system (SMILES) notation, for recovery of molecular descriptors. Variations to this method involve using generative adversarial networks (GAN)s, as subnetworks in the architecture, to generate the new chemical structures 1.

Other approaches for developing new chemical structures involve recurrent neural networks (RNN), to generate new valid SMILES strings, after training the RNNs on copious quantities of known SMILES datasets. The RNNs use probability distributions learned from training sets, to generate new strings that correspond to molecular structures 2. Variations to this approach incorporate reinforcement learning to reward models for new chemical structures, while punishing them for undesirable results 3.

### 1.2. Bioactivity prediction

Computational methods have been used in drug development for decades 4. The emergence of high-throughput screening (HTS), in which automated equipment is used to conduct large assays of scientific experiments on molecular compounds in parallel, has resulted in generation of enormous amounts of data that require processing. Quantitative structure activity relationship (QSAR) models for predicting the biological activity responses to physiochemical properties of predictor chemicals, extensively use machine learning models like support vector machines (SVM) and random decision forests (RF) for processing 1, 5.

While deep learning (DL) approaches have an advantage over single-layer machine learning methods, when predicting biological activity responses to properties of predictor chemicals, they have only recently been used for this 1. The need to interpret how predictions are made through computationally oriented drug discovery, is seen - in part - as a factor to why DL approaches have not been adopted as quickly in this area 6. However, because DL models can learn complex non-linear data patterns, using their multiple hidden layers to capture patterns in data, they are better suited for processing complex life sciences data, than other machine learning approaches 6.

Their applications have included profiling tumors at molecular level and predicting drug responses, based on pharmacological and biological molecular structures, functions, and dynamics. This is attributed to their ability to handle high dimensionality in data features, making them appealing for use in predicting drug response 5.

For example, deep neural networks were used in models that won NIH’s Toxi21 Challenge 7 on using chemical structure data only to predict compounds of concern to human health 8. DL models were also found to perform better than standard RF models 9 in predicting the biological activities of molecular compounds in the Merck Molecular Activity Challenge on Kaggle 10. Details of the challenge follow.

### 2.1. Merck Molecular Activity Challenge on Kaggle

A challenge to identify the best statistical techniques for predicting molecular activity was issued by Merck & Co Pharmaceutical, through Kaggle in October of 2012. The stated goal of the challenge was to ‘help develop safe and effective medicines by predicting molecular activity’ for effects that were both on and off target 10.

### 2.2. The Dataset

A dataset was provided for the challenge 10. It consisted of 15 molecular activity datasets. Each dataset contained rows corresponding to assays of biological activity for chemical compounds. The datasets were subdivided into training and test set files. The training and test dataset split was done by dates of testing 10, with test set dates consisting of assays conducted after the training set assays.

The training set files each had a column with molecular descriptors that were formulated from chemical molecular structures. A second column in the files contained numeric values, corresponding to raw activity measures. These were not normalized, and indicated measures in different units.

The remainder of the columns in each training dataset file indicated disguised substructures of molecules. Values in each row, under the substructure (atom pair and donor-acceptor pair) codes, corresponded to the frequencies at which each of the substructures appeared in each compound. Figure 1 shows part of the head row for one of the training dataset files, and the first records in the file.

Figure 1: Head Row of 1 of 15 Training Dataset files

The test dataset files were similar (Figure 2) to the training files, except they did not include the column for activity measures. The challenge presented was to predict the activity measures for the test dataset.

Figure 2: Head Row of 1 of 15 Test Dataset files

### 2.3. A Deep Learning Algorithm

The entry that won the Merck Molecular Activity Challenge on Kaggle used an ensemble of methods that included a fully connected neural network as the main contributor to the high accuracy in predicting molecular activity 9. Evaluations of predictions for molecular activity for the test set assays were then determined using the mean of the correlation coefficient (R2) of the 15 data sets. Sample code in R was provided for evaluating the correlation coefficient. The code, and formula for R2 are appended in Appendix 1.

An approach of employing convolutional networks on substructures of molecules, to concentrate learning on localized features, while reducing the number of parameters in the overall network, was also proposed in literature on improving molecular activity predictions. This methodology of identifying molecular substructures as graph convolutions, prior to further processing, was discussed by authors 11, 12.

In line with the above research, an ensemble of networks for predicting molecular activity was planned for this project, using the Merck dataset, and hyperparameter configurations found optimal by the cited authors. Recognized optimal activation functions, for different neural network types and prediction types 13, were also earmarked for use on the project.

## 3. Project Implementation

Implementation details for the project were as follows:

### 3.1. Tools and Environment

The Python programming language (version 3.7.10) was used on Google Colab (https://colab.research.google.com).

A subscription account to the service was employed, for access to more RAM (High-RAM runtime shape) during development, although the free standard subscription will suffice for the version of code included in this repository.

Google Colab GPU hardware accelerators were used in the runtime configuration.

Prerequisites for the code included packages from Cloudmesh, for benchmarking performance, and from Kaggle, for API access to related data.

Keras libraries were used for implementing the molecular activity prediction model.

### 3.2. Implementation Overview

This project’s implementation of a molecular activity prediction model consisted of a fully connected neural network. The network used the Adam 14 optimization algorithm, at a learning rate of 0.001 and beta_1 calibration of 0.5. Mean Squared Error (MSE) was used for the loss function, and R-Squared 15 for the metric. Batch sizes were set at 128. These parameter choices were selected by referencing the choices of other prior investigators 16.

The network was trained on the 15 datasets separately, by iterating through the storage location containing preprocessed data, and sampling the data into training, evaluation and prediction datasets - before running the training. The evaluation and prediction steps, for each dataset, where also executed during the iteration of each molecular activity dataset. Running the processing in this way was necessitated by the fact that the 15 datasets each had different feature set columns, corresponding to different molecular substructures. As such, they could not be readily processed through a single dataframe.

An additional compounding factor was that the data was missing the molecular activity results (actual readings) associated with the dataset provided for testing. These were not available through Kaggle as the original competition withheld these from contestants, reserving them as a means for evaluating the accuracy of the models submitted. In the absence of this data, for validating the results of this project, the available training data was split into samples that were then used for the exercise. The training of the fully connected network was allocated 80% of the data, while the testing/evaluation of the model was allocated 10% of the data. The remaining data (10%) was used for evaluating predictions.

### 3.3. Benchmarks

Benchmarks captured during code execution, using cloudmesh-common 7, were as follows:

• The data download process from Kaggle, through the Kaggle data API, took 29 seconds.

• Data preprocessing scripts took 8 minutes and 56 seconds to render the data ready for training and evaluation. Preprocessing of data included iterating through the issued datasets separately, since each file contained different combinations of feature columns (molecular substructures).

• The model training, evaluation and prediction step took 7 minutes and 45 seconds.

### 3.4. Findings

The square of the correlation coefficient (R^2) values obtained (coefficient of determination) 17 during training and evaluation were considerably low (< 0.1). A value of one (1) would indicate a goodness of fit for the model that implies that the model is completely on target with predicting accurate outcomes (molecular activity) from the independent variables (substructures/feature sets). Such a model would thus fully account for the predictions, given a set of substructures as inputs. A value of zero (0) would indicate a total lack of correlation between the input feature values and the predicted outputs. As such, it would imply that there is a lot of unexplained variance in the outputs of the model. The square of the correlation coefficient values obtained for this model (<0.1) therefore imply that it either did not learn enough, or other unexplained (by the model) variance caused unreliable predictions.

## 4. Discussion

An overwhelming proportion of the data elements provided through the datasets were zeros (0)s, indicating that no frequencies of the molecular substructures/features were present in the molecules represented by particular rows of data elements. This disproportionate representation of absent molecular substructure frequencies, versus the significantly lower instances where there were frequencies appears to have had an effect of dampening the learning of the fully connected neural network.

This supports approaches that advocated for the use of convolutional neural networks 11, 12 as auxiliary components to help focus learning on pertinent substructures. While the planning phase of this project had incorporated inclusion of such, the investigator ran out of time to implement an ensemble network that would include the suggestions.

Apart from employing convolutions, other preprocessing approaches for rescaling, and normalizing, the data features and activations 16 could have helped the learning, and subsequently the predictions made. This reinforces the fact that deep learning models, as is true of other machine learning approaches, rely deeply on the quality and preparation of data fed into them.

## 5. Conclusion

Deep learning is a very powerful new approach to solving many machine learning problems, including some that have eluded solutions till now. While deep learning models offer robust and sophisticated ways of learning patterns in data, they are still only half the story. The quality and appropriate preparation of the data fed into models is equally important when seeking to have meaningful results.

## 6. Acknowledgments

Acknowledgements go to Dr. Geoffrey Fox for his excellent guidance on ways to think about deep learning approaches, and for his instructorship of the course ‘ENG-E599: AI-First Engineering’, for which this project is a deliverable. Acknowledgements also go to Dr. Gregor von Laszewski for his astute tips and recommendations on technical matters, and on coding and documention etiquette.

## 7. Appendix

Square of the Correlation Coefficient (R2) Formula:

Sample R2 Code in the R Programming Language:

Rsquared <- function(x,y) {
# Returns R-squared.
# R2 = \frac{[\sum_i(x_i-\bar x)(y_i-\bar y)]^2}{\sum_i(x_i-\bar x)^2 \sum_j(y_j-\bar y)^2}
# Arugments: x = solution activities
#            y = predicted activities
if ( length(x) != length(y) ) {
warning("Input vectors must be same length!")
}
else {
avx <- mean(x)
avy <- mean(y)
num <- sum( (x-avx)*(y-avy) )
num <- num*num
denom <- sum( (x-avx)*(x-avx) ) * sum( (y-avy)*(y-avy) )
return(num/denom)
}
}


10

## References

1. Hongming Chen, O. E. (2018). The rise of deep learning in drug discovery. Elsevier. ↩︎

2. Marwin H. S. Segler, T. K. (2018). Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks. America Chemical Society. ↩︎

3. N Jaques, S. G. (2017). Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control. Proceedings of the 34th International Conference on Machine Learning, PMLR (pp. 1645-1654). MLResearchPress. ↩︎

4. Gregory Sliwoski, S. K. (2014). Computational Methods in Drug Discovery. Pharmacol Rev, 334 - 395. ↩︎

5. Delora Baptista, P. G. (2020). Deep learning for drug response prediction in cancer. Briefings in Bioinformatics, 22, 2021, 360–379. ↩︎

6. Erik Gawehn, J. A. (2016). Deep Learning in Drug Discovery. Molecular Informatics, 3 - 14. ↩︎

7. National Institute of Health. (2014, November 14). Tox21 Data Challenge 2014. Retrieved from [tripod.nih.gov:] https://tripod.nih.gov/tox21/challenge/ ↩︎

8. Andreas Mayr, G. K. (2016). Deeptox: Toxicity Prediction using Deep Learning. Frontiers in Environmental Science. ↩︎

9. Junshui Ma, R. P. (2015). Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships. Journal of Chemical Information and Modeling, 263-274. ↩︎

10. Kaggle. (n.d.). Merck Molecular Activity Challenge. Retrieved from [Kaggle.com:] https://www.kaggle.com/c/MerckActivity ↩︎

11. Kearnes, S., McCloskey, K., Berndl, M., Pande, V., & Riley, P. (2016). Molecular graph convolutions: moving beyond fingerprints. Switzerland: Springer International Publishing . ↩︎

12. Mikael Henaff, J. B. (2015). Deep Convolutional Networks on Graph-Structured Data. ↩︎

13. Bronlee, J. (2021, January 22). How to Choose an Activation Function for Deep Learning. Retrieved from [https://machinelearningmastery.com:] https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/ ↩︎

15. Keras. (2021). Regression Metrics. Retrieved from [https://keras.io:] https://keras.io/api/metrics/regression_metrics/ ↩︎

16. RuwanT (2017, May 16). Merk. Retrieved from [https://github.com:] https://github.com/RuwanT/merck/blob/master/README.md ↩︎

17. Wikipedia (2021). Coefficient of Determination. Retrieved from [https://wikipedia.org:] https://en.wikipedia.org/wiki/Coefficient_of_determination ↩︎

# 29 - Big Data Application in E-commerce

As a result of the last twenty year’s Internet development globally, the E-commerce industry is getting stronger and stronger. While customers enjoyed their convenient online purchase environment, E-commerce sees the potential for the data and information customers left during their online shopping process. One fundamental usage for this information is to perform a Recommendation Strategy to give customers potential products they would also like to purchase. This report will build a User-Based Collaborative Filtering strategy to provide customer recommendation products based on the database of previous customer purchase records. This report will start with an overview of the background and explain the dataset it chose Amazon Review Data. After that, each step for the code and step made in a corresponding file Big_tata_Application_in_E_commense.ipynb will be illustrated, and the User-Based Collaborative Filtering strategy will be presented step by step.

Status: final, Type: Project

Tao Liu, fa20-523-339, Edit

Keywords: recommendation strategy,user-based, collaborative filtering, business, big data, E-commerce, customer behavior

## 1. Introduction

Big data have many applications in scientific research and business, from those in the hardware perspective like Higgs Discovery to the software perspective like E-commence. However, with the passage of time, online shopping and E-commerce have become one of the most popular events for citizens' lives and society. Millions of goods are now sold online the customers all over the world. With the 5G technology’s implementation, this trend is now inevitable. These activities will create millions of data about customer’s behaviors like how they value products, how they purchase or sell the products, and how they review the goods purchased would have a tremendous contribution for corporations to analyze. These data can not only help convince the current strategies of E-commerce on the right track, but a potential way to see which step E-commerce can make up for attracting more customers to buy the goods. At the same time, these data can also be implemented as a way for recommendation strategies for E-commerce. It will help customers find the next products they like in a short period by implementing machine learning technology on Big Data. The corporations also can enjoy the increase of sales and attractions by recommendation strategies. A better recommendation strategy on E-commerce is now the new trend for massive data scientists and researchers’ target. Therefore, this field is now one of the most popular research areas in the data science fields.

In this final project, An User-Based Collaborative Filtering Strategy will be implemented to get a taste of the recommendation strategy based on Customer’s Gift card purchase records and the item they also viewed and bought. The algorithm’s logic is the following: A used record indicates that customer who bought product A and also view/buy products B&C. When a new customer comes and shows his interest in B&C, product A would be recommended. This logic is addressed based on the daily-experience of customer behaviors on their E-commerce experience.

## 2. Background

Recommendation Strategy is a quite popular research area in recent years with a strong real-world influence. It is largely used in E-commerce platforms like Taobao, Amazon, etc. Therefore, It is obvious that there are plenty of recommendation strategies have been done. Though every E-commerce recommendation algorithm may be different from each other, the most popular technique for recommendation systems is called Collaborative Filtering. It is a technique that can filter out items that a user might like based on reactions by similar users. During this technique, the memory-based method is considered in this report since it uses a dataset to calculate the prediction using statistical techniques. This strategy will be able to fulfill in the local environment with a proper dataset. There are two kinds of memory-based methods available in the market: User-Based Collaborative Filtering, Item-Based Collaborative Filtering 1. This project will only focus on the User-Based Collaborative Filtering Strategy since Item-Based Collaborative Filtering requires a customer review rate for evaluation. The customer review rate for evaluation is not in the dataset available in the market. Therefore, Item-Based Collaborative Filtering unlikely to be implemented, and the User-Based Collaborative Filtering Strategy is considered.

## 3. Choice of Data-sets

The dataset for this study is called Amazon Review Data 2. Particularly, since the dataset is now reached billions of amount, the subcategory gift card will be used as an example since the overall customer record is 1547 and the amount of data retrieved is currently in the right amount of training. This fact can help to perform User-Based Collaborative Filtering in a controlled timeline.

## 4. Data Preprocessing and cleaning

The first step will be data collection and data cleaning. The raw data-set is imported directly from data-set contributors' online storage meta_Gift_Cards.json.gz 3 to Google Colab notebook. The raw database retrieved directly from the website will be shown in Table 1.

Attribute Description Example
category The category of the record ["Gift Cards", “Gift Cards”]\
tech1 tech relate to it ""
description The description of the product “Gift card for the purchase of goods…”
fit fit for its record ""
title title for the product “Serendipity 3 $100.00 Gift Card” also_buy the product also bought ["B005ESMEBQ"]\ image image of the gift card "" tech2 tech relate to it "" brand brand of the product “Amazon” feature feature of the product “Amazon.com Gift cards never expire” rank rank of the product "" also_view the product also view ["BT00DC6QU4"]\ details detail for the product “3.4 x 2.1 inches ; 1.44 ounces” main_cat main category of the product “Grocery” similar_item similar_item of the product "" date date of the product assigned "" price price of the product "" asin product asin code “B001BKEWF2” Table 1: The description for the dataset Since the attributes category, main_cat are the same for the whole dataset, they will not be valid training labels. The attributes tech1, fit, tech2, rank, similar_item, date, price have no/ extremely less filled in. That made them also invalid for being training labels. The attributes image, description and feature is unique per item and hard to find the similarity in numeric purpose and then hard to be used as labels. Therefore, only attributes also_buy, also_view, asin are trained as attributes and labels in this algorithm. Figure 1 is a shortcut for the raw database. THE RAW DATABASE The size of DATABASE : 1547 0 0 {"category": ["Gift Cards", "Gift Cards"], "te... 1 {"category": ["Gift Cards", "Gift Cards"], "te... 2 {"category": ["Gift Cards", "Gift Cards"], "te... 3 {"category": ["Gift Cards", "Gift Cards"], "te... 4 {"category": ["Gift Cards", "Gift Cards"], "te... ... ... 1542 {"category": ["Gift Cards", "Gift Cards"], "te... 1543 {"category": ["Gift Cards", "Gift Cards"], "te... 1544 {"category": ["Gift Cards", "Gift Cards"], "te... 1545 {"category": ["Gift Cards", "Gift Cards"], "te... 1546 {"category": ["Gift Cards", "Gift Cards"], "te... [1547 rows x 1 columns]  Figure 1: The raw database For the training purpose, all asins that appeared in the dataset, either from also_buy & also_view list or * asin*, have to be reformatted from alphabet character to numeric character. For example, the original label for a particular item may be called **B001BKEWF2**. It will now be reformatted to a numeric number as 0. In that case, it can be a better fit-in the next step training method and easy to track. This step will be essential since it will help the also_view and also_buy dataset to be reformatted and make sure they are reformed in the track without overlapping each other. Therefore, a reformat_asin function is called for reformatting all the asins in the dataset and is performed as a dictionary. A shortcut for the *Asin Dictionary* is shown in **Figure 2**. The 4561 Lines of Reformatted ASIN reference dictionary as following. {'B001BKEWF2': 0, 'B001GXRQW0': 1, 'B001H53QE4': 2, 'B001H53QEO': 3, 'B001KMWN2K': 4, 'B001M1UVQO': 5, 'B001M1UVZA': 6, 'B001M5GKHE': 7, 'B002BSHDJK': 8, 'B002DN7XS4': 9, 'B002H9RN0C': 10, 'B002MS7BPA': 11, 'B002NZXF9S': 12, 'B002O018DM': 13, 'B002O0536U': 14, 'B002OOBESC': 15, 'B002PY04EG': 16, 'B002QFXC7U': 17, 'B002QTM0Y2': 18, 'B002QTPZUI': 19, 'B002SC9DRO': 20, 'B002UKLD7M': 21, 'B002VFYGC0': 22, 'B002VG4AR0': 23, 'B002VG4BRO': 24, 'B002W8YL6W': 25, 'B002XNLC04': 26, 'B002XNOVDE': 27, 'B002YEWXZ0': 28, 'B002YEWXMI': 29, 'B003755QI6': 30, 'B003CMYYGY': 31, 'B003NALDC8': 32, 'B003XNIBTS': 33, 'B003ZYIKDM': 34, 'B00414Y7Y6': 35, 'B0046IIHMK': 36, 'B004BVCHDC': 37, 'B004CG61UQ': 38, 'B004CZRZKW': 39, 'B004D01QJ2': 40, 'B004KNWWPE': 41, 'B004KNWWP4': 42, 'B004KNWWR2': 43, 'B004KNWWRC': 44, 'B004KNWWT0': 45, 'B004KNWWRW': 46, 'B004KNWWQ8': 47, 'B004KNWWNG': 48, 'B004KNWWPO': 49, 'B004KNWWXQ': 50, 'B004KNWWUE': 51, 'B004KNWWYU': 52, 'B004KNWWWC': 53, 'B004KNWX3A': 54, 'B004KNWX1W': 55, 'B004KNWWZE': 56, 'B004KNWWSQ': 57, 'B004KNWX4Y': 58, 'B004KNWX12': 59, 'B004KNWX3U': 60, 'B004KNWX62': 61, 'B004KNWX2Q': 62, 'B004KNWX6C': 63...}  Figure 2: The ASIN dictionary Then the data contained in the each record’s attributes: also_view & also_buy will be reformated as Figure 3 and Figure 4. Figure 3 is about the also_view item in reformatted numeric numbers based on each item customer purchased. Figure 4 is about the also_buy item in reformatted numeric numbers based on each item customer purchased. also_view List: The first 10 lines Item 0 : [] Item 1 : [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025] Item 2 : [2026, 2027, 2028, 2029, 2030, 2031, 2032, 2033, 2034, 2035, 922, 2036, 283, 2037, 2038, 2001, 2000, 2013, 2039, 2040, 2007, 2041, 2042, 2009, 1233, 2043, 2014, 234, 2044, 2012, 2005, 2045, 2046, 2002, 2047, 378, 2048, 1382, 2008, 2004, 2011, 2049, 2050, 2051, 2052, 2003, 2053, 2054, 2018, 2055, 2056] Item 3 : [] Item 4 : [] Item 5 : [] Item 6 : [] Item 7 : [] Item 8 : [] Item 9 : [] Item 10 : [2057, 2058, 2059]  Figure 3: The also_view list also_buy List: The first 20 lines Item 0 : [] Item 1 : [] Item 2 : [2026, 2028, 2027, 2049, 1382, 2037, 2012, 2023] Item 3 : [] Item 4 : [] Item 5 : [] Item 6 : [] Item 7 : [] Item 8 : [4224, 4225, 4226, 4227, 4228, 4229, 4230, 4231, 4232, 4233, 4234, 4235, 4236, 4237, 4238, 4239, 4240, 4241, 4242, 4243, 4244, 4245, 4246, 4247, 4248, 4249, 4250, 4251, 4252] Item 9 : [] Item 10 : []  Figure 4: The also_buy list While the also_buy list and also_view list is addressed. It is also important to know how many times a particular item appeared in other items' also view list and also buy list. These dictionaries will help to calculate the recommendation rate later. Figure 5 and Figure 6 is an example for how many times item 2000 appeared in other item’s also_view and also_buy lists. also_view dictionary: use Item 2000 as an example Item 2000 : [1, 2, 11, 12, 51, 60, 63, 65, 66, 67, 85, 86, 90, 94, 99, 100, 101, 103, 107, 108, 113, 116, 123, 126, 127, 129, 130, 141, 142, 143, 145, 146, 147, 148, 194, 199, 200, 204, 217, 221, 225, 229, 230, 231, 232, 233, 234, 235, 251, 253, 254, 260, 264, 268, 269, 270, 271, 280, 284, 285, 286, 287, 288, 294, 295, 296, 298, 299, 305, 306, 307, 308, 309, 313, 319, 327, 328, 338, 339, 344, 346, 348, 355, 356, 360, 371, 372, 377, 380, 389, 394, 406, 407, 410, 415, 440, 456, 469, 480, 490, 494, 495, 496, 502, 505, 509, 511, 512, 514, 517, 519, 520, 527, 530, 548, 591, 595, 600, 608, 609, 621, 631, 633, 670, 671, 672, 673, 675, 681, 689, 691, 695, 697, 707, 708, 709, 719, 783, 792, 793, 796, 797, 801, 803, 804, 807, 810, 816, 817, 818, 819, 836, 840, 842, 856, 892, 902, 913, 914, 917, 921, 955, 968, 972, 974, 975, 979, 981, 990, 991, 997, 998, 999, 1000, 1001, 1003, 1005, 1006, 1007, 1010, 1011, 1014, 1015, 1017, 1018, 1023, 1024, 1026, 1027, 1028, 1031, 1032, 1035, 1037, 1038, 1039, 1040, 1042, 1043, 1050, 1069, 1070, 1084, 1114, 1115, 1116, 1117, 1119, 1143, 1153, 1171, 1175, 1192, 1197, 1198, 1199, 1200, 1201, 1202, 1203, 1204, 1205, 1207, 1208, 1213, 1217, 1218, 1220, 1222, 1233, 1236, 1238, 1242, 1244, 1245, 1246, 1249, 1251, 1258, 1268, 1270, 1280, 1285, 1289, 1290, 1292, 1295, 1315, 1318, 1319, 1324, 1328, 1330, 1333, 1336, 1341, 1345, 1346, 1347, 1348, 1352, 1359, 1361, 1365, 1366, 1373, 1378, 1384, 1389, 1394, 1395, 1396, 1403, 1405, 1406, 1407, 1414, 1415, 1417, 1418, 1419, 1420, 1423, 1424, 1426, 1427, 1430, 1431, 1432, 1433, 1434, 1437, 1443, 1453, 1454, 1455, 1457, 1458, 1462, 1463, 1464, 1467, 1468, 1469, 1470, 1472, 1474, 1475, 1477, 1478, 1480, 1481, 1482, 1486, 1488, 1492, 1496, 1497, 1498, 1499, 1500, 1501, 1504, 1505, 1506, 1508, 1509, 1512, 1513, 1514, 1515, 1523, 1530, 1533, 1537, 1539, 1546]  Figure 5: The also_view dictionary also_buy dictionary: use Item 2000 as an example Item 2000 : [217, 231, 235, 236, 277, 284, 285, 286, 287, 306, 307, 308, 327, 359, 476, 482, 505, 583, 609, 719, 891, 922, 963, 1065, 1328, 1359, 1384, 1399, 1482, 1483, 1490, 1496, 1497, 1499, 1509, 1512, 1540]  Figure 6: The also_buy dictionary ## 5. Recommendation Rate and Similarity Calculation While all the dictionaries and attributes-label relationship are prepared in Part4, the recommendation rate calculation is addressed in this part. There are two types of similarity methods in this algorithm: Cosine Similarity and Euclidean Distance Similarity that perform the similarity calculation. Before calculating the similarity, the first step would be phrasing the recommendation rate for each item to another item. The Figure 8 is a shortcut for the recommendation rate matrix. It will use the logic in Figure 7. for item in the asin list: for asin in the also_view dictionary: if asin is founded in also_view dictionary[item] list: score for this item increase 2 each item in the also_view_dict[asin]'s score will be also increase 2 for asin in the also_view dictionary: if asin is founded in also_view dictionary[item] list: score for this item increase 10 each item in the also_view_dict[asin]'s score will be also increase 10 for other scores which is currently 0, assigned the average value for it return the overall matrix for the further step  Figure 7: The sudocode for giving the recommendation rate for the likelyhood of the next purchase item based the current purchase Item 0 : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ...] Item 1 : [13.0, 52, 28, 13.0, 13.0, 13.0, 13.0, 13.0, 13.0, 13.0, 13.0, 2, 2, 13.0, 13.0, 13.0, 13.0, 13.0, 13.0, 13.0 ...] Item 2 : [29.5, 28, 182, 29.5, 29.5, 29.5, 29.5, 29.5, 29.5, 29.5, 29.5, 4, 2, 29.5, 29.5, 29.5, 29.5, 29.5, 29.5, 29.5 ...] Item 3 : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ...] Item 4 : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ...] Item 5 : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ...] Item 6 : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ...] Item 7 : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ...] Item 8 : [14.5, 14.5, 14.5, 14.5, 14.5, 14.5, 14.5, 14.5, 290, 14.5, 14.5, 14.5, 14.5, 14.5, 14.5, 14.5, 14.5, 14.5, 14.5, 14.5 ...] Item 9 : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ...] Item 10 : [1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 6, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5 ...]  Figure 8: The shortcut for recommenation rate matrix The first similarity method implemented is Cosine Similarity 4. It will use the cosine of the angle between vectors(see Figure 9) to address the similarity between different items. By implementing this method with sklearn.metrics.pairwise package, it will rephrase the whole recommendation similarity as table 2. Figure 9: The cosine similarity item 0 1. 2 3 4 5 6 7. 8. …1547 0 0. 0. 0. 0. 0. 0. 0. 0. 0. 1 0. 0. 0.928569 0. 0. 0. 0. 0. 0.873242 2 0. 0. 0.928569 0. 0. 0. 0. 0. 0. table 2: The shortcut for using consine similarity to address the recommendation result The second similarity method implemented is Euclidean Distance Similarity5. It will use Euclidean Distance to calculate the distance between each items as a way to calculate similarity (see Figure 10). By implementing this method with scipy.spatial.distance_matrix package, it will rephrase the whole recommendation similarity. Figure 11 is an example with the item 1. Figure 10 The Euclidean Distance calculation Item 1 : [1005.70671669 0. 1370.09142031 1005.70671669 1005.70671669 1005.70671669 1005.70671669 1005.70671669 710.89169358 1005.70671669 905.23339532 862.88933242 971.0745337 1005.70671669 1005.70671669 1005.70671669 1005.70671669 1005.70671669 1005.70671669 1005.70671669...]  Figure 11: The Euclidean Distance similarity example of item 1 ## 6. Accuracy The accuracy for the consine_similarity and euclidean distance similarity with the number of items already purchased is shown as Figure 12. Here the blue line represented the cosine similarity, and the red line represented the euclidean distance similarity. As presented, the more item joined as purchased, the less likely both similarity algorithms accurately locate the next item the customer may want to purchase next. However, overall the consine_similarity performed better accuracy compared to Euclidean Distance similarity. In Figure 13, both accurate number and both wrong number is addressed. Both wrong numbers changed dramatically after more already-purchased items joined. This fact convinces the prior statement: this algorithm works better when the given object is 1 but can’t handle many purchased item. Figure 12: The Cosine similarity and Euclidean Distance Accuracy Comparison Figure 13: The bothright and bothwrong accuracy comparison ## 7. Benchmark The Benchmark for each step for the project is stated in Figure 14 The overall Time spent is affordable. The accuracy calculation part(57s) and the Euclidean Distance algorithm implementation(74s) have taken the majority of time for the running. The Accuracy time consumed would be considered proper since it will randomly assign one to ten items and perform recommendation items based on it. The time spent is necessary and should be considered normal. The Euclidean Distance algorithm would be considered making sense since it is trying to perform the difference in two 1547X1547 matrixs. Name Status Time Data Online Downloading Process ok 1.028 Raw Database ok 0.529 Database Reformatting process ok 0.587 Recommendation Rate Calculation ok 1.197 Consine_Similarity ok 0.835 Euclidean distance ok 73.895 Recommendation strategy showcase-Cosine_Similarity ok 0.003 Recommendation strategy showcase-Euclidean distance ok 0.004 Accuracy ok 57.119 Showcase-Cosine_Similarity ok 0.003 Showcase-Euclidean distance ok 0.003 Figure 14: Benchmark The time comparison for Cosine Similarity and Euclidean Distance Time Comparison is addressed in Figure 15 As stated, the euclidean distance algorithm has taken much more time than cosine similarity. Therefore, the cosine similarity should be considered as efficient in these two similarities. Figure 15: The Cosine Similarity and Euclidean Distance Time Comparison ## 8. Conclusion This project Big_tata_Application_in_E_commense.ipynb 6 is attempted to get a taste of the recommendation strategy based on User-Based Collaborative Filtering. Based on this attemption, the two similarity methods: Cosine Similarity and Euclidean Distance are addressed. After analyzing accuracy and time consumption for each method, Cosine Similarity performed better in both the accuracy and implementation time. Therefore the cosine similarity method is recommended to use in the recommendation algorithm strategies. This project should be aware of Limitations. Since the rating attribute is missing in the dataset, the recommendation rate was assigned by the author. Therefore, in real-world implementation, both methods' accuracy can be expected to be higher than in this project. Besides, the cross-section recommendation strategies are not implemented. This project is only focused on the gift card section recommendations. With the multiple aspects of goods customer purchases addressed, both methods' accuracy can also be expected to be higher. ## 9. Acknowledgements The author would like to thank Dr. Gregor Von Laszewski, Dr. Geoffrey Fox, and the associate instructors in the FA20-BL-ENGR-E534-11530: Big Data Applications course (offered in the Fall 2020 semester at Indiana University, Bloomington) for their continued assistance and suggestions with regard to exploring this idea and also for their aid with preparing the various drafts of this article. ## 10. References 1. Build a Recommendation Engine With Collaborative Filtering. Ajitsaria, A. 2020 https://realpython.com/build-recommendation-engine-collaborative-filtering/ ↩︎ 2. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. Jianmo Ni, Jiacheng Li, Julian McAuley. Empirical Methods in Natural Language Processing (EMNLP), 2019 http://jmcauley.ucsd.edu/data/amazon/ ↩︎ 3. meta_Gift_Cards.json.gz http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Gift_Cards.json.gz ↩︎ 4. Recommendation Systems : User-based Collaborative Filtering using N Nearest Neighbors. Ashay Pathak. 2019 https://medium.com/sfu-cspmp/recommendation-systems-user-based-collaborative-filtering-using-n-nearest-neighbors-bf7361dc24e0 ↩︎ 5. Similarity and Distance Metrics for Data Science and Machine Learning. Gonzalo Ferreiro Volpi. 2019 https://medium.com/dataseries/similarity-and-distance-metrics-for-data-science-and-machine-learning-e5121b3956f8 ↩︎ 6. Big_tata_Application_in_E_commense.ipynb https://github.com/cybertraining-dsc/fa20-523-339/raw/main/project/Big_tata_Application_in_E_commense.ipynb ↩︎ # 30 - Residential Power Usage Prediction We are living in a technology-driven world. Innovations make human life easier. As science advances, the usage of electrical and electronic gadgets are leaping. This leads to the shoot up of power consumption. Weather plays an important role in power usage. Even the outbreak of Covid-19 has impacted daily power utilization. Similarly, many factors influence the use of electricity-driven appliances at homes. Monitoring these factors and consolidating them will result in a humungous amount of data. But analyzing this data will help to keep track of power consumption. This system provides a prediction of usage of electric power at residences in the future and will enable people to plan ahead of time and not be surprised by the monthly electricity bill. Status: final, Type: Project Siny P. Raphel, fa20-523-314, Edit Keywords: power usage, big data, regression ## 1. Introduction Electricity is an inevitable part of our day-to-day life. The residential power sector consumes about one-fifth of the total energy in the U.S. economy1. Most of the appliances in a household use electricity for its working. The usage of electricity in a residence depends on the standard of living of the country, weather conditions, family size, type of residence, etc2. Most of the houses in the USA are equipped with lightings and refrigerators using electric power. The usage of air conditioners is also increasing. From Figure 1, we can see that the top three categories for energy consumption are air conditioning, space heating, water heating as of 2015. Figure 1: Residential electricity consumption by end use, 20153. Climate change is one of the biggest challenges in our current time. As a result, temperatures are rising. Therefore, to analyze energy consumption, understanding weather variations are critical4. As the temperature rises, the use of air conditioners is also rising. As shown in Figure 1, air conditioning is the primary source of power consumption in households. The weather change has also resulted in a drop in temperatures and variation in humidity. These results in secondary power consumers. Even though weather plays an important role in power usage, factors like household income, age of the residents, family type, etc also influence consumption. During the holidays' many people tend to spend time outside which reduces power utilization at their homes. Similarly, during the weekend, since most people have work off, the appliances will be frequently consumed compared to weekdays when they go to work. Our world is currently facing an epidemic. Most of the countries had months of lockdown periods. Schools and many workplaces were closed. People were not allowed to go out and so they were stuck in their homes. As a result, power expending reduced drastically everywhere other than residences. But during the lockdown period, the energy consumption of residences spiked. Most of the electric service providers like Duke, Dominion provide customers their consumption data so that customers are aware of their usages. Some providers give predictions on their future usages so that they are prepared. ## 2. Reason to choose this dataset There were many datasets on residential power usage analysis in Kaggle itself. But most of them were three or four years old. This dataset has the recent data of power consumptions together with weather data of each day. Since the pandemic hit the world in 2019-2020, the availability of recent data is considered to be significant for the analysis. This dataset is chosen because, 1. It has the latest power usage data - till August 2020. 2. It has marked covid lockdown, vacations, weekdays and weekends which is a challenge for the prediction. ## 3. Datasets This project is based on the dataset, Residential Power Usage 3 years data in Kaggle datasets5. The dataset contains data of hourly power consumption of a 2 storied house in Houston, Texas from 01-06-2016 to August 2020 and also weather conditions of each day like temperatures, humidity wind etc of that area. Each day is marked whether it is a weekday, weekend, vacation or COVID-19 lockdown. The project is intending to build a model to predict the future power consumption of a house with similar environments from the available data. Python6 is used for the development and since the expected output is a continuous variable, linear regression is considered for the baseline model. Later the performance of the base model is compared to one or two other models like tuned linear regression, gradient boosting, Light Gbm, or random forest. Data is spread across two csv files. • power_usage_2016_to_2020.csv This file depicts the hourly electricity usage of the house for three years, from 2016 to 2020. It contains basic details like startdate with hour, the value of power consumption in kwh, day of the week and notes. It has 4 features and 35953 instances. Figure 2: First five rows of power_usage_2016_to_2020 data Figure 2 provides a snapshot of the first few rows of the data. Day of the week is an integer value with 0 being Monday. The column notes layout details like whether that day was weekend, weekday, covid lockdown or vacation, as shown in Figure 3. Figure 3: Details in notes column • weather_2016_2020_daily.csv The second file or the weather file imparts the weather conditions of that particular area on each day. It has 19 features and 1553 instances. Figure 4 is the snapshot of the first few rows and columns of this file. Figure 4: First few rows of weather_2016_2020_daily data Each feature in this data has different units and the units of the features are given in Table 1. Table 1: Feature units Feature names Units Temperature F deg Dew Point F deg Humidity %age Wind mph Pressure Hg Precipitation inch Weather file has additional features like date and day of the date. ## 4. Data preprocessing The data has to be preprocessed before modelling for predictions. ### 4.1 Data download and load The data in this project is directly downloaded from Kaggle. The downloaded file is then unzipped and loaded to two dataframes using python codes. For more detailed explanation and codes for download and load of data, see python code Download datasets and Load datasets sections. ### 4.2 Data descriptive analysis The data loaded has to be analyzed properly before it can be preprocessed. An analysis is made on the existence of missing values, the range of each feature, etc. On analysis, it is determined that there are no missing values and the date format in both tables is different. The StartDate feature of the power_usage dataset and Date feature of the weather dataset is to be used as a key to merge the two datasets. But the format of both features is different. StartDate feature is the combination of date and hour. Whereas, Date feature of weather is just the date. Hence, these issues will have to be taken care of before merging data. ### 4.3 Preprocessing data In this step, the column name Values (kWh) is renamed to Value and also date format issue is addressed. Firstly, StartDate column is split into Date and Hour columns. Since the StartDate column is in Pandas Period type, the function strftime() is used for converting to the required format. ### 4.4 Merge datasets For proper analysis of data, it is critical that the analyst should be able to analyze the relationships of each feature concerning the target feature(Value in kWh in this project). Therefore, both power_usage and weather tables are merged with respect to the Date column. The resulting table has a total of 35952 instances and 22 features. ## 5. Exploratory Data Analysis Here we analyze different features, their relationship with each other, and with the target. Figure 5: Average power usage by day of the week In Figure 5, the average power usage by the day of the week is plotted7. It is analyzed that Saturday and Friday have the most usage compared to other days of the week. Since the day of the week represents values Sunday-Saturday, we can consider it as a categorical feature. Similarly, from Figure 6, there is a huge dip in power usage during vacation. Other three occasions like covid lockdown, weekend and weekdays have almost the same power usage, even though consumption during weekends outweigh. Figure 6: Average power usage by type of the day In Figure 7, we compare the monthly power consumption for three years - 2018, 2019, 20208. The overall power usage in 2019 is less compared to 2018. But in 2020 may be due to Covid-lockdown the power consumption shoots. Also, power consumption peaks in the months of June, July, and August. Figure 7: Average power usage per month for three years Figure 8: Correlation plot between features The correlation plot in Figure 8, depicts the inter-correlation between features. We can see that features like temperature, dew and pressure has a high correlation to our target feature. Also, different temperatures and dew features are inter-correlated. Therefore, all the intercorrelated features except for temp_avg can be dropped during feature selection. ## 6. Modeling Modeling of the data includes splitting data into train and test, include cross-validation, create pipelines, select metrics for measuring performance, run data in regression models, and discuss results. ### 6.1 Split Data For measuring the accuracy of the model, the main data is split into train and test. 20% of data is selected as test data and the remaining 80% is the train data. The proportion of notes(vacation, weekday, weekend, and covid lockdown) are different. Therefore, we stratify the data according to the notes column. After the split, train data has 28761 rows and test data has 7191 rows. ### 6.2 Pipelines Categorical variables and numeric variables are separated and processed in pipelines separately. Categorical features are one hot encoded before feeding to the model. Similarly, numerical features are standardized before modeling. Later these two pipelines are joined and modeled used Linear regression and other models. ### 6.3 Metrics Our target is a continuous variable and hence we implement regression models for prediction. To determine how accurate a regression model is, we use the following metrics. #### 6.3.1 Mean squared error(MSE) MSE is the average of squares of error. The larger the MSE score, the larger the errors are. Models with lower values of MSE is considered to perform well. But, since MSE is the squared value, the scale of the target variable and MSE will be different. Therefore, we go for RMSE values. #### 6.3.2 Root mean squared error(RMSE) RMSE is the square root of MSE scores. The square root is introduced to make the scale of the errors to be the same as the scale of targets. Similar to MSE, the lower scores for RMSE means better model performance. Therefore, in this project, the models with lower RMSE values will be monitored9. #### 6.3.3 R-Squared(R2) Score R2 score is the goodness-of-fit measure. It’s a statistical measure that ranges between 0 and 1. R2 score helps the analyst to understand how similar the fitted line is to the data it is fitted to. The closer it is to one, the more likely the model predicts its variance. Similarly, if the score is zero, the model doesn’t predict any variance. In this project, the R2 score of the test data is calculated. The model with the highest R2 scores will be considered9. ### 6.4 Baseline Linear Regression model We use linear regression as our baseline model. For the baseline model, we are not hyperparameter tuning. For the baseline model, the train RMSE score was 0.6783, and R2 for the test set was 0.4460. These values are then compared to other regression models with hyperparameter tuning. ### 6.5 Other regression models After developing a baseline model, we are developing four other regression models and comparing the results. We implement feature selection and hyperparameter tuning. As we analyzed in exploratory data analysis, some features have strong inter-correlation and these features are dropped. The parameters for the regression models are hyper tuned and modeled in GridsearchCV of sklearn package. The models used for prediction are: • Linear regression with hyperparameter tuning • Gradient boosting • XGBoost • Light GBM Similar to the baseline model, the metrics like train RMSE, test RMSE, and test R2 scores are calculated. ### 6.6 Results Figure 9: Performance of all the regression models Figure 9 documents the performance of all the regression models used. cloudmesh.common benchmark and stopwatch framework are used to monitor and record the time taken for each step in this project10. Time taken for critical steps like downloading data, loading data, preprocessing data, training and predictions of each model are recorded. The StopWatch recordings are shown in Table 2. StopWatch recordings played an important role in the selection of the best model. Benchmark also provides a detailed report on the system or device information as shown in Table 2. Table 2: Benchmark results Attribute Value BUG_REPORT_URL https://bugs.launchpad.net/ubuntu/" DISTRIB_CODENAME bionic DISTRIB_DESCRIPTION “Ubuntu 18.04.5 LTS” DISTRIB_ID Ubuntu DISTRIB_RELEASE 18.04 HOME_URL https://www.ubuntu.com/" ID ubuntu ID_LIKE debian NAME “Ubuntu” PRETTY_NAME “Ubuntu 18.04.5 LTS” PRIVACY_POLICY_URL https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" SUPPORT_URL https://help.ubuntu.com/" UBUNTU_CODENAME bionic VERSION “18.04.5 LTS (Bionic Beaver)” VERSION_CODENAME bionic VERSION_ID “18.04” cpu_count 2 mem.active 1.0 GiB mem.available 11.2 GiB mem.free 8.5 GiB mem.inactive 2.6 GiB mem.percent 11.6 % mem.total 12.7 GiB mem.used 1.9 GiB platform.version #1 SMP Thu Jul 23 08:00:38 PDT 2020 python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0] python.pip 19.3.1 python.version 3.6.9 sys.platform linux uname.machine x86_64 uname.node a1f46a7ed3c2 uname.processor x86_64 uname.release 4.19.112+ uname.system Linux uname.version #1 SMP Thu Jul 23 08:00:38 PDT 2020 user collab Name Status Time Sum Start tag Node User OS Version Data download ok 2.652 2.652 2020-11-30 12:43:50 a1f46a7ed3c2 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Data load ok 0.074 0.074 2020-11-30 12:43:53 a1f46a7ed3c2 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Data preprocessing ok 67.618 67.618 2020-11-30 12:43:53 a1f46a7ed3c2 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Baseline Linear Regression ok 2.814 2.814 2020-11-30 12:45:03 a1f46a7ed3c2 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Linear Regression ok 5.581 5.581 2020-11-30 12:45:06 a1f46a7ed3c2 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Gradient Boosting ok 244.868 244.868 2020-11-30 12:45:12 a1f46a7ed3c2 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 XGBoost ok 2946.7 2946.7 2020-11-30 12:49:16 a1f46a7ed3c2 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Light GBM ok 770.967 770.967 2020-11-30 13:38:23 a1f46a7ed3c2 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 For the baseline model, the RMSE values were high and R2 scores were small compared to all other regression models. The hyperparameter tuned linear regression model scores are better compared to the baseline model. But the other three models outweigh both linear models. XGBoost has the lowest RMSE and highest R2 score of all other models. But the time taken for execution is too long. Therefore, XGBoost is computationally expensive which leads us to ignore its scores. Gradient boosting and Light GBM have similar scores and hence the time taken for execution has to be considered as the deciding factor here. Gradient boosting completed 135 fits in 244.868 seconds whereas LightGBM took around 770.967 seconds for executing 3645 fits and then prediction. Since per fit execution time for Light GBM is too small, we consider Light GBM as the best model for predicting daily power usage of a residence with similar background conditions. The RMSE scores for Light GBM are .2896 for train and .2910 for the test. The R2 score for the test set is .6526. ## 7. Conclusion As the importance of electricity is increasing, the need to know how or where the power usage increase will be a lifesaver for the electricity consumers. In this project, the daily power consumption of a house is analyzed and modeled for a prediction of electricity usage for residences with similar environments. The model considered a set of parameters like weather conditions, weekdays, type of days, etc. for prediction. Since the output is power consumption in kWh, we selected regression for modeling and prediction. Experiments are conducted on five regression models. After analyzing the experiment results, we concluded that the performance of the Light GBM model is better and faster compared to all other models. ## 8. Acknowledgments The author would like to express special thanks to Dr. Geoffrey Fox, Dr. Gregor von Laszewski, and all the associate instructors of the Big Data Applications course (FA20-BL-ENGR-E534-11530) offered by Indiana University, Bloomington for their continuous guidance and support throughout the project. ## 9. References 1. Jia Li and Richard E. Just, Modeling household energy consumption and adoption of energy efficient technology, Energy Economics, vol. 72, pp. 404-415, 2018. Available: https://www.sciencedirect.com/science/article/pii/S0140988318301440#bbb0180 ↩︎ 2. Domestic Power Consumption, [Online resource] https://en.wikipedia.org/wiki/Domestic_energy_consumption ↩︎ 3. Use of energy explained - Energy use in homes, [Online resource] https://www.eia.gov/energyexplained/use-of-energy/electricity-use-in-homes.php ↩︎ 4. Yating Li, William A. Pizer, and Libo Wu, Climate change and residential electricity consumption in the Yangtze River Delta, China, Research article, Available: https://www.pnas.org/content/116/2/472#ref-1 ↩︎ 5. Residential Power Usage dataset, https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries ↩︎ 6. Residential Power Usage Prediction script, https://github.com/cybertraining-dsc/fa20-523-314/blob/main/project/code/residential_power_usage_prediction.ipynb ↩︎ 7. seaborn: statistical data visualization, https://seaborn.pydata.org/index.html ↩︎ 8. Group by: split-apply-combine, https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html ↩︎ 9. Mean Square Error & R2 Score Clearly Explained, [Online resource] https://www.bmc.com/blogs/mean-squared-error-r2-and-variance-in-regression-analysis/ ↩︎ 10. Gregor von Laszewski, Cloudmesh StopWatch and Benchmark from the Cloudmesh Common Library, https://github.com/cloudmesh/cloudmesh-common ↩︎ # 31 - Big Data Applications in the Gaming Industry Gaming is one of the fastest growing aspects of the modern entertainment industry. It’s a rapidly evolving market, where trends can change in a near instant, meaning that companies need to be ready for near anything when making decisions that may impact development times, targets and milestones. Companies need to be able to see market trends as they happen, not post factum, which frequently means predicting things based off of freshly incoming data. Big data is also used for development of the games themselves, allowing for new experiences and capabilities. It’s a relatively new use for big data, but as AI capabilities in games are developed further this is becoming a very important method of providing more immersive experiences. Last use case that will be talked about, is monetization in games, as big data has also found a use there as well. Status: final, Type: Report Aleksandr Linde, fa20-523-340, Edit Keywords: gaming, big data, product development, computer science, technology, microtransactions, artifician intelligence ## 1. Introduction The video game market is one of the fastest growing aspects of the modern entertainment industry, and in 2020 brought in 92 billion USD worldwide out of a total worldwide entertainment market value income of 199.64 billion USD. With global player count reaching 2.7 billion users, more and more people choose to spend some of their leisure time behind a controller or a keyboard 12. This phenomenon isn’t exactly a new thing. Originally getting its start in 1972 with the release of the Magnavox Odyssey, the first home console with replaceable cartridges. The scale of this achievement was hardly recognized, as back then if you wanted to play something, the only option was arcades. Arcades were a social experience but being able to play the exact same titles, if slightly downgraded, at home was a breakthrough. These first gen consoles were quite clunky and by modern standards unimpressive, yet they were the vital first step for birthing what we now know today as the video game market. At that point games had been a thing for a around a decade, but they were primarily limited to a group of computer hobbyists, who would exchange copies of homebrew software amongst themselves. In this format it would be impossible to get any sort of mainstream popularity. With time, home consoles had changed this dramatically. Games become a mainstream phenomenon, which means that suddenly the potential player pool is a lot larger, and as a result we see an explosion in the popularity of gaming. ### 1.1 Market Expansion & Segmentation Over the decades, this market has grown into a massive global phenomenon, becoming one of the primary forms of media alongside film, music, and art. Thanks to all of this we have seen 3 distinct market segments emerge. 1. Personal Computers – (Laptops and desktops) 2. Consoles – (Nintendo Switch, 3DS, Xbox, Playstation) 3. Mobile Phones – (Anything with the Google Play store and IOS Appstore) Each segment has some interesting specifics. Mobile games account for 33% of all app downloads, 74% of all mobile consumer spending and 10% of all raw time spent in apps. By the end of 2019 the amount of people who played mobile games topped 2.4 billion 3. An important reason for this is accessibility, since mobile phones as of today, are the most commonly bought tech item in the world. In many developing nations, mobile phones are a commonplace piece of tech that is owned by the majority of adults 4.As the global population increases in wealth and size, this trend is only set to increase. Meaning that any company that ignores the mobile phone market is loosing out of massive sums of money. The same is true, but for a lesser scale, in personal computers, meaning that as the population grows, the PC market will expand as well. This is less true for consoles but machines from the last console generation sold a combined 221 million units, with a yearly revenue of 48.7 Billion in 2019 5. Far cry from what mobile games make, but still rather significant, spelling good fortunes for the health of the industry in the coming years. As a result of all this, it is only natural that companies will focus more and more of their attention on the developing world for expansion This is where big data can help significantly, ## 2. Big data in mobile gaming spaces. A big reason for the integration of big data analytics into mobile games comes from the intelligence edge that it provides you in a competitive market space. Being able to track Key Performance Indicators, or KPI’s for short allows you to rapidly shift strategies to fix growing problems as soon as you see them. For example, customer retention is one of, if not the most important metric in any gaming product, but or mobile gaming this is especially critical since gaming attention spans are getting shorter, and in a market where attention spans are already low from the get-go, hemorrhaging customers can spell doom for any app 6. Thus, an important big data application is the CLTV – customer lifetime analysis, which tracks customer retention and average return per customer on a specific platform. Machine learning in this case factors in a lot of different variables that effect just how long the user will end up using your app before putting it down to go do something else. Reducing the churn, or the rate at which you intake and expel customers is thus a key business consideration when bringing a new gaming product to market 7. This is more important for mobile game customers since the ROI per user on mobile platforms is generally lower, thus maximizing user count rapidly becomes an extremely important tactic to get a self-sustaining userbase that can bring in stable profits 8. But how are games monetized in the first place, and how does big data play into it? ### 2.1 Mobile game monetizaton Most mobile games operate on the freemium model, where the base game is free, but you pay for things such as boosts, upgrades and cosmetics 9. Systems such as this still allow you to earn most things in the game normally, yet make it prohibitively difficult to do so, requiring a large time investment. What happens often in this case is that players will spend money to save time and effort that would have normally gone into grinding out these items for free [9]. So why is this effective? Because the initial entry barrier is quite literally nonexistent and the main monetization fees don’t actually cost that much at first, making up mostly 5-10$ purchases10. This doesn’t seem like much, but this eventually reaches into the sunk cost fallacy where users get so ingrained in an ecosystem and simply don’t want to leave. Mobile monetization platform Tapjoy recently used big data analysis to identify 5 different categories of mobile users 11. The one that interests us the most is the whales. Whales are called as such due to how despite making up only 10% of a game’s population, they will usually be responsible for 70% of the cash flow from games. Many developers work specifically to design systems that aren’t exactly fun but work more to trap players like this within the ecosystem. While we may not agree with this rather predatory practice, its another big data application to be cognizant of. It works off the same principle that gambling does, via enticing potential rewards that don’t actually have publicly avaliable information about your actual chance to win. When the positive effect happens, it’s usually minor, but still works like a Skinner box trigger where the user gets the positive feedback that further keeps them in the churn loop 12. Is it scummy? Many players think so but using big data in this case allows us to hyper target these users. Big data analytics generate user reports that show directly what users are most susceptible, how to reach them and most importantly how to stick them into the endless loop where they keep dumping more and more money into a game that many don’t really even love anymore 13 . For companies this is great, since it earns you a customer who is guaranteed to stay and dump money in a marketplace where getting the average user to pay even 5$for anything is already a feat that many cant handle. Using big data lets developers and companies be 10 steps ahead of potential users. By the time they realize they are addicted its far too late. Similar systems are making their way into our desktop and console spaces as well. ### 2.2 Profits from Monetization. In fact, microtransactions that function this way can now be found on every platform and genre, all due to how insanely profitable it is. This year, gaming industry valuations rose 30% because of the absurd amounts of money that get pulled in via microtransactions 14. All of this, possible only due to the massively increasing use of big data analytical tools. Why make a good game, when you can hyper tailor the monetization so that players are guaranteed to stop caring about how good your game actually is when they get sufficiently sucked in enough. This trend is only increasing since its predicted that by 2023, 99% of all game downloads will be free to play with various forms of microtransaction based monetization 14. However, its not all doom and gloom. Big data can also be used for some other really interesting applications when it comes to developing the game itself, and not just the predatory monetization methodology. ## 3. Big Data and AI Development on In-Game AI Systems Last summer, the world of online poker had quite the shock when a machine learning algorithm beat 4 seasoned poker vets 15. But poker is actually a fairly simple game, so why is this important? It’s a big deal due to how most games feature AI in one form or another. When one plays strategy games, they can usually immediately tell if they are playing versus an AI or vs an actual player due to how current AI tech still isn’t nearly as good as a regular player. However, in simple games like poker, you can make systems that almost perfectly copy human behavior. All that is required is that they are trained on the users of the game, and then become nearly indistinguishable from a regular player. This opens up massive new possibilities, because soon we will be able to mimic whole players so that even games that are functionally dead due to lower player counts can still have users enjoy content that was made for multiplayer and such. Plus, just having smarter AI for non-player characters and enemies would be a nice touch. Currently most games that have AI opponents function on a system that is called a Finite State Machine 16. Systems like this have a strict instruction set that they can’t really deviate from, nor make new strategies.This causes everythign to feel scripted and dumb, yet this is also a fairly lightweight method of ai control. FSM can be refined into what is called a Monte Carlo Search Tree (MCST) algorithm, where computers will on their own, make decision trees based on the reward value of the tree endpoint. MCST’s can become massive, so in order to cut down on the sheer amount of processing power that is needed, developers will make the AI randomly select a few paths, which will then in turn be implemented as actions. This cut down version adds the randomness that players expect from other players, but also removes a lot of rigidity of traditional AI systems 16. Using machine learning models on real player actions allows MCST models to be really close to how an actual player acts in specific situations. However, training machine learning models on players also has another, really interesting purpose – dual AI systems. ### 3.1. AI implementation Some games, feature AI that is divided into two separate algorithms, the director, and the controller. Director AI, has only one objective, make the game experience as enjoyable as possible. It is a macro level passive AI that bases game triggers and events off of player action. For example, causing random noises when it detects player stress via their control inputs17. This means that the system can detect whether it needs to up the ante on what is actually happening in game, introduce new enemies, change environmental effects etc. This molds the experience into a completely unique system that learns from every player that has ever played the game. All of this data goes into crafting emerging experiences that can’t be replicated via a rigid AI system. But this is only one piece of the puzzle as the other component of this system is the controller AI. Controller AI is the sidekick that the director AI uses to help immerse the player. We will explain how it works in detail, but what is important is that the controller AI is ultimately subordinate to the director. In a normal AI system, the algorithm knows everything, can see everything and will pretend it has no idea about what the player is doing, yet is ultimately aware of their actions. This system is the easiest to make, but also can seem to players like the AI is cheating or being exceedingly stupid at times. Let’s use the example of an alien hunting down a player. Normally, the AI would just head to the players location and just see them once they enter detection range. What two tier systems do is limit the information flow to the controller AI from the director AI. The director sees all, but it limits what the controller can visualize. Instead of saying, Go to area, find player in location, attack player, what the controller gets as input is – Go to area and look for player. The director ai can set the area as anything. But what matters is that the end alien cannot actually see everything that’s happening. What it does is learn from the player in previous encounters. Early on in the game, it might just do a cursory pass around an area and leave, while later on, it might have found the player in a specific location and thus will now check for them in places like it. The alien learns as it goes, and the beauty of this system is that if you select a higher difficulty, then the game can just draw upon cloud data to fill in that early learning steps, dropping in smarter, more intelligent enemies much earlier in the game 1617. This is only possible via big data analysis. Any other solution would mean either massive amounts of very rigid code, or very blatant information cheating. Best example of this is Alien: Isolation a 2014 title by Creative Assembly, that impressed the gaming world with just how scary the AI implementation was. In our opinion this is the best example of a two tier system so far. Systems like this are only possible with big data implementation and the proper ML algorithms. As games advance to more and more realistic worlds, this approach, in our opinion is going to become the norm, since it allows for really flexible systems that are engaging to play around with. However, this isn’t the only way developers can make systems fairer and fun for the end user. Another area where big data is gaining lots of traction is level design. ## 4. Level Design and Balance, An Unlikely Big Data Application When crafting virtual worlds its always difficult to strike a balance between breadth and scope. A common approach is to hand craft levels in games that do not require that much breadth. This approach has great benefits in terms of the attention to detail and depth of narrative, yet crafting each level by hand takes a while. Big data in this case helps with testing, since previously playtesting had players run the map time after time. Now, you can run hundreds of models in addition to the players, compare their paths, and find the best solution for what would increase player enjoyment. However, the main drawback of this approach is its great cost for the developer. Much moreso than just using a script to autogenerate terrain. Some developers will focus entirely on auto-generation as a means of level design instead, which can provide near infinite possibilities for content (see Minecraft), but has the danger of being exceedingly bland to the end user. Big data analytics are a way of alleviating this via looking at what terrain players prefer more, and thus adjusting level generation parameters accordingly. Thus, removing some of the randomness that such worlds depend on. However, this isn’t the main use of such analytics. A big reason why developers collect massive amounts of data about levels is actually balance issues. This is especially true for Esports titles that dominate the current games markets. Titles where minute advantages in map design get exploited to their fullest extent 18. Using data collected from tens of thousands of matches, we can see what paths are most taken, what are optimal firing angles (most esports titles are shooters) and any detail that can give players a leg up over the competition. This allows maps to be fine tuned to produce the most memorable player experiences, feel hand crafted and also be as balanced as possible. ### 4.1. Balance Balance isn’t just a competitive thing however, multiplayer games live and die due to balance, as unbalanced gameplay drives away players, leaving only people who are ok with it, which in turn drives away new players 19. Ultimately this increases player churn, and you end up with a dead game 20. Normally you would just use player feedback, but if you have mountains of raw data, ML and big data can also allow you to really fine tune specific aspects of balance. This isn’t only limited to maps, but also skills, abilities and puzzles. Balance has always been a really fine line between enjoyment and fairness yet it’s the developer’s job to ensure that no one is left out in the cold on purpose. Some genre’s are more balance intensive and thus require more data to make things fair even when taking into randomness into action. A great example are MOBAs, Multiplayer Online Battle Arenas, where being down even 1 player can cause a team to get crushed in a 4v5 matchup. A single person leaving completely changes the dynamic of the game, and without big data its hard to compensate for events like that, since they are at their core, massively unbalanced. But being able to account for literally any situation, allows developers to craft systems that can handle disbalance better in these cases 21. Anything that includes player vs player is very difficult to balance for as well. A good use for big data in this case would be truly fair matchmaking. Currently most systems of this sort use only things like win rate and total playtime, in order to pair up players. However, if we start watching for minute flags on how well players actually play, we can make systems that are specifically made to balance players as much as possible by looking at the minute details of how they actually play. ## 5. What the future holds One thing we can definitely count on is the sheer amount of data that is going to be generated from now on by gaming. With always online experiences being the norm, machines send data reports back to the main system which matches them with other players in the same instance to allow playing in the same world. As more and more players enter the community, we will run into the issue of where an absurdly massive amount of data is being generated, and not all of it is positive. Its going to get harder to get the full picture since now looking over all the data would make the task of sifting through it extremely difficult. What is most exciting is the developments of self-learning AI based on the mountains of this newfound data. Currently unless you use two tier AI systems, there isn’t a way to make AI believable, and having bad AI can potentially ruin a game for most players. With more advancements in reliable AI systems based on potent data analysis, there exists the potential for AI that is near indistinguishable from players 22. ## 6. Conclusion Gaming has gone from a very niche and simple pastime to one of the biggest entertainment markets in the world. From mobile phones, to supercharged desktop setups, the field is expansive to the point where there is something for anyone who wants to have blow a couple of hours behind a screen. The speed at which the field is developing is truly astonishing, and a good portion of it is being driven by the previously discussed Big Data use cases. By being able to use all this data for developmental purposes, game companies and publishers can craft truly memorable and interesting experiences.It has been demonstrated that big data analytics is having some very profound effects on the video game markets from pretty much every side. The end goal of this developemnt is systems that can predict nearly any game situation and adjuist parameters accordingly to maximise player enjoyment. It means, more advanced AI systems that feel as lifelike as humanly possible that can populate virtual worlds. It means systems that can properly react to player inputs in order to create dynamic worlds that aren’t just a static canvas that a player is thrust into upon booting up the game. This may seem far off, especially with how a lot of development teams dont exactly use Big Data in the right way, but the best examples are getting quite close in its implementation. ## Acknowledgements I would like to thank Dr Gregor von Laszewski for helping me despite me taking a long time with quite a few assignments. I want to thank the AI team for their massive help with any issues that arose in during the class period, as well as the excellent avaliablity of their help whenever it was needed. Would like to thank Dr. Geoffery Fox for his very informative and thorough class. ## 7. Refernces 1. The Average Gamer: How the Demographics Have Shifted. (n.d.). Retrieved December 08, 2020, from https://www.gamesparks.com/blog/the-average-gamer-how-the-demographics-have-shifted/ ↩︎ 2. Nakamura, Y. (2019, January 23). Peak Video Game? Retrieved December 08, 2020, from https://www.bloomberg.com/news/articles/2019-01-23/peak-video-game-top-analyst-sees-industry-slumping-in-2019 ↩︎ 3. Kaplan, O. (2019, August 22). Mobile gaming is a$68.5 billion global business, and investors are buying in. Retrieved December 08, 2020, from https://techcrunch.com/2019/08/22/mobile-gaming-mints-money/ ↩︎

4. Silver, L., Smith, A., Johnson, C., Jiang, J., Anderson, M., & Rainie, L. (2020, August 25). 1. Use of smartphones and social media is common across most emerging economies. Retrieved December 08, 2020, from https://www.pewresearch.org/internet/2019/03/07/use-of-smartphones-and-social-media-is-common-across-most-emerging-economies/ ↩︎

5. Ali, A. (2020, November 10). The State of the Multi-Billion Dollar Console Gaming Market. Retrieved December 08, 2020, from https://www.visualcapitalist.com/multi-billion-dollar-console-gaming-market/ ↩︎

6. Filippo, A. (2019, December 17). Our attention spans are changing, and so must game design. Retrieved December 08, 2020, from https://www.polygon.com/2019/12/17/20928761/game-design-subscriptions-attention ↩︎

7. Addepto. (2019, March 07). Benefits of Big Data Analytics in the Mobile Gaming Industry. Retrieved December 08, 2020, from https://medium.com/datadriveninvestor/benefits-of-big-data-analytics-in-the-mobile-gaming-industry-2b4747b90878 ↩︎

8. Rands, D., & Rands, K. (2018, January 26). How big data is disrupting the gaming industry. Retrieved December 08, 2020, from https://www.cio.com/article/3251172/how-big-data-is-disrupting-the-gaming-industry.html ↩︎

9. Matrofailo, I. (2015, December 21). Retention and LTV as Core Metrics to Measure Mobile Game Performance. Retrieved December 08, 2020, from https://medium.com/@imatrof/retention-and-ltv-as-core-metrics-to-measure-mobile-game-performance-89229e70f710 ↩︎

10. Batt, S. (2018, October 04). What Is a “Whale” In Mobile Gaming? Retrieved December 08, 2020, from https://www.maketecheasier.com/what-is-whale-in-mobile-gaming/ ↩︎

11. Shaul, B. (2016, March 01). Infographic: ‘Whales’ Account for 70% of In-App Purchase Revenue. Retrieved December 08, 2020, from https://www.adweek.com/digital/infographic-whales-account-for-70-of-in-app-purchase-revenue/ ↩︎

12. Perez, D. (2012, January 13). Skinner’s Box and Video Games: How to Create Addictive Games - LevelSkip - Video Games. Retrieved December 08, 2020, from https://levelskip.com/how-to/Skinners-Box-and-Video-Games ↩︎

13. Muench Frederickm (2014, March 18), The New Skinner Box: We and Mobile Analytics, December 7th 2020, https://www.psychologytoday.com/us/blog/more-tech-support/201403/the-new-skinner-box-web-and-mobile-analytics ↩︎

14. Gardner, M. (2020, September 19). Report: Gaming Industry Value To Rise 30%–With Thanks To Microtransactions. Retrieved December 08, 2020, from https://www.forbes.com/sites/mattgardner1/2020/09/19/gaming-industry-value-200-billion-fortnite-microtransactions/?sh=3374fce32bb4 ↩︎

15. Gardner, M. (2020, June 11). What’s The Future Of Gaming? Industry Professors Tell Us What To Expect. Retrieved December 08, 2020, from https://www.forbes.com/sites/mattgardner1/2020/06/11/whats-the-future-of-gaming-industry-professors-tell-us-what-to-expect/ ↩︎

16. Maass, L. (2019, July 01). Artificial Intelligence in Video Games. Retrieved December 08, 2020, from https://towardsdatascience.com/artificial-intelligence-in-video-games-3e2566d59c22 ↩︎

17. Burford, G. (2016, April 26). Alien Isolation’s Artificial Intelligence Was Good…Too Good. Retrieved December 08, 2020, from https://kotaku.com/alien-isolations-artificial-intelligence-was-good-too-1714227179 ↩︎

18. Ozyazgan, E. (2019, December 14). The Data Science Boom in Esports. Retrieved December 08, 2020, from https://towardsdatascience.com/the-data-science-boom-in-esports-8cf9a59fd573/ ↩︎

19. Cormack, L. (2018, June 29). Balancing game data with player data - DR Studios/505 Games. Retrieved December 08, 2020, from https://deltadna.com/blog/balancing-game-data-player-data/ ↩︎

20. Sergeev, A. (2019, July 15). Analytics of Map Design: Use Big Data to Build Levels. Retrieved December 08, 2020, from https://80.lv/articles/analytics-of-map-design-use-big-data-to-build-levels/ ↩︎

21. Site Admin. (2017, March 18). Retrieved December 08, 2020, from http://dmtolpeko.com/2017/03/18/moba-games-analytics-platform-balance-details/ ↩︎

22. Is AI in Video Games the Future of Gaming? (2020, November 21). Retrieved December 08, 2020, from https://www.gamedesigning.org/gaming/artificial-intelligence/ ↩︎

# 32 - Project: Forecasting Natural Gas Demand/Supply

Natural Gas(NG) is one of the valuable ones among the other energy resources. It is used as a heating source for homes and businesses through city gas companies and utilized as a raw material for power plants to generate electricity. Through this, it can be seen that various purposes of NG demand arise in the different fields. In addition, it is essential to identify accurate demand for NG as there is growing volatility in energy demand depending on the direction of the government’s environmental policy. This project focuses on building the model of forecasting the NG demand and supply amount of South Korea, which relies on imports for much of its energy sources. Datasets for training include various fields such as weather and prices of other energy resources, which are open-source. Also, those are trained by using deep learning methods such as the multi-layer perceptron(MLP) with long short-term memory(LSTM), using Tensorflow. In addition, a combination of the dataset from various factors is created by using pandas for training scenario-wise, and the results are compared by changing the variables and analyzed by different viewpoints.

Status: final, Type: Project

Baekeun Park, sp21-599-356, Edit

Keywords: Natural Gas, supply, forecasting, South Korea, MLP with LSTM, Tensorflow, various dataset.

## 1. Introduction

South Korea relies on imports for 92.8 percent of its energy resources as of the first half of 2020 1. Among the energy resources, the Korea Gas Corporation(KOGAS) imports Liquified Natural Gas(LNG) from around the world and supplies it to power generation plants, gas-utility companies, and city gas companies throughout the country 2. It produces and supplies NG in order to ensure a stable gas supply for the nation. Moreover, it operates LNG storage tanks at LNG acquisition bases, storing LNG during the season when city gas demand is low and replenish LNG during winter when demand is higher than supply 3.

The wholesale charges consist of raw material costs (LNG introduction and incidental costs) and gas supply costs 4. Therefore, the forecasting NG demand/supply will help establish an optimized mid-to-long-term plan for the introduction of LNG and stable NG supply and economic effects.

The factors which influence NG demand include weather, economic conditions, and petroleum prices. The winter weather strongly influences NG demand, and the hot summer weather can increase electric power demand for NG. In addition, some large-volume fuel consumers such as power plants and iron, steel, and paper mills can switch between NG, coal, and petroleum, depending on the cost of each fuel 5.

Therefore, some indicators related to weather, economic conditions, and the price of other energy resources can be used for this project.

Khotanzad and Elragal (1999) proposed a two-stage system with the first stage containing a combination of artificial neural network(ANN) for prediction of daily NG consumption 6, and Khotanzad et al. (2000) combined eight different algorithms to improve the performance of forecasters 7. Mustafa Akpinar et al. (2016) used daily NG consumption data to forecast the NG demand by ABC-based ANN 8. Also, Athanasios Anagnostis et al. (2019) conducted daily NG demand prediction by a comparative analysis between ANN and LSTM 9. Unlike those methods, MLP with LSTM is applied for this project, and external factors affecting NG demand are changed and compared.

## 3. Datasets

As described, weather datasets like temperature and precipitation, price datasets of other energy resources like crude oil and coal, and economic indicators like exchange rate are used in this project for forecasting NG demand and supply.

There is an NG supply dataset 10 from a public data portal in South Korea. It includes four years from 2016 to 2019 of regional monthly NG supply in the nine different cities of South Korea. In addition, climate data such as temperature and precipitation 11 for the same period can be obtained from the Korea Meteorological Administration. Similarly, data on the price of four types of crude oil 12 and various types of coal price datasets per month 13 are also available through corresponding agencies. Finally, the Won-Dollar exchange rate dataset 14 with the same period is used.

As mentioned above, each dataset has monthly information and also has average values instead of the NG supply dataset. It is regionally separated or combined according to the test scenario. For example, the NG supply dataset has nine different cities. One column of cities is split from the original dataset and merged with another regional dataset like temperature or precipitation. On the other hand, each regional value’s summation is utilized in a scenario where a national dataset is needed.

The dataset is applied differently for each scenario. In scenario one, all datasets such as crude oil price, coal price, exchange rate, and regional temperature and precipitation are merged with regional dataset, especially Seoul. For scenario two, all climate datasets are used with the regional dataset. Only temperature dataset is utilized with regional dataset in scenario three. In addition, in scenario four, all cases are the same as in scenario one, but the timesteps are changed to two months. Finally, the national dataset is used for scenario five.

Figure 1: External factors affecting natural gas

## 4.1. Min-Max scaling

In this project, all datasets are rescaled between 0 and 1 by Min-Max scaling, one of the most common normalization methods. If there is a feature with anonymous data, The maximum value(max(x)) of data is converted to 1, and the minimum value(min(x)) of data is converted to 0. The other values between the maximum value and the minimum value get converted to x', between 0 and 1.

## 4.2. Training

For forecasting the NG supply amount from the time series dataset, MLP with LSTM network model is designed by using Tensorflow. The first and second LSTM layers have 100 units, and a total of 3 layers of MLP follow it. Each MLP layer has 100 neurons instead of the final layer, where its neuron is 1. In addition, dropout was designated to prevent overfitting of data, the Adam is used as an optimizer, and the Rectified Linear Unit(ReLU) as an activation function.

Figure 2: Structure of network model

## 4.3. Evaluation

Mean Absolute Error(MAE) and Root Mean Squared Error(RMSE) are applied for this time series dataset to evaluate this network model. The MAE measures the average magnitude of the errors and is presented by the formula as following, where n is the number of errors, is the true value, and is the predicted value.

Also, The RMSE is used for observing the differences between the actual dataset and prediction values. The following is the formula of RMSE, and each value of this is the same for MAE.

## 4.4. Prediction

Since the datasets used for the training process are normalized between 0 and 1, they get converted to a range of the ground truth values again. From these rescaled datasets, it is possible to obtain the RMSE and compare the differences between the actual value and the predicted value.

## 5. Result

In all scenarios, main variables such as dropout, learning rate, and epochs are fixed under the same conditions and are 0.1, 0.0005, and 100 in order. In scenarios one, two, three, and five, the training set is applied as twelve months, and in scenario four, next month’s prediction comes from the previous two months dataset. For comparative analysis, the results are obtained by changing the size of the training set from twelve months to twenty-four months, and the effect is described. Each scenario shows individual results and is comprehensively compared at the end of this part.

## 5.1 Scenario one(regional dataset)

The final MAE of the train set is around 0.05, and the one of the test set is around 0.19. Also, the RMSE between actual data and predicted data is around 227018. The predictive graph tends to deviate a lot at the beginning of the part, but it shows a relatively similar shape at the end of the graph.

Figure 3: Loss for scenario one

Figure 4: Prediction results for scenario one

## 5.2 Scenario two(regional climate dataset)

The final MAE of the train set is around 0.10, and the one of the test set is around 0.14. Also, the RMSE is around 185205. Although the predictive graph still differs compared to the actual graph, it shows similar trends in shape.

Figure 5: Loss for scenario two

Figure 6: Prediction results for scenario two

## 5.3 Scenario three(regional temperature dataset)

The final MAE of the train set is around 0.13, and the one of the test set is around 0.14. Also, the RMSE is around 207585. While the tendency to follow high and low seems similar, but changes in the middle seem to be misleading.

Figure 7: Loss for scenario three

Figure 8: Prediction results for scenario three

## 5.4 Scenario four(applying timesteps)

The final MAE of the train set is around 0.06, and the one of the test set is around 0.30. Also, the RMSE is around 340843. Out of all scenarios, the predictive graph shows to have the most differences. However, in the last part, there is a somewhat akin tendency.

Figure 9: Loss for scenario four

Figure 10: Prediction results for scenario four

## 5.5 Scenario five(national dataset)

The final MAE of the train set is around 0.03 and the one of test set is around 0.14. Also, the RMSE between real data and predicted data is around 587340. Tremendous RMSE value results, but direct comparisons are not possible because the baseline volume is different from other scenarios. Although the predictive graph shows discrepancy, it tends to be similar to the results in scenario two.

Figure 11: Loss for scenario five

Figure 12: Prediction results for scenario five

## 5.6 Overall results

Out of the five scenarios in total, the second and third have smaller RMSE than others, and the graphs also show relatively similar results. The first and fourth show differences in the beginning and similar trends in the last part. However, it is noteworthy that the gap at the beginning of them is very large, but it tends to shrink together at the point of decline and stretch together at the point of increase.

In the first and fifth scenarios, all data are identical except that they differ in regional scale in temperature and precipitation. It is also the same that twelve months of data are used as the training set. From the subtle differences in the shape of the resulting graph, it can be seen that the national average data cannot represent the situation in a particular region, and the amount of NG supply differs depending on the circumstances in the region.

Figure 13: Total prediction results: 12 months training set

After changing the training set from twelve months to twenty-four months, the results are more clearly visible. The second and third prediction graphs have a more similar shape and the RMSE value decreases than the previous setting. The results of other scenarios show that the overall shape has improved; contrarily, the shape of the rapidly changing middle part is better in the previous condition.

Figure 14: Total prediction results: 24 months training set

## 6. Benchmarks

For a benchmark, the Cloudmesh StopWatch and Benchmark 15 is used to measure the program’s performance. The time spent on data load, data preprocessing, network model compile, training, and the prediction was separately measured, and the overall time for execution of all scenarios is around 77 seconds. It can be seen that The training time for the fourth scenario is the longest, and the one for the fifth scenario is the shortest.

Figure 15: Benchmarks

## 7. Conclusion

From the results of this project, it can be seen that simplifying factors that have a significant impact shows better efficiency than combining various factors. For example, NG consumption tends to increase for heating in cold weather. In addition, there is much precipitation in warm or hot weather; on the contrary, there is relatively little precipitation in the cold weather. It can be seen that these seasonal elements show relatively high consistency for affecting prediction when those are used as training datasets. Also, the predictions are derived more effectively when the seasonal datasets are combined.

However, in training set with a duration of twelve months, the last part of the scenario tends to match the actual data despite using the dataset combined with various factors that appears to be seasonally unrelated. Furthermore, when the training set is doubled on the same dataset, it can be seen that the differences between the actual and prediction graph are decreased than the result of a smaller training set. Based on this, it can be expected that the results could vary if a large amount of dataset with a more extended period is used and the ratio of the training set is appropriately adjusted.

South Korea imports a large amount of its energy resources. Also, the plan for energy demand and supply is being made and operated through nation-led policies. Ironically, the government’s plan also shows a sharp change in direction with recent environmental issues, and the volatility of demand in the energy market is increasing than before. Therefore, methodologies for accurate forecasting of energy demand will need to be complemented and developed constantly to prepare for and overcome this variability.

In this project, Forecasting NG demand and supply was carried out using various data factors such as weather and price that is relatively easily obtained than the datasets which are complex economic indicators or classified as confidential. Nevertheless, state-of-the-art deep learning methods show that it has the flexibility and potential to forecast NG demand through the tendency of the results that indicate a relatively consistent with the actual data. From this point of view, it is thought that the research on NG in South Korea should be conducted in an advanced form by utilizing various data and more specialized analysis.

## 8. Acknowledgments

The author would like to thank Dr. Gregor von Laszewski for his invaluable feedback, continued assistance, and suggestions on this paper, and Dr. Geoffrey Fox for sharing his expertise in Deep Learning and Artificial Intelligence applications throughout this Deep Learning Application: AI-First Engineering course offered in the Spring 2021 semester at Indiana University, Bloomington.

## 9. Source code

The source code for all experiments and results can be found here as ipynb link and as pdf link.

## 10. References

1. 2020 Monthly Energy Statistics, [Online resource] http://www.keei.re.kr/keei/download/MES2009.pdf, Sep. 2020 ↩︎

2. KOGAS profile, [Online resource] https://www.kogas.or.kr:9450/eng/contents.do?key=1498 ↩︎

3. LNG production phase, [Online resource] https://www.kogas.or.kr:9450/portal/contents.do?key=2014 ↩︎

4. NG wholesale charges, [Online resource] https://www.kogas.or.kr:9450/portal/contents.do?key=2026 ↩︎

5. Natural gas explained, [Online resource], https://www.eia.gov/energyexplained/natural-gas/factors-affecting-natural-gas-prices.php, Aug, 2020 ↩︎

6. A. Khotanzad and H. Elragal, “Natural gas load forecasting with combination of adaptive neural networks,” IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), 1999, pp. 4069-4072 vol.6, doi: 10.1109/IJCNN.1999.830812. ↩︎

7. A. Khotanzad, H. Elragal and T. . -L. Lu, “Combination of artificial neural-network forecasters for prediction of natural gas consumption,” in IEEE Transactions on Neural Networks, vol. 11, no. 2, pp. 464-473, March 2000, doi: 10.1109/72.839015. ↩︎

8. M. Akpinar, M. F. Adak and N. Yumusak, “Forecasting natural gas consumption with hybrid neural networks — Artificial bee colony,” 2016 2nd International Conference on Intelligent Energy and Power Systems (IEPS), 2016, pp. 1-6, doi: 10.1109/IEPS.2016.7521852. ↩︎

9. A. Anagnostis, E. Papageorgiou, V. Dafopoulos and D. Bochtis, “Applying Long Short-Term Memory Networks for natural gas demand prediction,” 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), 2019, pp. 1-7, doi: 10.1109/IISA.2019.8900746. ↩︎

10. NG supply dataset, [Online resource], https://www.data.go.kr/data/15049904/fileData.do, Apr, 2020 ↩︎

11. Regional climate dataset, [Online resource] https://data.kma.go.kr/climate/RankState/selectRankStatisticsDivisionList.do?pgmNo=179 ↩︎

12. Crude oil orice dataset, [Online resource] https://www.petronet.co.kr/main2.jsp ↩︎

13. Bituminous coal price dataset, [Online resource] https://www.kores.net/komis/price/mineralprice/ironoreenergy/pricetrend/baseMetals.do?mc_seq=3030003&mnrl_pc_mc_seq=506 ↩︎

14. Won-Dollar exchange rate dateset, [Online resource] http://ecos.bok.or.kr/flex/EasySearch.jsp?langGubun=K&topCode=022Y013 ↩︎

15. Gregor von Laszewski, Cloudmesh StopWatch and Benchmark from the Cloudmesh Common Library, [GitHub] https://github.com/cloudmesh/cloudmesh-common ↩︎

# 33 - Big Data on Gesture Recognition and Machine Learning

Since our technology is more and more advanced as time goes by, traditional human-computer interaction has become increasingly difficult to meet people’s demands. In this digital era, people need faster and more efficient methods to obtain information and data. Traditional and single input and output devices are not fast and convenient enough, it also requires users to learn their own methods of use, which is extremely inefficient and completely a waste of time. Therefore, artificial intelligence comes out, and its rise has followed the changeover times, and it satisfied people’s needs. At the same time, gesture is one of the most important way for human to deliver information. It is simple, efficient, convenient, and universally acceptable. Therefore, gesture recognition has become an emerging field in intelligent human-computer interaction field, with great potential and future.

Status: final, Type: Report

Sunny Xu, Peiran Zhao, Kris Zhang, fa20-523-315, Edit

Keywords: gesture recognition, human, technology, big data, artificial intelligence, body language, facial expression

## 1. Introduction

Technology is probably one of the most attracting things for people nowadays. Whether it is the new iPhone coming out or some random new technology that is bring into our life. It is a matter of fact that technology has become one of the essential parts of our life and our society. Simply, our life will change a lot without technology. As of today, since technology is improving so fast, there are many things that can be related to AI and machine learning. A lot of the ordinary things around our life becomes data. And the reason why they become data is because there is a need for them in having better technology to improve our life. For example, language was stored into data to produce technology like translator to provide convenience for people that does not speak the language. Another example is that roads were stored into data to produce GPS to guide direction for people. Nowadays, people values communication and interaction between others. Since gesture recognition is one of the most important ways to understand people and know their emotion, it becomes a popular field of study for many scientists. There are multiply field of study in gesture recognition and each require a lot of amount of time to know them well. For the report, we do research about hand gesture, body gesture and facial expression. Of course, there will be a lot of other fields related to gesture recognition, for example, like animal gestures. They all can be stored into data and get study in the research by scientists. Many people might have question about how gesture recognition are has anything to do with technology. They simply do not think that they can be related, but in fact, they are related. Companies like Intel and Microsoft have already created so many studies for new technology in that field. For example, Intel proposed combining facial recognition with device recognition to authenticate users. Studying gestures recognition will often reveal what the think. For example, when someone is lying, their eye will tend to look around and they tend to touch their nose with their hand, etc. So, studying gesture recognition will not only help people understand much more about human beings and it can also help our technology grow. For example, in AI and machine learning, studying gestures recognition will make or improve AI and machine learning to better understand humans and be more human-like.

## 2. Background

Nowadays, people are doing more and more high-tech research, which also makes various high-tech products appear in society. For people, electricity is as important as water and air. Can’t imagine life without electricity. We can realize that technology is changing everything about people from all aspects. People living in the high-tech era are also forced to learn and understand the usage of various high-tech products. As a representative of high technology, artificial intelligence has also attracted widespread attention from society. Due to the emergence of artificial intelligence, people have also begun to realize that maintaining human characteristics is also an important aspect of high technology.

People’s living environment is inseparable from high technology. As for the use of human body information, almost every high-tech has different usage 1. For example, face recognition is used in many places to check-in. This kind of technology enables the machine to store the information of the human face and determine whether it is indeed the right person by judging the five senses. We are most familiar with using this technology in airports, customs, and companies to check in at work. Not only that, but the smartphones we use every day are also unlocked through this face recognition technology. Another example is the game console that we are very familiar with. Game consoles such as Xbox and PS already have methods for identifying people’s bodies. They can identify several key points of people through the images received by their own cameras, thus inputting this line of action into the world of the game.

Many researchers are now studying other applications of human movements, gestures, and facial expressions. One of the most influential ones is that Google’s scientists have developed a new computer vision method for hand perception. Google researchers identified the movement of a hand through twenty-one 3D points on the hand. Research Engineers Valentin Bazarevsky and Fan Zhang stated that “The ability to perceive the shape and motion of hands can be a vital component in improving the user experience across a variety of technological domains and platforms 2.” This model can currently identify many common cultural features. gesture. They have done experiments. When people play a game of “rock, paper, scissors” in front of the camera, this model can also judge everyone’s win or loss by recognizing gestures.

More than that, many artificial intelligences can now understand people’s feelings and intentions by identifying people’s facial expressions. This also allows us to know how big a database is behind this to support the operation of these studies. But collecting these data about gesture recognition is not easy. Many times we need to worry about not only whether the data we input is correct, but also whether the target identified by artificial intelligence is clear and the output information is accurate.

## 3. Gesture Recognition

Gesture recognition is mainly divided into two categories, one is based on external device recognition, the specific application is data gloves, wearing it on user’s hand, to obtain and analysis information through sensors. This method has obvious shortcomings, though it is accurate and has excellent response speed, but it is costly and is not good for large-scale promotion. The other one is the use of computer vision. People do not need to wear gloves. As its name implies, this method collects and analyzes information through a computer. It is convenient, comfortable, and not so limited based on external device identification. In contrast, it has greater potential and is more in line with the trend of the times. Of course, this method needs more effective and accurate algorithms to support, because the gestures made by different people at different times, in different environments and at different angles also represent different meanings. So, if we want more accurate information feedback. Then the advancement of algorithms and technology is inevitable. The development of gesture recognition is also the development of artificial intelligence, a process of the development of various algorithms from data gloves to the development of computer vision-based optical technology plays a role in promoting it.

### 3.1 Hand Gesture

#### 3.1.1 Hand Gesture Recognition and Big Data

Hand gesture recognition is commonly used in gesture recognition because fingers are the most flexible and it is able to create different angles that will represent different meanings. The hand gesture itself is also an easy but efficient way for us human beings to communicate and send messages to each other. The existence of hand gestures can be considered easy but powerful. However, if we are using the application of hand gesture recognition, it is a much more complicated process. In real life, we can just ben our finger or simply make a fist so that other people will understand our message. But when using hand gesture recognition there are many processes that are being involved. Hand gesture is commonly used in geesture recoginitaion As we did our research and based on our life experiences, hand gesture recognition is a very hot topic and has all the potential to be the next wave. Hand gesture recognition has recently achieved big success in many fields. The advancement and development of hand gesture recognition is also the development of other technology such as the advancement of computer chips, the advancement of algorithms, the advancement of machine learning even advancement of deep learning, and the advancement of cameras from 2D to 3D. The most important part of hand gesture recognition is big data and machine learning. Because of the development of big data and machine learning, data scientists are able to have better datasets, build a more accurate and successful model and be able to process the information and predict the most accurate results. Hand gesture recognition is a significant link in Gesture recognition.However gesture recognition is also not only about hand gesture recognition, it also includes other body parts such as facial expression recognition and body gesture recognition. With the help of the whole system of different gesture recognitions, the data can be recorded and processed by AIs. The results or predictions can be used currently or later on for different purposes in different areas 3.

#### 3.1.2 Principles of Hand Gesture Recognition

Hand gesture recognition is a complicated process involving many steps. And in order to get the most accurate result, it will need a large amount of quality data and a scientific model with high performance. Hand gesture recognition is also at a developing stage simply because there are so many possible factors that can influence the result. Possible factors include skin color, background color, hand gesture angle, and Bending angle, etc. To simplify the process of gesture recognition, AIs will use 3-D cameras to capture images. After that, the data of the image will be collected and processed by programs and built models. And lastly, AIs will be able to use that model to get an accurate result in order to have a corresponding response or predict future actions. To explain all processes of hand gesture recognition in detail, it includes graphic gathering, retreatment, skin color segmentation, hand gesture segmentation, and finally hand gesture recognition. Hand Gesture Recognition can not achieve the best accuracy without all any of these steps. Within all these steps, skin color segmentation is the most crucial step in order to increase accuracy and this process will be explained in the next session 4.

### 3.2 Body Gesture

#### 3.2.1 Introduction to Body Gesture

Body gestures, which can also be called body language, refer to humans expressing their ideas through the coordinated activities of various body parts. In our lives, body language is ubiquitous. It is like a bridge for our human communication. Through body language expression, it is often easier for us to understand what the other person wants to express. At the same time, we can express ourselves better. The profession of an actor is a good example of body language. This is a compulsory course for every actor because actors can only use their performances to let us know what they are expressing 6. At this time, body language becomes extremely important. Different characters have different body movements in different situations, and actors need to make the right body language at a specific time to let the audience know their inner feelings. Yes, the most important point of body language is to convey mood through movement.

In many cases, certain actions will make people feel emotions. For us who communicate with all kinds of people every day, there are also many body languages that we are more familiar with. For example, when a person hangs his head, it means that he is unhappy, walking back and forth is a sign of a person’s anxiety, and body shaking is caused by nervousness, etc.

#### 3.2.2 Body Gesture and Big Data

As a piece of big data, body language requires data collected by studying human movements. Scientists found that when a person wants to convey a complete message, body language accounts for half. And because body language belongs to a person’s actions subconsciously, it is rarely deceptive. All of your nonverbal behaviors—the gestures you make, your posture, your tone of voice, how many eyes contact you make—send strong messages 7. In many cases, these unconscious messages from our bodies allow the people who communicate with us to feel our intentions. Even when we stop talking, these messages will not stop. This also explains why scientists want to collect data to let artificial intelligence understand human behavior. In order for artificial intelligence to understand human mood or intention from people’s body postures and actions, scientists have collected a lot of human body actions that show intentions in different situations through research. The music gesture artificial intelligence developed by MIT-IBM Watson AI Lab is a good example 8. The music gesture artificial intelligence developed by MIT-IBM Watson AI Lab can enable artificial intelligence to judge and isolate the sounds of individual instruments through body and gesture movements. This success is undoubtedly created by the big data of the entire body and gestures. The research room collects a large number of human structure actions to provide artificial intelligence with a large amount of information so that the artificial intelligence can judge what melody the musician is playing through body gestures and key points of the face. This can improve its ability to distinguish and separate sounds when artificial intelligence listens to the entire piece of music.

Most of the artificial intelligence’s analysis of the human body requires facial expressions and body movements. This recognition cannot be achieved only by calculation. What is needed is the collection of the meaning of different body movements of the human body by a large database. The more situations are collected, the more accurate the analysis of human emotions and intentions by artificial intelligence will be. The easiest way is to include more. Just like humans, broadening your horizons is a way to better understand the world. The way of recording actions is not complicated. Just set several key movable joints of the human body to several points, and then connect the red dots with lines to get the approximate shape of the human body. At this time, the actions made by the human body will be included in the artificial intelligence. In the recording process, the upper body and lower body can be recorded separately. In order to avoid in some cases, the existence of obstructions will cause artificial intelligence to fail to recognize correctly.

#### 3.2.3 Random Forest Algorithm in Body Gesture Recognition

Body gesture recognition is pretty useful but pretty hard to achieve because of its limitations and harsh requirements. Without the development of all kinds of 3D cameras, body gesture recognition is just an unrealistic dream. In order to get important and precise data for the body gesture recognition to process, different angles, light, background all needs to be captured 9. For body gestures, the biggest difficulty is that if you only capture data in the front, it will not give you the correct information and result in most of the time. In this way, you will need precise data from different angles. A Korean team has done an experiment using three 3D cameras and three stereo cameras to capture images and record data from different angles. The data were recorded in a large database that includes captured data both from outside and inside. One of the most popular algorithms used in body gesture recognition is the random forest algorithm. It is very famous and useful in all types of machine learning projects. It is a type of supervised learning algorithm. Because there are all types of data are needed to be a record and process. The random forest algorithm is perfect for that, the biggest advantage of this algorithm is that it can let each individual tree mainly focus on one part or one characteristic of body gesture data because of this algorithm’s ability to combine all weak classifiers into a strong one 10. It is simple but so powerful and efficient. In addition to that, it works really well with body gesture recognition. With the algorithm and advanced cameras, precise data could be collected and AIs will be able to get useful information at different levels.

### 3.3 Face Gesture

#### 3.3.1 Introduction to Face Gesture (Facial Expression)

Body language is one of the ways that we can express ourselves without saying any words. It has been suggested that body language may account for between 60 to 65% of all communication 6. According to expert, body language is used every day for us to communicate with each other. During our communication, we not only use words but also use body gestures, hand gestures and most importantly, we use facial expression most. During communication with different people, our face communicate different thoughts, idea, and emotion and the reason why we use facial expression more than any other body gestures is that when we have certain emotion, it is express in our face automatically. Facial expression is often not under our control. That is why people often say that the word that come out of mouth cannot always be true, but their facial expression will reveal what those people are thinking about. So, what is facial expression exactly? According to Jason Matthew Harley, Facial expressions are configurations of different micromotor movements in the face that are used to infer a person’s discrete emotional state 9. Some example of common facial expression will be: Happiness, Sadness, Anger, Surprise, Disgust, Fear, etc 6. Each facial expression will have some deep meaning behind it. For example, A simple facial expression like smiling can be translated into a sign of approval, or it can be translated into a sign of friendly. If we put all those emotion into big data, it will help us to understand ourselves much better.

#### 3.3.2 Sense Organs on The Face

The facial expression expresses our emotion during the communication by micro movement of our sense organs. The most used organs are the eyes and mouth and sometimes, the eyebrows.

##### 3.3.2.1 Eye

The eyes are one of the most important communication tools in our ways of communication with each other. When we communicate with each other, the eye contact will be inevitable. The signal in your eye will tell people what you are think. Eye gaze is a sample of paying attention when communicating with others. When you are talking to a person and if his eye is directly on you and both of you keep having eye contact. In this situation, this mean that he is interested in what you say and is paying attention to what you say. On the other hand, if the action of breaking eye contact happens very frequently, it means that he is not interested, distracted, or not paying attention to you and what you are saying.

Blinking is another eye signal that is very often and will happen in communicating with other people. When talking to other people, blinking is very usual and will happen every time when you are going to communicate with different people. But the frequency of blanking can give away what are you feeling right now. People often blink more rapidly when they are feeling distressed or uncomfortable. Infrequent blinking may indicate that a person is intentionally trying to control his or her eye movements 6. For example, when A person is lying, he might try to control his blinking frequency to make other people feel like he is calm and saying the truth. In order to persuade other people that he is calm and telling the truth, he will need to blink less frequently.

Pupil size is a very important facial expression. Pupil size can be a very subtle nonverbal communication signal. While light levels in the environment control pupil dilation, sometimes emotions can also cause small changes in pupil size 6. For example, when you are surprised by something, your pupil size will become noticeably larger than before. When having a communication, dilated eyes can also mean that the person is interesting in the communication.

##### 3.3.2.2 Mouth

Mouth expression and movement will also be a huge part in communicating with other and reading body language. The easiest example will be smiling. A micro movement of your mouth and lip will give signal to others about what do you think or how are you feeling. When you tighten your lips, it means that you either distaste, disapprove or distrust other people when having a conversation. When you bite your lips, it means that you are worried, anxious, or stressed. When someone tries to hide certain emotional reaction, they tend to cover their mouth in order not to display any facial expression through lip movement. For example, when you are laughing. The simple movement of turning up or down of the lip will also indicate what a person is feeling. When the mouth is slightly turn up, it might mean that the person is either feeling happy or optimistic. On the other hand, a slightly down-turned mouth can be an indicator of sadness, disapproval, or even an outright grimace 6.

Nowadays, since technology is so advance, everything around us can be turn into data and everything can be related to data. Facial expression is often study by different scientist in research because it allows us to understand more about human and communication between different people. One of the relatively new and promising trends in using facial expressions to classify learners' emotions is the development and use of software programs the automate the process of coding using advanced machine learning technologies. For example, FaceReader is a commercially available facial recognition program that uses an active appearance model to model participant faces and identifies their facial expression. The program further utilizes an artificial neural network, with seven outputs to classify learner’s emotions 11. Also, facial expression can be analyzed in other software programs like the Computer Expression Recognition Toolbox. Emotion is a huge study field in the technology field, and facial expression is one of the best ways to study and analyze people’s emotion. Emotion technology is becoming huge right now and will be even more popular in the future according to MIT Technology Review, Emotion recognition – or using technology to analyze facial expressions and infer feelings-is, by one estimate, set to be a 25 billion business by 2023 12. So back to the topic about big data and facial expression. Why are those things related? It is because, first everything is data around us. Your facial expression can be stored into data for other to learn and detect too. One of the examples is that, in 2003, The US Transportation Security Administration started training humans to spot potential terrorists by reading their facial expression. And by that, scientist believe that if human can do that, with data and AI technology, robot can detect facial expression more accurate than human. #### 3.3.4 The Problem with Detecting Emotion for Technology Nowadays Even though facial expression can reveal people’s emotion and what they think, but there has been “growing pushback” against the statement. A group of scientists brought together a research after reviewing more than 1,000 paper on emotion detection. After the research, the conclusion of it is hard to use facial expressions alone to accurately tell how someone is feeling is made. Human’s mind is very hard to predict. People do not always cry when they feel down and smile when they feel happy. The facial expression can not always reveal the true feeling the person is feeling. Not only that, because there is not enough data for facial expression, people will often mistakenly categorize other’s facial expression. For example, Kairos, which is a facial biometrics company, promise retailers that it can use a emotion recognition technology to figure out how their customers are feeling. But when they are labeling the data to feed the algorithm, one big problem reveals. An observer might read a facial expression as “surprised,” but without asking the original person, it is hard to know what the real emotion was 12. So the problems with technology that involves around facial expression are first, there is not enough data. Second is that facial expression sometimes can not be always true. #### 3.3.5 Classification Algorithms Nowadays, since technology is growing so fast, there are a lot of interaction between humans and computer. Facial expression plays an essential role in social interaction with other people. It is not arguably one of the best ways to understand human. “It is reported that facial expression constitutes 55% of the effect of a communicated message while language and voice constitute 7% and 38% respectively. With the rapid development of computer vision and artificial intelligence, facial expression recognition becomes the key technology of advanced human computer interaction 13.” This quote from the research shows that facial expression is one of the main tools that we are using to communicate with other people and interact with computer. So being able to recognize and identify the facial expression becomes relatively important. The main objective for facial expression recognitions is to use its conveying information automatically and correctly. As a result, feature extraction is very important to the facial expression recognition process. The process needs to be smooth and without any mistakes. So, algorithms are needed in the processes. Classification analysis is an important component of facial recognition, it is mainly used to find data distribution that is valuable and at the same time, find data models in the potential data. At present it has further study of the database, data mining, statistics, and other fields 13. In addition to that, one of the major obstacles and limitation of facial expression recognition is face detection. To detect the face, you will need to locate the faces in an image or a photograph. This is where scientists applicate classification algorithm, machine learning and deep learning. Recently, convolutional neural network model has become so successful that facial recognition is the next top wave 14. ## 4. Conclusion With the development of artificial intelligence, human-computer interaction, big data, and machine learning even deep learning is getting more mature. Gesture Recognition including Hand Gesture Recognition, Body Gesture Recognition, and Face Gesture Recognition has finally come true into a real-life application and already achieved huge success in many areas. But it still has much more potential in all possible areas that could change people’s lives drastically in a good way. Gestures are the simplest and the most natural language of all human beings. It sends the clearest message for communicating between people, and even human and computers. Because of the more powerful cameras, better big data technology, and more efficient and effective algorithms from deep learning, Scientists are able to use color and the Gesture Segmentation method to remove useless color data in order to maximize the accuracy of the result. As we are doing our research, we also find out Hand Gesture Recognition is not the only Recognition in this area, Body Gesture Recognition and Face Gesture Recognition or facial expression are also very important, they can also deliver messages in the simplest way. They are also very effective when building relationships between humans and machines. Face Gesture or facial expression could not only deliver messages but even deliver emotions. Micromovements of facial expressions studied by different scientists could be very useful in predicting the emotions of humans. Body Gesture Recognition is also helpful as we did our research with the body gesture data scientists collected from different musicians with different instruments. They are able to predict the melodies or even the songs played by that musician. This is mind-blowing because with this type of technology and applications we are able to achieve more and use it in many possible fields. With all these combined, scientists could build a very successful and mature Gesture Recognition model to get the most accurate result or prediction. According to the research and our own analysis, we come up with a conclusion that Gesture Recognition will be the next hot trendy topic and are applicable in many possible areas including Security, AI, economics, manufacture, the game industry, and even medical services. With Gesture Recognition being applied, scientists are able to develop much smarter AIs and machines that can interact with humans more efficiently and more fluently. AIs will be able to receive and understand messages from humans more easily and will able to function better. This is also a great message for many handicapped people. With Hand Gesture Recognition being used, their life will also be easier and happier and that’s definitely something we are want to see because the overall goal of all the technologies is to make people’s life easier and bring the greatest amount of happiness to the greatest amount of people. However, the technology we have right now is not advanced enough yet, in order to get a more accurate result, we still need to develop better cameras, better algorithms, and better models. But we all believe that this era is the big data era, and everything could happen as big data and deep learning technology get more and more advanced and mature. We believe in and look forward to the beautiful future of Gesture Recognition. And we also think people should really pay more and closer attention to this field since Gesture Recognition is the next wave. ## 5. References 1. Srilatha, Poluka, and Tiruveedhula Saranya. “Advancements in Gesture Recognition Technology.” IOSR Journal of VLSI and Signal Processing, vol. 4, no. 4, 2014, pp. 01–07, iosrjournals.org/iosr-jvlsi/papers/vol4-issue4/Version-1/A04410107.pdf, 10.9790/4200-04410107. Accessed 25 Oct. 2020. ↩︎ 2. Bazarevdsky, V., & Zhang, F. (2019, August 19). On-device, real-time hand tracking with MediaPipe. Google AI Blog. https://ai.googleblog.com/2019/08/on-device-real-time-hand-tracking-with.html ↩︎ 3. F. Zhan, “Hand Gesture Recognition with Convolution Neural Networks,” 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI), Los Angeles, CA, USA, 2019, pp. 295-298, doi: 10.1109/IRI.2019.00054. ↩︎ 4. Di Zhang, DZ.(2019) Research on Hand Gesture Recognition Technology Based on Machine Learning, Nanjing University of Posts and Telecommunications. ↩︎ 5. A. Choudhury, A. K. Talukdar and K. K. Sarma, “A novel hand segmentation method for multiple-hand gesture recognition system under complex background,” 2014 International Conference on Signal Processing and Integrated Networks (SPIN), Noida, 2014, pp. 136-140, doi: 10.1109/SPIN.2014.6776936. ↩︎ 6. Cherry, K. (2019, September 28). How to Read Body Language and Facial Expressions. Verywell Mind. Retrieved November 8, 2020, from https://www.verywellmind.com/understand-body-language-and-facial-expressions-4147228 ↩︎ 7. Segal, J., Smith, M., Robinson, L., & Boose, G. (2020, October). Nonverbal Communication and Body Language. HelpGuide.org. https://www.helpguide.org/articles/relationships-communication/nonverbal-communication.htm ↩︎ 8. Martineau, K. (2020, June 25). Identifying a melody by studying a musician’s body language. MIT News | Massachusetts Institute of Technology. https://news.mit.edu/2020/music-gesture-artificial-intelligence-identifies-melody-by-musician-body-language-0625 ↩︎ 9. N. Normani et al., “A machine learning approach for gesture recognition with a lensless smart sensor system,” 2018 IEEE 15th International Conference on Wearable and Implantable Body Sensor Networks (BSN), Las Vegas, NV, 2018, pp. 136-139, doi: 10.1109/BSN.2018.8329677. ↩︎ 10. Bon-Woo Hwang, Sungmin Kim and Seong-Whan Lee, “A full-body gesture database for automatic gesture recognition,” 7th International Conference on Automatic Face and Gesture Recognition (FGR06), Southampton, 2006, pp. 243-248, doi: 10.1109/FGR.2006.8. ↩︎ 11. Harley, J. M. (2016). Facial Expression. ScienceDirect. https://www.sciencedirect.com/topics/computer-science/facial-expression ↩︎ 12. Chen, A. (2019, July 26). Computers can’t tell if you’re happy when you smile. MIT Technology Review. https://www.technologyreview.com/2019/07/26/238782/emotion-recognition-technology-artifical-intelligence-inaccurate-psychology/ ↩︎ 13. Ou, J. (2012). Classification algorithms research on facial expression recognition. Retrieved from https://www.sciencedirect.com/science/article/pii/S1875389212006438 ↩︎ 14. Brownlee, J. (2020, August 24). How to perform face detection with deep learning. Retrieved from https://machinelearningmastery.com/how-to-perform-face-detection-with-classical-and-deep-learning-methods-in-python-with-keras/ ↩︎ # 34 - Big Data in the Healthcare Industry Healthcare is an organized provision of medical practices provided to individuals or a community. Over centuries the application of innovative healthcare has been needed increasingly as humans expand their life span and become more aware of better preventative care practices. The application of Big Data within the industry of Healthcare is of the utmost importance in order to quantify the effects of wide scale efficient and safe solutions. Pharmaceutical and Bio Data Research companies can use big data to intake large facets of patient record data and use this collected data to iterate how preventative care can be implemented before diseases actually present themselves in stages that are beyond the point of potential recovery. Data collected in laboratory settings and statistics collected from medical and state institutions of healthcare facilitate time, money, and life saving initiatives as deep learning can in certain instances perform better than the average doctor at detecting malignant cells. Big data within healthcare has proven great results for the advancement and diverse application of informed reasoning towards medical solutions. Cristian Villanueva, Christina Colon Status: final, Type: Report Keywords: EHR, Healthcare, diagnosis, application, treatment, AI, network, records ## 1. Introduction Healthcare is a multi-dimensional system established with the aim of the prevention, diagnosis, and treatment of health-related issues or impairments in human beings1. The many dimensions of Healthcare can be characterized by the influx of information coming and going from each level as there are multiple different applications of Healthcare. These applications can include but are not limited to vaccines, surgeries, x-rays, medicines/treatments. Big data plays a pivotal role in Healthcare diagnostics, predictions, and accelerated results/outcomes of these applications. Big Data has the ability to save millions of dollars through automating 40% of radiologist’s tasks, saving time on potential treatments through digital patients, and by providing improved outcomes1. With higher accuracy rates of diagnosis and advanced AI is able to transform hypothetical analysis into data driven diagnosis and treatment strategies. ## 2. Patient Records EHR stands for ‘electronic health records’ and is a digital version of a patient’s paper chart.The Healthcare industry utilizes EHR for maintaining records of everything related in their institutions. EHR are real-time, patient centred records that make information available instantly and securely to authorized users2. EHR is capable of holding even more information as it is possible to include such information such as medical history, diagnoses, medications, treatment plans, immunization dates, allergies, radiology images, and laboratory and test results. According to Definitive Healthcare data from 2020, more than 89 percent of all hospitals have implemented inpatient or ambulatory EHR systems 3. A network of information surrounding a patient’s health record and medical data allows for the research and production of such progressive advancement in treatment. To underline the potential of the resources, more than 110 million EHRs around the continents were inspected for genetic disease research4. This is the capability of EHRs as it holds information capable of diagnosing, preventing and treating other patients for early detection of an ailment or disease. Through the application of neural networks in Deep Learning models, EHR’s could be compiled and analyzed to identify inconspicuous indicators of disease development of patients in early stages far before a human doctor would be able to make a clear diagnosis. The application has the ability to work far ahead for preventive measures as well as the allocation of resources to make sure that patients are paying for the care at minimum costs, the appropriate method of medical intervention is applied, and physicians’ workload can become less strenuous. ### 2.1 EHR Application: Detecting Error and Reducing Costs In order to understand the impact that Big Data such as EHRs has on the Healthcare industry an example of research is presented in the form of collection before and after implementation of EHR. The research study collected data for the period of 1 year before EHR (pre-EHR) and 1 year after EHR (post-EHR) implementation. What was noticed in the analyzes of the data was in the area of ‘Medication errors and near misses’ the research stated ‘medication errors per 1000 hospital days decreased 14.0%-from 17.9% in the pre-EHR period to 15.4% in the 9 months after CPOE implementation’5. The research determined that with implementation of EHR with (CPOE) computerized provider order entry was able to reduce the costs of treatment and improvised upon the safety of their patients. Participants of the study mentioned that there was an increase in speed when it came to pharmacy, laboratory and radiology orders. The research also stated ‘our study demonstrated an 18% reduction in laboratory testing’. The study touched upon the rapidness that EHR can add to a process of treatment when orders are validated much quicker and hospitals and patients save money from the rapid diagnosis and treatment. This cuts out the middle-man of deliberate testing and examinations upon patients so they don’t have to cover the costs or undergo wasteful testing from their own EHR and other extensive EHR that it utilizes for comparison. Examples of models used in this example study include data mining through phenotyping and natural language processing. In this way data mining allows large sets of patient data to be aggregated in order to make inferences over a population or theories regarding how a disease will progress in any given patient. Phenotyping categorizes features of patients' health and their DNA and ultimately their overall health. Association rule data mining helps automated systems in their predictions in order to predict behavioral and health outcomes of patients’ circumstances. ## 3. AI Models in Cancer Detection AI is modifying early detection of cancer as models are capable of being more accurate and precise with the analysis of mass and cell images. The difficulty of diagnosing cancer is because of the possibilities of either the mass being benign or malignant. The amount of time overlooking the cell nuclei and its features to either determine if it is malignant or benign can be staggering for oncologists. Utilizing the information of what’s known about cancer can train AI to be calibrated to scour through several images and screenings of cell nuclei to find the key indicators. These key indicators can also be whittled down even further as there is AI to determine which indicators have the highest correlation with malignant cancer. As a dataset from Kaggle consisting of 569 cases of malignant and benign breast cancer, it represented 357 cases of benign and 212 of malignant. With that information there were initially 33 features that may have indicated malignancy in these cases. The 33 features were reduced to 10 features as not all of them equally displayed the same level of contribution to the diagnosis. Across the 10 features there were 5 features that demonstrated the highest correlation to the malignancy. Several models were adapted to find the highest accuracy and precision. This form of AI detection improves upon the efficacy of early cancer detection. Figure 1. Demonstrates how AI in this study used images to cross-analyze features of a patient’s results to verify what model is the most accurate and precise to determine which model can best serve a physician in their diagnostic report. ### 3.1 Early Detection Big Data Applications ‘An ounce of prevention is worth more than a pound of a cure’ is a common philosophy held by medical professionals. The meaning behind this ideology is found in that if one can prevent a disease from ever taking its final form through performing small routine tasks and check ups, a plethora of harm and suffering from trying to recage a disease can be avoided. Many medical solutions for diseases such as cancer or degenerative brain diseases rely on the idea that outside medical intervention will strengthen the patient enough for the human body to heal itself through existing biological principles 6. For example, vaccines work by injecting dead cells into a patient so that its antibodies can be learned and immunity can be built up by white blood cells naturally. Intervening before one is infected must be completed for these measures to be effective. If preventative care such as routine screenings on individuals with family history of diseases or those with general genetic predispositions then the power truly lies in having the discernment knowledge to catch the disease early. In many diseases once a patient is presenting symptoms, it is too late or survival/recovery probability percentages are slashed. This places immense pressures on patients themselves to work to have access to routine screenings and even more pressure on physicians to intake these patients and make preliminary diagnosis with little more than a visual analysis of the patient. Big data automates these tasks and gives physicians an incredible advantage and discernment as to what is truly happening within a patient’s circumstance. ### 3.2 Detecting Cervical Cancer Cervical cancer in the past was one of the most common causes of cancer death for women in the United States. However preventive care in the form of pap test has been able to drop the death rate significantly. In the pap test images are taken of the women’s cervix to identify any changes that might indicate cancer is going to form. Cervical cancer has a much higher death rate without early detection as a cure is easier to take full effect in the early stages. Artificial Intelligence performs an algorithm and gives the computer the ability to act and reason based on a set of known information. Machine learning implements more data and allows the computer to work iteratively and make predictions and make decisions based on the massive amount of data provided. In this way, machines have had the ability to detect cervical cancer with greater precision and accuracy in some cases than gynecologists 7. Imaging of cervical screenings targeted by a convolution neural network is the key to unlocking correlations behind the large sum of images. By implementing further reasoning into the data set, the CNN is able to classify enhanced recognition of cancer as or before it forms. This study using this method of machine learning has been able to perform with 90-96% accuracy and save lives. The CNN is able to identify the colors, shapes, sizes, edges and other features pertaining to cancerous cells. This is ground breaking for women in underdeveloped countries like India and Nigeria where the death rate for cervical cancer is much higher than the United States due to lack of access to routine pap smears. Women could get results on their cervical cancer status even if they do not get a pap smear every 3 years as recommended by doctors. For example if a woman in Nigeria has her first pap smear at the age of 40 when the recommended age to start pap smears is 21 she has gone unchecked for nearly 20 years and the early detection window is narrowed. However, if she is one of the 20% of women who get cervical cancer over the age of 65, a deep learning analysis of her pap smear at 40 could save her life and roadblock potential suffering. Early detection is key and big data optimizes early detection windows by providing a deeper analysis in the preventive care stages. From here doctors are able to implement the best care plan available on a case by case basis. ## 4. Artificial intelligence in Cardiovascular Disease AI in cardiovascular disease models are innovating disease detection by segmenting different types of analysis together for more efficient and accurate results. Being that cardiovascular diseases typically agitate/involve the heart and lungs there are numerous dynamics surrounding why a person is experiencing certain symptoms or at risk for development of a more critical diagnosis. Immense amount of labor is included in the diagnosis and treatment of individuals with cardiovascular disease on behalf of general physicians, specialists, nurses, and several other medical professionals. Artificial intelligence has the capability to add a layer of ease and accuracy that is involved in analyzing a patient’s status or risk for cardiovascular disease. AI is able to overcome the challenges of low quality pixelated images from analyzes and draw clearer and more accurate conclusions at a stage where more prevention strategies can be implemented. AI in this sense is able to analyze the systems of the human body as a whole as opposed to a doctor which might have several appointments with a patient to determine results from evaluations on lungs, heart, etc. By segmenting x-rays from numerous patients AI is able to learn and grow its data set to produce increasingly accurate and precise results[^8]. By using a combination of recurrent neural networks and convolutional neural networks artificial intelligence is able to go beyond what currently exists in terms of medical analysis and provide optimum results for patients in need. Recurrent neural networks function by building upon past data in order to create new output in series. They work hand in hand with Convolutional Neural networks which focus on analyzing advanced imagery based on qualitative data and can weigh biases on potential prescriptive outcomes. [^10] Figure 2. Demonstrates a wireframe of how data is computed to draw relevant conclusions from thousands of images and pinpoint exact predictions of diagnosis. Risk analysis is crucial for heart attack prevention and understanding how suspeectable a person is to heart failure. Being that heart attacks can lead to strokes due to loss of blood and oxygen to the brain, these imaging tools serve as an invaluable life saving mechanism to help bring prevention to the forefront of these medical emergencies. ## 5. Deep Learning Techniques for Genomics A digital patient is the idea that a patient’s health record can be compiled with live and exact biometrics for the purpose of testing. Through this method medical professionals will have the ability to propose new solutions to patients and monitor potential effects of operations or medicines over a period of time in a condensed/rapid results format. Essentially if a patient would be able to see how their body reacts to medical procedures before they are performed. The digital copy of a patient would receive simulated trial treatments to better understand what would happen over a period of time if the solution was adopted. For example, a patient would be able to verify with their physician what type of diuretics, beta inhibitors, or angiotensin receptor blocker medication would be the most effective solution to their hypertension regulatory needs[^11]. Physicians would be able to mitigate the risks and side effects associated with a certain solution given a patients expected behavior in response to what has been uploaded to the model. In order to produce deep learning results, models must be implemented by indicating genetic markers by which computational methods can traverse the genetics strands and draw relevant conclusions. In this way data can be processed to propose changes to disease carrying chains of DNA or fortify immune based responses in those who are immunocompromised[^9]. Figure 3. Illustrates how genes are analyzed through data collection methods such as EHR, personal biometric data, and family history in order to track what type of disease poses a threat and how to prevent, predict, and treat disease at the molecular level. Producing accurate methods of treatments, medications as well as predictions without having to put the patient through any trials. ## 6. Discussion In considering the numerous innovations made possible by Big Data one can expect major impacts on society as we know it. Access to these types of data solutions should be made accessible to all those who are in need. Collectively an effort must be made to promote equitable access to life saving artificial intelligence discussed in this report. Processing power and lack of resources stand as a barrier to widespread access to proper testing. However, governments and industries in the private sector must work together to avoid monopolies and price gouging limitations to such valuable data and computing models. With further investment into deep learning models error margins can be narrowed and risk percentages and be slimmed pertaining to prescriptive analysis in specific use cases. The more access to information and examples are available, the better and more advanced a deep learning system can become. With the addition of electronic health records and past analysis artificial intelligence has the power to exponentially revolutionize the healthcare industry. By providing patients with services that could save their lives there is more incentive to stay involved in personal health as computation is optimized targeting patients for more results focused visits to the doctor. Doctors themselves are able to be relieved of a portion of the workload and foster a greater work life balance through cutting down on testing time and having more time to interact with patients for educational informative appointments. Legally medical professionals will be able to use prediction errors as alternative signals to further analyze a patient and justify treatment measures. Using data visualization of potential outcomes via a specific treatment method will empower patients and doctors to choose the pathway with the most favorable outcome.Convolutional Neural Networks within deep learning is one of the if not the most essential form of algorithm for AI in healthcare. CNN allows images to be input in a way that allows for learnable weights and biases to be calculated for and differentiate and match aspects of images that would go unknown to the human eye. Through identifying the edges, shape, size, color, amount of scarring CNN is able to identify cancerous and non-cancerous cells into five categories: normal, mild, moderate, severe, and carcinoma. Accuracy in this space is above 95% and creates a new opportunity space for medical professionals to provide their patients with a high level of accuracy and timely action planning for treatment and recovery[^13]. Beyond human healthcare CNN modeling has the potential to transfer into the realm of veterinary medicine, agricultural engineering, and sustainable environment initiatives to detect invasive species and similar disease development. Dogs or cats with cancer or heart worm could be analyzed in order to determine that with their heredity/breed and life span what are the chances and timeline for disease development. Crop production could be amplified with the processing of plant genomes in combination with soil to foresee what combination will produce the most abundant and profitable harvest. Lastly, ecosystems distrubed by global warming have the capability of being studied with CNN in order to factor in changes to the environment and what solutions could be on the horizon. With enough sample collection the power of CNN has the capability of securing a brighter future for tomorrow. ## 7. Conclusion Healthcare is an essential resource to living a long life and without it we can see our lifespan slashed nearly in half or even more for those who are hindered by hereditary ailments. Healthcare has been around as long as medicine and such other treatments have been around and that was centuries ago. The field has expanded well beyond what could’ve been expected for any medical professional or institution. Where the information and resources are available to save and care for the life before them even when a lack of training can hinder them the resources are present. It’s come to be such an accomplishment to mesh the medical practices of many medical professionals and Big Data to develop the largest compendium of medical practices in the world. By the allowance of such an asset many are able to collaborate with new findings and reinforcing old findings as these prevalent results allow physicians to work without faltering over inconclusive findings. The goal for this area of Big Data is to continue making the EHR system more secure and friendly towards medical professionals in different areas of practice as well as allowing easy access for patients who seek out their own medical history. The more advancements in this area of Healthcare can be applicable to other fields that must reference the compendium that maintains individuals and their history going forward. Such a structure will continue to aid generations of physicians and patients alike and can aid technological advancements along the way. ## 8. Acknowledgements We would like to thank Dr Gregor von Laszweski for allowing us to complete this report despite the delays there was as well as the lack of communication. We would also like to thank the AI team for their commitment to assisting in this class as even through a pandemic they continued to help the students complete the course. We would also like to thank Dr. Geoffrey Fox for teaching the course and making the class as informative as possible given his experience with the field of Big Data. ## 9. References [^8] Arslan, M., Owais, M., & Mahmood, T. Artificial Intelligence-Based Diagnosis of Cardiac and Related Diseases (2020, March 23). Retrieved December 13, 2020 from https://www.mdpi.com/2077-0383/9/3/871/htm [^9] Eraslan, G., Avsec, Z., Gagneur, J., & Theis, Fabian J.. Deep learning: new computational modelling techniques for genomics. (2019, April 10). Retrieved December 14, 2020 from https://www.nature.com/articles/s41576-019-0122-6 [^10] 1Regina. AI to Detect Cancer. (2019, November 22). Retrieved December 14, 2020 from https://towardsdatascience.com/ai-for-cancer-detection-cadb583ae1c5 [^11] Koumakis, L. Deep learning models in genomics; are we there yet? (2020). Retrieved December 14, 2020 from https://www.sciencedirect.com/science/article/pii/S2001037020303068 [^12] Ross, M.K., Wei, W., & Ohno-Machado, L., ‘Big Data’ and the Electronic Health Record (2014, August 15). Retrieved 15, 2020 from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4287068/ [^13] P, Shanthi. B., Faruqi, F., K, Hareesha, K., & Kudva, R., Deep Convolution Neural Network for Malignancy Detection and Classification in Microscopic Uterine Cervex Cell Images (2019, November 1). Retrieved December 15, 2020 from https://pubmed.ncbi.nlm.nih.gov/31759371/ 1. Laney, D., AD. Mauro, M., Gubbi, J., Doyle-Lindrud, S., Gillum, R., Reiser, S., . . . Reardon, S. Big data in healthcare: Management, analysis and future prospects (2019, June 19). Retrieved December 10, 2020, from https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0217-0 ↩︎ 2. What is an electronic health record (EHR)? (2019, September 10). Retrieved December 10, 2020 from https://www.healthit.gov/faq/what-electronic-health-record-ehr ↩︎ 3. Moriarty, A. Does Hospital EHR Adoption Actually Improve Data Sharing? (2020, October 23) Retrieved December 10, 2020 from https://blog.definitivehc.com/hospital-ehr-adoption ↩︎ 4. Cruciana, Paula A. The Implications of Big Data in Healthcare (2019, November 21) Retrieved December 11, 2020 from https://ieeexplore.ieee.org/document/8970084 ↩︎ 5. Zlabek, Jonathan A. Early cost and safety benefits of an inpatient electronic health record (2011, February 2) Retrieved December 10, 2020 from https://academic.oup.com/jamia/article/18/2/169/802487 ↩︎ 6. Artificial Intelligence-Oppurtunities in Cancer Research. (2020, August 31). Retrieved December 11, 2020 from https://www.cancer.gov/research/areas/diagnosis/artificial-intelligence ↩︎ 7. Zhang, R., Simon, G., & Yu, F. Advancing Alzheimer’s research: A review of big data promises. (2017, June 4) Retrieved December 11, 2020 from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5590222/ ↩︎ # 35 - Analysis of Various Machine Learning Classification Techniques in Detecting Heart Disease As cardiovascular diseases are the number 1 cause of death in the United States, the study of the factors and early detection and treatment could improve quality of life and lifespans. From investigating how the variety of factors related to cardiovascular health relate to a general trend, it has resulted in general guidelines to reduce the risk of experiencing a cardiovascular disease. However, this is a rudimentary way of preventative care that allows for those who do not fall into these risk categories to fall through. By applying machine learning, one could develop a flexible solution to actively monitor, find trends, and flag patients at risk to be treated immediately. Solving not only the risk categories but has the potential to be expanded to annual checkup data revolutionizing health care. Status: final, Type: Project Ethan Nguyen, fa20-523-309, Edit Keywords: health, healthcare, cardiovascular disease, data analysis ## 1. Introduction Since cardiovascular diseases are the number 1 cause of death in the United States, early prevention could help in extending one’s life span and possibly quality of life 1. Since there are cases where patients do not show any signs of cardiovascular trouble until an event occurs, having an algorithm predict from their medical history would help in picking up on early warning signs a physician may overlook. Or could also reveal additional risk factors and patterns for research on prevention and treatment. In turn this would be a great tool to apply in preventive care, which is the type of healthcare policy that focuses in diagnosing and preventing health issues that would otherwise require specialized treatment or is not treatable 2. This also has the potential to trickle down and increase the quality of life and lifespan of populations at a reduced cost as catching issues early most likely results in cheaper treatments 2. This project will take a high-level overview of common, widely available classification algorithms and analyze their effectiveness for this specific use case. Notable ones include, Gaussian Naive Bayes, K-Nearest Neighbors, and Support Vector Machines. Additionally, two data sets that contain common features will be used to increase the training and test pool for evaluation. As well as to explore if additional feature types contribute to a better prediction. The goal of this project being a gateway to further research in data preprocessing, tuning, or development of specialized algorithms as well as further ideas on what data could be provided. ## 2. Datasets The range of creation dates are 1988 and 2019 respectively with different features of which 4 are common between. This does bring up a small hiccup in preprocessing to consider. Namely the possibility of changing diet and culture trends resulting in significantly different trends/patterns within the same age group. As well as possible differences in measurement accuracy. However this large gap is within the scope of the project in exploring which features can help provide an accurate prediction. This possible phenomenon may be of interest to explore closely if time allows. Whether a trend itself is even present or there is an overarching trend across different cultures and time periods. Or to consider if this difference is significant enough that the data from the various sets needs to be adjusted to normalize the ages to present day. ### 2.1 Dataset Cleaning The datasets used have already been significantly cleaned from the raw data and has been provided as a csv file. These files were then imported into the python notebook as pandas dataframes for easy manipulation. An initial check was made to ensure the integrity of the data matched the description from the source websites. Then some preprocessing was completed to normalize the common features between the datasets. These features were gender, age, and cholesterol levels. The first two adjustments were trivial in conversion however, in the case of cholesterol levels, the 2019 set is on a 1-3 scale while the 1988 dataset provided them as real measurements. A conversion of the 1988 dataset was done based on guidelines found online for the age range of the dataset 3. ### 2.2 Dataset Analysis From this point on, the 1988 dataset will be referred to as dav_set and 2019 data set will be referred to as sav_set. To provide further insight on what to expect and how a model would be applied, the population of the datasets was analysed first. As depicted in Figure 2.1 the population samples of both datasets of gender vs age show the majority of the data is centered around 60 years of age with a growing slope from 30 onwards. Figure 2.1: Age vs Gender distributions of the dav_set and sav_set. This trend appears to signify that the datasets focused solely on an older population or general trend in society of not monitoring heart conditions as closely in the younger generation. Moving on to Figure 2.2, we see an interesting trend with a significant growing trend in the sav_set in older population having more cardiovascular issues compared to the dav_set. While this cannot be seen in the dav_set. This may be caused by the additional life expectancy or a change in diet as noted in the introduction. Figure 2.2: Age vs Target distributions of the dav_set and sav_set. In Figure 2.3, the probability of having cardiovascular issues between the sets are interesting. In the dav_set the inequality of higher probability could be attributed to the larger female samples in the dataset. With the sav_set having a more equal probability between the genders. Figure 2.3: Gender vs Probability of cardiovascular issues of the dav_set and sav_set. Finally, in Figure 2.4 is the probability vs cholesterol levels. This one is very interesting between the two datasets in terms of trend levels. With the dav_set having a higher risk at normal levels compared to the sav_set. This could be another hint of a societal change across the years or may in fact be due to the low sample size. Especially since the sav_set matches the general consensus of higher cholesterol levels increasing risk of cardiovascular issues 3. Figure 2.4: Cholesterol levels vs Probability of cardiovascular issues of the dav_set and sav_set. To close out this initial analysis is the correlation map of each of the features. From Figure 2.5 and 2.6 it can be concluded that both of these datasets are viable to conduct machine learning as the correlation factor is below the recommended value of 0.8 4. Although we do see the signs of a low sample amount in the dav_set with a higher correlation factor compared to the sav_set. Figure 2.5: dav_set correlation matrix. Figure 2.6: sav_set correlation matrix. ## 3. Machine Learning Algorithms and Implementation With many machine learning algorithms already available and many more in development. Selecting the optimal one for an application can be a challenging balance since each algorithm has both its advantages and disadvantages. As mentioned in the introduction, we will explore applying the most common and established algorithms available to the public. Starting off, is selecting a library from the most popular ones available. Namely Keras, Pytorch, Tensorflow, and Scikit-Learn. Upon further investigation it was determined that Scikit-Learn would be used for this project. The reason being Scikit-Learn is a great general machine learning library that also includes pre and post processing functions. While Keras, Pytorch, and Tensorflow are targeted for neural networks and other higher-level deep learning algorithms which are outside of the scope of this project at this time 5. ### 3.1 Scikit-Learn and Algorithm Types Diving further into the Scikit-Learn library, its key strength appears to be the variety of algorithms available that are relatively easy to implement against a dataset. Of those available, they are classified under three different categories based on the approach each takes. They are as follows: • Classification • Applied to problems that require identifying the category an object belongs to. • Regression • For predicting or modeling continuous values. • Clustering • Grouping similar objects into groups. For this project, we will be investigating the Classification and Clustering algorithms offered by the library due to the nature of our dataset. Since it is a binary answer, the continuous prediction capability of regression algorithms will not fair well. Compared to classification type algorithms which are well suited for determining binary and multi-class classification on datasets 6. Along with Clustering algorithms being capable of grouping unlabeled data which is one of the key problem points mentioned in the introduction 7. ### 3.2 Classification Algorithms The following algorithms were determined to be candidates for this project based on the documentation available on the Scikit-learn for supervised learning 8. #### 3.2.1 Support Vector Machines This algorithm was chosen because classification is one of the target types and has a decent list of advantages that appear to be applicable to this dataset 6. • Effective in high dimensional spaces as well as if the number dimensions out number samples. • Is very versatile. #### 3.2.2 K-Nearest Neighbors This algorithm was selected due to being a non-parametric method that has been successful in classification applications 9. From the dataset analysis, it is appears that the decision boundary may be very irregular which is a strong point of this type of method. #### 3.2.3 Gaussian Naive Bayes Is an implementation of the Naive Bayes theorem that has been targeted for classification. The advantages of this algorithm is its speed and requires a small training set compared to more advanced algorithms 10. #### 3.2.4 Decision Trees This algorithm was chosen to investigate another non-parametric method to determine their efficacy against this dataset application. This algorithm also has some advantages over K-Nearest namely 11. • Simple to interpret and visualize • Requires little data preparation • Handles numerical and categorical data instead of needing to normalize • Can validate the model and is possible to audit from a liability standpoint. ### 3.3 Clustering Algorithms The following algorithms were determined to be candidates for this project based on the table of clustering algorithms available on the Scikit-learn 7. #### 3.3.1 K-Means The usecase for this algorithm is general purpose with even and low number of clusters 7. Of which the sav_set appears to have with the even distribution across most of the features. #### 3.3.2 Mean-shift This algorithm was chosen for its strength in dealing with uneven cluster sizes and non-flat geometry 7. Though it is not easily scalable the application of our small dataset size might be of interest. #### 3.3.3 Spectral Clustering As an inverse, this algorithm was chosen for its strength with fewer uneven clusters 7. In comparison to Mean-shift, this maybe the better algorithm for this application. ### 3.4 Implementation The implementation of these algorithms were done under the direction of the documentation page for each respective algorithm. The jupyter notebook used for this project is available at https://github.com/cybertraining-dsc/fa20-523-309/blob/main/project/data_analysis/ml_algorithms.ipynb with each algorithm having a corresponding cell. A benchmarking library is also included to determine the efficiency of each algorithm in processing time. One thing of note is the lack of functions used for the classification compared to the clustering algorithms. The justification for this discrepancy is due to inexperience in creating optimal implementations as well as determining that not being implemented in a function would not have a significant impact on performance. Additionally, graphs representing the test data were included to help visualize the performance of the clustering algorithms utilizing example code from the documentation 12. #### 3.4.1 Dataset Preprocessing Pre-processing of the cleaned datasets for the classification algorithms was done under guidance of the scikit learn documentation 13. Overall, each algorithm was trained and tested with the same split for each run. While the split data could have been passed directly to the algorithms, they were normalized further using the built-in fit_transform function for the best results possible. Pre-processing of the cleaned datasets for the clustering algorithms was done under guidance of the scikit learn documentation 7. Compared to the classification algorithms, a dimensionality reduction was conducted using Principal component analysis (PCA). This step condenses the multiple features into a 2 feature array which the clustering algorithms were optimized for, increasing the odds for the best results possible. Another note is the dataset split was conducted during execution of the algorithm. Upon further investigation, it was determined that this does not have an effect on the ending results as the randomization was disabled due to setting the same random_state parameter for each call. ## 4. Results & Discussion ### 4.1 Algorithm Metrics The metrics used to determine the viability of each of the algorithms are precision, recall, and f1-score. These are simple metrics based on the values from a confusion matrix which is a visualization of the False and True Positives and Negatives. Precision is essentially how accurate was the algorithm in classifying each data point. This however, is not a good metric to solely base performance as precision does not account for imbalanced distributions within a dataset 14. This is where the recall metric comes in which is defined as how many samples were accurately classified by the algorithm. This is a more versatile metric as it can compensate for imbalanced datasets. While it may not be in our case as seen in the dataset analysis where we have a relatively balanced ratio. It still gives great insight on the performance for our application. Finally is the f1-score which is the harmonic mean of the precision and recall metric 14. This will be the key metric we will mainly focus on as it strikes a good balance between the two more primitive metrics. Since one may think in medical applications one would want to maximize recall, it is at the cost of precision which ends up in more false predictions which is essentially an overfitting scenario 14. Something that reduces the viability of the model to the application especially since we have a relatively balanced dataset, more customized weighting is not as necessary. The metrics for each algorithm implementation are as follows. The training time metric is provided by the cloudmesh.common benchmark library 15. #### 4.1.1 Support Vector Machines Table 4.1: dav_set metrics Precision Recall f1-score No Disease 0.99 0.94 0.96 Has Disease 0.95 0.99 0.97 Training Time 0.038 sec Table 4.2: sav_set metrics Precision Recall f1-score No Disease 0.99 0.94 0.96 Has Disease 0.95 0.99 0.97 Training Time 167.897 sec #### 4.1.2 K-Nearest Neighbors Table 4.3: dav_set metrics Precision Recall f1-score No Disease 0.88 0.86 0.87 Has Disease 0.87 0.90 0.88 Training Time 0.025 sec Table 4.4: sav_set metrics Precision Recall f1-score No Disease 0.62 0.74 0.67 Has Disease 0.67 0.54 0.60 Training Time 10.116 sec #### 4.1.3 Gaussian Naive Bayes Table 4.5: dav_set metrics Precision Recall f1-score No Disease 0.88 0.81 0.84 Has Disease 0.83 0.90 0.86 Training Time 0.011 sec Table 4.6: sav_set metrics Precision Recall f1-score No Disease 0.56 0.90 0.69 Has Disease 0.72 0.28 0.40 Training Time 0.057 sec #### 4.1.4 Decision Trees Table 4.7: dav_set metrics Precision Recall f1-score No Disease 0.92 0.97 0.95 Has Disease 0.97 0.93 0.95 Training Time 0.009 sec Table 4.8: sav_set metrics Precision Recall f1-score No Disease 0.71 0.80 0.75 Has Disease 0.76 0.66 0.71 Training Time 0.272 sec #### 4.1.5 K-Means Figure 4.1: dav_set algorithm visualization. The axis have no corresponding unit due to the PCA operation. Table 4.9: dav_set metrics Precision Recall f1-score No Disease 0.22 0.29 0.25 Has Disease 0.12 0.09 0.10 Training Time 0.376 sec Figure 4.2: sav_set algorithm visualization. The axis have no corresponding unit due to the PCA operation. Table 4.10: sav_set metrics Precision Recall f1-score No Disease 0.51 0.69 0.59 Has Disease 0.52 0.34 0.41 Training Time 1.429 sec #### 4.1.6 Mean-shift Figure 4.3: dav_set algorithm visualization. The axis have no corresponding unit due to the PCA operation. Table 4.11: dav_set metrics Precision Recall f1-score No Disease 0.47 1.00 0.64 Has Disease 0.00 0.00 0.00 Training Time 0.461 sec Figure 4.4: sav_set algorithm visualization. The axis have no corresponding unit due to the PCA operation. Table 4.12: sav_set metrics Precision Recall f1-score No Disease 0.50 1.00 0.67 Has Disease 0.00 0.00 0.00 Training Time 193.93 sec #### 4.1.7 Spectral Clustering Figure 4.5: dav_set algorithm visualization. The axis have no corresponding unit due to the PCA operation. Table 4.13: dav_set metrics Precision Recall f1-score No Disease 0.86 0.74 0.79 Has Disease 0.79 0.89 0.84 Training Time 0.628 sec Figure 4.6: sav_set algorithm visualization. The axis have no corresponding unit due to the PCA operation. Table 4.14: sav_set metrics Precision Recall f1-score No Disease 0.56 0.57 0.57 Has Disease 0.56 0.56 0.56 Training Time 208.822 sec ### 4.2 System Information Google Collab was used to train and evaluate the models selected. The specifications of the system in use is provided by the cloudmesh.common benchmark library and is listed in Table 4.15 15. Table 4.15: Training and Evaluation System Specifications Attribute Value BUG_REPORT_URL https://bugs.launchpad.net/ubuntu/" DISTRIB_CODENAME bionic DISTRIB_DESCRIPTION “Ubuntu 18.04.5 LTS” DISTRIB_ID Ubuntu DISTRIB_RELEASE 18.04 HOME_URL https://www.ubuntu.com/" ID ubuntu ID_LIKE debian NAME “Ubuntu” PRETTY_NAME “Ubuntu 18.04.5 LTS” PRIVACY_POLICY_URL https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" SUPPORT_URL https://help.ubuntu.com/" UBUNTU_CODENAME bionic VERSION “18.04.5 LTS (Bionic Beaver)” VERSION_CODENAME bionic VERSION_ID “18.04” cpu_count 2 mem.active 698.5 MiB mem.available 11.9 GiB mem.free 9.2 GiB mem.inactive 2.6 GiB mem.percent 6.5 % mem.total 12.7 GiB mem.used 1.6 GiB platform.version #1 SMP Thu Jul 23 08:00:38 PDT 2020 python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0] python.pip 19.3.1 python.version 3.6.9 sys.platform linux uname.machine x86_64 uname.node bc15b46ebcf6 uname.processor x86_64 uname.release 4.19.112+ uname.system Linux uname.version #1 SMP Thu Jul 23 08:00:38 PDT 2020 user collab ### 4.3 Discussion In analyzing the resulting metrics in section 4.1, two major trends between the algorithms are apparent. 1. The classification algorithms perform significantly better than the clustering algorithms. 2. Significant signs of overfitting for the dav_set. Addressing the first point, it is obvious from the metric performance where on average the classification algorithms were higher than the clustering algorithms. At a lower training time cost as well, which indicates that classification algorithms are well suited for this application than clustering. Especially when looking at the results for Mean-Shift in section 4.1.6 where the algorithm failed to identify any patient with a disease. This also illustrates the discussion on the metrics used to determine performance as the recall was 100% at the cost of missing every patient that would have required treatment illustrated by Figure 4.3 and 4.4. On this topic, comparing the actual data graphs for each of the clustering algorithms and comparing them to the example clustering figures within the scikit documentation, it solidifies that this is not the correct algorithm type for this dataset 7. Moving on to the next point, it can be seen that overfitting is occurring for the dav_set in comparing the performance to the sav_set for the same algorithm which can be seen in the corresponding tables in sections 4.1.2, 4.1.3, and 4.1.4. Here the performance gap is at least 20% between the two compared to what one would assume should be relatively close to each other. While this could also illustrate the affect the various features have on the algorithm, it was determined that this is most likely due to the small dataset size having a larger influence than anticipated. ## 5. Conclusion Reviewing these results, a clear conclusion cannot be accurately be determined due to the considerable amount of variables involved that were not able to be isolated to a desirable level. Namely the compromises that were mentioned in section 2.1 and general dataset availability. However, it was determined that the main goal of this project was accomplished where the Support Vector Machine algorithm was narrowed down as a viable candidate for future work. Due in part to the overall f1-score performance for both datasets, providing confidence that overfitting may not occur. While there is a downside in scalability due to the significant increase in training time between the smaller dav_set and larger sav_set. This could indicate that further research should be focused on either improving this algorithm or creating a new one based on the underlying mechanism. In relation to the types of features, it could be interpreted from this project that further efforts require a more expansive and modern dataset to perform to a level suitable for real world applications. As possible factors affecting the performance are in the accuracy and granularity of the measurements and factors available to learn from. This however, is seen to be a difficult challenge due to the nature of privacy laws on health data but, as proposed in the introduction. It would be very interesting to apply this project’s findings on more general health data that is retrieved in annual visits. ## 6. Acknowledgements The author would like to thank Dr. Gregor Von Laszewski, Dr. Geoffrey Fox, and the associate instructors in the FA20-BL-ENGR-E534-11530: Big Data Applications course (offered in the Fall 2020 semester at Indiana University, Bloomington) for their assistance and suggestions with regard to this project. ## References 1. Centers for Disease Control and Prevention. 2020. Heart Disease Facts | Cdc.Gov. [online] Available at: https://www.cdc.gov/heartdisease/facts.htm [Accessed 16 November 2020]. ↩︎ 2. Amadeo, K., 2020. Preventive Care: How It Lowers Healthcare Costs In America. [online] The Balance. Available at: https://www.thebalance.com/preventive-care-how-it-lowers-aca-costs-3306074 [Accessed 16 November 2020]. ↩︎ 3. WebMD. 2020. Understanding Your Cholesterol Report. [online] Available at: https://www.webmd.com/cholesterol-management/understanding-your-cholesterol-report [Accessed 21 October 2020]. ↩︎ 4. R, V., 2020. Feature Selection — Correlation And P-Value. [online] Medium. Available at: https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf [Accessed 21 October 2020]. ↩︎ 5. Stack Overflow. 2020. Differences In Scikit Learn, Keras, Or Pytorch. [online] Available at: https://stackoverflow.com/questions/54527439/differences-in-scikit-learn-keras-or-pytorch [Accessed 27 October 2020]. ↩︎ 6. Scikit-learn.org. 2020. 1.4. Support Vector Machines — Scikit-Learn 0.23.2 Documentation. [online] Available at: https://scikit-learn.org/stable/modules/svm.html#classification [Accessed 27 October 2020]. ↩︎ 7. Scikit-learn.org. 2020. 2.3. Clustering — Scikit-Learn 0.23.2 Documentation. [online] Available at: https://scikit-learn.org/stable/modules/clustering.html#clustering [Accessed 27 October 2020]. ↩︎ 8. Scikit-learn.org. 2020. 1. Supervised Learning — Scikit-Learn 0.23.2 Documentation. [online] Available at: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning [Accessed 27 October 2020]. ↩︎ 9. Scikit-learn.org. 2020. 1.6. Nearest Neighbors — Scikit-Learn 0.23.2 Documentation. [online] Available at: https://scikit-learn.org/stable/modules/neighbors.html [Accessed 27 October 2020]. ↩︎ 10. Scikit-learn.org. 2020. 1.9. Naive Bayes — Scikit-Learn 0.23.2 Documentation. [online] Available at: https://scikit-learn.org/stable/modules/naive_bayes.html [Accessed 27 October 2020]. ↩︎ 11. Scikit-learn.org. 2020. 1.10. Decision Trees — Scikit-Learn 0.23.2 Documentation. [online] Available at: https://scikit-learn.org/stable/modules/tree.html [Accessed 27 October 2020]. ↩︎ 12. Scikit-learn.org. 2020. A Demo Of K-Means Clustering On The Handwritten Digits Data — Scikit-Learn 0.23.2 Documentation. [online] Available at: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py [Accessed 17 November 2020]. ↩︎ 13. Scikit-learn.org. 2020. 6.3. Preprocessing Data — Scikit-Learn 0.23.2 Documentation. [online] Available at: https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing [Accessed 17 November 2020]. ↩︎ 14. Mianaee, S., 2020. 20 Popular Machine Learning Metrics. Part 1: Classification & Regression Evaluation Metrics. [online] Medium. Available at: https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce [Accessed 10 November 2020]. ↩︎ 15. Gregor von Laszewski, Cloudmesh StopWatch and Benchmark from the Cloudmesh Common Library, https://github.com/cloudmesh/cloudmesh-common ↩︎ # 36 - Predicting Hotel Reservation Cancellation Rates As a result of the Covid-19 pandemic all segments of the travel industry face financial struggle. The lodging segment, in particular, has had the financial records scrutinized revealing a glaring problem. Since the beginning of 2019, the lodging segment has seen reservation cancellation rates near 40%. At the directive of business and marketing experts, hotels have previously attempted to solve the problem through an increased focus on reservation retention, flexible booking policies, and targeted marketing. These attempts did not produce results, and continue to leave rooms un-rented which is detrimental to the bottom line. This document will explain the creation and testing of a novel process to combat the rising cancellation rate. By analyzing reservation data from a nationwide hotel chain, it is hoped that an algorithm may be developed capable of predicting the likeliness that a traveler is to cancel a reservation. The resulting algorithm will be evaluated for accuracy. If the resulting algorithm has a satisfactory accuracy, it would make clear to the hotel industry that the use of big data is key to solving this problem. Status: final, Type: Project Anthony Tugman, fa20-523-323, Edit Keywords: travel, finance, hospitality, tourism, data analysis, environment, big data ## 1. Introduction Big Data is a term that describes the large volume of data that is collected and stored by a business on a day-to-day basis. While the scale of this data is impressive, what is more interesting is what can be done by analyzing the data 1. Becoming more commonplace, companies are beginning to use Big Data to gain advantage in competing, innovating, and capturing customers. It is necessary for businesses to collaborate with Data Scientists to expose patterns and other insights that can be gained from inspection of the data. This collaboration is amongst software engineers, hardware engineers, and Data Scientists who develop powerful machine learning algorithms to efficiently analyze the data. Businesses have multiple benefits to gain from effectively utilizing Big Data including cost savings, time reductions, market condition insights, advertising insights, as well as driving innovation and product development 1. The lodging industry takes a hit to the bottom line each year as the rate of room reservation cancellations continues to rise. In the instance that a guest cancels a room without adequate notice there are additional expenses the hotel faces to re-market the room as available. If the room is unable to be rented, the hotel loses the revenue 2. At the directive of business and marketing experts, hotels have attempted to solve this problem through an increased focus on reservation retention, flexible booking policies, and targeted marketing campaigns. However, even with these efforts, the reservation cancellation rate continues to rise unchecked reaching upwards of 40% in 2019 2. By analyzing the Big Data the lodging industry collects on its customers, it would be possible to form a model capable of predicting whether a customer is likely to cancel a reservation. To develop this model, a large data set was sourced that tracked anonymized reservation information for a chain of hotels over a three-year period. Machine learning techniques will be applied to the data set to form a model capable of producing the aforementioned predictions. If the model proves to be statistically significant, recommendations will be made to the lodging industry as to how the results should be interpreted. ## 2. Background Research and Previous Work Room reservation completion rates are an important consideration for revenue management in the hotel industry. Unsurprisingly, there have been previous attempts at creating an algorithm capable of predicting room cancellation rates. However, these attempts seem to have taken a more general approach such as predicting cancellations by particular ranges of dates 3. Other attempts make broad claims about the accuracy of the resulting algorithm without reliably proving so through statistical analysis 4. Most importantly for the scope of this course, these previous attempts use subsets of the full data set available. This has the potential to lead to unpredictability in the performance of the algorithm. To differentiate from and improve on the attempts previously mentioned, the algorithm produced as a result of the current effort will form a model capable of predicting if a reservation with cancel based on the reservation as a whole, rather than a subset of dates. Not predict the cancellation rate over general stipulations, but rather for a specific customer. Additionally, a special focus will be on proving that the created algorithm is statistically significant and can accurately be extrapolated to larger subsets of the data. Finally, for initial training and testing, larger subsets of the data sets will be utilized due to the increased processing power available from Google Colab. It is important to note that the first-mentioned previous work 3 appears to use a proprietary data set while the second-mentioned work 4 appears to utilize the same data set that this study will as well. ## 3. DataSets Locating a data set appropriate for this task proved to be challenging. Privacy concerns and the desire to keep internal corporate information confidential adds difficulty in locating the necessary data. Ultimately, the following data repository was selected to train and test the reservation cancellation, prediction model: The Hotel Booking Demand data set contains approximately 120,000 unique entries with 32 describing attributes. This data comes from a popular data science website, and previous analysis has occurred. This data set was featured as part of a small contest on the hosting website where the goal was to predict the likelihood of a guest making a reservation based on certain attributes while this study will instead attempt to predict the likelihood that a reservation is canceled. The 32 individual attributes of the data will be evaluated for their weight on the outcome of cancellation. Unlike previously mentioned studies, the attribute describing how the booking was made (in person, online) will be utilized. Researchers believe that the increase in reservations booked through online third-party platforms is contributing to the increase in cancellation 6. Considering this attribute may have a significant impact on the overall accuracy of the developed predictive algorithm. Finally, the data set provides information on reservations over the span of a three-year period. During the training and testing phase, an appropriate amount of data will be used from each year to account for trends in cancellations that may have occurred over time. Attribute Description hotel hotel type (resort or city) is_canceled reservation canceled (true/false) lead_time number of days between booking and arrival arrival_date_year year of arrival date arrival_date_month month of arrival date arrival_date_week_number week number of year for arrival date arrival_date_day_of_month day of arrival date stays_in_weekend_nights number of weekend nights (Sat. and Sun.) booked stays_in_week_nights number of week nights (Mon. to Fri.) booked adults number of adults children number of children babies number of babies meal meal booked (multiple categories) country country of origin market_segment market segment (multiple categories) distribution_channel booking distribution channel (multiple categories) is_repeated_guest is repeat guest (true/false) previous_cancellations number of times customer has canceled previously previous_bookings_not_canceled number of times customer has completed a reservation reserved_room_type reserved room type assigned_room_type assigned room type booking_changes number of changes made to reservation from booking to arrival deposit_type deposit made (true/false) agent ID of the travel agent company ID of the company days_in_waiting_list how many days customer took to confirm reservation customer_type type of customer (multiple categories) adr average daily rate required_car_parking_space number of parking spaces required for reservation total_of_special_requests number of special requests made reservation_status status of reservation reservation_status_date date status was last updated ## 4. Data Preprocessing The raw data set 5 is imported directly from Kaggle into a Google Colab notebook 7. Data manipulation is handled using Pandas. Pandas was specifically written for data manipulation and analysis, and will make for a simple process preprocessing the data. The raw data set must be prepared before a model can be developed from it. Before preprocessing the data: • shape: (119390, 32) • duplicate entries: 31,994 The features of the data set, the categories other than what is being predicted, have varying levels of importance to the predictive model. By inspection, the following features are removed: • country: in a format unsuitable for predictive model, unable to convert • agent: the ID number of the booking agent will not affect reservation outcome • babies: no reservation had babies • children: no reservation had children • company: the ID number of the booking company will not affect reservation outcome • reservation_status_date: intermediate status of reservation is irrelevant In addition, duplicates are removed. In the case of the features ‘reserved_room_type’ and ‘assigned_room_type’ what is of interest is if the guest was given the requested room type. As the room code system has been anonymized and is proprietary to the brand, it is impossible to make inferences other than this. To simplify the number of features, a Boolean comparison is performed on ‘reserved_room_type’ and ‘assigned_room_type’. ‘reserved_room_type’ and ‘assigned_room_type’ are deleted while the Boolean results are converted to integer values then placed in a new feature category, ‘room_correct’. As it stands, multiple data entries across various features are strings. In order to build a predictive model, the data entries are converted to integers. #Convert to numerical values df = df.replace(['City Hotel', 'HB', 'Online TA', 'TA/TO', 'No Deposit', 'Transient', 'Check-Out'], '0') df = df.replace(['Resort Hotel', 'January', 'BB', 'Ofline TA/TO', 'GDS', 'Non Refund', 'Transient-Party', Canceled'], '1') df = df.replace(['February', 'SC', 'Groups', 'Refundable', 'Group', 'No-Show'], '2') df = df.replace(['March', 'FB', 'Direct', 'Contract'], '3') df = df.replace(['April', 'Undefined', 'Corporate'], '4') df = df.replace(['May', 'Complementary'], '5') df = df.replace(['June', 'Aviation'], '6') df = df.replace(['July'], '7') df = df.replace(['August'], '8') df = df.replace(['September'], '9') df = df.replace(['October'], '10') df = df.replace(['November'], '11') df = df.replace(['December'], '12')  After preprocessing the data: • shape: (84938, 25) • duplicate entries: 0 Figure 1: Snapshot of Data Set after Preprocessing ## 5. Model Creation To form the predictive model a Random Forest Classifier will be used. With the use of the sklearn package, the Random Forest Classifier is simple to implement. The Random Forest Classifier is based on the concept of a decision tree. A decision tree is a series of yes/no question asked about the data which eventually leads to a predicted class or value 8. To start the creation of the model, the data must first be split into a training and testing set. Typically, this is a ratio that must be adjusted to determine which will result in the higher accuracy. Here are the accuracy outcomes for various ratios: Train/Test Ratio Accuracy 80/20 77.64% 70/30 77.66% 60/40 79.56% 50/50 77.92% 40/60 75.73% 30/70 74.71% 20/80 73.13% The train/test ratio of 60/40 had the best initial accuracy of 79.56% so this ratio will be used in the creation of the final model. For the initial test, all remaining features will be used to train the reservation cancellation outcome. To determine the accuracy of the resulting model, the number of predicted cancellations is compared to the number of actual cancellations. As the model stands, the accuracy is at 79.56%. This model relies on 23 features for the prediction. With this many features it is possible that some features have no effect on the cancellation outcome. It is also possible that some features are so closely related that calculating each individually hinders performance while having little effect on outcome. To evaluate the importance of features, Pearson’s correlation coefficent can be used. Correlation coefficients are used in statistics to measure how strong the relationship is between two variables. The calculation formula returns a value between -1 and 1 where a value of 1 indicates a strong positive relationship, -1 indicates a strong negative relationship, and 0 indicates that there is no relationship at all 9. Figure 2 shows the correlation between the remaining features. Figure 2: Pearson’s Correlation Graph of Remaining Features In Figure 2 it is straightforward to identify the correlation between the target variable ‘is_canceled’ and the remaining features. It does not appear than any variable has a strong positive or negative correlation, returning a value close to positive or negative 1. There does however appear to be a dominant correlation between ‘is_canceled’ and three features: ‘lead_time’, ‘adr’, and ‘room_correct’. The train/test ratio is again 60/40 and the baseline accuracy of the model is 79.56%. The remaining features ‘lead_time’, ‘adr’, and ‘room_correct’ are used to develop the new model. Accuracy is again determined by comparing the number of predicted cancellations to the number of actual cancellations. The updated model has an accuracy of 85.18%. Figure 3 shows the correlation between the remaining features. It is important to note that the relationship between the remaining features and target value does not appear to be strong, however there is a correlation nonetheless. Figure 3: Pearson’s Correlation Graph of Updated Remaining Features As a final visualization, Figure 4 shows a comparison between the predicted and actual cancellation instances. The graph reveals an interesting pattern, the model is over predicting early in the data set and under predicting as it proceeds through the data set. Further inspection and manipulation of the Random Forest parameters were unable to eliminate this pattern. Figure 4: Model Results Predicted vs. Actual ## 6. Benchmark To measure program performance in the Google Colab notebook, Cloudmesh Common 10 was used to create a benchmark. In this instance, performance was measured for overall code execution, data loading, preparation of the data, the creation of model one, and the creation of model two. The most important increase in performance was between the creation of models one and two. With 23 features, model one took 8.161 seconds to train while model 2, with 3 features, took 7.01 seconds to train. By reducing the number of features between the two models there is a 5.62% increase in accuracy and a 14.10% decrease in processing time. Figure 5 provides more insight into the parameters the benchmark tracked and returned. Additionally, the table provides an analysis of computation time: Train/Test Ratio Accuracy 80/20 77.64% 70/30 77.66% 60/40 79.56% 50/50 77.92% 40/60 75.73% 30/70 74.71% 20/80 73.13% Figure 5: Cloudmesh Benchmark Results ## 7. Conclusion From the results of the updated predictive model, it is apparent that the lodging industry should invest focus into the Big Data they keep on their customers. As each hotel chain is unique, it would be necessary for each to develop their own predictive model however it has been demonstrated that such a model would be effective in reducing the number of rooms going unoccupied from reservation cancellation. As the model is predicting at 85% accuracy, this is a 35% increase in the amount of reservation cancellations that can be accounted for over the current predictive techniques. To prevent further damage from reservation cancellations the hotel would have theoretically been able to overbook room reservations by 35% or less as they anticipated cancellations. ## 8. References 1. “Big Data - Definition, Importance, Examples & Tools”, RDA, 2020. [Online]. Available: https://www.rd-alliance.org/group/big-data-ig-data-development-ig/wiki/big-data-definition-importance-examples-tools#:~:text=Big%20data%20is%20a%20term,day%2Dto%2Dday%20basis.&text=It's%20what%20organizations%20do%20with,decisions%20and%20strategic%20business%20moves. [Accessed: 12- Nov- 2020]. ↩︎ 2. “Predicting Hotel Booking Cancellations Using Machine Learning - Step by Step Guide with Real Data and Python”, Linkedin.com, 2020. [Online]. Available: https://www.linkedin.com/pulse/u-hotel-booking-cancellations-using-machine-learning-manuel-banza/. [Accessed: 08- Nov- 2020]. ↩︎ 3. “(PDF) Predicting Hotel Booking Cancellation to Decrease Uncertainty and Increase Revenue”, ResearchGate, 2020. [Online]. Available: https://www.researchgate.net/publication/310504011_Predicting_Hotel_Booking_Cancellation_to_Decrease_Uncertainty_and_Increase_Revenue. [Accessed: 08- Nov- 2020. ↩︎ 4. “Predicting Hotel Cancellations with Machine Learning”, Medium, 2020. [Online]. Available: https://towardsdatascience.com/predicting-hotel-cancellations-with-machine-learning-fa669f93e794. [Accessed: 08- Nov- 2020]. ↩︎ 5. “Hotel booking demand”, Kaggle.com, 2020. [Online]. Available: https://www.kaggle.com/jessemostipak/hotel-booking-demand. [Accessed: 08- Nov- 2020]. ↩︎ 6. “Global Cancellation Rate of Hotel Reservations Reaches 40% on Average”, Hospitality Technology, 2020. [Online]. Available: https://hospitalitytech.com/global-cancellation-rate-hotel-reservations-reaches-40-average. [Accessed: 08- Nov- 2020]. ↩︎ 7. “An Implementation and Explanation of the Random Forest in Python”, Medium, 2020. [Online]. Available: https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76. [Accessed: 12- Nov- 2020]. ↩︎ 8. “Correlation Coefficient: Simple Definition, Formula, Easy Calculation Steps”, Statistics How To, 2020. [Online]. Available: https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/. [Accessed: 12- Nov- 2020]. ↩︎ 9. Gregor von Laszewski, Cloudmesh StopWatch and Benchmark from the Cloudmesh Common Library, https://github.com/cloudmesh/cloudmesh-common ↩︎ # 37 - Analysis of Future of Buffalo Breeds and Milk Production Growth in India Water buffalo (Bubalus bubalis) is also called Domestic Water Buffalo or Asian Water Buffalo. It is large bovid originating in Indian subcontinent, Southeast Asia, and China and today found in other regions of world - Europe, Australia, North America, South America and some African countries. There are two extant types recognized based on morphological and behavioral criteria: 1. River Buffalo - Mostly found in Indian subcontinent and further west to the Balkans, Egypt, and Italy. 2. Swamp Buffalo - Found from west of Assam through Southeast Asia to the Yangtze valley of China in the east. India is the largest milk producer and consumer compared to other countries in the world and stands unique in terms of the largest share of milk being produced coming from buffaloes. The aim of this academic project is to study the livestock census data of buffalo breeds in India and their milk production using Empirical Benchmarking analysis method at state level. Looking at the small sample of data, our analysis indicates that we have been seeing increasing trends in past few years in livestock and milk production but there are considerable opportunities to increase production using combined interventions. Status: final, Type: Project Gangaprasad Shahapurkar, fa20-523-326, Edit Python Notebook Keywords: hid 326, i532, buffalo, milk production, livestock, benchmarking, in-milk yield, agriculture, india, analysis ## 1. Introduction Indian agriculture sector has been playing a vital role in overall contribution to Indian economy. Most of the rural community in the nation still make their livelihood on dairy farming or agriculture farming. Dairy farming itself has been on its progressive stage from past few years and it is contributing to almost more than 25% of agriculture Gross Domestic Product (GDP) 1. Livestock rearing has been integral part of the nation’s rural community and this sector is leveraging the economy in a big way considering the growth seen. It not only provides food, income, employment but also plays major role in life of farmers. It also does other contributions to the overall rural development of the nation. The output of livestock rearing such as milk, egg, meat, and wool provides everyday income to the farmers on daily basis, it provides nutrition to consumers and indirectly it helps in contributing to the overall economy and socio-economic development of the country. The world buffalo population is estimated at 185.29 million, spread in some 42 countries, of which 179.75 million (97%) are in Asia (Source: Fao.org/stat 2008). Hence major share of buffalo milk production in world comes from Asia (see Figure 1). India has 105.1 million and they comprise approximately 56.7 percent of the total world buffalo population. During the last 10 years, the world buffalo population increased by approximately 1.49% annually, by 1.53% in India, 1.45% in Asia and 2.67% in the rest of the world. Figure 1 shows worldwide share of milk production from buffalo breed. Figure 2 highlights percentage contribution of Asia including other top 2 contributors to world milk production. Figure 1: Production of Milk, whole fresh buffalo in World + (Total), Average 2013 - 2018 2 Figure 2: Production share of Milk, whole fresh buffalo by region, Average 2013 - 2018 2 ## 2. Background Research and Previous Work Production of milk and meat from buffaloes in Asian countries over the last decades has shown a varying pattern: in countries such as India, Sri Lanka, Pakistan and China. Buffaloes are known to be better at converting poor-quality roughage into milk and meat. They are reported to have 5% higher digestibility of crude fibre than high-yielding cows; and a 4% to 5% higher efficiency of utilization of metabolic energy for milk production 3, 4. After studying literatures and researches it was noticed that there has been some research around to quantify livestock yield gaps. There is no standard methodology, but multiple methods were combined for research. Researchers were able to calculate relative yield gaps for the dairy production in India and Ethiopia 5. There was analysis based on attainable yields using Empirical Benchmarking, and Stochastic Frontier Analysis to evaluate possible interventions for increasing production (household modelling). It was noticed that large yield gaps exist for dairy production in both countries, and packages of interventions are required to bridge these gaps rather than single interventions. Part of the research was borrowed to analyze the limited dataset chosen as part of this project. ## 3. Choice of Datasets Number of online literatures and datasets were checked to find out suitable dataset required for this project analysis. Below dataset were found promising: 1. DAHD Data 6, 7, 8, 9 (20th India Livestock Census Data) The Animal Husbandry Statistics Division of the Department of Animal Husbandry & Dairying Division (DAHD) is responsible for the generation of animal husbandry statistics through the schemes of livestock census and integrated sample surveys 1. Survey is defined by Indian Agriculture Statistics Research Institute (IASRI) 10. This is the only scheme through which considerable data, particularly on the production estimate of major livestock products, is being generated for policy formulation in the livestock sector. It is mandate for this division to • Conduct quinquennial livestock census • Conduct annual sample survey through integrated sample survey • Publish annual production estimates of milk, eggs, meat, wool and other related animal husbandry statistics based on survey 1. FAO Data 11, 2 Food and Agriculture Organization (FAO) of United Nation publishes worldwide data on the aspects of dairy farming which can also be visualized online with the options provided. Some of the data from this source was used to extract useful summary needed in analysis. 1. [UIDAI Data] (https://uidai.gov.in) 12 Unique Identification Authority of India (UIDAI) was created with the objective to issue Unique Identification numbers (UID), named as Aadhaar, to all residents of India. Projected population data of 2020 was extracted from this source. In addition to above, other demographics information such as area of each state, district count was extracted from OpenStreetMap 13. Agricultural zone information was extracted from report of Food and Nutrition Security Analysis, India, 2019 14. ## 4. Methodology ### 4.1 Software Components This project has been implemented in Python 3.7 version. Jupyter Notebook application was used to develop the code and produce a notebook document. Jupyter notebook is a Client-Server architecture-based application which allows modification and execution of code through web browser. Jupyter notebook can be installed locally and accessed through localhost browser or it can be installed on a remote machine and accessed via internet 15, 16. Following python libraries were used in overall code development. Before running the code, one must make sure that these libraries are installed. • Pandas This is a high performance and easy to use library. It was used for data cleaning, data analysis, & data preparation. • NumPy NumPy is python core library used for scientific computing. Some of the basic functions were used in this project. • Matplotlib This is a comprehensive library used for static, animated and interactive visualization. • OS This is another standard library of Python which provides miscellaneous operating system interface functions. • Scikit-learn (Sklearn) Robust library that provides efficient tools for machine learning and statistical modelling. • Seaborn Python data visualization library based on matplotlib. ### 4.2 Data Processing The raw data retrieved from various sources was in excel or report format. The data was pre-processed and stored back in csv format for the purpose of this project and to easily process it. This dataset was further processed through various stages via EDA, feature engineering and modelling. #### 4.2.1 EDA Preprocessed dataset selected for this analysis contained information at the state level. There were two seperate pre-processed dataset used. Below was the nature of the attributes in the main dataset which had buffalo information: • State Name: Name of each state in India. One record for each state. • Buffalo count: Total number of male and female buffaloes. Data recorded for 14 types of buffalo breeds. One attribute for each type of female and male breeds. • In Milk animals: Number of In-Milk animals per state (figures in 000 no’s) recorded each year from 2013 to 2019. One attribute per year • Yield per In-Milk animal: Yield per In-Milk animals per state (figures in kg/day) recorded each year from 2013 to 2019. One attribute per year. • Milk production: - Milk production per state (figures in 000 tones) recorded each year from 2013 to 2019. One attribute per year. Below was the nature of the attributes in the secondary dataset which had demographic information: • Geographic information: features captured at state level where each feature represented - projected population of 2020, total districts in each state, total villages in each state, official area of each state in square kilometer. • Climatic information: One of attribute for each zone highlighted in Table 1 Table 1: Agro climatic regions in India Zones Agro-Climatic Regions States Zone 1 Western Himalayan Region Jammu and Kashmir, Himachal Pradesh, Uttarakhand Zone 2 Eastern Himalayan Region Assam, Sikkim, West Bengal, Manipur, Mizoram, Andhra Pradesh, Meghalaya, Tripura Zone 3 Lower Gangetic Plains Region West Bengal Zone 4 Middle Gangetic Plains Region Uttar Pradesh, Bihar, Jharkhand Zone 5 Upper Gangetic Plains Region Uttar Pradesh Zone 6 Trans-Gangetic Plains Region Punjab, Haryana, Delhi and Rajasthan Zone 7 Eastern Plateau and Hills Region Maharashtra, Chhattisgarh, Jharkhand, Orissa and West Bengal Zone 8 Central Plateau and Hills Region Madhya Pradesh, Rajasthan, Uttar Pradesh, Chhattisgarh Zone 9 Western Plateau and Hills Region Maharashtra, Madhya Pradesh, Chhattisgarh and Rajasthan Zone 10 Southern Plateau and Hills Region Andhra Pradesh, Karnataka, Tamil Nadu, Telangana, Chhattisgarh Zone 11 East Coast Plains and Hills Region Orissa, Andhra Pradesh, Tamil Nadu and Pondicherry Zone 12 West Coast Plains and Ghat Region Tamil Nadu, Kerala, Goa, Karnataka, Maharashtra, Gujarat Zone 13 Gujarat Plains and Hills Region Gujarat, Madhya Pradesh, Rajasthan, Maharashtra Zone 14 Western Dry Region Rajasthan Zone 15 The Islands Region Andaman and Nicobar, Lakshadweep Figure 3 shows top 10 states from livestock census having total number of buffalo counts. Uttar Pradesh was the state which reported a greater number of buffaloes compared to any other states in the country. Figure 3: Top 10 state by buffalo counts There were greater number of female buffaloes reported in country (80%) compared to male buffalo breeds. Figure 4: Buffalo breeds ratio by male and female #### 4.2.2 Feature Engineering Murrah buffalo shown in Figure 5 is the most productive and globally famous breed 17, 18. This breed is resistant to diseases and can adjust to various Indian climate conditions. Figure 5: Murrah buffalo (Bubalus bubalis), globally famous local breed of Haryana, were exported to many nations 19, 20 In feature engineering multiple attributes were derived needed for modelling or during analysis. Table 2 shows percentage share of Murrah buffalo breed in top 10 states having highest number of total buffaloes. Though Uttar Pradesh was top state in India in terms of total number of buffaloes but percentage share of Murrah buffalo was more in state of Punjab. Table 2: Murrah buffalo percent share in top 10 state with buffalo count State Name Murrah Buffalo Count Total Buffalo Count % Murrah Breed UTTAR PRADESH 20110852 30625334 65.67 RAJASTHAN 6448563 12976095 49.70 ANDHRA PRADESH 5227270 10622790 49.21 HARYANA 5011145 6085312 82.35 PUNJAB 4116508 5159734 79.78 BIHAR 2419952 7567233 31.98 MADHYA PRADESH 1446078 8187989 17.66 MAHARASHTRA 986981 5594392 17.64 TAMIL NADU 435634 780431 55.82 UTTARAKHAND 378917 987775 38.36 Survey dataset had three primary attributes reported at the state level. Data reported from 2013 to 2019 for in-milk animals, yield per in-milk animals and milk production per state were averaged for analysis purpose. Total number of buffaloes per breed type were calculated from the data provided in the dataset. Following list of breeds were identified from dataset 21. • Banni • Bhadawari • Chilika • Jaffarabadi • Kalahandi • Marathwadi • Mehsana • Murrah • Nagpuri • Nili Ravi • Non-Descript • Pandharpuri • Surti • Toda Data showed that Uttar Pradesh had highest average milk production with in the top 10 states whereas Punjab state had highest average yield per in-milk animals. Figure 6 shows the share of top 3 breeds in both the state. Common attribute seen between two states was highest number of Murrah breed buffaloes compared to other breeds. Figure 6: Top 3 types of buffalo breeds of Uttar Pradesh and Punjab ### 4.3 Modelling #### 4.3.1 Data Preperation Data from main dataset and supplementary dataset which was demographics data was merged into one dataset. This dataset was not labelled and it was small dataset. We considered average milk production as our target label for analysis. The rest of the features were divided based on categorical and numerical nature. Our dataset did not had any categorical features except state name which was used as index column so no futher processing was considered for this attribute. All the features were of numerical nature and all the data points were not on same scale. Hence datapoints were normalized for further processing. #### 4.3.2 Empirical Benchmarking Model There are two dominant approach of economic modelling to estimate the production behavior - Empirical Benchmarking and Stochastic Frontier Analysis 22, 23. Empirical Benchmarking is simple modelling method, and it is one of the two dominant approach. This method was used to analyze past 6 years of data points available in the livestock dataset. In this approach milk production data of past 6 years was averaged. Top 10 states with most milk production reported were compared with average of the whole sample. The problem analyzed as part of this project was relatively small. The comparison did not consider all possible characteristics for modelling. Correlation of target variable with various demographics features available in the dataset was calculated. Table 3 and Table 4 shows the positive and negative coorelation with target variable average milk production respectively. We noticed from Table 3 that average milk production contribution was getting affected by number of in-milk animals reported in the particular year census data and their was also factor of climatic conditions affecting the milk production. India is divided into 15 Agro climate zones. Agro zone 6 is Trans-Gangetic Plains region. Indian states Chandigarh, Delhi, Haryana, Punjab, Rajasthan (some parts) falls under this zone. Agricultural development has shown phenomenal growth in overall productivity and in providing better environment for dairy farming in this zone 24. Table 3: Positive coorelation of target with demographics features Feature % avg_milk_production 1.00 avg_in_milk 0.95 total_female 0.90 total_male 0.71 agro_climatic_zone6 0.50 Table 4: Negative coorelation of target with demographics features Feature % official_area_sqkm -0.17 agro_climatic_zone2 -0.19 avg_yield_in_milk -0.23 district_count -0.34 proj_population_2020 -0.94 #### 4.3.3 Linear Regression Our target variable considered was a continous variable. In an attempt to perform dimension reduction, Principal Component Analysis (PCA) was applied to available limited dataset. In the example presented, an pipeline was constructed that had dimension reduction followed by Linear regression classifier. Grid search cross validation was applied (GridSearchCV) to find the best parameters and score. Linear regression was applied with default parameter settings whereas parameter range was passed for PCA. Below is snapshot of pipeline implemented. # Define a pipeline to search for the best combination of PCA truncation # and classifier regularization. pca = PCA() # Linear Regression without parameters linear = LinearRegression() full_pipeline_with_predictor = Pipeline([ ("preparation", num_pipeline), ("pca",pca), ("linear", linear) ]) # Parameters of pipelines: param_grid = { 'pca__n_components': [5, 15, 30, 45, 64] }  ## 5. Results Based on simple Empirical Benchmarking Analysis and trends noticed in data it appears that it is possible to increase production past currently attainable yields (see Figure 7). The current scale of the yield does indicate that, leading states have best breeds of buffaloes. Different methods of analyzing yield gaps can be combined to give estimates of attainable yields. It will also help to evaluate possible interventions to increase production and profits. Figure 7: Average milk production in top 10 state with benchmark The best parameter came out of the pipeline implemented are highlighted here. The results were not promising due the limited dataset so further experimentation was not attempted, but one thing was noticed that cross validated score came out to 0.28 and dimension reduction to 5 components. Best parameter (CV score=0.282): {'pca__n_components': 5} PCA(copy=True, iterated_power='auto', n_components=None, random_state=None, svd_solver='auto', tol=0.0, whiten=False)  Figure 8: Covariance Heat Map We were able to calculate correlation of census data with other socioeconomic factors like population information, climate information (see Figure 8). The biggest probable limitation here was availability of good quality data. It would have been possible to conduct the analysis at finer level if more granular level data would have been available. Our analysis had to be done state level rather than at district level or specific area. ## 6. Conclusion The analysis done above with the limited dataset showed that there are considerable gaps in the average yield per in-milk buffalo of state Punjab and Uttar Pradesh, compared to other states in top 10 list. These states have larger share of Murrah breed buffaloes. Based on the data trends it appears that it is possible to increase the production past current attenable numbers. However, this would need to combine different methods and multiple strategies. ## 7. Acknowledgements The author would like to thank Dr. Gregor von Laszewski, Dr. Geoffrey Fox, and the associate instructors in the FA20-BL-ENGR-E534-11530: Big Data Applications course (offered in the Fall 2020 semester at Indiana University, Bloomington) for their continued assistance and suggestions with regard to exploring this idea and also for their aid with preparing the various drafts of this article. ## 8. References 1. PIB Delhi. (2019). Department of Animal Husbandry & Dairying releases 20th Livestock Census, 16 (Oct 2019). https://pib.gov.in/PressReleasePage.aspx?PRID=1588304 ↩︎ 2. FAO - Food and Agriculture Organization of United Nation, Accessed: Nov. 2020, http://www.fao.org/faostat/en/#data ↩︎ 3. Mudgal, V.D. (1988). “Proc. of the Second World Buffalo Congress, New Delhi, India”, 12 to 17 Dec.:454. ↩︎ 4. Alessandro, Nardone. (2010). “Buffalo Production and Research”. Italian Journal of Animal Science. 5. 10.4081/ijas.2006.203. ↩︎ 5. Mayberry, Dianne (07/2017). Yield gap analyses to estimate attainable bovine milk yields and evaluate options to increase production in Ethiopia and India. Agricultural systems (0308-521X), 155 , p. 43. ↩︎ 6. Department of Animal Husbandry and Dairying. http://dahd.nic.in/about-us/divisions/statistics ↩︎ 7. Department of Animal Husbandry and Dairying, Accessed: Oct. 2020, http://dadf.gov.in/sites/default/filess/20th%20Livestock%20census-2019%20All%20India%20Report.pdf ↩︎ 8. Department of Animal Husbandry and Dairying, Accessed: Oct. 2020, http://dadf.gov.in/sites/default/filess/Village%20and%20Ward%20Level%20Data%20%5BMale%20%26%20Female%5D.xlsx ↩︎ 9. Department of Animal Husbandry and Dairying, Accessed: Oct. 2020, http://dadf.gov.in/sites/default/filess/District-wise%20buffalo%20population%202019_0.pdf ↩︎ 10. IASRI - Indian Agriculture Statistics Research Institute. https://iasri.icar.gov.in/ ↩︎ 11. F.A.O. (2008). Food and Agriculture Organization. Rome Italy. STAT http://database.www.fao.org ↩︎ 12. Unique Identification Authority of India, Accessed: Nov. 2020, https://uidai.gov.in/images/state-wise-aadhaar-saturation.pdf ↩︎ 13. OpenStreetMap, Accessed: Nov. 2020, https://wiki.openstreetmap.org/wiki/Main_Page ↩︎ 14. Food and Nutrition Security Analysis, India, 2019, Accessed: Nov. 2020, http://mospi.nic.in/sites/default/files/publication_reports/document%281%29.pdf ↩︎ 15. Corey Schafer. Jupyter Notebook Tutorial: Introduction, Setup, and Walkthrough. (Sep. 22, 2016). Accessed: Nov. 07, 2020. [Online Video]. Available: https://www.youtube.com/watch?v=HW29067qVWk ↩︎ 16. The Jupyter Notebook. Jupyter Team. Accessed: Nov. 07, 2020. [Online]. Available: https://jupyter-notebook.readthedocs.io/en/stable/notebook.html ↩︎ 17. Bharathi Dairy Farm. http://www.bharathidairyfarm.com/about-murrah.php ↩︎ 18. Water Buffalo. Accessed: Oct 26, 2020. [Online]. Available: https://en.wikipedia.org/wiki/Water_buffalo ↩︎ 19. Kleomarlo. Own work, CC BY-SA 3.0. [Online]. Available: https://commons.wikimedia.org/w/index.php?curid=4349862 ↩︎ 20. ICAR - Central Institute for Research on Buffaloes. https://cirb.res.in/ ↩︎ 21. List of Water Buffalo Breeds. Accessed: Oct 26, 2020. [Online]. Available: https://en.wikipedia.org/wiki/List_of_water_buffalo_breeds ↩︎ 22. Bogetoft P., Otto L. (2011) Stochastic Frontier Analysis SFA. In: Benchmarking with DEA, SFA, and R. International Series in Operations Research & Management Science, vol 157. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-7961-2_7 ↩︎ 23. Aigner, Dennis (07/1977). “Formulation and estimation of stochastic frontier production function models”. Journal of econometrics (0304-4076), 6 (1), p. 21. https://doi.org/10.1016/0304-4076(77)90052-5 ↩︎ 24. Farm Mechanization-Department of Agriculture and Cooperation, India, Accessed: Nov. 2020, http://farmech.gov.in/06035-04-ACZ6-15052006.pdf ↩︎ # 38 - Music Mood Classification Music analysis on an individual level is incredibly subjective. A particular song can leave polarizing impressions on the emotions of its listener. One person may find a sense of calm in a piece, while another feels energy. In this study we examine the audio and lyrical features of popular songs in order to find relationships in a song’s lyrics, audio features, and its valence. We take advantage of the audio data provided by Spotify for each song in their massive library, as well as lyrical data from popular music news and lyrics site, Genius. Status: final, Type: Project Kunaal Shah, fa20-523-341, Edit Keywords: music, mood classification, audio, audio content analysis, lyrics, lyrical analysis, big data, spotify, emotion ## 1. Introduction The overall mood of a musical piece is generally very difficult to decipher due to the highly subjective nature of music. One person might think a song is energetic and happy, while another may think it is quite sad. This can be attributed to varying interpretations of tone and lyrics in song between different listeners. In this project we study both the audio and lyrical patterns of a song through machine learning and natural language processing (NLP) to find a relationship between the song’s lyrics and its valence, or its overall positivity. Previous studies take three different ways in classifying the mood of a song according to various mood models by analyzing audio, analyzing lyrics, and analyzing lyrics and audio. Most of these studies have been successful in their goals but have uses a limited collection of songs/words for their analysis 1. Perhaps obviously, the best results come when combining audio and lyrics. A simple weighting is given by a study from the University of Illinois to categorize moods of a song by audio and lyrical content analysis, A simple weighting is given by a study from the University of Illinois to categorize moods of a song by audio and lyrical content analysis. phybrid = \alpha plyrics + (1 - \alpha )paudio When researching existing work, we found two applications that approach music recommendations based on mood, one is called ‘moooodify’, a free web application developed by an independent music enthusiast, Siddharth Ahuja 2. Another website, Organize Your Music, aims to organize a Spotify user’s music library based on mood, genre, popularity, style, and other categories 3. However, both of these applications do not seem to take into account any lyrical analysis of a song. Lyrics of a song can be used to learn a lot about music from lexical pattern analysis to gender, genre, and mood analyses. For example, in an individual study a researcher found that female artists tend to mention girls, women, and friends a lot, while male artists sing about late Saturday Nights, play and love 4. Another popular project, SongSim, used repetition to visualize the parts of a song 5. Findings such as these can be used to uncover the gender of an artist based on their lyrics. Similarly, by use of NLP tools, lyrical text can be analyzed to elicit the mood and emotion of a song. ## 3. Datasets For the audio content analysis portion of this project, we use Spotify’s Web API, which provides a great amount audio data for every song in Spotify’s song collection, including valence, energy, and danceability 6. For the lyrical analysis portion of this project, we use Genius’s API to pull lyrics information for a song. Genius is a website where users submit lyrics and annotations to several popular songs 7. To perform sentiment analysis on a set of lyrics collected from Genius, we use the NLTK Vader library. ## 4. Analysis For the purposes of this study, we analyze a track’s lyrics and assign them scores based on their positivity, negativity, and neutrality. We then append this data to the audio feature data we receive from Spotify. To compare and find relationships and meaningfulness in using lyrics and audio features to predict a song’s valence, we employ several statistical and machine learning approaches. We try linear regression and polynomial regression to find relationships between several features of a track and a song’s valence. Then we perform multivariate linear regression to find how accurately we can predict a song’s valence based on the audio and lyrical features available in our dataset. ### 4.1 Accumulation of audio features and lyrics From our data sources, we collected data for roughly 10000 of the most popular songs released between 2017 and 2020, taking account of several audio and lyrical features present in the track. We gathered this data by hand, first querying the most popular 2000 newly released songs in each year between 2017 and 2020. We then sent requests to Genius to gather lyrics for each song. Some songs, even though they were popular, did not have lyrics present on Genius, these songs were excluded from our dataset. With BeautifulSoup, we extracted and cleaned up the lyrics, removing anything that is not a part of the song’s lyrics like annotations left by users, section headings (Chorus, Hook, etc), and empty lines. After exclusions our data covered 6551 Spotify tracks. ### 4.2 Performance of sentiment analysis on lyrics With a song’s lyrics in hand, we used NLTK’s sentiment module, Vader, to read each line in the lyrics. NLTK Vader Sentiment Intensity Analyzer is a pretrained machine learning model that reads a line of text and assigns it scores of positivity, negativity, neutrality, and and overall compound score. We marked lines with a compound score greater than 0.5 as positive, less than -0.1 as negative, and anything in between as neutral. We then found the percentages of positive, negative, and neutral lines in a song’s composition and saved them to our dataset. We performed a brief analysis of the legibility of the Vader module in determining sentiment on four separate strings. “I’m happy” and “I’m so happy” were used to compare two positive lines, “I’m happy” was expected to have a positive compound score, but slightly less positive than “I’m so happy”. Similarly, we used two negative lines “I’m sad” and the slightly more extreme, “I’m so sad” which were expected to result in negative compound scores with “I’m sad” being less negative than “I’m so sad”. Scores for 'I'm happy': { 'neg': 0.0, 'neu': 0.213, 'pos': 0.787, 'compound': 0.5719 } Scores for 'I'm so happy': { 'neg': 0.0, 'neu': 0.334, 'pos': 0.666, 'compound': 0.6115 } Scores for 'I'm sad': { 'neg': 0.756, 'neu': 0.244, 'pos': 0.0, 'compound': -0.4767 } Scores for 'I'm so sad': { 'neg': 0.629, 'neu': 0.371, 'pos': 0.0, 'compound': -0.5256 }  While these results confirmed our expectations, a few issues come to the table with our use of the Vader module. One is that Vader takes into consideration additional string features such as punctuation in its determination of score, meaning “I’m so sad!” will be more negative than “I’m so sad”. Since lyrics on Genius are contributed by the community, in most cases there is a lack of consistency using accurate punctuation. Additionally, in some cases there can be typos present in a line of lyrics, both of which can skew our data. However we determined that our method in using the Vader module is suitable for our project as we simply want to determine if a track is positive or negative without needing to be too specific. Another issue is that our implementation of Vader acts only on English words. Again, since lyrics on Genius are contributed by the community, there could be errors in our data from misspelled word contributions as well as sections or entire lyrics written in different languages. In addition to performing sentiment analysis on the lyrics, we tokenized the lyrics, removing common words such as ‘a’, ‘the’,‘for’, etc. This was done to collect data on the number of meaningful and number of non-repeating words in each song. Albeit while this data was never used in our study, it could prove useful in future studies. Table 1 displays a snapshot of the data we collected from seven tracks released in 2020. The dataset contains 27 fields, 12 of which describe the audio features of a track, and 8 of which describe the lyrics of the track. For the purpose of this study we exclude the use of audio features key, duration, and time signature. Table 1: Snapshot of dataset containing tracks released in 2020 danceability energy key loudness speechiness acousticness instrumentalness liveness valence tempo duration_ms time_signature name artist num_positive num_negative num_neutral positivity negativity neutrality word_count unique_word_count 0.709 0.548 10 -8.493 0.353 0.65 1.59E-06 0.133 0.543 83.995 160000 4 What You Know Bout Love Pop Smoke 7 2 33 0.166666667 0.047619048 0.785714286 209 130 0.799 0.66 1 -6.153 0.079 0.256 0 0.111 0.471 140.04 195429 4 Lemonade Internet Money 8 15 34 0.140350877 0.263157895 0.596491228 307 177 0.514 0.73 1 -5.934 0.0598 0.00146 9.54E-05 0.0897 0.334 171.005 200040 4 Blinding Lights The Weeknd 3 10 22 0.085714286 0.285714286 0.628571429 150 75 0.65 0.613 9 -6.13 0.128 0.00336 0 0.267 0.0804 149.972 194621 4 Wishing Well Juice WRLD 0 22 30 0 0.423076923 0.576923077 238 104 0.737 0.802 0 -4.771 0.0878 0.468 0 0.0931 0.682 144.015 172325 4 positions Ariana Grande 10 5 33 0.208333333 0.104166667 0.6875 178 73 0.357 0.425 5 -7.301 0.0333 0.584 0 0.322 0.27 102.078 198040 3 Heather Conan Gray 3 4 22 0.103448276 0.137931034 0.75862069 114 66 0.83 0.585 0 -6.476 0.094 0.237 0 0.248 0.485 109.978 173711 4 34+35 Ariana Grande 3 13 52 0.044117647 0.191176471 0.764705882 249 127 ### 4.3 Description of select data fields The following terms defined are important in our analyses. In our data set most terms contain are represented by a value between 0 and 1, indicating least to most. For example, looking at the first two rows in Table 1, we can see that the track by the artist, Pop Smoke, has a greater speechiness score, indicating a greater percentage of that song contains spoken word. • Danceability: uses several musical elements (tempo, stability, beat strength, regularity) to determine how suitable a given track is for dancing • Energy: measures intensity of a song • Loudness: a songs overall loudness measured in decibels • Speechiness: identifies how much of a track contains spoken word • Acousticness: confidence of a track being acoustic, or with physical instruments • Instrumentalness: confidence of a track having no vocals • Liveness: confidence of a track being a live recording • Valence: predicts the overall happiness, or positivity of a track based on its musical features • Tempo: the average beats per minute of a track • Positivity: percentage of lines in a track’s lyrics determined to have a positive sentiment score • Negativity: percentage of lines in a track’s lyrics determined to have a negative sentiment score • Neutrality: percentage of lines in a track’s lyrics determined to have a neutral sentiment score Out of these fields, we seek to find which audio features correlate to a song’s valence and if our positivity and negativity scores of a song’s lyrics provide any meaningfulness in determining a song’s positivity. For the purpose of this study we mainly focus on valence, energy, danceability, positivity, and negativity. ### 4.4 Preliminary Analysis of Data When calculating averages of the feature fields captured in our dataset, we found it interesting that based on our lyrical interpretation, tracks between 2017 and 2020 tended to be more negative than positive. The average negativity score for a track in our dataset was 0.21 which means 21% of the lines in the track were deemed to have negative connotation, while having a 0.08 positivity score. Figure 1: Heatmap of data with fields valence, energy, danceability, positivity, negativity Backed by Figure 1, we find that track lyrics tend to be more negative than positive. However for the most part, even with tracks with negative lyrics, the valence, or overall happiness of the audio features hovers around 0.5; indicating that most songs tend to have neutral audio features. Looking at tracks with lyrics that are highly positive we find that the valence rises to about 0.7 to 0.8 and that songs with extremely high negatively also cause the valence to drop to the 0.3 range. These observations indicate that only extremes in lyrical sentiment correlate significantly in a song’s valence, as some songs with negative lyrics may also be fast-tempo and energetic, keeping the valence relatively high compared to lyrical composition. This is shown in our visualization, where both tracks with positive and negative lyricals have high energy and danceability values, indicating fast-tempos and high-pitches. ### 4.5 Scatterplot Analysis Figure 2: Scatterplots showing relation of features danceability, energy, speechiness, positivity, negativity, and neutrality to valence. Figure 2 describes the relation of several data fields we collected to a song’s valence, or its overall positivity. We find that the positivity and negativity plots reflect that of the speechiness plot in that there seems to be little correlation between the x and y axes. On the other hand neutrality seems to show a positive correlation between a song’s lyrical content and its respective valence. If a song is more neutral, it seems more likely to have a higher valence. Figure 3: Distributions of field values across the Spotify music library 6. Our scatterplots do show consistency with the expected distributions exemplified in the Spotify API documentation, as shown in Figure 3. In the top three plots, which use values for audio features obtained exclusively obtained from the audio features given by Spotify, we can see the these matching distributions which imply that most songs fall in the 0.4 to 0.8 range for danceability, energy, and valence, and 0 to 0.1 for speechiness. The low distribution in speechiness can be explained by music features being more dependant on instruments and sounds than spoken word. A track with higher than 0.33 speechiness score indicates that the track is very high in spoken word content over music, like a poetry recitation, talk show clip, etc 6. ### 4.6 Linear and Polynomial Regression Analyses We performed a simple linear regression test against valence with the audio and lyrical features described in Figure 2 and Figure 3. Like the charts show, it was hard to find any linear correlation between the fields. Table 2 displays the r-squared results that we obtained when applying linear regression to find the relationship between a song’s feature and its valence. The only features that indicate potential relationships with a song’s valence are energy, and danceability, as definitions of energy and and danceability indicate some semblance of positivity as well. Table 2: R-Squared results obtained from linear regression application on select fields against valence Feature R-Squared Positivity -0.090859047 Negativity -0.039686828 Neutrality 0.093002783 Energy 0.367113611 Danceability 0.324412662 Speechiness 0.066492856 Since we found little relation between the selected features and valence, we tried applying polynomial regression with the same features as shown in Table 3. Again, we failed to find any relationship between a feature in our dataset and the song’s valence. Energy and danceability once again were found to have the highest relationship with valence. We speculate that some of the data we have is misleading the regression applications; as mentioned before, we found some issues in reading sentiment in the lyrics we collected due to misspelled words, inaccurate punctuations, and non-english words. Table 3: R-Square results obtained from polynomial regression application on select data fields against valence Feature R-Squared Positivity 0.013164307 Negativity 0.001588184 Neutrality 0.010308495 Energy 0.136822113 Danceability 0.113119545 Speechiness 0.008913925 ### 4.7 Multivariate Regression Analysis We performed multivariate regression tests to predict a song’s valence with a training set of 5500 tracks and a test set of 551 tracks. Our first test only included four independent variables: neutrality, energy, danceability, and speechiness. Our second test included all numerical fields available in our data, adding loudness, acousticness, liveness, instrumentalness, tempo, positivity, word count, and unique word count to the regression coefficient calculations. In both tests we calculated the relative mean squared error (RMSE) between our predicted values and the actual values of a song’s valence given several features. Our RMSEs were 0.1982 and 0.1905 respectively, indicating that as expected, adding additional pertinent independent variables gave slightly better results. However given that a song’s valence is captured between 0 and 1.0, and both our RSMEs were approximately 0.19, it is unclear how significant the results of these tests are. Figure 4 and Figure 5 show the calculated differences between the predicted and actual values for the first 50 tracks in our testing dataset for each regression test respectively. Figure 4: Differences between expected and predicted values with application of multivariate regression model with 4 independent variables Figure 5: Differences between expected and predicted values with application of multivariate regression model with 12 independent variables ## 5. Benchmarks Table 4 displays the benchmarks we received from key parts of our analyses. As expected, creating our dataset took a longer amount of time relative to the rest of the benchmarks. This is because accumulating the data involved sending two requests to online sources, and running the sentiment intensity analyzer on the lyrics received from the Genius API calls. Getting the sentiment of a line of text itself did not take much time at all. We found it interesting that applying multivariate regression on our dataset was much quicker than calculating averages on our dataset with numpy, and that it was the fastest process to complete. Table 4: Benchmark Results Name Status Time Sum Start tag Node User OS Version Create dataset of 10 tracks ok 12.971 168.523 2020-12-07 00:19:30 884e3d61f237 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Sentiment Intensity Analyzer on a line of lyrical text ok 0.001 0.005 2020-12-07 00:19:49 884e3d61f237 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Load dataset ok 0.109 1.08 2020-12-07 00:19:59 884e3d61f237 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Calculate averages of values in dataset ok 0.275 0.597 2020-12-07 00:19:59 884e3d61f237 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Multivariate Regression Analysis on dataset ok 0.03 0.151 2020-12-07 00:21:49 884e3d61f237 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Generate and display heatmap of data ok 0.194 0.194 2020-12-07 00:20:03 884e3d61f237 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 Plot differences ok 0.504 1.473 2020-12-07 00:21:50 884e3d61f237 collab Linux #1 SMP Thu Jul 23 08:00:38 PDT 2020 ## 6. Conclusion We received inconclusive results from our study. The linear and polynomial regression tests that we performed, showed little correlation between our lyrical features and a track’s valence. This was backed by our multivariate regression test which performed with a RSME score of about 0.19 on our dataset. Since valence is recorded on a scale from 0 to 1.0, this means that our predictions typically fall within 20% of the actual value, which is considerably inaccurate. As previous studies have shown massive improvements in combining lyrical and audio features for machine learning applications in music, we believe that the blame for our low scores falls heavily on our approach to assigning sentiment scores on our lyrics 81. Future studies should consider the presence of foreign lyrics and the potential inaccuracies of community submitted lyrics. There are several other elements of this study that could be improved upon in future iterations. In this project we only worked with songs released after the beginning of 2017, but obviously, people would still enjoy listening to songs from previous years. The Spotiy API contains audio features data for every song in its library, so it would be worth collecting that data on every song for usage in the generation of song recommendations. Secondly, our data set excluded songs on Spotify, whose lyrics could not be found easily on Genius.com. We should have handled these cases by attempting to find the lyrics from other popular websites which store music lyrics. And lastly, we worked with a very small dataset relative to the total amount of songs that exist, or that are available on Spotify. There is great possibility in repeating this study quite easily with a greater selection of songs. We were surprised by how small the file sizes were of our dataset of 6551 songs, the aggregated data set being only 2.3 megabytes in size. Using that value, a set of one million songs can be estimated to only be around 350 megabytes. ## 7. Acknowledgements We would like to give our thanks to Dr. Geoffrey Fox, Dr. Gregor von Laszewski, and the other associate instructors who taught FA20-BL-ENGR-E534-11530: Big Data Applications during the Fall 2020 semester at Indiana University, Bloomington for their suggestions and assistance in compiling this project report. Additionally we would like to thank the students who contributed to Piazza by either answering questions that we had ourselves, or giving their own suggestions and experiences in building projects. In taking this course we learned of several applications of and use cases for big data applications, and gained the knowledge to build our own big data projects. ## 8. References 1. Kashyap, N., Choudhury, T., Chaudhary, D. K., & Lal, R. (2016). Mood Based Classification of Music by Analyzing Lyrical Data Using Text Mining. 2016 International Conference on Micro-Electronics and Telecommunication Engineering (ICMETE). doi:10.1109/icmete.2016.65 ↩︎ 2. Ahuja, S. (2019, September 25). Sort your music by any mood - Introducing moooodify. Retrieved November 17, 2020, from https://blog.usejournal.com/sort-your-music-by-any-mood-introducing-moooodify-41749e80faab ↩︎ 3. Lamere, P. (2016, August 6). Organize Your Music. Retrieved November 17, 2020, from http://organizeyourmusic.playlistmachinery.com/ ↩︎ 4. Jeong, J. (2019, January 19). What Songs Tell Us About: Text Mining with Lyrics. Retrieved November 17, 2020, from https://towardsdatascience.com/what-songs-tell-us-about-text-mining-with-lyrics-ca80f98b3829 ↩︎ 5. Morris, C. (2016). SongSim. Retrieved November 17, 2020, from https://colinmorris.github.io/SongSim/ ↩︎ 6. Get Audio Features for a Track. (2020). Retrieved November 17, 2020, from https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/ ↩︎ 7. Genius API Documentation. (2020). Retrieved November 17, 2020, from https://docs.genius.com/ ↩︎ 8. Hu, X., & Downie, J. S. (2010). Improving mood classification in music digital libraries by combining lyrics and audio. Proceedings of the 10th Annual Joint Conference on Digital Libraries - JCDL ‘10. doi:10.1145/1816123.1816146 ↩︎ # 39 - Does Modern Day Music Lack Uniqueness Compared to Music before the 21st Century his project looked at 99 years of spotify music data and determined that all features of most tracks have changed in different ways. Because uniqueness can be related to variation the variation of different features were used to determine if tracks did lack uniqueness. Through data analysis it was concluded that they did. Status: final, Type: Project • Raymond Adams, fa20-523-333 • Edit Keywords: music, spotify data, uniqueness, music evolution, 21st century music. ## 1. Introduction Music is one of the most influential elements in the arts. It has a great impact on the way humans act and feel. Research, done by Nina Avramova, has shown that different genres of music bring about different emotions and feelings through the listener. Nonetheless, humans also have a major impact on the music itself. Music and humans are mutually dependent, therefore when one evolves, so does the other. This scientific journal intends to progress the current understanding of how music has changed since the 21st century. It also aims to determine if this change in music has led to a lack of uniqueness amongst the common features of a song compared to music before the 21st century. ## 2. Data The data is located on Kaggle and was collected by a data scientist named Yamac Eren Ay. He collected more than 160,000 songs from Spotify that ranged from 1921 to 2020. Some of the features that this data set includes and will be used to conduct an analysis are: danceability, energy, acousticness, instrumentalness, valence, tempo, key, and loudness. This data frame can be seen in Figure 1. Figure 1: Dataframe of spotify data collected from Kaggle Danceability describes how appropriate a track is for dancing by looking at multiple elements including tempo, rhythm stability, beat strength, and general regularity. The closer the value is to 0.0 the less danceable the song is and the closer it is to 1.0 the more danceable it is. Energy is a sensual measure of intensity and activity. Usually, energetic songs feel fast, loud, and noisy. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy 1. The closer this value is to 0.0 the less energetic the track is and the closer it is to 1.0 the more energetic the track is. Acoustic music refers to songs that are created using instruments and recorded in a natural environment as opposed to being recorded by electronic means. The closer the value is to 0.0 the less acoustic it is and the closer it is to 1.0 the more acoustic it is. Instrumentalness predicts how vocal a track is. Thus, songs that contain words other than “Oh” and “Ah” are considered vocal. The closer the value is to 0.0 the less likely the track contains vocals and the closer the value is to 1.0 the more likely it contains vocals. Valence describes the musical positiveness expressed through a song. Tracks with high valence (closer to 0.0) sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence (closer to 1.0) sound more negative (e.g. sad, depressed, angry) 1. The tempo is the overall pace of a track measured in BPM (beats per minute). The key is the overall approximated pitch that a song is played in. The possible keys and their integer values are: C = 0; C# = 1; D = 2; D#, Eb = 3; E = 4; F = 5; F#, Gb = 6; G = 7; G#, Ab = 8; A = 9; A#, Bb = 10; B, Cb = 11. The overall loudness of a track is measured in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing the relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 dB, 0 being the most loud 1. ## 3. Methods Data analysis was used to answer this paper’s research question. The data set was imported into Jupyter notebook using pandas, a software library built for data manipulation and analysis. The first step was cleaning the data. The data set contained 169,909 rows and 19 columns. After dropping rows where at least one column had NaN values, values that are undefined or unrepresentable, the data set still contained all 169,909 rows. Thus, the data set was cleaned by Yamac Ay prior to it being downloaded. The second step in the data analysis was editing the data. The objective of this research was to compare music tracks before the 21st century to tracks after the 21st century and see if, how, and why they differ. As well as do tracks after the 21st century lack uniqueness compared to tracks that were created before the 21st century. When Yamac Ay collected the data he separated it into five comma-separated values (csv) files. The first file, titled data.csv, contained all the information that was needed to conduct data analysis. Although this file contained the feature “year” that was required to analyze the data based on the period of time, it still needed to be manipulated to distinguish what years were attributed to before and after the 21st century. A python script was built to create a new column titled “years_split” that sorted all rows into qualitative binary values. These values were defined as “before_21st_century” and “after_21st_century”. Rows, where the tracks feature “year” were between 0 and 2000 were assigned to “before_21st_century” and tracks where the feature “year” was between 2001 and 2020 were assigned to “after_21st_century”. It is important to note that over 76% of the rows were attributed to “before_21st_century”. Therefore, the majority of tracks collected in this dataset were released before 2001. ## 4. Results The features that were analyzed were quantitative values. Thus, it was decided that histograms were the best plots for examining and comparing the data. The first feature that was analyzed is “danceability”. The visualization for danceability is seen in Figure 2. Figure 2: Danceability shows how danceable a song is The histogram assigned to “before_21st_century” resembles a normal distribution. The mean of the histogram is 0.52 while the mode is 0.57. The variance of the data is 0.02986. The histogram assigned to “after_21st_century” closely resembles a normal distribution. The mean is 0.59 and the mode is 0.61. The variance of the data is 0.02997. The bulk of the data before the 21st century lies between 0.2 and 0.8. However, when looking at the data after the 21st century the majority of it lies between 0.4 and 0.9. This implies that songs have become more danceable but the variation of less danceable to danceable is practically the same. The second feature that was analyzed is “energy”. The visualization for energy is seen in Figure 3. Figure 3: Energy shows how energetic a song is The histogram assigned to “before_21st_century” does not resemble a normal distribution. The mean of the histogram is 0.44 while the mode is 0.25. The variance of the data is 0.06819. The histogram assigned to “after_21st_century” also does not resemble a normal distribution. The mean is 0.65 and the mode is 0.73. The variance of the data is 0.05030. The data before the 21st century is skewed right while the data after the 21st century is skewed left. This indicates that tracks have become much more energetic since the 21st century. Songs before 2001 on average have an energy level of 0.44 but there are still many songs with high energy levels. Where as, songs after 2001 on average have an energy level of 0.65 but there are very few songs with low energy levels. The third feature that was analyzed was “acousticness”. The visualization for acousticness is seen in Figure 4. Figure 4: Acousticness shows how acoustic (type of instrument as a collective) a song is The mean of the histogram assigned to “before_21st_century” is 0.57 while the mode is 0.995. The variance of the data is 0.13687. The histogram assigned to “after_21st_century” has a mean of 0.26 and mode of 0.114. The variance of the data is 0.08445 . The graph shows that music made before the 21st century varied from non-acoustic to acoustic. However, when analyzing music after the 21st century the graph shows that most music is created using non-acoustic instruments. It is assumed that this change in outlet of sounds is due to music production transitioning from acoustic to analog to now digital. However, more in depth research would need to be completed to confirm this assumption. The fourth histogram to be anaylzed was “instrumentalness”. The visualization for instrumentalness is seen in Figure 5. Figure 5: Instrumentalness shows how instrumental a song is The mean of the histogram assigned to “before_21st_century” is 0.19 while the mode is 0.0. The variance of the data is 0.10699. The histogram assigned to “after_21st_century” has a mean of 0.07 and mode of 0.0. The variance of the data is 0.04786 . By analyzing the graph it appears that the instrumentalness for before and after the 21st century are relatively similar. Both histograms are skewed right but the histogram attributed to after the 21st century has much less songs that are instrumental compared to songs before the 21st century. The variation of non-instrumental to instrumental tracks after the 21st century is all far less compared to tracks before the 21st century. The fifth histogram that was analyzed is “valence”. The visualization for valence is seen in Figure 6. Figure 6: Valence shows how positive a song is The mean of the histogram assigned to “before_21st_century” is 0.54 while the mode is 0.961. The variance of the data is 0.07035. The histogram assigned to “after_21st_century” has a mean of 0.49 and mode of 0.961. The variance of the data is 0.06207. By analyzing the graph we can see that the valence before and after the 21st century has remained fairly the same in terms of shape. However, the average value of valence after the 21st century decreased by 0.05. Thus, songs have become less positive but there are still a good amount of positive songs being created. The sixth histogram that was analyzed is “tempo”. The visualization for tempo is seen in Figure 7. Figure 7: Tempo shows the speed a song is played in The mean of the histogram attributed to “before_21st_century"is 115.66. The variance of the data is 933.57150. The histogram assigned to “after_21st_century” has a mean of 121.19. The variance of the data is 955.44287. This indicates that tracks after the 21st century have increased tempo by a little over 6 BPM. Tracks after the 21st century also have more variation than tracks before the 21st century. The seventh histogram that was analyzed is “key”. The visualization for key is seen in Figure 8. Figure 8: Key labels the overall pitch a song is in The mean of the histogram assigned to “before_21st_century” is 5.19 The variance of the data is 12.22468. The histogram assigned to “after_21st_century” has a mean of 5.24. The variance of the data is 12.79017. This information implies that the key of songs have mostly stayed the same hoever, there are less songs after the 21st century being created in C, C#, and D compared to songs before the 21st century. The key of songs after 2001 are also more spread out compared to songs before 2001. The eighth histogram that was analyzed is “loudness”. The visualization for loudness is seen in Figure 9. Figure 9: Loudness shows the average decibels a track is played in The mean of the histogram assigned to “before_21st_century” is -12.60 while the mode is -11.82. The variance of the data is 29.17625. The histogram assigned to “after_21st_century” has a mean of -7.32 and mode of -4.80. The variance of the data is 20.35049. The shapes of both histograms are the same, however the histogram attributed to after the 21st century shifted to the right by 5.28. This displays the increase in loudness of songs created after 2001. The variance of the data shows that songs after 2001 are mostly very loud. Wheres as, songs before 2001 had greater variation of less loud to more loud. ## 5. Conclusion Music over the centuries has continuously changed. This change has typically been embraced by the masses. However, in the early 2000s adults who grew up on different styles of music began stating through word of mouth, interviews, tv shows, and movies that “music isn’t the same anymore”. They even claimed that most current songs sound the same and lack uniqueness. This scientific research set out to determine how music has changed and if in fact, modern music lacks uniqueness. After analyzing the songs before and after the 21st century it was determined that all the features of a track have changed in some way. The danceability, energy, tempo, and loudness of a song have increased. While the acousticness, valence, and instrumentalness have decreased. The number of songs after the 21st century that was created in the key of C, C#, and D has decreased. The variation of energetic, acoustic, instrumental, valent, and loud songs have decreased since 2001. While the variation of tempo and key has increased since 2001. The factor for determining whether music lacks uniqueness in this paper will be variance. If a feature has more than alpha = 0.01 difference of variability from before to after the 21st century then it will be determined that the feature is less unique after the 21st century. The difference in variances among the feature “energetic” is 0.01789, “acousticness” is 0.05242, “instrumentalness” is 0.05913, “valence” is 0.00828, and “loudness” is 8.82576. Thus, the only feature that does not lack uniqueness when compared to songs before the 21st century is “valence”. Based on this information music overall after the 21st century lacks uniqueness compared to music before the 21st century. This lack of uniqueness did not start after the 21st century. It started during the Enlightenment period. During this period of time, classical music was the most popular genre of music. Before this era, Baroque music was extremely popular. Artists such as Johann Sebastian Bach created complex compositions that were played for the elite. This style of music “was filled with complex melodies and exaggerated ornamentation, music of the Enlightenment period was technically simpler.” 2 Instead of focusing on these complexities the new music focused on enjoyment, pleasure, and being memorable. People now wanted to be able to hum and play songs themselves. This desired feature has caused music to constantly become simpler. Thus, reducing songs variances amongst features. A further step that could be taken in this research is predicting what these features will look like in the future. Machine learning could be used to make this prediction. This would give music enthusiasts and professionals a greater understanding of where music is headed and how to adapt. ## 5.1 Limitations Initially this project was supposed to be conducted on Hip-Hop music. However, the way the data was collected and stored did not allow for this analysis to be done. In the future a more in depth analysis could be conducted on a specific genre. ## 7. References 1. Developer.spotify.com. 2020. Get Audio Features For A Track | Spotify For Developers. [online] Available at: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/ [Accessed 6 November 2020]. ↩︎ 2. Muscato, C. and Clayton, J., 2020. Music During The Enlightenment Period. [online] Study.com. Available at: https://study.com/academy/lesson/music-during-the-enlightenment-period.html [Accessed 5 November 2020]. ↩︎ # 40 - NBA Performance and Injury Sports Medicine will be a7.2 billion dollar industry by 2025. The NBA has a vested interest in predicting performance of players as they return from injury. The authors evaluated datasets available to the public within the 2010 decade to build machine and deep learning models to expect results. The team utilized Gradient Based Regressor, Light GBM, and Keras Deep Learning models. The results showed that the coefficient of determination for the deep learning model was approximately 98.5%. The team recommends future work to predicting individual player performance utilizing the Keras model.

Status: final, Type: Project

• Gavin Hemmerlein, fa20-523-301
• Chelsea Gorius, fa20-523-344
• Edit

Keywords: basketball, NBA, injury, performance, salary, rehabilitation, artificial intelligence, convolutional neural network, lightGBM, deep learning, gradient based regressor.

## 1. Introduction

The topic to be investigated is basketball player performance as it relates to injury. The topic of injury and recovery is a multi-billion dollar industry. The Sports Medicine field is expected to reach $7.2 billion dollars by 2025 1. The scope of this effort is to explore National Basketball Association(NBA) teams, but the additional uses of a topic such as this could expand into other realms such as the National Football League, Major League Baseball, the Olympic Committees, and many other avenues. For leagues with salaries, projecting an expected return on the investment can assist in contract negotiations and cater expectations. Competing at such a high level of intensity puts these players at a greater risk to injury than the average athlete because of the intense and constant strain on their bodies. The overall valuation of the NBA in recent years is over 2 billion dollars, meaning each team is spending millions of dollars in the pursuit of a championship every season. Injuries to players can cost teams not only wins but also significant profits. Ticket sales alone for a single NBA finals game have reported greater than 10 million dollars in profit for the home team, if a team’s star player gets injured just before the playoffs and the team does not succeed, that is a lot of money lost. These injuries can have an effect no matter the time of year, regular season ticket sales have been known to fluctuate with injuries from the team’s top performers. Besides ticket sales these injuries can also influence viewership, TV or streaming, and potentially lead to a greater loss in profits. With the health of the players and so much money at stake NBA team organizations as a whole do their best to take care of their players and keep them injury free. ## 2. Background Research and Previous Work The assumptions were made based on current literature as well. The injury return and limitations upon return of Anterior Cruciate Ligament (ACL) rupture (ACLR) are well documented and known. Interesting enough, forty percent of the players in the study occurred during the fourth quarter 2. This leads some credence to the idea that fatigue is a major factor in the occurrence of these injuries. The current literature also shows that a second or third injury can occur more frequently due to minor injuries. “When an athlete is recovering from an injury or surgery, tissue is already compromised and thus requires far more attention despite the recovery of joint motion and strength. Moreover, injuries and surgical procedures can create detraining issues that increase the likelihood of further injury” 3. ## 3. Dataset To compare performance and injury, a minimum of two datasets will be needed. The first is a dataset of injuries for players 4. This dataset created the samples necessary for review. Once the controls for injuries were established, the next requirement was to establish pre-injury performance parameters and post-injury parameters. These areas were where the feature engineering took place. The datasets needed had to include appropriate basketball performance stats to establish a metric to encompass a player’s performance. One example that ESPN has tried in the past is the Player Efficiency Rating (PER). To accomplish this, it was important to review player performance within games such as in the NBA games data 5 dataset because of how it allowed the team to evaluate the player performance throughout the season, and not just the average stats across the year. In addition to that the data from the NBA games data 6 dataset was valuable in order to compare the calculated performance metrics just before an injury or after recovery to the player’s overall performance that season or in seasons prior. That comparison provided a solid baseline to understand how injuries can effect a player’s performance. With in depth information about each game of the season, and not just the teams and players aggregated stats, added to the data provided from the injury dataset 4 the team was be able to compose new metrics to understand how these injuries are actually affecting the players performance. Along the way attempted to discover if there is also a causal relationship to the severity of some of the injuries, based on how the player was performing just before the injury. The term load management has become popular in recent years to describe players taking rest periodically throughout the season in order to prevent injury from overplaying. This new practice has received both support for the player safety it provides and also criticism around players taking too much time off. Of course not all injuries are entirely based on the recent strain under the players body, but a better understanding about how that affects the injury as a whole could give better insight into avoiding more injuries. It is important to remember though that any pattern identification would not lead to an elimination of all injuries, any contact sport will continue to have injuries, especially one as high impact as the NBA. There is value to learn from why some players are able to return from certain injuries more quickly and why some return to almost equivalent or better playing performance than before the injury. This comparison of performance was attempted by deriving metrics based on varying ranges of games immediately leading up to injury and then immediately after returning from injury. In addition to that performed comparisons to the players known peak performance to better understand how the injury affected them. Another factor that was important to include is the length of time recovering from the injury. Different players take differing amounts of time off, sometimes even with similar injuries. Something will be said about the player’s dedication to recovery and determination to remain at peak performance, even through injury, when looking at how severe their injury was, how much time was taken for recovery, and how they performed upon returning. These datasets were chosen because they allow for a review of individual game performance, for each team, throughout each season in the recent decade. Aggregate statistics such as points per game (ppg) can be deceptive because duration of the metric is such a large period of time. The large sample of 82 games can lead to a perception issue when reviewing the data. These datasets include more variables to help the team determine effects to player injury, such as minutes per game (mpg) to understand how strenuous the pre-injury performance or how fatigue may have played a factor in the injury. Understanding more of the variables such as fouls given or drawn can help determine if the player or other team seemed to be the primary aggressor before any injury. ### 3.1 Data Transformations and Calculations Using the Kaggle package the datasets were downloaded direct from the website and unzipped to a directory accessible by the ‘project_dateEngineering.ipynb’ notebook. The 7 unzipped datasets are then loaded into the notebook as pandas data frames using the ‘.read_csv()’ function. The data engineering performed in the notebook includes removal of excess data and data type transformations across almost all the data frames loaded. This data transformation includes transforming the games details column ‘MIN’, meaning minutes played, from a timestamp format to a numerical format that could have calculations like summation or average performed on it. This was a crucial transformation since minutes played have a direct correlation to player fatigue, which can increase a player’s chance of injury. One of the more difficult tasks was transforming the Injury dataset into something that would provide more information through machine learning and analysis. The dataset is loaded as one data set where 2 columns ‘Relinquished’ and ‘Acquired’ defined if the row in questions was a player leaving the roster due to injury or returning from injury, respectively. In this case for each for one of those two columns contained a players name and the other was blank. Besides that the data frame contained information like the date, notes, and the team name. In order to appropriately understand each injury as whole the data frame needs to be transformed into one where each row contains the player, the start date of the injury, and the end date of the injury. In order to do this first the original Injury dataset was separated into rows marking the start of an injury and those marking the end of an injury. Data frames from the NBA games data 5 data set were used to join TeamID and PlayerID columns to the Injury datasets. An ‘iterrows():’ loop was then used on the data frame marking the start of an injury to specifically locate the corresponding row in the Injury End data frame with the same PlayerID and where the return date was the closest date after the injury date. As this new data frame was being transformed, it was noted that sometimes a Player would have multiple rows with the same Injury ending date but different injury start dates, this can happen if an injury worsens or the player did not play due to last minute decision. In order to solve this the table was grouped by the PlayerID and InjuryEnd Date while keeping the oldest Injury Start date, since the model will want to see the full length of the injury. From there it was simple to calculate the difference in days for each row between the Injury start and end dates. This data frame is called ‘df_Injury_length’ in the notebook and is much easier to use for improved understanding of NBA injuries than the original format of the Injury data set. Once created, the ‘df_Injury_length’ data frame was copied and built upon. Using ‘iterrows():’ loop again to filter down the games details data frame rows with the same PlayerId, over 60 calculated columns are created to produce the ‘df_Injury_stats’ data frame. The data frame includes performance statistics specifically from the game the player was injured and the game the player returned from that injury. In addition to this aggregate performance metrics were calculated based on the 5 games prior to the injury and the 5 games post returning from injury. At this time the season of when the injury occurred and when the player returned is also stored in the dataframe. This will allow comparisons between the ‘df_Injury_stats’ data frame and the ‘df_Season_stats’ data frame which contains the players average performance metrics for entire seasons. A few interesting figures were generated within the Exploratory Data Analysis (EDA) stage. Figure 1 gave a view of the load of the player returning from injury. The load to the player will show how recovered the player is upon completion of rehab. Many teams decide to slowly work a returning player in. Additionally, the amount of time for an injury can be seen on this graph. The longer the injury, the more unlikely the player will return to action. Figure 1: Average Minutes Played in First Five Games Upon Return over Injury Length in Days* Figure 2 shows the frequency in which a player is injured. The idea behind this graph is to see a relationship between the time leading up to the injury. Interesting enough, there is no key indication of where injury is more likely to occur. It can be assumed that there is a rarity of players who see playing time greater than 30 minutes. The histogram only shows a near flat relationship; which was surprising. Figure 2: Frequency of Injuries by Average Minutes Played in Prior Five Games* Figure 3 shows the length of injury over number of injuries. By reviewing this data, it can be seen that most injuries occur fewer rather than more often. A player that is deemed injury prone will be a lot more likely to be cut from the team. This data makes sense. Figure 3: Injury Length in Days over Number of Injuries Figure 4 shows the injury length over average minutes played in the five games before injury. This graph attempts to show all of the previous games and the impacts to the players injury. The data looks evenly distributed, but the majority of plaers do not play close to 40 minutes per game. By looking at this data, it shows that minutes played does likely contribute to the injury severity. Figure 4: Injury Length in Days over Avg Minutes Played in Prior 5 Games Figure 5 shows that in general the number of games played does not have a significant relationship to the length of the injury. There is a darker cluster between 500-1000 days injured that exists over the 40-82 games played, this could suggest that as more games are played there is likeliness for more severe injury. Figure 5: Injury Length in Days over Player Games Played that Season Figures 6, Figure 7, and Figure 8 attempt to demonstrate if any relationship exists visually between a player’s injury length and their age, weight, or height. For the most part Figure 6 shows most severe injuries occurring to younger players, which could make sense considering they can perform more difficult moves or have more stamina than older players. Some severe injuries still exist among the older players, this also makes sense considering their bodies have been under stress for many years and are more prone to injury. It should be noted that there are more players in the league that fall into the younger age bucket than the older ages. It is difficult to identify any pattern on Figure 7. If anything the graph is somewhat normally shaped similar to the heights of players across the league. Suprisingly the injuries on Figure 8 are clustered a bit towards the left, being the lighter players. This could be explained through the fact that the lighter players are often more athletic and perform more strenuous moves than heavier players. It is also somewhat surprising since the argument that heavier players are putting more strain on their bodies could be used as a reason why heavier players would have worse injuries. One possible explanation could be the musculature adding more of the dense body mass could add protection to weakened joints. More investigation would be needed to identify an exact reason. Figure 6: Injury Length in Days over Player Age that Season Figure 7: Injury Length in Days over Player Height in Inches Figure 8: Injury Length in Days over Player Weight in Kilograms Finally, the team decided to use the z-score to normalize all of the data. By using the Z-score from the individual data in a column of df_Injury_stats, the team was able to limit variability of multiple metrics across the dataframe. A player’s blocks and steals should be a miniscule amount compared to minutes or points of some players. The same can be said of assists, technical fouls, or any other statistic in the course of an NBA game. The Z-score, by nature of the metric from the mean, allows for much less variability across the columns. ## 4. Methodology The objective of this project was to develop performance indicators for injured players returning to basketball in the NBA. It is unreasonable to expect a player to return to the same level of play post injury immediately upon starting back up after recovery. It often takes a player months if not years to return to the same level of play as pre-injury, especially considering the severity of the injuries. In order to successfully analyze this information from the datasets, a predictive model will need to be created using a large set of the data to train. From this point, a test run was used to gauge the validity and accuracy of the model compared to some of the data set aside. The model created was able to provide feature importance to give a better understanding of which specific features are the most crucial when it comes to determining how bad the effects of an injury may or may not be on player performance. Feature engineering was performed prior to training the model in order to improve the chances of higher accuracy from the predictions. This model could be used to keep an eye out for how a player’s performance intensity and the engineered features could affect how long a player takes to recover from injury, if there are any warning signs prior to an injury, and even how well they perform when returning. ### 4.1 Development of Models To help with review of the data, conditioned data was used to save resources on Google Colab. By conditioning the data and saving the files as a .CSV, the team was able to create a streamlined process. Additionally, the team found benefit by uploading these files to Google Drive to quickly import data near real time. After operating in this fashion for some time, the team was able to load the datasets into Github and utilize that feature. By loading the datasets up to Github, a url could be used to link the files directly to the files saved on Github without using a token like with Kaggle or Google Drive. The files saved were the following: Table 1: Datasets Imported Dataframe Title 1. df_Injury_stats 2. df_Injury_length 3. df_Season_stats 4. games 5. df_Games_gamesDetails 6. injuries_2010-2018 7. players 8. ranking 9. teams Every time Google Colab loads data, it takes time and resources. The team was able to utilize the cross platform connectivity of the Google utilities. The team could then focus on building models as opposed to conditioning data every time the code was ran. #### 4.1.1 Evaluation Metrics The metrics chosen were designed to give results on Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and the Explained Variance (EV) Score. MAE is a measure of errors between paired observations experiencing the same expression. RMSE is the standard deviation of the prediction errors for our dataset. EV is the relationship between the train data and the test data. By using these metrics, the team is capable of reviewing the data in a statistical manner. #### 4.1.2 Gradient Boost Regression The initial model that was used was a Gradient Boosting Regressor (GBR) model. This model produced the results shown in Table 2. The GBR model builds in a stage-wise fashion; similarly to other boosting methods. GBR also generalizes the data and attempts to optimize the results utilizing a loss function. An example of the algorithm can be seen in Figure 5. Figure 5: Gradient Boosting Regressor 7 The team saw a relationship given the data. Table 2 shows the results of that model. The results were promising given the speed and utility of a GBR model. The team reviewed the data multiple times after multiple stages of conditioning the data. Table 2: GBR Results Category Value MAE Mean -10.787 MAE STD 0.687 RMSE Mean -115.929 RMSE STD 96.64 EV Mean 1.0 EV STD 0.0 After running a GBR model, the decision was made to try multiple models to see what gives the best results. The team settled on LightGBM and a Deep Learning model utilizing Keras built on the TensorFlow platform. These results will be seen in 4.1.2 and 4.1.3. #### 4.1.2 LightGBM Regression Another algorithm chosen was a Light Gradient Boost Machine (LightGBM) model. LightGBM is known for its lightweight and resource sparse abilities. The model is built from decision tree algorithms and used for ranking, classification, and other machine learning tasks. By choosing LightGBM data scientists are able to analyze larger data a faster approach. LightGBM can often over fit a model if the data is too small, but fortunately for the purpose of this assignment the data available for NBA injuries and stats is extremely large. Availability of data allowed for smooth operation of the LightGBM model. Mandot explains the model really well in The Medium. Mandot said, “Light GBM can handle the large size of data and takes lower memory to run. Another reason of why Light GBM is popular is because it focuses on accuracy of results. LGBM also supports GPU learning and thus data scientists are widely using LGBM for data science application development” 8. There are a lot of benefits available to this algorithm. Figure 6: LightGBM Algorithm: Leafwise searching 8 When running the model Table 3 was generated. This table uses the same metrics as the GBR Results Table (Table 2). After reviewing the results, the GBR model still appeared to be a viable avenue. The Keras model will be evaluated next to see most optimal model to use for repeatable fresults. Table 3: LightGBM Results Category Value MAE Mean -0.011 MAE STD 0.001 RMSE Mean -0.128 RMSE STD 0.046 EV Mean 0.982 EV STD 0.013 #### 4.1.3 Keras Deep Learning Models The final model attempted was a Deep Learning model. A few runs of different layers and epochs were chosen. They can be seen in Table 4 (shown later). The model was sequentially ran through the test layers to refine the model. When this is done, each predecessor layer acts as an input to the next layer’s input for the model. The results can produce accurate results while using unsupervised learning. The visualization for this model can be seen in the following figure: Figure 7: Neural Network 9 When the team ran the Neural Networks, the data went through three layers. Each layer was built upon the previous similarly to the figure. This allowed for the team to capture information from the processing. Table 4 shows the results for the deep learning model. Table 4: Epochs and Batch Sizes Chosen Number Regressor Epoch Regressor Batch Sizes KFolds Model Epochs R2 1. 25 25 10 10 0.985 2. 40 25 20 10 0.894 3. 20 25 20 10 0.966 4. 20 20 10 10 0.707 5. 25 25 10 5 0.611 6. 25 25 10 20 0.982 The team has decided that the results for the Deep Learning are the most desirable. This model would be the one that the team would recommend based on the results from the metrics available. The parameters the team recommends are italicized in Line 1 of Table 4. ## 5. Inference With the data available, some conclusions can be made. Not all injuries are of the same severity. By treating an ACL tear in the same manner as a bruise, the team doctors would take terrible approaches to rehab. The severity of the injury is a part of the approach to therapy. This detail is nearly impossible to capture in the model. Another aspect to come to a conclusion is that not every player recovers in the same timetable as another. Genetics, diet, effort, and mental health can all harm or reinforce the efforts from the medical staff. These areas are hard to capture in the data and cannot be appropriately reviewed with this model. It is also difficult to indicate where a previous injury may have contributed to a current injury. The kinetic chain is a structure of the musculoskeletal system that moves the body using the muscles and bones. If one portion of the chain is compromised, the entire chain will need to be modified to continue movement. This modification can result in more injuries. The data cannot provide this information. It is important to remember these possible confounding variables when interpreting the results of the model. ## 6. Conclusion After reviewing the results, the team created a robust model to predict the performance of a player after an injury. The coefficient of determination for the deep learning model shows a strong relationship between the training and test sets. After conditioning the data, the results can be seen in Table 2, Table 3, and Table 5. The team had an objective to find this correlation and build it to the point where injury and performance can be modeled. The team was able to accomplish this goal. Additionally, these results are consistent with the current scientific literature 2 3. The biological community has been able to record these results for decades. By leveraging this effort, the scientific community could move to a more proactive approach as opposed to reactive with respect to injury controls. This data will also allow for proper contract negotiations to take place in the NBA, considering potential decisions to avoid injury may include less playing time. The negotiations are pivotal to ensuring that expectations are met in the future seasons; especially when injury occurs in the final year of a player’s contract. Teams with an improved understanding of how players can or will return from injury have an opportunity to make the best of scenarios where other teams may be hesitant to sign an injured player. These different opportunities for a team’s front office could be the difference between a championship ring and missing the playoffs entirely. ## 6.1 Limitations With respect to the current work, the models could be continued to be refined. Currently the results are to the original intentions of the team, but improvements can be made. Feature Engineering is always an area where the models can improve. Some valuable features to be created in the future are the calculations for the player’s efficiency overall, as well as offensinve and defensive efficiencies in each game. The team would also like to develop a model to use the stats of a player in pre-injury and apply that to the post-injury set of metrics. Also, the team would like to move to where the same could be applied given the length of the injury to the player while considering the severity of the injury. Longer and more severe injury will lead to different future results than say a long not severe injury, or a short injury that was somewhat severe. The number of varaibles that could provide more valuable information to the model are endless. ## 7. Acknowledgements The authors would like to thank Dr. Gregor von Laszewski, Dr. Geoffrey Fox, and the associate instructors in the FA20-BL-ENGR-E534-11530: Big Data Applications course (offered in the Fall 2020 semester at Indiana University, Bloomington) for their continued assistance and suggestions with regard to exploring this idea and also for their aid with preparing the various drafts of this article. In addition to that the community of students from the FA20-BL-ENGR-E534-11530: Big Data Applications course also deserve a thanks from the author for the support, continued engagement, and valuable discussions through Piazza. ### 7.1 Work Breakdown For the effort developed, the team split tasks between each other to cover more ground. The requirements for the investigation required a more extensive effort for the teams in the ENGR-E 534 class. To accomplish the requirements, the task was expanded by addressing multiple datasets within the semester and building in multiple models to display the results. The team members were responsible for committing in Github multiple times throughout the semester. The tasks were divided as follows: 1. Chelsea Gorius • Exploratory Data Analysis • Feature Engineering • Keras Deep Learning Model 2. Gavin Hemmerlein • Organization of Items • Model Development 3. Both • Report • All Outstanding Items ## 8. References 1. A. Mehra, Sports Medicine Market worth$7.2 billion by 2025, [online] Markets and Markets. https://www.marketsandmarkets.com/PressReleases/sports-medicine-devices.asp [Accessed Oct. 15, 2020]. ↩︎

2. J. Harris, B. Erickson, B. Bach Jr, G. Abrams, G. Cvetanovich, B. Forsythe, F. McCormick, A. Gupta, B. Cole, Return-to-Sport and Performance After Anterior Cruciate Ligament Reconstruction in National Basketball Association Players, Sports Health. 2013 Nov;5(6):562-8. doi: 10.1177/1941738113495788. [Online serial]. Available: https://pubmed.ncbi.nlm.nih.gov/24427434 [Accessed Oct. 24, 2020]. ↩︎

3. W. Kraemer, C. Denegar, and S. Flanagan, Recovery From Injury in Sport: Considerations in the Transition From Medical Care to Performance Care, Sports Health. 2009 Sep; 1(5): 392–395.[Online serial]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3445177 [Accessed Oct. 24, 2020]. ↩︎

4. R. Hopkins, NBA Injuries from 2010-2020, [online] Kaggle. https://www.kaggle.com/ghopkins/nba-injuries-2010-2018 [Accessed Oct. 9, 2020]. ↩︎

5. N. Lauga, NBA games data, [online] Kaggle. https://www.kaggle.com/nathanlauga/nba-games?select=games_details.csv [Accessed Oct. 9, 2020]. ↩︎

6. J. Cirtautas, NBA Players, [online] Kaggle. https://www.kaggle.com/justinas/nba-players-data [Accessed Oct. 9, 2020]. ↩︎

7. V. Aliyev, A hands-on explanation of Gradient Boosting Regression, [online] Medium. https://medium.com/@vagifaliyev/a-hands-on-explanation-of-gradient-boosting-regression-4cfe7cfdf9e [Accessed Nov., 9 2020]. ↩︎

8. P. Mandon, What is LightGBM, How to implement it? How to fine tune the parameters?, [online] Medium. https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc [Accessed Nov., 9 2020]. ↩︎

9. The Data Scientist, What deep learning is and isn’t, [online] The Data Scientist. https://thedatascientist.com/what-deep-learning-is-and-isnt [Accessed Nov., 9 2020]. ↩︎

# 41 - NFL Regular Season Skilled Position Player Performance as a Predictor of Playoff Appearance Overtime

The present research investigates the value of in-game performance metrics for NFL skill position players (i.e., Quarterback, Wide Receiver, Tight End, Running Back and Full Back) in predicting post-season qualification. Utilizing nflscrapR-data that collects all regular season in-game performance metrics between 2009-2018, we are able to analyze the value of each of these in-game metrics by including them in a regression model that explores each variables strength in predicting post-season qualification. We also explore a comparative analysis between two time periods in the NFL (2009-2011 vs 2016-2018) to see if there is a shift in the critical metrics that predict post-season qualification for NFL teams. Theoretically, this could help inform the debate as to whether there has been a shift in the style of play in the NFL across the previous decade and where those changes may be taking place according to the data. Implications and future research are discussed.

Status: final, Type: Project

Travis Whitaker, fa20-523-308, Edit