Big Data in the Healthcare Industry

Healthcare is an organized provision of medical practices provided to individuals or a community. Over centuries the application of innovative healthcare has been needed increasingly as humans expand their life span and become more aware of better preventative care practices. The application of Big Data within the industry of Healthcare is of the utmost importance in order to quantify the effects of wide scale efficient and safe solutions. Pharmaceutical and Bio Data Research companies can use big data to intake large facets of patient record data and use this collected data to iterate how preventative care can be implemented before diseases actually present themselves in stages that are beyond the point of potential recovery. Data collected in laboratory settings and statistics collected from medical and state institutions of healthcare facilitate time, money, and life saving initiatives as deep learning can in certain instances perform better than the average doctor at detecting malignant cells. Big data within healthcare has proven great results for the advancement and diverse application of informed reasoning towards medical solutions.

Cristian Villanueva, Christina Colon

Check Report Status Status: final, Type: Report

fa20-523-352, Edit


Healthcare is an organized provision of medical practices provided to individuals or a community. Over centuries the application of innovative healthcare has been needed increasingly as humans expand their life span and become more aware of better preventative care practices. The application of Big Data within the industry of Healthcare is of the utmost importance in order to quantify the effects of wide scale efficient and safe solutions. Pharmaceutical and Bio Data Research companies can use big data to intake large facets of patient record data and use this collected data to iterate how preventative care can be implemented before diseases actually present themselves in stages that are beyond the point of potential recovery. Data collected in laboratory settings and statistics collected from medical and state institutions of healthcare facilitate time, money, and life saving initiatives as deep learning can in certain instances perform better than the average doctor at detecting malignant cells. Big data within healthcare has proven great results for the advancement and diverse application of informed reasoning towards medical solutions.


Keywords: EHR, Healthcare, diagnosis, application, treatment, AI, network, records

1. Introduction

Healthcare is a multi-dimensional system established with the aim of the prevention, diagnosis, and treatment of health-related issues or impairments in human beings1. The many dimensions of Healthcare can be characterized by the influx of information coming and going from each level as there are multiple different applications of Healthcare. These applications can include but are not limited to vaccines, surgeries, x-rays, medicines/treatments. Big data plays a pivotal role in Healthcare diagnostics, predictions, and accelerated results/outcomes of these applications. Big Data has the ability to save millions of dollars through automating 40% of radiologist’s tasks, saving time on potential treatments through digital patients, and by providing improved outcomes1. With higher accuracy rates of diagnosis and advanced AI is able to transform hypothetical analysis into data driven diagnosis and treatment strategies.

2. Patient Records

EHR stands for ‘electronic health records’ and is a digital version of a patient’s paper chart.The Healthcare industry utilizes EHR for maintaining records of everything related in their institutions. EHR are real-time, patient centred records that make information available instantly and securely to authorized users2. EHR is capable of holding even more information as it is possible to include such information such as medical history, diagnoses, medications, treatment plans, immunization dates, allergies, radiology images, and laboratory and test results. According to Definitive Healthcare data from 2020, more than 89 percent of all hospitals have implemented inpatient or ambulatory EHR systems 3. A network of information surrounding a patient’s health record and medical data allows for the research and production of such progressive advancement in treatment. To underline the potential of the resources, more than 110 million EHRs around the continents were inspected for genetic disease research4. This is the capability of EHRs as it holds information capable of diagnosing, preventing and treating other patients for early detection of an ailment or disease. Through the application of neural networks in Deep Learning models, EHR’s could be compiled and analyzed to identify inconspicuous indicators of disease development of patients in early stages far before a human doctor would be able to make a clear diagnosis. The application has the ability to work far ahead for preventive measures as well as the allocation of resources to make sure that patients are paying for the care at minimum costs, the appropriate method of medical intervention is applied, and physicians’ workload can become less strenuous.

2.1 EHR Application: Detecting Error and Reducing Costs

In order to understand the impact that Big Data such as EHRs has on the Healthcare industry an example of research is presented in the form of collection before and after implementation of EHR. The research study collected data for the period of 1 year before EHR (pre-EHR) and 1 year after EHR (post-EHR) implementation. What was noticed in the analyzes of the data was in the area of ‘Medication errors and near misses’ the research stated ‘medication errors per 1000 hospital days decreased 14.0%-from 17.9% in the pre-EHR period to 15.4% in the 9 months after CPOE implementation’5. The research determined that with implementation of EHR with (CPOE) computerized provider order entry was able to reduce the costs of treatment and improvised upon the safety of their patients. Participants of the study mentioned that there was an increase in speed when it came to pharmacy, laboratory and radiology orders. The research also stated ‘our study demonstrated an 18% reduction in laboratory testing’. The study touched upon the rapidness that EHR can add to a process of treatment when orders are validated much quicker and hospitals and patients save money from the rapid diagnosis and treatment. This cuts out the middle-man of deliberate testing and examinations upon patients so they don’t have to cover the costs or undergo wasteful testing from their own EHR and other extensive EHR that it utilizes for comparison. Examples of models used in this example study include data mining through phenotyping and natural language processing. In this way data mining allows large sets of patient data to be aggregated in order to make inferences over a population or theories regarding how a disease will progress in any given patient. Phenotyping categorizes features of patients' health and their DNA and ultimately their overall health. Association rule data mining helps automated systems in their predictions in order to predict behavioral and health outcomes of patients’ circumstances.

3. AI Models in Cancer Detection

AI is modifying early detection of cancer as models are capable of being more accurate and precise with the analysis of mass and cell images. The difficulty of diagnosing cancer is because of the possibilities of either the mass being benign or malignant. The amount of time overlooking the cell nuclei and its features to either determine if it is malignant or benign can be staggering for oncologists. Utilizing the information of what’s known about cancer can train AI to be calibrated to scour through several images and screenings of cell nuclei to find the key indicators. These key indicators can also be whittled down even further as there is AI to determine which indicators have the highest correlation with malignant cancer. As a dataset from Kaggle consisting of 569 cases of malignant and benign breast cancer, it represented 357 cases of benign and 212 of malignant. With that information there were initially 33 features that may have indicated malignancy in these cases. The 33 features were reduced to 10 features as not all of them equally displayed the same level of contribution to the diagnosis. Across the 10 features there were 5 features that demonstrated the highest correlation to the malignancy. Several models were adapted to find the highest accuracy and precision. This form of AI detection improves upon the efficacy of early cancer detection.

AI Models Demonstrate Accuracy & Precision

Figure 1. Demonstrates how AI in this study used images to cross-analyze features of a patient’s results to verify what model is the most accurate and precise to determine which model can best serve a physician in their diagnostic report.

3.1 Early Detection Big Data Applications

‘An ounce of prevention is worth more than a pound of a cure’ is a common philosophy held by medical professionals. The meaning behind this ideology is found in that if one can prevent a disease from ever taking its final form through performing small routine tasks and check ups, a plethora of harm and suffering from trying to recage a disease can be avoided. Many medical solutions for diseases such as cancer or degenerative brain diseases rely on the idea that outside medical intervention will strengthen the patient enough for the human body to heal itself through existing biological principles 6. For example, vaccines work by injecting dead cells into a patient so that its antibodies can be learned and immunity can be built up by white blood cells naturally. Intervening before one is infected must be completed for these measures to be effective. If preventative care such as routine screenings on individuals with family history of diseases or those with general genetic predispositions then the power truly lies in having the discernment knowledge to catch the disease early. In many diseases once a patient is presenting symptoms, it is too late or survival/recovery probability percentages are slashed. This places immense pressures on patients themselves to work to have access to routine screenings and even more pressure on physicians to intake these patients and make preliminary diagnosis with little more than a visual analysis of the patient. Big data automates these tasks and gives physicians an incredible advantage and discernment as to what is truly happening within a patient’s circumstance.

3.2 Detecting Cervical Cancer

Cervical cancer in the past was one of the most common causes of cancer death for women in the United States. However preventive care in the form of pap test has been able to drop the death rate significantly. In the pap test images are taken of the women’s cervix to identify any changes that might indicate cancer is going to form. Cervical cancer has a much higher death rate without early detection as a cure is easier to take full effect in the early stages. Artificial Intelligence performs an algorithm and gives the computer the ability to act and reason based on a set of known information. Machine learning implements more data and allows the computer to work iteratively and make predictions and make decisions based on the massive amount of data provided. In this way, machines have had the ability to detect cervical cancer with greater precision and accuracy in some cases than gynecologists 7. Imaging of cervical screenings targeted by a convolution neural network is the key to unlocking correlations behind the large sum of images. By implementing further reasoning into the data set, the CNN is able to classify enhanced recognition of cancer as or before it forms. This study using this method of machine learning has been able to perform with 90-96% accuracy and save lives. The CNN is able to identify the colors, shapes, sizes, edges and other features pertaining to cancerous cells.

This is ground breaking for women in underdeveloped countries like India and Nigeria where the death rate for cervical cancer is much higher than the United States due to lack of access to routine pap smears. Women could get results on their cervical cancer status even if they do not get a pap smear every 3 years as recommended by doctors. For example if a woman in Nigeria has her first pap smear at the age of 40 when the recommended age to start pap smears is 21 she has gone unchecked for nearly 20 years and the early detection window is narrowed. However, if she is one of the 20% of women who get cervical cancer over the age of 65, a deep learning analysis of her pap smear at 40 could save her life and roadblock potential suffering. Early detection is key and big data optimizes early detection windows by providing a deeper analysis in the preventive care stages. From here doctors are able to implement the best care plan available on a case by case basis.

4. Artificial intelligence in Cardiovascular Disease

AI in cardiovascular disease models are innovating disease detection by segmenting different types of analysis together for more efficient and accurate results. Being that cardiovascular diseases typically agitate/involve the heart and lungs there are numerous dynamics surrounding why a person is experiencing certain symptoms or at risk for development of a more critical diagnosis. Immense amount of labor is included in the diagnosis and treatment of individuals with cardiovascular disease on behalf of general physicians, specialists, nurses, and several other medical professionals. Artificial intelligence has the capability to add a layer of ease and accuracy that is involved in analyzing a patient’s status or risk for cardiovascular disease. AI is able to overcome the challenges of low quality pixelated images from analyzes and draw clearer and more accurate conclusions at a stage where more prevention strategies can be implemented. AI in this sense is able to analyze the systems of the human body as a whole as opposed to a doctor which might have several appointments with a patient to determine results from evaluations on lungs, heart, etc. By segmenting x-rays from numerous patients AI is able to learn and grow its data set to produce increasingly accurate and precise results[^8]. By using a combination of recurrent neural networks and convolutional neural networks artificial intelligence is able to go beyond what currently exists in terms of medical analysis and provide optimum results for patients in need. Recurrent neural networks function by building upon past data in order to create new output in series. They work hand in hand with Convolutional Neural networks which focus on analyzing advanced imagery based on qualitative data and can weigh biases on potential prescriptive outcomes.

AI Learning Wireframe[^10]

Figure 2. Demonstrates a wireframe of how data is computed to draw relevant conclusions from thousands of images and pinpoint exact predictions of diagnosis. Risk analysis is crucial for heart attack prevention and understanding how suspeectable a person is to heart failure. Being that heart attacks can lead to strokes due to loss of blood and oxygen to the brain, these imaging tools serve as an invaluable life saving mechanism to help bring prevention to the forefront of these medical emergencies.

5. Deep Learning Techniques for Genomics

A digital patient is the idea that a patient’s health record can be compiled with live and exact biometrics for the purpose of testing. Through this method medical professionals will have the ability to propose new solutions to patients and monitor potential effects of operations or medicines over a period of time in a condensed/rapid results format. Essentially if a patient would be able to see how their body reacts to medical procedures before they are performed. The digital copy of a patient would receive simulated trial treatments to better understand what would happen over a period of time if the solution was adopted. For example, a patient would be able to verify with their physician what type of diuretics, beta inhibitors, or angiotensin receptor blocker medication would be the most effective solution to their hypertension regulatory needs[^11]. Physicians would be able to mitigate the risks and side effects associated with a certain solution given a patients expected behavior in response to what has been uploaded to the model. In order to produce deep learning results, models must be implemented by indicating genetic markers by which computational methods can traverse the genetics strands and draw relevant conclusions. In this way data can be processed to propose changes to disease carrying chains of DNA or fortify immune based responses in those who are immunocompromised[^9].

Genomics Illustration

Figure 3. Illustrates how genes are analyzed through data collection methods such as EHR, personal biometric data, and family history in order to track what type of disease poses a threat and how to prevent, predict, and treat disease at the molecular level. Producing accurate methods of treatments, medications as well as predictions without having to put the patient through any trials.

6. Discussion

In considering the numerous innovations made possible by Big Data one can expect major impacts on society as we know it. Access to these types of data solutions should be made accessible to all those who are in need. Collectively an effort must be made to promote equitable access to life saving artificial intelligence discussed in this report. Processing power and lack of resources stand as a barrier to widespread access to proper testing. However, governments and industries in the private sector must work together to avoid monopolies and price gouging limitations to such valuable data and computing models. With further investment into deep learning models error margins can be narrowed and risk percentages and be slimmed pertaining to prescriptive analysis in specific use cases. The more access to information and examples are available, the better and more advanced a deep learning system can become. With the addition of electronic health records and past analysis artificial intelligence has the power to exponentially revolutionize the healthcare industry. By providing patients with services that could save their lives there is more incentive to stay involved in personal health as computation is optimized targeting patients for more results focused visits to the doctor. Doctors themselves are able to be relieved of a portion of the workload and foster a greater work life balance through cutting down on testing time and having more time to interact with patients for educational informative appointments. Legally medical professionals will be able to use prediction errors as alternative signals to further analyze a patient and justify treatment measures. Using data visualization of potential outcomes via a specific treatment method will empower patients and doctors to choose the pathway with the most favorable outcome.Convolutional Neural Networks within deep learning is one of the if not the most essential form of algorithm for AI in healthcare. CNN allows images to be input in a way that allows for learnable weights and biases to be calculated for and differentiate and match aspects of images that would go unknown to the human eye. Through identifying the edges, shape, size, color, amount of scarring CNN is able to identify cancerous and non-cancerous cells into five categories: normal, mild, moderate, severe, and carcinoma. Accuracy in this space is above 95% and creates a new opportunity space for medical professionals to provide their patients with a high level of accuracy and timely action planning for treatment and recovery[^13]. Beyond human healthcare CNN modeling has the potential to transfer into the realm of veterinary medicine, agricultural engineering, and sustainable environment initiatives to detect invasive species and similar disease development. Dogs or cats with cancer or heart worm could be analyzed in order to determine that with their heredity/breed and life span what are the chances and timeline for disease development. Crop production could be amplified with the processing of plant genomes in combination with soil to foresee what combination will produce the most abundant and profitable harvest. Lastly, ecosystems distrubed by global warming have the capability of being studied with CNN in order to factor in changes to the environment and what solutions could be on the horizon. With enough sample collection the power of CNN has the capability of securing a brighter future for tomorrow.

7. Conclusion

Healthcare is an essential resource to living a long life and without it we can see our lifespan slashed nearly in half or even more for those who are hindered by hereditary ailments. Healthcare has been around as long as medicine and such other treatments have been around and that was centuries ago. The field has expanded well beyond what could’ve been expected for any medical professional or institution. Where the information and resources are available to save and care for the life before them even when a lack of training can hinder them the resources are present. It’s come to be such an accomplishment to mesh the medical practices of many medical professionals and Big Data to develop the largest compendium of medical practices in the world. By the allowance of such an asset many are able to collaborate with new findings and reinforcing old findings as these prevalent results allow physicians to work without faltering over inconclusive findings. The goal for this area of Big Data is to continue making the EHR system more secure and friendly towards medical professionals in different areas of practice as well as allowing easy access for patients who seek out their own medical history. The more advancements in this area of Healthcare can be applicable to other fields that must reference the compendium that maintains individuals and their history going forward. Such a structure will continue to aid generations of physicians and patients alike and can aid technological advancements along the way.

8. Acknowledgements

We would like to thank Dr Gregor von Laszweski for allowing us to complete this report despite the delays there was as well as the lack of communication. We would also like to thank the AI team for their commitment to assisting in this class as even through a pandemic they continued to help the students complete the course. We would also like to thank Dr. Geoffrey Fox for teaching the course and making the class as informative as possible given his experience with the field of Big Data.

9. References

[^8] Arslan, M., Owais, M., & Mahmood, T. Artificial Intelligence-Based Diagnosis of Cardiac and Related Diseases (2020, March 23). Retrieved December 13, 2020 from

[^9] Eraslan, G., Avsec, Z., Gagneur, J., & Theis, Fabian J.. Deep learning: new computational modelling techniques for genomics. (2019, April 10). Retrieved December 14, 2020 from

[^10] 1Regina. AI to Detect Cancer. (2019, November 22). Retrieved December 14, 2020 from

[^11] Koumakis, L. Deep learning models in genomics; are we there yet? (2020). Retrieved December 14, 2020 from

[^12] Ross, M.K., Wei, W., & Ohno-Machado, L., ‘Big Data’ and the Electronic Health Record (2014, August 15). Retrieved 15, 2020 from

[^13] P, Shanthi. B., Faruqi, F., K, Hareesha, K., & Kudva, R., Deep Convolution Neural Network for Malignancy Detection and Classification in Microscopic Uterine Cervex Cell Images (2019, November 1). Retrieved December 15, 2020 from

  1. Laney, D., AD. Mauro, M., Gubbi, J., Doyle-Lindrud, S., Gillum, R., Reiser, S., . . . Reardon, S. Big data in healthcare: Management, analysis and future prospects (2019, June 19). Retrieved December 10, 2020, from ↩︎

  2. What is an electronic health record (EHR)? (2019, September 10). Retrieved December 10, 2020 from ↩︎

  3. Moriarty, A. Does Hospital EHR Adoption Actually Improve Data Sharing? (2020, October 23) Retrieved December 10, 2020 from ↩︎

  4. Cruciana, Paula A. The Implications of Big Data in Healthcare (2019, November 21) Retrieved December 11, 2020 from ↩︎

  5. Zlabek, Jonathan A. Early cost and safety benefits of an inpatient electronic health record (2011, February 2) Retrieved December 10, 2020 from ↩︎

  6. Artificial Intelligence-Oppurtunities in Cancer Research. (2020, August 31). Retrieved December 11, 2020 from ↩︎

  7. Zhang, R., Simon, G., & Yu, F. Advancing Alzheimer’s research: A review of big data promises. (2017, June 4) Retrieved December 11, 2020 from ↩︎