Big Data Applications are an important topic that have impact in academia and industry.
This the multi-page printable view of this section. Click here to print.
2020
- 1: Introduction to AI-Driven Digital Transformation
- 2: BDAA Fall 2020 Course Lectures and Organization
- 3: Big Data Use Cases Survey
- 4: Physics
- 5: Introduction to AI in Health and Medicine
- 6: Mobility (Industry)
- 7: Sports
- 8: Space and Energy
- 9: AI In Banking
- 10: Cloud Computing
- 11: Transportation Systems
- 12: Commerce
- 13: Python Warm Up
- 14: MNIST Classification on Google Colab
1 - Introduction to AI-Driven Digital Transformation
Overview
This Lecture is recorded in 8 parts and gives an introduction and motivation for the class. This and other lectures in class are divided into “bite-sized lessons” from 5 to 30 minutes in length; that’s why it has 8 parts.
Lecture explains what students might gain from the class even if they end up with different types of jobs from data engineering, software engineering, data science or a business (application) expert. It stresses that we are well into a transformation that impacts industry research and the way life is lived. This transformation is centered on using the digital way with clouds, edge computing and deep learning giving the implementation. This “AI-Driven Digital Transformation” is as transformational as the Industrial Revolution in the past. We note that deep learning dominates most innovative AI replacing several traditional machine learning methods.
The slides for this course can be found at E534-Fall2020-Introduction
A: Getting Started: BDAA Course Introduction Part A: Big Data Applications and Analytics
This lesson describes briefly the trends driving and consequent of the AI-Driven Digital Transformation. It discusses the organizational aspects of the class and notes the two driving trends are clouds and AI. Clouds are mature and a dominant presence. AI is still rapidly changing and we can expect further major changes. The edge (devices and associated local fog computing) has always been important but now more is being done there.
B: Technology Futures from Gartner’s Analysis: BDAA Course Introduction Part B: Big Data Applications and Analytics
This lesson goes through the technologies (AI Edge Cloud) from 2008-2020 that are driving the AI-Driven Digital Transformation. we use Hype Cycles and Priority Matrices from Gartner tracking importance concepts from the Innovation Trigger, Peak of Inflated Expectations through the Plateau of Productivity. We contrast clouds and AI.
C: Big Data Trends: BDAA Course Introduction Part C: Big Data Applications and Analytics
- This gives illustrations of sources of big data.
- It gives key graphs of data sizes, images uploaded; computing, data, bandwidth trends;
- Cloud-Edge architecture.
- Intelligent machines and comparison of data from aircraft engine monitors compared to Twitter
D: Computing Trends: BDAA Course Introduction Part D: Big Data Applications and Analytics
- Multicore revolution
- Overall Global AI and Modeling Supercomputer GAIMSC
- Moores Law compared to Deep Learning computing needs
- Intel and NVIDIA status
E: Big Data and Science: BDAA Course Introduction Part E: Big Data Applications and Analytics
- Applications and Analytics
- Cyberinfrastructure, e-moreorlessanything.
- LHC, Higgs Boson and accelerators.
- Astronomy, SKA, multi-wavelength.
- Polar Grid.
- Genome Sequencing.
- Examples, Long Tail of Science.
- Wired’s End of Science; the 4 paradigms.
- More data versus Better algorithms.
F: Big Data Systems: BDAA Course Introduction Part F: Big Data Applications and Analytics
- Clouds, Service-oriented architectures, HPC High Performance Computing, Apace Software
- DIKW process illustrated by Google maps
- Raw data to Information/Knowledge/Wisdom/Decision Deluge from the EdgeInformation/Knowledge/Wisdom/Decision Deluge
- Parallel Computing
- Map Reduce
G: Industry Transformation: BDAA Course Introduction Part G: Big Data Applications and Analytics
AI grows in importance and industries transform with
- Core Technologies related to
- New “Industries” over the last 25 years
- Traditional “Industries” Transformed; malls and other old industries transform
- Good to be master of Cloud Computing and Deep Learning
- AI-First Industries,
H: Jobs and Conclusions: BDAA Course Introduction Part H: Big Data Applications and Analytics
- Job trends
- Become digitally savvy so you can take advantage of the AI/Cloud/Edge revolution with different jobs
- The qualitative idea of Big Data has turned into a quantitative realization as Cloud, Edge and Deep Learning
- Clouds are here to stay and one should plan on exploiting them
- Data Intensive studies in business and research continue to grow in importance
2 - BDAA Fall 2020 Course Lectures and Organization
This describes weekly meeting and overall videos and homeworks
Contents
Week 1
This first class discussed overall issues and did the first ~40% of the introductory slides. This presentation is also available as 8 recorded presentations under Introduction to AI-Driven Digital Transformation
Administrative topics
The following topics were addressed
- Homework
- Difference between undergrad and graduate requirements
- Contact
- Commuication via Piazza
If you have questions please post them on Piazza.
Assignment 1
- Post a professional three paragraph Bio on Piazza. Please post it under the folder bio. Use as subject “Bio: Lastname, Firstname”. Research what a professional Biography is. Remember to write it in 3rd person and focus on professional activities. Look up the Bios from Geoffrey or Gregor as examples.
- Write report described in Homework 1
- Please study recorded lectures either in zoom or in Introduction to AI-Driven Digital Transformation
Week 2
This did the remaining 60% of the introductory slides. This presentation is also available as 8 recorded presentations
Student questions were answered
Video and Assignment
These introduce Colab with examples and a Homework using Colab for deep learning. Please study videos and do homework.
Week 3
This lecture reviewed where we had got to and introduced the new Cybertraining web site. Then we gave an overview of the use case lectures which are to be studied this week. The use case overview slides are available as Google Slides
.
Videos
Please study Big Data Use Cases Survey
Big Data in pictures
Collage of Big Data Players
Software systems of importance through early 2016. This collection was stopped due to rapid change but categories and entries are still valuable. We call this HPC-ABDS for High Performance Computing Enhanced Apache Big Data Stack
HPC-ABDS Global AI Supercomputer compared to classic cluster.
Six Computational Paradigms for Data Analytics
Features that can be used to distinguish and group together applications in both data and computational science
Week 4
We surveyed next weeks videos which describe the search for the Higgs Boson and the statistics methods used in the analysis of such conting experiments.
The Higgs Boson slides are available as Google Slides
.
Videos for Week 4
Please study Discovery of Higgs Boson
Week 5
This week’s class and its zoom video covers two topics
- Discussion of Final Project for Class and use of markdown text technology based on slides Course Project.
- Summary of Sports Informatics Module based on slides Sports Summary.
.
Videos for Week 5
Please Study Sports as Big Data Analytics
Week 6
This week’s video class recorded first part of Google Slides and emphasizes that these lectures are partly aimed at suggesting projects.
.
This class started with a review of applications for AI enhancement.
- Physics: Discovery of Higgs Boson
- Survey of 51 Applications
- Sports
- Health and Medicine
plus those covered Spring 2020 but not updated this semester
- Banking: YouTube playlist and Google slides
- Commerce: YouTube playlist and Google slides
- Mobility Industry: YouTube playlist and Google slides
- Transportation Systems: YouTube playlist and Google slides
- Space and Energy: YouTube playlist and Google slides
We focus on Health and Medicine with summary talk
Videos for Week 6
See module on Health and Medicine
Week 7
This week’s video class recorded the second part of Google Slides.
.
Videos for Week 7
Continue module on Health and Medicine
Week 8
We discussed projects with current list https://docs.google.com/document/d/13TZclzrWvkgQK6-8UR-LBu_LkpRsbiQ5FG1p--VZKZ8/edit?usp=sharing
This week’s video class recorded the first part of Google Slides.
.
Videos for Week 8
Module on Cloud Computing
Week 9
We discussed use of GitHub for projects (recording missed small part of this) and continued discussion of cloud computing but did not finish the slides yet.
This week’s video class recorded the second part of Google Slides.
.
Videos for Week 9
Continue work on project and complete study of videos already assigned. If interesting to you, please review videos on AI in Banking, Space and Energy, Transportation Systems, Mobility (Industry), and Commerce. Don’t forget the participation grade from GitHub activity each week
Week 10
We discussed use of GitHub for projects and finished summary of cloud computing.
This week’s video class recorded the last part of Google Slides.
.
Videos for Week 10
Continue work on project and complete study of videos already assigned. If interesting to you, please review videos on AI in Banking, Space and Energy, Transportation Systems, Mobility (Industry), and Commerce. Don’t forget the participation grade from GitHub activity each week
Week 11
This weeks video class went through project questions.
.
Videos for Week 11
Continue work on project and complete study of videos already assigned. If interesting to you, please review videos on AI in Banking, Space and Energy, Transportation Systems, Mobility (Industry), and Commerce. Don’t forget the participation grade from GitHub activity each week
Week 12
This weeks video class discussed deep learning for Time Series. There are Google Slides for this
.
Videos for Week 12
Continue work on project and complete study of videos already assigned. If interesting to you, please review videos on AI in Banking, Space and Energy, Transportation Systems, Mobility (Industry), and Commerce. Don’t forget the participation grade from GitHub activity each week.
Week 13
This weeks video class went through project questions.
.
The class is formally finished. Please submit your homework and project.
Week 14
This weeks video class went through project questions.
.
The class is formally finished. Please submit your homework and project.
Week 15
This weeks video class was a technical presentation on “Deep Learning for Images”. There are Google Slides
.
3 - Big Data Use Cases Survey
This unit has four lectures (slide=decks). The survey is 6 years old but the illustrative scope of Big Data Applications is still valid and has no better alternative. The problems and use of clouds has not changed. There has been algorithmic advances (deep earning) in some cases. The lectures are
-
- Overview of NIST Process
-
- The 51 Use cases divided into groups
-
- Common features of the 51 Use Cases
-
- 10 Patterns of data – computer – user interaction seen in Big Data Applications
There is an overview of these lectures below. The use case overview slides recorded here are available as Google Slides
.
Lecture set 1. Overview of NIST Big Data Public Working Group (NBD-PWG) Process and Results
This is first of 4 lectures on Big Data Use Cases. It describes the process by which NIST produced this survey
Use Case 1-1 Introduction to NIST Big Data Public Working Group
The focus of the (NBD-PWG) is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definition, taxonomies, secure reference architectures, and technology roadmap. The aim is to create vendor-neutral, technology and infrastructure agnostic deliverables to enable big data stakeholders to pick-and-choose best analytics tools for their processing and visualization requirements on the most suitable computing platforms and clusters while allowing value-added from big data service providers and flow of data between the stakeholders in a cohesive and secure manner.
Introduction (13:02)
Use Case 1-2 Definitions and Taxonomies Subgroup
The focus is to gain a better understanding of the principles of Big Data. It is important to develop a consensus-based common language and vocabulary terms used in Big Data across stakeholders from industry, academia, and government. In addition, it is also critical to identify essential actors with roles and responsibility, and subdivide them into components and sub-components on how they interact/ relate with each other according to their similarities and differences. For Definitions: Compile terms used from all stakeholders regarding the meaning of Big Data from various standard bodies, domain applications, and diversified operational environments. For Taxonomies: Identify key actors with their roles and responsibilities from all stakeholders, categorize them into components and subcomponents based on their similarities and differences. In particular data, Science and Big Data terms are discussed.
Taxonomies (7:42)
Use Case 1-3 Reference Architecture Subgroup
The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus-based approach to orchestrate vendor-neutral, technology and infrastructure agnostic for analytics tools and computing environments. The goal is to enable Big Data stakeholders to pick-and-choose technology-agnostic analytics tools for processing and visualization in any computing platform and cluster while allowing value-added from Big Data service providers and the flow of the data between the stakeholders in a cohesive and secure manner. Results include a reference architecture with well-defined components and linkage as well as several exemplars.
Architecture (10:05)
Use Case 1-4 Security and Privacy Subgroup
The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus secure reference architecture to handle security and privacy issues across all stakeholders. This includes gaining an understanding of what standards are available or under development, as well as identifies which key organizations are working on these standards. The Top Ten Big Data Security and Privacy Challenges from the CSA (Cloud Security Alliance) BDWG are studied. Specialized use cases include Retail/Marketing, Modern Day Consumerism, Nielsen Homescan, Web Traffic Analysis, Healthcare, Health Information Exchange, Genetic Privacy, Pharma Clinical Trial Data Sharing, Cyber-security, Government, Military and Education.
Security (9:51)
Use Case 1-5 Technology Roadmap Subgroup
The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus vision with recommendations on how Big Data should move forward by performing a good gap analysis through the materials gathered from all other NBD subgroups. This includes setting standardization and adoption priorities through an understanding of what standards are available or under development as part of the recommendations. Tasks are gathered input from NBD subgroups and study the taxonomies for the actors' roles and responsibility, use cases and requirements, and secure reference architecture; gain an understanding of what standards are available or under development for Big Data; perform a thorough gap analysis and document the findings; identify what possible barriers may delay or prevent the adoption of Big Data; and document vision and recommendations.
Technology (4:14)
Use Case 1-6 Requirements and Use Case Subgroup
The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus list of Big Data requirements across all stakeholders. This includes gathering and understanding various use cases from diversified application domains.Tasks are gather use case input from all stakeholders; derive Big Data requirements from each use case; analyze/prioritize a list of challenging general requirements that may delay or prevent adoption of Big Data deployment; develop a set of general patterns capturing the essence of use cases (not done yet) and work with Reference Architecture to validate requirements and reference architecture by explicitly implementing some patterns based on use cases. The progress of gathering use cases (discussed in next two units) and requirements systemization are discussed.
Requirements (27:28)
Use Case 1-7 Recent Updates of work of NIST Public Big Data Working Group
This video is an update of recent work in this area. The first slide of this short lesson discusses a new version of use case survey that had many improvements including tags to label key features (as discussed in slide deck 3) and merged in a significant set of security and privacy fields. This came from the security and privacy working group described in lesson 4 of this slide deck. A link for this new use case form is https://bigdatawg.nist.gov/_uploadfiles/M0621_v2_7345181325.pdf
A recent December 2018 use case form for Astronomy’s Square Kilometer Array is at https://docs.google.com/document/d/1CxqCISK4v9LMMmGox-PG1bLeaRcbAI4cDIlmcoRqbDs/edit?usp=sharing This uses a simplification of the official new form.
The second (last) slide in update gives some useful on latest work. NIST’s latest work just published is at https://bigdatawg.nist.gov/V3_output_docs.php Related activities are described at http://hpc-abds.org/kaleidoscope/
Lecture set 2: 51 Big Data Use Cases from NIST Big Data Public Working Group (NBD-PWG)
Use Case 2-1 Government Use Cases
This covers Census 2010 and 2000 - Title 13 Big Data; National Archives and Records Administration Accession NARA, Search, Retrieve, Preservation; Statistical Survey Response Improvement (Adaptive Design) and Non-Traditional Data in Statistical Survey Response Improvement (Adaptive Design).
Government Use Cases (17:43)
Use Case 2-2 Commercial Use Cases
This covers Cloud Eco-System, for Financial Industries (Banking, Securities & Investments, Insurance) transacting business within the United States; Mendeley - An International Network of Research; Netflix Movie Service; Web Search; IaaS (/infrastructure as a Service) Big Data Business Continuity & Disaster Recovery (BC/DR) Within A Cloud Eco-System; Cargo Shipping; Materials Data for Manufacturing and Simulation driven Materials Genomics.
This lesson is divided into 3 separate videos
Part 1
(9:31)
Part 2
(19:45)
Part 3
(10:48)
Use Case 2-3 Defense Use Cases
This covers Large Scale Geospatial Analysis and Visualization; Object identification and tracking from Wide Area Large Format Imagery (WALF) Imagery or Full Motion Video (FMV) - Persistent Surveillance and Intelligence Data Processing and Analysis.
Defense Use Cases (15:43)
Use Case 2-4 Healthcare and Life Science Use Cases
This covers Electronic Medical Record (EMR) Data; Pathology Imaging/digital pathology; Computational Bioimaging; Genomic Measurements; Comparative analysis for metagenomes and genomes; Individualized Diabetes Management; Statistical Relational Artificial Intelligence for Health Care; World Population Scale Epidemiological Study; Social Contagion Modeling for Planning, Public Health and Disaster Management and Biodiversity and LifeWatch.
Healthcare and Life Science Use Cases (30:11)
Use Case 2-5 Deep Learning and Social Networks Use Cases
This covers Large-scale Deep Learning; Organizing large-scale, unstructured collections of consumer photos; Truthy: Information diffusion research from Twitter Data; Crowd Sourcing in the Humanities as Source for Bigand Dynamic Data; CINET: Cyberinfrastructure for Network (Graph) Science and Analytics and NIST Information Access Division analytic technology performance measurement, evaluations, and standards.
Deep Learning and Social Networks Use Cases (14:19)
Use Case 2-6 Research Ecosystem Use Cases
DataNet Federation Consortium DFC; The ‘Discinnet process’, metadata -big data global experiment; Semantic Graph-search on Scientific Chemical and Text-based Data and Light source beamlines.
Research Ecosystem Use Cases (9:09)
Use Case 2-7 Astronomy and Physics Use Cases
This covers Catalina Real-Time Transient Survey (CRTS): a digital, panoramic, synoptic sky survey; DOE Extreme Data from Cosmological Sky Survey and Simulations; Large Survey Data for Cosmology; Particle Physics: Analysis of LHC Large Hadron Collider Data: Discovery of Higgs particle and Belle II High Energy Physics Experiment.
Astronomy and Physics Use Cases (17:33)
Use Case 2-8 Environment, Earth and Polar Science Use Cases
EISCAT 3D incoherent scatter radar system; ENVRI, Common Operations of Environmental Research Infrastructure; Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets; UAVSAR Data Processing, DataProduct Delivery, and Data Services; NASA LARC/GSFC iRODS Federation Testbed; MERRA Analytic Services MERRA/AS; Atmospheric Turbulence - Event Discovery and Predictive Analytics; Climate Studies using the Community Earth System Model at DOE’s NERSC center; DOE-BER Subsurface Biogeochemistry Scientific Focus Area and DOE-BER AmeriFlux and FLUXNET Networks.
Environment, Earth and Polar Science Use Cases (25:29)
Use Case 2-9 Energy Use Case
This covers Consumption forecasting in Smart Grids.
Energy Use Case (4:01)
Lecture set 3: Features of 51 Big Data Use Cases from the NIST Big Data Public Working Group (NBD-PWG)
This unit discusses the categories used to classify the 51 use-cases. These categories include concepts used for parallelism and low and high level computational structure. The first lesson is an introduction to all categories and the further lessons give details of particular categories.
Use Case 3-1 Summary of Use Case Classification
This discusses concepts used for parallelism and low and high level computational structure. Parallelism can be over People (users or subjects), Decision makers; Items such as Images, EMR, Sequences; observations, contents of online store; Sensors – Internet of Things; Events; (Complex) Nodes in a Graph; Simple nodes as in a learning network; Tweets, Blogs, Documents, Web Pages etc.; Files or data to be backed up, moved or assigned metadata; Particles/cells/mesh points. Low level computational types include PP (Pleasingly Parallel); MR (MapReduce); MRStat; MRIter (iterative MapReduce); Graph; Fusion; MC (Monte Carlo) and Streaming. High level computational types include Classification; S/Q (Search and Query); Index; CF (Collaborative Filtering); ML (Machine Learning); EGO (Large Scale Optimizations); EM (Expectation maximization); GIS; HPC; Agents. Patterns include Classic Database; NoSQL; Basic processing of data as in backup or metadata; GIS; Host of Sensors processed on demand; Pleasingly parallel processing; HPC assimilated with observational data; Agent-based models; Multi-modal data fusion or Knowledge Management; Crowd Sourcing.
Summary of Use Case Classification (23:39)
Use Case 3-2 Database(SQL) Use Case Classification
This discusses classic (SQL) database approach to data handling with Search&Query and Index features. Comparisons are made to NoSQL approaches.
Database (SQL) Use Case Classification (11:13)
Use Case 3-3 NoSQL Use Case Classification
This discusses NoSQL (compared in previous lesson) with HDFS, Hadoop and Hbase. The Apache Big data stack is introduced and further details of comparison with SQL.
NoSQL Use Case Classification (11:20)
Use Case 3-4 Other Use Case Classifications
This discusses a subset of use case features: GIS, Sensors. the support of data analysis and fusion by streaming data between filters.
Use Case Classifications I (12:42)
Use Case 3-5
This discusses a subset of use case features: Classification, Monte Carlo, Streaming, PP, MR, MRStat, MRIter and HPC(MPI), global and local analytics (machine learning), parallel computing, Expectation Maximization, graphs and Collaborative Filtering.
Case Classifications II (20:18)
Use Case 3-6
This discusses the classification, PP, Fusion, EGO, HPC, GIS, Agent, MC, PP, MR, Expectation maximization and benchmarks.
Use Case 3-7 Other Benchmark Sets and Classifications
This video looks at several efforts to divide applications into categories of related applications It includes “Computational Giants” from the National Research Council; Linpack or HPL from the HPC community; the NAS Parallel benchmarks from NASA; and finally the Berkeley Dwarfs from UCB. The second part of this video describes efforts in the Digital Science Center to develop Big Data classification and to unify Big Data and simulation categories. This leads to the Ogre and Convergence Diamonds. Diamonds have facets representing the different aspects by which we classify applications. See http://hpc-abds.org/kaleidoscope/
Lecture set 4. The 10 Use Case Patterns from the NIST Big Data Public Working Group (NBD-PWG)
In this last slide deck of the use cases unit, we will be focusing on 10 Use case patterns. This includes multi-user querying, real-time analytics, batch analytics, data movement from external data sources, interactive analysis, data visualization, ETL, data mining and orchestration of sequential and parallel data transformations. We go through the different ways the user and system interact in each case. The use case patterns are divided into 3 classes 1) initial examples 2) science data use case patterns and 3) remaining use case patterns.
Resources
- NIST Big Data Public Working Group (NBD-PWG) Process
- Big Data Definitions
- Big Data Taxonomies
- Big Data Use Cases and Requirements
- Big Data Security and Privacy
- Big Data Architecture White Paper Survey
- Big Data Reference Architecture
- Big Data Standards Roadmap
Some of the links bellow may be outdated. Please let us know the new links and notify us of the outdated links.
DCGSA Standard Cloud(this link does not exist any longer)- On line 51 Use Cases
- Summary of Requirements Subgroup
- Use Case 6 Mendeley
- Use Case 7 Netflix
- Use Case 8 Search
http://www.slideshare.net/kleinerperkins/kpcb-internet-trends-2013(this link does not exist any longer),- https://web.archive.org/web/20160828041032/http://webcourse.cs.technion.ac.il/236621/Winter2011-2012/en/ho_Lectures.html (Archived Pages),
- http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws,
- http://www.slideshare.net/beechung/recommender-systems-tutorialpart1intro,
- http://www.worldwidewebsize.com/
- Use Case 9 IaaS (infrastructure as a Service) Big Data Business Continuity & Disaster Recovery (BC/DR) Within A Cloud Eco-System provided by Cloud Service Providers (CSPs) and Cloud Brokerage Service Providers (CBSPs)
- Use Case 11 and Use Case 12 Simulation driven Materials Genomics
- Use Case 13 Large Scale Geospatial Analysis and Visualization
- Use Case 14 Object identification and tracking from Wide Area Large
Format Imagery (WALF) Imagery or Full Motion Video (FMV) -
Persistent Surveillance
- https://web.archive.org/web/20160828235002/http://www.militaryaerospace.com/topics/m/video/79088650/persistent-surveillance-relies-on-extracting-relevant-data-points-and-connecting-the-dots.htm (Archived Pages),
- http://www.defencetalk.com/wide-area-persistent-surveillance-revolutionizes-tactical-isr-45745/
- Use Case 15 Intelligence Data Processing and Analysis
-
http://www.afcea-aberdeen.org/files/presentations/AFCEAAberdeen_DCGSA_COLWells_PS.pdf, -
http://stids.c4i.gmu.edu/STIDS2011/papers/STIDS2011_CR_T1_SalmenEtAl.pdf,
-
http://stids.c4i.gmu.edu/papers/STIDSPapers/STIDS2012/_T14/_SmithEtAl/_HorizontalIntegrationOfWarfighterIntel.pdf,(this link does not exist any longer) -
https://www.youtube.com/watch?v=l4Qii7T8zeg(this link does not exist any longer) -
http://dcgsa.apg.army.mil/(this link does not exist any longer)
-
- Use Case 16 Electronic Medical Record (EMR) Data:
- Regenstrief Institute
- Logical observation identifiers names and codes
- Indiana Health Information Exchange
Institute of Medicine Learning Healthcare System(this link does not exist any longer)
- Use Case 17
- Pathology Imaging/digital pathology
https://web.cci.emory.edu/confluence/display/HadoopGIS(this link does not exist any longer)
- Use Case 19 Genome in a Bottle Consortium:
www.genomeinabottle.org(this link does not exist any longer)
- Use Case 20 Comparative analysis for metagenomes and genomes
- Use Case 25
- Use Case 26 Deep Learning: Recent popular press coverage of deep
learning technology:
- http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html
- http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html
- http://www.wired.com/2013/06/andrew_ng/,
A recent research paper on HPC for Deep Learning(this link does not exist any longer)- Widely-used tutorials and references for Deep Learning:
- Use Case 27 Organizing large-scale, unstructured collections of consumer photos
- Use Case 28
Use Case 30 CINET: Cyberinfrastructure for Network (Graph) Science and Analytics(this link does not exist any longer)- Use Case 31 NIST Information Access Division analytic technology performance measurement, evaluations, and standards
- Use Case 32
- DataNet Federation Consortium DFC: The DataNet Federation Consortium,
- iRODS
Use Case 33 The ‘Discinnet process’, big data global experiment(this link does not exist any longer)- Use Case 34 Semantic Graph-search on Scientific Chemical and Text-based Data
- Use Case 35 Light source beamlines
- http://www-als.lbl.gov/
https://www1.aps.anl.gov/(this link does not exist any longer)
- Use Case 36
- Use Case 37 DOE Extreme Data from Cosmological Sky Survey and Simulations
- Use Case 38 Large Survey Data for Cosmology
- Use Case 39 Particle Physics: Analysis of LHC Large Hadron Collider Data: Discovery of Higgs particle
- Use Case 40 Belle II High Energy Physics Experiment
- Use Case 41 EISCAT 3D incoherent scatter radar system
- Use Case 42 ENVRI, Common Operations of Environmental Research
Infrastructure
- ENVRI Project website
- ENVRI Reference Model (Archive Pages)
ENVRI deliverable D3.2 : Analysis of common requirements of Environmental Research Infrastructures(this link does not exist any longer)- ICOS,
- Euro-Argo
- EISCAT 3D (Archived Pages)
- LifeWatch
- EPOS
- EMSO
- Use Case 43 Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets
- Use Case 44 UAVSAR Data Processing, Data Product Delivery, and Data Services
- Use Case 47 Atmospheric Turbulence - Event Discovery and Predictive Analytics
- Use Case 48 Climate Studies using the Community Earth System Model at DOE’s NERSC center
- Use Case 50 DOE-BER AmeriFlux and FLUXNET Networks
- Use Case 51 Consumption forecasting in Smart Grids
- https://web.archive.org/web/20160412194521/http://dslab.usc.edu/smartgrid.php (Archived Pages)
- https://web.archive.org/web/20120130051124/http://ganges.usc.edu/wiki/Smart_Grid (Archived Pages)
https://www.ladwp.com/ladwp/faces/ladwp/aboutus/a-power/a-p-smartgridla?_afrLoop=157401916661989&_afrWindowMode=0&_afrWindowId=null#%40%3F_afrWindowId%3Dnull%26_afrLoop%3D157401916661989%26_afrWindowMode%3D0%26_adf.ctrl-state%3Db7yulr4rl_17(this link does not exist any longer)- http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6475927
4 - Physics
E534 2020 Big Data Applications and Analytics Discovery of Higgs Boson
Summary: This section of the class is devoted to a particular Physics experiment but uses this to discuss so-called counting experiments. Here one observes “events” that occur randomly in time and one studies the properties of the events; in particular are the events collection of subatomic particles coming from the decay of particles from a “Higgs Boson” produced in high energy accelerator collisions. The four video lecture sets (Parts I II III IV) start by describing the LHC accelerator at CERN and evidence found by the experiments suggesting the existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of the shape of the signal and various backgrounds and with various event totals. Then random variables and some simple principles of statistics are introduced with an explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they have seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes the discussion.
Colab Notebooks for Physics Usecases
For this lecture, we will be using the following Colab Notebooks along with the following lecture materials. They will be referenced in the corresponding sections.
- Notebook A
- Notebook B
- Notebook C
- Notebook D
Looking for Higgs Particle Part I : Bumps in Histograms, Experiments and Accelerators
This unit is devoted to Python and Java experiments looking at histograms of Higgs Boson production with various forms of the shape of the signal and various backgrounds and with various event totals. The lectures use Python but the use of Java is described. Students today can ignore Java!
Slides {20 slides}
Looking for Higgs Particle and Counting Introduction 1
We return to the particle case with slides used in introduction and stress that particles often manifested as bumps in histograms and those bumps need to be large enough to stand out from the background in a statistically significant fashion.
Video:
{slides1-5}Looking for Higgs Particle II Counting Introduction 2
We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion
Video:
{slides 6-8}Experimental Facilities
We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion.
Video:
{slides 9-14}Accelerator Picture Gallery of Big Science
This lesson gives a small picture gallery of accelerators. Accelerators, detection chambers and magnets in tunnels and a large underground laboratory used for experiments where you need to be shielded from the background like cosmic rays.
{slides 14-20}
Resources
http://www.sciencedirect.com/science/article/pii/S037026931200857X
http://www.nature.com/news/specials/lhc/interactive.html
Looking for Higgs Particles Part II: Python Event Counting for Signal and Background
Python Event Counting for Signal and Background (Part 2) This unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of the shape of the signal and various backgrounds and with various event totals.
Slides {1-29 slides}
Class Software
We discuss Python on both a backend server (FutureGrid - closed!) or a local client. We point out a useful book on Python for data analysis.
{slides 1-10}
Refer to A: Studying Higgs Boson Analysis. Signal and Background, Part 1 The background
Event Counting
We define event counting of data collection environments. We discuss the python and Java code to generate events according to a particular scenario (the important idea of Monte Carlo data). Here a sloping background plus either a Higgs particle generated similarly to LHC observation or one observed with better resolution (smaller measurement error).
{slides 11-14}
Examples of Event Counting I with Python Examples of Signal and Background
This uses Monte Carlo data both to generate data like the experimental observations and explore the effect of changing amount of data and changing measurement resolution for Higgs.
{slides 15-23}
Refer to A: Studying Higgs Boson Analysis. Signal and Background, Part 1,2,3,4,6,7
Examples of Event Counting II: Change shape of background and number of Higgs Particles produced in experiment
This lesson continues the examination of Monte Carlo data looking at the effect of change in the number of Higgs particles produced and in the change in the shape of the background.
{slides 25-29}
Refer to A: Studying Higgs Boson Analysis. Signal and Background, Part 5- Part 6
Refer to B: Studying Higgs Boson Analysis. Signal and Background
Resources
Python for Data Analysis: Agile Tools for Real-World Data By Wes McKinney, Publisher: O’Reilly Media, Released: October 2012, Pages: 472.
https://en.wikipedia.org/wiki/DataMelt
Looking for Higgs Part III: Random variables, Physics and Normal Distributions
We introduce random variables and some simple principles of statistics and explain why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they have seen so often in natural phenomena. Several Python illustrations are given. Java is not discussed in this unit.
Slides {slides 1-39}
Statistics Overview and Fundamental Idea: Random Variables
We go through the many different areas of statistics covered in the Physics unit. We define the statistics concept of a random variable
{slides 1-6}
Physics and Random Variables
We describe the DIKW pipeline for the analysis of this type of physics experiment and go through details of the analysis pipeline for the LHC ATLAS experiment. We give examples of event displays showing the final state particles seen in a few events. We illustrate how physicists decide what’s going on with a plot of expected Higgs production experimental cross sections (probabilities) for signal and background.
Part 1
{slides 6-9}
Part 2
{slides 10-12}
Statistics of Events with Normal Distributions
We introduce Poisson and Binomial distributions and define independent identically distributed (IID) random variables. We give the law of large numbers defining the errors in counting and leading to Gaussian distributions for many things. We demonstrate this in Python experiments.
{slides 13-19}
Refer to C: Gaussian Distributions and Counting Experiments, Part 1
Gaussian Distributions
We introduce the Gaussian distribution and give Python examples of the fluctuations in counting Gaussian distributions.
{slides 21-32}
Refer to C: Gaussian Distributions and Counting Experiments, Part 2
Using Statistics
We discuss the significance of a standard deviation and role of biases and insufficient statistics with a Python example in getting incorrect answers.
{slides 33-39}
Refer to C: Gaussian Distributions and Counting Experiments, Part 3
Resources
http://indico.cern.ch/event/20453/session/6/contribution/15?materialId=slides http://www.atlas.ch/photos/events.html (this link is outdated) https://cms.cern/
Looking for Higgs Part IV: Random Numbers, Distributions and Central Limit Theorem
We discuss Random Numbers with their Generators and Seeds. It introduces Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods are discussed. The Central Limit Theorem and Bayes law conclude the discussion. Python and Java (for student - not reviewed in class) examples and Physics applications are given
Slides {slides 1-44}
Generators and Seeds
We define random numbers and describe how to generate them on the computer giving Python examples. We define the seed used to define how to start generation.
Part 1
{slides 5-6}
Part 2
{slides 7-13}
Refer to D: Random Numbers, Part 1
Refer to C: Gaussian Distributions and Counting Experiments, Part 4
Binomial Distribution
We define the binomial distribution and give LHC data as an example of where this distribution is valid.
{slides 14-22}
Accept-Reject Methods for generating Random (Monte-Carlo) Events
We introduce an advanced method accept/reject for generating random variables with arbitrary distributions.
{slides 23-27}
Refer to A: Studying Higgs Boson Analysis. Signal and Background, Part 1
Monte Carlo Method
We define the Monte Carlo method which usually uses the accept/reject method in the typical case for distribution.
{slides 27-28}
Poisson Distribution
We extend the Binomial to the Poisson distribution and give a set of amusing examples from Wikipedia.
{slides 30-33}
Central Limit Theorem
We introduce Central Limit Theorem and give examples from Wikipedia
{slides 35-37}
Interpretation of Probability: Bayes v. Frequency
This lesson describes the difference between Bayes and frequency views of probability. Bayes’s law of conditional probability is derived and applied to Higgs example to enable information about Higgs from multiple channels and multiple experiments to be accumulated.
{slides 38-44}
Refer to C: Gaussian Distributions and Counting Experiments, Part 5
Homework 3 (Posted on Canvas)
The use case analysis that you were asked to watch in week 3, described over 50 use cases based on templates filled in by users. This homework has two choices.
- Consider the existing SKA Template here. Use this plus web resources such as this to write a 3 page description of science goals, current progress and big data challenges of the SKA
OR
- Here is a Blank use case Template (make your own copy). Chose any Big Data use case (including those in videos but address 2020 not 2013 situation) and fill in ONE new use case template producing about 2 pages of new material (summed over answers to questions)
Homwork 4 (Posted on Canvas)
Consider Physics Colab A “Part 4 Error Estimates” and Colab D “Part 2 Varying Seed (randomly) gives distinct results”
Consider a Higgs of measured width 0.5 GeV (narrowGauss and narrowTotal in Part A) and use analysis of A Part 4 to estimate the difference between signal and background compared to expected error (standard deviation)
Run 3 different random number choices (as in D Part 2) to show how conclusions change
Recommend changing bin size in A Part 4 from 2 GeV to 1 GeV when the Higgs signal will be in two bins you can add (equivalently use 2 GeV histogram bins shifting origin so Higgs mass of 126 GeV is in the center of a bin)
Suppose you keep background unchanged and reduce the Higgs signal by a factor of 2 (300 to 150 events). Can Higgs still be detected?
5 - Introduction to AI in Health and Medicine
Overview
This module discusses AI and the digital transformation for the Health and Medicine Area with a special emphasis on COVID-19 issues. We cover both the impact of COVID and some of the many activities that are addressing it. Parts B and C have an extensive general discussion of AI in Health and Medicine
The complete presentation is available at Google Slides while the videos are a YouTube playlist
Part A: Introduction
This lesson describes some overarching issues including the
- Summary in terms of Hypecycles
- Players in the digital health ecosystem and in particular role of Big Tech which has needed AI expertise and infrastructure from clouds to smart watches/phones
- Views of Pataients and Doctors on New Technology
- Role of clouds. This is essentially assumed throughout presentation but not stressed.
- Importance of Security
- Introduction to Internet of Medical Things; this area is discussed in more detail later in preserntation
Part B: Diagnostics
This highlights some diagnostic appliocations of AI and the digital transformation. Part C also has some diagnostic coverage – especially particular applications
- General use of AI in Diagnostics
- Early progress in diagnostic imaging including Radiology and Opthalmology
- AI In Clinical Decision Support
- Digital Therapeutics is a recognized and growing activity area
Part C: Examples
This lesson covers a broad range of AI uses in Health and Medicine
- Flagging Issues requirng urgent attentation and more generally AI for Precision Merdicine
- Oncology and cancer have made early progress as exploit AI for images. Avoiding mistakes and diagnosing curable cervical cancer in developing countries with less screening.
- Predicting Gestational Diabetes
- cardiovascular diagnostics and AI to interpret and guide Ultrasound measurements
- Robot Nurses and robots to comfort patients
- AI to guide cosmetic surgery measuring beauty
- AI in analysis DNA in blood tests
- AI For Stroke detection (large vessel occlusion)
- AI monitoring of breathing to flag opioid-induced respiratory depression.
- AI to relieve administration burden including voice to text for Doctor’s notes
- AI in consumer genomics
- Areas that are slow including genomics, Consumer Robotics, Augmented/Virtual Reality and Blockchain
- AI analysis of information resources flags probleme earlier
- Internet of Medical Things applications from watches to toothbrushes
Part D: Impact of Covid-19
This covers some aspects of the impact of COVID -19 pandedmic starting in March 2020
- The features of the first stimulus bill
- Impact on Digital Health, Banking, Fintech, Commerce – bricks and mortar, e-commerce, groceries, credit cards, advertising, connectivity, tech industry, Ride Hailing and Delivery,
- Impact on Restaurants, Airlines, Cruise lines, general travel, Food Delivery
- Impact of working from home and videoconferencing
- The economy and
- The often positive trends for Tech industry
Part E: Covid-19 and Recession
This is largely outdated as centered on start of pandemic induced recession. and we know what really happenmed now. Probably the pandemic accelerated the transformation of industry and the use of AI.
Part F: Tackling Covid-19
This discusses some of AI and digital methods used to understand and reduce impact of COVID-19
- Robots for remote patient examination
- computerized tomography scan + AI to identify COVID-19
- Early activities of Big Tech and COVID
- Other early biotech activities with COVID-19
- Remote-work technology: Hopin, Zoom, Run the World, FreeConferenceCall, Slack, GroWrk, Webex, Lifesize, Google Meet, Teams
- Vaccines
- Wearables and Monitoring, Remote patient monitoring
- Telehealth, Telemedicine and Mobile Health
Part G: Data and Computational Science and Covid-19
This lesson reviews some sophisticated high performance computing HPC and Big Data approaches to COVID
- Rosetta volunteer computer to analyze proteins
- COVID-19 High Performance Computing Consortium
- AI based drug discovery by startup Insilico Medicine
- Review of several research projects
- Global Pervasive Computational Epidemiology for COVID-19 studies
- Simulations of Virtual Tissues at Indiana University available on nanoHUB
Part H: Screening Drug and Candidates
A major project involving Department of Energy Supercomputers
- General Structure of Drug Discovery
- DeepDriveMD Project using AI combined with molecular dynamics to accelerate discovery of drug properties
Part I: Areas for Covid19 Study and Pandemics as Complex Systems
- Possible Projects in AI for Health and Medicine and especially COVID-19
- Pandemics as a Complex System
- AI and computational Futures for Complex Systems
6 - Mobility (Industry)
Overview
- Industry being transformed by a) Autonomy (AI) and b) Electric power
- Established Organizations can’t change
- General Motors (employees: 225,000 in 2016 to around 180,000 in 2018) finds it hard to compete with Tesla (42000 employees)
- Market value GM was half the market value of Tesla at the start of 2020 but is now just 11% October 2020
- GM purchased Cruise to compete
- Funding and then buying startups is an important “transformation” strategy
- Autonomy needs Sensors Computers Algorithms and Software
- Also experience (training data)
- Algorithms main bottleneck; others will automatically improve although lots of interesting work in new sensors, computers and software
- Over the last 3 years, electrical power has gone from interesting to “bound to happen”; Tesla’s happy customers probably contribute to this
- Batteries and Charging stations needed
Mobility Industry A: Introduction
- Futures of Automobile Industry, Mobility, and Ride-Hailing
- Self-cleaning cars
- Medical Transportation
- Society of Automotive Engineers, Levels 0-5
- Gartner’s conservative View
Mobility Industry B: Self Driving AI
- Image processing and Deep Learning
- Examples of Self Driving cars
- Road construction Industry
- Role of Simulated data
- Role of AI in autonomy
- Fleet cars
- 3 Leaders: Waymo, Cruise, NVIDIA
Mobility Industry C: General Motors View
- Talk by Dave Brooks at GM, “AI for Automotive Engineering”
- Zero crashes, zero emission, zero congestion
- GM moving to electric autonomous vehicles
Mobility Industry D: Self Driving Snippets
- Worries about and data on its Progress
- Tesla’s specialized self-driving chip
- Some tasks that are hard for AI
- Scooters and Bikes
Mobility Industry E: Electrical Power
- Rise in use of electrical power
- Special opportunities in e-Trucks and time scale
- Future of Trucks
- Tesla market value
- Drones and Robot deliveries; role of 5G
- Robots in Logistics
7 - Sports
Sports with Big Data Applications
E534 2020 Big Data Applications and Analytics Sports Informatics Part I Section Summary (Parts I, II, III): Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.
Part 1
Unit Summary (PartI): This unit discusses baseball starting with the movie Moneyball and the 2002-2003 Oakland Athletics. Unlike sports like basketball and soccer, most baseball action is built around individuals often interacting in pairs. This is much easier to quantify than many player phenomena in other sports. We discuss Performance-Dollar relationship including new stadiums and media/advertising. We look at classic baseball averages and sophisticated measures like Wins Above Replacement.
Lesson Summaries
Part 1.1 - E534 Sports - Introduction and Sabermetrics (Baseball Informatics) Lesson
Introduction to all Sports Informatics, Moneyball The 2002-2003 Oakland Athletics, Diamond Dollars economic model of baseball, Performance - Dollar relationship, Value of a Win.
{slides 1-15}
Part 1.2 - E534 Sports - Basic Sabermetrics
Different Types of Baseball Data, Sabermetrics, Overview of all data, Details of some statistics based on basic data, OPS, wOBA, ERA, ERC, FIP, UZR.
{slides 16-26}
Part 1.3 - E534 Sports - Wins Above Replacement
Wins above Replacement WAR, Discussion of Calculation, Examples, Comparisons of different methods, Coefficient of Determination, Another, Sabermetrics Example, Summary of Sabermetrics.
{slides 17-40}
Part 2
E534 2020 Big Data Applications and Analytics Sports Informatics Part II Section Summary (Parts I, II, III): Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.
Unit Summary (Part II): This unit discusses ‘advanced sabermetrics’ covering advances possible from using video from PITCHf/X, FIELDf/X, HITf/X, COMMANDf/X and MLBAM.
Part 2.1 - E534 Sports - Pitching Clustering
A Big Data Pitcher Clustering method introduced by Vince Gennaro, Data from Blog and video at 2013 SABR conference
{slides 1-16}
Part 2.2 - E534 Sports - Pitcher Quality
Results of optimizing match ups, Data from video at 2013 SABR conference.
{slides 17-24}
Part 2.3 - E534 Sports - PITCHf/X
Examples of use of PITCHf/X.
{slides 25-30}
Part 2.4 - E534 Sports - Other Video Data Gathering in Baseball
FIELDf/X, MLBAM, HITf/X, COMMANDf/X.
{slides 26-41}
Part 3
E534 2020 Big Data Applications and Analytics Sports Informatics Part III. Section Summary (Parts I, II, III): Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.
Unit Summary (Part III): We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.
Lesson Summaries
Part 3.1 - E534 Sports - Wearables
Consumer Sports, Stake Holders, and Multiple Factors.
{Slides 1-17}
Part 3.2 - E534 Sports - Soccer and the Olympics
Soccer, Tracking Players and Balls, Olympics.
{Slides 17-24}
Part 3.3 - E534 Sports - Spatial Visualization in NFL and NBA
NFL, NBA, and Spatial Visualization.
{Slides 25-38}
Part 3.4 - E534 Sports - Tennis and Horse Racing
Tennis, Horse Racing, and Continued Emphasis on Spatial Visualization.
{Slides 39-44}
8 - Space and Energy
Overview
- Energy sources and AI for powering Grids.
- Energy Solution from Bill Gates
- Space and AI
A: Energy
- Distributed Energy Resources as a grid of renewables with a hierarchical set of Local Distribution Areas
- Electric Vehicles in Grid
- Economics of microgrids
- Investment into Clean Energy
- Batteries
- Fusion and Deep Learning for plasma stability
- AI for Power Grid, Virtual Power Plant, Power Consumption Monitoring, Electricity Trading
B: Clean Energy startups from Bill Gates
- 26 Startups in areas like long-duration storage, nuclear energy, carbon capture, batteries, fusion, and hydropower …
- The slide deck gives links to 26 companies from their website and pitchbook which describes their startup status (#employees, funding)
- It summarizes their products
C: Space
- Space supports AI with communications, image data and global navigation
- AI Supports space in AI-controlled remote manufacturing, imaging control, system control, dynamic spectrum use
- Privatization of Space - SpaceX, Investment
- 57,000 satellites through 2029
9 - AI In Banking
Overview
In this lecture, AI in Banking is discussed. Here we focus on the transition of legacy banks towards AI based banking, real world examples of AI in Banking, banking systems and banking as a service.
AI in Banking A: The Transition of legacy Banks
- Types of AI that is used
- Closing of physical branches
- Making the transition
- Growth in Fintech as legacy bank services decline
AI in Banking B: FinTech
- Fintech examples and investment
- Broad areas of finance/banking where Fintech operating
AI in Banking C: Neobanks
- Types and Examples of neobanks
- Customer uptake by world region
- Neobanking in Small and Medium Business segment
- Neobanking in real estate, mortgages
- South American Examples
AI in Banking D: The System
- The Front, Middle, Back Office
- Front Office: Chatbots
- Robo-advisors
- Middle Office: Fraud, Money laundering
- Fintech
- Payment Gateways (Back Office)
- Banking as a Service
AI in Banking E: Examples
- Credit cards
- The stock trading ecosystem
- Robots counting coins
- AI in Insurance: Chatbots, Customer Support
- Banking itself
- Handwriting recognition
- Detect leaks for insurance
AI in Banking F: As a Service
- Banking Services Stack
- Business Model
- Several Examples
- Metrics compared among examples
- Breadth, Depth, Reputation, Speed to Market, Scalability
10 - Cloud Computing
E534 Cloud Computing Unit
Overall Summary
Video:
Defining Clouds I: Basic definition of cloud and two very simple examples of why virtualization is important
- How clouds are situated wrt HPC and supercomputers
- Why multicore chips are important
- Typical data center
Video:
Defining Clouds II: Service-oriented architectures: Software services as Message-linked computing capabilities
- The different aaS’s: Network, Infrastructure, Platform, Software
- The amazing services that Amazon AWS and Microsoft Azure have
- Initial Gartner comments on clouds (they are now the norm) and evolution of servers; serverless and microservices
- Gartner hypecycle and priority matrix on Infrastructure Strategies
Video:
Defining Clouds III: Cloud Market Share
- How important are they?
- How much money do they make?
Video:
Virtualization: Virtualization Technologies, Hypervisors and the different approaches
- KVM Xen, Docker and Openstack
Video:
Cloud Infrastructure I: Comments on trends in the data center and its technologies
- Clouds physically across the world
- Green computing
- Fraction of world’s computing ecosystem in clouds and associated sizes
- An analysis from Cisco of size of cloud computing
Video:
Cloud Infrastructure II: Gartner hypecycle and priority matrix on Compute Infrastructure
- Containers compared to virtual machines
- The emergence of artificial intelligence as a dominant force
Video:
Cloud Software: HPC-ABDS with over 350 software packages and how to use each of 21 layers
- Google’s software innovations
- MapReduce in pictures
- Cloud and HPC software stacks compared
- Components need to support cloud/distributed system programming
Video:
Cloud Applications I: Clouds in science where area called cyberinfrastructure; the science usage pattern from NIST
- Artificial Intelligence from Gartner
Video:
Cloud Applications II: Characterize Applications using NIST approach
- Internet of Things
- Different types of MapReduce
Video:
Parallel Computing Analogies: Parallel Computing in pictures
- Some useful analogies and principles
Video:
Real Parallel Computing: Single Program/Instruction Multiple Data SIMD SPMD
- Big Data and Simulations Compared
- What is hard to do?
Video:
Storage: Cloud data approaches
- Repositories, File Systems, Data lakes
Video:
HPC and Clouds: The Branscomb Pyramid
- Supercomputers versus clouds
- Science Computing Environments
Video:
Comparison of Data Analytics with Simulation: Structure of different applications for simulations and Big Data
- Software implications
- Languages
Video:
The Future I: The Future I: Gartner cloud computing hypecycle and priority matrix 2017 and 2019
- Hyperscale computing
- Serverless and FaaS
- Cloud Native
- Microservices
- Update to 2019 Hypecycle
Video:
Future and Other Issues II: Security
- Blockchain
Video:
Future and Other Issues III: Fault Tolerance
Video:
11 - Transportation Systems
Transportation Systems Summary
- The ride-hailing industry highlights the growth of a new “Transportation System” TS a. For ride-hailing TS controls rides matching drivers and customers; it predicts how to position cars and how to avoid traffic slowdowns b. However, TS is much bigger outside ride-hailing as we move into the “connected vehicle” era c. TS will probably find autonomous vehicles easier to deal with than human drivers
- Cloud Fog and Edge components
- Autonomous AI was centered on generalized image processing
- TS also needs AI (and DL) but this is for routing and geospatial time-series; different technologies from those for image processing
Transportation Systems A: Introduction
- “Smart” Insurance
- Fundamentals of Ride-Hailing
Transportation Systems B: Components of a Ride-Hailing System
- Transportation Brain and Services
- Maps, Routing,
- Traffic forecasting with deep learning
Transportation Systems C: Different AI Approaches in Ride-Hailing
- View as a Time Series: LSTM and ARIMA
- View as an image in a 2D earth surface - Convolutional networks
- Use of Graph Neural Nets
- Use of Convolutional Recurrent Neural Nets
- Spatio-temporal modeling
- Comparison of data with predictions
- Reinforcement Learning
- Formulation of General Geospatial Time-Series Problem
12 - Commerce
Overview
AI in Commerce A: The Old way of doing things
- AI in Commerce
- AI-First Engineering, Deep Learning
- E-commerce and the transformation of “Bricks and Mortar”
AI in Commerce B: AI in Retail
- Personalization
- Search
- Image Processing to Speed up Shopping
- Walmart
AI in Commerce C: The Revolution that is Amazon
- Retail Revolution
- Saves Time, Effort and Novelity with Modernized Retail
- Looking ahead of Retail evolution
AI in Commerce D: DLMalls e-commerce
- Amazon sellers
- Rise of Shopify
- Selling Products on Amazon
AI in Commerce E: Recommender Engines, Digital media
- Spotify recommender engines
- Collaborative Filtering
- Audio Modelling
- DNN for Recommender engines
13 - Python Warm Up
Python Exercise on Google Colab
View in Github | Download Notebook |
In this exercise, we will take a look at some basic Python Concepts needed for day-to-day coding.
Check the installed Python version.
! python --version
Python 3.7.6
Simple For Loop
for i in range(10):
print(i)
0
1
2
3
4
5
6
7
8
9
List
list_items = ['a', 'b', 'c', 'd', 'e']
Retrieving an Element
list_items[2]
'c'
Append New Values
list_items.append('f')
list_items
['a', 'b', 'c', 'd', 'e', 'f']
Remove an Element
list_items.remove('a')
list_items
['b', 'c', 'd', 'e', 'f']
Dictionary
dictionary_items = {'a':1, 'b': 2, 'c': 3}
Retrieving an Item by Key
dictionary_items['b']
2
Append New Item with Key
dictionary_items['c'] = 4
dictionary_items
{'a': 1, 'b': 2, 'c': 4}
Delete an Item with Key
del dictionary_items['a']
dictionary_items
{'b': 2, 'c': 4}
Comparators
x = 10
y = 20
z = 30
x > y
False
x < z
True
z == x
False
if x < z:
print("This is True")
This is True
if x > z:
print("This is True")
else:
print("This is False")
This is False
Arithmetic
k = x * y * z
k
6000
j = x + y + z
j
60
m = x -y
m
-10
n = x / z
n
0.3333333333333333
Numpy
Create a Random Numpy Array
import numpy as np
a = np.random.rand(100)
a.shape
(100,)
Reshape Numpy Array
b = a.reshape(10,10)
b.shape
(10, 10)
Manipulate Array Elements
c = b * 10
c[0]
array([3.33575458, 7.39029235, 5.54086921, 9.88592471, 4.9246252 ,
1.76107178, 3.5817523 , 3.74828708, 3.57490794, 6.55752319])
c = np.mean(b,axis=1)
c.shape
10
print(c)
[0.60673061 0.4223565 0.42687517 0.6260857 0.60814217 0.66445627
0.54888432 0.68262262 0.42523459 0.61504903]
14 - MNIST Classification on Google Colab
View in Github | Download Notebook |
In this lesson we discuss in how to create a simple IPython Notebook to solve an image classification problem. MNIST contains a set of pictures
Import Libraries
Note: https://python-future.org/quickstart.html
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.utils import to_categorical, plot_model
from keras.datasets import mnist
Warm Up Exercise
Pre-process data
Load data
First we load the data from the inbuilt mnist dataset from Keras Here we have to split the data set into training and testing data. The training data or testing data has two components. Training features and training labels. For instance every sample in the dataset has a corresponding label. In Mnist the training sample contains image data represented in terms of an array. The training labels are from 0-9.
Here we say x_train for training data features and y_train as the training labels. Same goes for testing data.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
Identify Number of Classes
As this is a number classification problem. We need to know how many classes are there. So we’ll count the number of unique labels.
num_labels = len(np.unique(y_train))
Convert Labels To One-Hot Vector
Read more on one-hot vector.
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
Image Reshaping
The training model is designed by considering the data as a vector. This is a model dependent modification. Here we assume the image is a squared shape image.
image_size = x_train.shape[1]
input_size = image_size * image_size
Resize and Normalize
The next step is to continue the reshaping to a fit into a vector and normalize the data. Image values are from 0 - 255, so an easy way to normalize is to divide by the maximum value.
x_train = np.reshape(x_train, [-1, input_size])
x_train = x_train.astype('float32') / 255
x_test = np.reshape(x_test, [-1, input_size])
x_test = x_test.astype('float32') / 255
Create a Keras Model
Keras is a neural network library. The summary function provides tabular summary on the model you created. And the plot_model function provides a grpah on the network you created.
# Create Model
# network parameters
batch_size = 4
hidden_units = 64
model = Sequential()
model.add(Dense(hidden_units, input_dim=input_size))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
plot_model(model, to_file='mlp-mnist.png', show_shapes=True)
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_5 (Dense) (None, 512) 401920
_________________________________________________________________
dense_6 (Dense) (None, 10) 5130
_________________________________________________________________
activation_5 (Activation) (None, 10) 0
=================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________
Compile and Train
A keras model need to be compiled before it can be used to train the model. In the compile function, you can provide the optimization that you want to add, metrics you expect and the type of loss function you need to use.
Here we use adam optimizer, a famous optimizer used in neural networks.
The loss funtion we have used is the categorical_crossentropy.
Once the model is compiled, then the fit function is called upon passing the number of epochs, traing data and batch size.
The batch size determines the number of elements used per minibatch in optimizing the function.
Note: Change the number of epochs, batch size and see what happens.
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1, batch_size=batch_size)
469/469 [==============================] - 3s 7ms/step - loss: 0.3647 - accuracy: 0.8947
<tensorflow.python.keras.callbacks.History at 0x7fe88faf4c50>
Testing
Now we can test the trained model. Use the evaluate function by passing test data and batch size and the accuracy and the loss value can be retrieved.
MNIST_V1.0|Exercise: Try to observe the network behavior by changing the number of epochs, batch size and record the best accuracy that you can gain. Here you can record what happens when you change these values. Describe your observations in 50-100 words.
loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))
79/79 [==============================] - 0s 4ms/step - loss: 0.2984 - accuracy: 0.9148
Test accuracy: 91.5%
Final Note
This programme can be defined as a hello world programme in deep learning. Objective of this exercise is not to teach you the depths of deep learning. But to teach you basic concepts that may need to design a simple network to solve a problem. Before running the whole code, read all the instructions before a code section.
Homework
Solve Exercise MNIST_V1.0.