This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Modules

A list of modules that have been used in various classes and tutorials

This page has been recently added and will contain new modules that can be integrated into courses.

List

1 - Contributors

We provide a list of content contributors to material distributed as part of this Web Site.

A partial list of contributors.

List of contributors

HID Lastname Firstname
fa18-423-02 Liuwie Kelvin
fa18-423-03 Tamhankar Omkar
fa18-423-05 Hu Yixing
fa18-423-06 Mick Chandler
fa18-423-07 Gillum Michael
fa18-423-08 Zhao Yuli
fa18-516-01 Angelier Mario
fa18-516-02 Barshikar Vineet
fa18-516-03 Branam Jonathan
fa18-516-04 Demeulenaere David
fa18-516-06 Filliman Paul
fa18-516-08 Joshi Varun
fa18-516-10 Li Rui
fa18-516-11 Cheruvu Murali
fa18-516-12 Luo Yu
fa18-516-14 Manipon Gerald
fa18-516-17 Pope Brad
fa18-516-18 Rastogi Richa
fa18-516-19 Rutledge De’Angelo
fa18-516-21 Shanishchara Mihir
fa18-516-22 Sims Ian
fa18-516-23 Sriramulu Anand
fa18-516-24 Withana Sachith
fa18-516-25 Wu Chun Sheng
fa18-516-26 Andalibi Vafa
fa18-516-29 Singh Shilpa
fa18-516-30 Kamau Alexander
fa18-516-31 Spell Jordan
fa18-523-52 Heine Anna
fa18-523-53 Kakarala Chaitanya
fa18-523-56 Hinders Daniel
fa18-523-57 Rajendran Divya
fa18-523-58 Duvvuri Venkata Pramod Kumar
fa18-523-59 Bhutka Jatinkumar
fa18-523-60 Fetko Izolda
fa18-523-61 Stockwell Jay
fa18-523-62 Bahl Manek
fa18-523-63 Miller Mark
fa18-523-64 Tupe Nishad
fa18-523-65 Patil Prajakta
fa18-523-66 Sanjay Ritu
fa18-523-67 Sridhar Sahithya
fa18-523-68 AKKAS Selahattin
fa18-523-69 Rai Sohan
fa18-523-70 Dash Sushmita
fa18-523-71 Kota Uma Bhargavi
fa18-523-72 Bhoyar Vishal
fa18-523-73 Tong Wang
fa18-523-74 Ma Yeyi
fa18-523-79 Rapelli Abhishek
fa18-523-80 Beall Evan
fa18-523-81 Putti Harika
fa18-523-82 Madineni Pavan Kumar
fa18-523-83 Tran Nhi
fa18-523-84 Hilgenkamp Adam
fa18-523-85 Li Bo
fa18-523-86 Liu Jeff
fa18-523-88 Leite John
fa19-516-140 Abdelgader Mohamed
fa19-516-141 (Bala) Balakrishna Katuru
fa19-516-142 Martel Tran
fa19-516-143 Sanders Sheri
fa19-516-144 Holland Andrew
fa19-516-145 Kumar Anurag
fa19-516-146 Jones Kenneth
fa19-516-147 Upadhyay Harsha
fa19-516-148 Raizada Sub
fa19-516-149 Modi Hely
fa19-516-150 Kowshi Akshay
fa19-516-151 Liu Qiwei
fa19-516-152 Pagadala Pratibha Madharapakkam
fa19-516-153 Mirjankar Anish
fa19-516-154 Shah Aneri
fa19-516-155 Pimparkar Ketan
fa19-516-156 Nagarajan Manikandan
fa19-516-157 Wang Chenxu
fa19-516-158 Dayanand Daivik
fa19-516-159 Zebrowski Austin
fa19-516-160 Jain Shreyans
fa19-516-161 Nelson Jim
fa19-516-162 Katukota Shivani
fa19-516-163 Hoerr John
fa19-516-164 Mirjankar Siddhesh
fa19-516-165 Wang Zhi
fa19-516-166 Funk Brian
fa19-516-167 Screen William
fa19-516-168 Deopura Deepak
fa19-516-169 Pandit Harshawardhan
fa19-516-170 Wan Yanting
fa19-516-171 Kandimalla Jagadeesh
fa19-516-172 Shaik Nayeemullah Baig
fa19-516-173 Yadav Brijesh
fa19-516-174 Ancha Sahithi
fa19-523-180 Grant Jonathon
fa19-523-181 Falkenstein Max
fa19-523-182 Siddiqui Zak
fa19-523-183 Creech Brent
fa19-523-184 Floreak Michael
fa19-523-186 Park Soowon
fa19-523-187 Fang Chris
fa19-523-188 Katukota Shivani
fa19-523-189 Wang Huizhou
fa19-523-190 Konger Skyler
fa19-523-191 Tao Yiyu
fa19-523-192 Kim Jihoon
fa19-523-193 Sung Lin-Fei
fa19-523-194 Minton Ashley
fa19-523-195 Gan Kang Jie
fa19-523-196 Zhang Xinzhuo
fa19-523-198 Matthys Dominic
fa19-523-199 Gupta Lakshya
fa19-523-200 Chaudhari Naimesh
fa19-523-201 Bohlander Ross
fa19-523-202 Liu Limeng
fa19-523-203 Yoo Jisang
fa19-523-204 Dingman Andrew
fa19-523-205 Palani Senthil
fa19-523-206 Arivukadal Lenin
fa19-523-207 Chadderwala Nihir
fa19-523-208 Natarajan Saravanan
fa19-523-209 Kirgiz Asya
fa19-523-210 Han Matthew
fa19-523-211 Chiang Yu-Hsi
fa19-523-212 Clemons Josiah
fa19-523-213 Hu Die
fa19-523-214 Liu Yihan
fa19-523-215 Farris Chris
fa19-523-216 Kasem Jamal
hid-sp18-201 Ali Sohile
hid-sp18-202 Cantor Gabrielle
hid-sp18-203 Clarke Jack
hid-sp18-204 Gruenberg Maxwell
hid-sp18-205 Krzesniak Jonathan
hid-sp18-206 Mhatre Krish Hemant
hid-sp18-207 Phillips Eli
hid-sp18-208 Fanbo Sun
hid-sp18-209 Tugman Anthony
hid-sp18-210 Whelan Aidan
hid-sp18-401 Arra Goutham
hid-sp18-402 Athaley Sushant
hid-sp18-403 Axthelm Alexander
hid-sp18-404 Carmickle Rick
hid-sp18-405 Chen Min
hid-sp18-406 Dasegowda Ramyashree
hid-sp18-407 Keith Hickman
hid-sp18-408 Joshi Manoj
hid-sp18-409 Kadupitige Kadupitiya
hid-sp18-410 Kamatgi Karan
hid-sp18-411 Kaveripakam Venkatesh Aditya
hid-sp18-412 Kotabagi Karan
hid-sp18-413 Lavania Anubhav
hid-sp18-414 Joao Leite
hid-sp18-415 Mudvari Janaki
hid-sp18-416 Sabra Ossen
hid-sp18-417 Ray Rashmi
hid-sp18-418 Surya Sekar
hid-sp18-419 Sobolik Bertholt
hid-sp18-420 Swarnima Sowani
hid-sp18-421 Vijjigiri Priyadarshini
hid-sp18-501 Agunbiade Tolu
hid-sp18-502 Alshi Ankita
hid-sp18-503 Arnav Arnav
hid-sp18-504 Arshad Moeen
hid-sp18-505 Cate Averill
hid-sp18-506 Esteban Orly
hid-sp18-507 Giuliani Stephen
hid-sp18-508 Guo Yue
hid-sp18-509 Irey Ryan
hid-sp18-510 Kaul Naveen
hid-sp18-511 Khandelwal Sandeep Kumar
hid-sp18-512 Kikaya Felix
hid-sp18-513 Kugan Uma
hid-sp18-514 Lambadi Ravinder
hid-sp18-515 Lin Qingyun
hid-sp18-516 Pathan Shagufta
hid-sp18-517 Pitkar Harshad
hid-sp18-518 Robinson Michael
hid-sp18-519 Saurabh Shukla
hid-sp18-520 Sinha Arijit
hid-sp18-521 Steinbruegge Scott
hid-sp18-522 Swaroop Saurabh
hid-sp18-523 Tandon Ritesh
hid-sp18-524 Tian Hao
hid-sp18-525 Walker Bruce
hid-sp18-526 Whitson Timothy
hid-sp18-601 Ferrari Juliano
hid-sp18-602 Naredla Keerthi
hid-sp18-701 Unni Sunanda Unni
hid-sp18-702 Dubey Lokesh
hid-sp18-703 Rufael Ribka
hid-sp18-704 Meier Zachary
hid-sp18-705 Thompson Timothy
hid-sp18-706 Sylla Hady
hid-sp18-707 Smith Michael
hid-sp18-708 Wright Darren
hid-sp18-709 Castro Andres
hid-sp18-710 Kugan Uma M
hid-sp18-711 Kagita Mani
sp19-222-100 Saxberg Jarod
sp19-222-101 Bower Eric
sp19-222-102 Danehy Ryan
sp19-222-89 Fischer Brandon
sp19-222-90 Japundza Ethan
sp19-222-91 Zhang Tyler
sp19-222-92 Yeagley Ben
sp19-222-93 Schwantes Brian
sp19-222-94 Gotts Andrew
sp19-222-96 Olson Mercedes
sp19-222-97 Levy Zach
sp19-222-98 McDowell Xandria
sp19-222-99 Badillo Jesus
sp19-516-121 Bahramian Hamidreza
sp19-516-122 Duer Anthony
sp19-516-123 Challa Mallik
sp19-516-124 Garbe Andrew
sp19-516-125 Fine Keli
sp19-516-126 Peters David
sp19-516-127 Collins Eric
sp19-516-128 Rawat Tarun
sp19-516-129 Ludwig Robert
sp19-516-130 Rachepalli Jeevan Reddy
sp19-516-131 Huang Jing
sp19-516-132 Gupta Himanshu
sp19-516-133 Mannarswamy Aravind
sp19-516-134 Sivan Manjunath
sp19-516-135 Yue Xiao
sp19-516-136 Eggleton Joaquin Avila
sp19-516-138 Samanvitha Pradhan
sp19-516-139 Pullakhandam Srimannarayana
sp19-616-111 Vangalapat Tharak
sp19-616-112 Joshi Shirish
sp20-516-220 Goodman Josh
sp20-516-222 McCandless Peter
sp20-516-223 Dharmchand Rahul
sp20-516-224 Mishra Divyanshu
sp20-516-227 Gu Xin
sp20-516-229 Shaw Prateek
sp20-516-230 Thornton Ashley
sp20-516-231 Kegerreis Brian
sp20-516-232 Singam Ashok
sp20-516-233 Zhang Holly
sp20-516-234 Goldfarb Andrew
sp20-516-235 Ibadi Yasir Al
sp20-516-236 Achath Seema
sp20-516-237 Beckford Jonathan
sp20-516-238 Mishra Ishan
sp20-516-239 Lam Sara
sp20-516-240 Nicasio Falconi
sp20-516-241 Jaswal Nitesh
sp20-516-243 Drummond David
sp20-516-245 Baker Joshua
sp20-516-246 Fischer Rhonda
sp20-516-247 Gupta Akshay
sp20-516-248 Bookland Hannah
sp20-516-250 Palani Senthil
sp20-516-251 Jiang Shihui
sp20-516-252 Zhu Jessica
sp20-516-253 Arivukadal Lenin
sp20-516-254 Kagita Mani
sp20-516-255 Porwal Prafull

2 - List

Link and location to available modules in markdown

This page contains the list of current modules.

Legend

  • h - header missing the #
  • m - too many #’s in titles
Title Warn Cybertraining
chapters read
     Assignments read
     devops read
         Puppet read
         devop-ci read
         DevOps with Azure Monitor read
         DevOps - Continuous Improvement read
         Terraform read
         Travis read
         Infrastructure as Code (IaC) read
         DevOps read
         DevOps with AWS read
     SECTION read
         Setting up tho OS on Multiple Pi’s read
         Quick Start read
         Syllabus read
         Course Overview read
         Bigdata Technologies and Algorithms read
         Gitbub Book Issues read
         Traditional Cluster Technologies read
         Course Details read
         Setting up tho OS on a Pi read
         Incoming read
         Cloud Clusters read
     intro read
         Big Data read
     iaas read
         openstack read
             OpenStackSDK read
             OpenStack read
         azure read
             Serverless Infrastructure as a Services read
             Azure read
             Microsoft Azure and Cloud Products read
             Microsoft Azure read
         Introduction read
         aws read
             Amazon Web Service Products read
             Amazon Web Services read
             AWS Products read
         watson read
             IBM Watson read
         futuresystems read
             FutureSystems read
         AWS Boto read
     msg read
         Exercises read
         Python Apache Avro read
         Amazon Kinesis Data Streams read
         MQTT read
     Lab’s read
     prg read
         Language read
         go read
             Editors Supporting Go read
             Introduction to Go for Cloud Computing read
             Go CMD read
             Open API read
             Exercises read
             Go REST read
         python read
             facedetection read
                 NIST Pedestrian and Face Detection read
             Language read
             DocOpts read
             fingerprint read
                 Fingerprint Matching read
             Pyenv in a docker container read
             cmd Module read
             dask read
                 Dask read
             Subprocess read
             Editors read
             Word Count with Parallel Python read
             numpy read
                 NumPy read
             opencv read
                 Secchi Disk read
                 OpenCV read
             Introduction to Python read
             Advanced Topics read
             Interactive Python read
             random-forest read
                 Dask - Random Forest Feature Detection read
             scipy read
                 Scipy read
             pandas read
         Github REST Services read
         Python read
     art read
         Sentient Architecture read
     preface read
         Corrections read
         ePub Readers read
         Exercises read
         Deprecated read
         Creating the ePubs from source read
         Contributors read
         Contributing read
         Class Git read
         Preface read
         Notation read
         Updates read
         Emojis read
     DBase read
     in read
         AWS DocumentDB read
         Assignments read
         Artificial Intelligence Service with REST read
         Box read
         Datasets read
         Visualization read
         Software Projects read
         Amazon Aurora DB read
     Romeo read
     faas read
         Fission read
         OpenFaaS read
         Fn read
         Exercises read
         FaaS read
         Apache OpenWhisk read
         Microsoft Azure Function read
         IronFunction read
         Riff read
     FAQ: General read
     books read
         Free Books read
     To Do read
     nist read
     iot read
         Hardware for IoT Projects read
         Introduction read
         Projects read
         ESP8266 read
         Sensors read
         GrovePi Modules read
         Dexter read
         Raspberry PI 3 read
     FAQ: 516 read
     container read
         Docker Clusters read
         Docker and Docker Swarm on FutureSystems read
         Resources read
         Introduction to Containers read
         CNCF read
         Apache Spark with Docker read
         Exercises read
         Hadoop with Docker read
         Using Kubernetes on FutureSystems read
         Docker Flask REST Service read
         Introduction to Docker read
         Docker Swarm read
         Docker Compose read
         Bookmanager in Container read
         Introduction to Kubernetes read
         Docker Hub read
     pi read
         Run Commands at Boot time read
         Pi Cluster Form Factor read
         car read
             Raspberry Pi Robot Car with Face Recognition and Identification read
         Message Massing Interface Cluster read
         Cluster Setup read
         Exercise read
         Pi Software Collection read
         VNC read
         Fortan read
         Raspberry PI read
         About read
         Setup of a Development Environment read
         kubernetes read
             526 read
                 head read
                     Head Node Setup read
                     !/bin/sh read
             417 read
     References read
     Assignments read
     h read
     ai read
         Artificial Intelligence Service with REST read
     new read
         Screenshot Rename Automator read
         Distributed Message Queues read
     Modifying the Pi Image without a PI read
     FAQ: 423/523 and others colocated with them read
     Overview read
     doc read
         Markdown read
         Emacs read
         Report Format read
         Report Format read
         Writing a Scientific Article or Conference Paper read
         Writing a Scientific Article read
         Recording Audio with Autoplay read
         Graphviz read
         General Remarks about Communication read
         Scientific Writing read
         Communicating Research in Other Ways read
         Overview read
         Projects read
     case read
         IU 100 Node Cluster Case read
     os read
         Ubuntu on an USB stick read
         Ubuntu Resources read
         Ubuntu Setup read
     Markdown Lint read
     windows read
         Windows Download read
     data read
         Data Formats read
         Mongoengine read
         AWS RedShift read
         MongoDB in Python read
     issues read
         Github Issues read
     mapreduce read
         Spark read
         Spark Streaming read
         Hadoop SDSC read
         Twister2 read
         AWS Elastic Map Reduce (AWS EMR) read
         Amazon Elastic Map Reduce (EMR) read
         Hadoop Virtual Cluster Installation Using Cloudmesh read
         Introduction to Mapreduce read
         User Defined Functions in Spark read
         Apache HBase read
         Hadoop Distributed File System (Hadoop HDFS) read
     rest read
         OpenAPI REST Services with Swagger read
         HATEOAS read
         OpenAPI REST Service via Introspection read
         Django REST Framework read
         REST Specifications read
         OpenAPI 2.0 Specification read
         OpenAPI REST Service via Codegen read
         Extensions to Eve read
         Exercises read
         OpenAPI REST Services with Swagger read
         Rest Services with Eve read
         OpenAPI 3.0 REST Service via Introspection read
         REST AI services Example read
         Introduction to REST read
     git read
         Github read
     bigdata read
         Python read
         assignments read
             Assignment 8 read
             Assignment 3 read
             Assignment 7 read
             Assignment 6 read
             Assignment 2 read
             Assignment 5 read
             Assignment 1 read
             Assignments read
             Assignment 4 read
         Physics with Big Data Applications read
         Part I Motivation I read
         Part III Cloud read
         Sports with Big Data Applications read
         github read
             Track Progress with Github read
         Introduction to Deep Learning read
         Introduction to Deep Learning Part III: Deep Learning Algorithms and Usage read
         Part II Motivation Archive read
         Introduction to Deep Learning Part II: Applications read
         Introduction to the Course read
         Introduction to Deep Learning Part I read
     h read
     deprecated read
         Assignments read
         Internet of Things read
         How to Run VMs (IaaS) read
         How to Run Iterative MapReduce (PaaS) read
         How to Run MapReduce (PaaS) read
         How to Build a Search Engine (SaaS) read

3 - Autogenerating Analytics Rest Services

In this section, we will deploy a Pipeline Anova SVM API on an openapi service using cloudmesh-openapi

1. Overview

1.1 Prerequisite

It is also assumed that the user has installed and has familiarity with the following:

  • python3 --version >= 3.8
  • Linux Command line

1.2 Effort

  • 15 minutes (not including assignment)

1.3 List of Topics Covered

In this module, we focus on the following:

  • Training ML models with stateless requests
  • Generating RESTful APIs using cms openapi for existing python code
  • Deploying openapi definitions onto a localserver
  • Interacting with newly created openapi services

1.4 Syntax of this Tutorial.

We describe the syntax for terminal commands used in this tutorial using the following example:

(TESTENV) ~ $ echo "hello"

Here, we are in the python virtual environment (TESTENV) in the home directory ~. The $ symbol denotes the beginning of the terminal command (ie. echo "hello"). When copying and pasting commands, do not include $ or anything before it.

2. Creating a virtual environment

It is best practice to create virtual environments when you do not envision needing a python package consistently. We also want to place all source code in a common directory called cm. Let us set up this create one for this tutorial.

On your Linux/Mac, open a new terminal.

~ $ python3 -m venv ~/ENV3

The above will create a new python virtual environment. Activate it with the following.

~ $ source ~/ENV3/bin/activate

First, we update pip and verify your python and pip are correct

(ENV3) ~ $ which python
/Users/user/ENV3/bin/python

(ENV3) ~ $ which pip
/Users/user/ENV3/bin/pip

(ENV3) ~ $ pip install -U pip

Now we can use cloudmesh-installer to install the code in developer mode. This gives you access to the source code.

First, create a new directory for the cloudmesh code.

(ENV3) ~ $ mkdir ~/cm
(ENV3) ~ $ cd ~/cm

Next, we install cloudmesh-installer and use it to install cloudmesh openapi.

(ENV3) ~/cm $ pip install -U pip
(ENV3) ~/cm $ pip install cloudmesh-installer
(ENV3) ~/cm $ cloudmesh-installer get openapi

Finally, for this tutorial, we use sklearn. Install the needed packages as follows:

(ENV3) ~/cm $ pip install sklearn pandas

3. The Python Code

Let’s take a look at the python code we would like to make a REST service from. First, let’s navigate to the local openapi repository that was installed with cloudmesh-installer.

(ENV3) ~/cm $ cd cloudmesh-openapi

(ENV3) ~/cm/cloudmesh-openapi $ pwd
/Users/user/cm/cloudmesh-openapi

Let us take a look at the PipelineAnova SVM example code.

A Pipeline is a pipeline of transformations to apply with a final estimator. Analysis of variance (ANOVA) is used for feature selection. A Support vector machine SVM is used as the actual learning model on the features.

Use your favorite editor to look at it (whether it be vscode, vim, nano, etc). We will use emacs

(ENV3) ~/cm/cloudmesh-openapi $ emacs ./tests/Scikitlearn-experimental/sklearn_svm.py

The class within this file has two main methods to interact with (except for the file upload capability which is added at runtime)

@classmethod
def train(cls, filename: str) -> str:
    """
    Given the filename of an uploaded file, train a PipelineAnovaSVM
    model from the data. Assumption of data is the classifications 
    are in the last column of the data.

    Returns the classification report of the test split
    """
    # some code...

@classmethod
def make_prediction(cls, model_name: str, params: str):
    """
    Make a prediction based on training configuration
    """
    # some code...

Note the parameters that each of these methods takes in. These parameters are expected as part of the stateless request for each method.

4. Generating the OpenAPI YAML file

Let us now use the python code from above to create the openapi YAML file that we will deploy onto our server. To correctly generate this file, use the following command:

(ENV3) ~/cm/cloudmesh-openapi $ cms openapi generate PipelineAnovaSVM \
    --filename=./tests/Scikitlearn-experimental/sklearn_svm.py \
    --import_class \
    --enable_upload

Let us digest the options we have specified:

  • --filename indicates the path to the python file in which our code is located
  • --import_class notifies cms openapi that the YAML file is generated from a class. The name of this class is specified as PipelineAnovaSVM
  • --enable_upload allows the user to upload files to be stored on the server for reference. This flag causes cms openapi to auto-generate a new python file with the upload method appended to the end of the file. For this example, you will notice a new file has been added in the same directory as sklearn_svm.py. The file is aptly called: sklearn_svm_upload-enabled.py

5. The OpenAPI YAML File (optional)

If Section 2 above was correctly, cms will have generated the corresponding openapi YAML file. Let us take a look at it.

(ENV3) ~/cm/cloudmesh-openapi $ emacs ./tests/Scikitlearn-experimental/sklearn_svm.yaml

This YAML file has a lot of information to digest. The basic structure is documented here. However, it is not necessary to understand this information to deploy RESTful APIs.

However, take a look at paths: on line 9 in this file. Under this section, we have several different endpoints for our API listed. Notice the correlation between the endpoints and the python file we generated from.

6. Starting the Server

Using the YAML file from Section 2, we can now start the server.

(ENV3) ~/cm/cloudmesh-openapi $ cms openapi server start ./tests/Scikitlearn-experimental/sklearn_svm.yaml

The server should now be active. Navigate to http://localhost:8080/cloudmesh/ui.

Unavailable

7. Interacting With the Endpoints

7.1 Uploading the Dataset

We now have a nice user inteface to interact with our newly generated API. Let us upload the data set. We are going to use the iris data set in this example. We have provided it for you to use. Simply navigate to the /upload endpoint by clicking on it, then click Try it out.

We can now upload the file. Click on Choose File and upload the data set located at ~/cm/cloudmesh-openapi/tests/Scikitlearn-experimental/iris.data. Simply hit Execute after the file is uploaded. We should then get a 200 return code (telling us that everything went ok).

Unavaialable

7.2 Training on the Dataset

The server now has our dataset. Let us now navigate to the /train endpoint by, again, clicking on it. Similarly, click Try it out. The parameter being asked for is the filename. The filename we are interested in is iris.data. Then click execute. We should get another 200 return code with a Classification Report in the Response Body.

Unavailable

7.3 Making Predictions

We now have a trained model on the iris data set. Let us now use it to make predictions. The model expects 4 attribute values: sepal length, seapl width, petal length, and petal width. Let us use the values 5.1, 3.5, 1.4, 0.2 as our attributes. The expected classification is Iris-setosa.

Navigate to the /make_prediction endpoint as we have with other endpoints. Again, let us Try it out. We need to provide the name of the model and the params (attribute values). For the model name, our model is aptly called iris (based on the name of the data set).

Unavailable

As expected, we have a classification of Iris-setosa.

8. Clean Up (optional)

At this point, we have created and trained a model using cms openapi. After satisfactory use, we can shut down the server. Let us check what we have running.

(ENV3) ~/cm/cloudmesh-openapi $ cms openapi server ps
openapi server ps

INFO: Running Cloudmesh OpenAPI Servers

+-------------+-------+--------------------------------------------------+
| name        | pid   | spec                                             |
+-------------+-------+--------------------------------------------------+
| sklearn_svm | 94428 | ./tests/Scikitlearn-                             |
|             |       | experimental/sklearn_svm.yaml                    |
+-------------+-------+--------------------------------------------------+

We can stop the server with the following command:

(ENV3) ~/cm/cloudmesh-openapi $ cms openapi server stop sklearn_svm

We can verify the server is shut down by running the ps command again.

(ENV3) ~/cm/cloudmesh-openapi $ cms openapi server ps
openapi server ps

INFO: Running Cloudmesh OpenAPI Servers

None

9. Uninstallation (Optional)

After running this tutorial, you may uninstall all cloudmesh-related things as follows:

First, deactivate the virtual environment.

(ENV3) ~/cm/cloudmesh-openapi $ deactivate

~/cm/cloudmesh-openapi $ cd ~

Then, we remove the ~/cm directory.

~ $ rm -r -f ~/cm

We also remove the cloudmesh hidden files:

~ $ rm -r -f ~/.cloudmesh

Lastly, we delete our virtual environment.

~ $ rm -r -f ~/ENV3

Cloudmesh is now succesfully uninstalled.

10. Assignments

Many ML models follow the same basic process for training and testing:

  1. Upload Training Data
  2. Train the model
  3. Test the model

Using the PipelineAnovaSVM code as a template, write python code for a new model and deploy it as a RESTful API as we have done above. Train and test your model using the provided iris data set. There are plenty of examples that can be referenced here

11. References

4 - DevOps

We present here a collection information and of tools related to DevOps.

We present here a collection information and of tools related to DevOps.

4.1 - DevOps - Continuous Improvement

Indorduction to DevOps and Continious Integration

Deploying enterprise applications has been always challenging. Without consistent and reliable processes and practices, it would be impossible to track and measure the deployment artifacts, which code-files and configuration data have been deployed to what servers and what level of unit and integration tests have been done among various components of the enterprise applications. Deploying software to cloud is much more complex, given Dev-Op teams do not have extensive access to the infrastructure and they are forced to follow the guidelines and tools provided by the cloud companies. In recent years, Continuous Integration (CI) and Continuous Deployment (CD) are the Dev-Op mantra for delivering software reliably and consistently.

While CI/CD process is, as difficult as it gets, monitoring the deployed applications is emerging as new challenge, especially, on an infrastructure that is sort of virtual with VMs in combination with containers. Continuous Monitoring (CM) is somewhat new concept, that has gaining rapid popularity and becoming integral part of the overall Dev-Op functionality. Based on where the software has been deployed, continuous monitoring can be as simple as, monitoring the behavior of the applications to as complex as, end-to-end visibility across infrastructure, heart-beat and health-check of the deployed applications along with dynamic scalability based on the usage of these applications. To address this challenge, building robust monitoring pipeline process, would be a necessity. Continuous Monitoring aspects get much better control, if they are thought as early as possible and bake them into the software during the development. We can provide much better tracking and analyze metrics much closer to the application needs, if these aspects are considered very early into the process. Cloud companies aware of this necessity, provide various Dev-Op tools to make CI/CD and continuous monitoring as easy as possible. While, some of these tools and aspects are provided by the cloud offerings, some of them must be planned and planted into our software.

At high level, we can think of a simple pipeline to achieve consistent and scalable deployment process. CI/CD and Continuous Monitoring Pipeline:

  • Step 1 - Continuous Development - Plan, Code, Build and Test:

    Planning, Coding, building the deployable artifacts - code, configuration, database, etc. and let them go through the various types of tests with all the dimensions - technical to business and internal to external, as automated as possible. All these aspects come under Continuous Development.

  • Step 2 - Continuous Improvement - Deploy, Operate and Monitor:

    Once deployed to production, how these applications get operated - bug and health-checks, performance and scalability along with various high monitoring - infrastructure and cold delays due to on-demand VM/container instantiations by the cloud offerings due to the nature of the dynamic scalability of the deployment and selected hosting options. Making necessary adjustments to improve the overall experience is essentially called Continuous Improvement.

4.2 - Infrastructure as Code (IaC)

Infrastructure as Code is the ability of code to generate, maintain and destroy application infrastructure like server, storage and networking, without requiring manual changes.

Learning Objectives


Learning Objectives

  • Introduction to IaC
  • How IaC is related to DevOps
  • How IaC differs from Configuration Management Tools, and how is it related
  • Listing of IaC Tools
  • Further Reading

Introduction to IaC

IaC(Infrastructure as Code) is the ability of code to generate, maintain and destroy application infrastructure like server, storage and networking, without requiring manual changes. State of the infrastructure is maintained in files.

Cloud architectures, and containers have forced usage of IaC, as the amount of elements to manage at each layer are just too many. It is impractical to keep track with the traditional method of raising tickets and having someone do it for you. Scaling demands, elasticity during odd hours, usage-based-billing all require provisioning, managing and destroying infrastructure much more dynamically.

From the book “Amazon Web Services in Action” by Wittig [1], using a script or a declarative description has the following advantages

  • Consistent usage
  • Dependencies are handled
  • Replicable
  • Customizable
  • Testable
  • Can figure out updated state
  • Minimizes human failure
  • Documentation for your infrastructure

Sometimes IaC tools are also called Orchestration tools, but that label is not as accurate, and often misleading.

DevOps has the following key practices

  • Automated Infrastructure
  • Automated Configuration Management, including Security
  • Shared version control between Dev and Ops
  • Continuous Build - Integrate - Test - Deploy
  • Continuous Monitoring and Observability

The first practice - Automated Infrastructure can be fulfilled by IaC tools. By having the code for IaC and Configuration Management in the same code repository as application code ensures adhering to the practice of shared version control.

Typically, the workflow of the DevOps team includes running Configuration Management tool scripts after running IaC tools, for configurations, security, connectivity, and initializations.

There are 4 broad categories of such tools [2], there are

  • Ad hoc scripts: Any shell, Python, Perl, Lua scripts that are written
  • Configuration management tools: Chef, Puppet, Ansible, SaltStack
  • Server templating tools: Docker, Packer, Vagrant
  • Server provisioning tools: Terraform, Heat, CloudFormation, Cloud Deployment Manager, Azure Resource Manager

Configuration Management tools make use of scripts to achieve a state. IaC tools maintain state and metadata created in the past.

However, the big difference is the state achieved by running procedural code or scripts may be different from state when it was created because

  • Ordering of the scripts determines the state. If the order changes, state will differ. Also, issues like waiting time required for resources to be created, modified or destroyed have to be correctly dealt with.
  • Version changes in procedural code are inevitabale, and will lead to a different state.

Chef and Ansible are more procedural, while Terraform, CloudFormation, SaltStack, Puppet and Heat are more declarative.

IaC or declarative tools do suffer from inflexibility related to expressive scripting language.

Listing of IaC Tools

IaC tools that are cloud specific are

  • Amazon AWS - AWS CloudFormation
  • Google Cloud - Cloud Deployment Manager
  • Microsoft Azure - Azure Resource Manager
  • OpenStack - Heat

Terraform is not a cloud specific tool, and is multi-vendor. It has got good support for all the clouds, however, Terraform scripts are not portable across clouds.

Advantages of IaC

IaC solves the problem of environment drift, that used to lead to the infamous “but it works on my machine” kind of errors that are difficult to trace. According to ???

IaC guarantees Idempotence – known/predictable end state – irrespective of starting state. Idempotency is achieved by either automatically configuring an existing target or by discarding the existing target and recreating a fresh environment.

Further Reading

Please see books and resources like the “Terraform Up and Running” [2] for more real-world advice on IaC, structuring Terraform code and good deployment practices.

A good resource for IaC is the book “Infrastructure as Code” [3].

Refernces

[1] M. Wittig Andreas; Wittig, Amazon web services in action, 1st ed. Manning Press, 2015.

[2] Y. Brikman, Terraform: Up and running, 1st ed. O’Reilly Media Inc, 2017.

[3] K. Morris, Infrastructure as code, 1st ed. O’Reilly Media Inc, 2015.

4.3 - Ansible

Ansible is an open-source IT automation DevOps engine allowing you to manage and configure many compute resources in a scalable, consistent and reliable way.

Introduction to Ansible

Ansible is an open-source IT automation DevOps engine allowing you to manage and configure many compute resources in a scalable, consistent and reliable way.

Ansible to automates the following tasks:

  • Provisioning: It sets up the servers that you will use as part of your infrastructure.

  • Configuration management: You can change the configuration of an application, OS, or device. You can implement security policies and other configuration tasks.

  • Service management: You can start and stop services, install updates

  • Application deployment: You can conduct application deployments in an automated fashion that integrate with your DevOps strategies.

Prerequisite

We assume you

  • can install Ubuntu 18.04 virtual machine on VirtualBox

  • can install software packages via ‘apt-get’ tool in Ubuntu virtual host

  • already reserved a virtual cluster (with at least 1 virtual machine in it) on some cloud. OR you can use VMs installed in VirtualBox instead.

  • have SSH credentials and can login to your virtual machines.

Setting up a playbook

Let us develop a sample from scratch, based on the paradigms that ansible supports. We are going to use Ansible to install Apache server on our virtual machines.

First, we install ansible on our machine and make sure we have an up to date OS:

$ sudo apt-get update
$ sudo apt-get install ansible

Next, we prepare a working environment for your Ansible example

$ mkdir ansible-apache
$ cd ansible-apache

To use ansible we will need a local configuration. When you execute Ansible within this folder, this local configuration file is always going to overwrite a system level Ansible configuration. It is in general beneficial to keep custom configurations locally unless you absolutely believe it should be applied system wide. Create a file inventory.cfg in this folder, add the following:

[defaults]
hostfile = hosts.txt

This local configuration file tells that the target machines' names are given in a file named hosts.txt. Next we will specify hosts in the file.

You should have ssh login accesses to all VMs listed in this file as part of our prerequisites. Now create and edit file hosts.txt with the following content:

[apache]
<server_ip> ansible_ssh_user=<server_username>

The name apache in the brackets defines a server group name. We will use this name to refer to all server items in this group. As we intend to install and run apache on the server, the name choice seems quite appropriate. Fill in the IP addresses of the virtual machines you launched in your VirtualBox and fire up these VMs in you VirtualBox.

To deploy the service, we need to create a playbook. A playbook tells Ansible what to do. it uses YAML Markup syntax. Create and edit a file with a proper name e.g. apache.yml as follow:

---
- hosts: apache #comment: apache is the group name we just defined
  become: yes #comment: this operation needs privilege access
  tasks:
    - name: install apache2 # text description
      apt: name=apache2 update_cache=yes state=latest

This block defines the target VMs and operations(tasks) need to apply. We are using the apt attribute to indicate all software packages that need to be installed. Dependent on the distribution of the operating system it will find the correct module installer without your knowledge. Thus an ansible playbook could also work for multiple different OSes.

Ansible relies on various kinds of modules to fulfil tasks on the remote servers. These modules are developed for particular tasks and take in related arguments. For instance, when we use apt module, we need to tell which package we intend to install. That is why we provide a value for the name= argument. The first -name attribute is just a comment that will be printed when this task is executed.

Run the playbook

In the same folder, execute

ansible-playbook apache.yml --ask-sudo-pass

After a successful run, open a browser and fill in your server IP. you should see an ‘It works!’ Apache2 Ubuntu default page. Make sure the security policy on your cloud opens port 80 to let the HTTP traffic go through.

Ansible playbook can have more complex and fancy structure and syntaxes. Go explore! This example is based on:

We are going to offer an advanced Ansible in next chapter.

Ansible Roles

Next we install the R package onto our cloud VMs. R is a useful statistic programing language commonly used in many scientific and statistics computing projects, maybe also the one you chose for this class. With this example we illustrate the concept of Ansible Roles, install source code through Github, and make use of variables. These are key features you will find useful in your project deployments.

We are going to use a top-down fashion in this example. We first start from a playbook that is already good to go. You can execute this playbook (do not do it yet, always read the entire section first) to get R installed in your remote hosts. We then further complicate this concise playbook by introducing functionalities to do the same tasks but in different ways. Although these different ways are not necessary they help you grasp the power of Ansible and ease your life when they are needed in your real projects.

Let us now create the following playbook with the name example.yml:

---
- hosts: R_hosts
  become: yes
  tasks:
    - name: install the R package
      apt: name=r-base update_cache=yes state=latest

The hosts are defined in a file hosts.txt, which we configured in a file that we now call ansible.cfg:

[R_hosts]
<cloud_server_ip> ansible_ssh_user=<cloud_server_username>

Certainly, this should get the installation job done. But we are going to extend it via new features called role next

Role is an important concept used often in large Ansible projects. You divide a series of tasks into different groups. Each group corresponds to certain role within the project.

For example, if your project is to deploy a web site, you may need to install the back end database, the web server that responses HTTP requests and the web application itself. They are three different roles and should carry out their own installation and configuration tasks.

Even though we only need to install the R package in this example, we can still do it by defining a role ‘r’. Let us modify our example.yml to be:

---
- hosts: R_hosts

  roles:
    - r

Now we create a directory structure in your top project directory as follows

$ mkdir -p roles/r/tasks
$ touch roles/r/tasks/main.yml

Next, we edit the main.yml file and include the following content:

---
- name: install the R package
  apt: name=r-base update_cache=yes state=latest
  become: yes

You probably already get the point. We take the ‘tasks’ section out of the earlier example.yml and re-organize them into roles. Each role specified in example.yml should have its own directory under roles/ and the tasks need be done by this role is listed in a file ‘tasks/main.yml’ as previous.

Using Variables

We demonstrate this feature by installing source code from Github. Although R can be installed through the OS package manager (apt-get etc.), the software used in your projects may not. Many research projects are available by Git instead. Here we are going to show you how to install packages from their Git repositories. Instead of directly executing the module ‘apt’, we pretend Ubuntu does not provide this package and you have to find it on Git. The source code of R can be found at https://github.com/wch/r-source.git. We are going to clone it to a remote VM’s hard drive, build the package and install the binary there.

To do so, we need a few new Ansible modules. You may remember from the last example that Ansible modules assist us to do different tasks based on the arguments we pass to it. It will come to no surprise that Ansible has a module ‘git’ to take care of git-related works, and a ‘command’ module to run shell commands. Let us modify roles/r/tasks/main.yml to be:

---
- name: get R package source
  git:
    repo: https://github.com/wch/r-source.git
    dest: /tmp/R

- name: build and install R
  become: yes
  command: chdir=/tmp/R "{{ item }}"
  with_items:
    - ./configure
    - make
    - make install

The role r will now carry out two tasks. One to clone the R source code into /tmp/R, the other uses a series of shell commands to build and install the packages.

Note that the commands executed by the second task may not be available on a fresh VM image. But the point of this example is to show an alternative way to install packages, so we conveniently assume the conditions are all met.

To achieve this we are using variables in a separate file.

We typed several string constants in our Ansible scripts so far. In general, it is a good practice to give these values names and use them by referring to their names. This way, you complex Ansible project can be less error prone. Create a file in the same directory, and name it vars.yml:

---
repository: https://github.com/wch/r-source.git
tmp: /tmp/R

Accordingly, we will update our example.yml:

---
- hosts: R_hosts
  vars_files:
    - vars.yml
  roles:
    - r

As shown, we specify a vars_files telling the script that the file vars.yml is going to supply variable values, whose keys are denoted by Double curly brackets like in roles/r/tasks/main.yml:

---
- name: get R package source
  git:
    repo: "{{ repository }}"
    dest: "{{ tmp }}"

- name: build and install R
  become: yes
  command: chdir="{{ tmp }}" "{{ item }}"
  with_items:
    - ./configure
    - make
    - make install

Now, just edit the hosts.txt file with your target VMs' IP addresses and execute the playbook.

You should be able to extend the Ansible playbook for your needs. Configuration tools like Ansible are important components to master the cloud environment.

Ansible Galaxy

Ansible Galaxy is a marketplace, where developers can share Ansible Roles to complete their system administration tasks. Roles exchanged in Ansible Galaxy community need to follow common conventions so that all participants know what to expect. We will illustrate details in this chapter.

It is good to follow the Ansible Galaxy standard during your development as much as possible.

Ansible Galaxy helloworld

Let us start with a simplest case: We will build an Ansible Galaxy project. This project will install the Emacs software package on your localhost as the target host. It is a helloworld project only meant to get us familiar with Ansible Galaxy project structures.

First you need to create a directory. Let us call it mongodb:

$ mkdir mongodb

Go ahead and create files README.md, playbook.yml, inventory and a subdirectory roles/ then `playbook.yml is your project playbook. It should perform the Emacs installation task by executing the corresponding role you will develop in the folder ‘roles/’. The only difference is that we will construct the role with the help of ansible-galaxy this time.

Now, let ansible-galaxy initialize the directory structure for you:

$ cd roles
$ ansible-galaxy init <to-be-created-role-name>

The naming convention is to concatenate your name and the role name by a dot. @fig:ansible shows how it looks like.

image{#fig:ansible}

Let us fill in information to our project. There are several main.yml files in different folders, and we will illustrate their usages.

defaults and vars:

These folders should hold variables key-value pairs for your playbook scripts. We will leave them empty in this example.

files:

This folder is for files need to be copied to the target hosts. Data files or configuration files can be specified if needed. We will leave it empty too.

templates:

Similar missions to files/, templates is allocated for template files. Keep empty for a simple Emacs installation.

handlers:

This is reserved for services running on target hosts. For example, to restart a service under certain circumstance.

tasks:

This file is the actual script for all tasks. You can use the role you built previously for Emacs installation here:

---
- name: install Emacs on Ubuntu 16.04
  become: yes
  package: name=emacs state=present

meta:

Provide necessary metadata for our Ansible Galaxy project for shipping:

    ---
    galaxy_info:
      author: <you name>
      description: emacs installation on Ubuntu 16.04
      license:
        - MIT
      min_ansible_version: 2.0
      platforms:
        - name: Ubuntu
          versions:
            - xenial
      galaxy_tags:
        - development

    dependencies: []

Next let us test it out. You have your Ansible Galaxy role ready now. To test it as a user, go to your directory and edit the other two files inventory.txt and playbook.yml, which are already generated for you in directory tests by the script:

$ ansible-playbook -i ./hosts playbook.yml

After running this playbook, you should have Emacs installed on localhost.

A Complete Ansible Galaxy Project

We are going to use ansible-galaxy to setup a sample project. This sample project will:

  • use a cloud cluster with multiple VMs
  • deploy Apache Spark on this cluster
  • install a particular HPC application
  • prepare raw data for this cluster to process
  • run the experiment and collect results

Ansible: Write a Playbooks for MongoDB

Ansible Playbooks are automated scripts written in YAML data format. Instead of using manual commands to setup multiple remote machines, you can utilize Ansible Playbooks to configure your entire systems. YAML syntax is easy to read and express the data structure of certain Ansible functions. You simply write some tasks, for example, installing software, configuring default settings, and starting the software, in a Ansible Playbook. With a few examples in this section, you will understand how it works and how to write your own Playbooks.

There are also several examples of using Ansible Playbooks from the official site. It covers

from basic usage of Ansible Playbooks to advanced usage such as applying patches and updates with different roles and groups.

We are going to write a basic playbook of Ansible software. Keep in mind that Ansible is a main program and playbook is a template that you would like to use. You may have several playbooks in your Ansible.

First playbook for MongoDB Installation

As a first example, we are going to write a playbook which installs MongoDB server. It includes the following tasks:

  • Import the public key used by the package management system
  • Create a list file for MongoDB
  • Reload local package database
  • Install the MongoDB packages
  • Start MongoDB

The material presented here is based on the manual installation of MongoDB from the official site:

We also assume that we install MongoDB on Ubuntu 15.10.

Enabling Root SSH Access

Some setups of managed nodes may not allow you to log in as root. As this may be problematic later, let us create a playbook to resolve this. Create a enable-root-access.yaml file with the following contents:

---
- hosts: ansible-test
  remote_user: ubuntu
  tasks:
    - name: Enable root login
      shell: sudo cp ~/.ssh/authorized_keys /root/.ssh/

Explanation:

  • hosts specifies the name of a group of machines in the inventory

  • remote_user specifies the username on the managed nodes to log in as

  • tasks is a list of tasks to accomplish having a name (a description) and modules to execute. In this case we use the shell module.

We can run this playbook like so:

$ ansible-playbook -i inventory.txt -c ssh enable-root-access.yaml

PLAY [ansible-test] ***********************************************************

GATHERING FACTS ***************************************************************
ok: [10.23.2.105]
ok: [10.23.2.104]

TASK: [Enable root login] *****************************************************
changed: [10.23.2.104]
changed: [10.23.2.105]

PLAY RECAP ********************************************************************
10.23.2.104                : ok=2    changed=1    unreachable=0    failed=0
10.23.2.105                : ok=2    changed=1    unreachable=0    failed=0

Hosts and Users

First step is choosing hosts to install MongoDB and a user account to run commands (tasks). We start with the following lines in the example filename of mongodb.yaml:

---
- hosts: ansible-test
  remote_user: root
  become: yes

In a previous section, we setup two machines with ansible-test group name. We use two machines for MongoDB installation. Also, we use root account to complete Ansible tasks.

Indentation is important in YAML format. Do not ignore spaces start

with in each line.

Tasks

A list of tasks contains commands or configurations to be executed on remote machines in a sequential order. Each task comes with a name and a module to run your command or configuration. You provide a description of your task in name section and choose a module for your task. There are several modules that you can use, for example, shell module simply executes a command without considering a return value. You may use apt or yum module which is one of the packaging modules to install software. You can find an entire list of modules here: http://docs.ansible.com/list_of_all_modules.html

Module apt_key: add repository keys

We need to import the MongoDB public GPG Key. This is going to be a first task in our playbook.:

tasks:
  - name: Import the public key used by the package management system
    apt_key: keyserver=hkp://keyserver.ubuntu.com:80 id=7F0CEB10 state=present

Module apt_repository: add repositories

Next add the MongoDB repository to apt:

- name: Add MongoDB repository
  apt_repository: repo='deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' state=present

Module apt: install packages

We use apt module to install mongodb-org package. notify action is added to start mongod after the completion of this task. Use the update_cache=yes option to reload the local package database.:

- name: install mongodb
  apt: pkg=mongodb-org state=latest update_cache=yes
  notify:
  - start mongodb

Module service: manage services

We use handlers here to start or restart services. It is similar to tasks but will run only once.:

handlers:
  - name: start mongodb
    service: name=mongod state=started

The Full Playbook

Our first playbook looks like this:

---
- hosts: ansible-test
  remote_user: root
  become: yes
  tasks:
  - name: Import the public key used by the package management system
    apt_key: keyserver=hkp://keyserver.ubuntu.com:80 id=7F0CEB10 state=present
  - name: Add MongoDB repository
    apt_repository: repo='deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' state=present
  - name: install mongodb
    apt: pkg=mongodb-org state=latest update_cache=yes
    notify:
    - start mongodb
  handlers:
    - name: start mongodb
      service: name=mongod state=started

Running a Playbook

We use ansible-playbook command to run our playbook:

$ ansible-playbook -i inventory.txt -c ssh mongodb.yaml

PLAY [ansible-test] ***********************************************************

GATHERING FACTS ***************************************************************
ok: [10.23.2.104]
ok: [10.23.2.105]

TASK: [Import the public key used by the package management system] ***********
changed: [10.23.2.104]
changed: [10.23.2.105]

TASK: [Add MongoDB repository] ************************************************
changed: [10.23.2.104]
changed: [10.23.2.105]

TASK: [install mongodb] *******************************************************
changed: [10.23.2.104]
changed: [10.23.2.105]

NOTIFIED: [start mongodb] *****************************************************
ok: [10.23.2.105]
ok: [10.23.2.104]

PLAY RECAP ********************************************************************
10.23.2.104                : ok=5    changed=3    unreachable=0    failed=0
10.23.2.105                : ok=5    changed=3    unreachable=0    failed=0

If you rerun the playbook, you should see that nothing changed:

$ ansible-playbook -i inventory.txt -c ssh mongodb.yaml

PLAY [ansible-test] ***********************************************************

GATHERING FACTS ***************************************************************
ok: [10.23.2.105]
ok: [10.23.2.104]

TASK: [Import the public key used by the package management system] ***********
ok: [10.23.2.104]
ok: [10.23.2.105]

TASK: [Add MongoDB repository] ************************************************
ok: [10.23.2.104]
ok: [10.23.2.105]

TASK: [install mongodb] *******************************************************
ok: [10.23.2.105]
ok: [10.23.2.104]

PLAY RECAP ********************************************************************
10.23.2.104                : ok=4    changed=0    unreachable=0    failed=0
10.23.2.105                : ok=4    changed=0    unreachable=0    failed=0

Sanity Check: Test MongoDB

Let us try to run ‘mongo’ to enter mongodb shell.:

$ ssh ubuntu@$IP
$ mongo
MongoDB shell version: 2.6.9
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
        http://docs.mongodb.org/
Questions? Try the support group
        http://groups.google.com/group/mongodb-user
>

Terms

  • Module: Ansible library to run or manage services, packages, files or commands.

  • Handler: A task for notifier.

  • Task: Ansible job to run a command, check files, or update configurations.

  • Playbook: a list of tasks for Ansible nodes. YAML format used.

  • YAML: Human readable generic data serialization.

Reference

The main tutorial from Ansible is here: http://docs.ansible.com/playbooks_intro.html

You can also find an index of the ansible modules here: http://docs.ansible.com/modules_by_category.html

Exercise

We have shown a couple of examples of using Ansible tools. Before you apply it in you final project, we will practice it in this exercise.

  • set up the project structure similar to Ansible Galaxy example
  • install MongoDB from the package manager (apt in this class)
  • configure your MongoDB installation to start the service automatically
  • use default port and let it serve local client connections only

4.4 - Puppet

Puppet is configuration management tool that simplifies complex task of deploying new software, applying software updates and rollback software packages in large cluster

Overview

Configuration management is an important task of IT department in any organization. It is process of managing infrastructure changes in structured and systematic way. Manual rolling back of infrastructure to previous version of software is cumbersome, time consuming and error prone. Puppet is configuration management tool that simplifies complex task of deploying new software, applying software updates and rollback software packages in large cluster. Puppet does this through Infrastructure as Code (IAC). Code is written for infrastructure on one central location and is pushed to nodes in all environments (Dev, Test, Production) using puppet tool. Configuration management tool has two approaches for managing infrastructure; Configuration push and pull. In push configuration, infrastructure as code is pushed from centralized server to nodes whereas in pull configuration nodes pulls infrastructure as code from central server as shown in fig. 1.

Figure 1: Infrastructure As Code [1]

Puppet uses push and pull configuration in centralized manner as shown in fig. 2.

Figure 2: push-pull-config Image [1]

Another popular infrastructure tool is Ansible. It does not have master and client nodes. Any node in Ansible can act as executor. Any node containing list of inventory and SSH credential can play master node role to connect with other nodes as opposed to puppet architecture where server and agent software needs to be setup and installed. Configuring Ansible nodes is simple, it just requires python version 2.5 or greater. Ansible uses push architecture for configuration.

Master slave architecture

Puppet uses master slave architecture as shown in fig. 3. Puppet server is called as master node and client nodes are called as puppet agent. Agents poll server at regular interval and pulls updated configuration from master. Puppet Master is highly available. It supports multi master architecture. If one master goes down backup master stands up to serve infrastructure.

Workflow

  • nodes (puppet agents) sends information (for e.g IP, hardware detail, network etc.) to master. Master stores such information in manifest file.
  • Master node compiles catalog file containing configuration information that needs to be implemented on agent nodes.
  • Master pushes catalog to puppet agent nodes for implementing configuration.
  • Client nodes send back updated report to Master. Master updates its inventory.
  • All exchange between master and agent is secured through SSL encryption (see fig. 3)
Figure 3: Master and Slave Architecture [1]

fig. 4, shows flow between master and slave.

Figure 4: Master Slave Workflow 1 [1]

fig. 5 shows SSL workflow between master and slave.

Figure 5: Master Slave SSL Workflow [1]

Puppet comes in two forms. Open source Puppet and Enterprise In this tutorial we will showcase installation steps of both forms.

Install Opensource Puppet on Ubuntu

We will demonstrate installation of Puppet on Ubuntu

Prerequisite - Atleast 4 GB RAM, Ubuntu box ( standalone or VM )

First, we need to make sure that Puppet master and agent is able to communicate with each other. Agent should be able to connect with master using name.

configure Puppet server name and map with its ip address

$ sudo nano /etc/hosts

contents of the /etc/hosts should look like

<ip_address> my-puppet-master

my-puppet-master is name of Puppet master to which Puppet agent would try to connect

press <ctrl> + O to Save and <ctrl> + X to exit

Next, we will install Puppet on Ubuntu server. We will execute the following commands to pull from official Puppet Labs Repository

$ curl -O https://apt.puppetlabs.com/puppetlabs-release-pc1-xenial.deb
$ sudo dpkg -i puppetlabs-release-pc1-xenial.deb
$ sudo apt-get update

Intstall the Puppet server

$ sudo apt-get install puppetserver

Default instllation of Puppet server is configured to use 2 GB of RAM. However, we can customize this by opening puppetserver configuration file

$ sudo nano /etc/default/puppetserver

This will open the file in editor. Look for JAVA_ARGS line and change the value of -Xms and -Xmx parameters to 3g if we wish to configure Puppet server for 3GB RAM. Note that default value of this parameter is 2g.

JAVA_ARGS="-Xms3g -Xmx3g -XX:MaxPermSize=256m"

press <ctrl> + O to Save and <ctrl> + X to exit

By default Puppet server is configured to use port 8140 to communicate with agents. We need to make sure that firewall allows to communicate on this port

$ sudo ufw allow 8140

next, we start Puppet server

$ sudo systemctl start puppetserver

Verify server has started

$ sudo systemctl status puppetserver

we would see “active(running)” if server has started successfully

$ sudo systemctl status puppetserver
● puppetserver.service - puppetserver Service
   Loaded: loaded (/lib/systemd/system/puppetserver.service; disabled; vendor pr
   Active: active (running) since Sun 2019-01-27 00:12:38 EST; 2min 29s ago
  Process: 3262 ExecStart=/opt/puppetlabs/server/apps/puppetserver/bin/puppetser
 Main PID: 3269 (java)
   CGroup: /system.slice/puppetserver.service
           └─3269 /usr/bin/java -Xms3g -Xmx3g -XX:MaxPermSize=256m -Djava.securi

Jan 27 00:11:34 ritesh-ubuntu1 systemd[1]: Starting puppetserver Service...
Jan 27 00:11:34 ritesh-ubuntu1 puppetserver[3262]: OpenJDK 64-Bit Server VM warn
Jan 27 00:12:38 ritesh-ubuntu1 systemd[1]: Started puppetserver Service.
lines 1-11/11 (END)

configure Puppet server to start at boot time

$ sudo systemctl enable puppetserver

Next, we will install Puppet agent

$ sudo apt-get install puppet-agent

start Puppet agent

$ sudo systemctl start puppet

configure Puppet agent to start at boot time

$ sudo systemctl enable puppet

next, we need to change Puppet agent config file so that it can connect to Puppet master and communicate

$ sudo nano /etc/puppetlabs/puppet/puppet.conf

configuration file will be opened in an editor. Add following sections in file

[main]
certname = <puppet-agent>
server = <my-puppet-server>

[agent]
server = <my-puppet-server>

Note: my-puppet-server is the name that we have set up in /etc/hosts file while installing Puppet server. And certname is the name of the certificate

Puppet agent sends certificate signing request to Puppet server when it connects first time. After signing request, Puppet server trusts and identifies agent for managing.

execute following command on Puppet Master in order to see all incoming cerficate signing requests

$ sudo /opt/puppetlabs/bin/puppet cert list

we will see something like

$ sudo /opt/puppetlabs/bin/puppet cert list
 "puppet-agent" (SHA256) 7B:C1:FA:73:7A:35:00:93:AF:9F:42:05:77:9B:
 05:09:2F:EA:15:A7:5C:C9:D7:2F:D7:4F:37:A8:6E:3C:FF:6B
  • Note that puppet-agent is the name that we have configured for certname in puppet.conf file*

After validating that request is from valid and trusted agent, we sign the request

$ sudo /opt/puppetlabs/bin/puppet cert sign puppet-agent

we will see message saying certificate was signed if successful

$ sudo /opt/puppetlabs/bin/puppet cert sign puppet-agent
Signing Certificate Request for:
  "puppet-agent" (SHA256) 7B:C1:FA:73:7A:35:00:93:AF:9F:42:05:77:9B:05:09:2F:
  EA:15:A7:5C:C9:D7:2F:D7:4F:37:A8:6E:3C:FF:6B
Notice: Signed certificate request for puppet-agent
Notice: Removing file Puppet::SSL::CertificateRequest puppet-agent
at '/etc/puppetlabs/puppet/ssl/ca/requests/puppet-agent.pem'

Next, we will verify installation and make sure that Puppet server is able to push configuration to agent. Puppet uses domian specific language code written in manifests ( .pp ) file

create default manifest site.pp file

$ sudo nano /etc/puppetlabs/code/environments/production/manifests/site.pp

This will open file in edit mode. Make following changes to this file

file {'/tmp/it_works.txt':                        # resource type file and filename
  ensure  => present,                             # make sure it exists
  mode    => '0644',                              # file permissions
  content => "It works!\n",  # Print the eth0 IP fact
}

domain specific language is used to create it_works.txt file inside /tmp directory on agent node. ensure directive make sure that file is present. It creates one if file is removed. mode directive specifies that process has write permission on file to make changes. content directive is used to define content of the changes applied [hid-sp18-523-open]

next, we test the installation on single node

sudo /opt/puppetlabs/bin/puppet agent --test

successfull verification will display

Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Caching catalog for puppet-agent
Info: Applying configuration version '1548305548'
Notice: /Stage[main]/Main/File[/tmp/it_works.txt]/content:
--- /tmp/it_works.txt    2019-01-27 02:32:49.810181594 +0000
+++ /tmp/puppet-file20190124-9628-1vy51gg    2019-01-27 02:52:28.717734377 +0000
@@ -0,0 +1 @@
+it works!

Info: Computing checksum on file /tmp/it_works.txt
Info: /Stage[main]/Main/File[/tmp/it_works.txt]: Filebucketed /tmp/it_works.txt
to puppet with sum d41d8cd98f00b204e9800998ecf8427e
Notice: /Stage[main]/Main/File[/tmp/it_works.txt]/content: content
changed '{md5}d41d8cd98f00b204e9800998ecf8427e' to '{md5}0375aad9b9f3905d3c545b500e871aca'
Info: Creating state file /opt/puppetlabs/puppet/cache/state/state.yaml
Notice: Applied catalog in 0.13 seconds

Installation of Puppet Enterprise

First, download ubuntu-<version and arch>.tar.gz and CPG signature file on Ubuntu VM

Second, we import Puppet public key

$ wget -O - https://downloads.puppetlabs.com/puppet-gpg-signing-key.pub | gpg --import

we will see ouput as

--2019-02-03 14:02:54--  https://downloads.puppetlabs.com/puppet-gpg-signing-key.pub
Resolving downloads.puppetlabs.com
(downloads.puppetlabs.com)... 2600:9000:201a:b800:10:d91b:7380:93a1
, 2600:9000:201a:800:10:d91b:7380:93a1, 2600:9000:201a:be00:10:d91b:7380:93a1, ...
Connecting to downloads.puppetlabs.com (downloads.puppetlabs.com)
|2600:9000:201a:b800:10:d91b:7380:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3139 (3.1K) [binary/octet-stream]
Saving to: ‘STDOUT’

-                   100%[===================>]   3.07K  --.-KB/s    in 0s

2019-02-03 14:02:54 (618 MB/s) - written to stdout [3139/3139]

gpg: key 7F438280EF8D349F: "Puppet, Inc. Release Key
(Puppet, Inc. Release Key) <release@puppet.com>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1

Third, we print fingerprint of used key

$ gpg --fingerprint 0x7F438280EF8D349F

we will see successful output as

pub   rsa4096 2016-08-18 [SC] [expires: 2021-08-17]
      6F6B 1550 9CF8 E59E 6E46  9F32 7F43 8280 EF8D 349F
uid           [ unknown] Puppet, Inc. Release Key
(Puppet, Inc. Release Key) <release@puppet.com>
sub   rsa4096 2016-08-18 [E] [expires: 2021-08-17]

Fourth, we verify release signature of installed package

$ gpg --verify puppet-enterprise-VERSION-PLATFORM.tar.gz.asc

successful output will show as

gpg: assuming signed data in 'puppet-enterprise-2019.0.2-ubuntu-18.04-amd64.tar.gz'
gpg: Signature made Fri 25 Jan 2019 02:03:23 PM EST
gpg:                using RSA key 7F438280EF8D349F
gpg: Good signature from "Puppet, Inc. Release Key
(Puppet, Inc. Release Key) <release@puppet.com>" [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the owner.
Primary key fingerprint: 6F6B 1550 9CF8 E59E 6E46  9F32 7F43 8280 EF8D 349

Next, we need to unpack installation tarball. Store location of path in $TARBALL variable. This variable will be used in our installation.

$ export TARBALL=path of tarball file

then, we extract tarball

$ tar -xf $TARBALL

Next, we run installer from installer directory

$ sudo ./puppet-enterprise-installer

This will ask us to chose installation option; we could chose from guided installation or text based installation

~/pe/puppet-enterprise-2019.0.2-ubuntu-18.04-amd64
$ sudo ./puppet-enterprise-installer
~/pe/puppet-enterprise-2019.0.2-ubuntu-18.04-amd64
~/pe/puppet-enterprise-2019.0.2-ubuntu-18.04-amd64
=============================================================
    Puppet Enterprise Installer
=============================================================

## Installer analytics are enabled by default.
## To disable, set the DISABLE_ANALYTICS environment variable and rerun
this script.
For example, "sudo DISABLE_ANALYTICS=1 ./puppet-enterprise-installer".
## If puppet_enterprise::send_analytics_data is set to false in your
existing pe.conf, this is not necessary and analytics will be disabled.

Puppet Enterprise offers three different methods of installation.

[1] Express Installation (Recommended)

This method will install PE and provide you with a link at the end
of the installation to reset your PE console admin password

Make sure to click on the link and reset your password before proceeding
to use PE

[2] Text-mode Install

This method will open your EDITOR (vi) with a PE config file (pe.conf)
for you to edit before you proceed with installation.

The pe.conf file is a HOCON formatted file that declares parameters
and values needed to install and configure PE.
We recommend that you review it carefully before proceeding.

[3] Graphical-mode Install

This method will install and configure a temporary webserver to walk
you through the various configuration options.

NOTE: This method requires you to be able to access port 3000 on this
machine from your desktop web browser.

=============================================================

 How to proceed? [1]:

-------------------------------------------------------------------

Press 3 for web based Graphic-mode-Install

when successfull, we will see output as

## We're preparing the Web Installer...

2019-02-02T20:01:39.677-05:00 Running command:
mkdir -p /opt/puppetlabs/puppet/share/installer/installer
2019-02-02T20:01:39.685-05:00 Running command:
cp -pR /home/ritesh/pe/puppet-enterprise-2019.0.2-ubuntu-18.04-amd64/*
/opt/puppetlabs/puppet/share/installer/installer/

## Go to https://<localhost>:3000 in your browser to continue installation.

By default Puppet Enterprise server uses 3000 port. Make sure that firewall allows communication on port 3000

$ sudo ufw allow 3000

Next, go to https://localhost:3000 url for completing installation

Click on get started button.

Chose install on this server

Enter <mypserver> as DNS name. This is our Puppet Server name. This can be configured in confile file also.

Enter console admin password

Click continue

we will get confirm the plan screen with following information

The Puppet master component
Hostname
ritesh-ubuntu-pe
DNS aliases
<mypserver>

click continue and verify installer validation screen.

click Deploy Now button

Puppet enterprise will be installed and will display message on screen

Puppet agent ran sucessfully

login to console with admin password that was set earlier and click on nodes links to manage nodes.

Installing Puppet Enterprise as Text mode monolithic installation

$ sudo ./puppet-enterprise-installer

Enter 2 on How to Proceed for text mode monolithic installation. Following message will be displayed if successfull.

2019-02-02T22:08:12.662-05:00 - [Notice]: Applied catalog in 339.28 seconds
2019-02-02T22:08:13.856-05:00 - [Notice]:
Sent analytics: pe_installer - install_finish - succeeded
* /opt/puppetlabs/puppet/bin/puppet infrastructure configure
--detailed-exitcodes --environmentpath /opt/puppetlabs/server/data/environments
--environment enterprise --no-noop --install=2019.0.2 --install-method='repair'
* returned: 2

## Puppet Enterprise configuration complete!


Documentation: https://puppet.com/docs/pe/2019.0/pe_user_guide.html
Release notes: https://puppet.com/docs/pe/2019.0/pe_release_notes.html

If this is a monolithic configuration, run 'puppet agent -t' to complete the
setup of this system.

If this is a split configuration, install or upgrade the remaining PE components,
and then run puppet agent -t on the Puppet master, PuppetDB, and PE console,
in that order.
~/pe/puppet-enterprise-2019.0.2-ubuntu-18.04-amd64
2019-02-02T22:08:14.805-05:00 Running command: /opt/puppetlabs/puppet/bin/puppet
agent --enable
~/pe/puppet-enterprise-2019.0.2-ubuntu-18.04-amd64$

This is called as monolithic installation as all components of Puppet Enterprise such as Puppet master, PuppetDB and Console are installed on single node. This installation type is easy to install. Troubleshooting errors and upgrading infrastructure using this type is simple. This installation type can easily support infrastructure of up to 20,000 managed nodes. Compiled master nodes can be added as network grows. This is recommended installation type for small to mid size organizations [2].

pe.conf configuration file will be opened in editor to configure values. This file contains parameters and values for installing, upgrading and configuring Puppet.

Some important parameters that can be specified in pe.conf file are

console_admin_password
puppet_enterprise::console_host
puppet_enterprise::puppetdb_host
puppet_enterprise::puppetdb_database_name
puppet_enterprise::puppetdb_database_user

Lastly, we run puppet after installation is complete

$ puppet agent -t

Text mode split installation is performed for large networks. Compared to monolithic installation split installation type can manage large infrastucture that requires more than 20,000 nodes. In this type of installation different components of Puppet Enterprise (master, PuppetDB and Console) are installed on different nodes. This installation type is recommended for organizations with large infrastructure needs [3].

In this type of installation, we need to install componenets in specific order. First master then puppet db followed by console.

Puppet Enterprise master and agent settings can be configured in puppet.conf file. Most configuration settings of Puppet Enterprise componenets such as Master, Agent and security certificates are all specified in this file.

Config section of Agent Node

[main]

certname = <http://your-domain-name.com/>
server = puppetserver
environment = testing
runinterval = 4h

Config section of Master Node

[main]

certname =  <http://your-domain-name.com/>
server = puppetserver
environment = testing
runinterval = 4h
strict_variables = true

[master]

dns_alt_names = puppetserver,puppet, <http://your-domain-name.com/>
reports = pupated
storeconfigs_backend = puppetdb
storeconfigs = true
environment_timeout = unlimited

Comment lines, Settings lines and Settings variables are main components of puppet configuration file. Comments in config files are specified by prefixing hash character. Setting line consists name of setting followed by equal sign, value of setting are specified in this section. Setting variable value generally consists of one word but multiple can be specified in rare cases [4].

Refernces

[1] Edureka, “Puppet tutorial – devops tool for configuration management.” Web Page, May-2017 [Online]. Available: https://www.edureka.co/blog/videos/puppet-tutorial/

[2] Puppet, “Text mode installation: Monolithic.” Web Page, Nov-2017 [Online]. Available: https://puppet.com/docs/pe/2017.1/install_text_mode_mono.html

[3] Puppet, “Text mode installation : Split.” Web Page, Nov-2017 [Online]. Available: https://puppet.com/docs/pe/2017.1/install_text_mode_split.html

[4] Puppet, “Config files: The main config files.” Web Page, Apr-2014 [Online]. Available: https://puppet.com/docs/puppet/5.3/config_file_main.html

4.5 - Travis

Travis CI is a continuous integration tool that is often used as part of DevOps development. It is a hosted service that enables users to test their projects on GitHub.

Travis CI is a continuous integration tool that is often used as part of DevOps development. It is a hosted service that enables users to test their projects on GitHub.

Once travis is activated in a GitHub project, the developers can place a .travis file in the project root. Upon checkin the travis configuration file will be interpreted and the commands indicated in it will be executed.

In fact this book has also a travis file that is located at

Please inspect it as we will illustrate some concepts of it. Unfortunately travis does not use an up to date operating system such as ubuntu 18.04. Therefore it contains outdated libraries. Although we would be able to use containers, we have elected for us to chose mechanism to update the operating system as we need.

This is done in the install phase that in our case installs a new version of pandoc, as well as some additional libraries that we use.

in the env we specify where we can find our executables with the PATH variable.

The last portion in our example file specifies the script that is executed after the install phase has been completed. As our installation contains convenient and sophisticated makefiles, the script is very simple while executing the appropriate make command in the corresponding directories.

Exercises

E.travis.1:

Develop an alternative travis file that in conjunction uses a preconfigured container for ubuntu 18.04

E.travis.2:

Develop an travis file that checks our books on multiple operating systems such as macOS, and ubuntu 18.04.

Resources

4.6 - DevOps with AWS

AWS cloud offering comes with end-to-end scalable and most performant support for DevOps

AWS cloud offering comes with end-to-end scalable and most performant support for DevOps, all the way from automatic deployment and monitoring of infrastructure-as-code to our cloud-applications-code. AWS provides various DevOp tools to make the deployment and support automation as simple as possible.

AWS DevOp Tools

Following is the list of DevOp tools for CI/CD workflows.

AWS DevOp Tool Description
CodeStar AWS CodeStar provides unified UI to enable simpler deployment automation.
CodePipeline CI/CD service for faster and reliable application and infrastructure updates.
CodeBuild Fully managed build service that complies, tests and creates software packages that are ready to deploy.
CodeDeploy Deployment automation tool to deploy to on-premise and on-cloud EC2 instances with near-to-zero downtime during the application deployments.

Infrastructure Automation

AWS provides services to make micro-services easily deployable onto containers and serverless platforms.

AWS DevOp Infrastructure Tool Description
Elastic Container Service Highly scalable container management service.
CodePipeline CI/CD service for faster and reliable application and infrastructure updates.
AWS Lambda Serverless Computing using Function-as-service (FaaS) methodologies .
AWS CloudFormation Tool to create and manage related AWS resources.
AWS OpsWorks Server Configuration Management Tool.

Monitoring and Logging

AWS DevOp Monitoring Tool Description
Amazon CloudWatch Tool to monitor AWS resources and cloud applications to collect and track metrics, logs and set alarms.
AWS X-Ray Allows developers to analyze and troubleshoot performance issues of their cloud applications and micro-services.

For more information, please visit Amazon AWS [1].

Refernces

[1] Amazon AWS, DevOps and AWS. Amazon, 2019 [Online]. Available: https://aws.amazon.com/devops/

4.7 - DevOps with Azure Monitor

Microsoft provides unified tool called Azure Monitor for end-to-end monitoring of the infrastructure and deployed applications.

Microsoft provides unified tool called Azure Monitor for end-to-end monitoring of the infrastructure and deployed applications. Azure Monitor can greatly help Dev-Op teams by proactively and reactively monitoring the applications for bug tracking, health-check and provide metrics that can hint on various scalability aspects.

Figure 1: Azure Monitor [1]

Azure Monitor accommodates applications developed in various programming languages - .NET, Java, Node.JS, Python and various others. With Azure Application Insights telematics API incorporated into the applications, Azure Monitor can provide more detailed metrics and analytics around specific tracking needs - usage, bugs, etc.

Azure Monitor can help us track the health, performance and scalability issues of the infrastructure - VMs, Containers, Storage, Network and all Azure Services by automatically providing various platform metrics, activity and diagnostic logs.

Azure Monitor provides programmatic access through Power Shell scripts to access the activity and diagnostic logs. It also allows querying them using powerful querying tools for advanced in-depth analysis and reporting.

Azure Monitor proactively monitors and notifies us of critical conditions - reaching quota limits, abnormal usage, health-checks and recommendations along with making attempts to correct some of those aspects.

Azure Monitor Dashboards allow visualize various aspects of the data - metrics, logs, usage patterns in tabular and graphical widgets.

Azure Monitor also facilitates closer monitoring of micro-services if they are provided through Azure Serverless Function-As-Service.

For more information, please visit Microsoft Azure Website [1].

Refernces

[1] Microsoft Azure, Azure Monitor Overview. Microsoft, 2018 [Online]. Available: https://docs.microsoft.com/en-us/azure/azure-monitor/overview

5 - Google Colab

A gentle introduction to Google Colab for Programming

In this section we are going to introduce you, how to use Google Colab to run deep learning models.

1. Updates

  1. Another Python notebook demonstrating StopWatch and Benchmark is available at:

  2. The line ! pip install cloudmesh-installer is not needed, but is used in the video.

2. Introduction to Google Colab

This video contains the introduction to Google Colab. In this section we will be learning how to start a Google Colab project.

3. Programming in Google Colab

In this video we will learn how to create a simple, Colab Notebook.

Required Installations

pip install numpy

4. Benchmarking in Google Colab with Cloudmesh

In this video we learn how to do a basic benchmark with Cloudmesh tools. Cloudmesh StopWatch will be used in this tutorial.

Required Installations

pip install numpy
pip install cloudmesh-common

Correction: The video shows to also pip install cloudmesh-installer. This is not necessary for this example.

5. Refernces

  1. Benchmark Colab Notebook. https://colab.research.google.com/drive/1tG7IcP-XMQiNVxU05yazKQYciQ9GpMat

6 - Modules from SDSC

In this is a list of test modules to demonstrate integration with content contributed by SDSC.

Modules contributed by SDSC.

6.1 - Jupyter Notebooks in Comet over HTTP

1. Overview

1.1. Prerequisite

  • Account on Comet

1.2. Effort

  • 30 minutes

1.3. Topics covered

  • Using Notebooks on Comet

2. SSH to Jupyter Notebooks on Comet

We describe how to connection between the browser on your local host (laptop) to a Jupyter service running on Comet over HTTP and demonstrates why the connection is not secure.

connection over HTTP

Note: google chrome has many local ports open in the range of 7713 - 7794. They are all connect to 80 or 443 on the other end.

3. Log onto comet.sdsc.edu

ssh -Y -l <username> <system name>.sdsc.edu
  • create a test directory, or cd into one you have already created
  • Clone the examples repository:
git clone https://github.com/sdsc-hpc-training-org/notebook-examples.git

4. Launch a notebook on the login node

Run the jupyter command. Be sure to set the –ip to use the hostname, which will appear in your URL :

[mthomas@comet-14-01:~] jupyter notebook  --no-browser --ip=`/bin/hostname`

You will see output similar to that shown below:

[I 08:06:32.961 NotebookApp] JupyterLab extension loaded from /home/mthomas/miniconda3/lib/python3.7/site-packages/jupyterlab
[I 08:06:32.961 NotebookApp] JupyterLab application directory is /home/mthomas/miniconda3/share/jupyter/lab
[I 08:06:33.486 NotebookApp] Serving notebooks from local directory: /home/mthomas
[I 08:06:33.487 NotebookApp] The Jupyter Notebook is running at:
[I 08:06:33.487 NotebookApp] http://comet-14-01.sdsc.edu:8888/?token=6d7a48dda7cc1635d6d08f63aa1a696008fa89d8aa84ad2b
[I 08:06:33.487 NotebookApp]  or http://127.0.0.1:8888/?token=6d7a48dda7cc1635d6d08f63aa1a696008fa89d8aa84ad2b
[I 08:06:33.487 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 08:06:33.494 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///home/mthomas/.local/share/jupyter/runtime/nbserver-6614-open.html
    Or copy and paste one of these URLs:
        http://comet-14-01.sdsc.edu:8888/?token=6d7a48dda7cc1635d6d08f63aa1a696008fa89d8aa84ad2b
     or http://127.0.0.1:8888/?token=6d7a48dda7cc1635d6d08f63aa1a696008fa89d8aa84ad2b
[I 08:06:45.773 NotebookApp] 302 GET /?token=6d7a48dda7cc1635d6d08f63aa1a696008fa89d8aa84ad2b (76.176.117.51) 0.74ms
[E 08:06:45.925 NotebookApp] Could not open static file ''
[W 08:06:46.033 NotebookApp] 404 GET /static/components/react/react-dom.production.min.js (76.176.117.51) 7.39ms referer=http://comet-14-01.sdsc.edu:8888/tree?token=6d7a48dda7cc1635d6d08f63aa1a696008fa89d8aa84ad2b
[W 08:06:46.131 NotebookApp] 404 GET /static/components/react/react-dom.production.min.js (76.176.117.51) 1.02ms referer=http://comet-14-01.sdsc.edu:8888/tree?token=6d7a48dda7cc1635d6d08f63aa1a696008fa89d8aa84ad2b

Notice that the notebook URL is using HTTP, and when you connect the browser on your local sysetm to this URL, the connection will not be secure. Note: it is against SDSC Comet policy to run applications on the login nodes, and any applications being run will be killed by the system admins. A better way is to run the jobs on an interactive node or on a compute node using the batch queue (see the Comet User Guide), or on a compute node, which is described in the next sections.

5. Obtain an interactive node

Jobs can be run on the cluster in batch mode or in interactive mode. Batch jobs are performed remotely and without manual intervention. Interactive mode enable you to run/compile your program and environment setup on a compute node dedicated to you. To obtain an interactive node, type:

srun --pty --nodes=1 --ntasks-per-node=24 -p compute -t 02:00:00 --wait 0 /bin/bash

You will have to wait for your node to be allocated - which can take a few or many minutes. You will see pending messages like the ones below:

srun: job 24000544 queued and waiting for resources
srun: job 24000544 has been allocated resources
[mthomas@comet-18-29:~/hpctrain/python/PythonSeries]

You can also check the status of jobs in the queue system to get an idea of how long you may need to wait.

Launch the Jupyter Notebook application. Note: this application will be running on comet, and you will be given a URL which will connect your local web browser the interactive comet session:

jupyter notebook --no-browser --ip=`/bin/hostname`

This will give you an address which has localhost in it and a token. Something like:

http://comet-14-0-4:8888/?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

You can then paste it into your browser. You will see a running Jupyter notebook and a listing of the notebooks in your directory. From there everything should be working as a regular notebook. Note: This token is your auth so don’t email/send it around. It will go away when you stop the notebook.

To learn about Python, run the Python basics.ipynb notebook. To see an example of remote visualization, run the Matplotlib.ipynb notebook!

5.1 Access the node in your browser

Copy the the URL above into the browser running on your laptop.

5.2 Use your jupyterlab/jupyter notebook server!

Enjoy. Note that your notebook is unsecured.

7 - AI-First Engeneering Cybertraining Spring 2021 - Module

Here you will find a number of modules and components for introducing you to AI-First Engineering

Big Data Applications are an important topic that have impact in academia and industry.

7.1 - 2021

Here you will find a number of modules and components for introducing you to AI-First Engineering.

Big Data Applications are an important topic that have impact in academia and industry.

7.1.1 - Introduction to AI-Driven Digital Transformation

Last Semester’s Introductory Lecture with introduction to and Motivation for Big Data Applications and Analytics Class. See section G for material directly related to this lass but all sections are relevant

This Lecture is recorded in 8 parts and gives an introduction and motivation for the class. This and other lectures in class are divided into “bite-sized lessons” from 5 to 30 minutes in length; that’s why it has 8 parts.

Lecture explains what students might gain from the class even if they end up with different types of jobs from data engineering, software engineering, data science or a business (application) expert. It stresses that we are well into a transformation that impacts industry research and the way life is lived. This transformation is centered on using the digital way with clouds, edge computing and deep learning giving the implementation. This “AI-Driven Digital Transformation” is as transformational as the Industrial Revolution in the past. We note that deep learning dominates most innovative AI replacing several traditional machine learning methods.

The slides for this course can be found at E534-Fall2020-Introduction

A: Getting Started: BDAA Course Introduction Part A: Big Data Applications and Analytics

This lesson describes briefly the trends driving and consequent of the AI-Driven Digital Transformation. It discusses the organizational aspects of the class and notes the two driving trends are clouds and AI. Clouds are mature and a dominant presence. AI is still rapidly changing and we can expect further major changes. The edge (devices and associated local fog computing) has always been important but now more is being done there.

B: Technology Futures from Gartner’s Analysis: BDAA Course Introduction Part B: Big Data Applications and Analytics

This lesson goes through the technologies (AI Edge Cloud) from 2008-2020 that are driving the AI-Driven Digital Transformation. we use Hype Cycles and Priority Matrices from Gartner tracking importance concepts from the Innovation Trigger, Peak of Inflated Expectations through the Plateau of Productivity. We contrast clouds and AI.

  • This gives illustrations of sources of big data.
  • It gives key graphs of data sizes, images uploaded; computing, data, bandwidth trends;
  • Cloud-Edge architecture.
  • Intelligent machines and comparison of data from aircraft engine monitors compared to Twitter
  • Multicore revolution
  • Overall Global AI and Modeling Supercomputer GAIMSC
  • Moores Law compared to Deep Learning computing needs
  • Intel and NVIDIA status

E: Big Data and Science: BDAA Course Introduction Part E: Big Data Applications and Analytics

  • Applications and Analytics
  • Cyberinfrastructure, e-moreorlessanything.
  • LHC, Higgs Boson and accelerators.
  • Astronomy, SKA, multi-wavelength.
  • Polar Grid.
  • Genome Sequencing.
  • Examples, Long Tail of Science.
  • Wired’s End of Science; the 4 paradigms.
  • More data versus Better algorithms.

F: Big Data Systems: BDAA Course Introduction Part F: Big Data Applications and Analytics

  • Clouds, Service-oriented architectures, HPC High Performance Computing, Apace Software
  • DIKW process illustrated by Google maps
  • Raw data to Information/Knowledge/Wisdom/Decision Deluge from the EdgeInformation/Knowledge/Wisdom/Decision Deluge
  • Parallel Computing
  • Map Reduce

G: Industry Transformation: BDAA Course Introduction Part G: Big Data Applications and Analytics

AI grows in importance and industries transform with

  • Core Technologies related to
  • New “Industries” over the last 25 years
  • Traditional “Industries” Transformed; malls and other old industries transform
  • Good to be master of Cloud Computing and Deep Learning
  • AI-First Industries,

H: Jobs and Conclusions: BDAA Course Introduction Part H: Big Data Applications and Analytics

  • Job trends
  • Become digitally savvy so you can take advantage of the AI/Cloud/Edge revolution with different jobs
  • The qualitative idea of Big Data has turned into a quantitative realization as Cloud, Edge and Deep Learning
  • Clouds are here to stay and one should plan on exploiting them
  • Data Intensive studies in business and research continue to grow in importance

7.1.2 - AI-First Engineering Cybertraining Spring 2021

Updated On an ongoing Basis

Week 1

Lecture

Our first meeting is 01:10P-02:25P on Tuesday

The zoom will be https://iu.zoom.us/my/gc.fox

We will discuss how to interact with us. We can adjust the course somewhat.

Also as lectures are/will be put on YouTube, we will go to one lecture per week – we will choose day

The Syllabus has a general course description

Please communicate initially by email gcf@iu.edu

This first class discussed structure of class and agreed to have a section on deep learning technology.

We gave Introductory Lecture

Assignments

Week 2

Introduction

We gave an introductory lecture to optimization and deep learning. Unfortunately we didn’t record the zoom serssion but we did make an offline recording with slides IntroDLOpt: Introduction to Deep Learning and Optimization and YouTube

Google Colab

We also went through material on using Google Colab with examples. This is a lecture plus four Python notebooks

with recorded video

First DL: Deep Learning MNIST Example Spring 2021

We now have recorded all the introductory deep learning material

with recorded videos

IntroDLOpt: Introduction to Deep Learning and Optimization

  • Video: IntroDLOpt: Introduction to Deep Learning and Optimization

Opt: Overview of Optimization Spring2021

  • Video: Opt: Overview of Optimization Spring2021

DLBasic: Deep Learning - Some examples Spring 2021

  • Video: DLBasic: Deep Learning - Some examples Spring 2021

DLBasic: Components of Deep Learning Systems

  • Video: DLBasic: Components of Deep Learning Systems

DLBasic: Summary of Types of Deep Learning Systems

  • Video: DLBasic: Summary of Types of Deep Learning Systems

Week 3

Deep Learning Examples, 1

We discussed deep learning examples covering first half of slides DLBasic: Deep Learning - Some examples with recorded video

Week 4

Deep Learning Examples, 2 plus Components

We concluded deep learning examples and covered components with slides Deep Learning: More Examples and Components with recorded video

Week 5

Deep Learning Networks plus Overview of Optimization

We covered two topics in this weeks video

with recorded video

Week 6

Deep Learning and AI Examples in Health and Medicine

We went about 2/3rds of way through presentation AI First Scenarios: Health and Medicine

with recorded video

Week 7

Deep Learning and AI Examples

with recorded video

Week 8

Deep Learning and AI Examples

with recorded video

Week 9

Deep Learning and AI Examples

with recorded video

Week 10

GitHub for the Class project

  • We explain how to use GitHub for the class project. A video is available on YouTube. Please note that we only uploaded the relevant portion. The other half of the lecture went into individual comments for each student which we have not published. The comments are included in the GitHub repository.

Note project guidelines are given here

Video

Week 11

The Final Project

  • We described the gidelines of final projects in Slides
  • We were impressed by the seven student presentations describing their chosen project and approach.

Video

Week 12

Practical Issues in Deep Learning for Earthquakes

We used our research on Earthquake forecasting, to illustrate deep learning for Time Series with slides

Video

Week 13

Practical Issues in Deep Learning for Earthquakes

We continued discussion that illustrated deep learning for Time Series with the same slides as last week

Video

7.1.3 - Introduction to AI in Health and Medicine

This section discusses the health and medicine sector

Overview

This module discusses AI and the digital transformation for the Health and Medicine Area with a special emphasis on COVID-19 issues. We cover both the impact of COVID and some of the many activities that are addressing it. Parts B and C have an extensive general discussion of AI in Health and Medicine

The complete presentation is available at Google Slides while the videos are a YouTube playlist

Part A: Introduction

This lesson describes some overarching issues including the

  • Summary in terms of Hypecycles
  • Players in the digital health ecosystem and in particular role of Big Tech which has needed AI expertise and infrastructure from clouds to smart watches/phones
  • Views of Pataients and Doctors on New Technology
  • Role of clouds. This is essentially assumed throughout presentation but not stressed.
  • Importance of Security
  • Introduction to Internet of Medical Things; this area is discussed in more detail later in preserntation

slides

Part B: Diagnostics

This highlights some diagnostic appliocations of AI and the digital transformation. Part C also has some diagnostic coverage – especially particular applications

  • General use of AI in Diagnostics
  • Early progress in diagnostic imaging including Radiology and Opthalmology
  • AI In Clinical Decision Support
  • Digital Therapeutics is a recognized and growing activity area

slides

Part C: Examples

This lesson covers a broad range of AI uses in Health and Medicine

  • Flagging Issues requirng urgent attentation and more generally AI for Precision Merdicine
  • Oncology and cancer have made early progress as exploit AI for images. Avoiding mistakes and diagnosing curable cervical cancer in developing countries with less screening.
  • Predicting Gestational Diabetes
  • cardiovascular diagnostics and AI to interpret and guide Ultrasound measurements
  • Robot Nurses and robots to comfort patients
  • AI to guide cosmetic surgery measuring beauty
  • AI in analysis DNA in blood tests
  • AI For Stroke detection (large vessel occlusion)
  • AI monitoring of breathing to flag opioid-induced respiratory depression.
  • AI to relieve administration burden including voice to text for Doctor’s notes
  • AI in consumer genomics
  • Areas that are slow including genomics, Consumer Robotics, Augmented/Virtual Reality and Blockchain
  • AI analysis of information resources flags probleme earlier
  • Internet of Medical Things applications from watches to toothbrushes

slides

Part D: Impact of Covid-19

This covers some aspects of the impact of COVID -19 pandedmic starting in March 2020

  • The features of the first stimulus bill
  • Impact on Digital Health, Banking, Fintech, Commerce – bricks and mortar, e-commerce, groceries, credit cards, advertising, connectivity, tech industry, Ride Hailing and Delivery,
  • Impact on Restaurants, Airlines, Cruise lines, general travel, Food Delivery
  • Impact of working from home and videoconferencing
  • The economy and
  • The often positive trends for Tech industry

slides

Part E: Covid-19 and Recession

This is largely outdated as centered on start of pandemic induced recession. and we know what really happenmed now. Probably the pandemic accelerated the transformation of industry and the use of AI.

slides

Part F: Tackling Covid-19

This discusses some of AI and digital methods used to understand and reduce impact of COVID-19

  • Robots for remote patient examination
  • computerized tomography scan + AI to identify COVID-19
  • Early activities of Big Tech and COVID
  • Other early biotech activities with COVID-19
  • Remote-work technology: Hopin, Zoom, Run the World, FreeConferenceCall, Slack, GroWrk, Webex, Lifesize, Google Meet, Teams
  • Vaccines
  • Wearables and Monitoring, Remote patient monitoring
  • Telehealth, Telemedicine and Mobile Health

slides

Part G: Data and Computational Science and Covid-19

This lesson reviews some sophisticated high performance computing HPC and Big Data approaches to COVID

  • Rosetta volunteer computer to analyze proteins
  • COVID-19 High Performance Computing Consortium
  • AI based drug discovery by startup Insilico Medicine
  • Review of several research projects
  • Global Pervasive Computational Epidemiology for COVID-19 studies
  • Simulations of Virtual Tissues at Indiana University available on nanoHUB

slides

Part H: Screening Drug and Candidates

A major project involving Department of Energy Supercomputers

  • General Structure of Drug Discovery
  • DeepDriveMD Project using AI combined with molecular dynamics to accelerate discovery of drug properties

slides

Part I: Areas for Covid19 Study and Pandemics as Complex Systems

slides

  • Possible Projects in AI for Health and Medicine and especially COVID-19
  • Pandemics as a Complex System
  • AI and computational Futures for Complex Systems

7.1.4 - Mobility (Industry)

This section discusses the mobility in Industry

Overview

  1. Industry being transformed by a) Autonomy (AI) and b) Electric power
  2. Established Organizations can’t change
    • General Motors (employees: 225,000 in 2016 to around 180,000 in 2018) finds it hard to compete with Tesla (42000 employees)
    • Market value GM was half the market value of Tesla at the start of 2020 but is now just 11% October 2020
    • GM purchased Cruise to compete
    • Funding and then buying startups is an important “transformation” strategy
  3. Autonomy needs Sensors Computers Algorithms and Software
    • Also experience (training data)
    • Algorithms main bottleneck; others will automatically improve although lots of interesting work in new sensors, computers and software
    • Over the last 3 years, electrical power has gone from interesting to “bound to happen”; Tesla’s happy customers probably contribute to this
    • Batteries and Charging stations needed

Summary Slides

Full Slide Deck

Mobility Industry A: Introduction

  • Futures of Automobile Industry, Mobility, and Ride-Hailing
  • Self-cleaning cars
  • Medical Transportation
  • Society of Automotive Engineers, Levels 0-5
  • Gartner’s conservative View

Mobility Industry B: Self Driving AI

  • Image processing and Deep Learning
  • Examples of Self Driving cars
  • Road construction Industry
  • Role of Simulated data
  • Role of AI in autonomy
  • Fleet cars
  • 3 Leaders: Waymo, Cruise, NVIDIA

Mobility Industry C: General Motors View

  • Talk by Dave Brooks at GM, “AI for Automotive Engineering”
  • Zero crashes, zero emission, zero congestion
  • GM moving to electric autonomous vehicles

Mobility Industry D: Self Driving Snippets

  • Worries about and data on its Progress
  • Tesla’s specialized self-driving chip
  • Some tasks that are hard for AI
  • Scooters and Bikes

Mobility Industry E: Electrical Power

  • Rise in use of electrical power
  • Special opportunities in e-Trucks and time scale
  • Future of Trucks
  • Tesla market value
  • Drones and Robot deliveries; role of 5G
  • Robots in Logistics

7.1.5 - Space and Energy

This section discusses the space and energy.

Overview

  1. Energy sources and AI for powering Grids.
  2. Energy Solution from Bill Gates
  3. Space and AI

Full Slide Deck

A: Energy

  • Distributed Energy Resources as a grid of renewables with a hierarchical set of Local Distribution Areas
  • Electric Vehicles in Grid
  • Economics of microgrids
  • Investment into Clean Energy
  • Batteries
  • Fusion and Deep Learning for plasma stability
  • AI for Power Grid, Virtual Power Plant, Power Consumption Monitoring, Electricity Trading

Slides

B: Clean Energy startups from Bill Gates

  • 26 Startups in areas like long-duration storage, nuclear energy, carbon capture, batteries, fusion, and hydropower …
  • The slide deck gives links to 26 companies from their website and pitchbook which describes their startup status (#employees, funding)
  • It summarizes their products

Slides

C: Space

  • Space supports AI with communications, image data and global navigation
  • AI Supports space in AI-controlled remote manufacturing, imaging control, system control, dynamic spectrum use
  • Privatization of Space - SpaceX, Investment
  • 57,000 satellites through 2029

Slides

7.1.6 - AI In Banking

This section discusses AI in Banking

Overview

In this lecture, AI in Banking is discussed. Here we focus on the transition of legacy banks towards AI based banking, real world examples of AI in Banking, banking systems and banking as a service.

Slides

AI in Banking A: The Transition of legacy Banks

  1. Types of AI that is used
  2. Closing of physical branches
  3. Making the transition
  4. Growth in Fintech as legacy bank services decline

AI in Banking B: FinTech

  1. Fintech examples and investment
  2. Broad areas of finance/banking where Fintech operating

AI in Banking C: Neobanks

  1. Types and Examples of neobanks
  2. Customer uptake by world region
  3. Neobanking in Small and Medium Business segment
  4. Neobanking in real estate, mortgages
  5. South American Examples

AI in Banking D: The System

  1. The Front, Middle, Back Office
  2. Front Office: Chatbots
  3. Robo-advisors
  4. Middle Office: Fraud, Money laundering
  5. Fintech
  6. Payment Gateways (Back Office)
  7. Banking as a Service

AI in Banking E: Examples

  1. Credit cards
  2. The stock trading ecosystem
  3. Robots counting coins
  4. AI in Insurance: Chatbots, Customer Support
  5. Banking itself
  6. Handwriting recognition
  7. Detect leaks for insurance

AI in Banking F: As a Service

  1. Banking Services Stack
  2. Business Model
  3. Several Examples
  4. Metrics compared among examples
  5. Breadth, Depth, Reputation, Speed to Market, Scalability

7.1.7 - Cloud Computing

Cloud Computing

E534 Cloud Computing Unit

Full Slide Deck

Overall Summary

Video:

Defining Clouds I: Basic definition of cloud and two very simple examples of why virtualization is important

  1. How clouds are situated wrt HPC and supercomputers
  2. Why multicore chips are important
  3. Typical data center

Video:

Defining Clouds II: Service-oriented architectures: Software services as Message-linked computing capabilities

  1. The different aaS’s: Network, Infrastructure, Platform, Software
  2. The amazing services that Amazon AWS and Microsoft Azure have
  3. Initial Gartner comments on clouds (they are now the norm) and evolution of servers; serverless and microservices
  4. Gartner hypecycle and priority matrix on Infrastructure Strategies

Video:

Defining Clouds III: Cloud Market Share

  1. How important are they?
  2. How much money do they make?

Video:

Virtualization: Virtualization Technologies, Hypervisors and the different approaches

  1. KVM Xen, Docker and Openstack

Video:

  1. Clouds physically across the world
  2. Green computing
  3. Fraction of world’s computing ecosystem in clouds and associated sizes
  4. An analysis from Cisco of size of cloud computing

Video:

Cloud Infrastructure II: Gartner hypecycle and priority matrix on Compute Infrastructure

  1. Containers compared to virtual machines
  2. The emergence of artificial intelligence as a dominant force

Video:

Cloud Software: HPC-ABDS with over 350 software packages and how to use each of 21 layers

  1. Google’s software innovations
  2. MapReduce in pictures
  3. Cloud and HPC software stacks compared
  4. Components need to support cloud/distributed system programming

Video:

Cloud Applications I: Clouds in science where area called cyberinfrastructure; the science usage pattern from NIST

  1. Artificial Intelligence from Gartner

Video:

Cloud Applications II: Characterize Applications using NIST approach

  1. Internet of Things
  2. Different types of MapReduce

Video:

Parallel Computing Analogies: Parallel Computing in pictures

  1. Some useful analogies and principles

Video:

Real Parallel Computing: Single Program/Instruction Multiple Data SIMD SPMD

  1. Big Data and Simulations Compared
  2. What is hard to do?

Video:

Storage: Cloud data approaches

  1. Repositories, File Systems, Data lakes

Video:

HPC and Clouds: The Branscomb Pyramid

  1. Supercomputers versus clouds
  2. Science Computing Environments

Video:

Comparison of Data Analytics with Simulation: Structure of different applications for simulations and Big Data

  1. Software implications
  2. Languages

Video:

The Future I: The Future I: Gartner cloud computing hypecycle and priority matrix 2017 and 2019

  1. Hyperscale computing
  2. Serverless and FaaS
  3. Cloud Native
  4. Microservices
  5. Update to 2019 Hypecycle

Video:

Future and Other Issues II: Security

  1. Blockchain

Video:

Future and Other Issues III: Fault Tolerance

Video:

7.1.8 - Transportation Systems

This section discusses the transportation systems

Transportation Systems Summary

  1. The ride-hailing industry highlights the growth of a new “Transportation System” TS a. For ride-hailing TS controls rides matching drivers and customers; it predicts how to position cars and how to avoid traffic slowdowns b. However, TS is much bigger outside ride-hailing as we move into the “connected vehicle” era c. TS will probably find autonomous vehicles easier to deal with than human drivers
  2. Cloud Fog and Edge components
  3. Autonomous AI was centered on generalized image processing
  4. TS also needs AI (and DL) but this is for routing and geospatial time-series; different technologies from those for image processing

Slides

Transportation Systems A: Introduction

  1. “Smart” Insurance
  2. Fundamentals of Ride-Hailing

Transportation Systems B: Components of a Ride-Hailing System

  1. Transportation Brain and Services
  2. Maps, Routing,
  3. Traffic forecasting with deep learning

Transportation Systems C: Different AI Approaches in Ride-Hailing

  1. View as a Time Series: LSTM and ARIMA
  2. View as an image in a 2D earth surface - Convolutional networks
  3. Use of Graph Neural Nets
  4. Use of Convolutional Recurrent Neural Nets
  5. Spatio-temporal modeling
  6. Comparison of data with predictions
  7. Reinforcement Learning
  8. Formulation of General Geospatial Time-Series Problem

7.1.9 - Commerce

This section discusses Commerce

Overview

Slides

AI in Commerce A: The Old way of doing things

  1. AI in Commerce
  2. AI-First Engineering, Deep Learning
  3. E-commerce and the transformation of “Bricks and Mortar”

AI in Commerce B: AI in Retail

  1. Personalization
  2. Search
  3. Image Processing to Speed up Shopping
  4. Walmart

AI in Commerce C: The Revolution that is Amazon

  1. Retail Revolution
  2. Saves Time, Effort and Novelity with Modernized Retail
  3. Looking ahead of Retail evolution

AI in Commerce D: DLMalls e-commerce

  1. Amazon sellers
  2. Rise of Shopify
  3. Selling Products on Amazon

AI in Commerce E: Recommender Engines, Digital media

  1. Spotify recommender engines
  2. Collaborative Filtering
  3. Audio Modelling
  4. DNN for Recommender engines

7.1.10 - Python Warm Up

Python Exercise on Google Colab

Python Exercise on Google Colab

Open In Colab View in Github Download Notebook

In this exercise, we will take a look at some basic Python Concepts needed for day-to-day coding.

Check the installed Python version.

! python --version
Python 3.7.6

Simple For Loop

for i in range(10):
  print(i)
0
1
2
3
4
5
6
7
8
9

List

list_items = ['a', 'b', 'c', 'd', 'e']

Retrieving an Element

list_items[2]
'c'

Append New Values

list_items.append('f')
list_items
['a', 'b', 'c', 'd', 'e', 'f']

Remove an Element

list_items.remove('a')
list_items
['b', 'c', 'd', 'e', 'f']

Dictionary

dictionary_items = {'a':1, 'b': 2, 'c': 3}

Retrieving an Item by Key

dictionary_items['b']
2

Append New Item with Key

dictionary_items['c'] = 4
dictionary_items
{'a': 1, 'b': 2, 'c': 4}

Delete an Item with Key

del dictionary_items['a'] 
dictionary_items
{'b': 2, 'c': 4}

Comparators

x = 10
y = 20 
z = 30
x > y 
False
x < z
True
z == x
False
if x < z:
  print("This is True")
This is True
if x > z:
  print("This is True")
else:
  print("This is False")  
This is False

Arithmetic

k = x * y * z
k
6000
j = x + y + z
j
60
m = x -y 
m
-10
n = x / z
n
0.3333333333333333

Numpy

Create a Random Numpy Array

import numpy as np
a = np.random.rand(100)
a.shape
(100,)

Reshape Numpy Array

b = a.reshape(10,10)
b.shape
(10, 10)

Manipulate Array Elements

c = b * 10
c[0]
array([3.33575458, 7.39029235, 5.54086921, 9.88592471, 4.9246252 ,
       1.76107178, 3.5817523 , 3.74828708, 3.57490794, 6.55752319])
c = np.mean(b,axis=1)
c.shape
10
print(c)
[0.60673061 0.4223565  0.42687517 0.6260857  0.60814217 0.66445627 
  0.54888432 0.68262262 0.42523459 0.61504903]

7.1.11 - Distributed Training for MNIST

Distributed Training for MNIST
Open In Colab View in Github Download Notebook

In this lesson we discuss in how to create a simple IPython Notebook to solve an image classification problem with Multi Layer Perceptron with LSTM.

Pre-requisites

Install the following Python packages

  1. cloudmesh-installer
  2. cloudmesh-common
pip3 install cloudmesh-installer
pip3 install cloudmesh-common

Sample MLP + LSTM with Tensorflow Keras

Import Libraries

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, SimpleRNN, InputLayer, LSTM, Dropout
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.datasets import mnist
from cloudmesh.common.StopWatch import StopWatch

Download Data and Pre-Process

StopWatch.start("data-load")
(x_train, y_train), (x_test, y_test) = mnist.load_data()
StopWatch.stop("data-load")


StopWatch.start("data-pre-process")
num_labels = len(np.unique(y_train))


y_train = to_categorical(y_train)
y_test = to_categorical(y_test)


image_size = x_train.shape[1]
x_train = np.reshape(x_train,[-1, image_size, image_size])
x_test = np.reshape(x_test,[-1, image_size, image_size])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
StopWatch.stop("data-pre-process")

input_shape = (image_size, image_size)
batch_size = 128
units = 256
dropout = 0.2

Define Model

Here we use the Tensorflow distributed training components to train the model in multiple CPUs or GPUs. In the Colab instance multiple GPUs are not supported. Hence, the training must be done in the device type ‘None’ when selecting the ‘runtime type’ from Runtime menu. To run with multiple-GPUs no code change is required. Learn more about distributed training.

StopWatch.start("compile")
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
  model = Sequential()
  # LSTM Layers
  model.add(LSTM(units=units,                      
                      input_shape=input_shape,
                      return_sequences=True))
  model.add(LSTM(units=units, 
                      dropout=dropout,                      
                      return_sequences=True))
  model.add(LSTM(units=units, 
                      dropout=dropout,                      
                      return_sequences=False))
  # MLP Layers
  model.add(Dense(units))
  model.add(Activation('relu'))
  model.add(Dropout(dropout))
  model.add(Dense(units))
  model.add(Activation('relu'))
  model.add(Dropout(dropout))
  # Softmax_layer
  model.add(Dense(num_labels))
  model.add(Activation('softmax'))
  model.summary()
  plot_model(model, to_file='rnn-mnist.png', show_shapes=True)
  
  print("Number of devices: {}".format(strategy.num_replicas_in_sync))

  model.compile(loss='categorical_crossentropy',
                optimizer='sgd',
                metrics=['accuracy'])
StopWatch.stop("compile")
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_6 (LSTM)                (None, 28, 256)           291840    
_________________________________________________________________
lstm_7 (LSTM)                (None, 28, 256)           525312    
_________________________________________________________________
lstm_8 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_6 (Dense)              (None, 256)               65792     
_________________________________________________________________
activation_6 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 256)               65792     
_________________________________________________________________
activation_7 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 10)                2570      
_________________________________________________________________
activation_8 (Activation)    (None, 10)                0         
=================================================================
Total params: 1,476,618
Trainable params: 1,476,618
Non-trainable params: 0
_________________________________________________________________
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
Number of devices: 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

Train

StopWatch.start("train")
model.fit(x_train, y_train, epochs=30, batch_size=batch_size)
StopWatch.stop("train")
Epoch 1/30
469/469 [==============================] - 7s 16ms/step - loss: 2.0427 - accuracy: 0.2718
Epoch 2/30
469/469 [==============================] - 7s 16ms/step - loss: 1.6934 - accuracy: 0.4007
Epoch 3/30
469/469 [==============================] - 7s 16ms/step - loss: 1.2997 - accuracy: 0.5497
...
Epoch 28/30
469/469 [==============================] - 8s 17ms/step - loss: 0.1175 - accuracy: 0.9640
Epoch 29/30
469/469 [==============================] - 8s 17ms/step - loss: 0.1158 - accuracy: 0.9645
Epoch 30/30
469/469 [==============================] - 8s 17ms/step - loss: 0.1098 - accuracy: 0.9661

Test

StopWatch.start("evaluate")
loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))
StopWatch.stop("evaluate")

StopWatch.benchmark()
79/79 [==============================] - 3s 9ms/step - loss: 0.0898 - accuracy: 0.9719

Test accuracy: 97.2%

+---------------------+------------------------------------------------------------------+
| Attribute           | Value                                                            |
|---------------------+------------------------------------------------------------------|
| BUG_REPORT_URL      | "https://bugs.launchpad.net/ubuntu/"                             |
| DISTRIB_CODENAME    | bionic                                                           |
| DISTRIB_DESCRIPTION | "Ubuntu 18.04.5 LTS"                                             |
| DISTRIB_ID          | Ubuntu                                                           |
| DISTRIB_RELEASE     | 18.04                                                            |
| HOME_URL            | "https://www.ubuntu.com/"                                        |
| ID                  | ubuntu                                                           |
| ID_LIKE             | debian                                                           |
| NAME                | "Ubuntu"                                                         |
| PRETTY_NAME         | "Ubuntu 18.04.5 LTS"                                             |
| PRIVACY_POLICY_URL  | "https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" |
| SUPPORT_URL         | "https://help.ubuntu.com/"                                       |
| UBUNTU_CODENAME     | bionic                                                           |
| VERSION             | "18.04.5 LTS (Bionic Beaver)"                                    |
| VERSION_CODENAME    | bionic                                                           |
| VERSION_ID          | "18.04"                                                          |
| cpu_count           | 2                                                                |
| mem.active          | 2.4 GiB                                                          |
| mem.available       | 10.3 GiB                                                         |
| mem.free            | 4.5 GiB                                                          |
| mem.inactive        | 5.4 GiB                                                          |
| mem.percent         | 18.6 %                                                           |
| mem.total           | 12.7 GiB                                                         |
| mem.used            | 3.3 GiB                                                          |
| platform.version    | #1 SMP Thu Jul 23 08:00:38 PDT 2020                              |
| python              | 3.7.10 (default, Feb 20 2021, 21:17:23)                          |
|                     | [GCC 7.5.0]                                                      |
| python.pip          | 19.3.1                                                           |
| python.version      | 3.7.10                                                           |
| sys.platform        | linux                                                            |
| uname.machine       | x86_64                                                           |
| uname.node          | b39e0899c1f8                                                     |
| uname.processor     | x86_64                                                           |
| uname.release       | 4.19.112+                                                        |
| uname.system        | Linux                                                            |
| uname.version       | #1 SMP Thu Jul 23 08:00:38 PDT 2020                              |
| user                | collab                                                           |
+---------------------+------------------------------------------------------------------+

+------------------+----------+---------+---------+---------------------+-------+--------------+--------+-------+-------------------------------------+
| Name             | Status   |    Time |     Sum | Start               | tag   | Node         | User   | OS    | Version                             |
|------------------+----------+---------+---------+---------------------+-------+--------------+--------+-------+-------------------------------------|
| data-load        | failed   |   0.473 |   0.473 | 2021-03-07 11:34:03 |       | b39e0899c1f8 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| data-pre-process | failed   |   0.073 |   0.073 | 2021-03-07 11:34:03 |       | b39e0899c1f8 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| compile          | failed   |   0.876 |   7.187 | 2021-03-07 11:38:05 |       | b39e0899c1f8 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| train            | failed   | 229.341 | 257.023 | 2021-03-07 11:38:44 |       | b39e0899c1f8 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| evaluate         | failed   |   2.659 |   4.25  | 2021-03-07 11:44:54 |       | b39e0899c1f8 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
+------------------+----------+---------+---------+---------------------+-------+--------------+--------+-------+-------------------------------------+

# csv,timer,status,time,sum,start,tag,uname.node,user,uname.system,platform.version
# csv,data-load,failed,0.473,0.473,2021-03-07 11:34:03,,b39e0899c1f8,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,data-pre-process,failed,0.073,0.073,2021-03-07 11:34:03,,b39e0899c1f8,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,compile,failed,0.876,7.187,2021-03-07 11:38:05,,b39e0899c1f8,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,train,failed,229.341,257.023,2021-03-07 11:38:44,,b39e0899c1f8,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,evaluate,failed,2.659,4.25,2021-03-07 11:44:54,,b39e0899c1f8,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020

Reference:

  1. Advance Deep Learning with Keras
  2. Distributed With Tensorflow
  3. Keras with Tensorflow Distributed Training

7.1.12 - MLP + LSTM with MNIST on Google Colab

MLP + LSTM with MNIST on Google Colab
Open In Colab View in Github Download Notebook

In this lesson we discuss in how to create a simple IPython Notebook to solve an image classification problem with Multi Layer Perceptron with LSTM.

Pre-requisites

Install the following Python packages

  1. cloudmesh-installer
  2. cloudmesh-common
pip3 install cloudmesh-installer
pip3 install cloudmesh-common

Sample MLP + LSTM with Tensorflow Keras

Import Libraries

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, SimpleRNN, InputLayer, LSTM, Dropout
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.datasets import mnist
from cloudmesh.common.StopWatch import StopWatch

Download Data and Pre-Process

StopWatch.start("data-load")
(x_train, y_train), (x_test, y_test) = mnist.load_data()
StopWatch.stop("data-load")


StopWatch.start("data-pre-process")
num_labels = len(np.unique(y_train))


y_train = to_categorical(y_train)
y_test = to_categorical(y_test)


image_size = x_train.shape[1]
x_train = np.reshape(x_train,[-1, image_size, image_size])
x_test = np.reshape(x_test,[-1, image_size, image_size])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
StopWatch.stop("data-pre-process")

input_shape = (image_size, image_size)
batch_size = 128
units = 256
dropout = 0.2

Define Model

StopWatch.start("compile")
model = Sequential()
# LSTM Layers
model.add(LSTM(units=units,                      
                     input_shape=input_shape,
                     return_sequences=True))
model.add(LSTM(units=units, 
                     dropout=dropout,                      
                     return_sequences=True))
model.add(LSTM(units=units, 
                     dropout=dropout,                      
                     return_sequences=False))
# MLP Layers
model.add(Dense(units))
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(units))
model.add(Activation('relu'))
model.add(Dropout(dropout))
# Softmax_layer
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
plot_model(model, to_file='rnn-mnist.png', show_shapes=True)


model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
StopWatch.stop("compile")
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 28, 256)           291840    
_________________________________________________________________
lstm_1 (LSTM)                (None, 28, 256)           525312    
_________________________________________________________________
lstm_2 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense (Dense)                (None, 256)               65792     
_________________________________________________________________
activation (Activation)      (None, 256)               0         
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               65792     
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                2570      
_________________________________________________________________
activation_2 (Activation)    (None, 10)                0         
=================================================================
Total params: 1,476,618
Trainable params: 1,476,618
Non-trainable params: 0

Train

StopWatch.start("train")
model.fit(x_train, y_train, epochs=30, batch_size=batch_size)
StopWatch.stop("train")
469/469 [==============================] - 378s 796ms/step - loss: 2.2689 - accuracy: 0.2075

Test

StopWatch.start("evaluate")
loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))
StopWatch.stop("evaluate")

StopWatch.benchmark()
79/79 [==============================] - 1s 7ms/step - loss: 2.2275 - accuracy: 0.3120

Test accuracy: 31.2%

+---------------------+------------------------------------------------------------------+
| Attribute           | Value                                                            |
|---------------------+------------------------------------------------------------------|
| BUG_REPORT_URL      | "https://bugs.launchpad.net/ubuntu/"                             |
| DISTRIB_CODENAME    | bionic                                                           |
| DISTRIB_DESCRIPTION | "Ubuntu 18.04.5 LTS"                                             |
| DISTRIB_ID          | Ubuntu                                                           |
| DISTRIB_RELEASE     | 18.04                                                            |
| HOME_URL            | "https://www.ubuntu.com/"                                        |
| ID                  | ubuntu                                                           |
| ID_LIKE             | debian                                                           |
| NAME                | "Ubuntu"                                                         |
| PRETTY_NAME         | "Ubuntu 18.04.5 LTS"                                             |
| PRIVACY_POLICY_URL  | "https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" |
| SUPPORT_URL         | "https://help.ubuntu.com/"                                       |
| UBUNTU_CODENAME     | bionic                                                           |
| VERSION             | "18.04.5 LTS (Bionic Beaver)"                                    |
| VERSION_CODENAME    | bionic                                                           |
| VERSION_ID          | "18.04"                                                          |
| cpu_count           | 2                                                                |
| mem.active          | 1.9 GiB                                                          |
| mem.available       | 10.7 GiB                                                         |
| mem.free            | 7.3 GiB                                                          |
| mem.inactive        | 3.0 GiB                                                          |
| mem.percent         | 15.6 %                                                           |
| mem.total           | 12.7 GiB                                                         |
| mem.used            | 2.3 GiB                                                          |
| platform.version    | #1 SMP Thu Jul 23 08:00:38 PDT 2020                              |
| python              | 3.6.9 (default, Oct  8 2020, 12:12:24)                           |
|                     | [GCC 8.4.0]                                                      |
| python.pip          | 19.3.1                                                           |
| python.version      | 3.6.9                                                            |
| sys.platform        | linux                                                            |
| uname.machine       | x86_64                                                           |
| uname.node          | 9810ccb69d08                                                     |
| uname.processor     | x86_64                                                           |
| uname.release       | 4.19.112+                                                        |
| uname.system        | Linux                                                            |
| uname.version       | #1 SMP Thu Jul 23 08:00:38 PDT 2020                              |
| user                | collab                                                           |
+---------------------+------------------------------------------------------------------+

+------------------+----------+--------+--------+---------------------+-------+--------------+--------+-------+-------------------------------------+
| Name             | Status   |   Time |    Sum | Start               | tag   | Node         | User   | OS    | Version                             |
|------------------+----------+--------+--------+---------------------+-------+--------------+--------+-------+-------------------------------------|
| data-load        | failed   |  0.61  |  0.61  | 2021-02-21 21:35:06 |       | 9810ccb69d08 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| data-pre-process | failed   |  0.076 |  0.076 | 2021-02-21 21:35:07 |       | 9810ccb69d08 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| compile          | failed   |  6.445 |  6.445 | 2021-02-21 21:35:07 |       | 9810ccb69d08 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| train            | failed   | 17.171 | 17.171 | 2021-02-21 21:35:13 |       | 9810ccb69d08 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| evaluate         | failed   |  1.442 |  1.442 | 2021-02-21 21:35:31 |       | 9810ccb69d08 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
+------------------+----------+--------+--------+---------------------+-------+--------------+--------+-------+-------------------------------------+

# csv,timer,status,time,sum,start,tag,uname.node,user,uname.system,platform.version
# csv,data-load,failed,0.61,0.61,2021-02-21 21:35:06,,9810ccb69d08,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,data-pre-process,failed,0.076,0.076,2021-02-21 21:35:07,,9810ccb69d08,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,compile,failed,6.445,6.445,2021-02-21 21:35:07,,9810ccb69d08,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,train,failed,17.171,17.171,2021-02-21 21:35:13,,9810ccb69d08,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,evaluate,failed,1.442,1.442,2021-02-21 21:35:31,,9810ccb69d08,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020

Reference:

Orignal Source to Source Code

7.1.13 - MNIST Classification on Google Colab

MNIST Classification on Google Colab
Open In Colab View in Github Download Notebook

In this lesson we discuss in how to create a simple IPython Notebook to solve an image classification problem. MNIST contains a set of pictures

Import Libraries

Note: https://python-future.org/quickstart.html

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.utils import to_categorical, plot_model
from keras.datasets import mnist

Warm Up Exercise

Pre-process data

Load data

First we load the data from the inbuilt mnist dataset from Keras Here we have to split the data set into training and testing data. The training data or testing data has two components. Training features and training labels. For instance every sample in the dataset has a corresponding label. In Mnist the training sample contains image data represented in terms of an array. The training labels are from 0-9.

Here we say x_train for training data features and y_train as the training labels. Same goes for testing data.

(x_train, y_train), (x_test, y_test) = mnist.load_data()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step

Identify Number of Classes

As this is a number classification problem. We need to know how many classes are there. So we’ll count the number of unique labels.

num_labels = len(np.unique(y_train))

Convert Labels To One-Hot Vector

Read more on one-hot vector.

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

Image Reshaping

The training model is designed by considering the data as a vector. This is a model dependent modification. Here we assume the image is a squared shape image.

image_size = x_train.shape[1]
input_size = image_size * image_size

Resize and Normalize

The next step is to continue the reshaping to a fit into a vector and normalize the data. Image values are from 0 - 255, so an easy way to normalize is to divide by the maximum value.

x_train = np.reshape(x_train, [-1, input_size])
x_train = x_train.astype('float32') / 255
x_test = np.reshape(x_test, [-1, input_size])
x_test = x_test.astype('float32') / 255

Create a Keras Model

Keras is a neural network library. The summary function provides tabular summary on the model you created. And the plot_model function provides a grpah on the network you created.

# Create Model
# network parameters
batch_size = 4
hidden_units = 64

model = Sequential()
model.add(Dense(hidden_units, input_dim=input_size))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
plot_model(model, to_file='mlp-mnist.png', show_shapes=True)
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_5 (Dense)              (None, 512)               401920    
_________________________________________________________________
dense_6 (Dense)              (None, 10)                5130      
_________________________________________________________________
activation_5 (Activation)    (None, 10)                0         
=================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________

images

Compile and Train

A keras model need to be compiled before it can be used to train the model. In the compile function, you can provide the optimization that you want to add, metrics you expect and the type of loss function you need to use.

Here we use adam optimizer, a famous optimizer used in neural networks.

The loss funtion we have used is the categorical_crossentropy.

Once the model is compiled, then the fit function is called upon passing the number of epochs, traing data and batch size.

The batch size determines the number of elements used per minibatch in optimizing the function.

Note: Change the number of epochs, batch size and see what happens.

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=1, batch_size=batch_size)
469/469 [==============================] - 3s 7ms/step - loss: 0.3647 - accuracy: 0.8947





<tensorflow.python.keras.callbacks.History at 0x7fe88faf4c50>

Testing

Now we can test the trained model. Use the evaluate function by passing test data and batch size and the accuracy and the loss value can be retrieved.

MNIST_V1.0|Exercise: Try to observe the network behavior by changing the number of epochs, batch size and record the best accuracy that you can gain. Here you can record what happens when you change these values. Describe your observations in 50-100 words.

loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))
79/79 [==============================] - 0s 4ms/step - loss: 0.2984 - accuracy: 0.9148

Test accuracy: 91.5%

Final Note

This programme can be defined as a hello world programme in deep learning. Objective of this exercise is not to teach you the depths of deep learning. But to teach you basic concepts that may need to design a simple network to solve a problem. Before running the whole code, read all the instructions before a code section.

Homework

Solve Exercise MNIST_V1.0.

Reference:

Orignal Source to Source Code

7.1.14 - MNIST With PyTorch

MNIST With PyTorch
Open In Colab View in Github Download Notebook

In this lesson we discuss in how to create a simple IPython Notebook to solve an image classification problem with Multi Layer Perceptron with PyTorch.

Import Libraries

import numpy as np
import torch
import torchvision
import matplotlib.pyplot as plt
from torchvision import datasets, transforms
from torch import nn
from torch import optim
from time import time
import os
from google.colab import drive

Pre-Process Data

Here we download the data using PyTorch data utils and transform the data by using a normalization function. PyTorch provides a data loader abstraction called a DataLoader where we can set the batch size, data shuffle per batch loading. Each data loader expecte a Pytorch Dataset. The DataSet abstraction and DataLoader usage can be found here

# Data transformation function 
transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

# DataSet
train_data_set = datasets.MNIST('drive/My Drive/mnist/data/', download=True, train=True, transform=transform)
validation_data_set = datasets.MNIST('drive/My Drive/mnist/data/', download=True, train=False, transform=transform)

# DataLoader
train_loader = torch.utils.data.DataLoader(train_data_set, batch_size=32, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_data_set, batch_size=32, shuffle=True)

Define Network

Here we select the matching input size compared to the network definition. Here data reshaping or layer reshaping must be done to match input data shape with the network input shape. Also we define a set of hidden unit sizes along with the output layers size. The output_size must match with the number of labels associated with the classification problem. The hidden units can be chosesn depending on the problem. nn.Sequential is one way to create the network. Here we stack a set of linear layers along with a softmax layer for the classification as the output layer.

input_size = 784
hidden_sizes = [128, 128, 64, 64]
output_size = 10

model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[0], hidden_sizes[1]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[1], hidden_sizes[2]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[2], hidden_sizes[3]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[3], output_size),
                      nn.LogSoftmax(dim=1))

                      
print(model)
Sequential(
  (0): Linear(in_features=784, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=128, bias=True)
  (3): ReLU()
  (4): Linear(in_features=128, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=64, bias=True)
  (7): ReLU()
  (8): Linear(in_features=64, out_features=10, bias=True)
  (9): LogSoftmax(dim=1)
)

Define Loss Function and Optimizer

Read more about Loss Functions and Optimizers supported by PyTorch.

criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.003, momentum=0.9)

Train

epochs = 5

for epoch in range(epochs):
    loss_per_epoch = 0
    for images, labels in train_loader:
        images = images.view(images.shape[0], -1)
    
        # Gradients cleared per batch
        optimizer.zero_grad()
        
        # Pass input to the model
        output = model(images)
        # Calculate loss after training compared to labels
        loss = criterion(output, labels)
        
        # backpropagation 
        loss.backward()
        
        # optimizer step to update the weights
        optimizer.step()
        
        loss_per_epoch += loss.item()
    average_loss = loss_per_epoch / len(train_loader)
    print("Epoch {} - Training loss: {}".format(epoch, average_loss))
Epoch 0 - Training loss: 1.3052690227402808
Epoch 1 - Training loss: 0.33809808635317695
Epoch 2 - Training loss: 0.22927882223685922
Epoch 3 - Training loss: 0.16807103878669521
Epoch 4 - Training loss: 0.1369301250545995

Model Evaluation

Similar to training data loader, we use the validation loader to load batch by batch and run the feed-forward network to get the expected prediction and compared to the label associated with the data point.

correct_predictions, all_count = 0, 0
# enumerate data from the data validation loader (loads a batch at a time)
for batch_id, (images,labels) in enumerate(validation_loader):
  for i in range(len(labels)):
    img = images[i].view(1, 784)
    # at prediction stage, only feed-forward calculation is required. 
    with torch.no_grad():
        logps = model(img)

    # Output layer of the network uses a LogSoftMax layer
    # Hence the probability must be calculated with the exponential values. 
    # The final layer returns an array of probabilities for each label
    # Pick the maximum probability and the corresponding index
    # The corresponding index is the predicted label 
    ps = torch.exp(logps)
    probab = list(ps.numpy()[0])
    pred_label = probab.index(max(probab))
    true_label = labels.numpy()[i]
    if(true_label == pred_label):
      correct_predictions += 1
    all_count += 1

print(f"Model Accuracy {(correct_predictions/all_count) * 100} %")
Model Accuracy 95.95 %

Reference:

  1. Torch NN Sequential
  2. Handwritten Digit Recognition Using PyTorch — Intro To Neural Networks
  3. MNIST Handwritten Digit Recognition in PyTorch

7.1.15 - MNIST-AutoEncoder Classification on Google Colab

MNIST with AutoEncoder: Classification on Google Colab
Open In Colab View in Github Download Notebook

Prerequisites

Install the following packages

! pip3 install cloudmesh-installer
! pip3 install cloudmesh-common

Import Libraries

import tensorflow as tf
from keras.layers import Dense, Input
from keras.layers import Conv2D, Flatten
from keras.layers import Reshape, Conv2DTranspose
from keras.models import Model
from keras.datasets import mnist
from keras.utils import plot_model
from keras import backend as K

import numpy as np
import matplotlib.pyplot as plt

Download Data and Pre-Process

(x_train, y_train), (x_test, y_test) = mnist.load_data()

image_size = x_train.shape[1]
x_train = np.reshape(x_train, [-1, image_size, image_size, 1])
x_test = np.reshape(x_test, [-1, image_size, image_size, 1])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

input_shape = (image_size, image_size, 1)
batch_size = 32
kernel_size = 3
latent_dim = 16
hidden_units = [32, 64]
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step

Define Model

inputs = Input(shape=input_shape, name='encoder_input')
x = inputs
x = Dense(hidden_units[0], activation='relu')(x)
x = Dense(hidden_units[1], activation='relu')(x)

shape = K.int_shape(x)

# generate latent vector
x = Flatten()(x)
latent = Dense(latent_dim, name='latent_vector')(x)

# instantiate encoder model
encoder = Model(inputs,
                latent,
                name='encoder')
encoder.summary()
plot_model(encoder,
           to_file='encoder.png',
           show_shapes=True)


latent_inputs = Input(shape=(latent_dim,), name='decoder_input')
x = Dense(shape[1] * shape[2] * shape[3])(latent_inputs)
x = Reshape((shape[1], shape[2], shape[3]))(x)
x = Dense(hidden_units[0], activation='relu')(x)
x = Dense(hidden_units[1], activation='relu')(x)

outputs = Dense(1, activation='relu')(x)

decoder = Model(latent_inputs, outputs, name='decoder')
decoder.summary()
plot_model(decoder, to_file='decoder.png', show_shapes=True)

autoencoder = Model(inputs,
                    decoder(encoder(inputs)),
                    name='autoencoder')
autoencoder.summary()
plot_model(autoencoder,
           to_file='autoencoder.png',
           show_shapes=True)

autoencoder.compile(loss='mse', optimizer='adam')
Model: "encoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
encoder_input (InputLayer)   [(None, 28, 28, 1)]       0         
_________________________________________________________________
dense_2 (Dense)              (None, 28, 28, 32)        64        
_________________________________________________________________
dense_3 (Dense)              (None, 28, 28, 64)        2112      
_________________________________________________________________
flatten_1 (Flatten)          (None, 50176)             0         
_________________________________________________________________
latent_vector (Dense)        (None, 16)                802832    
=================================================================
Total params: 805,008
Trainable params: 805,008
Non-trainable params: 0
_________________________________________________________________
Model: "decoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
decoder_input (InputLayer)   [(None, 16)]              0         
_________________________________________________________________
dense_4 (Dense)              (None, 50176)             852992    
_________________________________________________________________
reshape (Reshape)            (None, 28, 28, 64)        0         
_________________________________________________________________
dense_5 (Dense)              (None, 28, 28, 32)        2080      
_________________________________________________________________
dense_6 (Dense)              (None, 28, 28, 64)        2112      
_________________________________________________________________
dense_7 (Dense)              (None, 28, 28, 1)         65        
=================================================================
Total params: 857,249
Trainable params: 857,249
Non-trainable params: 0
_________________________________________________________________
Model: "autoencoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
encoder_input (InputLayer)   [(None, 28, 28, 1)]       0         
_________________________________________________________________
encoder (Functional)         (None, 16)                805008    
_________________________________________________________________
decoder (Functional)         (None, 28, 28, 1)         857249    
=================================================================
Total params: 1,662,257
Trainable params: 1,662,257
Non-trainable params: 0

Train

autoencoder.fit(x_train,
                x_train,
                validation_data=(x_test, x_test),
                epochs=1,
                batch_size=batch_size)
1875/1875 [==============================] - 112s 60ms/step - loss: 0.0268 - val_loss: 0.0131

<tensorflow.python.keras.callbacks.History at 0x7f3ecb2e0be0>

Test

x_decoded = autoencoder.predict(x_test)
79/79 [==============================] - 7s 80ms/step - loss: 0.2581 - accuracy: 0.9181

Test accuracy: 91.8%

Visualize

imgs = np.concatenate([x_test[:8], x_decoded[:8]])
imgs = imgs.reshape((4, 4, image_size, image_size))
imgs = np.vstack([np.hstack(i) for i in imgs])
plt.figure()
plt.axis('off')
plt.title('Input: 1st 2 rows, Decoded: last 2 rows')
plt.imshow(imgs, interpolation='none', cmap='gray')
plt.savefig('input_and_decoded.png')
plt.show()

7.1.16 - MNIST-CNN Classification on Google Colab

MNIST with Convolutional Neural Networks: Classification on Google Colab
Open In Colab View in Github Download Notebook

Prerequisites

Install the following packages

! pip3 install cloudmesh-installer
! pip3 install cloudmesh-common

Import Libraries

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout
from keras.layers import Conv2D, MaxPooling2D, Flatten, AveragePooling2D
from keras.utils import to_categorical, plot_model
from keras.datasets import mnist

Download Data and Pre-Process

(x_train, y_train), (x_test, y_test) = mnist.load_data()

num_labels = len(np.unique(y_train))

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

image_size = x_train.shape[1]
x_train = np.reshape(x_train,[-1, image_size, image_size, 1])
x_test = np.reshape(x_test,[-1, image_size, image_size, 1])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

input_shape = (image_size, image_size, 1)
print(input_shape)
batch_size = 128
kernel_size = 3
pool_size = 2
filters = 64
dropout = 0.2
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
(28, 28, 1)

Define Model

model = Sequential()
model.add(Conv2D(filters=filters,
                 kernel_size=kernel_size,
                 activation='relu',
                 input_shape=input_shape,
                 padding='same'))
model.add(MaxPooling2D(pool_size))
model.add(Conv2D(filters=filters,
                 kernel_size=kernel_size,
                 activation='relu',
                 input_shape=input_shape,
                 padding='same'))
model.add(MaxPooling2D(pool_size))
model.add(Conv2D(filters=filters,
                 kernel_size=kernel_size,
                 activation='relu',
                 padding='same'))
model.add(MaxPooling2D(pool_size))
model.add(Conv2D(filters=filters,
                 kernel_size=kernel_size,
                 activation='relu'))
model.add(Flatten())
model.add(Dropout(dropout))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
plot_model(model, to_file='cnn-mnist.png', show_shapes=True)
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_4 (Conv2D)            (None, 28, 28, 64)        640       
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 64)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 14, 14, 64)        36928     
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 7, 7, 64)          0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 7, 7, 64)          36928     
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 3, 3, 64)          0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 1, 1, 64)          36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 64)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                650       
_________________________________________________________________
activation_1 (Activation)    (None, 10)                0         
=================================================================
Total params: 112,074
Trainable params: 112,074
Non-trainable params: 0
_________________________________________________________________

Train

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
# train the network
model.fit(x_train, y_train, epochs=10, batch_size=batch_size)
469/469 [==============================] - 125s 266ms/step - loss: 0.6794 - accuracy: 0.7783

<tensorflow.python.keras.callbacks.History at 0x7f35d4b104e0>

Test

loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))
79/79 [==============================] - 6s 68ms/step - loss: 0.0608 - accuracy: 0.9813

Test accuracy: 98.1%

7.1.17 - MNIST-LSTM Classification on Google Colab

MNIST-LSTM Classification on Google Colab
Open In Colab View in Github Download Notebook

Pre-requisites

Install the following Python packages

  1. cloudmesh-installer
  2. cloudmesh-common
pip3 install cloudmesh-installer
pip3 install cloudmesh-common

Sample LSTM with Tensorflow Keras

Import Libraries

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, SimpleRNN, InputLayer, LSTM
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.datasets import mnist
from cloudmesh.common.StopWatch import StopWatch

Download Data and Pre-Process

StopWatch.start("data-load")
(x_train, y_train), (x_test, y_test) = mnist.load_data()
StopWatch.stop("data-load")


StopWatch.start("data-pre-process")
num_labels = len(np.unique(y_train))


y_train = to_categorical(y_train)
y_test = to_categorical(y_test)


image_size = x_train.shape[1]
x_train = np.reshape(x_train,[-1, image_size, image_size])
x_test = np.reshape(x_test,[-1, image_size, image_size])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
StopWatch.stop("data-pre-process")

input_shape = (image_size, image_size)
batch_size = 128
units = 256
dropout = 0.2

Define Model

StopWatch.start("compile")
model = Sequential()
model.add(LSTM(units=units,                      
                     input_shape=input_shape,
                     return_sequences=True))
model.add(LSTM(units=units, 
                     dropout=dropout,                      
                     return_sequences=True))
model.add(LSTM(units=units, 
                     dropout=dropout,                      
                     return_sequences=False))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
plot_model(model, to_file='rnn-mnist.png', show_shapes=True)


model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
StopWatch.stop("compile")
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_3 (LSTM)                (None, 28, 256)           291840    
_________________________________________________________________
lstm_4 (LSTM)                (None, 28, 256)           525312    
_________________________________________________________________
lstm_5 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_1 (Dense)              (None, 10)                2570      
_________________________________________________________________
activation_1 (Activation)    (None, 10)                0         
=================================================================
Total params: 1,345,034
Trainable params: 1,345,034
Non-trainable params: 0

Train

StopWatch.start("train")
model.fit(x_train, y_train, epochs=1, batch_size=batch_size)
StopWatch.stop("train")
469/469 [==============================] - 378s 796ms/step - loss: 2.2689 - accuracy: 0.2075

Test

StopWatch.start("evaluate")
loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))
StopWatch.stop("evaluate")

StopWatch.benchmark()
79/79 [==============================] - 22s 260ms/step - loss: 1.9646 - accuracy: 0.3505

Test accuracy: 35.0%

+---------------------+------------------------------------------------------------------+
| Attribute           | Value                                                            |
|---------------------+------------------------------------------------------------------|
| BUG_REPORT_URL      | "https://bugs.launchpad.net/ubuntu/"                             |
| DISTRIB_CODENAME    | bionic                                                           |
| DISTRIB_DESCRIPTION | "Ubuntu 18.04.5 LTS"                                             |
| DISTRIB_ID          | Ubuntu                                                           |
| DISTRIB_RELEASE     | 18.04                                                            |
| HOME_URL            | "https://www.ubuntu.com/"                                        |
| ID                  | ubuntu                                                           |
| ID_LIKE             | debian                                                           |
| NAME                | "Ubuntu"                                                         |
| PRETTY_NAME         | "Ubuntu 18.04.5 LTS"                                             |
| PRIVACY_POLICY_URL  | "https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" |
| SUPPORT_URL         | "https://help.ubuntu.com/"                                       |
| UBUNTU_CODENAME     | bionic                                                           |
| VERSION             | "18.04.5 LTS (Bionic Beaver)"                                    |
| VERSION_CODENAME    | bionic                                                           |
| VERSION_ID          | "18.04"                                                          |
| cpu_count           | 2                                                                |
| mem.active          | 1.5 GiB                                                          |
| mem.available       | 11.4 GiB                                                         |
| mem.free            | 9.3 GiB                                                          |
| mem.inactive        | 1.7 GiB                                                          |
| mem.percent         | 10.4 %                                                           |
| mem.total           | 12.7 GiB                                                         |
| mem.used            | 1.3 GiB                                                          |
| platform.version    | #1 SMP Thu Jul 23 08:00:38 PDT 2020                              |
| python              | 3.6.9 (default, Oct  8 2020, 12:12:24)                           |
|                     | [GCC 8.4.0]                                                      |
| python.pip          | 19.3.1                                                           |
| python.version      | 3.6.9                                                            |
| sys.platform        | linux                                                            |
| uname.machine       | x86_64                                                           |
| uname.node          | 351ef0f61c92                                                     |
| uname.processor     | x86_64                                                           |
| uname.release       | 4.19.112+                                                        |
| uname.system        | Linux                                                            |
| uname.version       | #1 SMP Thu Jul 23 08:00:38 PDT 2020                              |
| user                | collab                                                           |
+---------------------+------------------------------------------------------------------+

+------------------+----------+---------+---------+---------------------+-------+--------------+--------+-------+-------------------------------------+
| Name             | Status   |    Time |     Sum | Start               | tag   | Node         | User   | OS    | Version                             |
|------------------+----------+---------+---------+---------------------+-------+--------------+--------+-------+-------------------------------------|
| data-load        | failed   |   0.354 |   0.967 | 2021-02-18 15:27:21 |       | 351ef0f61c92 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| data-pre-process | failed   |   0.098 |   0.198 | 2021-02-18 15:27:21 |       | 351ef0f61c92 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| compile          | failed   |   0.932 |   2.352 | 2021-02-18 15:27:23 |       | 351ef0f61c92 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| train            | failed   | 377.842 | 377.842 | 2021-02-18 15:27:26 |       | 351ef0f61c92 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| evaluate         | failed   |  21.689 |  21.689 | 2021-02-18 15:33:44 |       | 351ef0f61c92 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
+------------------+----------+---------+---------+---------------------+-------+--------------+--------+-------+-------------------------------------+

# csv,timer,status,time,sum,start,tag,uname.node,user,uname.system,platform.version
# csv,data-load,failed,0.354,0.967,2021-02-18 15:27:21,,351ef0f61c92,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,data-pre-process,failed,0.098,0.198,2021-02-18 15:27:21,,351ef0f61c92,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,compile,failed,0.932,2.352,2021-02-18 15:27:23,,351ef0f61c92,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,train,failed,377.842,377.842,2021-02-18 15:27:26,,351ef0f61c92,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,evaluate,failed,21.689,21.689,2021-02-18 15:33:44,,351ef0f61c92,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020

Reference:

Orignal Source to Source Code

7.1.18 - MNIST-MLP Classification on Google Colab

MNIST-MLP Classification on Google Colab
Open In Colab View in Github Download Notebook

In this lesson we discuss in how to create a simple IPython Notebook to solve an image classification problem with Multi Layer Perceptron.

Pre-requisites

Install the following Python packages

  1. cloudmesh-installer
  2. cloudmesh-common
pip3 install cloudmesh-installer
pip3 install cloudmesh-common

Sample MLP with Tensorflow Keras

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import time 

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.utils import to_categorical, plot_model
from keras.datasets import mnist
#import pydotplus
from keras.utils.vis_utils import model_to_dot
#from keras.utils.vis_utils import pydot


from cloudmesh.common.StopWatch import StopWatch

StopWatch.start("data-load")
(x_train, y_train), (x_test, y_test) = mnist.load_data()
StopWatch.stop("data-load")

num_labels = len(np.unique(y_train))


y_train = to_categorical(y_train)
y_test = to_categorical(y_test)


image_size = x_train.shape[1]
input_size = image_size * image_size


x_train = np.reshape(x_train, [-1, input_size])
x_train = x_train.astype('float32') / 255
x_test = np.reshape(x_test, [-1, input_size])
x_test = x_test.astype('float32') / 255

batch_size = 128
hidden_units = 512
dropout = 0.45

model = Sequential()
model.add(Dense(hidden_units, input_dim=input_size))
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(hidden_units))
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(hidden_units))
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(hidden_units))
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
plot_model(model, to_file='mlp-mnist.png', show_shapes=True)

StopWatch.start("compile")
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
StopWatch.stop("compile")
StopWatch.start("train")
model.fit(x_train, y_train, epochs=5, batch_size=batch_size)
StopWatch.stop("train")

StopWatch.start("test")
loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))
StopWatch.stop("test")

StopWatch.benchmark()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 512)               401920    
_________________________________________________________________
activation (Activation)      (None, 512)               0         
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 512)               262656    
_________________________________________________________________
activation_1 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
activation_2 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 512)               262656    
_________________________________________________________________
activation_3 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 10)                5130      
_________________________________________________________________
activation_4 (Activation)    (None, 10)                0         
=================================================================
Total params: 1,195,018
Trainable params: 1,195,018
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
469/469 [==============================] - 14s 29ms/step - loss: 0.7886 - accuracy: 0.7334
Epoch 2/5
469/469 [==============================] - 14s 29ms/step - loss: 0.1981 - accuracy: 0.9433
Epoch 3/5
469/469 [==============================] - 14s 29ms/step - loss: 0.1546 - accuracy: 0.9572
Epoch 4/5
469/469 [==============================] - 14s 29ms/step - loss: 0.1302 - accuracy: 0.9641
Epoch 5/5
469/469 [==============================] - 14s 29ms/step - loss: 0.1168 - accuracy: 0.9663
79/79 [==============================] - 1s 9ms/step - loss: 0.0785 - accuracy: 0.9765

Test accuracy: 97.6%

+---------------------+------------------------------------------------------------------+
| Attribute           | Value                                                            |
|---------------------+------------------------------------------------------------------|
| BUG_REPORT_URL      | "https://bugs.launchpad.net/ubuntu/"                             |
| DISTRIB_CODENAME    | bionic                                                           |
| DISTRIB_DESCRIPTION | "Ubuntu 18.04.5 LTS"                                             |
| DISTRIB_ID          | Ubuntu                                                           |
| DISTRIB_RELEASE     | 18.04                                                            |
| HOME_URL            | "https://www.ubuntu.com/"                                        |
| ID                  | ubuntu                                                           |
| ID_LIKE             | debian                                                           |
| NAME                | "Ubuntu"                                                         |
| PRETTY_NAME         | "Ubuntu 18.04.5 LTS"                                             |
| PRIVACY_POLICY_URL  | "https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" |
| SUPPORT_URL         | "https://help.ubuntu.com/"                                       |
| UBUNTU_CODENAME     | bionic                                                           |
| VERSION             | "18.04.5 LTS (Bionic Beaver)"                                    |
| VERSION_CODENAME    | bionic                                                           |
| VERSION_ID          | "18.04"                                                          |
| cpu_count           | 2                                                                |
| mem.active          | 1.2 GiB                                                          |
| mem.available       | 11.6 GiB                                                         |
| mem.free            | 9.8 GiB                                                          |
| mem.inactive        | 1.4 GiB                                                          |
| mem.percent         | 8.4 %                                                            |
| mem.total           | 12.7 GiB                                                         |
| mem.used            | 913.7 MiB                                                        |
| platform.version    | #1 SMP Thu Jul 23 08:00:38 PDT 2020                              |
| python              | 3.6.9 (default, Oct  8 2020, 12:12:24)                           |
|                     | [GCC 8.4.0]                                                      |
| python.pip          | 19.3.1                                                           |
| python.version      | 3.6.9                                                            |
| sys.platform        | linux                                                            |
| uname.machine       | x86_64                                                           |
| uname.node          | 6609095905d1                                                     |
| uname.processor     | x86_64                                                           |
| uname.release       | 4.19.112+                                                        |
| uname.system        | Linux                                                            |
| uname.version       | #1 SMP Thu Jul 23 08:00:38 PDT 2020                              |
| user                | collab                                                           |
+---------------------+------------------------------------------------------------------+

+-----------+----------+--------+--------+---------------------+-------+--------------+--------+-------+-------------------------------------+
| Name      | Status   |   Time |    Sum | Start               | tag   | Node         | User   | OS    | Version                             |
|-----------+----------+--------+--------+---------------------+-------+--------------+--------+-------+-------------------------------------|
| data-load | failed   |  0.549 |  0.549 | 2021-02-15 15:24:00 |       | 6609095905d1 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| compile   | failed   |  0.023 |  0.023 | 2021-02-15 15:24:01 |       | 6609095905d1 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| train     | failed   | 69.1   | 69.1   | 2021-02-15 15:24:01 |       | 6609095905d1 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| test      | failed   |  0.907 |  0.907 | 2021-02-15 15:25:10 |       | 6609095905d1 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
+-----------+----------+--------+--------+---------------------+-------+--------------+--------+-------+-------------------------------------+

# csv,timer,status,time,sum,start,tag,uname.node,user,uname.system,platform.version
# csv,data-load,failed,0.549,0.549,2021-02-15 15:24:00,,6609095905d1,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,compile,failed,0.023,0.023,2021-02-15 15:24:01,,6609095905d1,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,train,failed,69.1,69.1,2021-02-15 15:24:01,,6609095905d1,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,test,failed,0.907,0.907,2021-02-15 15:25:10,,6609095905d1,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020

Reference:

Orignal Source to Source Code

7.1.19 - MNIST-RMM Classification on Google Colab

MNIST with Recurrent Neural Networks: Classification on Google Colab
Open In Colab View in Github Download Notebook

Prerequisites

Install the following packages

! pip3 install cloudmesh-installer
! pip3 install cloudmesh-common

Import Libraries

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, SimpleRNN
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.datasets import mnist
from cloudmesh.common.StopWatch import StopWatch

Download Data and Pre-Process

StopWatch.start("data-load")
(x_train, y_train), (x_test, y_test) = mnist.load_data()
StopWatch.stop("data-load")


StopWatch.start("data-pre-process")
num_labels = len(np.unique(y_train))


y_train = to_categorical(y_train)
y_test = to_categorical(y_test)


image_size = x_train.shape[1]
x_train = np.reshape(x_train,[-1, image_size, image_size])
x_test = np.reshape(x_test,[-1, image_size, image_size])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
StopWatch.stop("data-pre-process")

input_shape = (image_size, image_size)
batch_size = 128
units = 256
dropout = 0.2

Define Model

StopWatch.start("compile")
model = Sequential()
model.add(SimpleRNN(units=units,
                    dropout=dropout,
                    input_shape=input_shape, return_sequences=True))
model.add(SimpleRNN(units=units,
                    dropout=dropout,
                    return_sequences=True))
model.add(SimpleRNN(units=units,
                    dropout=dropout,
                    return_sequences=False))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
plot_model(model, to_file='rnn-mnist.png', show_shapes=True)


model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
StopWatch.stop("compile")
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
simple_rnn (SimpleRNN)       (None, 28, 256)           72960     
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 28, 256)           131328    
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 256)               131328    
_________________________________________________________________
dense (Dense)                (None, 10)                2570      
_________________________________________________________________
activation (Activation)      (None, 10)                0         
=================================================================
Total params: 338,186
Trainable params: 338,186
Non-trainable params: 0

Train

StopWatch.start("train")
model.fit(x_train, y_train, epochs=1, batch_size=batch_size)
StopWatch.stop("train")
469/469 [==============================] - 125s 266ms/step - loss: 0.6794 - accuracy: 0.7783

<tensorflow.python.keras.callbacks.History at 0x7f35d4b104e0>

Test

StopWatch.start("evaluate")
loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))
StopWatch.stop("evaluate")

StopWatch.benchmark()
79/79 [==============================] - 7s 80ms/step - loss: 0.2581 - accuracy: 0.9181

Test accuracy: 91.8%

+---------------------+------------------------------------------------------------------+
| Attribute           | Value                                                            |
|---------------------+------------------------------------------------------------------|
| BUG_REPORT_URL      | "https://bugs.launchpad.net/ubuntu/"                             |
| DISTRIB_CODENAME    | bionic                                                           |
| DISTRIB_DESCRIPTION | "Ubuntu 18.04.5 LTS"                                             |
| DISTRIB_ID          | Ubuntu                                                           |
| DISTRIB_RELEASE     | 18.04                                                            |
| HOME_URL            | "https://www.ubuntu.com/"                                        |
| ID                  | ubuntu                                                           |
| ID_LIKE             | debian                                                           |
| NAME                | "Ubuntu"                                                         |
| PRETTY_NAME         | "Ubuntu 18.04.5 LTS"                                             |
| PRIVACY_POLICY_URL  | "https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" |
| SUPPORT_URL         | "https://help.ubuntu.com/"                                       |
| UBUNTU_CODENAME     | bionic                                                           |
| VERSION             | "18.04.5 LTS (Bionic Beaver)"                                    |
| VERSION_CODENAME    | bionic                                                           |
| VERSION_ID          | "18.04"                                                          |
| cpu_count           | 2                                                                |
| mem.active          | 1.3 GiB                                                          |
| mem.available       | 11.6 GiB                                                         |
| mem.free            | 9.7 GiB                                                          |
| mem.inactive        | 1.5 GiB                                                          |
| mem.percent         | 8.5 %                                                            |
| mem.total           | 12.7 GiB                                                         |
| mem.used            | 978.6 MiB                                                        |
| platform.version    | #1 SMP Thu Jul 23 08:00:38 PDT 2020                              |
| python              | 3.6.9 (default, Oct  8 2020, 12:12:24)                           |
|                     | [GCC 8.4.0]                                                      |
| python.pip          | 19.3.1                                                           |
| python.version      | 3.6.9                                                            |
| sys.platform        | linux                                                            |
| uname.machine       | x86_64                                                           |
| uname.node          | 8f16b3b1f784                                                     |
| uname.processor     | x86_64                                                           |
| uname.release       | 4.19.112+                                                        |
| uname.system        | Linux                                                            |
| uname.version       | #1 SMP Thu Jul 23 08:00:38 PDT 2020                              |
| user                | collab                                                           |
+---------------------+------------------------------------------------------------------+

+------------------+----------+---------+---------+---------------------+-------+--------------+--------+-------+-------------------------------------+
| Name             | Status   |    Time |     Sum | Start               | tag   | Node         | User   | OS    | Version                             |
|------------------+----------+---------+---------+---------------------+-------+--------------+--------+-------+-------------------------------------|
| data-load        | failed   |   0.36  |   0.36  | 2021-02-18 15:16:12 |       | 8f16b3b1f784 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| data-pre-process | failed   |   0.086 |   0.086 | 2021-02-18 15:16:12 |       | 8f16b3b1f784 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| compile          | failed   |   0.51  |   0.51  | 2021-02-18 15:16:12 |       | 8f16b3b1f784 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| train            | failed   | 126.612 | 126.612 | 2021-02-18 15:16:13 |       | 8f16b3b1f784 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
| evaluate         | failed   |   6.798 |   6.798 | 2021-02-18 15:18:19 |       | 8f16b3b1f784 | collab | Linux | #1 SMP Thu Jul 23 08:00:38 PDT 2020 |
+------------------+----------+---------+---------+---------------------+-------+--------------+--------+-------+-------------------------------------+

# csv,timer,status,time,sum,start,tag,uname.node,user,uname.system,platform.version
# csv,data-load,failed,0.36,0.36,2021-02-18 15:16:12,,8f16b3b1f784,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,data-pre-process,failed,0.086,0.086,2021-02-18 15:16:12,,8f16b3b1f784,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,compile,failed,0.51,0.51,2021-02-18 15:16:12,,8f16b3b1f784,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,train,failed,126.612,126.612,2021-02-18 15:16:13,,8f16b3b1f784,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020
# csv,evaluate,failed,6.798,6.798,2021-02-18 15:18:19,,8f16b3b1f784,collab,Linux,#1 SMP Thu Jul 23 08:00:38 PDT 2020


Reference:

Orignal Source to Source Code

8 - Big Data Applications

Here you will find a number of modules and components for introducing you to big data applications.

Big Data Applications are an important topic that have impact in academia and industry.

8.1 - 2020

Here you will find a number of modules and components for introducing you to big data applications.

Big Data Applications are an important topic that have impact in academia and industry.

8.1.1 - Introduction to AI-Driven Digital Transformation

The Full Introductory Lecture with introduction to and Motivation for Big Data Applications and Analytics Class

Overview

This Lecture is recorded in 8 parts and gives an introduction and motivation for the class. This and other lectures in class are divided into “bite-sized lessons” from 5 to 30 minutes in length; that’s why it has 8 parts.

Lecture explains what students might gain from the class even if they end up with different types of jobs from data engineering, software engineering, data science or a business (application) expert. It stresses that we are well into a transformation that impacts industry research and the way life is lived. This transformation is centered on using the digital way with clouds, edge computing and deep learning giving the implementation. This “AI-Driven Digital Transformation” is as transformational as the Industrial Revolution in the past. We note that deep learning dominates most innovative AI replacing several traditional machine learning methods.

The slides for this course can be found at E534-Fall2020-Introduction

A: Getting Started: BDAA Course Introduction Part A: Big Data Applications and Analytics

This lesson describes briefly the trends driving and consequent of the AI-Driven Digital Transformation. It discusses the organizational aspects of the class and notes the two driving trends are clouds and AI. Clouds are mature and a dominant presence. AI is still rapidly changing and we can expect further major changes. The edge (devices and associated local fog computing) has always been important but now more is being done there.

B: Technology Futures from Gartner’s Analysis: BDAA Course Introduction Part B: Big Data Applications and Analytics

This lesson goes through the technologies (AI Edge Cloud) from 2008-2020 that are driving the AI-Driven Digital Transformation. we use Hype Cycles and Priority Matrices from Gartner tracking importance concepts from the Innovation Trigger, Peak of Inflated Expectations through the Plateau of Productivity. We contrast clouds and AI.

  • This gives illustrations of sources of big data.
  • It gives key graphs of data sizes, images uploaded; computing, data, bandwidth trends;
  • Cloud-Edge architecture.
  • Intelligent machines and comparison of data from aircraft engine monitors compared to Twitter
  • Multicore revolution
  • Overall Global AI and Modeling Supercomputer GAIMSC
  • Moores Law compared to Deep Learning computing needs
  • Intel and NVIDIA status

E: Big Data and Science: BDAA Course Introduction Part E: Big Data Applications and Analytics

  • Applications and Analytics
  • Cyberinfrastructure, e-moreorlessanything.
  • LHC, Higgs Boson and accelerators.
  • Astronomy, SKA, multi-wavelength.
  • Polar Grid.
  • Genome Sequencing.
  • Examples, Long Tail of Science.
  • Wired’s End of Science; the 4 paradigms.
  • More data versus Better algorithms.

F: Big Data Systems: BDAA Course Introduction Part F: Big Data Applications and Analytics

  • Clouds, Service-oriented architectures, HPC High Performance Computing, Apace Software
  • DIKW process illustrated by Google maps
  • Raw data to Information/Knowledge/Wisdom/Decision Deluge from the EdgeInformation/Knowledge/Wisdom/Decision Deluge
  • Parallel Computing
  • Map Reduce

G: Industry Transformation: BDAA Course Introduction Part G: Big Data Applications and Analytics

AI grows in importance and industries transform with

  • Core Technologies related to
  • New “Industries” over the last 25 years
  • Traditional “Industries” Transformed; malls and other old industries transform
  • Good to be master of Cloud Computing and Deep Learning
  • AI-First Industries,

H: Jobs and Conclusions: BDAA Course Introduction Part H: Big Data Applications and Analytics

  • Job trends
  • Become digitally savvy so you can take advantage of the AI/Cloud/Edge revolution with different jobs
  • The qualitative idea of Big Data has turned into a quantitative realization as Cloud, Edge and Deep Learning
  • Clouds are here to stay and one should plan on exploiting them
  • Data Intensive studies in business and research continue to grow in importance

8.1.2 - BDAA Fall 2020 Course Lectures and Organization

Updated On an ongoing Basis

Week 1

This first class discussed overall issues and did the first ~40% of the introductory slides. This presentation is also available as 8 recorded presentations under Introduction to AI-Driven Digital Transformation

Administrative topics

The following topics were addressed

  • Homework
  • Difference between undergrad and graduate requirements
  • Contact
  • Commuication via Piazza

If you have questions please post them on Piazza.

Assignment 1

  • Post a professional three paragraph Bio on Piazza. Please post it under the folder bio. Use as subject “Bio: Lastname, Firstname”. Research what a professional Biography is. Remember to write it in 3rd person and focus on professional activities. Look up the Bios from Geoffrey or Gregor as examples.
  • Write report described in Homework 1
  • Please study recorded lectures either in zoom or in Introduction to AI-Driven Digital Transformation

Week 2

This did the remaining 60% of the introductory slides. This presentation is also available as 8 recorded presentations

Student questions were answered

Video and Assignment

These introduce Colab with examples and a Homework using Colab for deep learning. Please study videos and do homework.

Week 3

This lecture reviewed where we had got to and introduced the new Cybertraining web site. Then we gave an overview of the use case lectures which are to be studied this week. The use case overview slides are available as Google Slides

.

Videos

Please study Big Data Use Cases Survey

Big Data in pictures

Collage of Big Data Players

Collage of Big Data Players

Software systems of importance through early 2016. This collection was stopped due to rapid change but categories and entries are still valuable. We call this HPC-ABDS for High Performance Computing Enhanced Apache Big Data Stack

HPC-ABDS

HPC-ABDS Global AI Supercomputer compared to classic cluster.

AI Supercomputer

Six Computational Paradigms for Data Analytics

6 System Data Architectures

Features that can be used to distinguish and group together applications in both data and computational science

Facets

Week 4

We surveyed next weeks videos which describe the search for the Higgs Boson and the statistics methods used in the analysis of such conting experiments.

The Higgs Boson slides are available as Google Slides

.

Videos for Week 4

Please study Discovery of Higgs Boson

Week 5

This week’s class and its zoom video covers two topics

  • Discussion of Final Project for Class and use of markdown text technology based on slides Course Project.
  • Summary of Sports Informatics Module based on slides Sports Summary.

.

Videos for Week 5

Please Study Sports as Big Data Analytics

Week 6

This week’s video class recorded first part of Google Slides and emphasizes that these lectures are partly aimed at suggesting projects.

.

This class started with a review of applications for AI enhancement.

plus those covered Spring 2020 but not updated this semester

We focus on Health and Medicine with summary talk

Videos for Week 6

See module on Health and Medicine

Week 7

This week’s video class recorded the second part of Google Slides.

.

Videos for Week 7

Continue module on Health and Medicine

Week 8

We discussed projects with current list https://docs.google.com/document/d/13TZclzrWvkgQK6-8UR-LBu_LkpRsbiQ5FG1p--VZKZ8/edit?usp=sharing

This week’s video class recorded the first part of Google Slides.

.

Videos for Week 8

Module on Cloud Computing

Week 9

We discussed use of GitHub for projects (recording missed small part of this) and continued discussion of cloud computing but did not finish the slides yet.

This week’s video class recorded the second part of Google Slides.

.

Videos for Week 9

Continue work on project and complete study of videos already assigned. If interesting to you, please review videos on AI in Banking, Space and Energy, Transportation Systems, Mobility (Industry), and Commerce. Don’t forget the participation grade from GitHub activity each week

Week 10

We discussed use of GitHub for projects and finished summary of cloud computing.

This week’s video class recorded the last part of Google Slides.

.

Videos for Week 10

Continue work on project and complete study of videos already assigned. If interesting to you, please review videos on AI in Banking, Space and Energy, Transportation Systems, Mobility (Industry), and Commerce. Don’t forget the participation grade from GitHub activity each week

Week 11

This weeks video class went through project questions.

.

Videos for Week 11

Continue work on project and complete study of videos already assigned. If interesting to you, please review videos on AI in Banking, Space and Energy, Transportation Systems, Mobility (Industry), and Commerce. Don’t forget the participation grade from GitHub activity each week

Week 12

This weeks video class discussed deep learning for Time Series. There are Google Slides for this

.

Videos for Week 12

Continue work on project and complete study of videos already assigned. If interesting to you, please review videos on AI in Banking, Space and Energy, Transportation Systems, Mobility (Industry), and Commerce. Don’t forget the participation grade from GitHub activity each week.

Week 13

This weeks video class went through project questions.

.

The class is formally finished. Please submit your homework and project.

Week 14

This weeks video class went through project questions.

.

The class is formally finished. Please submit your homework and project.

Week 15

This weeks video class was a technical presentation on “Deep Learning for Images”. There are Google Slides

.

8.1.3 - Big Data Use Cases Survey

4 Lectures on Big Data Use Cases Survey

This unit has four lectures (slide=decks). The survey is 6 years old but the illustrative scope of Big Data Applications is still valid and has no better alternative. The problems and use of clouds has not changed. There has been algorithmic advances (deep earning) in some cases. The lectures are

    1. Overview of NIST Process
    1. The 51 Use cases divided into groups
    1. Common features of the 51 Use Cases
    1. 10 Patterns of data – computer – user interaction seen in Big Data Applications

There is an overview of these lectures below. The use case overview slides recorded here are available as Google Slides

.

Lecture set 1. Overview of NIST Big Data Public Working Group (NBD-PWG) Process and Results

This is first of 4 lectures on Big Data Use Cases. It describes the process by which NIST produced this survey

Presentation or Google Slides

Use Case 1-1 Introduction to NIST Big Data Public Working Group

The focus of the (NBD-PWG) is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definition, taxonomies, secure reference architectures, and technology roadmap. The aim is to create vendor-neutral, technology and infrastructure agnostic deliverables to enable big data stakeholders to pick-and-choose best analytics tools for their processing and visualization requirements on the most suitable computing platforms and clusters while allowing value-added from big data service providers and flow of data between the stakeholders in a cohesive and secure manner.

Introduction (13:02)

Use Case 1-2 Definitions and Taxonomies Subgroup

The focus is to gain a better understanding of the principles of Big Data. It is important to develop a consensus-based common language and vocabulary terms used in Big Data across stakeholders from industry, academia, and government. In addition, it is also critical to identify essential actors with roles and responsibility, and subdivide them into components and sub-components on how they interact/ relate with each other according to their similarities and differences. For Definitions: Compile terms used from all stakeholders regarding the meaning of Big Data from various standard bodies, domain applications, and diversified operational environments. For Taxonomies: Identify key actors with their roles and responsibilities from all stakeholders, categorize them into components and subcomponents based on their similarities and differences. In particular data, Science and Big Data terms are discussed.

Taxonomies (7:42)

Use Case 1-3 Reference Architecture Subgroup

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus-based approach to orchestrate vendor-neutral, technology and infrastructure agnostic for analytics tools and computing environments. The goal is to enable Big Data stakeholders to pick-and-choose technology-agnostic analytics tools for processing and visualization in any computing platform and cluster while allowing value-added from Big Data service providers and the flow of the data between the stakeholders in a cohesive and secure manner. Results include a reference architecture with well-defined components and linkage as well as several exemplars.

Architecture (10:05)

Use Case 1-4 Security and Privacy Subgroup

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus secure reference architecture to handle security and privacy issues across all stakeholders. This includes gaining an understanding of what standards are available or under development, as well as identifies which key organizations are working on these standards. The Top Ten Big Data Security and Privacy Challenges from the CSA (Cloud Security Alliance) BDWG are studied. Specialized use cases include Retail/Marketing, Modern Day Consumerism, Nielsen Homescan, Web Traffic Analysis, Healthcare, Health Information Exchange, Genetic Privacy, Pharma Clinical Trial Data Sharing, Cyber-security, Government, Military and Education.

Security (9:51)

Use Case 1-5 Technology Roadmap Subgroup

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus vision with recommendations on how Big Data should move forward by performing a good gap analysis through the materials gathered from all other NBD subgroups. This includes setting standardization and adoption priorities through an understanding of what standards are available or under development as part of the recommendations. Tasks are gathered input from NBD subgroups and study the taxonomies for the actors' roles and responsibility, use cases and requirements, and secure reference architecture; gain an understanding of what standards are available or under development for Big Data; perform a thorough gap analysis and document the findings; identify what possible barriers may delay or prevent the adoption of Big Data; and document vision and recommendations.

Technology (4:14)

Use Case 1-6 Requirements and Use Case Subgroup

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus list of Big Data requirements across all stakeholders. This includes gathering and understanding various use cases from diversified application domains.Tasks are gather use case input from all stakeholders; derive Big Data requirements from each use case; analyze/prioritize a list of challenging general requirements that may delay or prevent adoption of Big Data deployment; develop a set of general patterns capturing the essence of use cases (not done yet) and work with Reference Architecture to validate requirements and reference architecture by explicitly implementing some patterns based on use cases. The progress of gathering use cases (discussed in next two units) and requirements systemization are discussed.

Requirements (27:28)

Use Case 1-7 Recent Updates of work of NIST Public Big Data Working Group

This video is an update of recent work in this area. The first slide of this short lesson discusses a new version of use case survey that had many improvements including tags to label key features (as discussed in slide deck 3) and merged in a significant set of security and privacy fields. This came from the security and privacy working group described in lesson 4 of this slide deck. A link for this new use case form is https://bigdatawg.nist.gov/_uploadfiles/M0621_v2_7345181325.pdf

A recent December 2018 use case form for Astronomy’s Square Kilometer Array is at https://docs.google.com/document/d/1CxqCISK4v9LMMmGox-PG1bLeaRcbAI4cDIlmcoRqbDs/edit?usp=sharing This uses a simplification of the official new form.

The second (last) slide in update gives some useful on latest work. NIST’s latest work just published is at https://bigdatawg.nist.gov/V3_output_docs.php Related activities are described at http://hpc-abds.org/kaleidoscope/

Lecture set 2: 51 Big Data Use Cases from NIST Big Data Public Working Group (NBD-PWG)

Presentation or Google Slides

Use Case 2-1 Government Use Cases

This covers Census 2010 and 2000 - Title 13 Big Data; National Archives and Records Administration Accession NARA, Search, Retrieve, Preservation; Statistical Survey Response Improvement (Adaptive Design) and Non-Traditional Data in Statistical Survey Response Improvement (Adaptive Design).

Government Use Cases (17:43)

Use Case 2-2 Commercial Use Cases

This covers Cloud Eco-System, for Financial Industries (Banking, Securities & Investments, Insurance) transacting business within the United States; Mendeley - An International Network of Research; Netflix Movie Service; Web Search; IaaS (/infrastructure as a Service) Big Data Business Continuity & Disaster Recovery (BC/DR) Within A Cloud Eco-System; Cargo Shipping; Materials Data for Manufacturing and Simulation driven Materials Genomics.

This lesson is divided into 3 separate videos

Part 1

(9:31)

Part 2

(19:45)

Part 3

(10:48)

Use Case 2-3 Defense Use Cases

This covers Large Scale Geospatial Analysis and Visualization; Object identification and tracking from Wide Area Large Format Imagery (WALF) Imagery or Full Motion Video (FMV) - Persistent Surveillance and Intelligence Data Processing and Analysis.

Defense Use Cases (15:43)

Use Case 2-4 Healthcare and Life Science Use Cases

This covers Electronic Medical Record (EMR) Data; Pathology Imaging/digital pathology; Computational Bioimaging; Genomic Measurements; Comparative analysis for metagenomes and genomes; Individualized Diabetes Management; Statistical Relational Artificial Intelligence for Health Care; World Population Scale Epidemiological Study; Social Contagion Modeling for Planning, Public Health and Disaster Management and Biodiversity and LifeWatch.

Healthcare and Life Science Use Cases (30:11)

Use Case 2-5 Deep Learning and Social Networks Use Cases

This covers Large-scale Deep Learning; Organizing large-scale, unstructured collections of consumer photos; Truthy: Information diffusion research from Twitter Data; Crowd Sourcing in the Humanities as Source for Bigand Dynamic Data; CINET: Cyberinfrastructure for Network (Graph) Science and Analytics and NIST Information Access Division analytic technology performance measurement, evaluations, and standards.

Deep Learning and Social Networks Use Cases (14:19)

Use Case 2-6 Research Ecosystem Use Cases

DataNet Federation Consortium DFC; The ‘Discinnet process’, metadata -big data global experiment; Semantic Graph-search on Scientific Chemical and Text-based Data and Light source beamlines.

Research Ecosystem Use Cases (9:09)

Use Case 2-7 Astronomy and Physics Use Cases

This covers Catalina Real-Time Transient Survey (CRTS): a digital, panoramic, synoptic sky survey; DOE Extreme Data from Cosmological Sky Survey and Simulations; Large Survey Data for Cosmology; Particle Physics: Analysis of LHC Large Hadron Collider Data: Discovery of Higgs particle and Belle II High Energy Physics Experiment.

Astronomy and Physics Use Cases (17:33)

Use Case 2-8 Environment, Earth and Polar Science Use Cases

EISCAT 3D incoherent scatter radar system; ENVRI, Common Operations of Environmental Research Infrastructure; Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets; UAVSAR Data Processing, DataProduct Delivery, and Data Services; NASA LARC/GSFC iRODS Federation Testbed; MERRA Analytic Services MERRA/AS; Atmospheric Turbulence - Event Discovery and Predictive Analytics; Climate Studies using the Community Earth System Model at DOE’s NERSC center; DOE-BER Subsurface Biogeochemistry Scientific Focus Area and DOE-BER AmeriFlux and FLUXNET Networks.

Environment, Earth and Polar Science Use Cases (25:29)

Use Case 2-9 Energy Use Case

This covers Consumption forecasting in Smart Grids.

Energy Use Case (4:01)

Lecture set 3: Features of 51 Big Data Use Cases from the NIST Big Data Public Working Group (NBD-PWG)

This unit discusses the categories used to classify the 51 use-cases. These categories include concepts used for parallelism and low and high level computational structure. The first lesson is an introduction to all categories and the further lessons give details of particular categories.

Presentation or Google Slides

Use Case 3-1 Summary of Use Case Classification

This discusses concepts used for parallelism and low and high level computational structure. Parallelism can be over People (users or subjects), Decision makers; Items such as Images, EMR, Sequences; observations, contents of online store; Sensors – Internet of Things; Events; (Complex) Nodes in a Graph; Simple nodes as in a learning network; Tweets, Blogs, Documents, Web Pages etc.; Files or data to be backed up, moved or assigned metadata; Particles/cells/mesh points. Low level computational types include PP (Pleasingly Parallel); MR (MapReduce); MRStat; MRIter (iterative MapReduce); Graph; Fusion; MC (Monte Carlo) and Streaming. High level computational types include Classification; S/Q (Search and Query); Index; CF (Collaborative Filtering); ML (Machine Learning); EGO (Large Scale Optimizations); EM (Expectation maximization); GIS; HPC; Agents. Patterns include Classic Database; NoSQL; Basic processing of data as in backup or metadata; GIS; Host of Sensors processed on demand; Pleasingly parallel processing; HPC assimilated with observational data; Agent-based models; Multi-modal data fusion or Knowledge Management; Crowd Sourcing.

Summary of Use Case Classification (23:39)

Use Case 3-2 Database(SQL) Use Case Classification

This discusses classic (SQL) database approach to data handling with Search&Query and Index features. Comparisons are made to NoSQL approaches.

Database (SQL) Use Case Classification (11:13)

Use Case 3-3 NoSQL Use Case Classification

This discusses NoSQL (compared in previous lesson) with HDFS, Hadoop and Hbase. The Apache Big data stack is introduced and further details of comparison with SQL.

NoSQL Use Case Classification (11:20)

Use Case 3-4 Other Use Case Classifications

This discusses a subset of use case features: GIS, Sensors. the support of data analysis and fusion by streaming data between filters.

Use Case Classifications I (12:42)

Use Case 3-5

This discusses a subset of use case features: Classification, Monte Carlo, Streaming, PP, MR, MRStat, MRIter and HPC(MPI), global and local analytics (machine learning), parallel computing, Expectation Maximization, graphs and Collaborative Filtering.

Case Classifications II (20:18)

Use Case 3-6

This discusses the classification, PP, Fusion, EGO, HPC, GIS, Agent, MC, PP, MR, Expectation maximization and benchmarks.

Use Case 3-7 Other Benchmark Sets and Classifications

This video looks at several efforts to divide applications into categories of related applications It includes “Computational Giants” from the National Research Council; Linpack or HPL from the HPC community; the NAS Parallel benchmarks from NASA; and finally the Berkeley Dwarfs from UCB. The second part of this video describes efforts in the Digital Science Center to develop Big Data classification and to unify Big Data and simulation categories. This leads to the Ogre and Convergence Diamonds. Diamonds have facets representing the different aspects by which we classify applications. See http://hpc-abds.org/kaleidoscope/

Lecture set 4. The 10 Use Case Patterns from the NIST Big Data Public Working Group (NBD-PWG)

Presentation or Google Slides

In this last slide deck of the use cases unit, we will be focusing on 10 Use case patterns. This includes multi-user querying, real-time analytics, batch analytics, data movement from external data sources, interactive analysis, data visualization, ETL, data mining and orchestration of sequential and parallel data transformations. We go through the different ways the user and system interact in each case. The use case patterns are divided into 3 classes 1) initial examples 2) science data use case patterns and 3) remaining use case patterns.

Resources

Some of the links bellow may be outdated. Please let us know the new links and notify us of the outdated links.

8.1.4 - Physics

Week 4: Big Data applications and Physics

E534 2020 Big Data Applications and Analytics Discovery of Higgs Boson

Summary: This section of the class is devoted to a particular Physics experiment but uses this to discuss so-called counting experiments. Here one observes “events” that occur randomly in time and one studies the properties of the events; in particular are the events collection of subatomic particles coming from the decay of particles from a “Higgs Boson” produced in high energy accelerator collisions. The four video lecture sets (Parts I II III IV) start by describing the LHC accelerator at CERN and evidence found by the experiments suggesting the existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of the shape of the signal and various backgrounds and with various event totals. Then random variables and some simple principles of statistics are introduced with an explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they have seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes the discussion.

Colab Notebooks for Physics Usecases

For this lecture, we will be using the following Colab Notebooks along with the following lecture materials. They will be referenced in the corresponding sections.

  1. Notebook A
  2. Notebook B
  3. Notebook C
  4. Notebook D

Looking for Higgs Particle Part I : Bumps in Histograms, Experiments and Accelerators

This unit is devoted to Python and Java experiments looking at histograms of Higgs Boson production with various forms of the shape of the signal and various backgrounds and with various event totals. The lectures use Python but the use of Java is described. Students today can ignore Java!

Slides {20 slides}

Looking for Higgs Particle and Counting Introduction 1

We return to the particle case with slides used in introduction and stress that particles often manifested as bumps in histograms and those bumps need to be large enough to stand out from the background in a statistically significant fashion.

Video:

{slides1-5}

Looking for Higgs Particle II Counting Introduction 2

We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion

Video:

{slides 6-8}

Experimental Facilities

We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion.

Video:

{slides 9-14}

This lesson gives a small picture gallery of accelerators. Accelerators, detection chambers and magnets in tunnels and a large underground laboratory used for experiments where you need to be shielded from the background like cosmic rays.

{slides 14-20}

Resources

http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf

http://www.sciencedirect.com/science/article/pii/S037026931200857X

http://www.nature.com/news/specials/lhc/interactive.html

Looking for Higgs Particles Part II: Python Event Counting for Signal and Background

Python Event Counting for Signal and Background (Part 2) This unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of the shape of the signal and various backgrounds and with various event totals.

Slides {1-29 slides}

Class Software

We discuss Python on both a backend server (FutureGrid - closed!) or a local client. We point out a useful book on Python for data analysis.

{slides 1-10}

Refer to A: Studying Higgs Boson Analysis. Signal and Background, Part 1 The background

Event Counting

We define event counting of data collection environments. We discuss the python and Java code to generate events according to a particular scenario (the important idea of Monte Carlo data). Here a sloping background plus either a Higgs particle generated similarly to LHC observation or one observed with better resolution (smaller measurement error).

{slides 11-14}

Examples of Event Counting I with Python Examples of Signal and Background

This uses Monte Carlo data both to generate data like the experimental observations and explore the effect of changing amount of data and changing measurement resolution for Higgs.

{slides 15-23}

Refer to A: Studying Higgs Boson Analysis. Signal and Background, Part 1,2,3,4,6,7

Examples of Event Counting II: Change shape of background and number of Higgs Particles produced in experiment

This lesson continues the examination of Monte Carlo data looking at the effect of change in the number of Higgs particles produced and in the change in the shape of the background.

{slides 25-29}

Refer to A: Studying Higgs Boson Analysis. Signal and Background, Part 5- Part 6

Refer to B: Studying Higgs Boson Analysis. Signal and Background

Resources

Python for Data Analysis: Agile Tools for Real-World Data By Wes McKinney, Publisher: O’Reilly Media, Released: October 2012, Pages: 472.

http://jwork.org/scavis/api/

https://en.wikipedia.org/wiki/DataMelt

Looking for Higgs Part III: Random variables, Physics and Normal Distributions

We introduce random variables and some simple principles of statistics and explain why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they have seen so often in natural phenomena. Several Python illustrations are given. Java is not discussed in this unit.

Slides {slides 1-39}

Statistics Overview and Fundamental Idea: Random Variables

We go through the many different areas of statistics covered in the Physics unit. We define the statistics concept of a random variable

{slides 1-6}

Physics and Random Variables

We describe the DIKW pipeline for the analysis of this type of physics experiment and go through details of the analysis pipeline for the LHC ATLAS experiment. We give examples of event displays showing the final state particles seen in a few events. We illustrate how physicists decide what’s going on with a plot of expected Higgs production experimental cross sections (probabilities) for signal and background.

Part 1

{slides 6-9}

Part 2

{slides 10-12}

Statistics of Events with Normal Distributions

We introduce Poisson and Binomial distributions and define independent identically distributed (IID) random variables. We give the law of large numbers defining the errors in counting and leading to Gaussian distributions for many things. We demonstrate this in Python experiments.

{slides 13-19}

Refer to C: Gaussian Distributions and Counting Experiments, Part 1

Gaussian Distributions

We introduce the Gaussian distribution and give Python examples of the fluctuations in counting Gaussian distributions.

{slides 21-32}

Refer to C: Gaussian Distributions and Counting Experiments, Part 2

Using Statistics

We discuss the significance of a standard deviation and role of biases and insufficient statistics with a Python example in getting incorrect answers.

{slides 33-39}

Refer to C: Gaussian Distributions and Counting Experiments, Part 3

Resources

http://indico.cern.ch/event/20453/session/6/contribution/15?materialId=slides http://www.atlas.ch/photos/events.html (this link is outdated) https://cms.cern/

Looking for Higgs Part IV: Random Numbers, Distributions and Central Limit Theorem

We discuss Random Numbers with their Generators and Seeds. It introduces Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods are discussed. The Central Limit Theorem and Bayes law conclude the discussion. Python and Java (for student - not reviewed in class) examples and Physics applications are given

Slides {slides 1-44}

Generators and Seeds

We define random numbers and describe how to generate them on the computer giving Python examples. We define the seed used to define how to start generation.

Part 1

{slides 5-6}

Part 2

{slides 7-13}

Refer to D: Random Numbers, Part 1

Refer to C: Gaussian Distributions and Counting Experiments, Part 4

Binomial Distribution

We define the binomial distribution and give LHC data as an example of where this distribution is valid.

{slides 14-22}

Accept-Reject Methods for generating Random (Monte-Carlo) Events

We introduce an advanced method accept/reject for generating random variables with arbitrary distributions.

{slides 23-27}

Refer to A: Studying Higgs Boson Analysis. Signal and Background, Part 1

Monte Carlo Method

We define the Monte Carlo method which usually uses the accept/reject method in the typical case for distribution.

{slides 27-28}

Poisson Distribution

We extend the Binomial to the Poisson distribution and give a set of amusing examples from Wikipedia.

{slides 30-33}

Central Limit Theorem

We introduce Central Limit Theorem and give examples from Wikipedia

{slides 35-37}

Interpretation of Probability: Bayes v. Frequency

This lesson describes the difference between Bayes and frequency views of probability. Bayes’s law of conditional probability is derived and applied to Higgs example to enable information about Higgs from multiple channels and multiple experiments to be accumulated.

{slides 38-44}

Refer to C: Gaussian Distributions and Counting Experiments, Part 5

Homework 3 (Posted on Canvas)

The use case analysis that you were asked to watch in week 3, described over 50 use cases based on templates filled in by users.  This homework has two choices.

  1. Consider the existing SKA Template here. Use this plus web resources such as this to write a 3 page description of science goals, current progress and big data challenges of the SKA

OR

  1. Here is a Blank use case Template (make your own copy). Chose any Big Data use case (including those in videos but address 2020 not 2013 situation) and fill in ONE new use case template producing about 2 pages of new material (summed over answers to questions)

Homwork 4 (Posted on Canvas)

Consider Physics Colab A “Part 4 Error Estimates” and Colab D “Part 2 Varying Seed (randomly) gives distinct results”

Consider a Higgs of measured width 0.5 GeV (narrowGauss and narrowTotal in Part A) and use analysis of A Part 4 to estimate the difference between signal and background compared to expected error (standard deviation)

Run 3 different random number choices (as in D Part 2) to show how conclusions change

Recommend changing bin size in A Part 4 from 2 GeV to 1 GeV when the Higgs signal will be in two bins you can add (equivalently use 2 GeV histogram bins shifting origin so Higgs mass of 126 GeV is in the center of a bin)

Suppose you keep background unchanged and reduce the Higgs signal by a factor of 2 (300 to 150 events). Can Higgs still be detected?

8.1.5 - Introduction to AI in Health and Medicine

This section discusses the health and medicine sector

Overview

This module discusses AI and the digital transformation for the Health and Medicine Area with a special emphasis on COVID-19 issues. We cover both the impact of COVID and some of the many activities that are addressing it. Parts B and C have an extensive general discussion of AI in Health and Medicine

The complete presentation is available at Google Slides while the videos are a YouTube playlist

Part A: Introduction

This lesson describes some overarching issues including the

  • Summary in terms of Hypecycles
  • Players in the digital health ecosystem and in particular role of Big Tech which has needed AI expertise and infrastructure from clouds to smart watches/phones
  • Views of Pataients and Doctors on New Technology
  • Role of clouds. This is essentially assumed throughout presentation but not stressed.
  • Importance of Security
  • Introduction to Internet of Medical Things; this area is discussed in more detail later in preserntation

slides

Part B: Diagnostics

This highlights some diagnostic appliocations of AI and the digital transformation. Part C also has some diagnostic coverage – especially particular applications

  • General use of AI in Diagnostics
  • Early progress in diagnostic imaging including Radiology and Opthalmology
  • AI In Clinical Decision Support
  • Digital Therapeutics is a recognized and growing activity area

slides

Part C: Examples

This lesson covers a broad range of AI uses in Health and Medicine

  • Flagging Issues requirng urgent attentation and more generally AI for Precision Merdicine
  • Oncology and cancer have made early progress as exploit AI for images. Avoiding mistakes and diagnosing curable cervical cancer in developing countries with less screening.
  • Predicting Gestational Diabetes
  • cardiovascular diagnostics and AI to interpret and guide Ultrasound measurements
  • Robot Nurses and robots to comfort patients
  • AI to guide cosmetic surgery measuring beauty
  • AI in analysis DNA in blood tests
  • AI For Stroke detection (large vessel occlusion)
  • AI monitoring of breathing to flag opioid-induced respiratory depression.
  • AI to relieve administration burden including voice to text for Doctor’s notes
  • AI in consumer genomics
  • Areas that are slow including genomics, Consumer Robotics, Augmented/Virtual Reality and Blockchain
  • AI analysis of information resources flags probleme earlier
  • Internet of Medical Things applications from watches to toothbrushes

slides

Part D: Impact of Covid-19

This covers some aspects of the impact of COVID -19 pandedmic starting in March 2020

  • The features of the first stimulus bill
  • Impact on Digital Health, Banking, Fintech, Commerce – bricks and mortar, e-commerce, groceries, credit cards, advertising, connectivity, tech industry, Ride Hailing and Delivery,
  • Impact on Restaurants, Airlines, Cruise lines, general travel, Food Delivery
  • Impact of working from home and videoconferencing
  • The economy and
  • The often positive trends for Tech industry

slides

Part E: Covid-19 and Recession

This is largely outdated as centered on start of pandemic induced recession. and we know what really happenmed now. Probably the pandemic accelerated the transformation of industry and the use of AI.

slides

Part F: Tackling Covid-19

This discusses some of AI and digital methods used to understand and reduce impact of COVID-19

  • Robots for remote patient examination
  • computerized tomography scan + AI to identify COVID-19
  • Early activities of Big Tech and COVID
  • Other early biotech activities with COVID-19
  • Remote-work technology: Hopin, Zoom, Run the World, FreeConferenceCall, Slack, GroWrk, Webex, Lifesize, Google Meet, Teams
  • Vaccines
  • Wearables and Monitoring, Remote patient monitoring
  • Telehealth, Telemedicine and Mobile Health

slides

Part G: Data and Computational Science and Covid-19

This lesson reviews some sophisticated high performance computing HPC and Big Data approaches to COVID

  • Rosetta volunteer computer to analyze proteins
  • COVID-19 High Performance Computing Consortium
  • AI based drug discovery by startup Insilico Medicine
  • Review of several research projects
  • Global Pervasive Computational Epidemiology for COVID-19 studies
  • Simulations of Virtual Tissues at Indiana University available on nanoHUB

slides

Part H: Screening Drug and Candidates

A major project involving Department of Energy Supercomputers

  • General Structure of Drug Discovery
  • DeepDriveMD Project using AI combined with molecular dynamics to accelerate discovery of drug properties

slides

Part I: Areas for Covid19 Study and Pandemics as Complex Systems

slides

  • Possible Projects in AI for Health and Medicine and especially COVID-19
  • Pandemics as a Complex System
  • AI and computational Futures for Complex Systems

8.1.6 - Mobility (Industry)

This section discusses the mobility in Industry

Overview

  1. Industry being transformed by a) Autonomy (AI) and b) Electric power
  2. Established Organizations can’t change
    • General Motors (employees: 225,000 in 2016 to around 180,000 in 2018) finds it hard to compete with Tesla (42000 employees)
    • Market value GM was half the market value of Tesla at the start of 2020 but is now just 11% October 2020
    • GM purchased Cruise to compete
    • Funding and then buying startups is an important “transformation” strategy
  3. Autonomy needs Sensors Computers Algorithms and Software
    • Also experience (training data)
    • Algorithms main bottleneck; others will automatically improve although lots of interesting work in new sensors, computers and software
    • Over the last 3 years, electrical power has gone from interesting to “bound to happen”; Tesla’s happy customers probably contribute to this
    • Batteries and Charging stations needed

Summary Slides

Full Slide Deck

Mobility Industry A: Introduction

  • Futures of Automobile Industry, Mobility, and Ride-Hailing
  • Self-cleaning cars
  • Medical Transportation
  • Society of Automotive Engineers, Levels 0-5
  • Gartner’s conservative View

Mobility Industry B: Self Driving AI

  • Image processing and Deep Learning
  • Examples of Self Driving cars
  • Road construction Industry
  • Role of Simulated data
  • Role of AI in autonomy
  • Fleet cars
  • 3 Leaders: Waymo, Cruise, NVIDIA

Mobility Industry C: General Motors View

  • Talk by Dave Brooks at GM, “AI for Automotive Engineering”
  • Zero crashes, zero emission, zero congestion
  • GM moving to electric autonomous vehicles

Mobility Industry D: Self Driving Snippets

  • Worries about and data on its Progress
  • Tesla’s specialized self-driving chip
  • Some tasks that are hard for AI
  • Scooters and Bikes

Mobility Industry E: Electrical Power

  • Rise in use of electrical power
  • Special opportunities in e-Trucks and time scale
  • Future of Trucks
  • Tesla market value
  • Drones and Robot deliveries; role of 5G
  • Robots in Logistics

8.1.7 - Sports

Week 5: Big Data and Sports.

Sports with Big Data Applications

E534 2020 Big Data Applications and Analytics Sports Informatics Part I Section Summary (Parts I, II, III): Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Part 1

Unit Summary (PartI): This unit discusses baseball starting with the movie Moneyball and the 2002-2003 Oakland Athletics. Unlike sports like basketball and soccer, most baseball action is built around individuals often interacting in pairs. This is much easier to quantify than many player phenomena in other sports. We discuss Performance-Dollar relationship including new stadiums and media/advertising. We look at classic baseball averages and sophisticated measures like Wins Above Replacement.

Slides

Lesson Summaries

Part 1.1 - E534 Sports - Introduction and Sabermetrics (Baseball Informatics) Lesson

Introduction to all Sports Informatics, Moneyball The 2002-2003 Oakland Athletics, Diamond Dollars economic model of baseball, Performance - Dollar relationship, Value of a Win.

{slides 1-15}

Part 1.2 - E534 Sports - Basic Sabermetrics

Different Types of Baseball Data, Sabermetrics, Overview of all data, Details of some statistics based on basic data, OPS, wOBA, ERA, ERC, FIP, UZR.

{slides 16-26}

Part 1.3 - E534 Sports - Wins Above Replacement

Wins above Replacement WAR, Discussion of Calculation, Examples, Comparisons of different methods, Coefficient of Determination, Another, Sabermetrics Example, Summary of Sabermetrics.

{slides 17-40}

Part 2

E534 2020 Big Data Applications and Analytics Sports Informatics Part II Section Summary (Parts I, II, III): Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Slides

Unit Summary (Part II): This unit discusses ‘advanced sabermetrics’ covering advances possible from using video from PITCHf/X, FIELDf/X, HITf/X, COMMANDf/X and MLBAM.

Part 2.1 - E534 Sports - Pitching Clustering

A Big Data Pitcher Clustering method introduced by Vince Gennaro, Data from Blog and video at 2013 SABR conference

{slides 1-16}

Part 2.2 - E534 Sports - Pitcher Quality

Results of optimizing match ups, Data from video at 2013 SABR conference.

{slides 17-24}

Part 2.3 - E534 Sports - PITCHf/X

Examples of use of PITCHf/X.

{slides 25-30}

Part 2.4 - E534 Sports - Other Video Data Gathering in Baseball

FIELDf/X, MLBAM, HITf/X, COMMANDf/X.

{slides 26-41}

Part 3

E534 2020 Big Data Applications and Analytics Sports Informatics Part III. Section Summary (Parts I, II, III): Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Unit Summary (Part III): We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Slides

Lesson Summaries

Part 3.1 - E534 Sports - Wearables

Consumer Sports, Stake Holders, and Multiple Factors.

{Slides 1-17}

Part 3.2 - E534 Sports - Soccer and the Olympics

Soccer, Tracking Players and Balls, Olympics.

{Slides 17-24}

Part 3.3 - E534 Sports - Spatial Visualization in NFL and NBA

NFL, NBA, and Spatial Visualization.

{Slides 25-38}

Part 3.4 - E534 Sports - Tennis and Horse Racing

Tennis, Horse Racing, and Continued Emphasis on Spatial Visualization.

{Slides 39-44}

8.1.8 - Space and Energy

This section discusses the space and energy.

Overview

  1. Energy sources and AI for powering Grids.
  2. Energy Solution from Bill Gates
  3. Space and AI

Full Slide Deck

A: Energy

  • Distributed Energy Resources as a grid of renewables with a hierarchical set of Local Distribution Areas
  • Electric Vehicles in Grid
  • Economics of microgrids
  • Investment into Clean Energy
  • Batteries
  • Fusion and Deep Learning for plasma stability
  • AI for Power Grid, Virtual Power Plant, Power Consumption Monitoring, Electricity Trading

Slides

B: Clean Energy startups from Bill Gates

  • 26 Startups in areas like long-duration storage, nuclear energy, carbon capture, batteries, fusion, and hydropower …
  • The slide deck gives links to 26 companies from their website and pitchbook which describes their startup status (#employees, funding)
  • It summarizes their products

Slides

C: Space

  • Space supports AI with communications, image data and global navigation
  • AI Supports space in AI-controlled remote manufacturing, imaging control, system control, dynamic spectrum use
  • Privatization of Space - SpaceX, Investment
  • 57,000 satellites through 2029

Slides

8.1.9 - AI In Banking

This section discusses AI in Banking

Overview

In this lecture, AI in Banking is discussed. Here we focus on the transition of legacy banks towards AI based banking, real world examples of AI in Banking, banking systems and banking as a service.

Slides

AI in Banking A: The Transition of legacy Banks

  1. Types of AI that is used
  2. Closing of physical branches
  3. Making the transition
  4. Growth in Fintech as legacy bank services decline

AI in Banking B: FinTech

  1. Fintech examples and investment
  2. Broad areas of finance/banking where Fintech operating

AI in Banking C: Neobanks

  1. Types and Examples of neobanks
  2. Customer uptake by world region
  3. Neobanking in Small and Medium Business segment
  4. Neobanking in real estate, mortgages
  5. South American Examples

AI in Banking D: The System

  1. The Front, Middle, Back Office
  2. Front Office: Chatbots
  3. Robo-advisors
  4. Middle Office: Fraud, Money laundering
  5. Fintech
  6. Payment Gateways (Back Office)
  7. Banking as a Service

AI in Banking E: Examples

  1. Credit cards
  2. The stock trading ecosystem
  3. Robots counting coins
  4. AI in Insurance: Chatbots, Customer Support
  5. Banking itself
  6. Handwriting recognition
  7. Detect leaks for insurance

AI in Banking F: As a Service

  1. Banking Services Stack
  2. Business Model
  3. Several Examples
  4. Metrics compared among examples
  5. Breadth, Depth, Reputation, Speed to Market, Scalability

8.1.10 - Cloud Computing

Cloud Computing

E534 Cloud Computing Unit

Full Slide Deck

Overall Summary

Video:

Defining Clouds I: Basic definition of cloud and two very simple examples of why virtualization is important

  1. How clouds are situated wrt HPC and supercomputers
  2. Why multicore chips are important
  3. Typical data center

Video:

Defining Clouds II: Service-oriented architectures: Software services as Message-linked computing capabilities

  1. The different aaS’s: Network, Infrastructure, Platform, Software
  2. The amazing services that Amazon AWS and Microsoft Azure have
  3. Initial Gartner comments on clouds (they are now the norm) and evolution of servers; serverless and microservices
  4. Gartner hypecycle and priority matrix on Infrastructure Strategies

Video:

Defining Clouds III: Cloud Market Share

  1. How important are they?
  2. How much money do they make?

Video:

Virtualization: Virtualization Technologies, Hypervisors and the different approaches

  1. KVM Xen, Docker and Openstack

Video:

  1. Clouds physically across the world
  2. Green computing
  3. Fraction of world’s computing ecosystem in clouds and associated sizes
  4. An analysis from Cisco of size of cloud computing

Video:

Cloud Infrastructure II: Gartner hypecycle and priority matrix on Compute Infrastructure

  1. Containers compared to virtual machines
  2. The emergence of artificial intelligence as a dominant force

Video:

Cloud Software: HPC-ABDS with over 350 software packages and how to use each of 21 layers

  1. Google’s software innovations
  2. MapReduce in pictures
  3. Cloud and HPC software stacks compared
  4. Components need to support cloud/distributed system programming

Video:

Cloud Applications I: Clouds in science where area called cyberinfrastructure; the science usage pattern from NIST

  1. Artificial Intelligence from Gartner

Video:

Cloud Applications II: Characterize Applications using NIST approach

  1. Internet of Things
  2. Different types of MapReduce

Video:

Parallel Computing Analogies: Parallel Computing in pictures

  1. Some useful analogies and principles

Video:

Real Parallel Computing: Single Program/Instruction Multiple Data SIMD SPMD

  1. Big Data and Simulations Compared
  2. What is hard to do?

Video:

Storage: Cloud data approaches

  1. Repositories, File Systems, Data lakes

Video:

HPC and Clouds: The Branscomb Pyramid

  1. Supercomputers versus clouds
  2. Science Computing Environments

Video:

Comparison of Data Analytics with Simulation: Structure of different applications for simulations and Big Data

  1. Software implications
  2. Languages

Video:

The Future I: The Future I: Gartner cloud computing hypecycle and priority matrix 2017 and 2019

  1. Hyperscale computing
  2. Serverless and FaaS
  3. Cloud Native
  4. Microservices
  5. Update to 2019 Hypecycle

Video:

Future and Other Issues II: Security

  1. Blockchain

Video:

Future and Other Issues III: Fault Tolerance

Video:

8.1.11 - Transportation Systems

This section discusses the transportation systems

Transportation Systems Summary

  1. The ride-hailing industry highlights the growth of a new “Transportation System” TS a. For ride-hailing TS controls rides matching drivers and customers; it predicts how to position cars and how to avoid traffic slowdowns b. However, TS is much bigger outside ride-hailing as we move into the “connected vehicle” era c. TS will probably find autonomous vehicles easier to deal with than human drivers
  2. Cloud Fog and Edge components
  3. Autonomous AI was centered on generalized image processing
  4. TS also needs AI (and DL) but this is for routing and geospatial time-series; different technologies from those for image processing

Slides

Transportation Systems A: Introduction

  1. “Smart” Insurance
  2. Fundamentals of Ride-Hailing

Transportation Systems B: Components of a Ride-Hailing System

  1. Transportation Brain and Services
  2. Maps, Routing,
  3. Traffic forecasting with deep learning

Transportation Systems C: Different AI Approaches in Ride-Hailing

  1. View as a Time Series: LSTM and ARIMA
  2. View as an image in a 2D earth surface - Convolutional networks
  3. Use of Graph Neural Nets
  4. Use of Convolutional Recurrent Neural Nets
  5. Spatio-temporal modeling
  6. Comparison of data with predictions
  7. Reinforcement Learning
  8. Formulation of General Geospatial Time-Series Problem

8.1.12 - Commerce

This section discusses Commerce

Overview

Slides

AI in Commerce A: The Old way of doing things

  1. AI in Commerce
  2. AI-First Engineering, Deep Learning
  3. E-commerce and the transformation of “Bricks and Mortar”

AI in Commerce B: AI in Retail

  1. Personalization
  2. Search
  3. Image Processing to Speed up Shopping
  4. Walmart

AI in Commerce C: The Revolution that is Amazon

  1. Retail Revolution
  2. Saves Time, Effort and Novelity with Modernized Retail
  3. Looking ahead of Retail evolution

AI in Commerce D: DLMalls e-commerce

  1. Amazon sellers
  2. Rise of Shopify
  3. Selling Products on Amazon

AI in Commerce E: Recommender Engines, Digital media

  1. Spotify recommender engines
  2. Collaborative Filtering
  3. Audio Modelling
  4. DNN for Recommender engines

8.1.13 - Python Warm Up

Python Exercise on Google Colab

Python Exercise on Google Colab

Open In Colab View in Github Download Notebook

In this exercise, we will take a look at some basic Python Concepts needed for day-to-day coding.

Check the installed Python version.

! python --version
Python 3.7.6

Simple For Loop

for i in range(10):
  print(i)
0
1
2
3
4
5
6
7
8
9

List

list_items = ['a', 'b', 'c', 'd', 'e']

Retrieving an Element

list_items[2]
'c'

Append New Values

list_items.append('f')
list_items
['a', 'b', 'c', 'd', 'e', 'f']

Remove an Element

list_items.remove('a')
list_items
['b', 'c', 'd', 'e', 'f']

Dictionary

dictionary_items = {'a':1, 'b': 2, 'c': 3}

Retrieving an Item by Key

dictionary_items['b']
2

Append New Item with Key

dictionary_items['c'] = 4
dictionary_items
{'a': 1, 'b': 2, 'c': 4}

Delete an Item with Key

del dictionary_items['a'] 
dictionary_items
{'b': 2, 'c': 4}

Comparators

x = 10
y = 20 
z = 30
x > y 
False
x < z
True
z == x
False
if x < z:
  print("This is True")
This is True
if x > z:
  print("This is True")
else:
  print("This is False")  
This is False

Arithmetic

k = x * y * z
k
6000
j = x + y + z
j
60
m = x -y 
m
-10
n = x / z
n
0.3333333333333333

Numpy

Create a Random Numpy Array

import numpy as np
a = np.random.rand(100)
a.shape
(100,)

Reshape Numpy Array

b = a.reshape(10,10)
b.shape
(10, 10)

Manipulate Array Elements

c = b * 10
c[0]
array([3.33575458, 7.39029235, 5.54086921, 9.88592471, 4.9246252 ,
       1.76107178, 3.5817523 , 3.74828708, 3.57490794, 6.55752319])
c = np.mean(b,axis=1)
c.shape
10
print(c)
[0.60673061 0.4223565  0.42687517 0.6260857  0.60814217 0.66445627 
  0.54888432 0.68262262 0.42523459 0.61504903]

8.1.14 - MNIST Classification on Google Colab

MNIST Classification on Google Colab
Open In Colab View in Github Download Notebook

In this lesson we discuss in how to create a simple IPython Notebook to solve an image classification problem. MNIST contains a set of pictures

Import Libraries

Note: https://python-future.org/quickstart.html

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.utils import to_categorical, plot_model
from keras.datasets import mnist

Warm Up Exercise

Pre-process data

Load data

First we load the data from the inbuilt mnist dataset from Keras Here we have to split the data set into training and testing data. The training data or testing data has two components. Training features and training labels. For instance every sample in the dataset has a corresponding label. In Mnist the training sample contains image data represented in terms of an array. The training labels are from 0-9.

Here we say x_train for training data features and y_train as the training labels. Same goes for testing data.

(x_train, y_train), (x_test, y_test) = mnist.load_data()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step

Identify Number of Classes

As this is a number classification problem. We need to know how many classes are there. So we’ll count the number of unique labels.

num_labels = len(np.unique(y_train))

Convert Labels To One-Hot Vector

Read more on one-hot vector.

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

Image Reshaping

The training model is designed by considering the data as a vector. This is a model dependent modification. Here we assume the image is a squared shape image.

image_size = x_train.shape[1]
input_size = image_size * image_size

Resize and Normalize

The next step is to continue the reshaping to a fit into a vector and normalize the data. Image values are from 0 - 255, so an easy way to normalize is to divide by the maximum value.

x_train = np.reshape(x_train, [-1, input_size])
x_train = x_train.astype('float32') / 255
x_test = np.reshape(x_test, [-1, input_size])
x_test = x_test.astype('float32') / 255

Create a Keras Model

Keras is a neural network library. The summary function provides tabular summary on the model you created. And the plot_model function provides a grpah on the network you created.

# Create Model
# network parameters
batch_size = 4
hidden_units = 64

model = Sequential()
model.add(Dense(hidden_units, input_dim=input_size))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
plot_model(model, to_file='mlp-mnist.png', show_shapes=True)
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_5 (Dense)              (None, 512)               401920    
_________________________________________________________________
dense_6 (Dense)              (None, 10)                5130      
_________________________________________________________________
activation_5 (Activation)    (None, 10)                0         
=================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________

images

Compile and Train

A keras model need to be compiled before it can be used to train the model. In the compile function, you can provide the optimization that you want to add, metrics you expect and the type of loss function you need to use.

Here we use adam optimizer, a famous optimizer used in neural networks.

The loss funtion we have used is the categorical_crossentropy.

Once the model is compiled, then the fit function is called upon passing the number of epochs, traing data and batch size.

The batch size determines the number of elements used per minibatch in optimizing the function.

Note: Change the number of epochs, batch size and see what happens.

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=1, batch_size=batch_size)
469/469 [==============================] - 3s 7ms/step - loss: 0.3647 - accuracy: 0.8947





<tensorflow.python.keras.callbacks.History at 0x7fe88faf4c50>

Testing

Now we can test the trained model. Use the evaluate function by passing test data and batch size and the accuracy and the loss value can be retrieved.

MNIST_V1.0|Exercise: Try to observe the network behavior by changing the number of epochs, batch size and record the best accuracy that you can gain. Here you can record what happens when you change these values. Describe your observations in 50-100 words.

loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))
79/79 [==============================] - 0s 4ms/step - loss: 0.2984 - accuracy: 0.9148

Test accuracy: 91.5%

Final Note

This programme can be defined as a hello world programme in deep learning. Objective of this exercise is not to teach you the depths of deep learning. But to teach you basic concepts that may need to design a simple network to solve a problem. Before running the whole code, read all the instructions before a code section.

Homework

Solve Exercise MNIST_V1.0.

Reference:

Orignal Source to Source Code

8.2 - 2019

Here you will find a number of modules and components for introducing you to big data applications.

Big Data Applications are an important topic that have impact in academia and industry.

8.2.2 - Introduction (Fall 2018)

Introduction (Fall2018)

Introduction to Big Data Applications

This is an overview course of Big Data Applications covering a broad range of problems and solutions. It covers cloud computing technologies and includes a project. Also, algorithms are introduced and illustrated.

General Remarks Including Hype cycles

This is Part 1 of the introduction. We start with some general remarks and take a closer look at the emerging technology hype cycles.

1.a Gartner’s Hypecycles and especially those for emerging technologies between 2016 and 2018

  • Video . Audio

1.b Gartner’s Hypecycles with Emerging technologies hypecycles and the priority matrix at selected times 2008-2015

  • Video . Presentation

1.a + 1.b:

  • Audio
  • Technology trends
  • Industry reports

Data Deluge

This is Part 2 of the introduction.

2.a Business usage patterns from NIST

  • Video . Presentation

2.b Cyberinfrastructure and AI

  • Video . Presentation

2.a + 2.b

  • Audio
  • Several examples of rapid data and information growth in different areas
  • Value of data and analytics

Jobs

This is Part 3 of the introduction.

  • Video . Presentation . Audio
  • Jobs opportunities in the areas: data science, clouds and computer science and computer engineering
  • Jobs demands in different countries and companies.
  • Trends and forecast of jobs demands in the future.

This is Part 4 of the introduction.

4a. Industry Trends: Technology Trends by 2014

  • Video

4b. Industry Trends: 2015 onwards

  • Video

An older set of trend slides is available from:

4a. Industry Trends: Technology Trends by 2014

A current set is available at:

4b. Industry Trends: 2015 onwards

  • Presentation . Audio

4c. Industry Trends: Voice and HCI, cars,Deep learning

  • Audio
  • Many technology trends through end of 2014 and 2015 onwards, examples in different fields
  • Voice and HCI, Cars Evolving and Deep learning

Digital Disruption and Transformation

This is Part 5 of the introduction.

  1. Digital Disruption and Transformation
  • Video . Presentation . Audio
  • The past displaced by digital disruption

Computing Model

This is Part 6 of the introduction.

6a. Computing Model: earlier discussion by 2014:

  • Video . Presentation . Audio

6b. Computing Model: developments after 2014 including Blockchain:

  • Video . Presentation . Audio
  • Industry adopted clouds which are attractive for data analytics, including big companies, examples are Google, Amazon, Microsoft and so on.
  • Some examples of development: AWS quarterly revenue, critical capabilities public cloud infrastructure as a service.
  • Blockchain: ledgers redone, blockchain consortia.

Research Model

This is Part 7 of the introduction.

Research Model: 4th Paradigm; From Theory to Data driven science?

  • Video . Presentation . Audio
  • The 4 paradigm of scientific research: Theory,Experiment and observation,Simulation of theory or model,Data-driven.

Data Science Pipeline

This is Part 8 of the introduction. 8. Data Science Pipeline

  • Video . Presentation
  • DIKW process:Data, Information, Knowledge, Wisdom and Decision.
  • Example of Google Maps/navigation.
  • Criteria for Data Science platform.

Physics as an Application Example

This is Part 9 of the introduction.

  • Physics as an application example.

Technology Example

This is Part 10 of the introduction.

  • Overview of many informatics areas, recommender systems in detail.
  • NETFLIX on personalization, recommendation, datascience.

Exploring Data Bags and Spaces

This is Part 11 of the introduction.

  1. Exploring data bags and spaces: Recommender Systems II
  • Video . Presentation
  • Distances in funny spaces, about “real” spaces and how to use distances.

Another Example: Web Search Information Retrieval

This is Part 12 of the introduction. 12. Another Example: Web Search Information Retrieval

  • Video . Presentation

Cloud Application in Research

This is Part 13 of the introduction discussing cloud applications in research.

  1. Cloud Applications in Research: Science Clouds and Internet of Things
  • Presentation

Software Ecosystems: Parallel Computing and MapReduce

This is Part 14 of the introduction discussing the software ecosystem

  1. Software Ecosystems: Parallel Computing and MapReduce
  • Presentation

Conclusions

This is Part 15 of the introduction with some concluding remarks. 15. Conclusions

  • Video . Presentation . Audio

8.2.3 - Motivation

We present the motivation why big data is so important

Part I Motivation I

Motivation

Big Data Applications & Analytics: Motivation/Overview; Machine (actually Deep) Learning, Big Data, and the Cloud; Centerpieces of the Current and Future Economy,

00) Mechanics of Course, Summary, and overall remarks on course

In this section we discuss the summary of the motivation section.

01A) Technology Hypecycle I

Today clouds and big data have got through the hype cycle (they have emerged) but features like blockchain, serverless and machine learning are on recent hype cycles while areas like deep learning have several entries (as in fact do clouds) Gartner’s Hypecycles and especially that for emerging technologies in 2019 The phases of hypecycles Priority Matrix with benefits and adoption time Initial discussion of 2019 Hypecycle for Emerging Technologies

01B) Technology Hypecycle II

Today clouds and big data have got through the hype cycle (they have emerged) but features like blockchain, serverless and machine learning are on recent hype cycles while areas like deep learning have several entries (as in fact do clouds) Gartner’s Hypecycles and especially that for emerging technologies in 2019 Details of 2019 Emerging Technology and related (AI, Cloud) Hypecycles

01C) Technology Hypecycle III

Today clouds and big data have got through the hype cycle (they have emerged) but features like blockchain, serverless and machine learning are on recent hype cycles while areas like deep learning have several entries (as in fact do clouds) Gartners Hypecycles and Priority Matrices for emerging technologies in 2018, 2017 and 2016 More details on 2018 will be found in Unit 1A of 2018 Presentation and details of 2015 in Unit 1B (Journey to Digital Business). 1A in 2018 also discusses 2017 Data Center Infrastructure removed as this hype cycle disappeared in later years.

01D) Technology Hypecycle IV

Today clouds and big data have got through the hype cycle (they have emerged) but features like blockchain, serverless and machine learning are on recent hype cycles while areas like deep learning have several entries (as in fact do clouds) Emerging Technologies hypecycles and Priority matrix at selected times 2008-2015 Clouds star from 2008 to today They are mixed up with transformational and disruptive changes Unit 1B of 2018 Presentation has more details of this history including Priority matrices

02)

02A) Clouds/Big Data Applications I

The Data Deluge Big Data; a lot of the best examples have NOT been updated (as I can’t find updates) so some slides old but still make the correct points Big Data Deluge has become the Deep Learning Deluge Big Data is an agreed fact; Deep Learning still evolving fast but has stream of successes!

02B) Cloud/Big Data Applications II

Clouds in science where area called cyberinfrastructure; The usage pattern from NIST is removed. See 2018 lectures 2B of the motivation for this discussion

02C) Cloud/Big Data

Usage Trends Google and related Trends Artificial Intelligence from Microsoft, Gartner and Meeker

03) Jobs In areas like Data Science, Clouds and Computer Science and Computer

Engineering

more details removed as dated but still valid See 2018 Lesson 4C for 3 Technology trends for 2016: Voice as HCI, Cars, Deep Learning

05) Digital Disruption and Transformation The Past displaced by Digital

Disruption; some more details are in 2018 Presentation Lesson 5

06)

06A) Computing Model I Industry adopted clouds which are attractive for data

analytics. Clouds are a dominant force in Industry. Examples are given

06B) Computing Model II with 3 subsections is removed; please see 2018

Presentation for this Developments after 2014 mainly from Gartner Cloud Market share Blockchain

07) Research Model 4th Paradigm; From Theory to Data driven science?

08) Data Science Pipeline DIKW: Data, Information, Knowledge, Wisdom, Decisions.

More details on Data Science Platforms are in 2018 Lesson 8 presentation

09) Physics: Looking for Higgs Particle with Large Hadron Collider LHC Physics as a big data example

10) Recommender Systems I General remarks and Netflix example

11) Recommender Systems II Exploring Data Bags and Spaces

12) Web Search and Information Retrieval Another Big Data Example

13) Cloud Applications in Research Removed Science Clouds, Internet of Things

Part 12 continuation. See 2018 Presentation (same as 2017 for lesson 13) and Cloud Unit 2019-I) this year

14) Parallel Computing and MapReduce Software Ecosystems

15) Online education and data science education Removed.

You can find it in the 2017 version. In @sec:534-week2 you can see more about this.

16) Conclusions

Conclusion contain in the latter part of the part 15.

Motivation Archive Big Data Applications and Analytics: Motivation/Overview; Machine (actually Deep) Learning, Big Data, and the Cloud; Centerpieces of the Current and Future Economy. Backup Lectures from previous years referenced in 2019 class

8.2.4 - Motivation (cont.)

We present the motivation why big data is so important

Part II Motivation Archive

2018 BDAA Motivation-1A) Technology Hypecycle I

In this section we discuss on general remarks including Hype curves.

2018 BDAA Motivation-1B) Technology Hypecycle II

In this section we continue our discussion on general remarks including Hype curves.

2018 BDAA Motivation-2B) Cloud/Big Data Applications II

In this section we discuss clouds in science where area called cyberinfrastructure; the usage pattern from NIST Artificial Intelligence from Gartner and Meeker.

In this section we discuss on Lesson 4A many technology trends through end of 2014.

In this section we continue our discussion on industry trends. This section includes Lesson 4B 2015 onwards many technology adoption trends.

In this section we continue our discussion on industry trends. This section contains lesson 4C 2015 onwards 3 technology trends voice as HCI cars deep learning.

2018 BDAA Motivation-6B) Computing Model II

In this section we discuss computing models. This section contains lesson 6B with 3 subsections developments after 2014 mainly from Gartner cloud market share blockchain

2017 BDAA Motivation-8) Data Science Pipeline DIKW

In this section, we discuss data science pipelines. This section also contains about data, information, knowledge, wisdom forming DIKW term. And also it contains some discussion on data science platforms.

2017 BDAA Motivation-13) Cloud Applications in Research Science Clouds Internet of Things

In this section we discuss about internet of things and related cloud applications.

2017 BDAA Motivation-15) Data Science Education Opportunities at Universities

In this section we discuss more on data science education opportunities.

8.2.5 - Cloud

We present the motivation why big data is so important

Part III Cloud {#sec:534-week3}

A. Summary of Course

B. Defining Clouds I

In this lecture we discuss the basic definition of cloud and two very simple examples of why virtualization is important.

In this lecture we discuss how clouds are situated wrt HPC and supercomputers, why multicore chips are important in a typical data center.

C. Defining Clouds II

In this lecture we discuss service-oriented architectures, Software services as Message-linked computing capabilities.

In this lecture we discuss different aaS’s: Network, Infrastructure, Platform, Software. The amazing services that Amazon AWS and Microsoft Azure have Initial Gartner comments on clouds (they are now the norm) and evolution of servers; serverless and microservices Gartner hypecycle and priority matrix on Infrastructure Strategies.

D. Defining Clouds III: Cloud Market Share

In this lecture we discuss on how important the cloud market shares are and how much money do they make.

E. Virtualization: Virtualization Technologies,

In this lecture we discuss hypervisors and the different approaches KVM, Xen, Docker and Openstack.

F. Cloud Infrastructure I

In this lecture we comment on trends in the data center and its technologies. Clouds physically spread across the world Green computing Fraction of world’s computing ecosystem. In clouds and associated sizes an analysis from Cisco of size of cloud computing is discussed in this lecture.

G. Cloud Infrastructure II

In this lecture, we discuss Gartner hypecycle and priority matrix on Compute Infrastructure Containers compared to virtual machines The emergence of artificial intelligence as a dominant force.

H. Cloud Software:

In this lecture we discuss, HPC-ABDS with over 350 software packages and how to use each of 21 layers Google’s software innovations MapReduce in pictures Cloud and HPC software stacks compared Components need to support cloud/distributed system programming.

I. Cloud Applications I: Clouds in science where area called

In this lecture we discuss cyberinfrastructure; the science usage pattern from NIST Artificial Intelligence from Gartner.

J. Cloud Applications II: Characterize Applications using NIST

In this lecture we discuss the approach Internet of Things with different types of MapReduce.

K. Parallel Computing

In this lecture we discuss analogies, parallel computing in pictures and some useful analogies and principles.

L. Real Parallel Computing: Single Program/Instruction Multiple Data SIMD SPMD

In this lecture, we discuss Big Data and Simulations compared and we furthermore discusses what is hard to do.

M. Storage: Cloud data

In this lecture we discuss about the approaches, repositories, file systems, data lakes.

N. HPC and Clouds

In this lecture we discuss the Branscomb Pyramid Supercomputers versus clouds Science Computing Environments.

O. Comparison of Data Analytics with Simulation:

In this lecture we discuss the structure of different applications for simulations and Big Data Software implications Languages.

P. The Future I

In this lecture we discuss Gartner cloud computing hypecycle and priority matrix 2017 and 2019 Hyperscale computing Serverless and FaaS Cloud Native Microservices Update to 2019 Hypecycle.

Q. other Issues II

In this lecture we discuss on Security Blockchain.

R. The Future and other Issues III

In this lecture we discuss on Fault Tolerance.

8.2.6 - Physics

Big Data applications and Physics

Physics with Big Data Applications {#sec:534-week5}

E534 2019 Big Data Applications and Analytics Discovery of Higgs Boson Part I (Unit 8) Section Units 9-11 Summary: This section starts by describing the LHC accelerator at CERN and evidence found by the experiments suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. Then random variables and some simple principles of statistics are introduced with explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes discussion.

Unit 8:

8.1 - Looking for Higgs: 1. Particle and Counting Introduction 1

We return to particle case with slides used in introduction and stress that particles often manifested as bumps in histograms and those bumps need to be large enough to stand out from background in a statistically significant fashion.

8.2 - Looking for Higgs: 2. Particle and Counting Introduction 2

We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion.

8.3 - Looking for Higgs: 3. Particle Experiments

We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion

This lesson gives a small picture gallery of accelerators. Accelerators, detection chambers and magnets in tunnels and a large underground laboratory used fpr experiments where you need to be shielded from background like cosmic rays.

Unit 9

This unit is devoted to Python experiments with Geoffrey looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals

9.1 - Looking for Higgs II: 1: Class Software

We discuss how this unit uses Java (deprecated) and Python on both a backend server (FutureGrid - closed!) or a local client. We point out useful book on Python for data analysis. This lesson is deprecated. Follow current technology for class

9.2 - Looking for Higgs II: 2: Event Counting

We define ‘‘event counting’’ data collection environments. We discuss the python and Java code to generate events according to a particular scenario (the important idea of Monte Carlo data). Here a sloping background plus either a Higgs particle generated similarly to LHC observation or one observed with better resolution (smaller measurement error).

9.3 - Looking for Higgs II: 3: With Python examples of Signal plus Background

This uses Monte Carlo data both to generate data like the experimental observations and explore effect of changing amount of data and changing measurement resolution for Higgs.

9.4 - Looking for Higgs II: 4: Change shape of background & number of Higgs Particles

This lesson continues the examination of Monte Carlo data looking at effect of change in number of Higgs particles produced and in change in shape of background

Unit 10

In this unit we discuss;

E534 2019 Big Data Applications and Analytics Discovery of Higgs Boson: Big Data Higgs Unit 10: Looking for Higgs Particles Part III: Random Variables, Physics and Normal Distributions Section Units 9-11 Summary: This section starts by describing the LHC accelerator at CERN and evidence found by the experiments suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. Then random variables and some simple principles of statistics are introduced with explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes discussion. Big Data Higgs Unit 10: Looking for Higgs Particles Part III: Random Variables, Physics and Normal Distributions Overview: Geoffrey introduces random variables and some simple principles of statistics and explains why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Java is currently not available in this unit.

10.1 - Statistics Overview and Fundamental Idea: Random Variables

We go through the many different areas of statistics covered in the Physics unit. We define the statistics concept of a random variable.

10.2 - Physics and Random Variables I

We describe the DIKW pipeline for the analysis of this type of physics experiment and go through details of analysis pipeline for the LHC ATLAS experiment. We give examples of event displays showing the final state particles seen in a few events. We illustrate how physicists decide whats going on with a plot of expected Higgs production experimental cross sections (probabilities) for signal and background.

10.3 - Physics and Random Variables II

We describe the DIKW pipeline for the analysis of this type of physics experiment and go through details of analysis pipeline for the LHC ATLAS experiment. We give examples of event displays showing the final state particles seen in a few events. We illustrate how physicists decide whats going on with a plot of expected Higgs production experimental cross sections (probabilities) for signal and background.

10.4 - Statistics of Events with Normal Distributions

We introduce Poisson and Binomial distributions and define independent identically distributed (IID) random variables. We give the law of large numbers defining the errors in counting and leading to Gaussian distributions for many things. We demonstrate this in Python experiments.

10.5 - Gaussian Distributions

We introduce the Gaussian distribution and give Python examples of the fluctuations in counting Gaussian distributions.

10.6 - Using Statistics

We discuss the significance of a standard deviation and role of biases and insufficient statistics with a Python example in getting incorrect answers.

Unit 11

In this section we discuss;

E534 2019 Big Data Applications and Analytics Discovery of Higgs Boson: Big Data Higgs Unit 11: Looking for Higgs Particles Part IV: Random Numbers, Distributions and Central Limit Theorem Section Units 9-11 Summary: This section starts by describing the LHC accelerator at CERN and evidence found by the experiments suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. Then random variables and some simple principles of statistics are introduced with explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes discussion. Big Data Higgs Unit 11: Looking for Higgs Particles Part IV: Random Numbers, Distributions and Central Limit Theorem Unit Overview: Geoffrey discusses Random Numbers with their Generators and Seeds. It introduces Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods are discussed. The Central Limit Theorem and Bayes law concludes discussion. Python and Java (for student - not reviewed in class) examples and Physics applications are given.

11.1 - Generators and Seeds I

We define random numbers and describe how to generate them on the computer giving Python examples. We define the seed used to define to specify how to start generation.

11.2 - Generators and Seeds II

We define random numbers and describe how to generate them on the computer giving Python examples. We define the seed used to define to specify how to start generation.

11.3 - Binomial Distribution

We define binomial distribution and give LHC data as an eaxmple of where this distribution valid.

11.4 - Accept-Reject

We introduce an advanced method – accept/reject – for generating random variables with arbitrary distrubitions.

11.5 - Monte Carlo Method

We define Monte Carlo method which usually uses accept/reject method in typical case for distribution.

11.6 - Poisson Distribution

We extend the Binomial to the Poisson distribution and give a set of amusing examples from Wikipedia.

11.7 - Central Limit Theorem

We introduce Central Limit Theorem and give examples from Wikipedia.

11.8 - Interpretation of Probability: Bayes v. Frequency

This lesson describes difference between Bayes and frequency views of probability. Bayes’s law of conditional probability is derived and applied to Higgs example to enable information about Higgs from multiple channels and multiple experiments to be accumulated.

8.2.7 - Deep Learning

Introduction to Deep Learning

Introduction to Deep Learning {#sec:534-intro-to-dnn}

In this tutorial we will learn the fist lab on deep neural networks. Basic classification using deep learning will be discussed in this chapter.

Video{width=20%}

MNIST Classification Version 1

Using Cloudmesh Common

Here we do a simple benchmark. We calculate compile time, train time, test time and data loading time for this example. Installing cloudmesh-common library is the first step. Focus on this section because the ** Assignment 4 ** will be focused on the content of this lab.

Video{width=20%}

!pip install cloudmesh-common
    Collecting cloudmesh-common
     Downloading https://files.pythonhosted.org/packages/42/72/3c4aabce294273db9819be4a0a350f506d2b50c19b7177fb6cfe1cbbfe63/cloudmesh_common-4.2.13-py2.py3-none-any.whl (55kB)
        |████████████████████████████████| 61kB 4.1MB/s
    Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from cloudmesh-common) (0.16.0)
    Collecting pathlib2 (from cloudmesh-common)
      Downloading https://files.pythonhosted.org/packages/e9/45/9c82d3666af4ef9f221cbb954e1d77ddbb513faf552aea6df5f37f1a4859/pathlib2-2.3.5-py2.py3-none-any.whl
    Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from cloudmesh-common) (2.5.3)
    Collecting simplejson (from cloudmesh-common)
      Downloading https://files.pythonhosted.org/packages/e3/24/c35fb1c1c315fc0fffe61ea00d3f88e85469004713dab488dee4f35b0aff/simplejson-3.16.0.tar.gz (81kB)
         |████████████████████████████████| 81kB 10.6MB/s
    Collecting python-hostlist (from cloudmesh-common)
      Downloading https://files.pythonhosted.org/packages/3d/0f/1846a7a0bdd5d890b6c07f34be89d1571a6addbe59efe59b7b0777e44924/python-hostlist-1.18.tar.gz
    Requirement already satisfied: pathlib in /usr/local/lib/python3.6/dist-packages (from cloudmesh-common) (1.0.1)
    Collecting colorama (from cloudmesh-common)
      Downloading https://files.pythonhosted.org/packages/4f/a6/728666f39bfff1719fc94c481890b2106837da9318031f71a8424b662e12/colorama-0.4.1-py2.py3-none-any.whl
    Collecting oyaml (from cloudmesh-common)
      Downloading https://files.pythonhosted.org/packages/00/37/ec89398d3163f8f63d892328730e04b3a10927e3780af25baf1ec74f880f/oyaml-0.9-py2.py3-none-any.whl
    Requirement already satisfied: humanize in /usr/local/lib/python3.6/dist-packages (from cloudmesh-common) (0.5.1)
    Requirement already satisfied: psutil in /usr/local/lib/python3.6/dist-packages (from cloudmesh-common) (5.4.8)
    Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from pathlib2->cloudmesh-common) (1.12.0)
    Requirement already satisfied: pyyaml in /usr/local/lib/python3.6/dist-packages (from oyaml->cloudmesh-common) (3.13)
    Building wheels for collected packages: simplejson, python-hostlist
      Building wheel for simplejson (setup.py) ... done
      Created wheel for simplejson: filename=simplejson-3.16.0-cp36-cp36m-linux_x86_64.whl size=114018 sha256=a6f35adb86819ff3de6c0afe475229029305b1c55c5a32b442fe94cda9500464
      Stored in directory: /root/.cache/pip/wheels/5d/1a/1e/0350bb3df3e74215cd91325344cc86c2c691f5306eb4d22c77
      Building wheel for python-hostlist (setup.py) ... done
      Created wheel for python-hostlist: filename=python_hostlist-1.18-cp36-none-any.whl size=38517 sha256=71fbb29433b52fab625e17ef2038476b910bc80b29a822ed00a783d3b1fb73e4
      Stored in directory: /root/.cache/pip/wheels/56/db/1d/b28216dccd982a983d8da66572c497d6a2e485eba7c4d6cba3
    Successfully built simplejson python-hostlist
    Installing collected packages: pathlib2, simplejson, python-hostlist, colorama, oyaml, cloudmesh-common
    Successfully installed cloudmesh-common-4.2.13 colorama-0.4.1 oyaml-0.9 pathlib2-2.3.5 python-hostlist-1.18 simplejson-3.16.0

In this lesson we discuss in how to create a simple IPython Notebook to solve an image classification problem. MNIST contains a set of pictures

! python3 --version
Python 3.6.8
! pip install tensorflow-gpu==1.14.0
Collecting tensorflow-gpu==1.14.0
  Downloading https://files.pythonhosted.org/packages/76/04/43153bfdfcf6c9a4c38ecdb971ca9a75b9a791bb69a764d652c359aca504/tensorflow_gpu-1.14.0-cp36-cp36m-manylinux1_x86_64.whl (377.0MB)
     |████████████████████████████████| 377.0MB 77kB/s
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (1.12.0)
Requirement already satisfied: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (1.15.0)
Requirement already satisfied: protobuf>=3.6.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (3.7.1)
Requirement already satisfied: keras-applications>=1.0.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (1.0.8)
Requirement already satisfied: gast>=0.2.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (0.2.2)
Requirement already satisfied: astor>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (0.8.0)
Requirement already satisfied: absl-py>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (0.8.0)
Requirement already satisfied: wrapt>=1.11.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (1.11.2)
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (0.33.6)
Requirement already satisfied: tensorflow-estimator 1.15.0rc0,>=1.14.0rc0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (1.14.0)
Requirement already satisfied: tensorboard 1.15.0,>=1.14.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (1.14.0)
Requirement already satisfied: numpy 2.0,>=1.14.5 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (1.16.5)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (1.1.0)
Requirement already satisfied: keras-preprocessing>=1.0.5 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (1.1.0)
Requirement already satisfied: google-pasta>=0.1.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow-gpu==1.14.0) (0.1.7)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.6.1->tensorflow-gpu==1.14.0) (41.2.0)
Requirement already satisfied: h5py in /usr/local/lib/python3.6/dist-packages (from keras-applications>=1.0.6->tensorflow-gpu==1.14.0) (2.8.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard 1.15.0,>=1.14.0->tensorflow-gpu==1.14.0) (3.1.1)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard 1.15.0,>=1.14.0->tensorflow-gpu==1.14.0) (0.15.6)
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-1.14.0

Import Libraries

Note: https://python-future.org/quickstart.html

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import time

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.utils import to_categorical, plot_model
from keras.datasets import mnist

from cloudmesh.common.StopWatch import StopWatch

Using TensorFlow backend.

Pre-process data

Video{width=20%}

Load data

First we load the data from the inbuilt mnist dataset from Keras

StopWatch.start("data-load")
(x_train, y_train), (x_test, y_test) = mnist.load_data()
StopWatch.stop("data-load")
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step

Identify Number of Classes

As this is a number classification problem. We need to know how many classes are there. So we’ll count the number of unique labels.

num_labels = len(np.unique(y_train))

Convert Labels To One-Hot Vector

|Exercise MNIST_V1.0.0: Understand what is an one-hot vector?

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

Image Reshaping

The training model is designed by considering the data as a vector. This is a model dependent modification. Here we assume the image is a squared shape image.

image_size = x_train.shape[1]
input_size = image_size * image_size

Resize and Normalize

The next step is to continue the reshaping to a fit into a vector and normalize the data. Image values are from 0 - 255, so an easy way to normalize is to divide by the maximum value.

|Execrcise MNIST_V1.0.1: Suggest another way to normalize the data preserving the accuracy or improving the accuracy.

x_train = np.reshape(x_train, [-1, input_size])
x_train = x_train.astype('float32') / 255
x_test = np.reshape(x_test, [-1, input_size])
x_test = x_test.astype('float32') / 255

Create a Keras Model

Video{width=20%}

Keras is a neural network library. Most important thing with Keras is the way we design the neural network.

In this model we have a couple of ideas to understand.

|Exercise MNIST_V1.1.0: Find out what is a dense layer?

A simple model can be initiated by using an Sequential instance in Keras. For this instance we add a single layer.

  1. Dense Layer
  2. Activation Layer (Softmax is the activation function)

Dense layer and the layer followed by it is fully connected. For instance the number of hidden units used here is 64 and the following layer is a dense layer followed by an activation layer.

|Execrcise MNIST_V1.2.0: Find out what is the use of an activation function. Find out why, softmax was used as the last layer.

batch_size = 4
hidden_units = 64

model = Sequential()
model.add(Dense(hidden_units, input_dim=input_size))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
plot_model(model, to_file='mnist_v1.png', show_shapes=True)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:66: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:541: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4432: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_1 (Dense)              (None, 64)                50240
_________________________________________________________________
dense_2 (Dense)              (None, 10)                650
_________________________________________________________________
activation_1 (Activation)    (None, 10)                0
=================================================================
Total params: 50,890
Trainable params: 50,890
Non-trainable params: 0
_________________________________________________________________

images

Compile and Train

Video{width=20%}

A keras model need to be compiled before it can be used to train the model. In the compile function, you can provide the optimization that you want to add, metrics you expect and the type of loss function you need to use.

Here we use the adam optimizer, a famous optimizer used in neural networks.

Exercise MNIST_V1.3.0: Find 3 other optimizers used on neural networks.

The loss funtion we have used is the categorical_crossentropy.

Exercise MNIST_V1.4.0: Find other loss functions provided in keras. Your answer can limit to 1 or more.

Once the model is compiled, then the fit function is called upon passing the number of epochs, traing data and batch size.

The batch size determines the number of elements used per minibatch in optimizing the function.

Note: Change the number of epochs, batch size and see what happens.

Exercise MNIST_V1.5.0: Figure out a way to plot the loss function value. You can use any method you like.

StopWatch.start("compile")
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
StopWatch.stop("compile")
StopWatch.start("train")
model.fit(x_train, y_train, epochs=1, batch_size=batch_size)
StopWatch.stop("train")
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:793: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3576: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250:
add_dispatch_support. locals.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1033: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

Epoch 1/1
60000/60000 [==============================] - 20s 336us/step - loss: 0.3717 - acc: 0.8934

Testing

Now we can test the trained model. Use the evaluate function by passing test data and batch size and the accuracy and the loss value can be retrieved.

Exercise MNIST_V1.6.0: Try to optimize the network by changing the number of epochs, batch size and record the best accuracy that you can gain

StopWatch.start("test")
loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))
StopWatch.stop("test")
10000/10000 [==============================] - 1s 138us/step

Test accuracy: 91.0%
StopWatch.benchmark()
+---------------------+------------------------------------------------------------------+
| Machine Attribute   | Value                                                            |
+---------------------+------------------------------------------------------------------+
| BUG_REPORT_URL      | "https://bugs.launchpad.net/ubuntu/"                             |
| DISTRIB_CODENAME    | bionic                                                           |
| DISTRIB_DESCRIPTION | "Ubuntu 18.04.3 LTS"                                             |
| DISTRIB_ID          | Ubuntu                                                           |
| DISTRIB_RELEASE     | 18.04                                                            |
| HOME_URL            | "https://www.ubuntu.com/"                                        |
| ID                  | ubuntu                                                           |
| ID_LIKE             | debian                                                           |
| NAME                | "Ubuntu"                                                         |
| PRETTY_NAME         | "Ubuntu 18.04.3 LTS"                                             |
| PRIVACY_POLICY_URL  | "https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" |
| SUPPORT_URL         | "https://help.ubuntu.com/"                                       |
| UBUNTU_CODENAME     | bionic                                                           |
| VERSION             | "18.04.3 LTS (Bionic Beaver)"                                    |
| VERSION_CODENAME    | bionic                                                           |
| VERSION_ID          | "18.04"                                                          |
| cpu_count           | 2                                                                |
| mac_version         |                                                                  |
| machine             | ('x86_64',)                                                      |
| mem_active          | 973.8 MiB                                                        |
| mem_available       | 11.7 GiB                                                         |
| mem_free            | 5.1 GiB                                                          |
| mem_inactive        | 6.3 GiB                                                          |
| mem_percent         | 8.3%                                                             |
| mem_total           | 12.7 GiB                                                         |
| mem_used            | 877.3 MiB                                                        |
| node                | ('8281485b0a16',)                                                |
| platform            | Linux-4.14.137+-x86_64-with-Ubuntu-18.04-bionic                  |
| processor           | ('x86_64',)                                                      |
| processors          | Linux                                                            |
| python              | 3.6.8 (default, Jan 14 2019, 11:02:34)                           |
|                     | [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]      |
| release             | ('4.14.137+',)                                                   |
| sys                 | linux                                                            |
| system              | Linux                                                            |
| user                |                                                                  |
| version             | #1 SMP Thu Aug 8 02:47:02 PDT 2019                               |
| win_version         |                                                                  |
+---------------------+------------------------------------------------------------------+
+-----------+-------+---------------------+-----+-------------------+------+--------+-------------+-------------+
| timer     | time  | start               | tag | node              | user | system | mac_version | win_version |
+-----------+-------+---------------------+-----+-------------------+------+--------+-------------+-------------+
| data-load | 1.335 | 2019-09-27 13:37:41 |     | ('8281485b0a16',) |      | Linux  |             |             |
| compile   | 0.047 | 2019-09-27 13:37:43 |     | ('8281485b0a16',) |      | Linux  |             |             |
| train     | 20.58 | 2019-09-27 13:37:43 |     | ('8281485b0a16',) |      | Linux  |             |             |
| test      | 1.393 | 2019-09-27 13:38:03 |     | ('8281485b0a16',) |      | Linux  |             |             |
+-----------+-------+---------------------+-----+-------------------+------+--------+-------------+-------------+

timer,time,starttag,node,user,system,mac_version,win_version
data-load,1.335,None,('8281485b0a16',),,Linux,,
compile,0.047,None,('8281485b0a16',),,Linux,,
train,20.58,None,('8281485b0a16',),,Linux,,
test,1.393,None,('8281485b0a16',),,Linux,,

Final Note

This programme can be defined as a hello world programme in deep learning. Objective of this exercise is not to teach you the depths of deep learning. But to teach you basic concepts that may need to design a simple network to solve a problem. Before running the whole code, read all the instructions before a code section. Solve all the problems noted in bold text with Exercise keyword (Exercise MNIST_V1.0 - MNIST_V1.6). Write your answers and submit a PDF by following the Assignment 5. Include codes or observations you made on those sections.

Reference:

Mnist Database

Advanced Deep Learning Models

Minist Deep Learning

8.2.8 - Sports

Big Data and Sports.

Sports with Big Data Applications {#sec:534-week7}

E534 2019 Big Data Applications and Analytics Sports Informatics Part I (Unit 32) Section Summary (Parts I, II, III): Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Unit 32

Unit Summary (PartI, Unit 32): This unit discusses baseball starting with the movie Moneyball and the 2002-2003 Oakland Athletics. Unlike sports like basketball and soccer, most baseball action is built around individuals often interacting in pairs. This is much easier to quantify than many player phenomena in other sports. We discuss Performance-Dollar relationship including new stadiums and media/advertising. We look at classic baseball averages and sophisticated measures like Wins Above Replacement.

Lesson Summaries

BDAA 32.1 - E534 Sports - Introduction and Sabermetrics (Baseball Informatics) Lesson

Introduction to all Sports Informatics, Moneyball The 2002-2003 Oakland Athletics, Diamond Dollars economic model of baseball, Performance - Dollar relationship, Value of a Win.

BDAA 32.2 - E534 Sports - Basic Sabermetrics

Different Types of Baseball Data, Sabermetrics, Overview of all data, Details of some statistics based on basic data, OPS, wOBA, ERA, ERC, FIP, UZR.

BDAA 32.3 - E534 Sports - Wins Above Replacement

Wins above Replacement WAR, Discussion of Calculation, Examples, Comparisons of different methods, Coefficient of Determination, Another, Sabermetrics Example, Summary of Sabermetrics.

Unit 33

E534 2019 Big Data Applications and Analytics Sports Informatics Part II (Unit 33) Section Summary (Parts I, II, III): Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Unit Summary (Part II, Unit 33): This unit discusses ‘advanced sabermetrics’ covering advances possible from using video from PITCHf/X, FIELDf/X, HITf/X, COMMANDf/X and MLBAM.

BDAA 33.1 - E534 Sports - Pitching Clustering

A Big Data Pitcher Clustering method introduced by Vince Gennaro, Data from Blog and video at 2013 SABR conference

BDAA 33.2 - E534 Sports - Pitcher Quality

Results of optimizing match ups, Data from video at 2013 SABR conference.

BDAA 33.3 - E534 Sports - PITCHf/X

Examples of use of PITCHf/X.

BDAA 33.4 - E534 Sports - Other Video Data Gathering in Baseball

FIELDf/X, MLBAM, HITf/X, COMMANDf/X.

Unit 34

E534 2019 Big Data Applications and Analytics Sports Informatics Part III (Unit 34). Section Summary (Parts I, II, III): Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Unit Summary (Part III, Unit 34): We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Lesson Summaries

BDAA 34.1 - E534 Sports - Wearables

Consumer Sports, Stake Holders, and Multiple Factors.

BDAA 34.2 - E534 Sports - Soccer and the Olympics

Soccer, Tracking Players and Balls, Olympics.

BDAA 34.3 - E534 Sports - Spatial Visualization in NFL and NBA

NFL, NBA, and Spatial Visualization.

BDAA 34.4 - E534 Sports - Tennis and Horse Racing

Tennis, Horse Racing, and Continued Emphasis on Spatial Visualization.

8.2.9 - Deep Learning (Cont. I)

Introduction to Deep Lenning (cont.) Part I

Introduction to Deep Learning Part I

E534 2019 BDAA DL Section Intro Unit: E534 2019 Big Data Applications and Analytics Introduction to Deep Learning Part I (Unit Intro) Section Summary

This section covers the growing importance of the use of Deep Learning in Big Data Applications and Analytics. The Intro Unit is an introduction to the technology with examples incidental. It includes an introducton to the laboratory where we use Keras and Tensorflow. The Tech unit covers the deep learning technology in more detail. The Application Units cover deep learning applications at different levels of sophistication.

Intro Unit Summary

This unit is an introduction to deep learning with four major lessons

Optimization

Lesson Summaries Optimization: Overview of Optimization Opt lesson overviews optimization with a focus on issues of importance for deep learning. Gives a quick review of Objective Function, Local Minima (Optima), Annealing, Everything is an optimization problem with examples, Examples of Objective Functions, Greedy Algorithms, Distances in funny spaces, Discrete or Continuous Parameters, Genetic Algorithms, Heuristics.

ImageSlides

First Deep Learning Example

FirstDL: Your First Deep Learning Example FirstDL Lesson gives an experience of running a non trivial deep learning application. It goes through the identification of numbers from NIST database using a Multilayer Perceptron using Keras+Tensorflow running on Google Colab

ImageSlides

Deep Learning Basics

DLBasic: Basic Terms Used in Deep Learning DLBasic lesson reviews important Deep Learning topics including Activation: (ReLU, Sigmoid, Tanh, Softmax), Loss Function, Optimizer, Stochastic Gradient Descent, Back Propagation, One-hot Vector, Vanishing Gradient, Hyperparameter

ImageSlides

Deep Learning Types

DLTypes: Types of Deep Learning: Summaries DLtypes Lesson reviews important Deep Learning neural network architectures including Multilayer Perceptron, CNN Convolutional Neural Network, Dropout for regularization, Max Pooling, RNN Recurrent Neural Networks, LSTM: Long Short Term Memory, GRU Gated Recurrent Unit, (Variational) Autoencoders, Transformer and Sequence to Sequence methods, GAN Generative Adversarial Network, (D)RL (Deep) Reinforcement Learning.

ImageSlides

8.2.10 - Deep Learning (Cont. II)

Introduction to Deep Lenning (cont.) Part II

Introduction to Deep Learning Part II: Applications

This section covers the growing importance of the use of Deep Learning in Big Data Applications and Analytics. The Intro Unit is an introduction to the technology with examples incidental. The MNIST Unit covers an example on Google Colaboratory. The Technology Unit covers deep learning approaches in more detail than the Intro Unit. The Tech Unit covers the deep learning technology in more detail. The Application Unit cover deep learning applications at different levels of sophistication.

Applications of Deep Learning Unit Summary This unit is an introduction to deep learning with currently 7 lessons

Recommender: Overview of Recommender Systems

Recommender engines used to be dominated by collaborative filtering using matrix factorization and k’th nearest neighbor approaches. Large systems like YouTube and Netflix now use deep learning. We look at sysyems like Spotify that use multiple sources of information.

ImageSlides

Retail: Overview of AI in Retail Sector (e-commerce)

The retail sector can use AI in Personalization, Search and Chatbots. They must adopt AI to survive. We also discuss how to be a seller on Amazon

ImageSlides

RideHailing: Overview of AI in Ride Hailing Industry (Uber, Lyft, Didi)

The Ride Hailing industry will grow as it becomes main mobility method for many customers. Their technology investment includes deep learning for matching drivers and passengers. There is huge overlap with larger area of AI in transportation.

ImageSlides

SelfDriving: Overview of AI in Self (AI-Assisted) Driving cars

Automobile Industry needs to remake itself as mobility companies. Basic automotive industry flat to down but AI can improve productivity. Lesson also discusses electric vehicles and drones

ImageSlides

Imaging: Overview of Scene Understanding

Imaging is area where convolutional neural nets and deep learning has made amazing progress. all aspects of imaging are now dominated by deep learning. We discuss the impact of Image Net in detail

ImageSlides

MainlyMedicine: Overview of AI in Health and Telecommunication

Telecommunication Industry has little traditional growth to look forward to. It can use AI in its operation and exploit trove of Big Data it possesses. Medicine has many breakthrough opportunities but progress hard – partly due to data privacy restrictions. Traditional Bioinformatics areas progress but slowly; pathology is based on imagery and making much better progress with deep learning

ImageSlides

BankingFinance: Overview of Banking and Finance

This FinTech sector has huge investments (larger than other applications we studied)and we can expect all aspects of Banking and Finance to be remade with online digital Banking as a Service. It is doubtful that traditional banks will thrive

ImageSlides

8.2.11 - Introduction to Deep Learning (III)

Introduction to Deep Learning (III)

Usage of deep learning algorithm is one of the demanding skills needed in this decade and the coming decade. Providing a hands on experience in using deep learning applications is one of the main goals of this lecture series. Let’s get started.

Deep Learning Algorithm Part 1

In this part of the lecture series, the idea is to provide an understanding on the usage of various deep learning algorithms. In this lesson, we talk about different algorithms in Deep Learning world. In this lesson we discuss a multi-layer perceptron and convolutional neural networks. Here we use MNIST classification problem and solve it using MLP and CNN.

ImageSlides

Deep Learning Algorithms Part 2

In this lesson, we continue our study on a deep learning algorithms. We use Recurrent Neural Network related examples to show case how it can be applied to do MNIST classfication. We showcase how RNN can be applied to solve this problem.

ImageSlides

Deep Learning Algorithms Part 3

CNN is one of the most prominent algorithms that has been used in the deep learning world in the last decade. A lots of applications has been done using CNN. Most of these applications deal with images, videos, etc. In this lesson we continue the lesson on convolution neural networks. Here we discuss a brief history on CNN.

ImageSlides

Deep Learning Algorithms Part 4

In this lesson we continue our study on CNN by understanding how historical findings supported the upliftment of the Convolutional Neural Networks. And also we discuss why CNN has been used for various applications in various fields.

ImageSlides

Deep Learning Algorithms Part 5

In this lesson we discuss about auto-encoders. This is one of the highly used deep learning based models in signal denoising, image denoising. Here we portray how an auto-encoder can be used to do such tasks.

ImageSlides

Deep Learning Algorithms Part 6

In this lesson we discuss one of the most famous deep neural network architecture, Generative Adversarial Networks. This deep learning model has the capability of generating new outputs from existing knowledge. A GAN model is more like a counter-fitter who is trying to improve itself to generate best counterfits.

ImageSlides

Additional Material

We have included more information on different types of deep neural networks and their usage. A summary of all the topics discussed under deep learning can be found in the following slide deck. Please refer it to get more information. Some of these information can help for writing term papers and projects.

ImageSlides

8.2.12 - Cloud Computing

Cloud Computing

E534 Cloud Computing Unit

:orange_book: Full Slide Deck https://drive.google.com/open?id=1e61jrgTSeG8wQvQ2v6Zsp5AA31KCZPEQ

This page https://docs.google.com/document/d/1D8bEzKe9eyQfbKbpqdzgkKnFMCBT1lWildAVdoH5hYY/edit?usp=sharing

Overall Summary

Video Video: https://drive.google.com/open?id=1Iq-sKUP28AiTeDU3cW_7L1fEQ2hqakae

:orange_book: Slides https://drive.google.com/open?id=1MLYwAM6MrrZSKQjKm570mNtyNHiWSCjC

Defining Clouds I:

Video Video https://drive.google.com/open?id=15TbpDGR2VOy5AAYb_o4740enMZKiVTSz

:orange_book: Slides https://drive.google.com/open?id=1CMqgcpNwNiMqP8TZooqBMhwFhu2EAa3C

  1. Basic definition of cloud and two very simple examples of why virtualization is important.
  2. How clouds are situated wrt HPC and supercomputers
  3. Why multicore chips are important
  4. Typical data center

Defining Clouds II:

Video Video https://drive.google.com/open?id=1BvJCqBQHLMhrPrUsYvGWoq1nk7iGD9cd

:orange_book: Slides https://drive.google.com/open?id=1_rczdp74g8hFnAvXQPVfZClpvoB_B3RN

  1. Service-oriented architectures: Software services as Message-linked computing capabilities
  2. The different aaS’s: Network, Infrastructure, Platform, Software
  3. The amazing services that Amazon AWS and Microsoft Azure have
  4. Initial Gartner comments on clouds (they are now the norm) and evolution of servers; serverless and microservices

Defining Clouds III:

Video Video https://drive.google.com/open?id=1MjIU3N2PX_3SsYSN7eJtAlHGfdePbKEL

:orange_book: Slides https://drive.google.com/open?id=1cDJhE86YRAOCPCAz4dVv2ieq-4SwTYQW

  1. Cloud Market Share
  2. How important are they?
  3. How much money do they make?

Virtualization:

Video Video https://drive.google.com/open?id=1-zd6wf3zFCaTQFInosPHuHvcVrLOywsw

:orange_book: Slides https://drive.google.com/open?id=1_-BIAVHSgOnWQmMfIIC61wH-UBYywluO

  1. Virtualization Technologies, Hypervisors and the different approaches
  2. KVM Xen, Docker and Openstack

Cloud Infrastructure I:

Video Video https://drive.google.com/open?id=1CIVNiqu88yeRkeU5YOW3qNJbfQHwfBzE

:orange_book: Slides https://drive.google.com/open?id=11JRZe2RblX2MnJEAyNwc3zup6WS8lU-V

  1. Comments on trends in the data center and its technologies
  2. Clouds physically across the world
  3. Green computing
  4. Amount of world’s computing ecosystem in clouds

Cloud Infrastructure II:

Video Videos https://drive.google.com/open?id=1yGR0YaqSoZ83m1_Kz7q7esFrrxcFzVgl

:orange_book: Slides https://drive.google.com/open?id=1L6fnuALdW3ZTGFvu4nXsirPAn37ZMBEb

  1. Gartner hypecycle and priority matrix on Infrastructure Strategies and Compute Infrastructure
  2. Containers compared to virtual machines
  3. The emergence of artificial intelligence as a dominant force

Cloud Software:

Video Video https://drive.google.com/open?id=14HISqj17Ihom8G6v9KYR2GgAyjeK1mOp

:orange_book: Slides https://drive.google.com/open?id=10TaEQE9uEPBFtAHpCAT_1akCYbvlMCPg

  1. HPC-ABDS with over 350 software packages and how to use each of 21 layers
  2. Google’s software innovations
  3. MapReduce in pictures
  4. Cloud and HPC software stacks compared
  5. Components need to support cloud/distributed system programming

Cloud Applications I: Research applications

Video Video https://drive.google.com/open?id=11zuqeUbaxyfpONOmHRaJQinc4YSZszri

:orange_book: Slides https://drive.google.com/open?id=1hUgC82FLutp32rICEbPJMgHaadTlOOJv

  1. Clouds in science where the area called cyberinfrastructure

Cloud Applications II: Few key types

Video Video https://drive.google.com/open?id=1S2-MgshCSqi9a6_tqEVktktN4Nf6Hj4d

:orange_book: Slides https://drive.google.com/open?id=1KlYnTZgRzqjnG1g-Mf8NTvw1k8DYUCbw

  1. Internet of Things
  2. Different types of MapReduce

Parallel Computing in Pictures

Video Video https://drive.google.com/open?id=1LSnVj0Vw2LXOAF4_CMvehkn0qMIr4y4J

:orange_book: Slides https://drive.google.com/open?id=1IDozpqtGbTEzANDRt4JNb1Fhp7JCooZH

  1. Some useful analogies and principles
  2. Society and Building Hadrian’s wall

Parallel Computing in real world

Video Video https://drive.google.com/open?id=1d0pwvvQmm5VMyClm_kGlmB79H69ihHwk

:orange_book: Slides https://drive.google.com/open?id=1aPEIx98aDYaeJS-yY1JhqqnPPJbizDAJ

  1. Single Program/Instruction Multiple Data SIMD SPMD
  2. Parallel Computing in general
  3. Big Data and Simulations Compared
  4. What is hard to do?

Cloud Storage:

Video Video https://drive.google.com/open?id=1ukgyO048qX0uZ9sti3HxIDGscyKqeCaB

:orange_book: Slides https://drive.google.com/open?id=1rVRMcfrpFPpKVhw9VZ8I72TTW21QxzuI

  1. Cloud data approaches
  2. Repositories, File Systems, Data lakes

HPC and Clouds: The Branscomb Pyramid

Video Video https://drive.google.com/open?id=15rrCZ_yaMSpQNZg1lBs_YaOSPw1Rddog

:orange_book: Slides https://drive.google.com/open?id=1JRdtXWWW0qJrbWAXaHJHxDUZEhPCOK_C

  1. Supercomputers versus clouds
  2. Science Computing Environments

Comparison of Data Analytics with Simulation:

Video Video https://drive.google.com/open?id=1wmt7MQLz3Bf2mvLN8iHgXFHiuvGfyRKr

:orange_book: Slides https://drive.google.com/open?id=1vRv76LerhgJKUsGosXLVKq4s_wDqFlK4

  1. Structure of different applications for simulations and Big Data
  2. Software implications
  3. Languages

The Future:

Video Video https://drive.google.com/open?id=1A20g-rTYe0EKxMSX0HI4D8UyUDcq9IJc

:orange_book: Slides https://drive.google.com/open?id=1_vFA_SLsf4PQ7ATIxXpGPIPHawqYlV9K

  1. Gartner cloud computing hypecycle and priority matrix
  2. Hyperscale computing
  3. Serverless and FaaS
  4. Cloud Native
  5. Microservices

Fault Tolerance

Video Video https://drive.google.com/open?id=11hJA3BuT6pS9Ovv5oOWB3QOVgKG8vD24

:orange_book: Slides https://drive.google.com/open?id=1oNztdHQPDmj24NSGx1RzHa7XfZ5vqUZg

8.2.13 - Introduction to Cloud Computing

Introduction to Cloud Computing

Introduction to Cloud Computing

This introduction to Cloud Computing covers all aspects of the field drawing on industry and academic advances. It makes use of analyses from the Gartner group on future Industry trends. The presentation is broken into 21 parts starting with a survey of all the material covered. Note this first part is A while the substance of the talk is in parts B to U.

Introduction - Part A {#s:cloud-fundamentals-a}

  • Parts B to D define cloud computing, its key concepts and how it is situated in the data center space
  • The next part E reviews virtualization technologies comparing containers and hypervisors
  • Part F is the first on Gartner’s Hypecycles and especially those for emerging technologies in 2017 and 2016
  • Part G is the second on Gartner’s Hypecycles with Emerging Technologies hypecycles and the Priority matrix at selected times 2008-2015
  • Parts H and I cover Cloud Infrastructure with Comments on trends in the data center and its technologies and the Gartner hypecycle and priority matrix on Infrastructure Strategies and Compute Infrastructure
  • Part J covers Cloud Software with HPC-ABDS(High Performance Computing enhanced Apache Big Data Stack) with over 350 software packages and how to use each of its 21 layers
  • Part K is first on Cloud Applications covering those from industry and commercial usage patterns from NIST
  • Part L is second on Cloud Applications covering those from science where area called cyberinfrastructure; we look at the science usage pattern from NIST
  • Part M is third on Cloud Applications covering the characterization of applications using the NIST approach.
  • Part N covers Clouds and Parallel Computing and compares Big Data and Simulations
  • Part O covers Cloud storage: Cloud data approaches: Repositories, File Systems, Data lakes
  • Part P covers HPC and Clouds with The Branscomb Pyramid and Supercomputers versus clouds
  • Part Q compares Data Analytics with Simulation with application and software implications
  • Part R compares Jobs from Computer Engineering, Clouds, Design and Data Science/Engineering
  • Part S covers the Future with Gartner cloud computing hypecycle and priority matrix, Hyperscale computing, Serverless and FaaS, Cloud Native and Microservices
  • Part T covers Security and Blockchain
  • Part U covers fault-tolerance

This lecture describes the contents of the following 20 parts (B to U).

Introduction - Part B - Defining Clouds I {#s:cloud-fundamentals-b}

B: Defining Clouds I

  • Basic definition of cloud and two very simple examples of why virtualization is important.
  • How clouds are situated wrt HPC and supercomputers
  • Why multicore chips are important
  • Typical data center

Introduction - Part C - Defining Clouds II {#s:cloud-fundamentals-c}

C: Defining Clouds II

  • Service-oriented architectures: Software services as Message-linked computing capabilities
  • The different aaS’s: Network, Infrastructure, Platform, Software
  • The amazing services that Amazon AWS and Microsoft Azure have
  • Initial Gartner comments on clouds (they are now the norm) and evolution of servers; serverless and microservices

Introduction - Part D - Defining Clouds III {#s:cloud-fundamentals-d}

D: Defining Clouds III

  • Cloud Market Share
  • How important are they?
  • How much money do they make?

Introduction - Part E - Virtualization {#s:cloud-fundamentals-e}

E: Virtualization

  • Virtualization Technologies, Hypervisors and the different approaches
  • KVM Xen, Docker and Openstack
  • Several web resources are listed

Introduction - Part F - Technology Hypecycle I {#s:cloud-fundamentals-f}

F:Technology Hypecycle I

  • Gartner’s Hypecycles and especially that for emerging technologies in 2017 and 2016
  • The phases of hypecycles
  • Priority Matrix with benefits and adoption time
  • Today clouds have got through the cycle (they have emerged) but features like blockchain, serverless and machine learning are on cycle
  • Hypecycle and Priority Matrix for Data Center Infrastructure 2017

Introduction - Part G - Technology Hypecycle II {#s:cloud-fundamentals-g}

G: Technology Hypecycle II

  • Emerging Technologies hypecycles and Priority matrix at selected times 2008-2015
  • Clouds star from 2008 to today
  • They are mixed up with transformational and disruptive changes
  • The route to Digital Business (2015)

Introduction - Part H - IaaS I {#s:cloud-fundamentals-h}

H: Cloud Infrastructure I

  • Comments on trends in the data center and its technologies
  • Clouds physically across the world
  • Green computing and fraction of world’s computing ecosystem in clouds

Introduction - Part I - IaaS II {#s:cloud-fundamentals-i}

I: Cloud Infrastructure II

  • Gartner hypecycle and priority matrix on Infrastructure Strategies and Compute Infrastructure
  • Containers compared to virtual machines
  • The emergence of artificial intelligence as a dominant force

Introduction - Part J - Cloud Software {#s:cloud-fundamentals-j}

J: Cloud Software

  • HPC-ABDS(High Performance Computing enhanced Apache Big Data Stack) with over 350 software packages and how to use each of 21 layers
  • Google’s software innovations
  • MapReduce in pictures
  • Cloud and HPC software stacks compared
  • Components need to support cloud/distributed system programming
  • Single Program/Instruction Multiple Data SIMD SPMD

Introduction - Part K - Applications I {#s:cloud-fundamentals-k}

K: Cloud Applications I

  • Big Data in Industry/Social media; a lot of best examples have NOT been updated so some slides old but still make the correct points
  • Some of the business usage patterns from NIST

Introduction - Part L - Applications II {#s:cloud-fundamentals-l}

L: Cloud Applications II

  • Clouds in science where area called cyberinfrastructure;
  • The science usage pattern from NIST
  • Artificial Intelligence from Gartner

Introduction - Part M - Applications III {#s:cloud-fundamentals-m}

M: Cloud Applications III

  • Characterize Applications using NIST approach
  • Internet of Things
  • Different types of MapReduce

Introduction - Part N - Parallelism {#s:cloud-fundamentals-n}

N: Clouds and Parallel Computing

  • Parallel Computing in general
  • Big Data and Simulations Compared
  • What is hard to do?

Introduction - Part O - Storage {#s:cloud-fundamentals-o}

O: Cloud Storage

  • Cloud data approaches
  • Repositories, File Systems, Data lakes

Introduction - Part P - HPC in the Cloud {#s:cloud-fundamentals-p}

P: HPC and Clouds

  • The Branscomb Pyramid
  • Supercomputers versus clouds
  • Science Computing Environments

Introduction - Part Q - Analytics and Simulation {#s:cloud-fundamentals-q}

Q: Comparison of Data Analytics with Simulation

  • Structure of different applications for simulations and Big Data
  • Software implications
  • Languages

Introduction - Part R - Jobs {#s:cloud-fundamentals-r}

R: Availability of Jobs in different areas

  • Computer Engineering
  • Clouds
  • Design
  • Data Science/Engineering

Introduction - Part S - The Future {#s:cloud-fundamentals-s}

S: The Future

  • Gartner cloud computing hypecycle and priority matrix highlights:

    • Hyperscale computing
    • Serverless and FaaS
    • Cloud Native
    • Microservices

Introduction - Part T - Security {#s:cloud-fundamentals-t}

T: Security

  • CIO Perspective
  • Blockchain

Introduction - Part U - Fault Tolerance {#s:cloud-fundamentals-u}

U: Fault Tolerance

  • S3 Fault Tolerance
  • Application Requirements

© 2018 GitHub, Inc. Terms Privacy Security Status Help Contact GitHub Pricing API Training Blog About Press h to open a hovercard with more details.

8.2.14 - Assignments

Assignments

Assignments

Due dates are on Canvas. Click on the links to checkout the assignment pages.

8.2.14.1 - Assignment 1

Assignment 1

Assignment 1

In the first assignment you will be writing a technical document on the current technology trends that you’re pursuing and the trends that you would like to follow. In addition to this include some information about your background in programming and some projects that you have done. There is no strict format for this one, but we expect 2 page written document. Please submit a PDF.

Go to Canvas

8.2.14.2 - Assignment 2

Assignment 2

Assignment 2

In the second assignment, you will be working on Week 1 (see @sec:534-week1) lecture videos. Objectives are as follows.

  1. Summarize what you have understood. (2 page)
  2. Select a subtopic that you are interested in and research on the current trends (1 page)
  3. Suggest ideas that could improve the existing work (imaginations and possibilities) (1 page)

For this assignment we expect a 4 page document. You can use a single column format for this document. Make sure you write exactly 4 pages. For your research section make sure you add citations to the sections that you are going to refer. If you have issues in how to do citations you can reach a TA to learn how to do that. We will try to include some chapters on how to do this in our handbook. Submissions are in pdf format only.

Go to Canvas

8.2.14.3 - Assignment 3

Assignment 3

Assignment 3

In the third assignment, you will be working on (see @sec:534-week3) lecture videos. Objectives are as follows.

  1. Summarize what you have understood. (2 page)
  2. Select a subtopic that you are interested in and research on the current trends (1 page)
  3. Suggest ideas that could improve the existing work (imaginations and possibilities) (1 page)

For this assignment we expect a 4 page document. You can use a single column format for this document. Make sure you write exactly 4 pages. For your research section make sure you add citations to the sections that you are going to refer. If you have issues in how to do citations you can reach a TA to learn how to do that. We will try to include some chapters on how to do this in our handbook. Submissions are in pdf format only.

Go to Canvas

8.2.14.4 - Assignment 4

Assignment 4

Assignment 4

In the fourth assignment, you will be working on (see @sec:534-week5) lecture videos. Objectives are as follows.

  1. Summarize what you have understood. (1 page)
  2. Select a subtopic that you are interested in and research on the current trends (0.5 page)
  3. Suggest ideas that could improve the existing work (imaginations and possibilities) (0.5 page)
  4. Summarize a specific video segment in the video lectures. To do this you need to follow these guidelines. Mention the video lecture name and section identification number. And also specify which range of minutes you have focused on the specific video lecture (2 pages).

For this assignment we expect a 4 page document. You can use a single column format for this document. Make sure you write exactly 4 pages. For your research section make sure you add citations to the sections that you are going to refer. If you have issues in how to do citations you can reach a TA to learn how to do that. We will try to include some chapters on how to do this in our handbook. Submissions are in pdf format only.

Go to Canvas

8.2.14.5 - Assignment 5

Assignment 5

Assignment 5

In the fifth assignment, you will be working on (see @sec:534-intro-to-dnn) lecture videos. Objectives are as follows.

Run the given sample code and try to answer the questions under the exercise tag.

Follow the Exercises labelled from MNIST_V1.0.0 - MNIST_V1.6.0

For this assignment all you have to do is just answer all the questions. You can use a single column format for this document. Submissions are in pdf format only.

Go to Canvas

8.2.14.6 - Assignment 6

Assignment 6

Assignment 6

In the sixth assignment, you will be working on (see @sec:534-week7) lecture videos. Objectives are as follows.

  1. Summarize what you have understood. (1 page)
  2. Select a subtopic that you are interested in and research on the current trends (0.5 page)
  3. Suggest ideas that could improve the existing work (imaginations and possibilities) (0.5 page)
  4. Summarize a specific video segment in the video lectures. To do this you need to follow these guidelines. Mention the video lecture name and section identification number. And also specify which range of minutes you have focused on the specific video lecture (2 pages).
  5. Pick a sport you like and show case how it can be used with Big Data in order to improve the game (1 page). Use techniques used in the lecture videos and mention which lecture video refers to this technique.

For this assignment we expect a 5-page document. You can use a single column format for this document. Make sure you write exactly 5pages. For your research section make sure you add citations to the sections that you are going to refer. If you have issues in how to do citations you can reach a TA to learn how to do that. We will try to include some chapters on how to do this in our handbook. Submissions are in pdf format only.

Go to Canvas

8.2.14.7 - Assignment 7

Assignment 7

Assignment 7

For a Complete Project

This project must contain the following details;

  1. The idea of the project,

Doesn’t need to be a novel idea. But a novel idea will carry more weight towards a very higher grade. If you’re trying to replicate an existing idea. Need to provide the original source you’re referring. If it is a github project, need to reference it and showcase what you have done to improve it or what changes you made in applying the same idea to solve a different problem.

a). For a deep learning project, if you are using an existing model, you need to explain how did you use the same model to solve the problem suggested by you. b). If you planned to improve the existing model, explain the suggested improvements. c). If you are just using an existing model and solving an existing problem, you need to do an extensive benchmark. This kind of project carries lesser marks than a project like a) or b)

  1. Benchmark

No need to use a very large dataset. You can use the Google Colab and train your network with a smaller dataset. Think of a smaller dataset like MNIST. UCI Machine Learning Repository is a very good place to find such a dataset. https://archive.ics.uci.edu/ml/index.php (Links to an external site.)

Get CPU, GPU, TPU Benchmarks. This can be something similar to what we did with our first deep learning tutorial.

  1. Final Report

The report must include diagrams or flowcharts describing the idea. Benchmark results in graphs, not in tables. Use IEEE Template to write the document. Latex or Word is your choice. But submit a PDF file only. Template: https://www.ieee.org/conferences/publishing/templates.html (Links to an external site.)

  1. Submission Include,

  2. IPython Notebook (must run the whole process, training, testing, benchmark, etc in Google Colab) Providing Colab link is acceptable.

  3. The report in PDF Format

This is the expected structure of your project.

In the first phase, you need to submit the project proposal by Nov 10th. This must include the idea of the project with approximate details that you try to include in the project. It doesn’t need to claim the final result, it is just a proposal. Add a flowchart or diagrams to explain your idea. Use a maximum of 2 pages to include your content. There is no extension for this submission. If you cannot make it by Nov 10th, you need to inform the professor and decide the way you plan to finish the class.

Anyone who fails to submit this by the deadline will fail to complete the course.

For a Term Paper

For a graduate student, by doing a term paper, the maximum possible grade is going to be an A-. This rule doesn’t apply to undergraduate students.

For a term paper, the minimum content of 8 pages and a maximum of 10 pages must include using any of the templates given in project report writing section. (https://www.ieee.org/conferences/publishing/templates.html (Links to an external site.))

So when you are writing the proposal, you need to select an area in deep learning applications, trends or innovations.

Once the area is sorted. Write a two-page proposal on what you will be including in the paper. This can be a rough estimation of what you will be writing.

When writing the paper,

You will be reading online blogs, papers, articles, etc, so you will be trying to understand concepts and write the paper. In this process make sure not to copy and paste from online sources. If we find such an activity, your paper will not be accepted. Do references properly and do paraphrasing when needed.

Keep these in mind, before you propose the idea that you want to write. The term paper must include a minimum of 15 references which includes articles, blogs or papers that you have read. You need to reference them in the write-up. So be cautious in deciding the idea for the proposal.

Submission date is Nov 10th and there will be no extensions for this. If you cannot make it by this date, you need to discuss with the professor to decide the way you want to finish the class. Reach us via office hours or class meetings to sort out any issues.

Special Note on Team Projects

Each member must submit the report. The common section must be Abstract, Introduction, Overall process, results, etc. Each contributor must write a section on his or her contribution to the project. This content must be the additional 50% of the report. For instance, if the paper size is 8 pages for an individual project, another 4 pages explaining each member’s contribution must be added (for the two-person project). If there are 4 members the additional pages must be 8 pages. 2 additional pages per author. If results and methods involve your contribution, clearly state it as a subsection, Author’s Contribution.

8.2.14.8 - Assignment 8

Assignment 8

Assignment 8

For term paper submission, please send us the pdf file to your paper in the submission.

If you’re doing a project, please make sure that the code is committed to the repository created at the beginning of the class. You can commit all before submission. But make sure you submit the report, (pdf) and the code for the project. Please follow the report guidelines provided under Assignment 7.

Please note, there are no extensions for final project submission. If there is any issue, please discuss this with Professor or TA ahead of time.

Special Note on Team Projects

Each member must submit the report. The common section must be Abstract, Introduction, Overall process, results, etc. Each contributor must write a section on his or her contribution to the project. This content must be the additional 50% of the report. For instance, if the paper size is 8 pages for an individual project, another 4 pages explaining each member’s contribution must be added (for the two-person project). If there are 4 members the additional pages must be 8 pages. 2 additional pages per author. If results and methods involve your contribution, clearly state it as a subsection, Author’s Contribution. Good luck !!!

8.2.15 - Applications

Applications

We will discuss each of these applications in more detail# Applications

8.2.15.1 - Big Data Use Cases Survey

Big Data Use Cases Survey

This section covers 51 values of X and an overall study of Big data that emerged from a NIST (National Institute for Standards and Technology) study of Big data. The section covers the NIST Big Data Public Working Group (NBD-PWG) Process and summarizes the work of five subgroups: Definitions and Taxonomies Subgroup, Reference Architecture Subgroup, Security and Privacy Subgroup, Technology Roadmap Subgroup and the Requirements andUse Case Subgroup. 51 use cases collected in this process are briefly discussed with a classification of the source of parallelism and the high and low level computational structure. We describe the key features of this classification.

NIST Big Data Public Working Group

This unit covers the NIST Big Data Public Working Group (NBD-PWG) Process and summarizes the work of five subgroups: Definitions and Taxonomies Subgroup, Reference Architecture Subgroup, Security and Privacy Subgroup, Technology Roadmap Subgroup and the Requirements and Use Case Subgroup. The work of latter is continued in next two units.

Presentation Overview (45)

Introduction to NIST Big Data Public Working

The focus of the (NBD-PWG) is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitions, taxonomies, secure reference architectures, and technology roadmap. The aim is to create vendor-neutral, technology and infrastructure agnostic deliverables to enable big data stakeholders to pick-and-choose best analytics tools for their processing and visualization requirements on the most suitable computing platforms and clusters while allowing value-added from big data service providers and flow of data between the stakeholders in a cohesive and secure manner.

Video Introduction (13:02)

Definitions and Taxonomies Subgroup

The focus is to gain a better understanding of the principles of Big Data. It is important to develop a consensus-based common language and vocabulary terms used in Big Data across stakeholders from industry, academia, and government. In addition, it is also critical to identify essential actors with roles and responsibility, and subdivide them into components and sub-components on how they interact/ relate with each other according to their similarities and differences.

For Definitions: Compile terms used from all stakeholders regarding the meaning of Big Data from various standard bodies, domain applications, and diversified operational environments. For Taxonomies: Identify key actors with their roles and responsibilities from all stakeholders, categorize them into components and subcomponents based on their similarities and differences. In particular data Science and Big Data terms are discussed.

Video Taxonomies (7:42)

Reference Architecture Subgroup

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus-based approach to orchestrate vendor-neutral, technology and infrastructure agnostic for analytics tools and computing environments. The goal is to enable Big Data stakeholders to pick-and-choose technology-agnostic analytics tools for processing and visualization in any computing platform and cluster while allowing value-added from Big Data service providers and the flow of the data between the stakeholders in a cohesive and secure manner. Results include a reference architecture with well defined components and linkage as well as several exemplars.

Video Architecture (10:05)

Security and Privacy Subgroup

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus secure reference architecture to handle security and privacy issues across all stakeholders. This includes gaining an understanding of what standards are available or under development, as well as identifies which key organizations are working on these standards. The Top Ten Big Data Security and Privacy Challenges from the CSA (Cloud Security Alliance) BDWG are studied. Specialized use cases include Retail/Marketing, Modern Day Consumerism, Nielsen Homescan, Web Traffic Analysis, Healthcare, Health Information Exchange, Genetic Privacy, Pharma Clinical Trial Data Sharing, Cyber-security, Government, Military and Education.

Video Security (9:51)

Technology Roadmap Subgroup

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus vision with recommendations on how Big Data should move forward by performing a good gap analysis through the materials gathered from all other NBD subgroups. This includes setting standardization and adoption priorities through an understanding of what standards are available or under development as part of the recommendations. Tasks are gather input from NBD subgroups and study the taxonomies for the actors' roles and responsibility, use cases and requirements, and secure reference architecture; gain understanding of what standards are available or under development for Big Data; perform a thorough gap analysis and document the findings; identify what possible barriers may delay or prevent adoption of Big Data; and document vision and recommendations.

Video Technology (4:14)

Interfaces Subgroup

This subgroup is working on the following document: NIST Big Data Interoperability Framework: Volume 8, Reference Architecture Interface.

This document summarizes interfaces that are instrumental for the interaction with Clouds, Containers, and HPC systems to manage virtual clusters to support the NIST Big Data Reference Architecture (NBDRA). The Representational State Transfer (REST) paradigm is used to define these interfaces allowing easy integration and adoption by a wide variety of frameworks. . This volume, Volume 8, uses the work performed by the NBD-PWG to identify objects instrumental for the NIST Big Data Reference Architecture (NBDRA) which is introduced in the NBDIF: Volume 6, Reference Architecture.

This presentation was given at the 2nd NIST Big Data Public Working Group (NBD-PWG) Workshop in Washington DC in June 2017. It explains our thoughts on deriving automatically a reference architecture form the Reference Architecture Interface specifications directly from the document.

The workshop Web page is located at

The agenda of the workshop is as follows:

The Web cas of the presentation is given bellow, while you need to fast forward to a particular time

You are welcome to view other presentations if you are interested.

Requirements and Use Case Subgroup

The focus is to form a community of interest from industry, academia, and government, with the goal of developing a consensus list of Big Data requirements across all stakeholders. This includes gathering and understanding various use cases from diversified application domains.Tasks are gather use case input from all stakeholders; derive Big Data requirements from each use case; analyze/prioritize a list of challenging general requirements that may delay or prevent adoption of Big Data deployment; develop a set of general patterns capturing the essence of use cases (not done yet) and work with Reference Architecture to validate requirements and reference architecture by explicitly implementing some patterns based on use cases. The progress of gathering use cases (discussed in next two units) and requirements systemization are discussed.

Video Requirements (27:28)

51 Big Data Use Cases

This units consists of one or more slides for each of the 51 use cases - typically additional (more than one) slides are associated with pictures. Each of the use cases is identified with source of parallelism and the high and low level computational structure. As each new classification topic is introduced we briefly discuss it but full discussion of topics is given in following unit.

Presentation 51 Use Cases (100)

Government Use Cases

This covers Census 2010 and 2000 - Title 13 Big Data; National Archives and Records Administration Accession NARA, Search, Retrieve, Preservation; Statistical Survey Response Improvement (Adaptive Design) and Non-Traditional Data in Statistical Survey Response Improvement (Adaptive Design).

Video Government Use Cases (17:43)

Commercial Use Cases

This covers Cloud Eco-System, for Financial Industries (Banking, Securities & Investments, Insurance) transacting business within the United States; Mendeley - An International Network of Research; Netflix Movie Service; Web Search; IaaS (Infrastructure as a Service) Big Data Business Continuity & Disaster Recovery (BC/DR) Within A Cloud Eco-System; Cargo Shipping; Materials Data for Manufacturing and Simulation driven Materials Genomics.

Video Commercial Use Cases (17:43)

Defense Use Cases

This covers Large Scale Geospatial Analysis and Visualization; Object identification and tracking from Wide Area Large Format Imagery (WALF) Imagery or Full Motion Video (FMV) - Persistent Surveillance and Intelligence Data Processing and Analysis.

Video Defense Use Cases (15:43)

Healthcare and Life Science Use Cases

This covers Electronic Medical Record (EMR) Data; Pathology Imaging/digital pathology; Computational Bioimaging; Genomic Measurements; Comparative analysis for metagenomes and genomes; Individualized Diabetes Management; Statistical Relational Artificial Intelligence for Health Care; World Population Scale Epidemiological Study; Social Contagion Modeling for Planning, Public Health and Disaster Management and Biodiversity and LifeWatch.

Video Healthcare and Life Science Use Cases (30:11)

Deep Learning and Social Networks Use Cases

This covers Large-scale Deep Learning; Organizing large-scale, unstructured collections of consumer photos; Truthy: Information diffusion research from Twitter Data; Crowd Sourcing in the Humanities as Source for Bigand Dynamic Data; CINET: Cyberinfrastructure for Network (Graph) Science and Analytics and NIST Information Access Division analytic technology performance measurement, evaluations, and standards.

Video Deep Learning and Social Networks Use Cases (14:19)

Research Ecosystem Use Cases

DataNet Federation Consortium DFC; The ‘Discinnet process’, metadata -big data global experiment; Semantic Graph-search on Scientific Chemical and Text-based Data and Light source beamlines.

Video Research Ecosystem Use Cases (9:09)

Astronomy and Physics Use Cases

This covers Catalina Real-Time Transient Survey (CRTS): a digital, panoramic, synoptic sky survey; DOE Extreme Data from Cosmological Sky Survey and Simulations; Large Survey Data for Cosmology; Particle Physics: Analysis of LHC Large Hadron Collider Data: Discovery of Higgs particle and Belle II High Energy Physics Experiment.

Video Astronomy and Physics Use Cases (17:33)

Environment, Earth and Polar Science Use Cases

EISCAT 3D incoherent scatter radar system; ENVRI, Common Operations of Environmental Research Infrastructure; Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets; UAVSAR Data Processing, DataProduct Delivery, and Data Services; NASA LARC/GSFC iRODS Federation Testbed; MERRA Analytic Services MERRA/AS; Atmospheric Turbulence - Event Discovery and Predictive Analytics; Climate Studies using the Community Earth System Model at DOE’s NERSC center; DOE-BER Subsurface Biogeochemistry Scientific Focus Area and DOE-BER AmeriFlux and FLUXNET Networks.

Video Environment, Earth and Polar Science Use Cases (25:29)

Energy Use Case

This covers Consumption forecasting in Smart Grids.

Video Energy Use Case (4:01)

Features of 51 Big Data Use Cases

This unit discusses the categories used to classify the 51 use-cases. These categories include concepts used for parallelism and low and high level computational structure. The first lesson is an introduction to all categories and the further lessons give details of particular categories.

Presentation Features (43)

Summary of Use Case Classification

This discusses concepts used for parallelism and low and high level computational structure. Parallelism can be over People (users or subjects), Decision makers; Items such as Images, EMR, Sequences; observations, contents of online store; Sensors – Internet of Things; Events; (Complex) Nodes in a Graph; Simple nodes as in a learning network; Tweets, Blogs, Documents, Web Pages etc.; Files or data to be backed up, moved or assigned metadata; Particles/cells/mesh points. Low level computational types include PP (Pleasingly Parallel); MR (MapReduce); MRStat; MRIter (Iterative MapReduce); Graph; Fusion; MC (Monte Carlo) and Streaming. High level computational types include Classification; S/Q (Search and Query); Index; CF (Collaborative Filtering); ML (Machine Learning); EGO (Large Scale Optimizations); EM (Expectation maximization); GIS; HPC; Agents. Patterns include Classic Database; NoSQL; Basic processing of data as in backup or metadata; GIS; Host of Sensors processed on demand; Pleasingly parallel processing; HPC assimilated with observational data; Agent-based models; Multi-modal data fusion or Knowledge Management; Crowd Sourcing.

Video Summary of Use Case Classification (23:39)

Database(SQL) Use Case Classification

This discusses classic (SQL) database approach to data handling with Search&Query and Index features. Comparisons are made to NoSQL approaches.

Video Database (SQL) Use Case Classification (11:13)

NoSQL Use Case Classification

This discusses NoSQL (compared in previous lesson) with HDFS, Hadoop and Hbase. The Apache Big data stack is introduced and further details of comparison with SQL.

Video NoSQL Use Case Classification (11:20)

Other Use Case Classifications

This discusses a subset of use case features: GIS, Sensors. the support of data analysis and fusion by streaming data between filters.

Video Use Case Classifications I (12:42) This discusses a subset of use case features: Pleasingly parallel, MRStat, Data Assimilation, Crowd sourcing, Agents, data fusion and agents, EGO and security.

Video Use Case Classifications II (20:18)

This discusses a subset of use case features: Classification, Monte Carlo, Streaming, PP, MR, MRStat, MRIter and HPC(MPI), global and local analytics (machine learning), parallel computing, Expectation Maximization, graphs and Collaborative Filtering.

Video Use Case Classifications III (17:25)

\TODO{These resources have not all been checked to see if they still exist this is currently in progress}

Resources

Some of the links bellow may be outdated. Please let us know the new links and notify us of the outdated links.

8.2.15.2 - Cloud Computing

Cloud Computing

We describe the central role of Parallel computing in Clouds and Big Data which is decomposed into lots of Little data running in individual cores. Many examples are given and it is stressed that issues in parallel computing are seen in day to day life for communication, synchronization, load balancing and decomposition. Cyberinfrastructure for e-moreorlessanything or moreorlessanything-Informatics and the basics of cloud computing are introduced. This includes virtualization and the important as a Service components and we go through several different definitions of cloud computing.

Gartner’s Technology Landscape includes hype cycle and priority matrix and covers clouds and Big Data. Two simple examples of the value of clouds for enterprise applications are given with a review of different views as to nature of Cloud Computing. This IaaS (Infrastructure as a Service) discussion is followed by PaaS and SaaS (Platform and Software as a Service). Features in Grid and cloud computing and data are treated. We summarize the 21 layers and almost 300 software packages in the HPC-ABDS Software Stack explaining how they are used.

Cloud (Data Center) Architectures with physical setup, Green Computing issues and software models are discussed followed by the Cloud Industry stakeholders with a 2014 Gartner analysis of Cloud computing providers. This is followed by applications on the cloud including data intensive problems, comparison with high performance computing, science clouds and the Internet of Things. Remarks on Security, Fault Tolerance and Synchronicity issues in cloud follow. We describe the way users and data interact with a cloud system. The Big Data Processing from an application perspective with commercial examples including eBay concludes section after a discussion of data system architectures.

Parallel Computing (Outdated)

We describe the central role of Parallel computing in Clouds and Big Data which is decomposed into lots of ‘‘Little data’’ running in individual cores. Many examples are given and it is stressed that issues in parallel computing are seen in day to day life for communication, synchronization, load balancing and decomposition.

Presentation Parallel Computing (33)

Decomposition

We describe why parallel computing is essential with Big Data and distinguishes parallelism over users to that over the data in problem. The general ideas behind data decomposition are given followed by a few often whimsical examples dreamed up 30 years ago in the early heady days of parallel computing. These include scientific simulations, defense outside missile attack and computer chess. The basic problem of parallel computing – efficient coordination of separate tasks processing different data parts – is described with MPI and MapReduce as two approaches. The challenges of data decomposition in irregular problems is noted.

Parallel Computing in Society

This lesson from the past notes that one can view society as an approach to parallel linkage of people. The largest example given is that of the construction of a long wall such as that (Hadrian’s wall) between England and Scotland. Different approaches to parallelism are given with formulae for the speed up and efficiency. The concepts of grain size (size of problem tackled by an individual processor) and coordination overhead are exemplified. This example also illustrates Amdahl’s law and the relation between data and processor topology. The lesson concludes with other examples from nature including collections of neurons (the brain) and ants.

Parallel Processing for Hadrian’s Wall

This lesson returns to Hadrian’s wall and uses it to illustrate advanced issues in parallel computing. First We describe the basic SPMD – Single Program Multiple Data – model. Then irregular but homogeneous and heterogeneous problems are discussed. Static and dynamic load balancing is needed. Inner parallelism (as in vector instruction or the multiple fingers of masons) and outer parallelism (typical data parallelism) are demonstrated. Parallel I/O for Hadrian’s wall is followed by a slide summarizing this quaint comparison between Big data parallelism and the construction of a large wall.

Resources

Introduction

We discuss Cyberinfrastructure for e-moreorlessanything or moreorlessanything-Informatics and the basics of cloud computing. This includes virtualization and the important ‘as a Service’ components and we go through several different definitions of cloud computing.Gartner’s Technology Landscape includes hype cycle and priority matrix and covers clouds and Big Data. The unit concludes with two simple examples of the value of clouds for enterprise applications. Gartner also has specific predictions for cloud computing growth areas.

Presentation Introduction (45)

Cyberinfrastructure for E-Applications

This introduction describes Cyberinfrastructure or e-infrastructure and its role in solving the electronic implementation of any problem where e-moreorlessanything is another term for moreorlessanything-Informatics and generalizes early discussion of e-Science and e-Business.

What is Cloud Computing: Introduction

Cloud Computing is introduced with an operational definition involving virtualization and efficient large data centers that can rent computers in an elastic fashion. The role of services is essential – it underlies capabilities being offered in the cloud. The four basic aaS’s – Software (SaaS), Platform (Paas), Infrastructure (IaaS) and Network (NaaS) – are introduced with Research aaS and other capabilities (for example Sensors aaS are discussed later) being built on top of these.

What and Why is Cloud Computing: Other Views I

This lesson contains 5 slides with diverse comments on ‘‘what is cloud computing’’ from the web.

Gartner’s Emerging Technology Landscape for Clouds and Big Data

This lesson gives Gartner’s projections around futures of cloud and Big data. We start with a review of hype charts and then go into detailed Gartner analyses of the Cloud and Big data areas. Big data itself is at the top of the hype and by definition predictions of doom are emerging. Before too much excitement sets in, note that spinach is above clouds and Big data in Google trends.

Simple Examples of use of Cloud Computing

This short lesson gives two examples of rather straightforward commercial applications of cloud computing. One is server consolidation for multiple Microsoft database applications and the second is the benefits of scale comparing gmail to multiple smaller installations. It ends with some fiscal comments.

Value of Cloud Computing

Some comments on fiscal value of cloud computing.

Resources

Software and Systems

We cover different views as to nature of architecture and application for Cloud Computing. Then we discuss cloud software for the cloud starting at virtual machine management (IaaS) and the broad Platform (middleware) capabilities with examples from Amazon and academic studies. We summarize the 21 layers and almost 300 software packages in the HPC-ABDS Software Stack explaining how they are used.

Presentation Software and Systems (32)

What is Cloud Computing

This lesson gives some general remark of cloud systems from an architecture and application perspective.

Introduction to Cloud Software Architecture: IaaS and PaaS I

We discuss cloud software for the cloud starting at virtual machine management (IaaS) and the broad Platform (middleware) capabilities with examples from Amazon and academic studies. We cover different views as to nature of architecture and application for Cloud Computing. Then we discuss cloud software for the cloud starting at virtual machine management (IaaS) and the broad Platform (middleware) capabilities with examples from Amazon and academic studies. We summarize the 21 layers and almost 300 software packages in the HPC-ABDS Software Stack explaining how they are used.

We discuss cloud software for the cloud starting at virtual machine management (IaaS) and the broad Platform (middleware) capabilities with examples from Amazon and academic studies. We cover different views as to nature of architecture and application for Cloud Computing. Then we discuss cloud software for the cloud starting at virtual machine management (IaaS) and the broad Platform (middleware) capabilities with examples from Amazon and academic studies. We summarize the 21 layers and almost 300 software packages in the HPC-ABDS Software Stack explaining how they are used.

Using the HPC-ABDS Software Stack

Using the HPC-ABDS Software Stack.

Resources

Architectures, Applications and Systems

We start with a discussion of Cloud (Data Center) Architectures with physical setup, Green Computing issues and software models. We summarize a 2014 Gartner analysis of Cloud computing providers. This is followed by applications on the cloud including data intensive problems, comparison with high performance computing, science clouds and the Internet of Things. Remarks on Security, Fault Tolerance and Synchronicity issues in cloud follow.

scroll: Architectures (64)

Cloud (Data Center) Architectures

Some remarks on what it takes to build (in software) a cloud ecosystem, and why clouds are the data center of the future are followed by pictures and discussions of several data centers from Microsoft (mainly) and Google. The role of containers is stressed as part of modular data centers that trade scalability for fault tolerance. Sizes of cloud centers and supercomputers are discussed as is “green” computing.

Analysis of Major Cloud Providers

Gartner 2014 Analysis of leading cloud providers.

Use of Dropbox, iCloud, Box etc.

Cloud Applications I

This short lesson discusses the need for security and issues in its implementation. Clouds trade scalability for greater possibility of faults but here clouds offer good support for recovery from faults. We discuss both storage and program fault tolerance noting that parallel computing is especially sensitive to faults as a fault in one task will impact all other tasks in the parallel job.

Science Clouds

Science Applications and Internet of Things.

Security

This short lesson discusses the need for security and issues in its implementation.

Comments on Fault Tolerance and Synchronicity Constraints

Clouds trade scalability for greater possibility of faults but here clouds offer good support for recovery from faults. We discuss both storage and program fault tolerance noting that parallel computing is especially sensitive to faults as a fault in one task will impact all other tasks in the parallel job.

Resources

Data Systems

We describe the way users and data interact with a cloud system. The unit concludes with the treatment of data in the cloud from an architecture perspective and Big Data Processing from an application perspective with commercial examples including eBay.

Presentation Data Systems (49)

The 10 Interaction scenarios (access patterns) I

The next 3 lessons describe the way users and data interact with the system.

The 10 Interaction scenarios. Science Examples

This lesson describes the way users and data interact with the system for some science examples.

Remaining general access patterns

This lesson describe the way users and data interact with the system for the final set of examples.

Video Access Patterns (11:36)

Data in the Cloud

Databases, File systems, Object Stores and NOSQL are discussed and compared. The way to build a modern data repository in the cloud is introduced.

Video Data in the Cloud (10:24)

Applications Processing Big Data

This lesson collects remarks on Big data processing from several sources: Berkeley, Teradata, IBM, Oracle and eBay with architectures and application opportunities.

Video Processing Big Data (8:45)

Resources

8.2.15.3 - e-Commerce and LifeStyle

e-Commerce and LifeStyle

Recommender systems operate under the hood of such widely recognized sites as Amazon, eBay, Monster and Netflix where everything is a recommendation. This involves a symbiotic relationship between vendor and buyer whereby the buyer provides the vendor with information about their preferences, while the vendor then offers recommendations tailored to match their needs. Kaggle competitions h improve the success of the Netflix and other recommender systems. Attention is paid to models that are used to compare how changes to the systems affect their overall performance. It is interesting that the humble ranking has become such a dominant driver of the world’s economy. More examples of recommender systems are given from Google News, Retail stores and in depth Yahoo! covering the multi-faceted criteria used in deciding recommendations on web sites.

The formulation of recommendations in terms of points in a space or bag is given where bags of item properties, user properties, rankings and users are useful. Detail is given on basic principles behind recommender systems: user-based collaborative filtering, which uses similarities in user rankings to predict their interests, and the Pearson correlation, used to statistically quantify correlations between users viewed as points in a space of items. Items are viewed as points in a space of users in item-based collaborative filtering. The Cosine Similarity is introduced, the difference between implicit and explicit ratings and the k Nearest Neighbors algorithm. General features like the curse of dimensionality in high dimensions are discussed. A simple Python k Nearest Neighbor code and its application to an artificial data set in 3 dimensions is given. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of a training and a testing set are introduced with training set pre labeled. Recommender system are used to discuss clustering with k-means based clustering methods used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.

Recommender Systems

We introduce Recommender systems as an optimization technology used in a variety of applications and contexts online. They operate in the background of such widely recognized sites as Amazon, eBay, Monster and Netflix where everything is a recommendation. This involves a symbiotic relationship between vendor and buyer whereby the buyer provides the vendor with information about their preferences, while the vendor then offers recommendations tailored to match their needs, to the benefit of both.

There follows an exploration of the Kaggle competition site, other recommender systems and Netflix, as well as competitions held to improve the success of the Netflix recommender system. Finally attention is paid to models that are used to compare how changes to the systems affect their overall performance. It is interesting how the humble ranking has become such a dominant driver of the world’s economy.

Presentation Lifestyle Recommender (45)

Recommender Systems as an Optimization Problem

We define a set of general recommender systems as matching of items to people or perhaps collections of items to collections of people where items can be other people, products in a store, movies, jobs, events, web pages etc. We present this as “yet another optimization problem”.

Video Recommender Systems I (8:06)

Recommender Systems Introduction

We give a general discussion of recommender systems and point out that they are particularly valuable in long tail of tems (to be recommended) that are not commonly known. We pose them as a rating system and relate them to information retrieval rating systems. We can contrast recommender systems based on user profile and context; the most familiar collaborative filtering of others ranking; item properties; knowledge and hybrid cases mixing some or all of these.

Video Recommender Systems Introduction (12:56)

Kaggle Competitions

We look at Kaggle competitions with examples from web site. In particular we discuss an Irvine class project involving ranking jokes.

Video Kaggle Competitions: (3:36)

Warning Please not that we typically do not accept any projects using kaggle data for this classes. This class is not about winning a kaggle competition and if done wrong it does not fullfill the minimum requiremnt for this class. Please consult with the instructor.

Examples of Recommender Systems

We go through a list of 9 recommender systems from the same Irvine class.

Video Examples of Recommender Systems (1:00)

Netflix on Recommender Systems

We summarize some interesting points from a tutorial from Netflix for whom everything is a recommendation. Rankings are given in multiple categories and categories that reflect user interests are especially important. Criteria used include explicit user preferences, implicit based on ratings and hybrid methods as well as freshness and diversity. Netflix tries to explain the rationale of its recommendations. We give some data on Netflix operations and some methods used in its recommender systems. We describe the famous Netflix Kaggle competition to improve its rating system. The analogy to maximizing click through rate is given and the objectives of optimization are given.

Video Netflix on Recommender Systems (14:20)

Next we go through Netflix’s methodology in letting data speak for itself in optimizing the recommender engine. An example iis given on choosing self produced movies. A/B testing is discussed with examples showing how testing does allow optimizing of sophisticated criteria. This lesson is concluded by comments on Netflix technology and the full spectrum of issues that are involved including user interface, data, AB testing, systems and architectures. We comment on optimizing for a household rather than optimizing for individuals in household.

Video Consumer Data Science (13:04)

Other Examples of Recommender Systems

We continue the discussion of recommender systems and their use in e-commerce. More examples are given from Google News, Retail stores and in depth Yahoo! covering the multi-faceted criteria used in deciding recommendations on web sites. Then the formulation of recommendations in terms of points in a space or bag is given.

Here bags of item properties, user properties, rankings and users are useful. Then we go into detail on basic principles behind recommender systems: user-based collaborative filtering, which uses similarities in user rankings to predict their interests, and the Pearson correlation, used to statistically quantify correlations between users viewed as points in a space of items.

Presentation Lifestyle Recommender (49)

We start with a quick recap of recommender systems from previous unit; what they are with brief examples.

Video Recap and Examples of Recommender Systems (5:48)

Examples of Recommender Systems

We give 2 examples in more detail: namely Google News and Markdown in Retail.

Video Examples of Recommender Systems (8:34)

Recommender Systems in Yahoo Use Case Example

We describe in greatest detail the methods used to optimize Yahoo web sites. There are two lessons discussing general approach and a third lesson examines a particular personalized Yahoo page with its different components. We point out the different criteria that must be blended in making decisions; these criteria include analysis of what user does after a particular page is clicked; is the user satisfied and cannot that we quantified by purchase decisions etc. We need to choose Articles, ads, modules, movies, users, updates, etc to optimize metrics such as relevance score, CTR, revenue, engagement.These lesson stress that if though we have big data, the recommender data is sparse. We discuss the approach that involves both batch (offline) and on-line (real time) components.

Video Recap of Recommender Systems II (8:46)

Video Recap of Recommender Systems III (10:48)

Video Case Study of Recommender systems (3:21)

User-based nearest-neighbor collaborative filtering

Collaborative filtering is a core approach to recommender systems. There is user-based and item-based collaborative filtering and here we discuss the user-based case. Here similarities in user rankings allow one to predict their interests, and typically this quantified by the Pearson correlation, used to statistically quantify correlations between users.

Video User-based nearest-neighbor collaborative filtering I (7:20)

Video User-based nearest-neighbor collaborative filtering II (7:29)

Vector Space Formulation of Recommender Systems

We go through recommender systems thinking of them as formulated in a funny vector space. This suggests using clustering to make recommendations.

Video Vector Space Formulation of Recommender Systems new (9:06)

Resources

Item-based Collaborative Filtering and its Technologies

We move on to item-based collaborative filtering where items are viewed as points in a space of users. The Cosine Similarity is introduced, the difference between implicit and explicit ratings and the k Nearest Neighbors algorithm. General features like the curse of dimensionality in high dimensions are discussed.

Presentation Lifestyle Filtering (18)

Item-based Collaborative Filtering

We covered user-based collaborative filtering in the previous unit. Here we start by discussing memory-based real time and model based offline (batch) approaches. Now we look at item-based collaborative filtering where items are viewed in the space of users and the cosine measure is used to quantify distances. WE discuss optimizations and how batch processing can help. We discuss different Likert ranking scales and issues with new items that do not have a significant number of rankings.

Video Item Based Filtering (11:18)

Video k Nearest Neighbors and High Dimensional Spaces (7:16)

k-Nearest Neighbors and High Dimensional Spaces

We define the k Nearest Neighbor algorithms and present the Python software but do not use it. We give examples from Wikipedia and describe performance issues. This algorithm illustrates the curse of dimensionality. If items were a real vectors in a low dimension space, there would be faster solution methods.

Video k Nearest Neighbors and High Dimensional Spaces (10:03)

Recommender Systems - K-Neighbors

Next we provide some sample Python code for the k Nearest Neighbor and its application to an artificial data set in 3 dimensions. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of training and testing sets are introduced with training set pre-labelled. This lesson is adapted from the Python k Nearest Neighbor code found on the web associated with a book by Harrington on Machine Learning [??]. There are two data sets. First we consider a set of 4 2D vectors divided into two categories (clusters) and use k=3 Nearest Neighbor algorithm to classify 3 test points. Second we consider a 3D dataset that has already been classified and show how to normalize. In this lesson we just use Matplotlib to give 2D plots.

The lesson goes through an example of using k NN classification algorithm by dividing dataset into 2 subsets. One is training set with initial classification; the other is test point to be classified by k=3 NN using training set. The code records fraction of points with a different classification from that input. One can experiment with different sizes of the two subsets. The Python implementation of algorithm is analyzed in detail.

Plotviz

The clustering methods are used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.

Files

Resources k-means

8.2.15.4 - Health Informatics

Health Informatics

Presentation Health Informatics (131)

This section starts by discussing general aspects of Big Data and Health including data sizes, different areas including genomics, EBI, radiology and the Quantified Self movement. We review current state of health care and trends associated with it including increased use of Telemedicine. We summarize an industry survey by GE and Accenture and an impressive exemplar Cloud-based medicine system from Potsdam. We give some details of big data in medicine. Some remarks on Cloud computing and Health focus on security and privacy issues.

We survey an April 2013 McKinsey report on the Big Data revolution in US health care; a Microsoft report in this area and a European Union report on how Big Data will allow patient centered care in the future. Examples are given of the Internet of Things, which will have great impact on health including wearables. A study looks at 4 scenarios for healthcare in 2032. Two are positive, one middle of the road and one negative. The final topic is Genomics, Proteomics and Information Visualization.

Big Data and Health

This lesson starts with general aspects of Big Data and Health including listing subareas where Big data important. Data sizes are given in radiology, genomics, personalized medicine, and the Quantified Self movement, with sizes and access to European Bioinformatics Institute.

Video Big Data and Health (10:02)

Status of Healthcare Today

This covers trends of costs and type of healthcare with low cost genomes and an aging population. Social media and government Brain initiative.

Video Status of Healthcare Today (16:09)

Telemedicine (Virtual Health)

This describes increasing use of telemedicine and how we tried and failed to do this in 1994.

Video Telemedicine (8:21)

Medical Big Data in the Clouds

An impressive exemplar Cloud-based medicine system from Potsdam.

Video Medical Big Data in the Clouds (15:02)

Medical image Big Data

Video Medical Image Big Data (6:33)

Clouds and Health

Video Clouds and Health (4:35)

McKinsey Report on the big-data revolution in US health care

This lesson covers 9 aspects of the McKinsey report. These are the convergence of multiple positive changes has created a tipping point for

innovation; Primary data pools are at the heart of the big data revolution in healthcare; Big data is changing the paradigm: these are the value pathways; Applying early successes at scale could reduce US healthcare costs by $300 billion to $450 billion; Most new big-data applications target consumers and providers across pathways; Innovations are weighted towards influencing individual decision-making levers; Big data innovations use a range of public, acquired, and proprietary data

types; Organizations implementing a big data transformation should provide the leadership required for the associated cultural transformation; Companies must develop a range of big data capabilities.

Video McKinsey Report (14:53)

Microsoft Report on Big Data in Health

This lesson identifies data sources as Clinical Data, Pharma & Life Science Data, Patient & Consumer Data, Claims & Cost Data and Correlational Data. Three approaches are Live data feed, Advanced analytics and Social analytics.

Video Microsoft Report on Big Data in Health (2:26)

EU Report on Redesigning health in Europe for 2020

This lesson summarizes an EU Report on Redesigning health in Europe for 2020. The power of data is seen as a lever for change in My Data, My decisions; Liberate the data; Connect up everything; Revolutionize health; and Include Everyone removing the current correlation between health and wealth.

Video EU Report on Redesigning health in Europe for 2020 (5:00)

Medicine and the Internet of Things

The Internet of Things will have great impact on health including telemedicine and wearables. Examples are given.

Video Medicine and the Internet of Things (8:17)

Extrapolating to 2032

A study looks at 4 scenarios for healthcare in 2032. Two are positive, one middle of the road and one negative.

Video Extrapolating to 2032 (15:13)

Genomics, Proteomics and Information Visualization

A study of an Azure application with an Excel frontend and a cloud BLAST backend starts this lesson. This is followed by a big data analysis of personal genomics and an analysis of a typical DNA sequencing analytics pipeline. The Protein Sequence Universe is defined and used to motivate Multi dimensional Scaling MDS. Sammon’s method is defined and its use illustrated by a metagenomics example. Subtleties in use of MDS include a monotonic mapping of the dissimilarity function. The application to the COG Proteomics dataset is discussed. We note that the MDS approach is related to the well known chisq method and some aspects of nonlinear minimization of chisq (Least Squares) are discussed.

Video Genomics, Proteomics and Information Visualization (6:56)

Next we continue the discussion of the COG Protein Universe introduced in the last lesson. It is shown how Proteomics clusters are clearly seen in the Universe browser. This motivates a side remark on different clustering methods applied to metagenomics. Then we discuss the Generative Topographic Map GTM method that can be used in dimension reduction when original data is in a metric space and is in this case faster than MDS as GTM computational complexity scales like N not N squared as seen in MDS.

Examples are given of GTM including an application to topic models in Information Retrieval. Indiana University has developed a deterministic annealing improvement of GTM. 3 separate clusterings are projected for visualization and show very different structure emphasizing the importance of visualizing results of data analytics. The final slide shows an application of MDS to generate and visualize phylogenetic trees.

\TODO{These two videos need to be uploaded to youtube} Video Genomics, Proteomics and Information Visualization I (10:33)

Video Genomics, Proteomics and Information Visualization: II (7:41)

Presentation Proteomics and Information Visualization (131)

Resources

8.2.15.5 - Overview of Data Science

Overview of Data Science

What is Big Data, Data Analytics and X-Informatics?

We start with X-Informatics and its rallying cry. The growing number of jobs in data science is highlighted. The first unit offers a look at the phenomenon described as the Data Deluge starting with its broad features. Data science and the famous DIKW (Data to Information to Knowledge to Wisdom) pipeline are covered. Then more detail is given on the flood of data from Internet and Industry applications with eBay and General Electric discussed in most detail.

In the next unit, we continue the discussion of the data deluge with a focus on scientific research. He takes a first peek at data from the Large Hadron Collider considered later as physics Informatics and gives some biology examples. He discusses the implication of data for the scientific method which is changing with the data-intensive methodology joining observation, theory and simulation as basic methods. Two broad classes of data are the long tail of sciences: many users with individually modest data adding up to a lot; and a myriad of Internet connected devices – the Internet of Things.

We give an initial technical overview of cloud computing as pioneered by companies like Amazon, Google and Microsoft with new centers holding up to a million servers. The benefits of Clouds in terms of power consumption and the environment are also touched upon, followed by a list of the most critical features of Cloud computing with a comparison to supercomputing. Features of the data deluge are discussed with a salutary example where more data did better than more thought. Then comes Data science and one part of it ~~ data analytics ~~ the large algorithms that crunch the big data to give big wisdom. There are many ways to describe data science and several are discussed to give a good composite picture of this emerging field.

Data Science generics and Commercial Data Deluge

We start with X-Informatics and its rallying cry. The growing number of jobs in data science is highlighted. This unit offers a look at the phenomenon described as the Data Deluge starting with its broad features. Then he discusses data science and the famous DIKW (Data to Information to Knowledge to Wisdom) pipeline. Then more detail is given on the flood of data from Internet and Industry applications with eBay and General Electric discussed in most detail.

Presentation Commercial Data Deluge (45)

What is X-Informatics and its Motto

This discusses trends that are driven by and accompany Big data. We give some key terms including data, information, knowledge, wisdom, data analytics and data science. We discuss how clouds running Data Analytics Collaboratively processing Big Data can solve problems in X-Informatics. We list many values of X you can defined in various activities across the world.

Jobs

Big data is especially important as there are some many related jobs. We illustrate this for both cloud computing and data science from reports by Microsoft and the McKinsey institute respectively. We show a plot from LinkedIn showing rapid increase in the number of data science and analytics jobs as a function of time.

Data Deluge: General Structure

We look at some broad features of the data deluge starting with the size of data in various areas especially in science research. We give examples from real world of the importance of big data and illustrate how it is integrated into an enterprise IT architecture. We give some views as to what characterizes Big data and why data science is a science that is needed to interpret all the data.

Data Science: Process

We stress the DIKW pipeline: Data becomes information that becomes knowledge and then wisdom, policy and decisions. This pipeline is illustrated with Google maps and we show how complex the ecosystem of data, transformations (filters) and its derived forms is.

Data Deluge: Internet

We give examples of Big data from the Internet with Tweets, uploaded photos and an illustration of the vitality and size of many commodity applications.

Data Deluge: Business

We give examples including the Big data that enables wind farms, city transportation, telephone operations, machines with health monitors, the banking, manufacturing and retail industries both online and offline in shopping malls. We give examples from ebay showing how analytics allowing them to refine and improve the customer experiences.

Resources

Data Deluge and Scientific Applications and Methodology

Overview of Data Science

We continue the discussion of the data deluge with a focus on scientific research. He takes a first peek at data from the Large Hadron Collider considered later as physics Informatics and gives some biology examples. He discusses the implication of data for the scientific method which is changing with the data-intensive methodology joining observation, theory and simulation as basic methods. We discuss the long tail of sciences; many users with individually modest data adding up to a lot. The last lesson emphasizes how everyday devices ~~ the Internet of Things ~~ are being used to create a wealth of data.

Presentation Methodology (22)

Science and Research

We look into more big data examples with a focus on science and research. We give astronomy, genomics, radiology, particle physics and discovery of Higgs particle (Covered in more detail in later lessons), European Bioinformatics Institute and contrast to Facebook and Walmart.

Implications for Scientific Method

We discuss the emergencies of a new fourth methodology for scientific research based on data driven inquiry. We contrast this with third ~~ computation or simulation based discovery - methodology which emerged itself some 25 years ago.

Long Tail of Science

There is big science such as particle physics where a single experiment has 3000 people collaborate!.Then there are individual investigators who do not generate a lot of data each but together they add up to Big data.

Internet of Things

A final category of Big data comes from the Internet of Things where lots of small devices ~~ smart phones, web cams, video games collect and disseminate data and are controlled and coordinated in the cloud.

Resources

Clouds and Big Data Processing; Data Science Process and Analytics

Overview of Data Science

We give an initial technical overview of cloud computing as pioneered by companies like Amazon, Google and Microsoft with new centers holding up to a million servers. The benefits of Clouds in terms of power consumption and the environment are also touched upon, followed by a list of the most critical features of Cloud computing with a comparison to supercomputing.

He discusses features of the data deluge with a salutary example where more data did better than more thought. He introduces data science and one part of it ~~ data analytics ~~ the large algorithms that crunch the big data to give big wisdom. There are many ways to describe data science and several are discussed to give a good composite picture of this emerging field.

Presentation Clouds (35)

Clouds

We describe cloud data centers with their staggering size with up to a million servers in a single data center and centers built modularly from shipping containers full of racks. The benefits of Clouds in terms of power consumption and the environment are also touched upon, followed by a list of the most critical features of Cloud computing and a comparison to supercomputing.

Aspect of Data Deluge

Data, Information, intelligence algorithms, infrastructure, data structure, semantics and knowledge are related. The semantic web and Big data are compared. We give an example where “More data usually beats better algorithms”. We discuss examples of intelligent big data and list 8 different types of data deluge

Data Science Process

We describe and critique one view of the work of a data scientists. Then we discuss and contrast 7 views of the process needed to speed data through the DIKW pipeline.

Data Analytics

Presentation Data Analytics (30) We stress the importance of data analytics givi ng examples from several fields. We note that better analytics is as important as better computing and storage capability. In the second video we look at High Performance Computing in Science and Engineering: the Tree and the Fruit.

Resources

8.2.15.6 - Physics

Physics

This section starts by describing the LHC accelerator at CERN and evidence found by the experiments suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. Then random variables and some simple principles of statistics are introduced with explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes discussion.

Looking for Higgs Particles

Bumps in Histograms, Experiments and Accelerators

This unit is devoted to Python and Java experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. The lectures use Python but use of Java is described.

  • Presentation Higgs (20)

  • <{gitcode}/physics/mr-higgs/higgs-classI-sloping.py>

Particle Counting

We return to particle case with slides used in introduction and stress that particles often manifested as bumps in histograms and those bumps need to be large enough to stand out from background in a statistically significant fashion.

We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion.

Experimental Facilities

We give a few details on one LHC experiment ATLAS. Experimental physics papers have a staggering number of authors and quite big budgets. Feynman diagrams describe processes in a fundamental fashion.

This lesson gives a small picture gallery of accelerators. Accelerators, detection chambers and magnets in tunnels and a large underground laboratory used fpr experiments where you need to be shielded from background like cosmic rays.

Resources

Looking for Higgs Particles: Python Event Counting for Signal and Background (Part 2)

This unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals.

Files:

  • <{gitcode}/physics/mr-higgs/higgs-classI-sloping.py>
  • <{gitcode}/physics/number-theory/higgs-classIII.py>
  • <{gitcode}/physics/mr-higgs/higgs-classII-uniform.py>

Event Counting

We define event counting data collection environments. We discuss the python and Java code to generate events according to a particular scenario (the important idea of Monte Carlo data). Here a sloping background plus either a Higgs particle generated similarly to LHC observation or one observed with better resolution (smaller measurement error).

Monte Carlo

This uses Monte Carlo data both to generate data like the experimental observations and explore effect of changing amount of data and changing measurement resolution for Higgs.

Resources

Random Variables, Physics and Normal Distributions

We introduce random variables and some simple principles of statistics and explains why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Java is currently not available in this unit.

  • Presentation Higgs (39)
  • <{gitcode}/physics/number-theory/higgs-classIII.py>

Statistics Overview and Fundamental Idea: Random Variables

We go through the many different areas of statistics covered in the Physics unit. We define the statistics concept of a random variable.

Physics and Random Variables

We describe the DIKW pipeline for the analysis of this type of physics experiment and go through details of analysis pipeline for the LHC ATLAS experiment. We give examples of event displays showing the final state particles seen in a few events. We illustrate how physicists decide whats going on with a plot of expected Higgs production experimental cross sections (probabilities) for signal and background.

Statistics of Events with Normal Distributions

We introduce Poisson and Binomial distributions and define independent identically distributed (IID) random variables. We give the law of large numbers defining the errors in counting and leading to Gaussian distributions for many things. We demonstrate this in Python experiments.

Gaussian Distributions

We introduce the Gaussian distribution and give Python examples of the fluctuations in counting Gaussian distributions.

Using Statistics

We discuss the significance of a standard deviation and role of biases and insufficient statistics with a Python example in getting incorrect answers.

Resources

Random Numbers, Distributions and Central Limit Theorem

We discuss Random Numbers with their Generators and Seeds. It introduces Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods are discussed. The Central Limit Theorem and Bayes law concludes discussion. Python and Java (for student - not reviewed in class) examples and Physics applications are given.

Files:

  • <{gitcode}/physics/calculated-dice-roll/higgs-classIV-seeds.py>

Generators and Seeds

We define random numbers and describe how to generate them on the computer giving Python examples. We define the seed used to define to specify how to start generation.

Binomial Distribution

We define binomial distribution and give LHC data as an example of where this distribution valid.

Accept-Reject

We introduce an advanced method accept/reject for generating random variables with arbitrary distributions.

Monte Carlo Method

We define Monte Carlo method which usually uses accept/reject method in typical case for distribution.

Poisson Distribution

We extend the Binomial to the Poisson distribution and give a set of amusing examples from Wikipedia.

Central Limit Theorem

We introduce Central Limit Theorem and give examples from Wikipedia.

Interpretation of Probability: Bayes v. Frequency

This lesson describes difference between Bayes and frequency views of probability. Bayes’s law of conditional probability is derived and applied to Higgs example to enable information about Higgs from multiple channels and multiple experiments to be accumulated.

Resources

\TODO{integrate physics-references.bib}

SKA – Square Kilometer Array

Professor Diamond, accompanied by Dr. Rosie Bolton from the SKA Regional Centre Project gave a presentation at SC17 “into the deepest reaches of the observable universe as they describe the SKA’s international partnership that will map and study the entire sky in greater detail than ever before.”

A summary article about this effort is available at:

8.2.15.7 - Plotviz

Plotviz

NOTE: This an legacy application this has now been replaced by WebPlotViz which is a web browser based visualization tool which provides added functionality’s.

We introduce Plotviz, a data visualization tool developed at Indiana University to display 2 and 3 dimensional data. The motivation is that the human eye is very good at pattern recognition and can see structure in data. Although most Big data is higher dimensional than 3, all can be transformed by dimension reduction techniques to 3D. He gives several examples to show how the software can be used and what kind of data can be visualized. This includes individual plots and the manipulation of multiple synchronized plots.Finally, he describes the download and software dependency of Plotviz.

Using Plotviz Software for Displaying Point Distributions in 3D

We introduce Plotviz, a data visualization tool developed at Indiana University to display 2 and 3 dimensional data. The motivation is that the human eye is very good at pattern recognition and can see structure in data. Although most Big data is higher dimensional than 3, all can be transformed by dimension reduction techniques to 3D. He gives several examples to show how the software can be used and what kind of data can be visualized. This includes individual plots and the manipulation of multiple synchronized plots. Finally, he describes the download and software dependency of Plotviz.

Presentation Plotviz (34)

Files:

Motivation and Introduction to use

The motivation of Plotviz is that the human eye is very good at pattern recognition and can see structure in data. Although most Big data is higher dimensional than 3, all data can be transformed by dimension reduction techniques to 3D and one can check analysis like clustering and/or see structure missed in a computer analysis. The motivations shows some Cheminformatics examples. The use of Plotviz is started in slide 4 with a discussion of input file which is either a simple text or more features (like colors) can be specified in a rich XML syntax. Plotviz deals with points and their classification (clustering). Next the protein sequence browser in 3D shows the basic structure of Plotviz interface. The next two slides explain the core 3D and 2D manipulations respectively. Note all files used in examples are available to students.

Presentation Motivation (7:58)

Example of Use I: Cube and Structured Dataset

Initially we start with a simple plot of 8 points – the corners of a cube in 3 dimensions – showing basic operations such as size/color/labels and Legend of points. The second example shows a dataset (coming from GTM dimension reduction) with significant structure. This has .pviz and a .txt versions that are compared.

Presentation Example I (9:45)

Example of Use II: Proteomics and Synchronized Rotation

This starts with an examination of a sample of Protein Universe Browser showing how one uses Plotviz to look at different features of this set of Protein sequences projected to 3D. Then we show how to compare two datasets with synchronized rotation of a dataset clustered in 2 different ways; this dataset comes from k Nearest Neighbor discussion.

Presentation Proteomics and Synchronized Rotation (9:14)

Example of Use III: More Features and larger Proteomics Sample

This starts by describing use of Labels and Glyphs and the Default mode in Plotviz. Then we illustrate sophisticated use of these ideas to view a large Proteomics dataset.

Presentation Larger Proteomics Sample (8:37)

Example of Use IV: Tools and Examples

This lesson starts by describing the Plotviz tools and then sets up two examples – Oil Flow and Trading – described in PowerPoint. It finishes with the Plotviz viewing of Oil Flow data.

Presentation Plotviz I (10:17)

Example of Use V: Final Examples

This starts with Plotviz looking at Trading example introduced in previous lesson and then examines solvent data. It finishes with two large biology examples with 446K and 100K points and each with over 100 clusters. We finish remarks on Plotviz software structure and how to download. We also remind you that a picture is worth a 1000 words.

Video Plotviz II (14:58)

Resources

Download

8.2.15.8 - Practical K-Means, Map Reduce, and Page Rank for Big Data Applications and Analytics

Practical K-Means, Map Reduce, and Page Rank for Big Data Applications and Analytics

We use the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the hill between different solutions and rationale for running K-means many times and choosing best answer. Then we introduce MapReduce with the basic architecture and a homely example. The discussion of advanced topics includes an extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given. The SciPy K-means code is modified to support a MapReduce execution style. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the parallel maps run sequentially. This simple 2 map version can be generalized to scalable parallelism. Python is used to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.

Video K-Means I (11:42)

Video K-Means II (11:54)

K-means in Practice

We introduce the k means algorithm in a gentle fashion and describes its key features including dangers of local minima. A simple example from Wikipedia is examined.

We use the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the hill between different solutions and rationale for running K-means many times and choosing best answer.

Files:

K-means in Python

We use the K-means Python code in SciPy package to show real code for clustering and applies it a set of 85 two dimensional vectors – officially sets of weights and heights to be clustered to find T-shirt sizes. We run through Python code with Matplotlib displays to divide into 2-5 clusters. Then we discuss Python to generate 4 clusters of varying sizes and centered at corners of a square in two dimensions. We formally give the K means algorithm better than before and make definition consistent with code in SciPy.

Analysis of 4 Artificial Clusters

We present clustering results on the artificial set of 1000 2D points described in previous lesson for 3 choices of cluster sizes small large and very large. We emphasize the SciPy always does 20 independent K means and takes the best result – an approach to avoiding local minima. We allow this number of independent runs to be changed and in particular set to 1 to generate more interesting erratic results. We define changes in our new K means code that also has two measures of quality allowed. The slides give many results of clustering into 2 4 6 and 8 clusters (there were only 4 real clusters). We show that the very small case has two very different solutions when clustered into two clusters and use this to discuss functions with multiple minima and a hill between them. The lesson has both discussion of already produced results in slides and interactive use of Python for new runs.

Parallel K-means

We modify the SciPy K-means code to support a MapReduce execution style and runs it in this short unit. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the parallel maps run sequentially. We stress that this simple 2 map version can be generalized to scalable parallelism.

Files:

PageRank in Practice

We use Python to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.

Files:

Resources

8.2.15.9 - Radar

Radar

The changing global climate is suspected to have long-term effects on much of the world’s inhabitants. Among the various effects, the rising sea level will directly affect many people living in low-lying coastal regions. While the ocean-s thermal expansion has been the dominant contributor to rises in sea level, the potential contribution of discharges from the polar ice sheets in Greenland and Antarctica may provide a more significant threat due to the unpredictable response to the changing climate. The Radar-Informatics unit provides a glimpse in the processes fueling global climate change and explains what methods are used for ice data acquisitions and analysis.

Presentation Radar (58)

Introduction

This lesson motivates radar-informatics by building on previous discussions on why X-applications are growing in data size and why analytics are necessary for acquiring knowledge from large data. The lesson details three mosaics of a changing Greenland ice sheet and provides a concise overview to subsequent lessons by detailing explaining how other remote sensing technologies, such as the radar, can be used to sound the polar ice sheets and what we are doing with radar images to extract knowledge to be incorporated into numerical models.

Remote Sensing

This lesson explains the basics of remote sensing, the characteristics of remote sensors and remote sensing applications. Emphasis is on image acquisition and data collection in the electromagnetic spectrum.

Ice Sheet Science

This lesson provides a brief understanding on why melt water at the base of the ice sheet can be detrimental and why it’s important for sensors to sound the bedrock.

Global Climate Change

This lesson provides an understanding and the processes for the greenhouse effect, how warming effects the Polar Regions, and the implications of a rise in sea level.

Radio Overview

This lesson provides an elementary introduction to radar and its importance to remote sensing, especially to acquiring information about Greenland and Antarctica.

Radio Informatics

This lesson focuses on the use of sophisticated computer vision algorithms, such as active contours and a hidden markov model to support data analysis for extracting layers, so ice sheet models can accurately forecast future changes in climate.

8.2.15.10 - Sensors

Sensors

We start with the Internet of Things IoT giving examples like monitors of machine operation, QR codes, surveillance cameras, scientific sensors, drones and self driving cars and more generally transportation systems. We give examples of robots and drones. We introduce the Industrial Internet of Things IIoT and summarize surveys and expectations Industry wide. We give examples from General Electric. Sensor clouds control the many small distributed devices of IoT and IIoT. More detail is given for radar data gathered by sensors; ubiquitous or smart cities and homes including U-Korea; and finally the smart electric grid.

Presentation Sensor I(31)

Presentation Sensor II(44)

Internet of Things

There are predicted to be 24-50 Billion devices on the Internet by 2020; these are typically some sort of sensor defined as any source or sink of time series data. Sensors include smartphones, webcams, monitors of machine operation, barcodes, surveillance cameras, scientific sensors (especially in earth and environmental science), drones and self driving cars and more generally transportation systems. The lesson gives many examples of distributed sensors, which form a Grid that is controlled by a cloud.

Video Internet of Things (12:36)

Robotics and IoT

Examples of Robots and Drones.

Video Robotics and IoT Expectations (8:05)

Industrial Internet of Things

We summarize surveys and expectations Industry wide.

Video Industrial Internet of Things (24:02)

Sensor Clouds

We describe the architecture of a Sensor Cloud control environment and gives example of interface to an older version of it. The performance of system is measured in terms of processing latency as a function of number of involved sensors with each delivering data at 1.8 Mbps rate.

Video Sensor Clouds (4:40)

Earth/Environment/Polar Science data gathered by Sensors

This lesson gives examples of some sensors in the Earth/Environment/Polar Science field. It starts with material from the CReSIS polar remote sensing project and then looks at the NSF Ocean Observing Initiative and NASA’s MODIS or Moderate Resolution Imaging Spectroradiometer instrument on a satellite.

Video Earth/Environment/Polar Science data gathered by Sensors (4:58)

Ubiquitous/Smart Cities

For Ubiquitous/Smart cities we give two examples: Iniquitous Korea and smart electrical grids.

Video Ubiquitous/Smart Cities (1:44)

U-Korea (U=Ubiquitous)

Korea has an interesting positioning where it is first worldwide in broadband access per capita, e-government, scientific literacy and total working hours. However it is far down in measures like quality of life and GDP. U-Korea aims to improve the latter by Pervasive computing, everywhere, anytime i.e. by spreading sensors everywhere. The example of a ‘High-Tech Utopia’ New Songdo is given.

Video U-Korea (U=Ubiquitous) (2:49)

Smart Grid

The electrical Smart Grid aims to enhance USA’s aging electrical infrastructure by pervasive deployment of sensors and the integration of their measurement in a cloud or equivalent server infrastructure. A variety of new instruments include smart meters, power monitors, and measures of solar irradiance, wind speed, and temperature. One goal is autonomous local power units where good use is made of waste heat.

Video Smart Grid (6:04)

Resources

\TODO{These resources have not all been checked to see if they still exist this is currently in progress}

8.2.15.11 - Sports

Sports

Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Basic Sabermetrics

This unit discusses baseball starting with the movie Moneyball and the 2002-2003 Oakland Athletics. Unlike sports like basketball and soccer, most baseball action is built around individuals often interacting in pairs. This is much easier to quantify than many player phenomena in other sports. We discuss Performance-Dollar relationship including new stadiums and media/advertising. We look at classic baseball averages and sophisticated measures like Wins Above Replacement.

Presentation Overview (40)

Introduction and Sabermetrics (Baseball Informatics) Lesson

Introduction to all Sports Informatics, Moneyball The 2002-2003 Oakland Athletics, Diamond Dollars economic model of baseball, Performance - Dollar relationship, Value of a Win.

Video Introduction and Sabermetrics (Baseball Informatics) Lesson (31:4)

Basic Sabermetrics

Different Types of Baseball Data, Sabermetrics, Overview of all data, Details of some statistics based on basic data, OPS, wOBA, ERA, ERC, FIP, UZR.

Video Basic Sabermetrics (26:53)

Wins Above Replacement

Wins above Replacement WAR, Discussion of Calculation, Examples, Comparisons of different methods, Coefficient of Determination, Another, Sabermetrics Example, Summary of Sabermetrics.

Video Wins Above Replacement (30:43)

Advanced Sabermetrics

This unit discusses ‘advanced sabermetrics’ covering advances possible from using video from PITCHf/X, FIELDf/X, HITf/X, COMMANDf/X and MLBAM.

Presentation Sporta II (41)

Pitching Clustering

A Big Data Pitcher Clustering method introduced by Vince Gennaro, Data from Blog and video at 2013 SABR conference.

Video Pitching Clustering (20:59)

Pitcher Quality

Results of optimizing match ups, Data from video at 2013 SABR conference.

Video Pitcher Quality (10:02)

PITCHf/X

Examples of use of PITCHf/X.

Video PITCHf/X (10:39)

Other Video Data Gathering in Baseball

FIELDf/X, MLBAM, HITf/X, COMMANDf/X.

Video Other Video Data Gathering in Baseball (18:5) Other Sports


We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Presentation Sport Sports III (44)

Wearables

Consumer Sports, Stake Holders, and Multiple Factors.

Video Wearables (22:2)

Soccer and the Olympics

Soccer, Tracking Players and Balls, Olympics.

Video Soccer and the Olympics (8:28)

Spatial Visualization in NFL and NBA

NFL, NBA, and Spatial Visualization.

Video Spatial Visualization in NFL and NBA (15:19)

Tennis and Horse Racing

Tennis, Horse Racing, and Continued Emphasis on Spatial Visualization.

Video Tennis and Horse Racing (8:52)

Resources

\TODO{These resources have not all been checked to see if they still exist this is currently in progress}

8.2.15.12 - Statistics

Statistics

We assume that you are familiar with elementary statistics including

  • mean, minimum, maximum
  • standard deviation
  • probability
  • distribution
  • frequency distribution
  • Gaussian distribution
  • bell curve
  • standard normal probabilities
  • tables (z table)
  • Regression
  • Correlation

Some of these terms are explained in various sections throughout our application discussion. This includes especially the Physics section. However these terms are so elementary that any undergraduate or highschool book will provide you with a good introduction.

It is expected from you to identify these terms and you can contribute to this section with non plagiarized subsections explaining these topics for credit.

No Topics identified by a :?: can be contributed by students. If you are interested, use piazza for announcing your willingness to do so.

Mean, minimum, maximum:

No Question

Standard deviation:

No Question

Probability:

No Question

Distribution:

No Question

Frequency distribution:

No Question

Gaussian distribution:

No Question

Bell curve:

No Question

Standard normal probabilities:

No Question

Tables (z-table):

No Question

Regression:

No Question

Correlation:

No Question

Exercise

E.Statistics.1:

Pick a term from the previous list and define it while not plagiarizing. Create a pull request. Coordinate on piazza as to not duplicate someone else’s contribution. Also look into outstanding pull requests.

E.Statistics.2:

Pick a term from the previous list and develop a python program demonstrating it and create a pull request for a contribution into the examples directory. Make links to the github location. Coordinate on piazza as to not duplicate someone else’s contribution. Also look into outstanding pull requests.

8.2.15.13 - Web Search and Text Mining

Web Search and Text Mining

This section starts with an overview of data mining and puts our study of classification, clustering and exploration methods in context. We examine the problem to be solved in web and text search and note the relevance of history with libraries, catalogs and concordances. An overview of web search is given describing the continued evolution of search engines and the relation to the field of Information.

The importance of recall, precision and diversity is discussed. The important Bag of Words model is introduced and both Boolean queries and the more general fuzzy indices. The important vector space model and revisiting the Cosine Similarity as a distance in this bag follows. The basic TF-IDF approach is dis cussed. Relevance is discussed with a probabilistic model while the distinction between Bayesian and frequency views of probability distribution completes this unit.

We start with an overview of the different steps (data analytics) in web search and then goes key steps in detail starting with document preparation. An inverted index is described and then how it is prepared for web search. The Boolean and Vector Space approach to query processing follow. This is followed by Link Structure Analysis including Hubs, Authorities and PageRank. The application of PageRank ideas as reputation outside web search is covered. The web graph structure, crawling it and issues in web advertising and search follow. The use of clustering and topic models completes the section.

Web Search and Text Mining

The unit starts with the web with its size, shape (coming from the mutual linkage of pages by URL’s) and universal power laws for number of pages with particular number of URL’s linking out or in to page. Information retrieval is introduced and compared to web search. A comparison is given between semantic searches as in databases and the full text search that is base of Web search. The origin of web search in libraries, catalogs and concordances is summarized. DIKW – Data Information Knowledge Wisdom – model for web search is discussed. Then features of documents, collections and the important Bag of Words representation. Queries are presented in context of an Information Retrieval architecture. The method of judging quality of results including recall, precision and diversity is described. A time line for evolution of search engines is given.

Boolean and Vector Space models for query including the cosine similarity are introduced. Web Crawlers are discussed and then the steps needed to analyze data from Web and produce a set of terms. Building and accessing an inverted index is followed by the importance of term specificity and how it is captured in TF-IDF. We note how frequencies are converted into belief and relevance.

Presentation Web Search and Text Mining (56)

The Problem

Video Text Mining (9:56)

This lesson starts with the web with its size, shape (coming from the mutual linkage of pages by URL’s) and universal power laws for number of pages with particular number of URL’s linking out or in to page.

Information Retrieval

Video Information Retrieval (6:06)

Information retrieval is introduced A comparison is given between semantic searches as in databases and the full text search that is base of Web search. The ACM classification illustrates potential complexity of ontologies. Some differences between web search and information retrieval are given.

History

Video Web Search History (5:48)

The origin of web search in libraries, catalogs and concordances is summarized.

Key Fundamental Principles

Video Principles (9:30)

This lesson describes the DIKW – Data Information Knowledge Wisdom – model for web search. Then it discusses documents, collections and the important Bag of Words representation.

Information Retrieval (Web Search) Components

Video Fundamental Principals of Web Search (5:06)

This describes queries in context of an Information Retrieval architecture. The method of judging quality of results including recall, precision and diversity is described.

Search Engines

Video Search Engines (3:08)

This short lesson describes a time line for evolution of search engines. The first web search approaches were directly built on Information retrieval but in 1998 the field was changed when Google was founded and showed the importance of URL structure as exemplified by PageRank.

Boolean and Vector Space Models

Video Boolean and Vector Space Model (6:17)

This lesson describes the Boolean and Vector Space models for query including the cosine similarity.

Web crawling and Document Preparation

Video Web crawling and Document Preparation (4:55)

This describes a Web Crawler and then the steps needed to analyze data from Web and produce a set of terms.

Indices

Video Indices (5:44)

This lesson describes both building and accessing an inverted index. It describes how phrases are treated and gives details of query structure from some early logs.

TF-IDF and Probabilistic Models

Video TF-IDF and Probabilistic Models (3:57)

It describes the importance of term specificity and how it is captured in TF-IDF. It notes how frequencies are converted into belief and relevance.

Topics in Web Search and Text Mining

Presentation Text Mining (33)

We start with an overview of the different steps (data analytics) in web search. This is followed by Link Structure Analysis including Hubs, Authorities and PageRank. The application of PageRank ideas as reputation outside web search is covered. Issues in web advertising and search follow. his leads to emerging field of computational advertising. The use of clustering and topic models completes unit with Google News as an example.

Video Web Search and Text Mining II (6:11)

This short lesson describes the different steps needed in web search including: Get the digital data (from web or from scanning); Crawl web; Preprocess data to get searchable things (words, positions); Form Inverted Index mapping words to documents; Rank relevance of documents with potentially sophisticated techniques; and integrate technology to support advertising and ways to allow or stop pages artificially enhancing relevance.

Video Related Applications (17:24)

The value of links and the concepts of Hubs and Authorities are discussed. This leads to definition of PageRank with examples. Extensions of PageRank viewed as a reputation are discussed with journal rankings and university department rankings as examples. There are many extension of these ideas which are not discussed here although topic models are covered briefly in a later lesson.

Video Web Advertising and Search (9:02)

Internet and mobile advertising is growing fast and can be personalized more than for traditional media. There are several advertising types Sponsored search, Contextual ads, Display ads and different models: Cost per viewing, cost per clicking and cost per action. This leads to emerging field of computational advertising.

Clustering and Topic Models

Video Clustering and Topic Models (6:21)

We discuss briefly approaches to defining groups of documents. We illustrate this for Google News and give an example that this can give different answers from word-based analyses. We mention some work at Indiana University on a Latent Semantic Indexing model.

Resources

All resources accessed March 2018.

8.2.15.14 - WebPlotViz

WebPlotViz

WebPlotViz is a browser based visualization tool developed at Indiana University. This tool allows user to visualize 2D and 3D data points in the web browser. WebPlotViz was developed as a succesor to the previous visualization tool PlotViz which was a application which needed to be installed on your machine to be used. You can find more information about PlotViz at the PlotViz Section

Motivation

The motivation of WebPlotViz is similar to PlotViz which is that the human eye is very good at pattern recognition and can see structure in data. Although most Big data is higher dimensional than 3, all data can be transformed by dimension reduction techniques to 3D and one can check analysis like clustering and/or see structure missed in a computer analysis.

How to use

In order to use WebPlotViz you need to host the application as a server, this can be done on you local machine or a application server. The source code for WebPlotViz can be found at the git hub repo WebPlotViz git Repo.

However there is a online version that is hosted on Indiana university servers that you can access and use. The online version is available at WebPlotViz

In order to use the services of WebPlotViz you would need to first create a simple account by providing your email and a password. Once the account is created you can login and upload files to WebPlotViz to be visualized.

Uploading files to WebPlotViz

While WebPlotViz does accept several file formats as inputs, we will look at the most simple and easy to use format that users can use. Files are uploaded as “.txt” files with the following structure. Each value is separated by a space.

Index x_val y_val z_val cluster_id label

Example file:

0 0.155117377 0.011486086 -0.078151964 1 l1
1 0.148366394 0.010782429 -0.076370584 2 l2
2 0.170597667 -0.025115137 -0.082946074 2 l2
3 0.136063907 -0.006670781 -0.082583441 3 l3
4 0.158259943 0.015187686 -0.073592601 5 l5
5 0.162483279 0.014387166 -0.085987414 5 l5
6 0.138651632 0.013358333 -0.062633719 5 l5
7 0.168020213 0.010742307 -0.090281011 5 l5
8 0.15810229 0.007551404 -0.083311109 4 l4
9 0.146878082 0.003858649 -0.071298345 4 l4
10 0.151487542 0.011896318 -0.074281645 4 l4

Once you have the data file properly formatted you can upload the file through the WebPlotViz GUI. Once you login to your account you should see a Green “Upload” button on the top left corner. Once you press it you would see a form that would allow you to choose the file, provide a description and select a group to which the file needs to be categorized into. If you do not want to assign a group you can simply use the default group which is picked by default

Once you have uploaded the file the file should appear in the list of plots under the heading “Artifacts”. Then you can click on the name or the “View” link to view the plot. Clicking on “View” will directly take you to the full view of the plot while clicking on the name will show and summary of the plot with a smaller view of the plot (Plot controls are not available in the smaller view). You can view how the sample dataset looks like after uploading at the following link. @fig:webpviz-11 shows a screen shot of the plot.

11 Points WebPlotViz plot

11 Points Plot{#fig:webpviz-11}

Users can apply colors to clusters manually or choose one of the color schemes that are provided. All the controls for the clusters are made available once your clock on the “Cluster List” button that is located on the bottom left corner of the plot (Third button from the left). This will pop up a window that will allow you to control all the settings of the clusters.

Features

WebPlotViz has many features that allows the users to control and customize the plots, Other than simple 2D/3D plots, WebPlotViz also supports time series plots and Tree structures. The examples section will show case examples for each of these. The data formats required for these plots are not covered here.

WebPlotViz Features labeled{#fig:webpviz-labled}

Some of the features are labeled in @fig:webpviz-labled. Please note that @fig:webpviz-labled shows an time series plot so the controls for playback shown in the figure are not available in single plots.

Some of the features are descibed in the short video that is linked in the home page of the hosted WebPlotViz site WebPlotViz

Examples

Now we will take a look at a couple of examples that were visualized using WebPlotViz.

Fungi gene sequence clustering example

The following example is a plot from clustering done on a set on fungi gene sequence data.

Fungi Gene Sequence Plot

Fungi Gene Sequence{#fig:webpviz-fungi}

Stock market time series data

This example shows a time series plot, the plot were created from stock market data so certain patterns can be followed with companies with passing years.

Stock market data

Stock market data{#fig:webpviz-stock}

8.2.16 - Technologies

Technologies useful for this course

8.2.16.1 - Python

Python

Please see the Python book:

  • Introduction to Python for Cloud Computing, Gregor von Laszewski, Aug. 2019

8.2.16.2 - Github

Github

Track Progress with Github

We will be adding git issues for all the assignments provided in the class. This way you can also keep a track on the items need to be completed. It is like a todo list. You can check things once you complete it. This way you can easily track what you need to do and you can comment on the issue to report the questions you have. This is an experimental idea we are trying in the class. Hope this helps to manage your work load efficiently.

How to check this?

All you have to do is go to your git repository.

Here are the steps to use this tool effectively.

Step 1

Go to the repo. Here we use a sample repo.

Sample Repo

Link to your repo will be https://github.com/cloudmesh-community/fa19-{class-id}-{hid}

class-id is your class number for instance 534. hid is your homework id assigned.

Step 2

In @fig:github-repo the red colored box shows where you need to navigate next. Click on issues.

Git Repo View{#fig:github-repo}

Step 3

In @fig:github-issue-list, Git issue list looks like this. The inputs in this are dummy values we used to test the module. In your repo, things will be readable and identified based on week. This way you know what you need to do this week.

Git Issue List{#fig:github-issue-list}

Step 4

In @fig:github-issue-view this is how a git issue looks like.

Git Issue View{#fig:github-issue-view}

In here you will see the things that you need to do with main task and subtasks. This looks like a tood list. No pressure you can customize the way you want it. We’ll put in the basic skeleton for this one.

Step 5 (Optional)

In @fig:github-issue-assign, assign a TA, once you have completed the issues, you can assign a TA to resolve if you have issues. In all issues you can make a comment and you can use @ sign to add the specific TA. For E534 Fall 2019 you can add @vibhatha as an assignee for your issue and we will communicate to solve the issues. This is an optional thing, you can use canvas or meeting hours to mention your concerns.

Git Issue View{#fig:github-issue-assign}

Step 6 (Optional)

In @fig:github-issue-label, you can add a label to your issue by clicking labels option in the right hand size within a given issue.

Git Issue Label{#fig:github-issue-label}

9 - MNIST Example

In this module, you will learn how to use Google Colab while using the well known MNIST example

MNIST Character Recognition

We discuss in this module how to create a simple IPython Notebook to solve an image classification problem. MNIST contains a set of pictures.

1. Overview

1.1. Prerequisite

  • Knowledge of Python
  • Google account

1.2. Effort

  • 3 hours

1.3. Topics covered

  • Using Google Colab
  • Running an AI application on Google Colab

Another MNIST course exist. However, this course has more information.

5. Introduction to Google Colab

Introduction to Google Colab

A Gentle Introduction to Google Colab (Web) A Gentle Introduction to Google Colab (Web)

6. Basic Python in Google Colab

In this module, we will take a look at some fundamental Python Concepts needed for day-to-day coding.

A Gentle Introduction to Python on Google Colab (Web) A Gentle Introduction to Python on Google Colab (Web)

7. MNIST On Google colab

Next, we discuss how to create a simple IPython Notebook to solve an image classification problem. MNIST contains a set of pictures.

A PDF containing both lectures is available at Colab-and-MNIST.pdf

There are 5 videos

8. Assignments

  1. Get an account on Google if you do not have one.
  2. Do the optional Basic Python Colab lab module
  3. Do MNIST in Colab.

9. References

10 - Sample

For creating new pages on this Web site, we have prepared a number of samples.

The samples need to be placed into various directories based on type. You need to make sure the title and the linkTitle in the page metadata are set. If they are in the same directory as another page make sure all are unique

10.1 - Module Sample

In this section you will be using a short description that is displayed in the module summary page

Here comes the more complete abstract about this module, while the description is a short summary, the abstract is a bit more verbose.

Splash: An optional module related image may be nice to start the image may be nice to create a splash the learners find attractive.

1. Overview

1.1. Prerequisite

Describe what knowlede needs to be here to start the module. Use a list and be specific as much as possible:

  • Computer with Python 3.8.3

1.2. Effort

If possible describe here how much effort it takes to complete the module. Use a list

  • 1 hour

1.3. Topics

Please list here the topics that are covered by the module A list is often the prefered way to do that. Use the abstract/pageinfo to provide a more textual description.

1.4. Organization

Please decribe how the Module is organized, if needed

2. Section A

Include Section A here

3. Section B

Include Unit 2 here

4. Assignments

Include the assignments here. Use a numbered list.

5. References

Put the refernces here

10.2 - Alert Sample

The alert shortcode allows you to highlight information in your page.

+++ title = “alert” description = “The alert shortcode allows you to highlight information in your page.” +++

The alert shortcode allow you to highlight information in your page. They create a colored box surrounding your text, like this:

Usage

Parameter Default Description
theme info success, info,warning,danger

Basic examples

{{% alert theme="info" %}}**this** is a text{{% /alert %}}
{{% alert theme="success" %}}**Yeahhh !** is a text{{% /alert %}}
{{% alert theme="warning" %}}**Be carefull** is a text{{% /alert %}}
{{% alert theme="danger" %}}**Beware !** is a text{{% /alert %}}

10.4 - Figure Sample

Showcases how to include a figure.

A sample caption

10.5 - Mermaid Sample

Showcases how to include simple graphs generated by mermaid
graph LR

	A[Introduction]
	B[Usecases]
	C[Physics]
	D[Sports]

A-->B-->C-->D

click A "/courses/bigdata2020/#introduction-to-ai-driven-digital-transformation" _blank
click B "/courses/bigdata2020/#big-data-usecases-survey" _blank
click C "/courses/bigdata2020/#physics" _blank
click D "/courses/bigdata2020/#sports" _blank
gantt
    title Class Calendar
    dateFormat  YYYY-MM-DD
    section Lectures
    Overview         :w1, 2020-08-28, 7d
    Introduction     :w2, 2020-09-04, 7d 
    Physics          :w3, 2020-09-11, 7d 
    Sport            :w4, 2020-09-18, 7d
    section Practice
	Colab            :after w1, 14d
    Github           :after w2, 7d	

To design a chart you can use the live edior.