This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Content

1 - Courses

List of courses.

With the help of modules, one can assemble their own courses. Courses can be designed individually or for a class with multiple students.

One of the main tools to export such courses is bookmanager which you can find out about at https://pypi.org/project/cyberaide-bookmanager/ https://github.com/cyberaide/bookmanager

1.1 - 2021 REU Course

This course introduces the REU students to various topics in Intelligent Systems Engineering. The course was taught in Summer 2021.

Rstudio with Git and GitHub Slides

Rstudio with Git and GitHub Slides Rstudio with Git and GitHub Slides

Programming with Python

Python is a great languge for doing data science and AI, a comprehensive list of features is available in book form. Please note that when installing Python, you always want to use a venv as this is best practice.

Python Introduction to Python (ePub) (PDF)

Installation of Python

Installation of Python Installation of Python — June 7th, 2021 (AM)

Update to the Video:

Best practices in Python recommend to use a Python venv. This is pretty easy to do and creates a separate Python environment for you so you do not interfere with your system Python installation. Some IDEs may do this automatically, but it is still best practice to install one and bind the IDE against it. To do this:

  1. Download Python version 3.9.5 just as shown in the first lecture.

  2. After the download you do an additional step as follows:

    • on Mac:

      python3.9 -m venv ~/ENV3
      source ~/ENV/bin/activate
      

      you need to do the source every time you start a new window or on mac ass it to .zprofile

  • on Windows you first install gitbash and do all yuour terminal work from gitbash as this is more Linux-like. In gitbash, run

    python -m venv ~/ENV3
    ~/ENV/Script/activate
    

    In case you like to add it to gitbash, you can add the source line to .bashrc and/or .bash_profile

  1. In case you use VSCode, you can also do it individually in a directory where you have your code.

    • On Mac: cd TO YOUR DIR; python3.9 -m venv .
    • On Windows cd TO YOUR DIR; python -m venv .

    Then start VSCode in the directory and it will ask you to use this venv. However, the global ENV3 venv may be better and you cen set your interpreter to it.

  2. On Pycharm we recommend you use the ENV3 and set the clobal interpreter

Jupyter Notebooks

Jupyter Notebooks Jupyter Notebooks — June 7th, 2021 (PM): This lecture provides an introduction to Jupyter Notebooks using Visual Studio as IDE.

Github

Github Video: Github
Github Video-Github 2 — June 8th, 2021 (PM): In this lecture the student can learn how to create a project on RStudio and link it with a repository on GitHub to commit, pull and push the code from RStudio.

Introduction to Python

Introduction to Python Slides: This introduction to Python cover the different data type, how to convert type of variable, understand and create flow control usign conditional statements.
Rstudio with Git and GitHub Slides Video-Introduction to Python (1) — June 9th, 2021 (AM): This introduction to Python cover the different data type, how to convert type of variable, understand and create flow control usign conditional statements.
Introduction to Python Video-Introduction to Python (2) — June 9th, 2021 (PM): This introduction to Python cover the different data type, how to convert type of variable, understand and create flow control usign conditional statements.
Introduction to Python Video-Introduction to Python (3) — June 10th, 2021 (AM): This lecture introduces the use of Google Colab to code your python program using the resources provided by Google. Also, DataFrame is introduced and use to manipulate and analyze data.
String, Numbers, Booleans Flow of control Using If statements Slides — June 10th, 2021 (PM): String, Numbers, Booleans Flow of control Using If statements
Slides: String, Numbers, Booleans Flow of control Using If statements Slides: String, Numbers, Booleans Flow of control Using If statements (2)
 Python Exercises Lab2 Python Exercises - Lab 2

The first exercise will require a simple for loop, while the second is more complicated, requiring nested for loops and a break statement.

General Instructions: Create two different files with extension .ipnyb, one for each problem. The first file will be named factorial.ipnyb which is for the factorial problem, and the second prime_number.ipnyb for the prime number problem.

  1. Write a program that can find the factorial of any given number. For example, find the factorial of the number 5 (often written as 5!) which is 12345 and equals 120. Your program should take as input an integer from the user.

    Note: The factorial is not defined for negative numbers and the factorial of Zero is 1; that is 0! = 1.

    You should

    1. If the number is less than Zero return with an error message.
    2. Check to see if the number is Zero—if it is then the answer is 1—print this out.
    3. Otherwise use a loop to generate the result and print it out.
  2. A Prime Number is a positive whole number, greater than 1, that has no other divisors except the number 1 and the number itself. That is, it can only be divided by itself and the number 1, for example the numbers 2, 3, 5 and 7 are prime numbers as they cannot be divided by any other whole number. However, the numbers 4 and 6 are not because they can both be divided by the number 2 in addition the number 6 can also be divided by the number 3.

    You should write a program to calculate prime number starting from 1 up to the value input by the user.

    You should

    1. If the user inputs a number below 2, print an error message.
    2. For any number greater than 2 loop for each integer from 2 to that number and determine if it can be divided by another number (you will probably need two for loops for this; one nested inside the other).
    3. For each number that cannot be divided by any other number (that is its a prime number) print it out.

Motivation for the REU

Video: Motivation for the REU: Data is Driven Everything Video — June 11th, 2021 (AM): Motivation for the REU: Data is Driven Everything
Slides: Motivation for the REU: Data is Driven Everything Slides: Motivation for the REU: Data is Driven Everything
Slides: Descriptive Statistic Slides: Descriptive Statistic
Slides: Probability Slides: Probability
Video: Motivation for the REU: Data is Driven Everything Video — June 28th, 2021 (AM): Working on GitHUb Template and Mendeley references management

Data Science Tools

Slides: Data Science Tools Slides: Data Science Tools
Numpy Video — June 14th, 2021 (AM): Numpy
Pandas data frame Video — June 14th, 2021 (PM): Pandas data frame
Web Data mining Video — June 15th, 2021 (AM): Web data mining
Pandas IO Video — June 15th, 2021 (PM): Pandas IO
Pandas Video — June 16th, 2021 (AM): Pandas
Matrix Computation Video-Matrix computation — June 16th, 2021 (PM): Linear algebra is a main compornent in the field of Data Science. As a consequence, this lecture introduces the main matrix operations such as, addition, substraction, multiplication, and picewise multiplication.
Pycharm Installation and Virtual Environment setup Video: Pycharm Installation and Virtual Environment setup — June 18th, 2021 (AM)
Application of Matrix Operation using Images on Python Video: This lecture the student can learn the different applications of Matrix Operation using images on Python. — June 21st, 2021 (AM)
Data wrangling Video: Data wrangling and Descriptive Statistic Using Python — June 21st, 2021 (AM)
Data wrangling and Descriptive Statistic Using Python Video: Data wrangling and Descriptive Statistic Using Python — June 22nd, 2021 (PM)
FURY Visualization and Microsoft Lecture Video: FURY Visualization and Microsoft Lecture — June 25th, 2021 (PM)
Introduction to Probability Video: Instroduction to Probability — June 25th, 2021 (PM)
Digital Twins and Virtual Tissue ussing CompuCell3D Simulating Cancer Somatic Evolution in nanoHUB Video: Digital Twins and Virtual Tissue ussing CompuCell3D Simulating Cancer Somatic Evolution in nanoHUB — July 2nd, 2021 (AM)

AI First Engineering

AI FIrst Engineering: Learning material Video: AI First Engineering: Learning material — June 25th, 2021 (AM)
Adding content to your su21-reu repositorirs Video: Adding content to your su21-reu repositories — June 17th, 2021 (PM)
AI First Engineering: Slides: AI First Engineering

Datasets for Projects

Datasets for Projects: Data world and Kaggle Video: Datasets for Projects: Data world and Kaggle — June 29th, 2021 (AM)
Datasets for Projects: Data world and Kaggle Video: Datasets for Projects: Data world and Kaggle part 2 — June 29th, 2021 (PM)

Machine Learning Models

K-Means: Unsupervised model Video: K-Means: Unsupervised model — June 30th, 2021 (AM)
Support Vector Machine: Supervised model Video: Support Vector Machine: Supervised model — July 2nd, 2021 (PM)
Support Vector Machine: Supervised model Slides: Support Vector Machine Supervised model.
Neural Networks: Deep Learning Supervised model Video: Neural Networks: Deep Learning Supervised model — July 6th, 2021 (AM)
SVideo: Neural Networks: Deep learning Model Video: Neural Networks: Deep learning Model — July 6th, 2021 (AM)
Data Visualization: Visualizaton for Data Science Video: Data Visualization: Visualizaton for Data Science — July 7th, 2021 (AM)
Convulotional Neural Networks: Deep learning Model Video: Convulotional Neural Networks: Deep learning Model — July 8th, 2021 (AM)

Students Report Help

Student Report Help Video: Student Report Help with Introduction and Datasets — July 7th, 2021 (AM)
Student Report Help Video: Student Report Help with Introduction and Datasets — July 13th, 2021 (AM)

COVID-19

Covid-19 Video: Chemo-Preventive Effect of Vegetables and Fruits Consumption on the COVID-19 Pandemic — July 1st, 2021 (AM)

1.2 - AI-First Engineering Cybertraining

This course introduces the students to AI-First principles. The notes are prepared for the course taught in 2021.

This is an image

Class Material

As part of this class, we will be using a variety of sources. To simplify the presentation we provide them in a variety of smaller packaged material including books, lecture notes, slides, presentations and code.

Note: We will regularly update the course material, so please always download the newest version. Some browsers try to be fancy and cache previous page visits. So please make sure to refresh the page.

We will use the following material:

Course Lectures and Management

Course Lectures Course Lectures. These meeting notes are updated weekly (Web)

Overview

This course is built around the revolution driven by AI and in particular deep learning that is transforming all activities: industry, research, and lifestyle. It will a similar structure to The Big Data Class and the details of the course will be adapted to the interests of participating students. It can include significant deep learning programming.

All activities – Industry, Research, and Lifestyle – are being transformed by Artificial Intelligence AI and Big Data. AI is currently dominated by deep learning implemented on a global pervasive computing environment - the global AI supercomputer. This course studies the technologies and applications of this transformation.

We review Core Technologies driving these transformations: Digital transformation moving to AI Transformation, Big Data, Cloud Computing, software and data engineering, Edge Computing and Internet of Things, The Network and Telecommunications, Apache Big Data Stack, Logistics and company infrastructure, Augmented and Virtual reality, Deep Learning.

There are new “Industries” over the last 25 years: The Internet, Remote collaboration and Social Media, Search, Cybersecurity, Smart homes and cities, Robotics. However, our focus is Traditional “Industries” Transformed: Computing, Transportation: ride-hailing, drones, electric self-driving autos/trucks, road management, travel, construction Industry, Space, Retail stores and e-commerce, Manufacturing: smart machines, digital twins, Agriculture and Food, Hospitality and Living spaces: buying homes, hotels, “room hailing”, Banking and Financial Technology: Insurance, mortgage, payments, stock market, bitcoin, Health: from DL for pathology to personalized genomics to remote surgery, Surveillance and Monitoring: – Civilian Disaster response; Miltary Command and Control, Energy: Solar wind oil, Science; more data better analyzed; DL as the new applied mathematics, Sports: including Sabermetrics, Entertainment, Gaming including eSports, News, advertising, information creation and dissemination, education, fake news and Politics, Jobs.

We select material from above to match student interests.

Students can take the course in either software-based or report-based mode. The lectures with be offered in video form with a weekly discussion class. Python and Tensorflow will be main software used.

Lectures on Particular Topics

Introduction to AI-Driven Digital Transformation

Introduction to AI-Driven Digital Transformation (Web) Introduction to AI-Driven Digital Transformation (Web)

Introduction to Google Colab

A Gentle Introduction to Google Colab (Web) A Gentle Introduction to Google Colab (Web)
A Gentle Introduction to Python on Google Colab (Web) A Gentle Introduction to Python on Google Colab (Web)
MNIST Classification on Google Colab (Web) MNIST Classification on Google Colab (Web)
MNIST Classification with MLP on Google Colab (Web) MNIST-MLP Classification on Google Colab (Web)
MNIST Classification with RNN on Google Colab (Web) MNIST-RNN Classification on Google Colab (Web)
MNIST Classification with LSTM on Google Colab (Web) MNIST-LSTM Classification on Google Colab (Web)
MNIST Classification with Autoencoder on Google Colab (Web) MNIST-Autoencoder Classification on Google Colab (Web)
MNIST Classification with MLP + LSTM MNIST with MLP+LSTM Classification on Google Colab (Web)
Distributed Training with MNIST Distributed Training with MNIST Classification on Google Colab (Web)
PyTorch with MNIST PyTorch with MNIST Classification on Google Colab (Web)

Material

Health and Medicine

Sports Health and Medicine sector has become a much more needed service than ever. With the uprising of the Covid-19, resource usage, monitoring, research on anti-virals and many more challenging tasks were on the shoulders of scientists. To face such challenges, AI can become a worthy partner in solving some of the related problems efficiently and effectively.

AI in Banking

AI in Banking AI in banking has become a vital component in providing best services to the peopel. AI provides securing bank transactions, providing suggestions and many other services for the clients. And legacy banking systems are also being reinforced with novel AI techniques to migrate business models with technology.

Space and Energy

Space and Energy Energy is a term we find in everyday life. Conserving energy and smart usage is vital in managing energy demands. Here the role played by AI has become significant in recent years. Many efforts have been taken by industry leaders like Bill Gates to provide better solutions for efficient energy consumption. Apart from that Space explorations are also being reinforced with AI. Better communication, remote sensing, data analysis have become key components in succeeding the challenge to unravel the mysteries in the universe.

Mobility (Industry)

Mobility (Industry) Mobility is a key part in everyday life. From the personal car to space exploring rockets, there are many places that can be enhanced by using AI. Autonomous vehicles and sensing features provide safety and efficiency. Many motorcar companies have already moved towards AI to power the vehicles and provide new features for the drivers.

Cloud Computing

Cloud Computing Cloud computing is a major component of Today's service infrastructures. Artificial intelligence, micro-services, storage, virtualization and parallel computing are some of the key aspects of cloud computing.

Commerce

Commerce Commerce is a field which is reinforced with AI and technologies to provide a better service to the clients. Amazon is one of the leading companies in e-commerce. The recommendation engines play a major role in e-commerce.

Complementary Material

  • When working with books, ePubs typically display better than PDF. For ePub, we recommend using iBooks on macOS and calibre on all other systems.

Piazza

Piazza Piazza. The link for all those that participate in the IU class to its class Piazza.

Scientific Writing with Markdown

Markdown Scientific Writing with Markdown (ePub) (PDF)

Git Pull Request

Git Pull Request Git Pull Request. Here you will learn how to do a simple git pull request either via the GitHub GUI or the git command line tools

Introduction to Linux

This course does not require you to do much Linux. However, if you do need it, we recommend the following as starting point listed

The most elementary Linux features can be learned in 12 hours. This includes bash, editor, directory structure, managing files. Under Windows, we recommend using gitbash, a terminal with all the commands built-in that you would need for elementary work.

Linux Introduction to Linux (ePub) (PDF)

Older Course Material

Older versions of the material are available at

Lecture Notes 2020 Lecture Notes 2020 (ePub) (PDF)
Big Data Applications (Nov. 2019) Big Data Applications (Nov. 2019) (ePub) (PDF)
Big Data Applications (2018) Big Data Applications (2018) (ePub) (PDF)

Contributions

You can contribute to the material with useful links and sections that you find. Just make sure that you do not plagiarize when making contributions. Please review our guide on plagiarism.

Computer Needs

This course does not require a sophisticated computer. Most of the things can be done remotely. Even a Raspberry Pi with 4 or 8GB could be used as a terminal to log into remote computers. This will cost you between $50 - $100 dependent on which version and equipment. However, we will not teach you how to use or set up a Pi or another computer in this class. This is for you to do and find out.

In case you need to buy a new computer for school, make sure the computer is upgradable to 16GB of main memory. We do no longer recommend using HDD’s but use SSDs. Buy the fast ones, as not every SSD is the same. Samsung is offering some under the EVO Pro branding. Get as much memory as you can effort. Also, make sure you back up your work regularly. Either in online storage such as Google, or an external drive.

1.2.1 - Project Guidelines

We present here the AI First Engineering project guidelines

We present here the project guidelines

All students of this class are doing a software project. (Some of our classes allow non software projects)

Details

The major deliverable of the course is a software project with a report. The project must include a programming part to get a full grade. It is expected that you identify a suitable analysis task and data set for the project and that you learn how to apply this analysis as well as to motivate it. It is part of the learning outcome that you determine this instead of us giving you a topic. This topic will be presented by student in class April 1.

It is desired that the project has a novel feature in it. A project that you simply reproduce may not recieve the best grade, but this depends on what the analysis is and how you report it.

However “major advances” and solving of a full-size problem are not required. You can simplify both network and dataset to be able to complete project. The project write-up should describe the “full-size” realistic problem with software exemplifying an instructive example.

One goal of the class is to use open source technology wherever possible. As a beneficial side product of this, we are able to distribute all previous reports that use such technologies. This means you can cite your own work, for example, in your resume. For big data, we have more than 1000 data sets we point to.

Comments on Example Projects from previous classes

Warning: Please note that we do not make any quality assumptions to the published papers that we list here. It is up to you to identify outstanding papers.

Warning: Also note that these activities took place in previous classes, and the content of this class has since been updated or the focus has shifted. Especially chapters on Google Colab, AI, DL have been added to the course after the date of most projects. Also, some of the documents include an additional assignment called Technology review. These are not the same as the Project report or review we refer to here. These are just assignments done in 2-3 weeks. So please do not use them to identify a comparison with your own work. The activities we ask from you are substantially more involved than the technology reviews.

Format of Project

Plagiarism is of course not permitted. It is your responsibility to know what plagiarism is. We provide a detailed description book about it here, you can also do the IU plagiarism test to learn more.

All project reports must be provided in github.com as a markdown file. All images must be in an images directory. You must use proper citations. Images copied from the Internet must have a citation in the Image caption. Please use the IEEE citation format and do not use APA or harvard style. Simply use fotnotes in markdown but treat them as regular citations and not text footnotes (e.g. adhere to the IEEE rules).
All projects and reports must be checked into the Github repository. Please take a look at the example we created for you.

The report will be stored in the github.com.

./project/index.md

./project/images/mysampleimage.png

Length of Project Report

Software Project Reports: 2500 - 3000 Words.

Possible sources of datasets

Given next are links to collections of datasets that may be of use for homework assignments or projects.

FAQ

  • Why you should not just paste and copy into the GitHub GUI?

    We may make comments directly in your markdown or program files. If you just paste and copy you may overlook such comments. HEns only paste and copy small paragraphs. If you need to. The best way of using github is from commandline and using editors such as pycharm and emacs.

  • I like to do a project that relates to my company?

    • Please go ahead and do so but make sure you use open-source data, and all results can be shared with everyone. If that is not the case, please pick a different project.
  • Can I use Word or Google doc, or LaTeX to hand in the final document?

    • No. you must use github.com and markdown.

    • Please note that exporting documents from word or google docs can result in a markdown file that needs substantial cleanup.

  • Where do I find more information about markdown and plagiarism

  • https://laszewski.github.io/publication/las-20-book-markdown/

  • [https://cloudmesh-community.github.io/pub/vonLaszewski-writing.pdf]{.ul}

  • Can I use an online markdown editor?

    • There are many online markdown editors available. One of them is [https://dillinger.io/]{.ul}.
      Use them to write your document or check the one you have developed in another editor such as word or google docs.

    • Remember, online editors can be dangerous in case you lose network connection. So we recommend to develop small portions and copy them into a locally managed document that you then check into github.com.

    • Github GUI (recommended): this works very well, but the markdown is slightly limited. We use hugo’s markdown.

    • pyCharm (recommended): works very well.

    • emacs (recommended): works very well

  • What level of expertise and effort do I need to write markdown?

    • We taught 10-year-old students to use markdown in less than 5 minutes.
  • What level of expertise is needed to learn BibTeX

    • We have taught BibTeX to inexperienced students while using jabref in less than an hour (but it is not required for this course). You can use footnotes while making sure that the footnotes follow the IEEE format.
  • How can I get IEEE formatted footnotes?

    • Simply use jabref and paste and copy the text it produces.
  • Will there be more FAQ’s?

    • Please see our book on markdown.

    • Discuss your issue in piazza; if it is an issue that is not yet covered, we will add it to the book.

  • How do I write URLs?

    • Answered in book

    • Note: All URL’s must be either in [TEXT](URLHERE) or <URLHERE> format.

1.3 - Big Data 2020

This course introduces the students to Cloud Big Data Applications. The notes are prepared for the course taught in 2020.

This is an image

Class Material

As part of this class, we will be using a variety of sources. To simplify the presentation we provide them in a variety of smaller packaged material including books, lecture notes, slides, presentations and code.

Note: We will regularly update the course material, so please always download the newest version. Some browsers try to be fancy and cache previous page visits. So please make sure to refresh the page.

We will use the following material:

Course Lectures and Management

Course Lectures Course Lectures. These meeting notes are updated weekly (Web)

Lectures on Particular Topics

Introduction to AI-Driven Digital Transformation

Introduction to AI-Driven Digital Transformation (Web) Introduction to AI-Driven Digital Transformation (Web)

Big Data Usecases Survey

Big Data Usecases Survey This module covers 51 usecases of Big data that emerged from a NIST (National Institute for Standards and Technology) study of Big data. We cover the NIST Big Data Public Working Group (NBD-PWG) Process and summarizes the work of five subgroups: Definitions and Taxonomies Subgroup, Reference Architecture Subgroup, Security and Privacy Subgroup, Technology Roadmap Subgroup and the Requirements andUse Case Subgroup. 51 use cases collected in this process are briefly discussed with a classification of the source of parallelism and the high and low level computational structure. We describe the key features of this classification.

Introduction to Google Colab

A Gentle Introduction to Google Colab (Web) A Gentle Introduction to Google Colab (Web)
A Gentle Introduction to Python on Google Colab (Web) A Gentle Introduction to Python on Google Colab (Web)
MNIST Classification on Google Colab (Web) MNIST Classification on Google Colab (Web)

Material

Physics

Physics Big Data Applications and Analytics Discovery of Higgs Boson Part I (Unit 8) Section Units 9-11 Summary: This section starts by describing the LHC accelerator at CERN and evidence found by the experiments suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. Then random variables and some simple principles of statistics are introduced with explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes discussion.

Sports

Sports Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

Health and Medicine

Sports Health and Medicine sector has become a much more needed service than ever. With the uprising of the Covid-19, resource usage, monitoring, research on anti-virals and many more challenging tasks were on the shoulders of scientists. To face such challenges, AI can become a worthy partner in solving some of the related problems efficiently and effectively.

AI in Banking

AI in Banking AI in banking has become a vital component in providing best services to the peopel. AI provides securing bank transactions, providing suggestions and many other services for the clients. And legacy banking systems are also being reinforced with novel AI techniques to migrate business models with technology.

Transportation Systems

Transportation Systems Transportation systems is a vital component in human life. With the dawn of AI, transportation systems are also reinforced to provide better service for the people. Analyzing tera-bytes of data collected in day-to-day transportation activities are used to analyze issues and provide a better experience for the user.

Space and Energy

Space and Energy Energy is a term we find in everyday life. Conserving energy and smart usage is vital in managing energy demands. Here the role played by AI has become significant in recent years. Many efforts have been taken by industry leaders like Bill Gates to provide better solutions for efficient energy consumption. Apart from that Space explorations are also being reinforced with AI. Better communication, remote sensing, data analysis have become key components in succeeding the challenge to unravel the mysteries in the universe.

Mobility (Industry)

Mobility (Industry) Mobility is a key part in everyday life. From the personal car to space exploring rockets, there are many places that can be enhanced by using AI. Autonomous vehicles and sensing features provide safety and efficiency. Many motorcar companies have already moved towards AI to power the vehicles and provide new features for the drivers.

Cloud Computing

Cloud Computing Cloud computing is a major component of Today's service infrastructures. Artificial intelligence, micro-services, storage, virtualization and parallel computing are some of the key aspects of cloud computing.

Commerce

Commerce Commerce is a field which is reinforced with AI and technologies to provide a better service to the clients. Amazon is one of the leading companies in e-commerce. The recommendation engines play a major role in e-commerce.

Complementary Material

  • When working with books, ePubs typically display better than PDF. For ePub, we recommend using iBooks on macOS and calibre on all other systems.

Piazza

Piazza Piazza. The link for all those that participate in the IU class to its class Piazza.

Scientific Writing with Markdown

Markdown Scientific Writing with Markdown (ePub) (PDF)

Git Pull Request

Git Pull Request Git Pull Request. Here you will learn how to do a simple git pull request either via the GitHub GUI or the git command line tools

Introduction to Linux

This course does not require you to do much Linux. However, if you do need it, we recommend the following as starting point listed

The most elementary Linux features can be learned in 12 hours. This includes bash, editor, directory structure, managing files. Under Windows, we recommend using gitbash, a terminal with all the commands built-in that you would need for elementary work.

Linux Introduction to Linux (ePub) (PDF)

Older Course Material

Older versions of the material are available at

Lecture Notes 2020 Lecture Notes 2020 (ePub) (PDF)
Big Data Applications (Nov. 2019) Big Data Applications (Nov. 2019) (ePub) (PDF)
Big Data Applications (2018) Big Data Applications (2018) (ePub) (PDF)

Contributions

You can contribute to the material with useful links and sections that you find. Just make sure that you do not plagiarize when making contributions. Please review our guide on plagiarism.

Computer Needs

This course does not require a sophisticated computer. Most of the things can be done remotely. Even a Raspberry Pi with 4 or 8GB could be used as a terminal to log into remote computers. This will cost you between $50 - $100 dependent on which version and equipment. However, we will not teach you how to use or set up a Pi or another computer in this class. This is for you to do and find out.

In case you need to buy a new computer for school, make sure the computer is upgradable to 16GB of main memory. We do no longer recommend using HDD’s but use SSDs. Buy the fast ones, as not every SSD is the same. Samsung is offering some under the EVO Pro branding. Get as much memory as you can effort. Also, make sure you back up your work regularly. Either in online storage such as Google, or an external drive.

1.4 - REU 2020

This course introduces the REU students to various topics in Intelligent Systems Engineering. The course was taught in Summer 2020.

Computational Foundations

  • Brief Overview of the Praxis AI Platform and Overview of the Learning Paths
  • Accessing Praxis Cloud
  • Introduction To Linux and the Command Line
  • Jupyter Notebooks
  • A Brief Intro to Machine Learning in Google Colaboratory

Programming with Python

Selected chapters from out python Book

  • Analyzing Patient Data
  • Loops Lists Analyzing Data
  • Functions Errors Exceptions
  • Defensive Programming Debugging
Python Introduction to Python (ePub) (PDF)

Coronavirus Overview

Basic Virology and Immunology

Case Study: 1918 Influenza Pandemic Case Study: 1918 Influenza Pandemic: Prior to COVID-19, the 1918 influenza pandemic was the most severe pandemic in recent history. First identified in military personnel in the spring of 1918, the influenza was an H1N1 virus of avian origin. It is commonly referred to by scientists and historians as “the Mother of all Pandemics.” This pandemic is often referred to as the “Spanish Flu” in the lay press, though this name is a misnomer, and the virus likely originated elsewhere. Contemporary reporting focused heavily on Spain, as it was one of few places at the time that did not have restrictions on the press during World War I.
SnapShot: COVID-19 SnapShot: COVID-19: In December 2019, several cases of pneumonia of unknown origin were reported in Wuhan, China. The causative agent was characterized as a novel coronavirus, initially referred to as 2019-nCoV and renamed severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) (Zhou et al., 2020b). This respiratory illness, coronavirus disease 2019 (COVID-19), has spread rapidly by human-to-human transmission, caused major outbreaks worldwide, and resulted in considerable morbidity and mortality. On March 11, 2020, WHO classified COVID-19 as a pandemic. It has stressed health systems and the global economy, as governments balance prevention, clinical care, and socioeconomic challenges.
Basic Virology and Immunology Basic Virology and Immunology: In December 2019, a series of cases of pneumonia of unknown origin were reported in Wuhan, the capital city of Hubei province in China. The causative virus was isolated and characterized in January 2020 (Zhou et al., Nature 2020, Zhu et al., NEJM 2020). On January 12, 2020, the World Health Organization (WHO) tentatively named the virus as the 2019 novel coronavirus (2019-nCoV). On January 30, 2020 WHO issued a public health emergency of international concern (PHEIC) and on February 11, 2020, the WHO formally named the disease caused by the novel coronavirus as coronavirus disease 2019 (COVID-19). At that time, based on its genetic relatedness to known coronaviruses and established classification system, the International Committee on Taxonomy of Viruses classified and renamed 2019-nCoV as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). On March 11, 2020, the WHO formally characterized the global spread of COVID-19 as a pandemic, the first to be caused by a coronavirus.

Clinical Presentation

Clinical Presentation Clinical Presentation

Management of COVID-19

Management of COVID-19 Management of COVID-19

Investigational Therapeutics and Vaccine Development

Investigational Therapeutics and Vaccine Development Investigational Therapeutics and Vaccine Development

Coronavirus Genomics Superlab

Pull from Computational Biology Journey

SARS by the numbers

SARS by the numbers SARS by the numbers

Epidemiology

Introduction to Epidemiological Terms

Principals Principals
Summary Summary
Introduction Introduction

Where Are We Now?

Where Where Are We Now?

Where Will We Be Next?

Where Will We Be Next? Where Will We Be Next?

Approaches to Long-Term Planning

Approaches to Long-Term Planning Approaches to Long-Term Planning

Case Studies

Case Studies 1918-influenza-pandemic
2009 H1N1 pandemic
Soutch Korea 2020

Introduction to AI/Deep Learning

Deep Learning in Health and Medicine B: Diagnostics

Deep Learning in Health and Medicine C: Examples

Deep Learning in Health and Medicine D: Impact of Corona Virus Covid-19

Deep Learning in Health and Medicine E: Corona Virus Covid-19 and Recession

Deep Learning in Health and Medicine F: Tackling Corona Virus Covid-19

Deep Learning in Health and Medicine G: Data and Computational Science and The Corona Virus Covid-19

Deep Learning in Health and Medicine H: Screening Covid-19 Drug Candidates

Deep Learning in Health and Medicine I: Areas for Covid19 Study and Pandemics as Complex Systems

REU Projects

  • REU Individual Project Overview and Expectations
  • Accessing the Coronavirus Datasets
  • Primer on How to Analyze the Data

Effect of AI on Industry and its Transformation Introduction to AI First Engineering

Examples of Applications of Deep Learning

Optimization – a key goal of Statistics, AI and Deep Learning

Learn the Deep Learning important words/components

Deep Learning and Imaging: It’s first greast success

For the BIg Data Class we revised the following material Big Data Overview Fall 2019

Big Data, technology, clouds and selected applications

Bigdara 20 Videos covering Big Data, technology, clouds and selected applications

Cloud Computing

Case Studies 18 Videos covering cloud computing

1.5 - Big Data 2019

This coursebook introduces the students to Cloud Big Data Applications

Big Data Applications

The document is available as an online book in ePub and PDF

For ePub, we recommend using iBooks on macOS and calibre on all other systems.

1.6 - Cloud Computing

This is a large volume that introduces you to many aspects of cloud computing.

Cloud Computing

The document is available as an online book in ePub and PDF from the following Web Page:

For ePub, we recommend using iBooks on Macos and calibre on all other systems.

THe book has over 590 pages. Topics coverd include:

  • DEFINITION OF CLOUD COMPUTING
  • CLOUD DATACENTER
  • CLOUD ARCHITECTURE
  • CLOUD REST
    • NIST
    • GRAPHQL
  • HYPERVISOR
    • Virtualization
      • Virtual Machine Management with QEMU
  • IAAS
    • Multipass
    • Vagrant
    • Amazon Web Services
    • Microsoft Azure
    • Google IaaS Cloud Services
    • OpenStack
    • Python Libcloud
    • AWS Boto
    • Cloudmesh
  • MAPREDUCE
    • HADOOP
    • SPARK
    • HADOOP ECOSYSTEM
    • TWISTER
    • HADOOP RDMA
  • CONTAINERS
    • DOCKER
    • KUBERNETES
    • Singularity
  • SERVERLESS
    • FaaS
    • Apache OpenWhisk
    • Kubeless
    • OpenFaaS ` * OpenLamda
  • MESSAGING
    • MQTT
    • Apache Avro
  • GO

1.7 - Data Science to Help Society

In this module, we will learn how to apply data science for the good of society. We introduce two examples, one for COVID-19, the other for hydrology.

COVID 101, Climate Change and their Technologies

General Material

Python Language

Need to add material here

Using Google CoLab and Jupyter notebooks

  • For questions on software, please mail Fugang Wang

    • Fugang can also give you help on python including introductory material if you need extra
  • 5 notebooks from Google

  • Introduction to Machine Learning Using TensorFlow (pptx)

  • Introduction to using Colab from IU class E534 with videos and note (google docs) This unit includes 3 videos

    • How to create a colab notebook (mp4)

    • How to create a simple program (mp4)

    • How to do benchmark (mp4)

  • Deep Learning for MNIST The docs are located alongside the video at

    • Introduction to MNIST

    • This teaches how to do deep learning on a handwriting example from NIST which is used in many textbooks

    • In the latter part of the document, a homework description is given. That can be ignored!

    • There are 5 videos

      1. DNN MNIST Introduction (mp4)

      2. DNN MNIST import section (mp4)

        • Running into import errors starting at the keras.models line in the code
      3. DNN MNIST data preprocessing (mp4)

      4. DNN MNIST model definition (mp4)

      5. DNN MNIST final presentation (mp4)

  • Jupyter notebook on Google Colab for COVID-19 data analysis ipynb

Follow-up on Discussion of AI remaking Industry worldwide

  • Class on AI First Engineering with 35 videos describing technologies and particular industries Commerce, Mobility, Banking, Health, Space, Energy in detail (youtube playlist)

  • Introductory Video (one of 35) discussing the Transformation - Industries invented and remade through AI (youtube)

  • Some online videos on deep learning

    • Introduction to AI First Engineering (youtube)
  • Examples of Applications of Deep Learning (youtube)

Optimization -- a key in Statistics, AI and Deep Learning (youtube)

Learn the Deep Learning important words and parts (youtube)

Deep Learning and Imaging: It's first great success (youtube)

Covid Material

Covid Biology Starting point

Medical Student COVID-19 Curriculum - COVID-19 Curriculum Module 1 and then module 2

Compucell3D Modelling material

Interactive Two-Part Virtual Miniworkshop on Open-Source CompuCell3D

Multiscale, Virtual-Tissue Spatio-Temporal Simulations of COVID-19 Infection, Viral Spread and Immune Response and Treatment Regimes** VTcovid19Symp

  • Part I: Will be presented twice:
  • First Presentation June 11th, 2020, 2PM-5PM EST (6 PM- 9PM GMT)
  • Second Presentation June 12th, 9AM - 12 noon EST (1 PM - 4 PM GMT)
  • Part II: Will be presented twice:
  • First Presentation June 18th, 2020, 2PM-5PM EST (6 PM- 9PM GMT)
  • Second Presentation June 19th, 9AM - 12 noon EST (1 PM - 4 PM GMT)

Topics in Covid 101

  • Biology1 and Harvard medical school material above
  • Epidemiology2
  • Public Health: Social Distancing and Policies3
  • HPC4
  • Data Science 5,6,7
  • Modeling 8,9

Climate Change Material

Topics in Climate Change (Russell Hofmann)

References


  1. Y. M. Bar-On, A. I. Flamholz, R. Phillips, and R. Milo, “SARS-CoV-2 (COVID-19) by the numbers,” arXiv [q-bio.OT], 28-Mar-2020. http://arxiv.org/abs/2003.12886 ↩︎

  2. Jiangzhuo Chen, Simon Levin, Stephen Eubank, Henning Mortveit, Srinivasan Venkatramanan, Anil Vullikanti, and Madhav Marathe, “Networked Epidemiology for COVID-19,” Siam News, vol. 53, no. 05, Jun. 2020. https://sinews.siam.org/Details-Page/networked-epidemiology-for-covid-19 ↩︎

  3. A. Adiga, L. Wang, A. Sadilek, A. Tendulkar, S. Venkatramanan, A. Vullikanti, G. Aggarwal, A. Talekar, X. Ben, J. Chen, B. Lewis, S. Swarup, M. Tambe, and M. Marathe, “Interplay of global multi-scale human mobility, social distancing, government interventions, and COVID-19 dynamics,” medRxiv - Public and Global Health, 07-Jun-2020. http://dx.doi.org/10.1101/2020.06.05.20123760 ↩︎

  4. D. Machi, P. Bhattacharya, S. Hoops, J. Chen, H. Mortveit, S. Venkatramanan, B. Lewis, M. Wilson, A. Fadikar, T. Maiden, C. L. Barrett, and M. V. Marathe, “Scalable Epidemiological Workflows to Support COVID-19 Planning and Response,” May 2020. ↩︎

  5. Luca Magri and Nguyen Anh Khoa Doan, “First-principles Machine Learning for COVID-19 Modeling,” Siam News, vol. 53, no. 5, Jun. 2020. https://sinews.siam.org/Details-Page/first-principles-machine-learning-for-covid-19-modeling ↩︎

  6. [Robert Marsland and Pankaj Mehta, “Data-driven modeling reveals a universal dynamic underlying the COVID-19 pandemic under social distancing,” arXiv [q-bio.PE], 21-Apr-2020. http://arxiv.org/abs/2004.10666 ↩︎

  7. Geoffrey Fox, “Deep Learning Based Time Evolution.”. http://dsc.soic.indiana.edu/publications/Summary-DeepLearningBasedTimeEvolution.pdf↩︎

  8. T. J. Sego, J. O. Aponte-Serrano, J. F. Gianlupi, S. Heaps, K. Breithaupt, L. Brusch, J. M. Osborne, E. M. Quardokus, and J. A. Glazier, “A Modular Framework for Multiscale Spatial Modeling of Viral Infection and Immune Response in Epithelial Tissue,” BioRxiv, 2020. https://www.biorxiv.org/content/10.1101/2020.04.27.064139v2.abstract ↩︎

  9. Yafei Wang, Gary An, Andrew Becker, Chase Cockrell, Nicholson Collier, Morgan Craig, Courtney L. Davis, James Faeder, Ashlee N. Ford Versypt, Juliano F. Gianlupi, James A. Glazier, Randy Heiland, Thomas Hillen, Mohammad Aminul Islam, Adrianne Jenner, Bing Liu, Penelope A Morel, Aarthi Narayanan, Jonathan Ozik, Padmini Rangamani, Jason Edward Shoemaker, Amber M. Smith, Paul Macklin, “Rapid community-driven development of a SARS-CoV-2 tissue simulator,” BioRxiv, 2020. https://www.biorxiv.org/content/10.1101/2020.04.02.019075v2.abstract ↩︎

  10. Gagne II, D. J., S. E. Haupt, D. W. Nychka, and G. Thompson, 2019: Interpretable Deep Learning for Spatial Analysis of Severe Hailstorms. Mon. Wea. Rev., 147, 2827–2845, https://doi.org/10.1175/MWR-D-18-0316.1 ↩︎

1.8 - Intelligent Systems

This book introduces you to the concepts used to build Intelligent Systems.

Intelligent Systems Engineering

The book is available in ePub and PDF

1.9 - Linux

You will learn here about using Linux while focussing mostly on shell command line usage.

Linux will be used on many computers to develop and interact with cloud services. Especially popular are the command line tools that even exist on Windows. Thus we can have a uniform environment on all platforms using the bash shell.

For ePub, we recommend using iBooks on MacOS and calibre on all other systems.

Topics covered include:

  • Linux Shell
  • Perl one liners
  • Refcards
  • SSH
    • keygen
    • agents
    • port forwarding
  • Shell on Windows
  • ZSH

1.10 - Markdown

Show your user how to work through some end to end examples.

An important part of any scientific research is to communicate and document it. Previously we used LaTeX in this class to provide the ability to contribute professional-looking documents. However, here we will describe how you can use markdown to create scientific documents. We use markdown also on the Web page.

Scientific Writing with Markdown

The document is available as an online book in ePub and PDF

For ePub, we recommend using iBooks on macOS and calibre on all other systems.

Topics covered include:

  • Plagiarism
  • Writing Scientific Articles
  • Markdown (Pandoc format)
  • Markdown for presentations
  • Writing papers and reports with markdown
  • Emacs and markdown as an editor
  • Graphviz in markdown

1.11 - OpenStack

You will have the opportunity to learn more about OpenStack. OpenStack is a Cloud toolkit allowing you to do Bare metal and virtual machine provisioning. Show your user how to work through some end to end examples.

OpenStack is usable via command line tools and REST APIs. YOu will be able to experiment with it on Chameleon Cloud.

OpenStack with Chameleon Cloud

We have put together from the chameleon cloud manual a subset of information that is useful for using OpenStack. This focusses mostly on Virtual machine provisioning. The reason we put our own documentation here is to promote more secure utilization of Chameleon Cloud.

Additional material on how to uniformly access OpenStack via a multicloud command line tool is available at:

We highly recommend you use the multicloud environment as it will allow you also to access AWS, Azure, Google, and other clouds from the same command line interface.

The Chameleon Cloud document is availanle as online book in ePub and PDF from the following Web Page:

The book is available in ePub and PDF.

For ePub, we recommend using iBooks on MacOS and calibre on all other systems.

Topics covered include:

  • Using Chameleoncloud more securely
  • Resources
  • Hardware
  • Charging
  • Getting STarted
  • Virtual Machines
  • Commandline Interface
  • Horizon
  • Heat
  • Bare metal
  • FAQ

1.12 - Python

You will find here information about learning the Python Programming language and learn about its ecosystem.

Python is an easy to learn programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s simple syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.

Introduction to Python

This online book will provide you with enough information to conduct your programming for the cloud in python. Although this the introduction was first developed for Cloud Computing related classes, it is a general introduction suitable for other classes.

Introduction to Python

The document is available as an online book in ePub and PDF

For ePub, we recommend using iBooks on macOS and calibre on all other systems.

Topics covered include:

  • Python Installation
    • Using Multiple different Python Versions
  • First Steps
    • REPL
    • Editors
    • Google Colab
  • Python Language
  • Python Modules
  • Selected Libraries
    • Python Cloudmesh Common Library
    • Basic Matplotlib
    • Basic Numpy
    • Python Data Management
    • Python Data Formats
    • Python MongoDB
  • Parallelism in Python
  • Scipy
  • Scikitlearn
  • Elementary Machine Learning
  • Dask
  • Applications
    • Fingerprint Matching
    • Face Detection

1.13 - MNIST Classification on Google Colab

In this mini-course, you will learn how to use Google Colab while using the well known MNIST example

MNIST Character Recognition

We discuss in this module how to create a simple IPython Notebook to solve an image classification problem. MNIST contains a set of pictures.

Prerequisite

  • Knowledge of Python
  • Google account

Effort

  • 1 hour

Topics covered

  • Using Google Colab
  • Running an AI application on Google Colab

1. Introduction to Google Colab

This module will introduce you to how to use Google Colab to run deep learning models.

A Gentle Introduction to Google Colab (Web)

2. (Optional) Basic Python in Google Colab

In this module, we will take a look at some fundamental Python Concepts needed for day-to-day coding.

A Gentle Introduction to Python on Google Colab (Web)

3. MNIST On Google colab

In this module, we discuss how to create a simple IPython Notebook to solve an image classification problem. MNIST contains a set of pictures

MNIST Classification on Google Colab (Web)

Assignments

  1. Get an account on Google if you do not have one.
  2. Do the optional Basic Python Colab lab module
  3. Do the MNIST Colab module.

References

2 - Books

New: Experimental feature to convert the content we have in several open source books into the Cybertraining Web pages. We have used the Python Book as an example. However we have additional books that we will convert.

2.1 - Python

Gregor von Laszewski (laszewski@gmail.com)

2.1.1 - Introduction to Python

Gregor von Laszewski (laszewski@gmail.com)


Learning Objectives

  • Learn quickly Python under the assumption you know a programming language
  • Work with modules
  • Understand docopts and cmd
  • Conduct some Python examples to refresh your Python knowledge
  • Learn about the map function in Python
  • Learn how to start subprocesses and redirect their output
  • Learn more advanced constructs such as multiprocessing and Queues
  • Understand why we do not use anaconda
  • Get familiar with venv

Portions of this lesson have been adapted from the official Python Tutorial copyright Python Software Foundation.

Python is an easy-to-learn programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s simple syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms. The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms from the Python Web site, https://www.python.org/, and may be freely distributed. The same site also contains distributions of and pointers to many free third-party Python modules, programs and tools, and additional documentation. The Python interpreter can be extended with new functions and data types implemented in C or C++ (or other languages callable from C). Python is also suitable as an extension language for customizable applications.

Python is an interpreted, dynamic, high-level programming language suitable for a wide range of applications.

The philosophy of Python is summarized in The Zen of Python as follows:

  • Explicit is better than implicit
  • Simple is better than complex
  • Complex is better than complicated
  • Readability counts

The main features of Python are:

  • Use of indentation whitespace to indicate blocks
  • Object orient paradigm
  • Dynamic typing
  • Interpreted runtime
  • Garbage collected memory management
  • a large standard library
  • a large repository of third-party libraries

Python is used by many companies and is applied for web development, scientific computing, embedded applications, artificial intelligence, software development, and information security, to name a few.

The material collected here introduces the reader to the basic concepts and features of the Python language and system. After you have worked through the material you will be able to:

  • use Python
  • use the interactive Python interface
  • understand the basic syntax of Python
  • write and run Python programs
  • have an overview of the standard library
  • install Python libraries using venv for multi-Python interpreter development.

This book does not attempt to be comprehensive and cover every single feature, or even every commonly used feature. Instead, it introduces many of Python’s most noteworthy features and will give you a good idea of the language’s flavor and style. After reading it, you will be able to read and write Python modules and programs, and you will be ready to learn more about the various Python library modules.

In order to conduct this lesson you need

  • A computer with Python 3.8.1
  • Familiarity with command line usage
  • A text editor such as PyCharm, emacs, vi, or others. You should identify which works best for you and set it up.

References

Some important additional information can be found on the following Web pages.

Python module of the week is a Web site that provides a number of short examples on how to use some elementary python modules. Not all modules are equally useful and you should decide if there are better alternatives. However, for beginners, this site provides a number of good examples

2.1.2 - Python Installation

Gregor von Laszewski (laszewski@gmail.com)


Learning Objectives

  • Learn how to install Python.
  • Find additional information about Python.
  • Make sure your Computer supports Python.

In this section, we explain how to install python 3.8 on a computer. Likely much of the code will work with earlier versions, but we do the development in Python on the newest version of Python available at https://www.python.org/downloads .

Hardware

Python does not require any special hardware. We have installed Python not only on PC’s and Laptops but also on Raspberry PI’s and Lego Mindstorms.

However, there are some things to consider. If you use many programs on your desktop and run them all at the same time, you will find that in up-to-date operating systems, you will find yourself quickly out of memory. This is especially true if you use editors such as PyCharm, which we highly recommend. Furthermore, as you likely have lots of disk access, make sure to use a fast HDD or better an SSD.

A typical modern developer PC or Laptop has 16GB RAM and an SSD. You can certainly do Python on a $35-$55 Raspberry PI, but you probably will not be able to run PyCharm. There are many alternative editors with less memory footprint available.

Python 3.9

Here we discuss how to install Python 3.9 or newer on your operating system. It is typically advantageous to use a newer version of python so you can leverage the latest features. Please be aware that many operating systems come with older versions that may or may not work for you. YOu always can start with the version that is installed and if you run into issues update later.

Python 3.9 on macOS

You want a number of useful tools on your macOS. This includes git, make, and a c compiler. All this can be installed with Xcode which is available from

Once you have installed it, you need to install macOS XCode command-line tools:

$ xcode-select --install

The easiest installation of Python is to use the installation from https://www.python.org/downloads. Please, visit the page and follow the instructions to install the python .pkg file. After this install, you have python3 available from the command line.

Python 3.9 on macOS via Homebrew

Homebrew may not provide you with the newest version, so we recommend using the install from python.org if you can.

An alternative installation is provided from Homebrew. To use this install method, you need to install Homebrew first. Start the process by installing Python 3 using homebrew. Install homebrew using the instruction in their web page:

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Then you should be able to install Python using:

$ brew install python

Python 3.9 on Ubuntu 20.04

The default version of Python on Ubuntu 20.04 is 3.8. However, you can benefit from newer version while either installing them through python.org or adding them as follows:

$ sudo apt-get update
$ sudo apt install software-properties-common
$ sudo add-apt-repository ppa:deadsnakes/ppa -y
$ sudo apt-get install python3.9 python3-dev -y

Now you can verify the version with

$ python3.9 --version

which should be 3.9.5 or newer.

Now we will create a new virtual environment:

$ python3.9 -m venv --without-pip ~/ENV3

Now you must edit the ~/.bashrc file and add the following line at the end:

alias ENV3="source ~/ENV3/bin/activate"
ENV3

Now activate the virtual environment using:

$ source ~/.bashrc

You can install the pip for the virtual environment with the commands:

$ curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
$ python get-pip.py
$ rm get-pip.py
$ pip install -U pip

Prerequisite Windows 10

Python 3.9.5 can be installed on Windows 10 using: https://www.python.org/downloads

Let us assume you choose the Web-based installer than you click on the file in the edge browser (make sure the account you use has administrative privileges). Follow the instructions that the installer gives. Important is that you select at one point [x] Add to Path. There will be an empty checkmark about this that you will click on.

Once it is installed chose a terminal and execute

python --version

However, if you have installed conda for some reason, you need to read up on how to install 3.9.5 Python in conda or identify how to run conda and python.org at the same time. We often see others are giving the wrong installation instructions. Please also be aware that when you uninstall conda it is not sufficient t just delete it. You will have t make sure that you usnet the system variables automatically set at install time. THi includes. modifications on Linux and or Mac in .zprofile, .bashrc and .bash_profile. In windows, PATH and other environment variables may have been modified.

Python in the Linux Subsystem

An alternative is to use Python from within the Linux Subsystem. But that has some limitations, and you will need to explore how to access the file system in the subsystem to have a smooth integration between your Windows host so you can, for example, use PyCharm.

To activate the Linux Subsystem, please follow the instructions at

A suitable distribution would be

However, as it may use an older version of Python, you may want to update it as previously discussed

Using venv

This step is needed if you have not yet already installed a venv for Python to make sure you are not interfering with your system python. Not using a venv could have catastrophic consequences and the destruction of your operating system tools if they really on Python. The use of venv is simple. For our purposes we assume that you use the directory:

~/ENV3

Follow these steps first:

First cd to your home directory. Then execute

$ python3 -m venv  ~/ENV3
$ source ~/ENV3/bin/activate

You can add at the end of your .bashrc (ubuntu) or .bash_profile or .zprofile` (macOS) file the line

If you like to activate it when you start a new terminal, please add this line to your .bashrc or .bash_profile or .zprofile` file.

$ source ~/ENV3/bin/activate

so the environment is always loaded. Now you are ready to install Cloudmesh.

Check if you have the right version of Python installed with

$ python --version

To make sure you have an up to date version of pip issue the command

$ pip install pip -U

Install Python 3.9 via Anaconda

We are not recommending ether to use conda or anaconda. If you do so, it is your responsibility to update the information in this section in regards to it.

:o2: We will check your python installation, and if you use conda and anaconda you need to work on completing this section.

Download conda installer

Miniconda is recommended here. Download an installer for Windows, macOS, and Linux from this page: https://docs.conda.io/en/latest/miniconda.html

Install conda

Follow instructions to install conda for your operating systems:

Install Python via conda

To install Python 3.9.5 in a virtual environment with conda please use

$ cd ~
$ conda create -n ENV3 python=3.9.5
$ conda activate ENV3
$ conda install -c anaconda pip
$ conda deactivate ENV3

It is very important to make sure you have a newer version of pip installed. After you installed and created the ENV3 you need to activate it. This can be done with

$ conda activate ENV3

If you like to activate it when you start a new terminal, please add this line to your .bashrc or .bash_profile

If you use zsh please add it to .zprofile instead.

Version test

Regardless of which version you install, you must do a version test to make sure you have the correct python and pip versions:

$ python --version
$ pip --version

If you installed everything correctly you should see

Python 3.9.5
pip 21.1.2

or newer.

2.1.3 - Interactive Python

Gregor von Laszewski (laszewski@gmail.com)

Python can be used interactively. You can enter the interactive mode by entering the interactive loop by executing the command:

$ python

You will see something like the following:

$ python
Python 3.9.5 (v3.9.5:0a7dcbdb13, May  3 2021, 13:17:02) 
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

The >>> is the prompt used by the interpreter. This is similar to bash where commonly $ is used.

Sometimes it is convenient to show the prompt when illustrating an example. This is to provide some context for what we are doing. If you are following along you will not need to type in the prompt.

This interactive python process does the following:

  • read your input commands
  • evaluate your command
  • print the result of the evaluation
  • loop back to the beginning.

This is why you may see the interactive loop referred to as a REPL: Read-Evaluate-Print-Loop.

REPL (Read Eval Print Loop)

There are many different types beyond what we have seen so far, such as dictionariess, lists, sets. One handy way of using the interactive python is to get the type of a value using type():

>>> type(42)
<type 'int'>
>>> type('hello')
<type 'str'>
>>> type(3.14)
<type 'float'>

You can also ask for help about something using help():

>>> help(int)
>>> help(list)
>>> help(str)

Using help() opens up a help message within a pager. To navigate you can use the spacebar to go down a page w to go up a page, the arrow keys to go up/down line-by-line, or q to exit.

Interpreter

Although the interactive mode provides a convenient tool to test things out you will see quickly that for our class we want to use the python interpreter from the command line. Let us assume the program is called prg.py. Once you have written it in that file you simply can call it with

$ python prg.py

It is important to name the program with meaningful names.

2.1.4 - Editors

Gregor von Laszewski (laszewski@gmail.com)

This section is meant to give an overview of the Python editing tools needed for completing this course. There are many other alternatives; however, we do recommend using PyCharm.

PyCharm

PyCharm is an Integrated Development Environment (IDE) used for programming in Python. It provides code analysis, a graphical debugger, an integrated unit tester, and integration with git.

Video Python 8:56 Pycharm

[Video

Python in 45 minutes

Next is an additional community YouTube video about the Python programming language. Naturally, there are many alternatives to this video, but it is probably a good start. It also uses PyCharm which we recommend.

Video Python 43:16 PyCharm

[Video

How much you want to understand Python is a bit up to you. While it is good to know classes and inheritance, you may be able to get away without using it for this class. However, we do recommend that you learn it.

PyCharm Installation:

Method 1: Download and install it from the PyCharm website. This is easy and if no automated install is required we recommend this method. Students and teachers can apply for a free professional version. Please note that Jupyter notebooks can only be viewed in the professional version.

Method 2: PyCharm Installation on ubuntu using umake

$ sudo add-apt-repository ppa:ubuntu-desktop/ubuntu-make
$ sudo apt-get update
$ sudo apt-get install ubuntu-make

Once the umake command is installed, use the next command to install PyCharm community edition:

$ umake ide pycharm

If you want to remove PyCharm installed using umake command, use this:

$ umake -r ide pycharm

Method 2: PyCharm installation on ubuntu using PPA

$ sudo add-apt-repository ppa:mystic-mirage/pycharm
$ sudo apt-get update
$ sudo apt-get install pycharm-community

PyCharm also has a Professional (paid) version that can be installed using the following command:

$ sudo apt-get install pycharm

Once installed, go to your VM dashboard and search for PyCharm.

2.1.5 - Google Colab

Gregor von Laszewski (laszewski@gmail.com)

In this section, we are going to introduce you, how to use Google Colab to run deep learning models.

Introduction to Google Colab

This video contains the introduction to Google Colab. In this section we will be learning how to start a Google Colab project.

Video{width=“20%"}

[Video

Programming in Google Colab

In this video, we will learn how to create a simple, Colab Notebook.

Required Installations

pip install numpy

Video{width=“20%"}

[Video

Benchmarking in Google Colab with Cloudmesh

In this video, we learn how to do a basic benchmark with Cloudmesh tools. Cloudmesh StopWatch will be used in this tutorial.

Required Installations

pip install numpy
pip install cloudmesh-installer
pip install cloudmesh-common

Video{width=“20%"}

[Video

2.1.6 - Language

Gregor von Laszewski (laszewski@gmail.com)

Statements and Strings

Let us explore the syntax of Python while starting with a print statement

print("Hello world from Python!")

This will print on the terminal

Hello world from Python!

The print function was given a string to process. A string is a sequence of characters. A character can be an alphabetic (A through Z, lower and upper case), numeric (any of the digits), white space (spaces, tabs, newlines, etc), syntactic directives (comma, colon, quotation, exclamation, etc), and so forth. A string is just a sequence of the character and typically indicated by surrounding the characters in double-quotes.

Standard output is discussed in the Section Linux.

So, what happened when you pressed Enter? The interactive Python program read the line print ("Hello world from Python!"), split it into the print statement and the "Hello world from Python!" string, and then executed the line, showing you the output.

Comments

Comments in Python are followed by a #:

# This is a comment

Variables

You can store data into a variable to access it later. For instance:

hello = 'Hello world from Python!'
print(hello)

This will print again

Hello world from Python!

Data Types

Booleans

A boolean is a value that can have the values True or False. You can combine booleans with boolean operators such as and and or

print(True and True) # True
print(True and False) # False
print(False and False) # False
print(True or True) # True
print(True or False) # True
print(False or False) # False

Numbers

The interactive interpreter can also be used as a calculator. For instance, say we wanted to compute a multiple of 21:

print(21 * 2) # 42

We saw here the print statement again. We passed in the result of the operation 21 * 2. An integer (or int) in Python is a numeric value without a fractional component (those are called floating point numbers, or float for short).

The mathematical operators compute the related mathematical operation to the provided numbers. Some operators are:

Operator Function


* multiplication
/ division
+ addition
- subtraction
** exponent

Exponentiation $x^y$ is written as x**y is x to the yth power.

You can combine floats and ints:

print(3.14 * 42 / 11 + 4 - 2) # 13.9890909091
print(2**3) # 8

Note that operator precedence is important. Using parenthesis to indicate affect the order of operations gives a difference results, as expected:

print(3.14 * (42 / 11) + 4 - 2) # 11.42
print(1 + 2 * 3 - 4 / 5.0) # 6.2
print( (1 + 2) * (3 - 4) / 5.0 ) # -0.6

Module Management

A module allows you to logically organize your Python code. Grouping related code into a module makes the code easier to understand and use. A module is a Python object with arbitrarily named attributes that you can bind and reference. A module is a file consisting of Python code. A module can define functions, classes, and variables. A module can also include runnable code.

Import Statement

When the interpreter encounters an import statement, it imports the module if the module is present in the search path. A search path is a list of directories that the interpreter searches before importing a module. The from…import Statement Python’s from statement lets you import specific attributes from a module into the current namespace. It is preferred to use for each import its own line such as:

import numpy
import matplotlib

When the interpreter encounters an import statement, it imports the module if the module is present in the search path. A search path is a list of directories that the interpreter searches before importing a module.

The from … import Statement

Python’s from statement lets you import specific attributes from a module into the current namespace. The from … import has the following syntax:

from datetime import datetime

Date Time in Python

The datetime module supplies classes for manipulating dates and times in both simple and complex ways. While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation. For related functionality, see also the time and calendar modules.

The import Statement You can use any Python source file as a module by executing an import statement in some other Python source file.

from datetime import datetime

This module offers a generic date/time string parser which is able to parse most known formats to represent a date and/or time.

from dateutil.parser import parse

pandas is an open-source Python library for data analysis that needs to be imported.

import pandas as pd

Create a string variable with the class start time

fall_start = '08-21-2018'

Convert the string to datetime format

datetime.strptime(fall_start, '%m-%d-%Y') \#
datetime.datetime(2017, 8, 21, 0, 0)

Creating a list of strings as dates

class_dates = [
    '8/25/2017',
    '9/1/2017',
    '9/8/2017',
    '9/15/2017',
    '9/22/2017',
    '9/29/2017']

Convert Class_dates strings into datetime format and save the list into variable a

a = [datetime.strptime(x, '%m/%d/%Y') for x in class_dates]

Use parse() to attempt to auto-convert common string formats. Parser must be a string or character stream, not list.

parse(fall_start) # datetime.datetime(2017, 8, 21, 0, 0)

Use parse() on every element of the Class_dates string.

[parse(x) for x in class_dates]
# [datetime.datetime(2017, 8, 25, 0, 0),
#  datetime.datetime(2017, 9, 1, 0, 0),
#  datetime.datetime(2017, 9, 8, 0, 0),
#  datetime.datetime(2017, 9, 15, 0, 0),
#  datetime.datetime(2017, 9, 22, 0, 0),
#  datetime.datetime(2017, 9, 29, 0, 0)]

Use parse, but designate that the day is first.

parse (fall_start, dayfirst=True)
# datetime.datetime(2017, 8, 21, 0, 0)

Create a dataframe. A DataFrame is a tabular data structure comprised of rows and columns, akin to a spreadsheet, database table. DataFrame is a group of Series objects that share an index (the column names).

import pandas as pd
data = {
  'dates': [
    '8/25/2017 18:47:05.069722',
    '9/1/2017 18:47:05.119994',
    '9/8/2017 18:47:05.178768',
    '9/15/2017 18:47:05.230071',
    '9/22/2017 18:47:05.230071',
    '9/29/2017 18:47:05.280592'],
  'complete': [1, 0, 1, 1, 0, 1]}
df = pd.DataFrame(
  data,
  columns = ['dates','complete'])
print(df)
#                  dates  complete
#  0  8/25/2017 18:47:05.069722 1
#  1   9/1/2017 18:47:05.119994 0
#  2   9/8/2017 18:47:05.178768 1
#  3  9/15/2017 18:47:05.230071 1
#  4  9/22/2017 18:47:05.230071 0
#  5  9/29/2017 18:47:05.280592 1

Convert df[`date`] from string to datetime

import pandas as pd
pd.to_datetime(df['dates'])
# 0   2017-08-25 18:47:05.069722
# 1   2017-09-01 18:47:05.119994
# 2   2017-09-08 18:47:05.178768
# 3   2017-09-15 18:47:05.230071
# 4   2017-09-22 18:47:05.230071
# 5   2017-09-29 18:47:05.280592
# Name: dates, dtype: datetime64[ns]

Control Statements

Comparison

Computer programs do not only execute instructions. Occasionally, a choice needs to be made. Such as a choice is based on a condition. Python has several conditional operators:

Operator Function


> greater than
< smaller than
== equals
!= is not

Conditions are always combined with variables. A program can make a choice using the if keyword. For example:

x = int(input("Guess x:"))
if x == 4:
   print('Correct!')

In this example, You guessed correctly! will only be printed if the variable x equals four. Python can also execute multiple conditions using the elif and else keywords.

x = int(input("Guess x:"))
if x == 4:
    print('Correct!')
elif abs(4 - x) == 1:
    print('Wrong, but close!')
else:
    print('Wrong, way off!')

Iteration

To repeat code, the for keyword can be used. For example, to display the numbers from 1 to 10, we could write something like this:

for i in range(1, 11):
   print('Hello!')

The second argument to the range, 11, is not inclusive, meaning that the loop will only get to 10 before it finishes. Python itself starts counting from 0, so this code will also work:

for i in range(0, 10):
   print(i + 1)

In fact, the range function defaults to starting value of 0, so it is equivalent to:

for i in range(10):
   print(i + 1)

We can also nest loops inside each other:

for i in range(0,10):
    for j in range(0,10):
        print(i,' ',j)

In this case, we have two nested loops. The code will iterate over the entire coordinate range (0,0) to (9,9)

Datatypes

Lists

see: https://www.tutorialspoint.com/python/python_lists.htm

Lists in Python are ordered sequences of elements, where each element can be accessed using a 0-based index.

To define a list, you simply list its elements between square brackets ‘[ ]':

names = [
  'Albert',
  'Jane',
  'Liz',
  'John',
  'Abby']
# access the first element of the list
names[0]
# 'Albert'
# access the third element of the list
names[2]
# 'Liz'

You can also use a negative index if you want to start counting elements from the end of the list. Thus, the last element has index -1, the second before the last element has index -2 and so on:

# access the last element of the list
names[-1]
# 'Abby'
# access the second last element of the list
names[-2]
# 'John'

Python also allows you to take whole slices of the list by specifying a beginning and end of the slice separated by a colon

# the middle elements, excluding first and last
names[1:-1]
# ['Jane', 'Liz', 'John']

As you can see from the example, the starting index in the slice is inclusive and the ending one, exclusive.

Python provides a variety of methods for manipulating the members of a list.

You can add elements with append’:

names.append('Liz')
names
# ['Albert', 'Jane', 'Liz',
#  'John', 'Abby', 'Liz']

As you can see, the elements in a list need not be unique.

Merge two lists with ‘extend’:

names.extend(['Lindsay', 'Connor'])
names
# ['Albert', 'Jane', 'Liz', 'John',
#  'Abby', 'Liz', 'Lindsay', 'Connor']

Find the index of the first occurrence of an element with ‘index’:

names.index('Liz') \# 2

Remove elements by value with ‘remove’:

names.remove('Abby')
names
# ['Albert', 'Jane', 'Liz', 'John',
#  'Liz', 'Lindsay', 'Connor']

Remove elements by index with ‘pop’:

names.pop(1)
# 'Jane'
names
# ['Albert', 'Liz', 'John',
#  'Liz', 'Lindsay', 'Connor']

Notice that pop returns the element being removed, while remove does not.

If you are familiar with stacks from other programming languages, you can use insert and ‘pop’:

names.insert(0, 'Lincoln')
names
# ['Lincoln', 'Albert', 'Liz',
#  'John', 'Liz', 'Lindsay', 'Connor']
names.pop()
# 'Connor'
names
# ['Lincoln', 'Albert', 'Liz',
#  'John', 'Liz', 'Lindsay']

The Python documentation contains a full list of list operations.

To go back to the range function you used earlier, it simply creates a list of numbers:

range(10)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
range(2, 10, 2)
# [2, 4, 6, 8]

Sets

Python lists can contain duplicates as you saw previously:

names = ['Albert', 'Jane', 'Liz',
         'John', 'Abby', 'Liz']

When we do not want this to be the case, we can use a set:

unique_names = set(names)
unique_names
# set(['Lincoln', 'John', 'Albert', 'Liz', 'Lindsay'])

Keep in mind that the set is an unordered collection of objects, thus we can not access them by index:

unique_names[0]
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   TypeError: 'set' object does not support indexing

However, we can convert a set to a list easily:

unique_names = list(unique_names)
unique_names [`Lincoln', `John', `Albert', `Liz', `Lindsay']
unique_names[0]
# `Lincoln'

Notice that in this case, the order of elements in the new list matches the order in which the elements were displayed when we create the set. We had

set(['Lincoln', 'John', 'Albert', 'Liz', 'Lindsay'])

and now we have

['Lincoln', 'John', 'Albert', 'Liz', 'Lindsay'])

You should not assume this is the case in general. That is, do not make any assumptions about the order of elements in a set when it is converted to any type of sequential data structure.

You can change a set’s contents using the add, remove and update methods which correspond to the append, remove and extend methods in a list. In addition to these, set objects support the operations you may be familiar with from mathematical sets: union, intersection, difference, as well as operations to check containment. You can read about this in the Python documentation for sets.

Removal and Testing for Membership in Sets

One important advantage of a set over a list is that access to elements is fast. If you are familiar with different data structures from a Computer Science class, the Python list is implemented by an array, while the set is implemented by a hash table.

We will demonstrate this with an example. Let us say we have a list and a set of the same number of elements (approximately 100 thousand):

import sys, random, timeit
nums_set = set([random.randint(0, sys.maxint) for _ in range(10**5)])
nums_list = list(nums_set)
len(nums_set)
# 100000

We will use the timeit Python module to time 100 operations that test for the existence of a member in either the list or set:

timeit.timeit('random.randint(0, sys.maxint) in nums',
              setup='import random; nums=%s' % str(nums_set), number=100)
# 0.0004038810729980469
timeit.timeit('random.randint(0, sys.maxint) in nums',
              setup='import random; nums=%s' % str(nums_list), number=100)
# 0.398054122924804

The exact duration of the operations on your system will be different, but the takeaway will be the same: searching for an element in a set is orders of magnitude faster than in a list. This is important to keep in mind when you work with large amounts of data.

Dictionaries

One of the very important data structures in python is a dictionary also referred to as dict.

A dictionary represents a key value store:

computer = {
  'name': 'mycomputer',
  'memory': 16,
  'kind': 'Laptop'
  }
print("computer['name']: ", computer['name'])
# computer['name']:  mycomputer
print("computer['memory']: ", computer['memory'])
# computer['Age']:  16

A convenient for to print by named attributes is

print("{name} {memory}'.format(**computer))

This form of printing with the format statement and a reference to data increases the readability of the print statements.

You can delete elements with the following commands:

del computer['name'] # remove entry with key 'name'
# computer
# 
computer.clear()     # remove all entries in dict
# computer
# 
del computer         # delete entire dictionary
# computer
# Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  NameError: name 'computer' is not defined

You can iterate over a dict:

computer = {
  'name': 'mycomputer',
  'memory': 16,
  'kind': 'Laptop'
  }
for item in computer:
  print(item, computer[item])

# name mycomputer
# memory 16
# kind laptop

Dictionary Keys and Values

You can retrieve both the keys and values of a dictionary using the keys() and values() methods of the dictionary, respectively:

computer.keys() # ['name', 'memory', 'kind']
computer.values() # ['mycomputer', 'memory', 'kind']

Both methods return lists. Please remember howver that the keys and order in which the elements are returned are not necessarily the same. It is important to keep this in mind:

*You cannot

![

make any assumptions about the order in which the elements of a dictionary will be returned by the keys() and values() methods*.

However, you can assume that if you call keys() and values() in sequence, the order of elements will at least correspond in both methods.

Counting with Dictionaries

One application of dictionaries that frequently comes up is counting the elements in a sequence. For example, say we have a sequence of coin flips:

import random
die_rolls = [
  random.choice(['heads', 'tails']) for _ in range(10)
]
# die_rolls
# ['heads', 'tails', 'heads',
#  'tails', 'heads', 'heads',
   'tails', 'heads', 'heads', 'heads']

The actual list die_rolls will likely be different when you execute this on your computer since the outcomes of the die rolls are random.

To compute the probabilities of heads and tails, we could count how many heads and tails we have in the list:

counts = {'heads': 0, 'tails': 0}
for outcome in die_rolls:
   assert outcome in counts
   counts[outcome] += 1
print('Probability of heads: %.2f' % (counts['heads'] / len(die_rolls)))
# Probability of heads: 0.70

print('Probability of tails: %.2f' % (counts['tails'] / sum(counts.values())))
# Probability of tails: 0.30

In addition to how we use the dictionary counts to count the elements of coin_flips, notice a couple of things about this example:

  1. We used the assert outcome in the count statement. The assert statement in Python allows you to easily insert debugging statements in your code to help you discover errors more quickly. assert statements are executed whenever the internal Python __debug__ variable is set to True, which is always the case unless you start Python with the -O option which allows you to run optimized Python.

  2. When we computed the probability of tails, we used the built-in sum function, which allowed us to quickly find the total number of coin flips. The sum is one of many built-in functions you can read about here.

Functions

You can reuse code by putting it inside a function that you can call in other parts of your programs. Functions are also a good way of grouping code that logically belongs together in one coherent whole. A function has a unique name in the program. Once you call a function, it will execute its body which consists of one or more lines of code:

def check_triangle(a, b, c):
return \
    a < b + c and a > abs(b - c) and \
    b < a + c and b > abs(a - c) and \
    c < a + b and c > abs(a - b)

    print(check_triangle(4, 5, 6))

The def keyword tells Python we are defining a function. As part of the definition, we have the function name, check_triangle, and the parameters of the function – variables that will be populated when the function is called.

We call the function with arguments 4, 5, and 6, which are passed in order into the parameters a, b, and c. A function can be called several times with varying parameters. There is no limit to the number of function calls.

It is also possible to store the output of a function in a variable, so it can be reused.

def check_triangle(a, b, c):
  return \
     a < b + c and a > abs(b - c) and \
     b < a + c and b > abs(a - c) and \
     c < a + b and c > abs(a - b)

    result = check_triangle(4, 5, 6)
    print(result)

Classes

A class is an encapsulation of data and the processes that work on them. The data is represented in member variables, and the processes are defined in the methods of the class (methods are functions inside the class). For example, let’s see how to define a Triangle class:

class Triangle(object):

  def __init__(self, length, width,
               height, angle1, angle2, angle3):
     if not self._sides_ok(length, width, height):
         print('The sides of the triangle are invalid.')
     elif not self._angles_ok(angle1, angle2, angle3):
         print('The angles of the triangle are invalid.')

     self._length = length
     self._width = width
     self._height = height

     self._angle1 = angle1
     self._angle2 = angle2
     self._angle3 = angle3

 def _sides_ok(self, a, b, c):
     return \
         a < b + c and a > abs(b - c) and \
         b < a + c and b > abs(a - c) and \
         c < a + b and c > abs(a - b)

 def _angles_ok(self, a, b, c):
     return a + b + c == 180

triangle = Triangle(4, 5, 6, 35, 65, 80)

Python has full object-oriented programming (OOP) capabilities, however we can not cover all of them in this section, so if you need more information please refer to the Python docs on classes and OOP.

Modules

Now write this simple program and save it:

print("Hello Cloud!")

As a check, make sure the file contains the expected contents on the command line:

$ cat hello.py
print("Hello Cloud!")

To execute your program pass the file as a parameter to the python command:

$ python hello.py
Hello Cloud!

Files in which Python code is stored are called modules. You can execute a Python module from the command line like you just did, or you can import it in other Python code using the import statement.

Let us write a more involved Python program that will receive as input the lengths of the three sides of a triangle, and will output whether they define a valid triangle. A triangle is valid if the length of each side is less than the sum of the lengths of the other two sides and greater than the difference of the lengths of the other two sides.:

"""Usage: check_triangle.py [-h] LENGTH WIDTH HEIGHT

Check if a triangle is valid.

Arguments:
  LENGTH     The length of the triangle.
  WIDTH      The width of the triangle.
  HEIGHT     The height of the triangle.

Options:
-h --help
"""
from docopt import docopt

if __name__ == '__main__':
  arguments = docopt(__doc__)
  a, b, c = int(arguments['LENGTH']),
            int(arguments['WIDTH']),
            int(arguments['HEIGHT'])
  valid_triangle = \
      a < b + c and a > abs(b - c) and \
      b < a + c and b > abs(a - c) and \
      c < a + b and c > abs(a - b)
  print('Triangle with sides %d, %d and %d is valid: %r' % (
      a, b, c, valid_triangle
  ))

Assuming we save the program in a file called check_triangle.py, we can run it like so:

$ python check_triangle.py 4 5 6
Triangle with sides 4, 5, and 6 is valid: True

Let us break this down a bit.

  1. We’ve defined a boolean expression that tells us if the sides that were input define a valid triangle. The result of the expression is stored in the valid_triangle variable. inside are true, and False otherwise.
  2. We’ve used the backslash symbol \ to format our code nicely. The backslash simply indicates that the current line is being continued on the next line.
  3. When we run the program, we do the check if __name__ == '__main__'. __name__ is an internal Python variable that allows us to tell whether the current file is being run from the command line (value __name__), or is being imported by a module (the value will be the name of the module). Thus, with this statement, we arre just making sure the program is being run by the command line.
  4. We are using the docopt module to handle command line arguments. The advantage of using this module is that it generates a usage help statement for the program and enforces command line arguments automatically. All of this is done by parsing the docstring at the top of the file.
  5. In the print function, we are using Python’s string formatting capabilities to insert values into the string we are displaying.

Lambda Expressions

As opposed to normal functions in Python which are defined using the def keyword, lambda functions in Python are anonymous functions that do not have a name and are defined using the lambda keyword. The generic syntax of a lambda function is in the form of lambda arguments: expression, as shown in the following example:

greeter = lambda x: print('Hello %s!'%x)
print(greeter('Albert'))

As you could probably guess, the result is:

Hello Albert!

Now consider the following examples:

power2 = lambda x: x ** 2

The power2 function defined in the expression, is equivalent to the following definition:

def power2(x):
    return x ** 2

Lambda functions are useful when you need a function for a short period. Note that they can also be very useful when passed as an argument with other built-in functions that take a function as an argument, e.g. filter() and map(). In the next example, we show how a lambda function can be combined with the filer function. Consider the array all_names which contains five words that rhyme together. We want to filter the words that contain the word name. To achieve this, we pass the function lambda x: 'name' in x as the first argument. This lambda function returns True if the word name exists as a substring in the string x. The second argument of filter function is the array of names, i.e. all_names.

all_names = ['surname', 'rename', 'nickname', 'acclaims', 'defame']
filtered_names = list(filter(lambda x: 'name' in x, all_names))
print(filtered_names)
# ['surname', 'rename', 'nickname']

As you can see, the names are successfully filtered as we expected.

In Python, the filter function returns a filter object or the iterator which gets lazily evaluated which means neither we can access the elements of the filter object with index nor we can use len() to find the length of the filter object.

list_a = [1, 2, 3, 4, 5]
filter_obj = filter(lambda x: x % 2 == 0, list_a)
# Convert the filer obj to a list
even_num = list(filter_obj)
print(even_num)
# Output: [2, 4]

In Python, we can have a small usually a single linear anonymous function called Lambda function which can have any number of arguments just like a normal function but with only one expression with no return statement. The result of this expression can be applied to a value.

Basic Syntax:

lambda arguments : expression

For example, a function in python

def multiply(a, b):
   return a*b

#call the function
multiply(3*5) #outputs: 15

The same function can be written as Lambda function. This function named as multiply is having 2 arguments and returns their multiplication.

Lambda equivalent for this function would be:

multiply = Lambda a, b : a*b

print(multiply(3, 5))
# outputs: 15

Here a and b are the 2 arguments and a*b is the expression whose value is returned as an output.

Also, we don’t need to assign the Lambda function to a variable.

(lambda a, b : a*b)(3*5)

Lambda functions are mostly passed as a parameter to a function which expects a function objects like in map or filter.

map

The basic syntax of the map function is

map(function_object, iterable1, iterable2,...)

map functions expect a function object and any number of iterable like a list or dictionary. It executes the function_object for each element in the sequence and returns a list of the elements modified by the function object.

Example:

def multiply(x):
   return x * 2

map(multiply2, [2, 4, 6, 8])
# Output [4, 8, 12, 16]

If we want to write the same function using Lambda

map(lambda x: x*2, [2, 4, 6, 8])
# Output [4, 8, 12, 16]

dictionary

Now, let us see how we can iterate over a dictionary using map and lambda Let us say we have a dictionary object

dict_movies = [
    {'movie': 'avengers', 'comic': 'marvel'},
    {'movie': 'superman', 'comic': 'dc'}]

We can iterate over this dictionary and read the elements of it using map and lambda functions in following way:

map(lambda x : x['movie'], dict_movies)  # Output: ['avengers', 'superman']
map(lambda x : x['comic'],  dict_movies)  # Output: ['marvel', 'dc']
map(lambda x : x['movie'] == "avengers", dict_movies)
# Output: [True, False]

In Python3, map function returns an iterator or map object which gets lazily evaluated which means neither we can access the elements of the map object with index nor we can use len() to find the length of the map object. We can force convert the map output i.e. the map object to list as shown next:

map_output = map(lambda x: x*2, [1, 2, 3, 4])
print(map_output)
# Output: map object: <map object at 0x04D6BAB0>
list_map_output = list(map_output)
print(list_map_output) # Output: [2, 4, 6, 8]

Iterators

In Python, an iterator protocol is defined using two methods: __iter()__ and next(). The former returns the iterator object and latter returns the next element of a sequence. Some advantages of iterators are as follows:

  • Readability
  • Supports sequences of infinite length
  • Saving resources

There are several built-in objects in Python which implement iterator protocol, e.g. string, list, dictionary. In the following example, we create a new class that follows the iterator protocol. We then use the class to generate log2 of numbers:

from math import log2

class LogTwo:
    "Implements an iterator of log two"

    def __init__(self,last = 0):
        self.last = last

    def __iter__(self):
        self.current_num = 1
        return self

    def __next__(self):
        if self.current_num <= self.last:
            result = log2(self.current_num)
            self.current_num += 1
            return result
        else:
            raise StopIteration

L = LogTwo(5)
i = iter(L)
print(next(i))
print(next(i))
print(next(i))
print(next(i))

As you can see, we first create an instance of the class and assign its __iter()__ function to a variable called i. Then by calling the next() function four times, we get the following output:

$ python iterator.py
0.0
1.0
1.584962500721156
2.0

As you probably noticed, the lines are log2() of 1, 2, 3, 4 respectively.

Generators

Before we go to Generators, please understand Iterators. Generators are also Iterators but they can only be iterated over once. That is because generators do not store the values in memory instead they generate the values on the go. If we want to print those values then we can either simply iterate over them or use the for loop.

Generators with function

For example, we have a function named as multiplyBy10 which prints all the input numbers multiplied by 10.

def multiplyBy10(numbers):
   result = []
   for i in numbers:
      result.append(i*10)
   return result

new_numbers = multiplyBy10([1,2,3,4,5])

print new_numbers  #Output: [10, 20, 30, 40 ,50]

Now, if we want to use Generators here then we will make the following changes.

def multiplyBy10(numbers):
   for i in numbers:
      yield(i*10)

new_numbers = multiplyBy10([1,2,3,4,5])

print new_numbers  #Output: Generators object

In Generators, we use yield() function in place of return(). So when we try to print new_numbers list now, it just prints Generators object. The reason for this is because Generators do not hold any value in memory, it yields one result at a time. So essentially it is just waiting for us to ask for the next result. To print the next result we can just say print next(new_numbers) , so how it is working is its reading the first value and squaring it and yielding out value 1. Also in this case, we can just print next(new_numbers) 5 times to print all numbers and if we do it for the 6th time then we will get an error StopIteration which means Generators has exhausted its limit and it has no 6th element to print.

print next(new_numbers)  #Output: 1

Generators using for loop

If we now want to print the complete list of squared values then we can just do:

def multiplyBy10(numbers):
   for i in numbers:
      yield(i*10)

new_numbers = multiplyBy10([1,2,3,4,5])

for num in new_numbers:
   print num

The output will be:

10
20
30
40
50

Generators with List Comprehension

Python has something called List Comprehension, if we use this then we can replace the complete function def with just:

new_numbers = [x*10 for x in [1,2,3,4,5]]
print new_numbers  #Output: [10, 20, 30, 40 ,50]

Here the point to note is square brackets [] in line 1 is very important. If we change it to () then again we will start getting Generators object.

new_numbers = (x*10 for x in [1,2,3,4,5])
print new_numbers  #Output: Generators object

We can get the individual elements again from Generators if we do a for loop over new_numbers, as we did previously. Alternatively, we can convert it into a list and then print it.

new_numbers = (x*10 for x in [1,2,3,4,5])
print list(new_numbers)  #Output: [10, 20, 30, 40 ,50]

But here if we convert this into a list then we lose on performance, which we will just see next.

Why use Generators?

Generators are better with Performance because it does not hold the values in memory and here with the small examples we provide it is not a big deal since we are dealing with a small amount of data but just consider a scenario where the records are in millions of data set. And if we try to convert millions of data elements into a list then that will make an impact on memory and performance because everything will in memory.

Let us see an example of how Generators help in Performance. First, without Generators, normal function taking 1 million records and returns the result[people] for 1 million.

names = ['John', 'Jack', 'Adam', 'Steve', 'Rick']
majors = ['Math',
          'CompScience',
          'Arts',
          'Business',
          'Economics']

# prints the memory before we run the function
memory = mem_profile.memory_usage_resource()
print (f'Memory (Before): {memory}Mb')

def people_list(people):
   result = []
   for i in range(people):
      person = {
            'id' : i,
            'name' : random.choice(names),
            'major' : randon.choice(majors)
            }
      result.append(person)
   return result

t1 = time.clock()
people = people_list(10000000)
t2 = time.clock()

# prints the memory after we run the function
memory = mem_profile.memory_usage_resource()
print (f'Memory (After): {memory}Mb')
print ('Took {time} seconds'.format(time=t2-t1))

#Output
Memory (Before): 15Mb
Memory (After): 318Mb
Took 1.2 seconds

I am just giving approximate values to compare it with the next execution but we just try to run it we will see a serious consumption of memory with a good amount of time taken.

names = ['John', 'Jack', 'Adam', 'Steve', 'Rick']
majors = ['Math',
          'CompScience',
          'Arts',
          'Business',
          'Economics']

# prints the memory before we run the function
memory = mem_profile.memory_usage_resource()
print (f'Memory (Before): {memory}Mb')
def people_generator(people):
   for i in xrange(people):
      person = {
            'id' : i,
            'name' : random.choice(names),
            'major' : randon.choice(majors)
        }
      yield person

t1 = time.clock()
people = people_list(10000000)
t2 = time.clock()

# prints the memory after we run the function
memory = mem_profile.memory_usage_resource()
print (f'Memory (After): {memory}Mb')
print ('Took {time} seconds'.format(time=t2-t1))

#Output
Memory (Before): 15Mb
Memory (After): 15Mb
Took 0.01 seconds

Now after running the same code using Generators, we will see a significant amount of performance boost with almost 0 Seconds. And the reason behind this is that in the case of Generators, we do not keep anything in memory so the system just reads 1 at a time and yields that.

2.1.7 - Cloudmesh

Gregor von Laszewski (laszewski@gmail.com)

2.1.7.1 - Introduction

Gregor von Laszewski (laszewski@gmail.com)


Learning Objectives

  • Introduction to the cloudmesh API
  • Using cmd5 via cms
  • Introduction to cloudmesh convenience API for output, dotdict, shell, stopwatch, benchmark management
  • Creating your own cms commands
  • Cloudmesh configuration file
  • Cloudmesh inventory

In this chapter, we like to introduce you to cloudmesh which provides you with a number of convenient methods to interface with the local system, but also with cloud services. We will start while focussing on some simple APIs and then gradually introduce the cloudmesh shell which not only provides a shell but also a command line interface so you can use cloudmesh from a terminal. This dual ability is quite useful as we can write cloudmesh scripts, but can also invoke the functionality from the terminal. This is quite an important distinction from other tools that only allow command line interfaces.

Moreover, we also show you that it is easy to create new commands and add them dynamically to the cloudmesh shell via simple pip installs.

Cloudmesh is an evolving project and you have the opportunity to improve if you see some features missing.

The manual of cloudmesh can be found at

The API documentation is located at

We will initially focus on a subset of this functionality.

2.1.7.2 - Installation

Gregor von Laszewski (laszewski@gmail.com)

The installation of cloudmesh is simple and can technically be done via pip by a user. However you are not a user, you are a developer. Cloudmesh is distributed in different topical repositories and in order for developers to easily interact with them we have written a convenient cloudmesh-installer program.

As a developer, you must also use a python virtual environment to avoid affecting your system-wide Python installation. This can be achieved while using Python3 from python.org or via conda. However, we do recommend that you use python.org as this is the vanilla python that most developers in the world use. Conda is often used by users of Python if they do not need to use bleeding-edge but older prepackaged Python tools and libraries.

Prerequisite

We require you to create a python virtual environment and activate it. How to do this was discussed in sec. ¿sec:python-install?. Please create the ENV3 environment. Please activate it.

Basic Install

Cloudmesh can install for developers a number of bundles. A bundle is a set of git repositories that are needed for a particular install. For us, we are mostly interested in the bundles cms, cloud, storage. We will introduce you to other bundles throughout this documentation.

If you like to find out more about the details of this you can look at cloudmesh-installer which will be regularly updated.

To make use of the bundle and the easy installation for developers please install the cloudmesh-installer via pip, but make sure you do this in a python virtual env as discussed previously. If not you may impact your system negatively. Please note that we are not responsible for fixing your computer. Naturally, you can also use a virtual machine, if you prefer. It is also important that we create a uniform development environment. In our case, we create an empty directory called cm in which we place the bundle.

$ mkdir cm
$ cd cm
$ pip install cloudmesh-installer

To see the bundle you can use

$ cloudmesh-installer bundles

We will start with the basic cloudmesh functionality at this time and only install the shell and some common APIs.

$ cloudmesh-installer git clone cms
$ cloudmesh-installer install cms

These commands download and install cloudmesh shell into your environment. It is important that you use the -e flag

To see if it works you can use the command

$ cms help

You will see an output. If this does not work for you, and you can not figure out the issue, please contact us so we can identify what went wrong.

For more information, please visit our Installation Instructions for Developers

2.1.7.3 - Output

Gregor von Laszewski (laszewski@gmail.com)

Cloudmesh provides a number of convenient API’s to make the output easier or more fanciful.

These API’s include

Console

Print is the usual function to output to the terminal. However, often we like to have colored output that helps us in the notification to the user. For this reason, we have a simple Console class that has several built-in features. You can even switch and define your own color schemes.

from cloudmesh.common.console import Console

msg = "my message"
Console.ok(msg) # prins a green message
Console.error(msg) # prins a red message proceeded with ERROR
Console.msg(msg) # prins a regular black message

In case of the error message we also have convenient flags that allow us to include the traceback in the output.

Console.error(msg, prefix=True, traceflag=True)

The prefix can be switched on and off with the prefix flag, while the traceflag switches on and of if the trace should be set.

The verbosity of the output is controlled via variables that are stored in the ~/.cloudmesh directory.

from cloudmesh.common.variables import Variables

variables = Variables()

variables['debug'] = True
variables['trace'] = True
variables['verbose'] = 10

For more features, see API: Console

In case you need a banner you can do this with

from cloudmesh.common.util import banner

banner("my text")

For more features, see API: Banner

Heading

A particularly useful function is HEADING() which prints the method name.

from cloudmesh.common.util import HEADING

class Example(object):

    def doit(self):
        HEADING()
        print ("Hello")

The invocation of the HEADING() function doit prints a banner with the name information. The reason we did not do it as a decorator is that you can place the HEADING() function in an arbitrary location of the method body.

For more features, see API: Heading

VERBOSE

Note: VERBOSE is not supported in jupyter notebooks

VERBOSE is a very useful method allowing you to print a dictionary. Not only will it print the dict, but it will also provide you with the information in which file it is used and which line number. It will even print the name of the dict that you use in your code.

To use this you will have to enable the debugging methods for cloudmesh as discussed in sec. 1.1

from cloudmesh.common.debug import VERBOSE

m = {"key": "value"}
VERBOSE(m)

For more features, please see VERBOSE

Using print and pprint

In many cases, it may be sufficient to use print and pprint for debugging. However, as the code is big and you may forget where you placed print statements or the print statements may have been added by others, we recommend that you use the VERBOSE function. If you use print or pprint we recommend using a unique prefix, such as:

from pprint import pprint

d = {"sample": "value"}
print("MYDEBUG:")
pprint (d)

# or with print

print("MYDEBUG:", d)

2.1.7.4 - Dictionaries

Gregor von Laszewski (laszewski@gmail.com)

Dotdict

For simple dictionaries we sometimes like to simplify the notation with a . instead of using the []:

You can achieve this with dotdict

from cloudmesh.common.dotdict import dotdict

data = {
    "name": "Gregor"
}

data = dotdict(data)

Now you can either call

data["name"]

or

data.name

This is especially useful in if conditions as it may be easier to read and write

if data.name is "Gregor":

    print("this is quite readable")

and is the same as

if data["name"] is "Gregor":

    print("this is quite readable")

For more features, see API: dotdict

FlatDict

In some cases, it is useful to be able to flatten out dictionaries that contain dicts within dicts. For this, we can use FlatDict.

from cloudmesh.common.Flatdict import FlatDict

data = {
    "name": "Gregor",
    "address": {
        "city": "Bloomington",
        "state": "IN"

    }
}

flat = FlatDict(data, sep=".")

This will be converted to a dict with the following structure.

flat = {
    "name": "Gregor"
    "address.city": "Bloomington",
    "address.state": "IN"
}

With sep you can change the sepaerator between the nested dict attributes. For more features, see API: dotdict

Printing Dicts

In case we want to print dicts and lists of dicts in various formats, we have included a simple Printer that can print a dict in yaml, json, table, and csv format.

The function can even guess from the passed parameters what the input format is and uses the appropriate internal function.

A common example is

from pprint import pprint
from cloudmesh.common.Printer import Printer

data = [
    {
        "name": "Gregor",
        "address": {
            "street": "Funny Lane 11",
            "city": "Cloudville"
        }
    },
    {
        "name": "Albert",
        "address": {
            "street": "Memory Lane 1901",
            "city": "Cloudnine"
        }
    }
]


pprint(data)

table = Printer.flatwrite(data,
                          sort_keys=["name"],
                          order=["name", "address.street", "address.city"],
                          header=["Name", "Street", "City"],
                          output='table')

print(table)

For more features, see API: Printer

More examples are available in the source code as tests

2.1.7.5 - Shell

Gregor von Laszewski (laszewski@gmail.com)

Python provides a sophisticated method for starting background processes. However, in many cases, it is quite complex to interact with it. It also does not provide convenient wrappers that we can use to start them in a pythonic fashion. For this reason, we have written a primitive Shell class that provides just enough functionality to be useful in many cases.

Let us review some examples where result is set to the output of the command being executed.

from cloudmesh.common.Shell import Shell

result = Shell.execute('pwd')
print(result)

result = Shell.execute('ls', ["-l", "-a"])
print(result)

result = Shell.execute('ls', "-l -a")
print(result)

For many common commands, we provide built-in functions. For example:

result = Shell.ls("-aux")
print(result)

result = Shell.ls("-a", "-u", "-x")
print(result)

result = Shell.pwd()
print(result)

The list includes (naturally the commands that must be available on your OS. If the shell command is not available on your OS, please help us improving the code to either provide functions that work on your OS or develop with us platform-independent functionality of a subset of the functionality for the shell command that we may benefit from.

  • VBoxManage(cls, *args)
  • bash(cls, *args)
  • blockdiag(cls, *args)
  • brew(cls, *args)
  • cat(cls, *args)
  • check_output(cls, *args, **kwargs)
  • check_python(cls)
  • cm(cls, *args)
  • cms(cls, *args)
  • command_exists(cls, name)
  • dialog(cls, *args)
  • edit(filename)
  • execute(cls,*args)
  • fgrep(cls, *args)
  • find_cygwin_executables(cls)
  • find_lines_with(cls, lines, what)
  • get_python(cls)
  • git(cls, *args)
  • grep(cls, *args)
  • head(cls, *args)
  • install(cls, name)
  • install(cls, name)
  • keystone(cls, *args)
  • kill(cls, *args)
  • live(cls, command, cwd=None)
  • ls(cls, *args)
  • mkdir(cls, directory)
  • mongod(cls, *args)
  • nosetests(cls, *args)
  • nova(cls, *args)
  • operating_system(cls)
  • pandoc(cls, *args)
  • ping(cls, host=None, count=1)
  • pip(cls, *args)
  • ps(cls, *args)
  • pwd(cls, *args)
  • rackdiag(cls, *args)
  • remove_line_with(cls, lines, what)
  • rm(cls, *args)
  • rsync(cls, *args)
  • scp(cls, *args)
  • sh(cls, *args)
  • sort(cls, *args)
  • ssh(cls, *args)
  • sudo(cls, *args)
  • tail(cls, *args)
  • terminal(cls, command='pwd')
  • terminal_type(cls)
  • unzip(cls, source_filename, dest_dir)
  • vagrant(cls, *args)
  • version(cls, name)
  • which(cls, command)

For more features, please see Shell

2.1.7.6 - StopWatch

Gregor von Laszewski (laszewski@gmail.com)

Often you find yourself in a situation where you like to measure the time between two events. We provide a simple StopWatch that allows you not only to measure a number of times but also to print them out in a convenient format.

from cloudmesh.common.StopWatch import StopWatch
from time import sleep


StopWatch.start("test")
sleep(1)
StopWatch.stop("test")

print (StopWatch.get("test"))

To print, you can simply also use:

StopWatch.benchmark()

For more features, please see StopWatch

2.1.7.7 - Cloudmesh Command Shell

Gregor von Laszewski (laszewski@gmail.com)

CMD5

Python’s CMD (https://docs.python.org/2/library/cmd.html) is a very useful package to create command line shells. However, it does not allow the dynamic integration of newly defined commands. Furthermore, additions to CMD need to be done within the same source tree. To simplify developing commands by a number of people and to have a dynamic plugin mechanism, we developed cmd5. It is a rewrite of our earlier efforts in cloudmesh client and cmd3.

Resources

The source code for cmd5 is located in GitHub:

We have discussed in sec. ¿sec:cloudmesh-cms-install? how to install cloudmesh as a developer and have access to the source code in a directory called cm. As you read this document we assume you are a developer and can skip the next section.

Installation from source

WARNING: DO NOT EXECUTE THIS IF YOU ARE A DEVELOPER OR YOUR ENVIRONMENT WILL NOT PROPERLY WORK. YOU LIKELY HAVE ALREADY INSTALLED CMD5 IF YOU USED THE CLOUDMESH INSTALLER.

However, if you are a user of cloudmesh you can install it with

$ pip install cloudmesh-cmd5

Execution

To run the shell you can activate it with the cms command. cms stands for cloudmesh shell:

(ENV2) $ cms

It will print the banner and enter the shell:

+-------------------------------------------------------+
|   ____ _                 _                     _      |
|  / ___| | ___  _   _  __| |_ __ ___   ___  ___| |__   |
| | |   | |/ _ \| | | |/ _` | '_ ` _ \ / _ \/ __| '_ \  |
| | |___| | (_) | |_| | (_| | | | | | |  __/\__ \ | | | |
|  \____|_|\___/ \__,_|\__,_|_| |_| |_|\___||___/_| |_| |
+-------------------------------------------------------+
|                  Cloudmesh CMD5 Shell                 |
+-------------------------------------------------------+

cms>

To see the list of commands you can say:

cms> help

To see the manual page for a specific command, please use:

help COMMANDNAME

Create your own Extension

One of the most important features of CMD5 is its ability to extend it with new commands. This is done via packaged namespaces. We recommend you name it cloudmesh-mycommand, where mycommand is the name of the command that you like to create. This can easily be done while using the sys* cloudmesh command (we suggest you use a different name than gregor maybe your firstname):

$ cms sys command generate gregor

It will download a template from cloudmesh called cloudmesh-bar and generate a new directory cloudmesh-gregor with all the needed files to create your own command and register it dynamically with cloudmesh. All you have to do is to cd into the directory and install the code:

$ cd cloudmesh-gregor
$ python setup.py install
# pip install .

Adding your command is easy. It is important that all objects are defined in the command itself and that no global variables be used to allow each shell command to stand alone. Naturally, you should develop API libraries outside of the cloudmesh shell command and reuse them to keep the command code as small as possible. We place the command in:

cloudmsesh/mycommand/command/gregor.py

Now you can go ahead and modify your command in that directory. It will look similar to (if you used the command name gregor):

from cloudmesh.shell.command import command
from cloudmesh.shell.command import PluginCommand

class GregorCommand(PluginCommand):

    @command
    def do_gregor(self, args, arguments):
        """
        ::
          Usage:
                gregor -f FILE
                gregor list
          This command does some useful things.
          Arguments:
              FILE   a file name
          Options:
              -f      specify the file
        """
        print(arguments)
        if arguments.FILE:
           print("You have used file: ", arguments.FILE)
        return ""

An important difference to other CMD solutions is that our commands can leverage (besides the standard definition), docopts as a way to define the manual page. This allows us to use arguments as dict and use simple if conditions to interpret the command. Using docopts has the advantage that contributors are forced to think about the command and its options and document them from the start. Previously we did not use but argparse and click. However, we noticed that for our contributors both systems lead to commands that were either not properly documented or the developers delivered ambiguous commands that resulted in confusion and wrong usage by subsequent users. Hence, we do recommend that you use docopts for documenting cmd5 commands. The transformation is enabled by the @command decorator that generates a manual page and creates a proper help message for the shell automatically. Thus there is no need to introduce a separate help method as would normally be needed in CMD while reducing the effort it takes to contribute new commands in a dynamic fashion.

Bug: Quotes

We have one bug in cmd5 that relates to the use of quotes on the commandline

For example, you need to say

$ cms gregor -f \"file name with spaces\"

If you like to help us fix this that would be great. it requires the use of shlex. Unfortunately, we did not yet time to fix this “feature.”

2.1.7.8 - Exercises

Gregor von Laszewski (laszewski@gmail.com)

When doing your assignment, make sure you label the programs appropriately with comments that clearly identify the assignment. Place all assignments in a folder on GitHub named “cloudmesh-exercises”

For example, name the program solving E.Cloudmesh.Common.1 e-cloudmesh-1.py and so on. For more complex assignments you can name them as you like, as long as in the file you have a comment such as

# fa19-516-000 E.Cloudmesh.Common.1

at the beginning of the file. Please do not store any screenshots in your GitHub repository of your working program.

Cloudmesh Common

E.Cloudmesh.Common.1

Develop a program that demonstrates the use of banner, HEADING, and VERBOSE.

E.Cloudmesh.Common.2

Develop a program that demonstrates the use of dotdict.

E.Cloudmesh.Common.3

Develop a program that demonstrates the use of FlatDict.

E.Cloudmesh.Common.4

Develop a program that demonstrates the use of cloudmesh.common.Shell.

E.Cloudmesh.Common.5

Develop a program that demonstrates the use of cloudmesh.common.StopWatch.

Cloudmesh Shell

E.Cloudmesh.Shell.1

Install cmd5 and the command cms on your computer.

E. Cloudmesh.Shell.2

Write a new command with your firstname as the command name.

E.Cloudmesh.Shell.3

Write a new command and experiment with docopt syntax and argument interpretation of the dict with if conditions.

E.Cloudmesh.Shell.4

If you have useful extensions that you like us to add by default, please work with us.

E.Cloudmesh.Shell.5

At this time one needs to quote in some commands the " in the shell command line. Develop and test code that fixes this.

2.1.8 - Data

Gregor von Laszewski (laszewski@gmail.com)

2.1.8.1 - Data Formats

Gregor von Laszewski (laszewski@gmail.com)

YAML

The term YAML stand for “YAML Ainot Markup Language.” According to the Web Page at

“YAML is a human friendly data serialization standard for all programming languages.” There are multiple versions of YAML existing and one needs to take care of that your software supports the right version. The current version is YAML 1.2.

YAML is often used for configuration and in many cases can also be used as XML replacement. Important is tat YAM in contrast to XML removes the tags while replacing them with indentation. This has naturally the advantage that it is mor easily to read, however, the format is strict and needs to adhere to proper indentation. Thus it is important that you check your YAML files for correctness, either by writing for example a python program that read your yaml file, or an online YAML checker such as provided at

An example on how to use yaml in python is provided in our next example. Please note that YAML is a superset of JSON. Originally YAML was designed as a markup language. However as it is not document oriented but data oriented it has been recast and it does no longer classify itself as markup language.

import os
import sys
import yaml

try:
    yamlFilename = os.sys.argv[1]
    yamlFile = open(yamlFilename, "r")
except:
    print("filename does not exist")
    sys.exit()
try:
   yaml.load(yamlFile.read())
except:
   print("YAML file is not valid.")

Resources:

JSON

The term JSON stand for JavaScript Object Notation. It is targeted as an open-standard file format that emphasizes on integration of human-readable text to transmit data objects. The data objects contain attribute value pairs. Although it originates from JavaScript, the format itself is language independent. It uses brackets to allow organization of the data. PLease note that YAML is a superset of JSON and not all YAML documents can be converted to JSON. Furthermore JSON does not support comments. For these reasons we often prefer to us YAMl instead of JSON. However JSON data can easily be translated to YAML as well as XML.

Resources:

XML

XML stands for Extensible Markup Language. XML allows to define documents with the help of a set of rules in order to make it machine readable. The emphasize here is on machine readable as document in XML can become quickly complex and difficult to understand for humans. XML is used for documents as well as data structures.

A tutorial about XML is available at

Resources:

2.1.9 - Mongo

Gregor von Laszewski (laszewski@gmail.com)

2.1.9.1 - MongoDB in Python

Gregor von Laszewski (laszewski@gmail.com)


Learning Objectives

  • Introduction to basic MongoDB knowledge
  • Use of MongoDB via PyMongo
  • Use of MongoEngine MongoEngine and Object-Document mapper,
  • Use of Flask-Mongo

In today’s era, NoSQL databases have developed an enormous potential to process the unstructured data efficiently. Modern information is complex, extensive, and may not have pre-existing relationships. With the advent of the advanced search engines, machine learning, and Artificial Intelligence, technology expectations to process, store, and analyze such data have grown tremendously [@www-upwork]. The NoSQL database engines such as MongoDB, Redis, and Cassandra have successfully overcome the traditional relational database challenges such as scalability, performance, unstructured data growth, agile sprint cycles, and growing needs of processing data in real-time with minimal hardware processing power [@www-guru99]. The NoSQL databases are a new generation of engines that do not necessarily require SQL language and are sometimes also called Not Only SQL databases. However, most of them support various third-party open connectivity drivers that can map NoSQL queries to SQL’s. It would be safe to say that although NoSQL databases are still far from replacing the relational databases, they are adding an immense value when used in hybrid IT environments in conjunction with relational databases, based on the application specific needs [@www-guru99]. We will be covering the MongoDB technology, its driver PyMongo, its object-document mapper MongoEngine, and the Flask-PyMongo micro-web framework that make MongoDB more attractive and user-friendly.

Cloudmesh MongoDB Usage Quickstart

Before you read on we like you to read this quickstart. The easiest way for many of the activities we do to interact with MongoDB is to use our cloudmesh functionality. This prelude section is not intended to describe all the details, but get you started quickly while leveraging cloudmesh

This is done via the cloudmesh cmd5 and the cloudmesh_community/cm code:

To install mongo on for example macOS you can use

$ cms admin mongo install

To start, stop and see the status of mongo you can use

$ cms admin mongo start
$ cms admin mongo stop
$ cms admin mongo status

To add an object to Mongo, you simply have to define a dict with predefined values for kind and cloud. In future such attributes can be passed to the function to determine the MongoDB collection.

from cloudmesh.mongo.DataBaseDecorator import DatabaseUpdate

@DatabaseUpdate
def test():
  data ={
    "kind": "test",
    "cloud": "testcloud",
    "value": "hello"
  }
  return data

result = test()

When you invoke the function it will automatically store the information into MongoDB. Naturally this requires that the ~/.cloudmesh/cloudmesh.yaml file is properly configured.

MongoDB

Today MongoDB is one of leading NoSQL database which is fully capable of handling dynamic changes, processing large volumes of complex and unstructured data, easily using object-oriented programming features; as well as distributed system challenges [@www-mongodb]. At its core, MongoDB is an open source, cross-platform, document database mainly written in C++ language.

Installation

MongoDB can be installed on various Unix Platforms, including Linux, Ubuntu, Amazon Linux, etc [@www-digitaloceaninst]. This section focuses on installing MongoDB on Ubuntu 18.04 Bionic Beaver used as a standard OS for a virtual machine used as a part of Big Data Application Class during the 2018 Fall semester.

Installation procedure

Before installing, it is recommended to configure the non-root user and provide the administrative privileges to it, in order to be able to perform general MongoDB admin tasks. This can be accomplished by login as the root user in the following manner [@www-digitaloceanprep].

$ adduser mongoadmin
$ usermod -aG sudo sammy

When logged in as a regular user, one can perform actions with superuser privileges by typing sudo before each command [@www-digitaloceanprep].

Once the user set up is completed, one can login as a regular user (mongoadmin) and use the following instructions to install MongoDB.

To update the Ubuntu packages to the most recent versions, use the next command:

$ sudo apt update

To install the MongoDB package:

$ sudo apt install -y mongodb

To check the service and database status:

$ sudo systemctl status mongodb

Verifying the status of a successful MongoDB installation can be confirmed with an output similar to this:

$ mongodb.service - An object/document-oriented database
    Loaded: loaded (/lib/systemd/system/mongodb.service; enabled; vendor preset: enabled)
    Active: **active** (running) since Sat 2018-11-15 07:48:04 UTC; 2min 17s ago
      Docs: man:mongod(1)
  Main PID: 2312 (mongod)
     Tasks: 23 (limit: 1153)
    CGroup: /system.slice/mongodb.service
           └─2312 /usr/bin/mongod --unixSocketPrefix=/run/mongodb --config /etc/mongodb.conf

To verify the configuration, more specifically the installed version, server, and port, use the following command:

$ mongo --eval 'db.runCommand({ connectionStatus: 1 })'

Similarly, to restart MongoDB, use the following:

$ sudo systemctl restart mongodb

To allow access to MongoDB from an outside hosted server one can use the following command which opens the fire-wall connections [@www-digitaloceaninst].

$ sudo ufw allow from your_other_server_ip/32 to any port 27017

Status can be verified by using:

$ sudo ufw status

Other MongoDB configurations can be edited through the /etc/mongodb.conf files such as port and hostnames, file paths.

$ sudo nano /etc/mongodb.conf

Also, to complete this step, a server’s IP address must be added to the bindIP value [@www-digitaloceaninst].

$ logappend=true

  bind_ip = 127.0.0.1,your_server_ip
  *port = 27017*

MongoDB is now listening for a remote connection that can be accessed by anyone with appropriate credentials [@www-digitaloceaninst].

Collections and Documents

Each database within Mongo environment contains collections which in turn contain documents. Collections and documents are analogous to tables and rows respectively to the relational databases. The document structure is in a key-value form which allows storing of complex data types composed out of field and value pairs. Documents are objects which correspond to native data types in many programming languages, hence a well defined, embedded document can help reduce expensive joins and improve query performance. The _id field helps to identify each document uniquely [@www-guru99].

MongoDB offers flexibility to write records that are not restricted by column types. The data storage approach is flexible as it allows one to store data as it grows and to fulfill varying needs of applications and/or users. It supports JSON like binary points known as BSON where data can be stored without specifying the type of data. Moreover, it can be distributed to multiple machines at high speed. It includes a sharding feature that partitions and spreads the data out across various servers. This makes MongoDB an excellent choice for cloud data processing. Its utilities can load high volumes of data at high speed which ultimately provides greater flexibility and availability in a cloud-based environment [@www-upwork].

The dynamic schema structure within MongoDB allows easy testing of the small sprints in the Agile project management life cycles and research projects that require frequent changes to the data structure with minimal downtime. Contrary to this flexible process, modifying the data structure of relational databases can be a very tedious process [@www-upwork].

Collection example

The following collection example for a person named Albert includes additional information such as age, status, and group [@www-mongocollection].

{
 name: "Albert"
 age: "21"
 status: "Open"
 group: ["AI" , "Machine Learning"]
}

Document structure

{
   field1: value1,
   field2: value2,
   field3: value3,
   ...
   fieldN: valueN
}

Collection Operations

If collection does not exists, MongoDB database will create a collection by default.

> db.myNewCollection1.insertOne( { x: 1 } )
> db.myNewCollection2.createIndex( { y: 1 } )

MongoDB Querying

The data retrieval patterns, the frequency of data manipulation statements such as insert, updates, and deletes may demand for the use of indexes or incorporating the sharding feature to improve query performance and efficiency of MongoDB environment [@www-guru99]. One of the significant difference between relational databases and NoSQL databases are joins. In the relational database, one can combine results from two or more tables using a common column, often called as key. The native table contains the primary key column while the referenced table contains a foreign key. This mechanism allows one to make changes in a single row instead of changing all rows in the referenced table. This action is referred to as normalization. MongoDB is a document database and mainly contains denormalized data which means the data is repeated instead of indexed over a specific key. If the same data is required in more than one table, it needs to be repeated. This constraint has been eliminated in MongoDB’s new version 3.2. The new release introduced a $lookup feature which more likely works as a left-outer-join. Lookups are restricted to aggregated functions which means that data usually need some type of filtering and grouping operations to be conducted beforehand. For this reason, joins in MongoDB require more complicated querying compared to the traditional relational database joins. Although at this time, lookups are still very far from replacing joins, this is a prominent feature that can resolve some of the relational data challenges for MongoDB [@www-sitepoint]. MongoDB queries support regular expressions as well as range asks for specific fields that eliminate the need of returning entire documents [@www-guru99]. MongoDB collections do not enforce document structure like SQL databases which is a compelling feature. However, it is essential to keep in mind the needs of the applications[@www-upwork].

Mongo Queries examples

The queries can be executed from Mongo shell as well as through scripts.

To query the data from a MongoDB collection, one would use MongoDB’s find() method.

> db.COLLECTION_NAME.find()

The output can be formatted by using the pretty() command.

> db.mycol.find().pretty()

The MongoDB insert statements can be performed in the following manner:

> db.COLLECTION_NAME.insert(document)

“The $lookup command performs a left-outer-join to an unsharded collection in the same database to filter in documents from the joined collection for processing” [@www-mongodblookup].

$ {
    $lookup:
      {
        from: <collection to join>,
        localField: <field from the input documents>,
        foreignField: <field from the documents of the "from" collection>,
        as: <output array field>
      }
  }

This operation is equivalent to the following SQL operation:

 $ SELECT *, <output array field>
   FROM collection
   WHERE <output array field> IN (SELECT *
                               FROM <collection to join>
                               WHERE <foreignField> = <collection.localField>);`

To perform a Like Match (Regex), one would use the following command:

> db.products.find( { sku: { $regex: /789$/ } } )

MongoDB Basic Functions

When it comes to the technical elements of MongoDB, it posses a rich interface for importing and storage of external data in various formats. By using the Mongo Import/Export tool, one can easily transfer contents from JSON, CSV, or TSV files into a database. MongoDB supports CRUD (create, read, update, delete) operations efficiently and has detailed documentation available on the product website. It can also query the geospatial data, and it is capable of storing geospatial data in GeoJSON objects. The aggregation operation of the MongoDB process data records and returns computed results. MongoDB aggregation framework is modeled on the concept of data pipelines [@www-mongoexportimport].

Import/Export functions examples

To import JSON documents, one would use the following command:

$ mongoimport --db users --collection contacts --file contacts.json

The CSV import uses the input file name to import a collection, hence, the collection name is optional [@www-mongoexportimport].

$ mongoimport --db users --type csv --headerline --file /opt/backups/contacts.csv

Mongoexport is a utility that produces a JSON or CSV export of data stored in a MongoDB instance” [@www-mongoexportimport].

$ mongoexport --db test --collection traffic --out traffic.json

Security Features

Data security is a crucial aspect of the enterprise infrastructure management and is the reason why MongoDB provides various security features such as ole based access control, numerous authentication options, and encryption. It supports mechanisms such as SCRAM, LDAP, and Kerberos authentication. The administrator can create role/collection-based access control; also roles can be predefined or custom. MongoDB can audit activities such as DDL, CRUD statements, authentication and authorization operations [@www-mongosecurity].

Collection based access control example

A user defined role can contain the following privileges [@www-mongosecurity].

$ privileges: [
   { resource: { db: "products", collection: "inventory" }, actions: [ "find", "update"] },
   { resource: { db: "products", collection: "orders" },  actions: [ "find" ] }
 ]

MongoDB Cloud Service

In regards to the cloud technologies, MongoDB also offers fully automated cloud service called Atlas with competitive pricing options. Mongo Atlas Cloud interface offers interactive GUI for managing cloud resources and deploying applications quickly. The service is equipped with geographically distributed instances to ensure no single point failure. Also, a well-rounded performance monitoring interface allows users to promptly detect anomalies and generate index suggestions to optimize the performance and reliability of the database. Global technology leaders such as Google, Facebook, eBay, and Nokia are leveraging MongoDB and Atlas cloud services making MongoDB one of the most popular choices among the NoSQL databases [@www-mongoatlas].

PyMongo

PyMongo is the official Python driver or distribution that allows work with a NoSQL type database called MongoDB [@api-mongodb-com-api]. The first version of the driver was developed in 2009 [@www-pymongo-blog], only two years after the development of MongoDB was started. This driver allows developers to combine both Python’s versatility and MongoDB’s flexible schema nature into successful applications. Currently, this driver supports MongoDB versions 2.6, 3.0, 3.2, 3.4, 3.6, and 4.0 [@www-github]. MongoDB and Python represent a compatible fit considering that BSON (binary JSON) used in this NoSQL database is very similar to Python dictionaries, which makes the collaboration between the two even more appealing [@www-mongodb-slideshare]. For this reason, dictionaries are the recommended tools to be used in PyMongo when representing documents [@www-gearheart].

Installation

Prior to being able to exploit the benefits of Python and MongoDB simultaneously, the PyMongo distribution must be installed using pip. To install it on all platforms, the following command should be used [@www-api-mongodb-installation]:

$ python -m pip install pymongo

Specific versions of PyMongo can be installed with command lines such as in our example where the 3.5.1 version is installed [@www-api-mongodb-installation].

$ python -m pip install pymongo==3.5.1

A single line of code can be used to upgrade the driver as well [@www-api-mongodb-installation].

$ python -m pip install --upgrade pymongo

Furthermore, the installation process can be completed with the help of the easy_install tool, which requires users to use the following command [@www-api-mongodb-installation].

$ python -m easy_install pymongo

To do an upgrade of the driver using this tool, the following command is recommended [@www-api-mongodb-installation]:

$ python -m easy_install -U pymongo

There are many other ways of installing PyMongo directly from the source, however, they require for C extension dependencies to be installed prior to the driver installation step, as they are the ones that skim through the sources on GitHub and use the most up-to-date links to install the driver [@www-api-mongodb-installation].

To check if the installation was completed accurately, the following command is used in the Python console [@www-realpython].

import pymongo

If the command returns zero exceptions within the Python shell, one can consider for the PyMongo installation to have been completed successfully.

Dependencies

The PyMongo driver has a few dependencies that should be taken into consideration prior to its usage. Currently, it supports CPython 2.7, 3.4+, PyPy, and PyPy 3.5+ interpreters [@www-github]. An optional dependency that requires some additional components to be installed is the GSSAPI authentication [@www-github]. For the Unix based machines, it requires pykerberos, while for the Windows machines WinKerberos is needed to fullfill this requirement [@www-github]. The automatic installation of this dependency can be done simultaneously with the driver installation, in the following manner:

$ python -m pip install pymongo[gssapi]

Other third-party dependencies such as ipaddress, certifi, or wincerstore are necessary for connections with help of TLS/SSL and can also be simultaneously installed along with the driver installation [@www-github].

Running PyMongo with Mongo Deamon

Once PyMongo is installed, the Mongo deamon can be run with a very simple command in a new terminal window [@www-realpython].

$ mongod

Connecting to a database using MongoClient

In order to be able to establish a connection with a database, a MongoClient class needs to be imported, which sub-sequentially allows the MongoClient object to communicate with the database [@www-realpython].

from pymongo import MongoClient
client = MongoClient()

This command allows a connection with a default, local host through port 27017, however, depending on the programming requirements, one can also specify those by listing them in the client instance or use the same information via the Mongo URI format [@www-realpython].

Accessing Databases

Since MongoClient plays a server role, it can be used to access any desired databases in an easy way. To do that, one can use two different approaches. The first approach would be doing this via the attribute method where the name of the desired database is listed as an attribute, and the second approach, which would include a dictionary-style access [@www-realpython]. For example, to access a database called cloudmesh_community, one would use the following commands for the attribute and for the dictionary method, respectively.

db = client.cloudmesh_community
db = client['cloudmesh_community']

Creating a Database

Creating a database is a straight forward process. First, one must create a MongoClient object and specify the connection (IP address) as well as the name of the database they are trying to create [@www-w3schools]. The example of this command is presented in the followng section:

import pymongo
client = pymongo.MongoClient('mongodb://localhost:27017/')
db = client['cloudmesh']

Inserting and Retrieving Documents (Querying)

Creating documents and storing data using PyMongo is equally easy as accessing and creating databases. In order to add new data, a collection must be specified first. In this example, a decision is made to use the cloudmesh group of documents.

cloudmesh = db.cloudmesh

Once this step is completed, data may be inserted using the insert_one() method, which means that only one document is being created. Of course, insertion of multiple documents at the same time is possible as well with use of the insert_many() method [@www-realpython]. An example of this method is as follows:

course_info = {
     'course': 'Big Data Applications and Analytics',
     'instructor': ' Gregor von Laszewski',
     'chapter': 'technologies'
}
result = cloudmesh.insert_one(course_info)`

Another example of this method would be to create a collection. If we wanted to create a collection of students in the cloudmesh_community, we would do it in the following manner:

student = [ {'name': 'John', 'st_id': 52642},
    {'name': 'Mercedes', 'st_id': 5717},
    {'name': 'Anna', 'st_id': 5654},
    {'name': 'Greg', 'st_id': 5423},
    {'name': 'Amaya', 'st_id': 3540},
    {'name': 'Cameron', 'st_id': 2343},
    {'name': 'Bozer', 'st_id': 4143},
    {'name': 'Cody', 'price': 2165} ]

client = MongoClient('mongodb://localhost:27017/')

with client:
    db = client.cloudmesh
    db.students.insert_many(student)

Retrieving documents is equally simple as creating them. The find_one() method can be used to retrieve one document [@www-realpython]. An implementation of this method is given in the following example.

gregors_course = cloudmesh.find_one({'instructor':'Gregor von Laszewski'})

Similarly, to retieve multiple documents, one would use the find() method instead of the find_one(). For example, to find all courses thought by professor von Laszewski, one would use the following command:

gregors_course = cloudmesh.find({'instructor':'Gregor von Laszewski'})

One thing that users should be cognizant of when using the find() method is that it does not return results in an array format but as a cursor object, which is a combination of methods that work together to help with data querying [@www-realpython]. In order to return individual documents, iteration over the result must be completed [@www-realpython].

Limiting Results

When it comes to working with large databases it is always useful to limit the number of query results. PyMongo supports this option with its limit() method [@www-w3schools]. This method takes in one parameter which specifies the number of documents to be returned [@www-w3schools]. For example, if we had a collection with a large number of cloud technologies as individual documents, one could modify the query results to return only the top 10 technologies. To do this, the following example could be utilized:

client = pymongo.MongoClient('mongodb://localhost:27017/')
    db = client['cloudmesh']
    col = db['technologies']
    topten = col.find().limit(10)

Updating Collection

Updating documents is very similar to inserting and retrieving the same. Depending on the number of documents to be updated, one would use the update_one() or update_many() method [@www-w3schools]. Two parameters need to be passed in the update_one() method for it to successfully execute. The first argument is the query object that specifies the document to be changed, and the second argument is the object that specifies the new value in the document. An example of the update_one() method in action is the following:

myquery = { 'course': 'Big Data Applications and Analytics' }
newvalues = { '$set': { 'course': 'Cloud Computing' } }

Updating all documents that fall under the same criteria can be done with the update_many method [@www-w3schools]. For example, to update all documents in which course title starts with letter B with a different instructor information, we would do the following:

client = pymongo.MongoClient('mongodb://localhost:27017/')
db = client['cloudmesh']
col = db['courses']
query = { 'course': { '$regex': '^B' } }
newvalues = { '$set': { 'instructor': 'Gregor von Laszewski' } }

edited = col.update_many(query, newvalues)

Counting Documents

Counting documents can be done with one simple operation called count_documents() instead of using a full query [@www-pymongo-tutorial]. For example, we can count the documents in the cloudmesh_commpunity by using the following command:

cloudmesh = count_documents({})

To create a more specific count, one would use a command similar to this:

cloudmesh = count_documents({'author': 'von Laszewski'})

This technology supports some more advanced querying options as well. Those advanced queries allow one to add certain contraints and narrow down the results even more. For example, to get the courses thought by professor von Laszewski after a certain date, one would use the following command:

d = datetime.datetime(2017, 11, 12, 12)
for course in cloudmesh.find({'date': {'$lt': d}}).sort('author'):
    pprint.pprint(course)

Indexing

Indexing is a very important part of querying. It can greately improve query performance but also add functionality and aide in storing documents [@www-pymongo-tutorial].

“To create a unique index on a key that rejects documents whose value for that key already exists in the index” [@www-pymongo-tutorial].

We need to firstly create the index in the following manner:

result = db.profiles.create_index([('user_id', pymongo.ASCENDING)],
unique=True)
sorted(list(db.profiles.index_information()))

This command acutally creates two different indexes. The first one is the *_id* , created by MongoDB automatically, and the second one is the user_id, created by the user.

The purpose of those indexes is to cleverly prevent future additions of invalid user_ids into a collection.

Sorting

Sorting on the server-side is also avaialable via MongoDB. The PyMongo sort() method is equivalent to the SQL order by statement and it can be performed as pymongo.ascending and pymongo.descending [@book-ohiggins]. This method is much more efficient as it is being completed on the server-side, compared to the sorting completed on the client side. For example, to return all users with first name Gregor sorted in descending order by birthdate we would use a command such as this:

users = cloudmesh.users.find({'firstname':'Gregor'}).sort(('dateofbirth', pymongo.DESCENDING))
for user in users:
   print user.get('email')

Aggregation

Aggregation operations are used to process given data and produce summarized results. Aggregation operations collect data from a number of documents and provide collective results by grouping data. PyMongo in its documentation offers a separate framework that supports data aggregation. This aggregation framework can be used to

“provide projection capabilities to reshape the returned data” [@www-mongo-aggregation].

In the aggregation pipeline, documents pass through multiple pipeline stages which convert documents into result data. The basic pipeline stages include filters. Those filters act like document transformation by helping change the document output form. Other pipelines help group or sort documents with specific fields. By using native operations from MongoDB, the pipeline operators are efficient in aggregating results.

The addFields stage is used to add new fields into documents. It reshapes each document in stream, similarly to the project stage. The output document will contain existing fields from input documents and the newly added fields @www-docs-mongodb]. The following example shows how to add student details into a document.

  db.cloudmesh_community.aggregate([
 {
        $addFields: {
        "document.StudentDetails": {
        $concat:['$document.student.FirstName', '$document.student.LastName']
            }
        }
    } ])

The bucket stage is used to categorize incoming documents into groups based on specified expressions. Those groups are called buckets [@www-docs-mongodb]. The following example shows the bucket stage in action.

db.user.aggregate([
{ "$group": {
  "_id": {
    "city": "$city",
    "age": {
      "$let": {
        "vars": {
 "age": { "$subtract" :[{ "$year": new Date() },{ "$year": "$birthDay" }] }},
        "in": {
          "$switch": {
            "branches": [
              { "case": { "$lt": [ "$$age", 20 ] }, "then": 0 },
              { "case": { "$lt": [ "$$age", 30 ] }, "then": 20 },
              { "case": { "$lt": [ "$$age", 40 ] }, "then": 30 },
              { "case": { "$lt": [ "$$age", 50 ] }, "then": 40 },
              { "case": { "$lt": [ "$$age", 200 ] }, "then": 50 }
            ] }  }  } } },
  "count": { "$sum": 1 }}})

In the bucketAuto stage, the boundaries are automatically determined in an attempt to evenly distribute documents into a specified number of buckets. In the following operation, input documents are grouped into four buckets according to the values in the price field [@www-docs-mongodb].

db.artwork.aggregate( [
  {
    $bucketAuto: {
        groupBy: "$price",
        buckets: 4
    }
  }
 ] )

The collStats stage returns statistics regarding a collection or view [@www-docs-mongodb].

db.matrices.aggregate( [ { $collStats: { latencyStats: { histograms: true } }
 } ] )

The count stage passes a document to the next stage that contains the number documents that were input to the stage [@www-docs-mongodb].

db.scores.aggregate(  [    {
   $match: {        score: {          $gt: 80    } }  },
 {      $count: "passing_scores"  } ])

The facet stage helps process multiple aggregation pipelines in a single stage [@www-docs-mongodb].

db.artwork.aggregate( [ {
   $facet: {  "categorizedByTags": [   { $unwind: "$tags" },
       { $sortByCount: "$tags" }  ],  "categorizedByPrice": [
       // Filter out documents without a price e.g., _id: 7
       { $match: { price: { $exists: 1 } } },
      { $bucket: { groupBy: "$price",
          boundaries: [  0, 150, 200, 300, 400 ],
          default: "Other",
          output: { "count": { $sum: 1 },
            "titles": { $push: "$title" }
          } }        }], "categorizedByYears(Auto)": [
      { $bucketAuto: { groupBy: "$year",buckets: 4 }
      } ]}}])

The geoNear stage returns an ordered stream of documents based on the proximity to a geospatial point. The output documents include an additional distance field and can include a location identifier field [@www-docs-mongodb].

db.places.aggregate([
 {    $geoNear: {
      near: { type: "Point", coordinates: [ -73.99279 , 40.719296 ] },
      distanceField: "dist.calculated",
      maxDistance: 2,
      query: { type: "public" },
      includeLocs: "dist.location",
      num: 5,
      spherical: true
   }  }])

The graphLookup stage performs a recursive search on a collection. To each output document, it adds a new array field that contains the traversal results of the recursive search for that document [@www-docs-mongodb].

db.travelers.aggregate( [
 {
    $graphLookup: {
       from: "airports",
       startWith: "$nearestAirport",
       connectFromField: "connects",
       connectToField: "airport",
       maxDepth: 2,
       depthField: "numConnections",
       as: "destinations"
    }
 }
] )

The group stage consumes the document data per each distinct group. It has a RAM limit of 100 MB. If the stage exceeds this limit, the group produces an error [@www-docs-mongodb].

db.sales.aggregate(
 [
    {
      $group : {
         _id : { month: { $month: "$date" }, day: { $dayOfMonth: "$date" },
         year: { $year: "$date" } },
         totalPrice: { $sum: { $multiply: [ "$price", "$quantity" ] } },
         averageQuantity: { $avg: "$quantity" },
         count: { $sum: 1 }
       }
    }
 ]
)

The indexStats stage returns statistics regarding the use of each index for a collection [@www-docs-mongodb].

db.orders.aggregate( [ { $indexStats: { } } ] )

The limit stage is used for controlling the number of documents passed to the next stage in the pipeline [@www-docs-mongodb].

db.article.aggregate(
  { $limit : 5 }
)

The listLocalSessions stage gives the session information currently connected to mongos or mongod instance [@www-docs-mongodb].

db.aggregate( [  { $listLocalSessions: { allUsers: true } } ] )

The listSessions stage lists out all session that have been active long enough to propagate to the system.sessions collection [@www-docs-mongodb].

 use config
 db.system.sessions.aggregate( [  { $listSessions: { allUsers: true } } ] )

The lookup stage is useful for performing outer joins to other collections in the same database [@www-docs-mongodb].

{
   $lookup:
     {
       from: <collection to join>,
       localField: <field from the input documents>,
       foreignField: <field from the documents of the "from" collection>,
       as: <output array field>
     }
}

The match stage is used to filter the document stream. Only matching documents pass to next stage [@www-docs-mongodb].

db.articles.aggregate(
    [ { $match : { author : "dave" } } ]
)

The project stage is used to reshape the documents by adding or deleting the fields.

db.books.aggregate( [ { $project : { title : 1 , author : 1 } } ] )

The redact stage reshapes stream documents by restricting information using information stored in documents themselves [@www-docs-mongodb].

  db.accounts.aggregate(
  [
    { $match: { status: "A" } },
    {
      $redact: {
        $cond: {
          if: { $eq: [ "$level", 5 ] },
          then: "$$PRUNE",
          else: "$$DESCEND"
        }      }    }  ]);

The replaceRoot stage is used to replace a document with a specified embedded document [@www-docs-mongodb].

  db.produce.aggregate( [
   {
     $replaceRoot: { newRoot: "$in_stock" }
   }
] )

The sample stage is used to sample out data by randomly selecting number of documents form input [@www-docs-mongodb].

  db.users.aggregate(
   [ { $sample: { size: 3 } } ]
)

The skip stage skips specified initial number of documents and passes remaining documents to the pipeline [@www-docs-mongodb].

db.article.aggregate(
   { $skip : 5 }
);

The sort stage is useful while reordering document stream by a specified sort key [@www-docs-mongodb].

 db.users.aggregate(
    [
      { $sort : { age : -1, posts: 1 } }
    ]
 )

The sortByCounts stage groups the incoming documents based on a specified expression value and counts documents in each distinct group [@www-docs-mongodb].

db.exhibits.aggregate(
[ { $unwind: "$tags" },  { $sortByCount: "$tags" } ] )

The unwind stage deconstructs an array field from the input documents to output a document for each element [@www-docs-mongodb].

db.inventory.aggregate( [ { $unwind: "$sizes" } ] )
db.inventory.aggregate( [ { $unwind: { path: "$sizes" } } ] )

The out stage is used to write aggregation pipeline results into a collection. This stage should be the last stage of a pipeline [@www-docs-mongodb].

db.books.aggregate( [
                  { $group : { _id : "$author", books: { $push: "$title" } } },
                      { $out : "authors" }
                  ] )

Another option from the aggregation operations is the Map/Reduce framework, which essentially includes two different functions, map and reduce. The first one provides the key value pair for each tag in the array, while the latter one

“sums over all of the emitted values for a given key” [@www-mongo-aggregation].

The last step in the Map/Reduce process it to call the map_reduce() function and iterate over the results [@www-mongo-aggregation]. The Map/Reduce operation provides result data in a collection or returns results in-line. One can perform subsequent operations with the same input collection if the output of the same is written to a collection [@www-docs-map-reduce]. An operation that produces results in a in-line form must provide results with in the BSON document size limit. The current limit for a BSON document is 16 MB. These types of operations are not supported by views [@www-docs-map-reduce]. The PyMongo’s API supports all features of the MongoDB’s Map/Reduce engine [@www-api-map-reduce]. Moreover, Map/Reduce has the ability to get more detailed results by passing full_response=True argument to the map_reduce() function [@www-api-map-reduce].

Deleting Documents from a Collection

The deletion of documents with PyMongo is fairly straight forward. To do so, one would use the remove() method of the PyMongo Collection object [@book-ohiggins]. Similarly to the reads and updates, specification of documents to be removed is a must. For example, removal of the entire document collection with a score of 1, would required one to use the following command:

cloudmesh.users.remove({"score":1, safe=True})

The safe parameter set to True ensures the operation was completed [@book-ohiggins].

Copying a Database

Copying databases within the same mongod instance or between different mongod servers is made possible with the command() method after connecting to the desired mongod instance [@www-pymongo-documentation-copydb]. For example, to copy the cloudmesh database and name the new database cloudmesh_copy, one would use the command() method in the following manner:

client.admin.command('copydb',
                         fromdb='cloudmesh',
                         todb='cloudmesh_copy')

There are two ways to copy a database between servers. If a server is not password-prodected, one would not need to pass in the credentials nor to authenticate to the admin database [@www-pymongo-documentation-copydb]. In that case, to copy a database one would use the following command:

client.admin.command('copydb',
                         fromdb='cloudmesh',
                         todb='cloudmesh_copy',
                         fromhost='source.example.com')

On the other hand, if the server where we are copying the database to is protected, one would use this command instead:

client = MongoClient('target.example.com',
                     username='administrator',
                     password='pwd')
client.admin.command('copydb',
                     fromdb='cloudmesh',
                     todb='cloudmesh_copy',
                     fromhost='source.example.com')

PyMongo Strengths

One of PyMongo strengths is that allows document creation and querying natively

“through the use of existing language features such as nested dictionaries and lists” [@book-ohiggins].

For moderately experienced Python developers, it is very easy to learn it and quickly feel comfortable with it.

“For these reasons, MongoDB and Python make a powerful combination for rapid, iterative development of horizontally scalable backend applications” [@book-ohiggins].

According to [@book-ohiggins], MongoDB is very applicable to modern applications, which makes PyMongo equally valuable [@book-ohiggins].

MongoEngine

“MongoEngine is an Object-Document Mapper, written in Python for working with MongoDB” [@www-docs-mongoengine].

It is actually a library that allows a more advanced communication with MongoDB compared to PyMongo. As MongoEngine is technically considered to be an object-document mapper(ODM), it can also be considered to be

“equivalent to a SQL-based object relational mapper(ORM)” [@www-realpython].

The primary technique why one would use an ODM includes data conversion between computer systems that are not compatible with each other [@www-wikiodm]. For the purpose of converting data to the appropriate form, a virtual object database must be created within the utilized programming language [@www-wikiodm]. This library is also used to define schemata for documents within MongoDB, which ultimately helps with minimizing coding errors as well defining methods on existing fields [@www-mongoengine-schema]. It is also very beneficial to the overall workflow as it tracks changes made to the documents and aids in the document saving process [@www-mongoengine-instances].

Installation

The installation process for this technology is fairly simple as it is considered to be a library. To install it, one would use the following command [@www-installing]:

$ pip install mongoengine

A bleeding-edge version of MongoEngine can be installed directly from GitHub by first cloning the repository on the local machine, virtual machine, or cloud.

Connecting to a database using MongoEngine

Once installed, MongoEngine needs to be connected to an instance of the mongod, similarly to PyMongo [@www-connecting]. The connect() function must be used to successfully complete this step and the argument that must be used in this function is the name of the desired database [@www-connecting]. Prior to using this function, the function name needs to be imported from the MongoEngine library.

from mongoengine import connect
connect('cloudmesh_community')

Similarly to the MongoClient, MongoEngine uses the local host and port 27017 by default, however, the connect() function also allows specifying other hosts and port arguments as well [@www-connecting].

connect('cloudmesh_community', host='196.185.1.62', port=16758)

Other types of connections are also supported (i.e. URI) and they can be completed by providing the URI in the connect() function [@www-connecting].

Querying using MongoEngine

To query MongoDB using MongoEngine an objects attribute is used, which is, technically, a part of the document class [@www-querying]. This attribute is called the QuerySetManager which in return

“creates a new QuerySet object on access” [@www-querying].

To be able to access individual documents from a database, this object needs to be iterated over. For example, to return/print all students in the cloudmesh_community object (database), the following command would be used.

for user in cloudmesh_community.objects:
   print cloudmesh_community.student

MongoEngine also has a capability of query filtering which means that a keyword can be used within the called QuerySet object to retrieve specific information [@www-querying]. Let us say one would like to iterate over cloudmesh_community students that are natives of Indiana. To achieve this, one would use the following command:

indy_students = cloudmesh_community.objects(state='IN')

This library also allows the use of all operators except for the equality operator in its queries, and moreover, has the capability of handling string queries, geo queries, list querying, and querying of the raw PyMongo queries [@www-querying].

The string queries are useful in performing text operations in the conditional queries. A query to find a document exactly matching and with state ACTIVE can be performed in the following manner:

db.cloudmesh_community.find( State.exact("ACTIVE") )

The query to retrieve document data for names that start with a case sensitive AL can be written as:

db.cloudmesh_community.find( Name.startswith("AL") )

To perform an exact same query for the non-key-sensitive AL one would use the following command:

db.cloudmesh_community.find( Name.istartswith("AL") )

The MongoEngine allows data extraction of geographical locations by using Geo queries. The geo_within operator checks if a geometry is within a polygon.

  cloudmesh_community.objects(
            point__geo_within=[[[40, 5], [40, 6], [41, 6], [40, 5]]])
  cloudmesh_community.objects(
            point__geo_within={"type": "Polygon",
                 "coordinates": [[[40, 5], [40, 6], [41, 6], [40, 5]]]})

The list query looks up the documents where the specified fields matches exactly to the given value. To match all pages that have the word coding as an item in the tags list one would use the following query:

  class Page(Document):
     tags = ListField(StringField())

  Page.objects(tags='coding')

Overall, it would be safe to say that MongoEngine has good compatibility with Python. It provides different functions to utilize Python easily with MongoDBand which makes this pair even more attractive to application developers.

Flask-PyMongo

“Flask is a micro-web framework written in Python” [@www-flask-framework].

It was developed after Django, and it is very pythonic in nature which implies that it is explicitly the targeting the Python user community. It is lightweight as it does not require additional tools or libraries and hence is classified as a Micro-Web framework. It is often used with MongoDB using PyMongo connector, and it treats data within MongoDB as searchable Python dictionaries. The applications such as Pinterest, LinkedIn, and the community web page for Flask are using the Flask framework. Moreover, it supports various features such as the RESTful request dispatching, secure cookies, Google app engine compatibility, and integrated support for unit testing, etc [@www-flask-framework]. When it comes to connecting to a database, the connection details for MongoDB can be passed as a variable or configured in PyMongo constructor with additional arguments such as username and password, if required. It is important that versions of both Flask and MongoDB are compatible with each other to avoid functionality breaks [@www-flask-pymongo].

Installation

Flask-PyMongo can be installed with an easy command such as this:

$ pip install Flask-PyMongo

PyMongo can be added in the following manner:

  from flask import Flask
  from flask_pymongo import PyMongo
  app = Flask(__name__)
  app.config["MONGO_URI"] = "mongodb://localhost:27017/cloudmesh_community"
  mongo = PyMongo(app)

Configuration

There are two ways to configure Flask-PyMongo. The first way would be to pass a MongoDB URI to the PyMongo constructor, while the second way would be to

“assign it to the MONGO_URI Flask confiuration variable” [@www-flask-pymongo].

Connection to multiple databases/servers

Multiple PyMongo instances can be used to connect to multiple databases or database servers. To achieve this, once would use a command similar to the following:

  app = Flask(__name__)
  mongo1 = PyMongo(app, uri="mongodb://localhost:27017/cloudmesh_community_one")
  mongo2 = PyMongo(app, uri="mongodb://localhost:27017/cloudmesh_community_two")
  mongo3 = PyMongo(app, uri=
        "mongodb://another.host:27017/cloudmesh_community_Three")

Flask-PyMongo Methods

Flask-PyMongo provides helpers for some common tasks. One of them is the Collection.find_one_or_404 method shown in the following example:

  @app.route("/user/<username>")
  def user_profile(username):
      user = mongo.db.cloudmesh_community.find_one_or_404({"_id": username})
      return render_template("user.html", user=user)

This method is very similar to the MongoDB’s find_one() method, however, instead of returning None it causes a 404 Not Found HTTP status [@www-flask-pymongo].

Similarly, the PyMongo.send_file and PyMongo.save_file methods work on the file-like objects and save them to GridFS using the given file name [@www-flask-pymongo].

Additional Libraries

Flask-MongoAlchemy and Flask-MongoEngine are the additional libraries that can be used to connect to a MongoDB database while using enhanced features with the Flask app. The Flask-MongoAlchemy is used as a proxy between Python and MongoDB to connect. It provides an option such as server or database based authentication to connect to MongoDB. While the default is set server based, to use a database-based authentication, the config value MONGOALCHEMY_SERVER_AUTH parameter must be set to False [@www-pythonhosted-MongoAlchemy].

Flask-MongoEngine is the Flask extension that provides integration with the MongoEngine. It handles connection management for the apps. It can be installed through pip and set up very easily as well. The default configuration is set to the local host and port 27017. For the custom port and in cases where MongoDB is running on another server, the host and port must be explicitly specified in connect strings within the MONGODB_SETTINGS dictionary with app.config, along with the database username and password, in cases where a database authentication is enabled. The URI style connections are also supported and supply the URI as the host in the MONGODB_SETTINGS dictionary with app.config. There are various custom query sets that are available within Flask-Mongoengine that are attached to Mongoengine’s default queryset [@www-flask-mongoengine].

Classes and Wrappers

Attributes such as cx and db in the PyMongo objects are the ones that help provide access to the MongoDB server [@www-flask-pymongo]. To achieve this, one must pass the Flask app to the constructor or call init_app() [@www-flask-pymongo].

“Flask-PyMongo wraps PyMongo’s MongoClient, Database, and Collection classes, and overrides their attribute and item accessors” [@www-flask-pymongo].

This type of wrapping allows Flask-PyMongo to add methods to Collection while at the same time allowing a MongoDB-style dotted expressions in the code [@www-flask-pymongo].

type(mongo.cx)
type(mongo.db)
type(mongo.db.cloudmesh_community)

Flask-PyMongo creates connectivity between Python and Flask using a MongoDB database and supports

“extensions that can add application features as if they were implemented in Flask itself” [@www-wiki-flask],

hence, it can be used as an additional Flask functionality in Python code. The extensions are there for the purpose of supporting form validations, authentication technologies, object-relational mappers and framework related tools which ultimately adds a lot of strength to this micro-web framework [@www-wiki-flask]. One of the main reasons and benefits why it is frequently used with MongoDB is its capability of adding more control over databases and history [@www-wiki-flask].

2.1.9.2 - Mongoengine

Gregor von Laszewski (laszewski@gmail.com)

Introduction

MongoEngine is a document mapper for working with mongoldb with python. To be able to use mongo engine MongodD should be already installed and running.

Install and connect

Mongoengine can be installed by running:

    $ pip install mongo engine

This will install six, pymongo and mongoengine.

To connect to mongoldb use connect () function by specifying mongoldb instance name. You don’t need to go to mongo shell but this can be done from unix shell or cmd line. In this case we are connecting to a database named student_db.

from mongo engine import * connect (student_db)

If mongodb is running on a port different from default port , port number and host need to be specified. If mongoldb needs authentication username and password need to be specified.

Basics

Mongodb does not enforce schemas. Comparing to RDBMS, Row in mongoldb is called a “document” and table can be compared to Collection. Defining a schema is helpful as it minimizes coding error’s. To define a schema we create a class that inherits from document.

from mongoengine import *

class Student(Document):
    first_name = StringField(max_length=50)
    last_name = StringField(max_length=50)

:o2: TODO: Can you fix the code sections and look at the examples we provided.

Fields are not mandatory but if needed, set the required keyword argument to True. There are multiple values available for field types. Each field can be customized by by keyword argument. If each student is sending text messages to Universities central database , these can be stored using Mongodb. Each text can have different data types, some might have images or some might have url’s. So we can create a class text and link it to student by using Reference field (similar to foreign key in RDBMS).

class Text(Document):
    title = StringField(max_length=120, required=True)
    author = ReferenceField(Student)
    meta = {'allow_inheritance': True}

class OnlyText(Text):
    content = StringField()

class ImagePost(Text):
    image_path = StringField()

class LinkPost(Text):
    link_url = StringField()

MongoDb supports adding tags to individual texts rather then storing them separately and then having them referenced.Similarly Comments can also be stored directly in a Text.

class Text(Document):
    title = StringField(max_length=120, required=True)
    author = ReferenceField(User)
    tags = ListField(StringField(max_length=30))
    comments = ListField(EmbeddedDocumentField(Comment))

For accessing data: if we need to get titles.

for text in OnlyText.objects:
    print(text.title)

Searching texts with tags.

for text in Text.objects(tags='mongodb'):
    print(text.title)

2.1.10 - Other

Gregor von Laszewski (laszewski@gmail.com)

2.1.10.1 - Word Count with Parallel Python

Gregor von Laszewski (laszewski@gmail.com)

We will demonstrate Python’s multiprocessing API for parallel computation by writing a program that counts how many times each word in a collection of documents appear.

Generating a Document Collection

Before we begin, let us write a script that will generate document collections by specifying the number of documents and the number of words per document. This will make benchmarking straightforward.

To keep it simple, the vocabulary of the document collection will consist of random numbers rather than the words of an actual language:

'''Usage: generate_nums.py [-h] NUM_LISTS INTS_PER_LIST MIN_INT MAX_INT DEST_DIR

Generate random lists of integers and save them
as 1.txt, 2.txt, etc.

Arguments:
   NUM_LISTS      The number of lists to create.
   INTS_PER_LIST  The number of integers in each list.
   MIN_NUM        Each generated integer will be >= MIN_NUM.
   MAX_NUM        Each generated integer will be <= MAX_NUM.
   DEST_DIR       A directory where the generated numbers will be stored.

Options:
  -h --help
'''

import os, random, logging
from docopt import docopt


def generate_random_lists(num_lists,
                          ints_per_list, min_int, max_int):
    return [[random.randint(min_int, max_int) \
        for i in range(ints_per_list)] for i in range(num_lists)]


if __name__ == '__main__':
   args = docopt(__doc__)
   num_lists, ints_per_list, min_int, max_int, dest_dir = [
      int(args['NUM_LISTS']),
      int(args['INTS_PER_LIST']),
      int(args['MIN_INT']),
      int(args['MAX_INT']),
      args['DEST_DIR']
   ]

   if not os.path.exists(dest_dir):
      os.makedirs(dest_dir)

   lists = generate_random_lists(num_lists,
                                 ints_per_list,
                                 min_int,
                                 max_int)
   curr_list = 1
   for lst in lists:
      with open(os.path.join(dest_dir, '%d.txt' % curr_list), 'w') as f:
     f.write(os.linesep.join(map(str, lst)))
  curr_list += 1
   logging.debug('Numbers written.')

Notice that we are using the docopt module that you should be familiar with from the Section [Python DocOpts](#s-python-docopts} to make the script easy to run from the command line.

You can generate a document collection with this script as follows:

python generate_nums.py 1000 10000 0 100 docs-1000-10000

Serial Implementation

A first serial implementation of wordcount is straightforward:

'''Usage: wordcount.py [-h] DATA_DIR

Read a collection of .txt documents and count how many times each word
appears in the collection.

Arguments:
  DATA_DIR  A directory with documents (.txt files).

Options:
  -h --help
'''

import os, glob, logging
from docopt import docopt

logging.basicConfig(level=logging.DEBUG)


def wordcount(files):
   counts = {}
   for filepath in files:
      with open(filepath, 'r') as f:
     words = [word.strip() for word in f.read().split()]
     for word in words:
        if word not in counts:
           counts[word] = 0
        counts[word] += 1
   return counts


if __name__ == '__main__':
   args = docopt(__doc__)
   if not os.path.exists(args['DATA_DIR']):
      raise ValueError('Invalid data directory: %s' % args['DATA_DIR'])

   counts = wordcount(glob.glob(os.path.join(args['DATA_DIR'], '*.txt')))
   logging.debug(counts)

Serial Implementation Using map and reduce

We can improve the serial implementation in anticipation of parallelizing the program by making use of Python’s map and reduce functions.

In short, you can use map to apply the same function to the members of a collection. For example, to convert a list of numbers to strings, you could do:

import random
nums = [random.randint(1, 2) for _ in range(10)]
print(nums)
[2, 1, 1, 1, 2, 2, 2, 2, 2, 2]
print(map(str, nums))
['2', '1', '1', '1', '2', '2', '2', '2', '2', '2']

We can use reduce to apply the same function cumulatively to the items of a sequence. For example, to find the total of the numbers in our list, we could use reduce as follows:

def add(x, y):
    return x + y

print(reduce(add, nums))
17

We can simplify this even more by using a lambda function:

print(reduce(lambda x, y: x + y, nums))
17

You can read more about Python’s lambda function in the docs.

With this in mind, we can reimplement the wordcount example as follows:

'''Usage: wordcount_mapreduce.py [-h] DATA_DIR

Read a collection of .txt documents and count how
many times each word
appears in the collection.

Arguments:
   DATA_DIR  A directory with documents (.txt files).

Options:
   -h --help
'''

import os, glob, logging
from docopt import docopt

logging.basicConfig(level=logging.DEBUG)

def count_words(filepath):
   counts = {}
   with open(filepath, 'r') as f:
      words = [word.strip() for word in f.read().split()]

  for word in words:
     if word not in counts:
        counts[word] = 0
     counts[word] += 1
  return counts


def merge_counts(counts1, counts2):
   for word, count in counts2.items():
      if word not in counts1:
     counts1[word] = 0
  counts1[word] += counts2[word]
   return counts1


if __name__ == '__main__':
   args = docopt(__doc__)
   if not os.path.exists(args['DATA_DIR']):
      raise ValueError('Invalid data directory: %s' % args['DATA_DIR'])

      per_doc_counts = map(count_words,
                           glob.glob(os.path.join(args['DATA_DIR'],
                           '*.txt')))
   counts = reduce(merge_counts, [{}] + per_doc_counts)
   logging.debug(counts)

Parallel Implementation

Drawing on the previous implementation using map and reduce, we can parallelize the implementation using Python’s multiprocessing API:

'''Usage: wordcount_mapreduce_parallel.py [-h] DATA_DIR NUM_PROCESSES

Read a collection of .txt documents and count, in parallel, how many
times each word appears in the collection.

Arguments:
   DATA_DIR       A directory with documents (.txt files).
   NUM_PROCESSES  The number of parallel processes to use.

Options:
   -h --help
'''

import os, glob, logging
from docopt import docopt
from wordcount_mapreduce import count_words, merge_counts
from multiprocessing import Pool

logging.basicConfig(level=logging.DEBUG)

if __name__ == '__main__':
   args = docopt(__doc__)
   if not os.path.exists(args['DATA_DIR']):
      raise ValueError('Invalid data directory: %s' % args['DATA_DIR'])
   num_processes = int(args['NUM_PROCESSES'])

   pool = Pool(processes=num_processes)

   per_doc_counts = pool.map(count_words,
                             glob.glob(os.path.join(args['DATA_DIR'],
                             '*.txt')))
   counts = reduce(merge_counts, [{}] + per_doc_counts)
   logging.debug(counts)

Benchmarking

To time each of the examples, enter it into its own Python file and use Linux’s time command:

$ time python wordcount.py docs-1000-10000

The output contains the real run time and the user run time. real is wall clock time - time from start to finish of the call. user is the amount of CPU time spent in user-mode code (outside the kernel) within the process, that is, only actual CPU time used in executing the process.

Excersises

E.python.wordcount.1:

Run the three different programs (serial, serial w/ map and reduce, parallel) and answer the following questions:

  1. Is there any performance difference between the different versions of the program?
  2. Does user time significantly differ from real time for any of the versions of the program?
  3. Experiment with different numbers of processes for the parallel example, starting with 1. What is the performance gain when you goal from 1 to 2 processes? From 2 to 3? When do you stop seeing improvement? (this will depend on your machine architecture)

References

2.1.10.2 - NumPy

Gregor von Laszewski (laszewski@gmail.com)

NumPy is a popular library that is used by many other Python packages such as Pandas, SciPy, and scikit-learn. It provides a fast, simple-to-use way of interacting with numerical data organized in vectors and matrices. In this section, we will provide a short introduction to NumPy.

Installing NumPy

The most common way of installing NumPy, if it wasn’t included with your Python installation, is to install it via pip:

$ pip install numpy

If NumPy has already been installed, you can update to the most recent version using:

$ pip install -U numpy

You can verify that NumPy is installed by trying to use it in a Python program:

import numpy as np

Note that, by convention, we import NumPy using the alias ‘np’ - whenever you see ‘np’ sprinkled in example Python code, it’s a good bet that it is using NumPy.

NumPy Basics

At its core, NumPy is a container for n-dimensional data. Typically, 1-dimensional data is called an array and 2-dimensional data is called a matrix. Beyond 2-dimensions would be considered a multidimensional array. Examples, where you’ll encounter these dimensions, may include:

  • 1 Dimensional: time-series data such as audio, stock prices, or a single observation in a dataset.
  • 2 Dimensional: connectivity data between network nodes, user-product recommendations, and database tables.
  • 3+ Dimensional: network latency between nodes over time, video (RGB+time), and version-controlled datasets.

All of these data can be placed into NumPy’s array object, just with varying dimensions.

Data Types: The Basic Building Blocks

Before we delve into arrays and matrices, we will start with the most basic element of those: a single value. NumPy can represent data utilizing many different standard datatypes such as uint8 (an 8-bit usigned integer), float64 (a 64-bit float), or str (a string). An exhaustive listing can be found at:

Before moving on, it is important to know about the tradeoff made when using different datatypes. For example, a uint8 can only contain values between 0 and 255. This, however, contrasts with float64 which can express any value from +/- 1.80e+308. So why wouldn’t we just always use float64s? Though they allow us to be more expressive in terms of numbers, they also consume more memory. If we were working with a 12-megapixel image, for example, storing that image using uint8 values would require 3000 * 4000 * 8 = 96 million bits, or 91.55 MB of memory. If we were to store the same image utilizing float64, our image would consume 8 times as much memory: 768 million bits or 732.42 MB. It is important to use the right data type for the job to avoid consuming unnecessary resources or slowing down processing.

Finally, while NumPy will conveniently convert between datatypes, one must be aware of overflows when using smaller data types. For example:

a = np.array([6], dtype=np.uint8)
print(a)
>>>[6]
a = a + np.array([7], dtype=np.uint8)
print(a)
>>>[13]
a = a + np.array([245], dtype=np.uint8)
print(a)
>>>[2]

In this example, it makes sense that 6+7=13. But how does 13+245=2? Put simply, the object type (uint8) simply ran out of space to store the value and wrapped back around to the beginning. An 8-bit number is only capable of storing 2^8, or 256, unique values. An operation that results in a value above that range will ‘overflow’ and cause the value to wrap back around to zero. Likewise, anything below that range will ‘underflow’ and wrap back around to the end. In our example, 13+245 became 258, which was too large to store in 8 bits and wrapped back around to 0 and ended up at 2.

NumPy will, generally, try to avoid this situation by dynamically retyping to whatever datatype will support the result:

a = a + 260
print(test)
>>>[262]

Here, our addition caused our array, ‘a,’ to be upscaled to use uint16 instead of uint8. Finally, NumPy offers convenience functions akin to Python’s range() function to create arrays of sequential numbers:

X = np.arange(0.2,1,.1)
print(X)
>>>array([0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], dtype=float32)

We can use this function to also generate parameters spaces that can be iterated on:

P = 10.0 ** np.arange(-7,1,1)
print(P)

for x,p in zip(X,P):
    print('%f, %f' % (x, p))

Arrays: Stringing Things Together

With our knowledge of datatypes in hand, we can begin to explore arrays. Simply put, arrays can be thought of as a sequence of values (not neccesarily numbers). Arrays are 1 dimensional and can be created and accessed simply:

a = np.array([1, 2, 3])
print(type(a))
>>><class 'numpy.ndarray'>
print(a)
>>>[1 2 3]
print(a.shape)
>>>(3,)
a[0]
>>>1

Arrays (and, later, matrices) are zero-indexed. This makes it convenient when, for example, using Python’s range() function to iterate through an array:

for i in range(3):
    print(a[i])
>>>1
>>>2
>>>3

Arrays are, also, mutable and can be changed easily:

a[0] = 42
print(a)
>>>array([42, 2, 3])

NumPy also includes incredibly powerful broadcasting features. This makes it very simple to perform mathematical operations on arrays that also makes intuitive sense:

a * 3
>>>array([3, 6, 9])
a**2
>>>array([1, 4, 9], dtype=int32)

Arrays can also interact with other arrays:

b = np.array([2, 3, 4])
print(a * b)
>>>array([ 2,  6, 12])

In this example, the result of multiplying together two arrays is to take the element-wise product while multiplying by a constant will multiply each element in the array by that constant. NumPy supports all of the basic mathematical operations: addition, subtraction, multiplication, division, and powers. It also includes an extensive suite of mathematical functions, such as log() and max(), which are covered later.

Matrices: An Array of Arrays

Matrices can be thought of as an extension of arrays - rather than having one dimension, matrices have 2 (or more). Much like arrays, matrices can be created easily within NumPy:

m = np.array([[1, 2], [3, 4]])
print(m)
>>>[[1 2]
>>> [3 4]]

Accessing individual elements is similar to how we did it for arrays. We simply need to pass in a number of arguments equal to the number of dimensions:

m[1][0]
>>>3

In this example, our first index selected the row and the second selected the column - giving us our result of 3. Matrices can be extending out to any number of dimensions by simply using more indices to access specific elements (though use-cases beyond 4 may be somewhat rare).

Matrices support all of the normal mathematial functions such as +, -, *, and /. A special note: the * operator will result in an element-wise multiplication. Using @ or np.matmul() for matrix multiplication:

print(m-m)
print(m*m)
print(m/m)

More complex mathematical functions can typically be found within the NumPy library itself:

print(np.sin(x))
print(np.sum(x))

A full listing can be found at: https://docs.scipy.org/doc/numpy/reference/routines.math.html

Slicing Arrays and Matrices

As one can imagine, accessing elements one-at-a-time is both slow and can potentially require many lines of code to iterate over every dimension in the matrix. Thankfully, NumPy incorporate a very powerful slicing engine that allows us to access ranges of elements easily:

m[1, :]
>>>array([3, 4])

The ‘:’ value tells NumPy to select all elements in the given dimension. Here, we’ve requested all elements in the first row. We can also use indexing to request elements within a given range:

a = np.arange(0, 10, 1)
print(a)
>>>[0 1 2 3 4 5 6 7 8 9]
a[4:8]
>>>array([4, 5, 6, 7])

Here, we asked NumPy to give us elements 4 through 7 (ranges in Python are inclusive at the start and non-inclusive at the end). We can even go backwards:

a[-5:]
>>>array([5, 6, 7, 8, 9])

In the previous example, the negative value is asking NumPy to return the last 5 elements of the array. Had the argument been ‘:-5,’ NumPy would’ve returned everything BUT the last five elements:

a[:-5]
>>>array([0, 1, 2, 3, 4])

Becoming more familiar with NumPy’s accessor conventions will allow you write more efficient, clearer code as it is easier to read a simple one-line accessor than it is a multi-line, nested loop when extracting values from an array or matrix.

Useful Functions

The NumPy library provides several convenient mathematical functions that users can use. These functions provide several advantages to code written by users:

  • They are open source typically have multiple contributors checking for errors.
  • Many of them utilize a C interface and will run much faster than native Python code.
  • They’re written to very flexible.

NumPy arrays and matrices contain many useful aggregating functions such as max(), min(), mean(), etc These functions are usually able to run an order of magnitude faster than looping through the object, so it’s important to understand what functions are available to avoid ‘reinventing the wheel.’ In addition, many of the functions are able to sum or average across axes, which make them extremely useful if your data has inherent grouping. To return to a previous example:

m = np.array([[1, 2], [3, 4]])
print(m)
>>>[[1 2]
>>> [3 4]]
m.sum()
>>>10
m.sum(axis=1)
>>>[3, 7]
m.sum(axis=0)
>>>[4, 6]

In this example, we created a 2x2 matrix containing the numbers 1 through 4. The sum of the matrix returned the element-wise addition of the entire matrix. Summing across axis 0 (rows) returned a new array with the element-wise addition across each row. Likewise, summing across axis 1 (columns) returned the columnar summation.

Linear Algebra

Perhaps one of the most important uses for NumPy is its robust support for Linear Algebra functions. Like the aggregation functions described in the previous section, these functions are optimized to be much faster than user implementations and can utilize processesor level features to provide very quick computations. These functions can be accessed very easily from the NumPy package:

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
print(np.matmul(a, b))
>>>[[19 22]
    [43 50]]

Included in within np.linalg are functions for calculating the Eigendecomposition of square matrices and symmetric matrices. Finally, to give a quick example of how easy it is to implement algorithms in NumPy, we can easily use it to calculate the cost and gradient when using simple Mean-Squared-Error (MSE):

cost = np.power(Y - np.matmul(X, weights)), 2).mean(axis=1)
gradient = np.matmul(X.T, np.matmul(X, weights) - y)

Finally, more advanced functions are easily available to users via the linalg library of NumPy as:

from numpy import linalg

A = np.diag((1,2,3))

w,v = linalg.eig(A)

print ('w =', w)
print ('v =', v)

NumPy Resources

2.1.10.3 - Scipy

Gregor von Laszewski (laszewski@gmail.com)

SciPy is a library built around NumPy and has a number of off-the-shelf algorithms and operations implemented. These include algorithms from calculus (such as integration), statistics, linear algebra, image-processing, signal processing, machine learning.

To achieve this, SciPy bundles a number of useful open-source software for mathematics, science, and engineering. It includes the following packages:

NumPy,

for managing N-dimensional arrays

SciPy library,

to access fundamental scientific computing capabilities

Matplotlib,

to conduct 2D plotting

IPython,

for an Interactive console (see jupyter)

Sympy,

for symbolic mathematics

pandas,

for providing data structures and analysis

Introduction

First, we add the usual scientific computing modules with the typical abbreviations, including sp for scipy. We could invoke scipy’s statistical package as sp.stats, but for the sake of laziness, we abbreviate that too.

import numpy as np # import numpy
import scipy as sp # import scipy
from scipy import stats # refer directly to stats rather than sp.stats
import matplotlib as mpl # for visualization
from matplotlib import pyplot as plt # refer directly to pyplot
                                     # rather than mpl.pyplot

Now we create some random data to play with. We generate 100 samples from a Gaussian distribution centered at zero.

s = sp.randn(100)

How many elements are in the set?

print ('There are',len(s),'elements in the set')

What is the mean (average) of the set?

print ('The mean of the set is',s.mean())

What is the minimum of the set?

print ('The minimum of the set is',s.min())

What is the maximum of the set?

print ('The maximum of the set is',s.max())

We can use the scipy functions too. What’s the median?

print ('The median of the set is',sp.median(s))

What about the standard deviation and variance?

print ('The standard deviation is',sp.std(s),
       'and the variance is',sp.var(s))

Isn’t the variance the square of the standard deviation?

    print ('The square of the standard deviation is',sp.std(s)**2)

How close are the measures? The differences are close as the following calculation shows

    print ('The difference is',abs(sp.std(s)**2 - sp.var(s)))

    print ('And in decimal form, the difference is %0.16f' %
           (abs(sp.std(s)**2 - sp.var(s))))

How does this look as a histogram? See Figure 1, Figure 2, Figure 3

plt.hist(s) # yes, one line of code for a histogram
plt.show()

Figure 1: Histogram 1

Figure 1: Histogram 1

Let us add some titles.

plt.clf() # clear out the previous plot

plt.hist(s)
plt.title("Histogram Example")
plt.xlabel("Value")
plt.ylabel("Frequency")

plt.show()

Figure 2: Histogram 2

Figure 2: Histogram 2

Typically we do not include titles when we prepare images for inclusion in LaTeX. There we use the caption to describe what the figure is about.

plt.clf() # clear out the previous plot

plt.hist(s)
plt.xlabel("Value")
plt.ylabel("Frequency")

plt.show()

Figure 3: Histogram 3

Figure 3: Histogram 3

Let us try out some linear regression or curve fitting. See @#fig:scipy-output_30_0

import random

def F(x):
    return 2*x - 2

def add_noise(x):
    return x + random.uniform(-1,1)

X = range(0,10,1)

Y = []
for i in range(len(X)):
    Y.append(add_noise(X[i]))

plt.clf() # clear out the old figure
plt.plot(X,Y,'.')
plt.show()

Figure 4: Result 1

Figure 4: Result 1

Now let’s try linear regression to fit the curve.

m, b, r, p, est_std_err = stats.linregress(X,Y)

What is the slope and y-intercept of the fitted curve?

print ('The slope is',m,'and the y-intercept is', b)

def Fprime(x): # the fitted curve
    return m*x + b

Now let’s see how well the curve fits the data. We’ll call the fitted curve F'.

X = range(0,10,1)

Yprime = []
for i in range(len(X)):
    Yprime.append(Fprime(X[i]))

plt.clf() # clear out the old figure

# the observed points, blue dots
plt.plot(X, Y, '.', label='observed points')

# the interpolated curve, connected red line
plt.plot(X, Yprime, 'r-', label='estimated points')

plt.title("Linear Regression Example") # title
plt.xlabel("x") # horizontal axis title
plt.ylabel("y") # vertical axis title
# legend labels to plot
plt.legend(['obsered points', 'estimated points'])

# comment out so that you can save the figure
#plt.show()

To save images into a PDF file for inclusion into LaTeX documents you can save the images as follows. Other formats such as png are also possible, but the quality is naturally not sufficient for inclusion in papers and documents. For that, you certainly want to use PDF. The save of the figure has to occur before you use the show() command. See Figure 5

plt.savefig("regression.pdf", bbox_inches='tight')

plt.savefig('regression.png')

plt.show()

Figure 5: Result 2

Figure 5: Result 2

References

For more information about SciPy we recommend that you visit the following link

https://www.scipy.org/getting-started.html#learning-to-work-with-scipy

Additional material and inspiration for this section are from

[![No

  • [No] Prasanth. “Simple statistics with SciPy.” Comfort at 1 AU. February

[![No 28, 2011. https://oneau.wordpress.com/2011/02/28/simple-statistics-with-scipy/.

  • [No] SciPy Cookbook. Lasted updated: 2015.

[![No http://scipy-cookbook.readthedocs.io/.

No create bibtex entries

No

2.1.10.4 - Scikit-learn

Gregor von Laszewski (laszewski@gmail.com)


Learning Objectives

  • Exploratory data analysis
  • Pipeline to prepare data
  • Full learning pipeline
  • Fine tune the model
  • Significance tests

Introduction to Scikit-learn

Scikit learn is a Machine Learning specific library used in Python. Library can be used for data mining and analysis. It is built on top of NumPy, matplotlib and SciPy. Scikit Learn features Dimensionality reduction, clustering, regression and classification algorithms. It also features model selection using grid search, cross validation and metrics.

Scikit learn also enables users to preprocess the data which can then be used for machine learning using modules like preprocessing and feature extraction.

In this section we demonstrate how simple it is to use k-means in scikit learn.

Installation

If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using pip

$ pip install numpy
$ pip install scipy -U
$ pip install -U scikit-learn

Supervised Learning

Supervised Learning is used in machine learning when we already know a set of output predictions based on input characteristics and based on that we need to predict the target for a new input. Training data is used to train the model which then can be used to predict the output from a bounded set.

Problems can be of two types

  1. Classification : Training data belongs to three or four classes/categories and based on the label we want to predict the class/category for the unlabeled data.
  2. Regression : Training data consists of vectors without any corresponding target values. Clustering can be used for these type of datasets to determine discover groups of similar examples. Another way is density estimation which determine the distribution of data within the input space. Histogram is the most basic form.

Unsupervised Learning

Unsupervised Learning is used in machine learning when we have the training set available but without any corresponding target. The outcome of the problem is to discover groups within the provided input. It can be done in many ways.

Few of them are listed here

  1. Clustering : Discover groups of similar characteristics.
  2. Density Estimation : Finding the distribution of data within the provided input or changing the data from a high dimensional space to two or three dimension.

Building a end to end pipeline for Supervised machine learning using Scikit-learn

A data pipeline is a set of processing components that are sequenced to produce meaningful data. Pipelines are commonly used in Machine learning, since there is lot of data transformation and manipulation that needs to be applied to make data useful for machine learning. All components are sequenced in a way that the output of one component becomes input for the next and each of the component is self contained. Components interact with each other using data.

Even if a component breaks, the downstream component can run normally using the last output. Sklearn provide the ability to build pipelines that can be transformed and modeled for machine learning.

Steps for developing a machine learning model

  1. Explore the domain space
  2. Extract the problem definition
  3. Get the data that can be used to make the system learn to solve the problem definition.
  4. Discover and Visualize the data to gain insights
  5. Feature engineering and prepare the data
  6. Fine tune your model
  7. Evaluate your solution using metrics
  8. Once proven launch and maintain the model.

Exploratory Data Analysis

Example project = Fraud detection system

First step is to load the data into a dataframe in order for a proper analysis to be done on the attributes.

data = pd.read_csv('dataset/data_file.csv')
data.head()

Perform the basic analysis on the data shape and null value information.

print(data.shape)
print(data.info())
data.isnull().values.any()

Here is the example of few of the visual data analysis methods.

Bar plot

A bar chart or graph is a graph with rectangular bars or bins that are used to plot categorical values. Each bar in the graph represents a categorical variable and the height of the bar is proportional to the value represented by it.

Bar graphs are used:

To make comparisons between variables To visualize any trend in the data, i.e., they show the dependence of one variable on another Estimate values of a variable

plt.ylabel('Transactions')
plt.xlabel('Type')
data.type.value_counts().plot.bar()

Figure 1: Example of scikit-learn barplots

Figure 1: Example of scikit-learn barplots

Correlation between attributes

Attributes in a dataset can be related based on differnt aspects.

Examples include attributes dependent on another or could be loosely or tightly coupled. Also example includes two variables can be associated with a third one.

In order to understand the relationship between attributes, correlation represents the best visual way to get an insight. Positive correlation meaning both attributes moving into the same direction. Negative correlation refers to opposte directions. One attributes values increase results in value decrease for other. Zero correlation is when the attributes are unrelated.

# compute the correlation matrix
corr = data.corr()

# generate a mask for the lower triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# set up the matplotlib figure
f, ax = plt.subplots(figsize=(18, 18))

# generate a custom diverging color map
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3,
            square=True,
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax);

Figure 2: scikit-learn correlation array

Figure 2: scikit-learn correlation array

Histogram Analysis of dataset attributes

A histogram consists of a set of counts that represent the number of times some event occurred.

%matplotlib inline
data.hist(bins=30, figsize=(20,15))
plt.show()

Figure 3: scikit-learn

Figure 3: scikit-learn

Box plot Analysis

Box plot analysis is useful in detecting whether a distribution is skewed and detect outliers in the data.

fig, axs = plt.subplots(2, 2, figsize=(10, 10))
tmp = data.loc[(data.type == 'TRANSFER'), :]

a = sns.boxplot(x = 'isFlaggedFraud', y = 'amount', data = tmp, ax=axs[0][0])
axs[0][0].set_yscale('log')
b = sns.boxplot(x = 'isFlaggedFraud', y = 'oldbalanceDest', data = tmp, ax=axs[0][1])
axs[0][1].set(ylim=(0, 0.5e8))
c = sns.boxplot(x = 'isFlaggedFraud', y = 'oldbalanceOrg', data=tmp, ax=axs[1][0])
axs[1][0].set(ylim=(0, 3e7))
d = sns.regplot(x = 'oldbalanceOrg', y = 'amount', data=tmp.loc[(tmp.isFlaggedFraud ==1), :], ax=axs[1][1])
plt.show()

Figure 4: scikit-learn

Figure 4: scikit-learn

Scatter plot Analysis

The scatter plot displays values of two numerical variables as Cartesian coordinates.

plt.figure(figsize=(12,8))
sns.pairplot(data[['amount', 'oldbalanceOrg', 'oldbalanceDest', 'isFraud']], hue='isFraud')

Figure 5: scikit-learn scatter plots

Figure 5: scikit-learn scatter plots

Data Cleansing - Removing Outliers

If the transaction amount is lower than 5 percent of the all the transactions AND does not exceed USD 3000, we will exclude it from our analysis to reduce Type 1 costs If the transaction amount is higher than 95 percent of all the transactions AND exceeds USD 500000, we will exclude it from our analysis, and use a blanket review process for such transactions (similar to isFlaggedFraud column in original dataset) to reduce Type 2 costs

low_exclude = np.round(np.minimum(fin_samp_data.amount.quantile(0.05), 3000), 2)
high_exclude = np.round(np.maximum(fin_samp_data.amount.quantile(0.95), 500000), 2)

###Updating Data to exclude records prone to Type 1 and Type 2 costs
low_data = fin_samp_data[fin_samp_data.amount > low_exclude]
data = low_data[low_data.amount < high_exclude]

Pipeline Creation

Machine learning pipeline is used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative.

Defining DataFrameSelector to separate Numerical and Categorical attributes

Sample function to seperate out Numerical and categorical attributes.

from sklearn.base import BaseEstimator, TransformerMixin

# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

Feature Creation / Additional Feature Engineering

During EDA we identified that there are transactions where the balances do not tally after the transaction is completed.We believe this could potentially be cases where fraud is occurring. To account for this error in the transactions, we define two new features"errorBalanceOrig" and “errorBalanceDest,” calculated by adjusting the amount with the before and after balances for the Originator and Destination accounts.

Below, we create a function that allows us to create these features in a pipeline.

from sklearn.base import BaseEstimator, TransformerMixin

# column index
amount_ix, oldbalanceOrg_ix, newbalanceOrig_ix, oldbalanceDest_ix, newbalanceDest_ix = 0, 1, 2, 3, 4

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self): # no *args or **kargs
        pass
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        errorBalanceOrig = X[:,newbalanceOrig_ix] +  X[:,amount_ix] -  X[:,oldbalanceOrg_ix]
        errorBalanceDest = X[:,oldbalanceDest_ix] +  X[:,amount_ix]-  X[:,newbalanceDest_ix]

        return np.c_[X, errorBalanceOrig, errorBalanceDest]

Creating Training and Testing datasets

Training set includes the set of input examples that the model will be fit into or trained on by adjusting the parameters. Testing dataset is critical to test the generalizability of the model . By using this set, we can get the working accuracy of our model.

Testing set should not be exposed to model unless model training has not been completed. This way the results from testing will be more reliable.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, random_state=42, stratify=y)

Creating pipeline for numerical and categorical attributes

Identifying columns with Numerical and Categorical characteristics.

X_train_num = X_train[["amount","oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest"]]
X_train_cat = X_train[["type"]]
X_model_col = ["amount","oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest","type"]
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer

num_attribs = list(X_train_num)
cat_attribs = list(X_train_cat)

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler())
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', CategoricalEncoder(encoding="onehot-dense"))
    ])

Selecting the algorithm to be applied

Algorithim selection primarily depends on the objective you are trying to solve and what kind of dataset is available. There are differnt type of algorithms which can be applied and we will look into few of them here.

Linear Regression

This algorithm can be applied when you want to compute some continuous value. To predict some future value of a process which is currently running, you can go with regression algorithm.

Examples where linear regression can used are :

  1. Predict the time taken to go from one place to another
  2. Predict the sales for a future month
  3. Predict sales data and improve yearly projections.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import time
scl= StandardScaler()
X_train_std = scl.fit_transform(X_train)
X_test_std = scl.transform(X_test)
start = time.time()
lin_reg = LinearRegression()
lin_reg.fit(X_train_std, y_train) #SKLearn's linear regression
y_train_pred = lin_reg.predict(X_train_std)
train_time = time.time()-start

Logistic Regression

This algorithm can be used to perform binary classification. It can be used if you want a probabilistic framework. Also in case you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

  1. Customer churn prediction.
  2. Credit Scoring & Fraud Detection which is our example problem which we are trying to solve in this chapter.
  3. Calculating the effectiveness of marketing campaigns.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, _, y_train, _ = train_test_split(X_train, y_train, stratify=y_train, train_size=subsample_rate, random_state=42)
X_test, _, y_test, _ = train_test_split(X_test, y_test, stratify=y_test, train_size=subsample_rate, random_state=42)

model_lr_sklearn = LogisticRegression(multi_class="multinomial", C=1e6, solver="sag", max_iter=15)
model_lr_sklearn.fit(X_train, y_train)

y_pred_test = model_lr_sklearn.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
results.loc[len(results)] = ["LR Sklearn", np.round(acc, 3)]
results

Decision trees

Decision trees handle feature interactions and they’re non-parametric. Doesnt support online learning and the entire tree needs to be rebuild when new traning dataset comes in. Memory consumption is very high.

Can be used for the following cases

  1. Investment decisions
  2. Customer churn
  3. Banks loan defaulters
  4. Build vs Buy decisions
  5. Sales lead qualifications
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
start = time.time()
dt.fit(X_train_std, y_train)
y_train_pred = dt.predict(X_train_std)
train_time = time.time() - start

start = time.time()
y_test_pred = dt.predict(X_test_std)
test_time = time.time() - start

K Means

This algorithm is used when we are not aware of the labels and one needs to be created based on the features of objects. Example will be to divide a group of people into differnt subgroups based on common theme or attribute.

The main disadvantage of K-mean is that you need to know exactly the number of clusters or groups which is required. It takes a lot of iteration to come up with the best K.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, PredefinedSplit
from sklearn.metrics import accuracy_score

X_train, _, y_train, _ = train_test_split(X_train, y_train, stratify=y_train, train_size=subsample_rate, random_state=42)
X_test, _, y_test, _ = train_test_split(X_test, y_test, stratify=y_test, train_size=subsample_rate, random_state=42)

model_knn_sklearn = KNeighborsClassifier(n_jobs=-1)
model_knn_sklearn.fit(X_train, y_train)

y_pred_test = model_knn_sklearn.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)

results.loc[len(results)] = ["KNN Arbitary Sklearn", np.round(acc, 3)]
results

Support Vector Machines

SVM is a supervised ML technique and used for pattern recognition and classification problems when your data has exactly two classes. Its popular in text classification problems.

Few cases where SVM can be used is

  1. Detecting persons with common diseases.
  2. Hand-written character recognition
  3. Text categorization
  4. Stock market price prediction

Naive Bayes

Naive Bayes is used for large datasets.This algoritm works well even when we have a limited CPU and memory available. This works by calculating bunch of counts. It requires less training data. The algorthim cant learn interation between features.

Naive Bayes can be used in real-world applications such as:

  1. Sentiment analysis and text classification
  2. Recommendation systems like Netflix, Amazon
  3. To mark an email as spam or not spam
  4. Face recognition

Random Forest

Ranmdon forest is similar to Decision tree. Can be used for both regression and classification problems with large data sets.

Few case where it can be applied.

  1. Predict patients for high risks.
  2. Predict parts failures in manufacturing.
  3. Predict loan defaulters.
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 400, criterion='mse',random_state=1, n_jobs=-1)
start = time.time()
forest.fit(X_train_std, y_train)
y_train_pred = forest.predict(X_train_std)
train_time = time.time() - start

start = time.time()
y_test_pred = forest.predict(X_test_std)
test_time = time.time() - start

Neural networks

Neural network works based on weights of connections between neurons. Weights are trained and based on that the neural network can be utilized to predict the class or a quantity. They are resource and memory intensive.

Few cases where it can be applied.

  1. Applied to unsupervised learning tasks, such as feature extraction.
  2. Extracts features from raw images or speech with much less human intervention

Deep Learning using Keras

Keras is most powerful and easy-to-use Python libraries for developing and evaluating deep learning models. It has the efficient numerical computation libraries Theano and TensorFlow.

XGBoost

XGBoost stands for eXtreme Gradient Boosting. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. It is engineered for efficiency of compute time and memory resources.

Scikit Cheat Sheet

Scikit learning has put a very indepth and well explained flow chart to help you choose the right algorithm that I find very handy.

Figure 6: scikit-learn

Figure 6: scikit-learn

Parameter Optimization

Machine learning models are parameterized so that their behavior can be tuned for a given problem. These models can have many parameters and finding the best combination of parameters can be treated as a search problem.

A parameter is a configurationthat is part of the model and values can be derived from the given data.

  1. Required by the model when making predictions.
  2. Values define the skill of the model on your problem.
  3. Estimated or learned from data.
  4. Often not set manually by the practitioner.
  5. Often saved as part of the learned model.

Hyperparameter optimization/tuning algorithms

Grid search is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.

Random search provide a statistical distribution for each hyperparameter from which values may be randomly sampled.

Experiments with Keras (deep learning), XGBoost, and SVM (SVC) compared to Logistic Regression(Baseline)

Creating a parameter grid

grid_param = [
                [{   #LogisticRegression
                   'model__penalty':['l1','l2'],
                   'model__C': [0.01, 1.0, 100]
                }],

                [{#keras
                    'model__optimizer': optimizer,
                    'model__loss': loss
                }],

                [{  #SVM
                   'model__C' :[0.01, 1.0, 100],
                   'model__gamma': [0.5, 1],
                   'model__max_iter':[-1]
                }],

                [{   #XGBClassifier
                    'model__min_child_weight': [1, 3, 5],
                    'model__gamma': [0.5],
                    'model__subsample': [0.6, 0.8],
                    'model__colsample_bytree': [0.6],
                    'model__max_depth': [3]

                }]
            ]

Implementing Grid search with models and also creating metrics from each of the model.

Pipeline(memory=None,
     steps=[('preparation', FeatureUnion(n_jobs=None,
       transformer_list=[('num_pipeline', Pipeline(memory=None,
     steps=[('selector', DataFrameSelector(attribute_names=['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest'])), ('attribs_adder', CombinedAttributesAdder()...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])
from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from xgboost.sklearn import XGBClassifier
from sklearn.svm import SVC

test_scores = []
#Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
        linear_model.LogisticRegression(),
        keras_model,
        SVC(),
        XGBClassifier()

      ]

#create table to compare MLA metrics
MLA_columns = ['Name', 'Score', 'Accuracy_Score','ROC_AUC_score','final_rmse','Classification_error','Recall_Score','Precision_Score', 'mean_test_score', 'mean_fit_time', 'F1_Score']
MLA_compare = pd.DataFrame(columns = MLA_columns)
Model_Scores = pd.DataFrame(columns = ['Name','Score'])

row_index = 0
for alg in MLA:

    #set name and parameters
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'Name'] = MLA_name
    #MLA_compare.loc[row_index, 'Parameters'] = str(alg.get_params())


    full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),  # combination of numerical and categorical pipelines
        ("model", alg)
    ])

    grid_search = GridSearchCV(full_pipeline_with_predictor, grid_param[row_index], cv=4, verbose=2, scoring='f1', return_train_score=True)

    grid_search.fit(X_train[X_model_col], y_train)
    y_pred = grid_search.predict(X_test)

    MLA_compare.loc[row_index, 'Accuracy_Score'] = np.round(accuracy_score(y_pred, y_test), 3)
    MLA_compare.loc[row_index, 'ROC_AUC_score'] = np.round(metrics.roc_auc_score(y_test, y_pred),3)
    MLA_compare.loc[row_index,'Score'] = np.round(grid_search.score(X_test, y_test),3)

    negative_mse = grid_search.best_score_
    scores = np.sqrt(-negative_mse)
    final_mse = mean_squared_error(y_test, y_pred)
    final_rmse = np.sqrt(final_mse)
    MLA_compare.loc[row_index, 'final_rmse'] = final_rmse

    confusion_matrix_var = confusion_matrix(y_test, y_pred)
    TP = confusion_matrix_var[1, 1]
    TN = confusion_matrix_var[0, 0]
    FP = confusion_matrix_var[0, 1]
    FN = confusion_matrix_var[1, 0]
    MLA_compare.loc[row_index,'Classification_error'] = np.round(((FP + FN) / float(TP + TN + FP + FN)), 5)
    MLA_compare.loc[row_index,'Recall_Score'] = np.round(metrics.recall_score(y_test, y_pred), 5)
    MLA_compare.loc[row_index,'Precision_Score'] = np.round(metrics.precision_score(y_test, y_pred), 5)
    MLA_compare.loc[row_index,'F1_Score'] = np.round(f1_score(y_test,y_pred), 5)


    MLA_compare.loc[row_index, 'mean_test_score'] = grid_search.cv_results_['mean_test_score'].mean()
    MLA_compare.loc[row_index, 'mean_fit_time'] = grid_search.cv_results_['mean_fit_time'].mean()

    Model_Scores.loc[row_index,'MLA Name'] = MLA_name
    Model_Scores.loc[row_index,'ML Score'] = np.round(metrics.roc_auc_score(y_test, y_pred),3)

    #Collect Mean Test scores for statistical significance test
    test_scores.append(grid_search.cv_results_['mean_test_score'])
    row_index+=1

Results table from the Model evaluation with metrics.

Figure 7: scikit-learn

Figure 7: scikit-learn

ROC AUC Score

AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

Figure 8: scikit-learn

Figure 8: scikit-learn

Figure 9: scikit-learn

Figure 9: scikit-learn

K-means in scikit learn.

Import

K-means Algorithm

In this section we demonstrate how simple it is to use k-means in scikit learn.

Import

    from time import time
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.cluster import KMeans
    from sklearn.datasets import load_digits
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import scale

Create samples

    np.random.seed(42)

    digits = load_digits()
    data = scale(digits.data)

Create samples

    np.random.seed(42)

    digits = load_digits()
    data = scale(digits.data)

    n_samples, n_features = data.shape
    n_digits = len(np.unique(digits.target))
    labels = digits.target

    sample_size = 300

    print("n_digits: %d, \t n_samples %d, \t n_features %d" % (n_digits, n_samples, n_features))
    print(79 * '_')
    print('% 9s' % 'init' '    time  inertia    homo   compl  v-meas     ARI AMI  silhouette')
    print("n_digits: %d, \t n_samples %d, \t n_features %d"
          % (n_digits, n_samples, n_features))


    print(79 * '_')
    print('% 9s' % 'init'
          '    time  inertia    homo   compl  v-meas     ARI AMI  silhouette')


    def bench_k_means(estimator, name, data):
        t0 = time()
        estimator.fit(data)
        print('% 9s   %.2fs    %i   %.3f   %.3f   %.3f   %.3f   %.3f    %.3f'
              % (name, (time() - t0), estimator.inertia_,
                 metrics.homogeneity_score(labels, estimator.labels_),
                 metrics.completeness_score(labels, estimator.labels_),
                 metrics.v_measure_score(labels, estimator.labels_),
                 metrics.adjusted_rand_score(labels, estimator.labels_),
                 metrics.adjusted_mutual_info_score(labels,  estimator.labels_),

                 metrics.silhouette_score(data, estimator.labels_,metric='euclidean',sample_size=sample_size)))

    bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10), name="k-means++", data=data)

    bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=10), name="random", data=data)

                 metrics.silhouette_score(data, estimator.labels_,
                                          metric='euclidean',
                                          sample_size=sample_size)))

    bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10),
                  name="k-means++", data=data)

    bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=10),
                  name="random", data=data)


    # in this case the seeding of the centers is deterministic, hence we run the
    # kmeans algorithm only once with n_init=1
    pca = PCA(n_components=n_digits).fit(data)

    bench_k_means(KMeans(init=pca.components_,n_clusters=n_digits, n_init=1),name="PCA-based", data=data)
    print(79 * '_')

Visualize

See Figure 10

    bench_k_means(KMeans(init=pca.components_,
                         n_clusters=n_digits, n_init=1),
                  name="PCA-based",
                  data=data)
    print(79 * '_')

Visualize

See Figure 10

    reduced_data = PCA(n_components=2).fit_transform(data)
    kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
    kmeans.fit(reduced_data)

    # Step size of the mesh. Decrease to increase the quality of the VQ.
    h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

    # Plot the decision boundary. For that, we will assign a color to each
    x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
    y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # Obtain labels for each point in mesh. Use last trained model.
    Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(1)
    plt.clf()
    plt.imshow(Z, interpolation='nearest',
               extent=(xx.min(), xx.max(), yy.min(), yy.max()),
               cmap=plt.cm.Paired,
               aspect='auto', origin='lower')

    plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
    # Plot the centroids as a white X
    centroids = kmeans.cluster_centers_
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=169, linewidths=3,
                color='w', zorder=10)
    plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
              'Centroids are marked with white cross')
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
    plt.show()

Figure 10: Result

Figure 10: Result

2.1.10.5 - Dask - Random Forest Feature Detection

Gregor von Laszewski (laszewski@gmail.com)

Setup

First we need our tools. pandas gives us the DataFrame, very similar to R’s DataFrames. The DataFrame is a structure that allows us to work with our data more easily. It has nice features for slicing and transformation of data, and easy ways to do basic statistics.

numpy has some very handy functions that work on DataFrames.

Dataset

We are using a dataset about the wine quality dataset, archived at UCI’s Machine Learning Repository (http://archive.ics.uci.edu/ml/index.php).

import pandas as pd
import numpy as np

Now we will load our data. pandas makes it easy!

# red wine quality data, packed in a DataFrame
red_df = pd.read_csv('winequality-red.csv',sep=';',header=0, index_col=False)

# white wine quality data, packed in a DataFrame
white_df = pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)

# rose? other fruit wines? plum wine? :(

Like in R, there is a .describe() method that gives basic statistics for every column in the dataset.

# for red wines
red_df.describe()
<style>
    .dataframe thead tr:only-child th {
        text-align: right;
    }

    .dataframe thead th {
        text-align: left;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>
</th>
<th>

fixed acidity

</th>
<th>

volatile acidity

</th>
<th>

citric acid

</th>
<th>

residual sugar

</th>
<th>

chlorides

</th>
<th>

free sulfur dioxide

</th>
<th>

total sulfur dioxide

</th>
<th>

density

</th>
<th>

pH

</th>
<th>

sulphates

</th>
<th>

alcohol

</th>
<th>

quality

</th>
</tr>
</thead>
<tbody>
<tr>
<th>

count

</th>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
<td>

1599.000000

</td>
</tr>
<tr>
<th>

mean

</th>
<td>

8.319637

</td>
<td>

0.527821

</td>
<td>

0.270976

</td>
<td>

2.538806

</td>
<td>

0.087467

</td>
<td>

15.874922

</td>
<td>

46.467792

</td>
<td>

0.996747

</td>
<td>

3.311113

</td>
<td>

0.658149

</td>
<td>

10.422983

</td>
<td>

5.636023

</td>
</tr>
<tr>
<th>

std

</th>
<td>

1.741096

</td>
<td>

0.179060

</td>
<td>

0.194801

</td>
<td>

1.409928

</td>
<td>

0.047065

</td>
<td>

10.460157

</td>
<td>

32.895324

</td>
<td>

0.001887

</td>
<td>

0.154386

</td>
<td>

0.169507

</td>
<td>

1.065668

</td>
<td>

0.807569

</td>
</tr>
<tr>
<th>

min

</th>
<td>

4.600000

</td>
<td>

0.120000

</td>
<td>

0.000000

</td>
<td>

0.900000

</td>
<td>

0.012000

</td>
<td>

1.000000

</td>
<td>

6.000000

</td>
<td>

0.990070

</td>
<td>

2.740000

</td>
<td>

0.330000

</td>
<td>

8.400000

</td>
<td>

3.000000

</td>
</tr>
<tr>
<th>

25%

</th>
<td>

7.100000

</td>
<td>

0.390000

</td>
<td>

0.090000

</td>
<td>

1.900000

</td>
<td>

0.070000

</td>
<td>

7.000000

</td>
<td>

22.000000

</td>
<td>

0.995600

</td>
<td>

3.210000

</td>
<td>

0.550000

</td>
<td>

9.500000

</td>
<td>

5.000000

</td>
</tr>
<tr>
<th>

50%

</th>
<td>

7.900000

</td>
<td>

0.520000

</td>
<td>

0.260000

</td>
<td>

2.200000

</td>
<td>

0.079000

</td>
<td>

14.000000

</td>
<td>

38.000000

</td>
<td>

0.996750

</td>
<td>

3.310000

</td>
<td>

0.620000

</td>
<td>

10.200000

</td>
<td>

6.000000

</td>
</tr>
<tr>
<th>

75%

</th>
<td>

9.200000

</td>
<td>

0.640000

</td>
<td>

0.420000

</td>
<td>

2.600000

</td>
<td>

0.090000

</td>
<td>

21.000000

</td>
<td>

62.000000

</td>
<td>

0.997835

</td>
<td>

3.400000

</td>
<td>

0.730000

</td>
<td>

11.100000

</td>
<td>

6.000000

</td>
</tr>
<tr>
<th>

max

</th>
<td>

15.900000

</td>
<td>

1.580000

</td>
<td>

1.000000

</td>
<td>

15.500000

</td>
<td>

0.611000

</td>
<td>

72.000000

</td>
<td>

289.000000

</td>
<td>

1.003690

</td>
<td>

4.010000

</td>
<td>

2.000000

</td>
<td>

14.900000

</td>
<td>

8.000000

</td>
</tr>
</tbody>
</table>
# for white wines
white_df.describe()
<style>
    .dataframe thead tr:only-child th {
        text-align: right;
    }

    .dataframe thead th {
        text-align: left;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>
</th>
<th>

fixed acidity

</th>
<th>

volatile acidity

</th>
<th>

citric acid

</th>
<th>

residual sugar

</th>
<th>

chlorides

</th>
<th>

free sulfur dioxide

</th>
<th>

total sulfur dioxide

</th>
<th>

density

</th>
<th>

pH

</th>
<th>

sulphates

</th>
<th>

alcohol

</th>
<th>

quality

</th>
</tr>
</thead>
<tbody>
<tr>
<th>

count

</th>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
<td>

4898.000000

</td>
</tr>
<tr>
<th>

mean

</th>
<td>

6.854788

</td>
<td>

0.278241

</td>
<td>

0.334192

</td>
<td>

6.391415

</td>
<td>

0.045772

</td>
<td>

35.308085

</td>
<td>

138.360657

</td>
<td>

0.994027

</td>
<td>

3.188267

</td>
<td>

0.489847

</td>
<td>

10.514267

</td>
<td>

5.877909

</td>
</tr>
<tr>
<th>

std

</th>
<td>

0.843868

</td>
<td>

0.100795

</td>
<td>

0.121020

</td>
<td>

5.072058

</td>
<td>

0.021848

</td>
<td>

17.007137

</td>
<td>

42.498065

</td>
<td>

0.002991

</td>
<td>

0.151001

</td>
<td>

0.114126

</td>
<td>

1.230621

</td>
<td>

0.885639

</td>
</tr>
<tr>
<th>

min

</th>
<td>

3.800000

</td>
<td>

0.080000

</td>
<td>

0.000000

</td>
<td>

0.600000

</td>
<td>

0.009000

</td>
<td>

2.000000

</td>
<td>

9.000000

</td>
<td>

0.987110

</td>
<td>

2.720000

</td>
<td>

0.220000

</td>
<td>

8.000000

</td>
<td>

3.000000

</td>
</tr>
<tr>
<th>

25%

</th>
<td>

6.300000

</td>
<td>

0.210000

</td>
<td>

0.270000

</td>
<td>

1.700000

</td>
<td>

0.036000

</td>
<td>

23.000000

</td>
<td>

108.000000

</td>
<td>

0.991723

</td>
<td>

3.090000

</td>
<td>

0.410000

</td>
<td>

9.500000

</td>
<td>

5.000000

</td>
</tr>
<tr>
<th>

50%

</th>
<td>

6.800000

</td>
<td>

0.260000

</td>
<td>

0.320000

</td>
<td>

5.200000

</td>
<td>

0.043000

</td>
<td>

34.000000

</td>
<td>

134.000000

</td>
<td>

0.993740

</td>
<td>

3.180000

</td>
<td>

0.470000

</td>
<td>

10.400000

</td>
<td>

6.000000

</td>
</tr>
<tr>
<th>

75%

</th>
<td>

7.300000

</td>
<td>

0.320000

</td>
<td>

0.390000

</td>
<td>

9.900000

</td>
<td>

0.050000

</td>
<td>

46.000000

</td>
<td>

167.000000

</td>
<td>

0.996100

</td>
<td>

3.280000

</td>
<td>

0.550000

</td>
<td>

11.400000

</td>
<td>

6.000000

</td>
</tr>
<tr>
<th>

max

</th>
<td>

14.200000

</td>
<td>

1.100000

</td>
<td>

1.660000

</td>
<td>

65.800000

</td>
<td>

0.346000

</td>
<td>

289.000000

</td>
<td>

440.000000

</td>
<td>

1.038980

</td>
<td>

3.820000

</td>
<td>

1.080000

</td>
<td>

14.200000

</td>
<td>

9.000000

</td>
</tr>
</tbody>
</table>

Sometimes it is easier to understand the data visually. A histogram of the white wine quality data citric acid samples is shown next. You can of course visualize other columns' data or other datasets. Just replace the DataFrame and column name (see Figure 1).

import matplotlib.pyplot as plt

def extract_col(df,col_name):
    return list(df[col_name])

col = extract_col(white_df,'citric acid') # can replace with another dataframe or column
plt.hist(col)

#TODO: add axes and such to set a good example

plt.show()

Figure 1: Histogram

Figure 1: Histogram

Detecting Features

Let us try out a some elementary machine learning models. These models are not always for prediction. They are also useful to find what features are most predictive of a variable of interest. Depending on the classifier you use, you may need to transform the data pertaining to that variable.

Data Preparation

Let us assume we want to study what features are most correlated with pH. pH of course is real-valued, and continuous. The classifiers we want to use usually need labeled or integer data. Hence, we will transform the pH data, assigning wines with pH higher than average as hi (more basic or alkaline) and wines with pH lower than average as lo (more acidic).

# refresh to make Jupyter happy
red_df = pd.read_csv('winequality-red.csv',sep=';',header=0, index_col=False)
white_df = pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)

#TODO: data cleansing functions here, e.g. replacement of NaN

# if the variable you want to predict is continuous, you can map ranges of values
# to integer/binary/string labels

# for example, map the pH data to 'hi' and 'lo' if a pH value is more than or
# less than the mean pH, respectively
M = np.mean(list(red_df['pH'])) # expect inelegant code in these mappings
Lf = lambda p: int(p < M)*'lo' + int(p >= M)*'hi' # some C-style hackery

# create the new classifiable variable
red_df['pH-hi-lo'] = map(Lf,list(red_df['pH']))

# and remove the predecessor
del red_df['pH']

Now we specify which dataset and variable you want to predict by assigning vlues to SELECTED_DF and TARGET_VAR, respectively.

We like to keep a parameter file where we specify data sources and such. This lets me create generic analytics code that is easy to reuse.

After we have specified what dataset we want to study, we split the training and test datasets. We then scale (normalize) the data, which makes most classifiers run better.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

# make selections here without digging in code
SELECTED_DF = red_df # selected dataset
TARGET_VAR = 'pH-hi-lo' # the predicted variable

# generate nameless data structures
df = SELECTED_DF
target = np.array(df[TARGET_VAR]).ravel()
del df[TARGET_VAR] # no cheating

#TODO: data cleansing function calls here

# split datasets for training and testing
X_train, X_test, y_train, y_test = train_test_split(df,target,test_size=0.2)

# set up the scaler
scaler = StandardScaler()
scaler.fit(X_train)

# apply the scaler
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Now we pick a classifier. As you can see, there are many to try out, and even more in scikit-learn’s documentation and many examples and tutorials. Random Forests are data science workhorses. They are the go-to method for most data scientists. Be careful relying on them though–they tend to overfit. We try to avoid overfitting by separating the training and test datasets.

Random Forest

# pick a classifier

from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor,ExtraTreeClassifier,ExtraTreeRegressor
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier

clf = RandomForestClassifier()

Now we will test it out with the default parameters.

Note that this code is boilerplate. You can use it interchangeably for most scikit-learn models.

# test it out

model = clf.fit(X_train,y_train)
pred = clf.predict(X_test)
conf_matrix = metrics.confusion_matrix(y_test,pred)

var_score = clf.score(X_test,y_test)

# the results
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

Now output the results. For Random Forests, we get a feature ranking. Relative importances usually exponentially decay. The first few highly-ranked features are usually the most important.

# for the sake of clarity
num_features = X_train.shape[1]
features = map(lambda x: df.columns[x],indices)
feature_importances = map(lambda x: importances[x],indices)

print 'Feature ranking:\n'

for i in range(num_features):
    feature_name = features[i]
    feature_importance = feature_importances[i]
    print '%s%f' % (feature_name.ljust(30), feature_importance)
Feature ranking:

fixed acidity                 0.269778
citric acid                   0.171337
density                       0.089660
volatile acidity              0.088965
chlorides                     0.082945
alcohol                       0.080437
total sulfur dioxide          0.067832
sulphates                     0.047786
free sulfur dioxide           0.042727
residual sugar                0.037459
quality                       0.021075

Sometimes it’s easier to visualize. We’ll use a bar chart. See Figure 2

plt.clf()
plt.bar(range(num_features),feature_importances)
plt.xticks(range(num_features),features,rotation=90)
plt.ylabel('relative importance (a.u.)')
plt.title('Relative importances of most predictive features')
plt.show()

Figure 2: Result

Figure 2: Result

import dask.dataframe as dd

red_df = dd.read_csv('winequality-red.csv',sep=';',header=0)
white_df = dd.read_csv('winequality-white.csv',sep=';',header=0)

Acknowledgement

This notebook was developed by Juliette Zerick and Gregor von Laszewski

2.1.10.6 - Parallel Computing in Python

Gregor von Laszewski (laszewski@gmail.com)

In this module, we will review the available Python modules that can be used for parallel computing. Parallel computing can be in the form of either multi-threading or multi-processing. In multi-threading approach, the threads run in the same shared memory heap whereas in case of multi-processing, the memory heaps of processes are separate and independent, therefore the communication between the processes are a little bit more complex.

Multi-threading in Python

Threading in Python is perfect for I/O operations where the process is expected to be idle regularly, e.g. web scraping. This is a very useful feature because several applications and scripts might spend the majority of their runtime waiting for network or data I/O. In several cases, e.g. web scraping, the resources, i.e. downloading from different websites, are most of the time-independent. Therefore the processor can download in parallel and join the result at the end.

Thread vs Threading

There are two built-in modules in Python that are related to threading, namely thread and threading. The former module is deprecated for some time in Python 2, and in Python 3 it is renamed to _thread for the sake of backward incompatibilities. The _thread module provides low-level threading API for multi-threading in Python, whereas the module threading builds a high-level threading interface on top of it.

The Thread() is the main method of the threading module, the two important arguments of which are target, for specifying the callable object, and args to pass the arguments for the target callable. We illustrate these in the following example:

import threading

def hello_thread(thread_num):
    print ("Hello from Thread ", thread_num)

if __name__ == '__main__':
    for thread_num in range(5):
        t = threading.Thread(target=hello_thread,arg=(thread_num,))
        t.start()

This is the output of the previous example:

In [1]: %run threading.py
Hello from Thread  0
Hello from Thread  1
Hello from Thread  2
Hello from Thread  3
Hello from Thread  4

In case you are not familiar with the if __name__ == '__main__:' statement, what it does is making sure that the code nested under this condition will be run only if you run your module as a program and it will not run in case your module is imported into another file.

Locks

As mentioned prior, the memory space is shared between the threads. This is at the same time beneficial and problematic: it is beneficial in a sense that the communication between the threads becomes easy, however, you might experience a strange outcome if you let several threads change the same variable without caution, e.g. thread 2 changes variable x while thread 1 is working with it. This is when lock comes into play. Using lock, you can allow only one thread to work with a variable. In other words, only a single thread can hold the lock. If the other threads need to work with that variable, they have to wait until the other thread is done and the variable is “unlocked.”

We illustrate this with a simple example:

import threading

global counter
counter = 0

def incrementer1():
    global counter
    for j in range(2):
        for i in range(3):
            counter += 1
            print("Greeter 1 incremented the counter by 1")
        print ("Counter is %d"%counter)

def incrementer2():
    global counter
    for j in range(2):
        for i in range(3):
            counter += 1
            print("Greeter 2 incremented the counter by 1")
        print ("Counter is now %d"%counter)


if __name__ == '__main__':
    t1 = threading.Thread(target = incrementer1)
    t2 = threading.Thread(target = incrementer2)

    t1.start()
    t2.start()

Suppose we want to print multiples of 3 between 1 and 12, i.e. 3, 6, 9 and 12. For the sake of argument, we try to do this using 2 threads and a nested for loop. Then we create a global variable called counter and we initialize it with 0. Then whenever each of the incrementer1 or incrementer2 functions are called, the counter is incremented by 3 twice (counter is incremented by 6 in each function call). If you run the previous code, you should be really lucky if you get the following as part of your output:

Counter is now 3
Counter is now 6
Counter is now 9
Counter is now 12

The reason is the conflict that happens between threads while incrementing the counter in the nested for loop. As you probably noticed, the first level for loop is equivalent to adding 3 to the counter and the conflict that might happen is not effective on that level but the nested for loop. Accordingly, the output of the previous code is different in every run. This is an example output:

$ python3 lock_example.py
Greeter 1 incremented the counter by 1
Greeter 1 incremented the counter by 1
Greeter 1 incremented the counter by 1
Counter is 4
Greeter 2 incremented the counter by 1
Greeter 2 incremented the counter by 1
Greeter 1 incremented the counter by 1
Greeter 2 incremented the counter by 1
Greeter 1 incremented the counter by 1
Counter is 8
Greeter 1 incremented the counter by 1
Greeter 2 incremented the counter by 1
Counter is 10
Greeter 2 incremented the counter by 1
Greeter 2 incremented the counter by 1
Counter is 12

We can fix this issue using a lock: whenever one of the function is going to increment the value by 3, it will acquire() the lock and when it is done the function will release() the lock. This mechanism is illustrated in the following code:

import threading

increment_by_3_lock = threading.Lock()

global counter
counter = 0

def incrementer1():
    global counter
    for j in range(2):
        increment_by_3_lock.acquire(True)
        for i in range(3):
            counter += 1
            print("Greeter 1 incremented the counter by 1")
        print ("Counter is %d"%counter)
        increment_by_3_lock.release()

def incrementer2():
    global counter
    for j in range(2):
        increment_by_3_lock.acquire(True)
        for i in range(3):
            counter += 1
            print("Greeter 2 incremented the counter by 1")
        print ("Counter is %d"%counter)
        increment_by_3_lock.release()

if __name__ == '__main__':
    t1 = threading.Thread(target = incrementer1)
    t2 = threading.Thread(target = incrementer2)

    t1.start()
    t2.start()

No matter how many times you run this code, the output would always be in the correct order:

$ python3 lock_example.py
Greeter 1 incremented the counter by 1
Greeter 1 incremented the counter by 1
Greeter 1 incremented the counter by 1
Counter is 3
Greeter 1 incremented the counter by 1
Greeter 1 incremented the counter by 1
Greeter 1 incremented the counter by 1
Counter is 6
Greeter 2 incremented the counter by 1
Greeter 2 incremented the counter by 1
Greeter 2 incremented the counter by 1
Counter is 9
Greeter 2 incremented the counter by 1
Greeter 2 incremented the counter by 1
Greeter 2 incremented the counter by 1
Counter is 12

Using the Threading module increases both the overhead associated with thread management as well as the complexity of the program and that is why in many situations, employing multiprocessing module might be a better approach.

Multi-processing in Python

We already mentioned that multi-threading might not be sufficient in many applications and we might need to use multiprocessing sometimes, or better to say most of the time. That is why we are dedicating this subsection to this particular module. This module provides you with an API for spawning processes the way you spawn threads using threading module. Moreover, some functionalities are not even available in threading module, e.g. the Pool class which allows you to run a batch of jobs using a pool of worker processes.

Process

Similar to threading module which was employing thread (aka _thread) under the hood, multiprocessing employs the Process class. Consider the following example:

from multiprocessing import Process
import os

def greeter (name):
    proc_idx = os.getpid()
    print ("Process {0}: Hello {1}!".format(proc_idx,name))

if __name__ == '__main__':
    name_list = ['Harry', 'George', 'Dirk', 'David']
    process_list = []
    for name_idx, name in enumerate(name_list):
        current_process = Process(target=greeter, args=(name,))
        process_list.append(current_process)
        current_process.start()
    for process in process_list:
        process.join()

In this example, after importing the Process module we created a greeter() function that takes a name and greets that person. It also prints the pid (process identifier) of the process that is running it. Note that we used the os module to get the pid. In the bottom of the code after checking the __name__='__main__' condition, we create a series of Processes and start them. Finally in the last for loop and using the join method, we tell Python to wait for the processes to terminate. This is one of the possible outputs of the code:

$ python3 process_example.py
Process 23451: Hello Harry!
Process 23452: Hello George!
Process 23453: Hello Dirk!
Process 23454: Hello David!

Pool

Consider the Pool class as a pool of worker processes. There are several ways for assigning jobs to the Pool class and we will introduce the most important ones in this section. These methods are categorized as blocking or non-blocking. The former means that after calling the API, it blocks the thread/process until it has the result or answer ready and the control returns only when the call completes. In thenon-blockin` on the other hand, the control returns immediately.

Synchronous Pool.map()

We illustrate the Pool.map method by re-implementing our previous greeter example using Pool.map:

from multiprocessing import Pool
import os

def greeter(name):
    pid = os.getpid()
    print("Process {0}: Hello {1}!".format(pid,name))

if __name__ == '__main__':
    names = ['Jenna', 'David','Marry', 'Ted','Jerry','Tom','Justin']
    pool = Pool(processes=3)
    sync_map = pool.map(greeter,names)
    print("Done!")

As you can see, we have seven names here but we do not want to dedicate each greeting to a separate process. Instead, we do the whole job of “greeting seven people” using “two processes.” We create a pool of 3 processes with Pool(processes=3) syntax and then we map an iterable called names to the greeter function using pool.map(greeter,names). As we expected, the greetings in the output will be printed from three different processes:

$ python poolmap_example.py
Process 30585: Hello Jenna!
Process 30586: Hello David!
Process 30587: Hello Marry!
Process 30585: Hello Ted!
Process 30585: Hello Jerry!
Process 30587: Hello Tom!
Process 30585: Hello Justin!
Done!

Note that Pool.map() is in blocking category and does not return the control to your script until it is done calculating the results. That is why Done! is printed after all of the greetings are over.

Asynchronous Pool.map_async()

As the name implies, you can use the map_async method, when you want assign many function calls to a pool of worker processes asynchronously. Note that unlike map, the order of the results is not guaranteed (as oppose to map) and the control is returned immediately. We now implement the previous example using map_async:

from multiprocessing import Pool
import os

def greeter(name):
    pid = os.getpid()
    print("Process {0}: Hello {1}!".format(pid,name))

if __name__ == '__main__':
    names = ['Jenna', 'David','Marry', 'Ted','Jerry','Tom','Justin']
    pool = Pool(processes=3)
    async_map = pool.map_async(greeter,names)
    print("Done!")
    async_map.wait()

As you probably noticed, the only difference (clearly apart from the map_async method name) is calling the wait() method in the last line. The wait() method tells your script to wait for the result of map_async before terminating:

$ python poolmap_example.py
Done!
Process 30740: Hello Jenna!
Process 30741: Hello David!
Process 30740: Hello Ted!
Process 30742: Hello Marry!
Process 30740: Hello Jerry!
Process 30741: Hello Tom!
Process 30742: Hello Justin!

Note that the order of the results are not preserved. Moreover, Done! is printer before any of the results, meaning that if we do not use the wait() method, you probably will not see the result at all.

Locks

The way multiprocessing module implements locks is almost identical to the way the threading module does. After importing Lock from multiprocessing all you need to do is to acquire it, do some computation and then release the lock. We will clarify the use of Lock by providing an example in next section about process communication.

Process Communication

Process communication in multiprocessing is one of the most important, yet complicated, features for better use of this module. As oppose to threading, the Process objects will not have access to any shared variable by default, i.e. no shared memory space between the processes by default. This effect is illustrated in the following example:

from multiprocessing import Process, Lock, Value
import time

global counter
counter = 0

def incrementer1():
    global counter
    for j in range(2):
        for i in range(3):
            counter += 1
        print ("Greeter1: Counter is %d"%counter)

def incrementer2():
    global counter
    for j in range(2):
        for i in range(3):
            counter += 1
        print ("Greeter2: Counter is %d"%counter)


if __name__ == '__main__':

    t1 = Process(target = incrementer1 )
    t2 = Process(target = incrementer2 )
    t1.start()
    t2.start()

Probably you already noticed that this is almost identical to our example in threading section. Now, take a look at the strange output:

$ python communication_example.py
Greeter1: Counter is 3
Greeter1: Counter is 6
Greeter2: Counter is 3
Greeter2: Counter is 6

As you can see, it is as if the processes does not see each other. Instead of having two processes one counting to 6 and the other counting from 6 to 12, we have two processes counting to 6.

Nevertheless, there are several ways that Processes from multiprocessing can communicate with each other, including Pipe, Queue, Value, Array and Manager. Pipe and Queue are appropriate for inter-process message passing. To be more specific, Pipe is useful for process-to-process scenarios while Queue is more appropriate for processes-toprocesses ones. Value and Array are both used to provide synchronized access to a shared data (very much like shared memory) and Managers can be used on different data types. In the following sub-sections, we cover both Value and Array since they are both lightweight, yet useful, approach.

Value

The following example re-implements the broken example in the previous section. We fix the strange output, by using both Lock and Value:

from multiprocessing import Process, Lock, Value
import time

increment_by_3_lock = Lock()


def incrementer1(counter):
    for j in range(3):
        increment_by_3_lock.acquire(True)
        for i in range(3):
            counter.value += 1
            time.sleep(0.1)
        print ("Greeter1: Counter is %d"%counter.value)
        increment_by_3_lock.release()

def incrementer2(counter):
    for j in range(3):
        increment_by_3_lock.acquire(True)
        for i in range(3):
            counter.value += 1
            time.sleep(0.05)
        print ("Greeter2: Counter is %d"%counter.value)
        increment_by_3_lock.release()


if __name__ == '__main__':

    counter = Value('i',0)
    t1 = Process(target = incrementer1, args=(counter,))
    t2 = Process(target = incrementer2 , args=(counter,))
    t2.start()
    t1.start()

The usage of Lock object in this example is identical to the example in threading section. The usage of counter is on the other hand the novel part. First, note that counter is not a global variable anymore and instead it is a Value which returns a ctypes object allocated from a shared memory between the processes. The first argument 'i' indicates a signed integer, and the second argument defines the initialization value. In this case we are assigning a signed integer in the shared memory initialized to size 0 to the counter variable. We then modified our two functions and pass this shared variable as an argument. Finally, we change the way we increment the counter since the counter is not a Python integer anymore but a ctypes signed integer where we can access its value using the value attribute. The output of the code is now as we expected:

$ python mp_lock_example.py
Greeter2: Counter is 3
Greeter2: Counter is 6
Greeter1: Counter is 9
Greeter1: Counter is 12

The last example related to parallel processing, illustrates the use of both Value and Array, as well as a technique to pass multiple arguments to a function. Note that the Process object does not accept multiple arguments for a function and therefore we need this or similar techniques for passing multiple arguments. Also, this technique can also be used when you want to pass multiple arguments to map or map_async:

from multiprocessing import Process, Lock, Value, Array
import time
from ctypes import c_char_p


increment_by_3_lock = Lock()


def incrementer1(counter_and_names):
    counter=  counter_and_names[0]
    names = counter_and_names[1]
    for j in range(2):
        increment_by_3_lock.acquire(True)
        for i in range(3):
            counter.value += 1
            time.sleep(0.1)
        name_idx = counter.value//3 -1
        print ("Greeter1: Greeting {0}! Counter is {1}".format(names.value[name_idx],counter.value))
        increment_by_3_lock.release()

def incrementer2(counter_and_names):
    counter=  counter_and_names[0]
    names = counter_and_names[1]
    for j in range(2):
        increment_by_3_lock.acquire(True)
        for i in range(3):
            counter.value += 1
            time.sleep(0.05)
        name_idx = counter.value//3 -1
        print ("Greeter2: Greeting {0}! Counter is {1}".format(names.value[name_idx],counter.value))
        increment_by_3_lock.release()


if __name__ == '__main__':
    counter = Value('i',0)
    names = Array (c_char_p,4)
    names.value = ['James','Tom','Sam', 'Larry']
    t1 = Process(target = incrementer1, args=((counter,names),))
    t2 = Process(target = incrementer2 , args=((counter,names),))
    t2.start()
    t1.start()

In this example, we created a multiprocessing.Array() object and assigned it to a variable called names. As we mentioned before, the first argument is the ctype data type and since we want to create an array of strings with a length of 4 (second argument), we imported the c_char_p and passed it as the first argument.

Instead of passing the arguments separately, we merged both the Value and Array objects in a tuple and passed the tuple to the functions. We then modified the functions to unpack the objects in the first two lines in both functions. Finally, we changed the print statement in a way that each process greets a particular name. The output of the example is:

$ python3 mp_lock_example.py
Greeter2: Greeting James! Counter is 3
Greeter2: Greeting Tom! Counter is 6
Greeter1: Greeting Sam! Counter is 9
Greeter1: Greeting Larry! Counter is 12

2.1.10.7 - Dask

Gregor von Laszewski (laszewski@gmail.com)

Dask is a python-based parallel computing library for analytics. Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved concurrently.

Dask is composed of two components:

  1. Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  2. Big Data collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Dask emphasizes the following virtues:

  • Familiar: Provides parallelized NumPy array and Pandas DataFrame objects.
  • Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
  • Native: Enables distributed computing in Pure Python with access to the PyData stack.
  • Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
  • Scales up: Runs resiliently on clusters with 1000s of cores
  • Scales down: Trivial to set up and run on a laptop in a single process
  • Responsive: Designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans

The section is structured in a number of subsections addressing the following topics:

Foundations:

an explanation of what Dask is, how it works, and how to use lower level primitives to set up computations. Casual users may wish to skip this section, although we consider it useful knowledge for all users.

Distributed Features:

information on running Dask on the distributed scheduler, which enables scale-up to distributed settings and enhanced monitoring of task operations. The distributed scheduler is now generally the recommended engine for executing task work, even on single workstations or laptops.

Collections:

convenient abstractions giving a familiar feel to big data.

Bags:

Python iterators with a functional paradigm, such as found in func/iter-tools and toolz - generalize lists/generators to big data; this will seem very familiar to users of PySpark’s RDD

Array:

massive multi-dimensional numerical data, with Numpy functionality

Dataframe:

massive tabular data, with Pandas functionality

How Dask Works

Dask is a computation tool for larger-than-memory datasets, parallel execution or delayed/background execution.

We can summarize the basics of Dask as follows:

  • process data that does not fit into memory by breaking it into blocks and specifying task chains
  • parallelize execution of tasks across cores and even nodes of a cluster
  • move computation to the data rather than the other way around, to minimize communication overheads

We use for-loops to build basic tasks, Python iterators, and the Numpy (array) and Pandas (dataframe) functions for multi-dimensional or tabular data, respectively.

Dask allows us to construct a prescription for the calculation we want to carry out. A module named Dask.delayed lets us parallelize custom code. It is useful whenever our problem doesn’t quite fit a high-level parallel object like dask.array or dask.dataframe but could still benefit from parallelism. Dask.delayed works by delaying our function evaluations and putting them into a dask graph. Here is a small example:

from dask import delayed

@delayed
def inc(x):
    return x + 1

@delayed
def add(x, y):
    return x + y

Here we have used the delayed annotation to show that we want these functions to operate lazily - to save the set of inputs and execute only on demand.

Dask Bag

Dask-bag excels in processing data that can be represented as a sequence of arbitrary inputs. We’ll refer to this as “messy” data, because it can contain complex nested structures, missing fields, mixtures of data types, etc. The functional programming style fits very nicely with standard Python iteration, such as can be found in the itertools module.

Messy data is often encountered at the beginning of data processing pipelines when large volumes of raw data are first consumed. The initial set of data might be JSON, CSV, XML, or any other format that does not enforce strict structure and datatypes. For this reason, the initial data massaging and processing is often done with Python lists, dicts, and sets.

These core data structures are optimized for general-purpose storage and processing. Adding streaming computation with iterators/generator expressions or libraries like itertools or toolz let us process large volumes in a small space. If we combine this with parallel processing then we can churn through a fair amount of data.

Dask.bag is a high level Dask collection to automate common workloads of this form. In a nutshell

dask.bag = map, filter, toolz + parallel execution

You can create a Bag from a Python sequence, from files, from data on S3, etc.

# each element is an integer
import dask.bag as db
b = db.from_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# each element is a text file of JSON lines
import os
b = db.read_text(os.path.join('data', 'accounts.*.json.gz'))

# Requires `s3fs` library
# each element is a remote CSV text file
b = db.read_text('s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv')

Bag objects hold the standard functional API found in projects like the Python standard library, toolz, or pyspark, including map, filter, groupby, etc.

As with Array and DataFrame objects, operations on Bag objects create new bags. Call the .compute() method to trigger execution.

def is_even(n):
    return n % 2 == 0

b = db.from_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
c = b.filter(is_even).map(lambda x: x ** 2)
c

# blocking form: wait for completion (which is very fast in this case)
c.compute()

For more details on Dask Bag check https://dask.pydata.org/en/latest/bag.html

Concurrency Features

Dask supports a real-time task framework that extends Python’s concurrent.futures interface. This interface is good for arbitrary task scheduling, like dask.delayed, but is immediate rather than lazy, which provides some more flexibility in situations where the computations may evolve. These features depend on the second-generation task scheduler found in dask.distributed (which, despite its name, runs very well on a single machine).

Dask allows us to simply construct graphs of tasks with dependencies. We can find that graphs can also be created automatically for us using functional, Numpy, or Pandas syntax on data collections. None of this would be very useful if there weren’t also a way to execute these graphs, in a parallel and memory-aware way. Dask comes with four available schedulers:

  • dask.threaded.get: a scheduler backed by a thread pool
  • dask.multiprocessing.get: a scheduler backed by a process pool
  • dask.async.get_sync: a synchronous scheduler, good for debugging
  • distributed.Client.get: a distributed scheduler for executing graphs on multiple machines.

Here is a simple program for dask.distributed library:

from dask.distributed import Client
client = Client('scheduler:port')

futures = []
for fn in filenames:
    future = client.submit(load, fn)
    futures.append(future)

summary = client.submit(summarize, futures)
summary.result()

For more details on Concurrent Features by Dask check https://dask.pydata.org/en/latest/futures.html

Dask Array

Dask arrays implement a subset of the NumPy interface on large arrays using blocked algorithms and task scheduling. These behave like numpy arrays, but break a massive job into tasks that are then executed by a scheduler. The default scheduler uses threading but you can also use multiprocessing or distributed or even serial processing (mainly for debugging). You can tell the dask array how to break the data into chunks for processing.

import dask.array as da
f = h5py.File('myfile.hdf5')
x = da.from_array(f['/big-data'], chunks=(1000, 1000))
x - x.mean(axis=1).compute()

For more details on Dask Array check https://dask.pydata.org/en/latest/array.html

Dask DataFrame

A Dask DataFrame is a large parallel dataframe composed of many smaller Pandas dataframes, split along the index. These pandas dataframes may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. Dask.dataframe implements a commonly used subset of the Pandas interface including elementwise operations, reductions, grouping operations, joins, timeseries algorithms, and more. It copies the Pandas interface for these operations exactly and so should be very familiar to Pandas users. Because Dask.dataframe operations merely coordinate Pandas operations they usually exhibit similar performance characteristics as are found in Pandas. To run the following code, save ‘student.csv’ file in your machine.

import pandas as pd
df = pd.read_csv('student.csv')
d = df.groupby(df.HID).Serial_No.mean()
print(d)

ID
101     1
102     2
104     3
105     4
106     5
107     6
109     7
111     8
201     9
202    10
Name: Serial_No, dtype: int64

import dask.dataframe as dd
df = dd.read_csv('student.csv')
dt = df.groupby(df.HID).Serial_No.mean().compute()
print (dt)

ID
101     1.0
102     2.0
104     3.0
105     4.0
106     5.0
107     6.0
109     7.0
111     8.0
201     9.0
202    10.0
Name: Serial_No, dtype: float64

For more details on Dask DataFrame check https://dask.pydata.org/en/latest/dataframe.html

Dask DataFrame Storage

Efficient storage can dramatically improve performance, particularly when operating repeatedly from disk.

Decompressing text and parsing CSV files is expensive. One of the most effective strategies with medium data is to use a binary storage format like HDF5.

# be sure to shut down other kernels running distributed clients
from dask.distributed import Client
client = Client()

Create data if we don’t have any

from prep import accounts_csvs
accounts_csvs(3, 1000000, 500)

First we read our csv data as before.

CSV and other text-based file formats are the most common storage for data from many sources, because they require minimal pre-processing, can be written line-by-line and are human-readable. Since Pandas' read_csv is well-optimized, CSVs are a reasonable input, but far from optimized, since reading required extensive text parsing.

import os
filename = os.path.join('data', 'accounts.*.csv')
filename

import dask.dataframe as dd
df_csv = dd.read_csv(filename)
df_csv.head()

HDF5 and netCDF are binary array formats very commonly used in the scientific realm.

Pandas contains a specialized HDF5 format, HDFStore. The dd.DataFrame.to_hdf method works exactly like the pd.DataFrame.to_hdf method.

target = os.path.join('data', 'accounts.h5')
target

%time df_csv.to_hdf(target, '/data')

df_hdf = dd.read_hdf(target, '/data')
df_hdf.head()

For more information on Dask DataFrame Storage, click http://dask.pydata.org/en/latest/dataframe-create.html

2.1.11 - Applications

Gregor von Laszewski (laszewski@gmail.com)

2.1.11.1 - Fingerprint Matching

Gregor von Laszewski (laszewski@gmail.com)


Warning Please note that NIST has temporarily removed the

![Warning

Fingerprint data set. We, unfortunately, do not have a copy of the dataset. If you have one, please notify us

Python is a flexible and popular language for running data analysis pipelines. In this section, we will implement a solution for fingerprint matching.

Overview

Fingerprint recognition refers to the automated method for verifying a match between two fingerprints and that is used to identify individuals and verify their identity. Fingerprints (Figure 1) are the most widely used form of biometric used to identify individuals.

Figure 1: Fingerprints

Figure 1: Fingerprints

The automated fingerprint matching generally required the detection of different fingerprint features (aggregate characteristics of ridges, and minutia points) and then the use of fingerprint matching algorithm, which can do both one-to-one and one-to-many matching operations. Based on the number of matches a proximity score (distance or similarity) can be calculated.

We use the following NIST dataset for the study:

Special Database 14 - NIST Mated Fingerprint Card Pairs 2. (http://www.nist.gov/itl/iad/ig/special\_dbases.cfm)

Objectives

Match the fingerprint images from a probe set to a gallery set and report the match scores.

Prerequisites

For this work we will use the following algorithms:

In order to follow along, you must have the NBIS tools which provide mindtct and bozorth3 installed. If you are on Ubuntu 16.04 Xenial, the following steps will accomplish this:

$ sudo apt-get update -qq
$ sudo apt-get install -y build-essential cmake unzip
$ wget "http://nigos.nist.gov:8080/nist/nbis/nbis_v5_0_0.zip"
$ unzip -d nbis nbis_v5_0_0.zip
$ cd nbis/Rel_5.0.0
$ ./setup.sh /usr/local --without-X11
$ sudo make

Implementation

  1. Fetch the fingerprint images from the web
  2. Call out to external programs to prepare and compute the match scored
  3. Store the results in a database
  4. Generate a plot to identify likely matches.
import urllib
import zipfile
import hashlib

we will be interacting with the operating system and manipulating files and their pathnames.

import os.path
import os
import sys
import shutil
import tempfile

Some general useful utilities

import itertools
import functools
import types
from pprint import pprint

Using the attrs library provides some nice shortcuts to defining objects

import attr
import sys

we will be randomly dividing the entire dataset, based on user input, into the probe and gallery stets

import random

we will need to call out to the NBIS software. we will also be using multiple processes to take advantage of all the cores on our machine

import subprocess
import multiprocessing

As for plotting, we will use matplotlib, though there are many alternatives.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

Finally, we will write the results to a database.

import sqlite3

Utility functions

Next, we will define some utility functions:

def take(n, iterable):
    "Returns a generator of the first **n** elements of an iterable"
    return itertools.islice(iterable, n )


def zipWith(function, *iterables):
    "Zip a set of **iterables** together and apply **function** to each tuple"
    for group in itertools.izip(*iterables):
        yield function(*group)


def uncurry(function):
    "Transforms an N-arry **function** so that it accepts a single parameter of an N-tuple"
    @functools.wraps(function)
    def wrapper(args):
        return function(*args)
    return wrapper


def fetch_url(url, sha256, prefix='.', checksum_blocksize=2**20, dryRun=False):
    """Download a url.

    :param url: the url to the file on the web
    :param sha256: the SHA-256 checksum. Used to determine if the file was previously downloaded.
    :param prefix: directory to save the file
    :param checksum_blocksize: blocksize to used when computing the checksum
    :param dryRun: boolean indicating that calling this function should do nothing
    :returns: the local path to the downloaded file
    :rtype:

    """

    if not os.path.exists(prefix):
        os.makedirs(prefix)

    local = os.path.join(prefix, os.path.basename(url))

    if dryRun: return local

    if os.path.exists(local):
        print ('Verifying checksum')
        chk = hashlib.sha256()
        with open(local, 'rb') as fd:
            while True:
                bits = fd.read(checksum_blocksize)
                if not bits: break
                chk.update(bits)
        if sha256 == chk.hexdigest():
            return local

    print ('Downloading', url)

    def report(sofar, blocksize, totalsize):
        msg = '{}%\r'.format(100 * sofar * blocksize / totalsize, 100)
        sys.stderr.write(msg)

    urllib.urlretrieve(url, local, report)

    return local

Dataset

we will now define some global parameters

First, the fingerprint dataset

DATASET_URL = 'https://s3.amazonaws.com/nist-srd/SD4/NISTSpecialDatabase4GrayScaleImagesofFIGS.zip'
DATASET_SHA256 = '4db6a8f3f9dc14c504180cbf67cdf35167a109280f121c901be37a80ac13c449'

We’ll define how to download the dataset. This function is general enough that it could be used to retrieve most files, but we will default it to use the values from previous.

def prepare_dataset(url=None, sha256=None, prefix='.', skip=False):
    url = url or DATASET_URL
    sha256 = sha256 or DATASET_SHA256
    local = fetch_url(url, sha256=sha256, prefix=prefix, dryRun=skip)

    if not skip:
        print ('Extracting', local, 'to', prefix)
        with zipfile.ZipFile(local, 'r') as zip:
            zip.extractall(prefix)

    name, _ = os.path.splitext(local)
    return name


def locate_paths(path_md5list, prefix):
    with open(path_md5list) as fd:
        for line in itertools.imap(str.strip, fd):
            parts = line.split()
            if not len(parts) == 2: continue
            md5sum, path = parts
            chksum = Checksum(value=md5sum, kind='md5')
            filepath = os.path.join(prefix, path)
            yield Path(checksum=chksum, filepath=filepath)


def locate_images(paths):

    def predicate(path):
        _, ext = os.path.splitext(path.filepath)
        return ext in ['.png']

    for path in itertools.ifilter(predicate, paths):
        yield image(id=path.checksum.value, path=path)

Data Model

we will define some classes so we have a nice API for working with the dataflow. We set slots=True so that the resulting objects will be more space-efficient.

Utilities

Checksum

The checksum consists of the actual hash value (value) as well as a string representing the hashing algorithm. The validator enforces that the algorithm can only be one of the listed acceptable methods

@attr.s(slots=True)
class Checksum(object):
  value = attr.ib()
  kind = attr.ib(validator=lambda o, a, v: v in 'md5 sha1 sha224 sha256 sha384 sha512'.split())

Path

Path refers to an image's file path and associatedChecksum`. We get the checksum “for"free” since the MD5 hash is provided for each image in the dataset.

@attr.s(slots=True)
class Path(object):
    checksum = attr.ib()
    filepath = attr.ib()

Image

The start of the data pipeline is the image. An image has an id (the md5 hash) and the path to the image.

@attr.s(slots=True)
class image(object):
    id = attr.ib()
    path = attr.ib()

Mindtct

The next step in the pipeline is to apply the mindtct program from NBIS. A mindtct object, therefore, represents the results of applying mindtct on an image. The xyt output is needed for the next step, and the image attribute represents the image id.

@attr.s(slots=True)
class mindtct(object):
    image = attr.ib()
    xyt = attr.ib()

    def pretty(self):
        d = dict(id=self.image.id, path=self.image.path)
        return pprint(d)

We need a way to construct a mindtct object from an image object. A straightforward way of doing this would be to have a from_image @staticmethod or @classmethod, but that doesn't work well with multiprocessing as top-level functions work best as they need to be serialized.

def mindtct_from_image(image):
    imgpath = os.path.abspath(image.path.filepath)
    tempdir = tempfile.mkdtemp()
    oroot = os.path.join(tempdir, 'result')

    cmd = ['mindtct', imgpath, oroot]

    try:
        subprocess.check_call(cmd)

        with open(oroot + '.xyt') as fd:
            xyt = fd.read()

        result = mindtct(image=image.id, xyt=xyt)
        return result

    finally:
        shutil.rmtree(tempdir)

Bozorth3

The final step in the pipeline is running the bozorth3 from NBIS. The bozorth3 class represents the match being done: tracking the ids of the probe and gallery images as well as the match score.

Since we will be writing these instances out to a database, we provide some static methods for SQL statements. While there are many Object-Relational-Model (ORM) libraries available for Python, this approach keeps the current implementation simple.

@attr.s(slots=True)
class bozorth3(object):
    probe = attr.ib()
    gallery = attr.ib()
    score = attr.ib()

    @staticmethod
    def sql_stmt_create_table():
        return 'CREATE TABLE IF NOT EXISTS bozorth3' \
             + '(probe TEXT, gallery TEXT, score NUMERIC)'

    @staticmethod
    def sql_prepared_stmt_insert():
        return 'INSERT INTO bozorth3 VALUES (?, ?, ?)'

    def sql_prepared_stmt_insert_values(self):
        return self.probe, self.gallery, self.score

In order to work well with multiprocessing, we define a class representing the input parameters to bozorth3 and a helper function to run bozorth3. This way the pipeline definition can be kept simple to a map to create the input and then a map to run the program.

As NBIS bozorth3 can be called to compare one-to-one or one-to-many, we will also dynamically choose between these approaches depending on if the gallery attribute is a list or a single object.

@attr.s(slots=True)
class bozorth3_input(object):
    probe = attr.ib()
    gallery = attr.ib()

    def run(self):
        if isinstance(self.gallery, mindtct):
            return bozorth3_from_one_to_one(self.probe, self.gallery)
        elif isinstance(self.gallery, types.ListType):
            return bozorth3_from_one_to_many(self.probe, self.gallery)
        else:
            raise ValueError('Unhandled type for gallery: {}'.format(type(gallery)))

The next is the top-level function to running bozorth3. It accepts an instance of bozorth3_input. The is implemented as a simple top-level wrapper so that it can be easily passed to the multiprocessing library.

def run_bozorth3(input):
    return input.run()

Running Bozorth3

There are two cases to handle: 1. One-to-one probe to gallery sets 1. One-to-many probe to gallery sets

Both approaches are implemented next. The implementations follow the same pattern: 1. Create a temporary directory within with to work 1. Write the probe and gallery images to files in the temporary directory

  1. Call the bozorth3 executable 1. The match score is written to stdout which is captured and then parsed. 1. Return a bozorth3 instance for each match 1. Make sure to clean up the temporary directory
One-to-one
def bozorth3_from_one_to_one(probe, gallery):
    tempdir = tempfile.mkdtemp()
    probeFile = os.path.join(tempdir, 'probe.xyt')
    galleryFile = os.path.join(tempdir, 'gallery.xyt')

    with open(probeFile,   'wb') as fd: fd.write(probe.xyt)
    with open(galleryFile, 'wb') as fd: fd.write(gallery.xyt)

    cmd = ['bozorth3', probeFile, galleryFile]

    try:
        result = subprocess.check_output(cmd)
        score = int(result.strip())
        return bozorth3(probe=probe.image, gallery=gallery.image, score=score)
    finally:
        shutil.rmtree(tempdir)
One-to-many
def bozorth3_from_one_to_many(probe, galleryset):
    tempdir = tempfile.mkdtemp()
    probeFile = os.path.join(tempdir, 'probe.xyt')
    galleryFiles = [os.path.join(tempdir, 'gallery%d.xyt' % i)
                    for i,_ in enumerate(galleryset)]

    with open(probeFile, 'wb') as fd: fd.write(probe.xyt)
    for galleryFile, gallery in itertools.izip(galleryFiles, galleryset):
        with open(galleryFile, 'wb') as fd: fd.write(gallery.xyt)

    cmd = ['bozorth3', '-p', probeFile] + galleryFiles

    try:
        result = subprocess.check_output(cmd).strip()
        scores = map(int, result.split('\n'))
        return [bozorth3(probe=probe.image, gallery=gallery.image, score=score)
               for score, gallery in zip(scores, galleryset)]
    finally:
        shutil.rmtree(tempdir)

Plotting

For plotting, we will operate only on the database. we will select a small number of probe images and plot the score between them and the rest of the gallery images.

The mk_short_labels helper function will be defined next.

def plot(dbfile, nprobes=10):
    conn = sqlite3.connect(dbfile)
    results = pd.read_sql(
        "SELECT DISTINCT probe FROM bozorth3 ORDER BY score LIMIT '%s'" % nprobes,
        con=conn
    )
    shortlabels = mk_short_labels(results.probe)
    plt.figure()

    for i, probe in results.probe.iteritems():
        stmt = 'SELECT gallery, score FROM bozorth3 WHERE probe = ? ORDER BY gallery DESC'
        matches = pd.read_sql(stmt, params=(probe,), con=conn)
        xs = np.arange(len(matches), dtype=np.int)
        plt.plot(xs, matches.score, label='probe %s' % shortlabels[i])

    plt.ylabel('Score')
    plt.xlabel('Gallery')
    plt.legend(bbox_to_anchor=(0, 0, 1, -0.2))
    plt.show()

The image ids are long hash strings. In order to minimize the amount of space on the figure the labels occupy, we provide a helper function to create a short label that still uniquely identifies each probe image in the selected sample

def mk_short_labels(series, start=7):
    for size in xrange(start, len(series[0])):
        if len(series) == len(set(map(lambda s: s[:size], series))):
            break
    return map(lambda s: s[:size], series)

Putting it all Together

First, set up a temporary directory in which to work:

pool = multiprocessing.Pool()
prefix = '/tmp/fingerprint_example/'
if not os.path.exists(prefix):
    os.makedirs(prefix)

Next, we download and extract the fingerprint images from NIST:

%%time
dataprefix = prepare_dataset(prefix=prefix)
Verifying checksum Extracting
/tmp/fingerprint_example/NISTSpecialDatabase4GrayScaleImagesofFIGS.zip
to /tmp/fingerprint_example/ CPU times: user 3.34 s, sys: 645 ms,
total: 3.99 s Wall time: 4.01 s

Next, we will configure the location of the MD5 checksum file that comes with the download

md5listpath = os.path.join(prefix, 'NISTSpecialDatabase4GrayScaleImagesofFIGS/sd04/sd04_md5.lst')

Load the images from the downloaded files to start the analysis pipeline

%%time
print('Loading images')
paths = locate_paths(md5listpath, dataprefix)
images = locate_images(paths)
mindtcts = pool.map(mindtct_from_image, images)
print('Done')
Loading images Done CPU times: user 187 ms, sys: 17 ms, total: 204 ms
Wall time: 1min 21s

We can examine one of the loaded images. Note that image refers to the MD5 checksum that came with the image and the xyt attribute represents the raw image data.

print(mindtcts[0].image)
print(mindtcts[0].xyt[:50])
98b15d56330cb17f1982ae79348f711d 14 146 214 6 25 238 22 37 25 51 180 20
30 332 214

For example purposes we will only a use a small percentage of the database, randomly selected, for pur probe and gallery datasets.

perc_probe = 0.001
perc_gallery = 0.1
%%time
print('Generating samples')
probes  = random.sample(mindtcts, int(perc_probe   * len(mindtcts)))
gallery = random.sample(mindtcts, int(perc_gallery * len(mindtcts)))
print('|Probes| =', len(probes))
print('|Gallery|=', len(gallery))
Generating samples = 4 = 400 CPU times: user 2 ms, sys: 0 ns, total: 2
ms Wall time: 993 µs

We can now compute the matching scores between the probe and gallery sets. This will use all cores available on this workstation.

%%time
print('Matching')
input = [bozorth3_input(probe=probe, gallery=gallery)
         for probe in probes]
bozorth3s = pool.map(run_bozorth3, i