Project Proposal

Team

For the final project I will be working independently. I have made this decision based on my previous experience with an earlier iteration of this course during my undergraduate career. In this iteration it seemed that it was much more manageable to work independently on a project such as this.

Topic

The focus of my project will be big data in demographics. The area of demographics in itself is massive datasets of countries populations and various aspects of them. With continuing advancements in analytics technology, data engineers are now able to expose patterns and trends that were previously invisible. The necessity to apply big data to demographics further reinforces how large and complex the data is coming to be. The analysis of this data is full of valuable information that can help us to understand the changing world around ourselves. However, while it is simple to make a blanket statement that patterns can be recognized it must be determined which of these patterns are of most interest. In my opinion the most important patterns involve physical population shifts, number of housing units, income, and employment status. While focusing on these data points, we are able to come to much more important conclusions. When considering this topic, it was necessary to determine what exactly I wanted to analyze. Based on available data sets I believe that it would be interesting to attempt to predict income based on demographic data by region. As an extension, it would be interesting to analyze the effectiveness of this algorithm across multiple regions, or even attempting to build a single algorithm able to predict all regions. An alternative analysis, although I struggled to find supporting data sets, would be to attempt to analyze and predict population shifts based on contributing factors such as natural disasters, economic events, or political events. A final alternative would be to perform a slightly different type of analysis, to determine if different populations “self-segregate” based on various factors such as age, race, religion, occupation, income, and education. The data, in this case, would be represented visually as a map with a color key. I am not sure if this would qualify for the project. On one hand this would require the analysis of large datasets with multiple factors, but would not require algorithmic analysis as much.

Dataset

For the analysis of demographics the most obvious datasets to come to mind are those of the Census Bureau. I found multiple datasets on Kaggle that would be appropriate. These datasets are in the form of CSV files which are simple and familiar to work with in Python. Quickly taking a look at these datasets, it seems they typically have about 30k data points. However, it also seems that many of these different datasets are segmentations of more complete sets of data that have the possibility of being combined into larger sets. Additionally, this data is typically self response and certain entries may be missing data points. Because of this, it will be necessary to spend extra time being sure the data has been appropriately prepared so that the algorithm may be as precise as possible. https://www.kaggle.com/econdata/predicting-earnings-from-census-data

Get a Good Grade

I understand that there are a few ground rules in order to get a great grade. Firstly, the final report must be submitted as a markdown file as well as following proper submission requirements to github. Also, because I am taking the graduate course I understand that I must submit a software component with the report in order to get a great grade. Additionally, I understand that the analysis on the dataset should be unique and bring about a new point of conversation. As I mentioned in the topic section of this writeup, I have not committed to the task I would like to attempt. If you could please offer guidance to any suggestions you may have in the area of demographics to create a project that is original. Also, if you could please comment on my proposal of more of a scientific visualization final product over a statistical analysis. This path would still require significant analysis of the data to prepare it for presentation.