Proses Analisa Data di Python serta cara Menggunakannya
Banyak cara menganalisa data dalam sebuah proyek. Dimana biasanya statistika biasanya menggunakan R sebagai tools utamanya dalam mengerjakannya serta beberapa model pengembangan software yang berkutat bagaimana mengoptimalkan struktur data misalnya.
Applications of Data Analysis
Beberapa analisa yang digunnakan misalnya contoh pada beberapa percobaan berikut ini:
- Paper by Facebook on exposure to ideologically diverse information
- OKCupid blog post on the best questions to ask on a first date
- How Walmart used data analysis to increase sales
- How Bill James applied data analysis to baseball
- A pharmaceutical company uses data analysis to predict which chemical compounds are likely to make effective drugs
Running Python Locally
In this course, we’ll assume you have the ability to run Python locally. Whether you already have Python installed on your computer or not, we recommend downloading and installing Anaconda. This is a scientific Python installation that comes with a lot of libraries and tools we’ll be using in this course, some of which are otherwise very difficult to install.
If you haven’t already, go through our short course on Anaconda and Jupyter notebooks to set up your computer. I’ve included an environment file you can use to create a conda environment that will provide all the necessary packages and versions for this course. If the resource links open up the environment files as text tiles in your browser, you can use right-click (Win) or control-click (Mac) to open up a menu to “Save as…” to download the file. If you’d like to set up your own environment, this course requires Python 2.7, numpy, pandas, matplotlib, and seaborn.
Downloading Data Files
You should also download the data files from the Resources section. Make sure you save these in the same directory as your IPython notebook. The files we’ll be using in this lesson are enrollments.csv
, daily_engagement.csv
, and project_submissions.csv
. daily_engagement_full.csv
contains more detailed data than daily_engagement.csv
, but it's a larger file (about 500 MB), so downloading and using this file is optional.
You should also download and read table_descriptions.txt
, which describes what data is present in each file (or table) and what columns are present. The data has been anonymized, and contains a random selection of Data Analyst Nanodegree students who had completed the first project at the time the data was collected, as well as a random selection of students who had not.
Supporting Materials
ipython_notebook_tutorial.ipynb
Reminder: You should download the notebook and data files from the Resources section to follow along with the analysis performed through the rest of this lesson. You can open this section by clicking on the icon to the upper right of the classroom.
Python’s csv Module
This page contains documentation for Python’s csv
module. Instead of csv
, you'll be using unicodecsv
in this course. unicodecsv
works exactly the same as csv
, but it comes with Anaconda and has support for unicode. The csv
documentation page is still the best way to learn how to use the unicodecsv
library, since the two libraries work exactly the same way.
Iterators in Python
This page explains the difference between iterators and lists in Python, and how to use iterators.
Solutions
If you want to check our solution for the problem, look at the end of this lesson for Quiz Solutions.
Removing an Element from a Dictionary
If you’re not sure how to remove an element from a dictionary, this post might be helpful.
Solutions
If you want to check our solution for the problem, look at the end of this lesson for Quiz Solutions.
Updated Code for Previous Exercise
After running the above code, Caroline also shows rewriting the solution from the previous exercise to the following code:
def get_unique_students(data):
unique_students = set()
for data_point in data:
unique_students.add(data_point['account_key'])
return unique_students
len(enrollments)
unique_enrolled_students = get_unique_students(enrollments)
len(unique_enrolled_students)
len(daily_engagement)
unique_engagement_students = get_unique_students(daily_engagement)
len(unique_engagement_students)
len(project_submissions)
unique_project_submitters = get_unique_students(project_submissions)
len(unique_project_submitters)
Adding labels and titles
In matplotlib, you can add axis labels using plt.xlabel("Label for x axis")
and plt.ylabel("Label for y axis")
. For histograms, you usually only need an x-axis label, but for other plot types a y-axis label may also be needed. You can also add a title using plt.title("Title of plot")
.
Making plots look nicer with seaborn
You can automatically make matplotlib plots look nicer using the seaborn library. This library is not automatically included with Anaconda, but Anaconda includes something called a package manager to make it easier to add new libraries. The package manager is called conda, and to use it, you should open the Command Prompt (on a PC) or terminal (on Mac or Linux), and type the command conda install seaborn
.
If you are using a different Python installation than Anaconda, you may have a different package manager. The most common ones are pip and easy_install, and you can use them with the commands pip install seaborn
or easy_install seaborn
respectively.
Once you have installed seaborn, you can import it anywhere in your code using the line import seaborn as sns
. Then any plot you make afterwards will automatically look better. Give it a try!
If you’re wondering why the abbreviation for seaborn is sns, it’s because seaborn was named after the character Samuel Norman Seaborn from the show The West Wing, and sns are his initials.
The seaborn package also includes some extra functions you can use to make complex plots that would be difficult in matplotlib. We won’t be covering those in this course, but if you’d like to see what functions seaborn has available, you can look through the documentation.
Adding extra arguments to your plot
You’ll also frequently want to add some arguments to your plot to tune how it looks. You can see what arguments are available on the documentation page for the hist function. One common argument to pass is the bins
argument, which sets the number of bins used by your histogram. For example, plt.hist(data, bins=20)
would make sure your histogram has 20 bins.
Improving one of your plots
Use these techniques to improve at least one of the plots you made earlier.
Sharing your findings
Finally, decide which of the discoveries you made this lesson you would most want to communicate to someone else, and write a forum post sharing your findings.
Solution Code
A notebook containing all code shown in this lesson is available in the Downloadables section, as well as the Quiz Solutions page at the end of the lesson.
Supporting Materials
Gapminder data
The data in this lesson was obtained from the site gapminder.org. The variables included are:
- Aged 15+ Employment Rate (%)
- Life Expectancy (years)
- GDP/capita (US$, inflation adjusted)
- Primary school completion (% of boys)
- Primary school completion (% of girls)
You can also obtain the data to anlayze on your own from the Downloadables section.
Bitwise Operations
See this article for more information about bitwise operations.
In NumPy, a & b
performs a bitwise and of a
and b
. This is not necessarily the same as a logical and, if you wanted to see if matching terms in two integer vectors were non-zero. However, if a
and b
are both arrays of booleans, rather than integers, bitwise and and logical and are the same thing. If you want to perform a logical and on integer vectors, then you can use the NumPy function np.logical_and(a, b)
or convert them into boolean vectors first.
Similarly, a | b
performs a bitwise or, and ~a
performs a bitwise not. However, if your arrays contain booleans, these will be the same as performing logical or and logical not. NumPy also has similar functions for performing these logical operations on integer-valued arrays.
For the quiz, assume that the number of males and females are equal i.e. we can take a simple average to get an overall completion rate.
In the solution, we may want to / 2.
instead of just / 2
. This is because in Python 2, dividing an integer by another integer (2
) drops fractions, so if our inputs are also integers, we may end up losing information. If we divide by a float (2.
) then we will definitely retain decimal values.
Erratum: The output of cell [3] in the solution video is incorrect: it appears that the male
variable has not been set to the proper value set in cell [2]. All values except for the first will be different. The correct output in cell Out[3]:
should instead start with:
array([ 192.83205, 205.28855, 202.82258, 186.63257, 206.91115,
Pandas idxmax()
Note: The argmax()
function mentioned in the videos has been realiased to idxmax()
, and returns the index of the first maximally-valued element. You can find documentation for the idxmax()
function in Pandas here.
Remember that Jupyter notebooks will just print out the results of the last expression run in a code cell as though a print
expression was run. If you want to save the results of your operations for later, remember to assign the results to a variable or, for some Pandas functions like .dropna()
, use inplace = True
to modify the starting object without needing to reassign it.
Note: The grader will execute your finished reverse_names(names)
function on some test names
Series when you submit your answer. Make sure that this function returns another Series with the transformed names.
split()
You can find documentation for Python’s split()
function here
Plotting in Pandas
If the variable data
is a NumPy array or a Pandas Series, just like if it is a list, the code
import matplotlib.pyplot as plt
plt.hist(data)
will create a histogram of the data.
Pandas also has built-in plotting that uses matplotlib behind the scenes, so if data
is a Series, you can create a histogram using data.hist()
.
There’s no difference between these two in this case, but sometimes the Pandas wrapper can be more convenient. For example, you can make a line plot of a series using data.plot()
. The index of the Series will be used for the x-axis and the values for the y-axis.
In the following quiz, we’ve created Series containing the various variables we’ve been looking at this lesson. Pick a country you’re interested in, and make a plot of each variable over time.
The Udacity editor will only show one plot each time you click “Test Run”, so you can look at multiple plots by clicking “Test Run” multiple times. If you’re running plotting code locally, you may need to add the line plt.show()
depending on your setup.
Memory Layout
This page describes the memory layout of 2D NumPy arrays.
Understand and Interpreting Correlations
- This page contains some scatterplots of variables with different values of correlation.
- This page lets you use a slider to change the correlation and see how the data might look.
- Pearson’s r only measures linear correlation! This image shows some different linear and non-linear relationships and what Pearson’s r will be for those relationships.
Corrected vs. Uncorrected Standard Deviation
By default, Pandas’ std()
function computes the standard deviation using Bessel's correction. Calling std(ddof=0)
ensures that Bessel's correction will not be used.
Previous Exercise
The exercise where you used a simple heuristic to estimate correlation was the “Pandas Series” exercise in the previous lesson, “NumPy and Pandas for 1D Data”.
Pearson’s r in NumPy
NumPy’s corrcoef() function can be used to calculate Pearson’s r, also known as the correlation coefficient.
Pandas shift()
Documentation for the Pandas shift() function is here. If you’re still not sure how the function works, try it out and see!
Alternative Solution
As an alternative to using vectorized operations, you could also use the code return entries_and_exits.diff()
to calculate the answer in a single step.
Note: The grader will execute your finished convert_grades(grades)
function on some test grades
DataFrames when you submit your answer. Make sure that this function returns a DataFrame with the converted grades.Hint: You may need to define a helper function to use with .applymap()
.
Note: In order to get the proper computations, we should actually be setting the value of the “ddof” parameter to 0 in the .std()
function.
Note that the type of standard deviation calculated by default is different between numpy’s .std()
and pandas' .std()
functions. By default, numpy calculates a population standard deviation, with "ddof = 0". On the other hand, pandas calculates a sample standard deviation, with "ddof = 1". If we know all of the scores, then we have a population - so to standardize using pandas, we need to set "ddof = 0".
Using groupby() to Calculate Hourly Entries and Exits
In the quiz where you calculated hourly entries and exits, you did so for a single set of cumulative entries. However, in the original data, there was a separate set of numbers for each station.
Thus, to correctly calculate the hourly entries and exits, it was necessary to group by station and day, then calculate the hourly entries and exits within each day.
Write a function to do that. You should use the apply()
function to call the function you wrote previously. You should also make sure you restrict your grouped data to just the entries and exits columns, since your function may cause an error if it is called on non-numerical data types.
If you would like to learn more about using groupby()
in Pandas, this page contains more details.
Note: You will not be able to reproduce the ENTRIESn_hourly
and EXITSn_hourly
columns in the full dataset using this method. When creating the dataset, we did extra processing to remove erroneous values.
To clarify the structure of the data, the original data recorded the cumulative number of entries on each station at four-hour intervals. For the quiz, you just need to look at the differences between consecutive measurements on each station: by computing “hourly entries”, we just mean recording the number of new tallies between each recording period as a contrast to “cumulative entries”.
Plotting with DataFrames
Just like Pandas Series, DataFrames also have a plot() method. If df
is a DataFrame, then df.plot()
will produce a line plot with a different colored line for each variable in the DataFrame. This can be a convenient way to get a quick look at your data, especially for small DataFrames, but for more complicated plots you will usually want to use matplotlib directly.
In the following quiz, create a plot of your choice showing something interesting about the New York subway data. For example, you might create:
- Histograms of subway ridership on both days with rain and days without rain
- A scatterplot of subway stations with latitude and longitude as the x and y axes and ridership as the bubble size
- If you choose this option, you may wish to use the
as_index=False
argument to groupby(). There is example code in the following quiz. - A scatterplot with subway ridership on one axis and precipitation or temperature on the other
If you’re not sure how to make the plot you want, try searching on Google or take a look at the matplotlib documentation. Once you’ve created a plot you’re happy with, share what you’ve found on the forums!
Three-Dimensional Data
Now that you’ve worked with one-dimensional and two-dimensional data, you might be wondering how to work with three or more dimensions.
3D data in NumPy
NumPy arrays can have arbitrarily many dimensions. Just like you can create a 1D array from a list, and a 2D array from a list of lists, you can create a 3D array from a list of lists of lists, and so on. For example, the following code would create a 3D array:
a = np.array([
[['A1a', 'A1b', 'A1c'], ['A2a', 'A2b', 'A2c']],
[['B1a', 'B1b', 'B1c'], ['B2a', 'B2b', 'B2c']]
])
3D data in Pandas
Pandas has a data structure called a Panel, which is similar to a DataFrame or a Series, but for 3D data. If you would like, you can learn more about Panels here.
Pandas Links
Project Overview
Note: This course is currently only available for free, so you won’t be able to submit your work for review. We encourage you to use the specifications and evaluation tools to complete it, then self-assess and seek feedback from family, friends, and your social networks. Use their feedback to improve and you’ll have a great example of your work to show off anytime!
In this project, you will analyze a dataset and then communicate your findings about it. You will use the Python libraries NumPy, Pandas, and Matplotlib to make your analysis easier.
What do I need to install?
You will need an installation of Python, plus the following libraries:
- pandas
- numpy
- matplotlib
- csv or unicodecsv
We recommend installing Anaconda, which comes with all of the necessary packages, as well as IPython notebook. You can find installation instructions here.
Why this Project?
This project will introduce you to the data analysis process. In this project, you will go through the entire process so that you know how all the pieces fit together. Other courses in the Data Analyst Nanodegree focus on individual pieces of the data analysis process. In this project, you will also gain experience using the Python libraries NumPy, Pandas, and Matplotlib, which make writing data analysis code in Python a lot easier!
What will I learn?
After completing the project, you will:
- Know all the steps involved in a typical data analysis process
- Be comfortable posing questions that can be answered with a given dataset and then answering those questions
- Know how to investigate problems in a dataset and wrangle the data into a format you can use
- Have practice communicating the results of your analysis
- Be able to use vectorized operations in NumPy and Pandas to speed up your data analysis code
- Be familiar with Pandas’ Series and DataFrame objects, which let you access your data more conveniently
- Know how to use Matplotlib to produce plots showing your findings
Why is this Important to my Career?
This project will show off a variety of data analysis skills, as well as showing potential employers that you know how to go through the entire data analysis process.
Introduction
For the final project, you will conduct your own data analysis and create a file to share that documents your findings. You should start by taking a look at your dataset and brainstorming what questions you could answer using it. Then you should use Pandas and NumPy to answer the questions you are most interested in, and create a report sharing the answers. You will not be required to use statistics or machine learning to complete this project, but you should make it clear in your communications that your findings are tentative. This project is open-ended in that we are not looking for one right answer.
Step One — Choose Your Data Set
Choose one of the following datasets to analyze for your project:
- Titanic Data — Contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic. You can view a description of this dataset on the Kaggle website, where the data was obtained.
- Baseball Data — A data set containing complete batting and pitching statistics from 1871 to 2014, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. This dataset contains many files, but you can choose to analyze only the one(s) you are most interested in.
- Choose the comma-delimited version, which contains CSV files.
Step Two — Get Organized
Eventually you’ll want to share your project with friends, family, and employers. Get organized before you begin. We recommend creating a single folder that will eventually contain:
- The report communicating your findings
- Any Python code you wrote as part of your analysis
- The data set you used (which you will not need to submit)
You may wish to use IPython notebook, in which case you can share both the code you wrote and the report of your findings in the same document. Otherwise, you will need to store your report and code separately.
Step Three — Analyze Your Data
Brainstorm some questions you could answer using the data set you chose, then start answering those questions. Here are some ideas to get you started:
- Titanic Data
- What factors made people more likely to survive?
- Baseball Data
- What is the relationship between different performance metrics? Do any have a strong negative or positive relationship?
- What are the characteristics of baseball players with the highest salaries?
Make sure you use NumPy and Pandas where they are appropriate!
Step Four — Share Your Findings
Once you have finished analyzing the data, create a report that shares the findings you found most interesting. You might wish to use IPython notebook to share your findings alongside the code you used to perform the analysis, but you can also use another tool if you wish.
Step Five — Review
Use the Project Rubric to review your project. If you are happy with your project, then you’re finished! If you see room for improvement, keep working to improve your project.
Supporting Materials
Evaluation
Use the Project Rubric to review your project. If you are happy with your project, then you are ready to share it with others for feedback! If you see room for improvement in any category in which you do not meet specifications, keep working!
You may wish to ask those who review your work to give feedback according to the same Project Rubric.
Sharing your work
Ready to share your work? Send an email to the person will give you feedback with the following:
- A PDF or HTML file containing your analysis. This file should include:
- A note specifying which dataset you analyzed
- A statement of the question(s) you posed
- A description of what you did to investigate those questions
- Documentation of any data wrangling you did
- Summary statistics and plots communicating your final results
- If the code you used to perform your analysis is not included in the above, you can attach the code separately in
.py
file(s). - A list of Web sites, books, forums, blog posts, github repositories, etc. that you referred to or used in creating your submission (add N/A if you did not use any such resources).
IPython notebook instructions
If you used IPython notebook to create your analysis, you can download your notebook as an HTML file. Click on File -> Download.As -> HTML (.html) within the notebook. This way, your reviewer will not need to have IPython notebook installed to view your work. If you get an error about “No module name”, then open a terminal and try installing the missing module using pip install <module_name>
(don't include the "<" or ">" or any words following a period in the module name).