pandas

Pandas

By Asabeneh Yetayeh

Pandas is an open source, high-performance, easy-to-use data structures and data analysis Python library. Pandas adds data structures and tools designed to work with table-like data which is Series and Data Frames. Pandas provides tools for data manipulation:

If you are using anaconda, you do not have to install pandas.

If you are in favor of videos, you can watch the video tutorial from here..

Installing Pandas

For Mac:

pip install conda
conda install pandas

For Windows:

pip install conda
pip install pandas

Pandas Series and DataFrames

Pandas data structure is based on Series and DataFrames.

Pandas Series

A series is a column and a DataFrame is a multidimensional table made up of collection of series. In order to create a pandas series we should use numpy to create a one dimensional arrays or a python list. Let us see an example of a series:

Names Pandas Series

Names Pandas Series

pandas series

Countries Series

pandas series

Cities Series

pandas series

As you can see, pandas series is just one column of data. If we want to have multiple columns we use data frames. The example below shows pandas DataFrames.

Let us see, an example of a pandas data frame:

Pandas data frame

Data frame is a collection of rows and columns. Look at the table below; it has many more columns than the example above:

Pandas data frame

Next, we will see how to import pandas and how to create Series and DataFrames using pandas

Importing Pandas

import pandas as pd # importing pandas as pd
import numpy  as np # importing numpy as np

Creating Pandas Series from list

Getting the index from the Pandas Series

Creating Pandas Series with custom index

Creating Pandas Series from a Dictionary

Creating a Constant Pandas Series

Creating a Pandas Series Using Linspace

Accessing specific item from Pandas Series

We can use the label to access Pandas series values. We can give labels to each item using the index argument

Copying a Panda Series with copy method

Working on the copy does not affect the original data

DataFrames

Pandas data frame has both rows and columns that has two dimenstional data structure like 2-dimensional numpy array or table. It can be created in different ways.

Creating DataFrames from List of Lists

Creating DataFrame Using Dictionary

Creating DataFrames from a List of Dictionaries

Reading different file formats Using Pandas

We can read txt, json, csv, tsv, xls file formats using pandas reading methods

Reading CSV File Using Pandas

Loading a CSV file

Data Exploration

Data exploration is an initial stage of data analysis used to explore and visualize data to get insights from the beginning of data analysis or identifing some patterns for further analysis.

Reading the first few records of a dataset using head()

The head() method gives 5 records by default, however, an agrument can be passed to the head() method.

The head() method with argument provides as large as the size of the argument. If we pass 10 in the head() as argument will get 10 records.

Reading the last records of a dataset

To explore the last five records of the data set we use the tail() method. However, we can get fewer or larger records by changing the argument we pass to the tail() method.

Number of Columns

Knowing the fields or attributes of the dataset is one part of data exploration. In this dataset there are only three columns but most of the time, the size of columns is larger than this. Therefore, it is good to know how to get the columns and the size of the columns. We will use the .columns DataFrame attribute to get a column list.

DataFrame shape

The DataFrame shape allows to understand the dataset better. It tells the number of rows and columns

Descriptive Statistics

Descriptive statistics summarizes a given data set that can be either a representation of the entire or a sample of a population. Descriptive statistics are divided into measures of central tendency and measures of variability (spread).

Measures of central tendency includes:

Measures of variability include:

Pandas describe() provides a descriptive statistics of a dataset. The method takes a couple of arguments

DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

Get information about the dataset

It also possible to get some information about the dataset using the method info(). The info() takes a couple of arguments.

DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None

Modifying a DataFrame

Modifying a DataFrame:

- We can create a new DataFrame
- We can create a new column and add it to the DataFrame, 
- we can remove an existing column from a DataFrame, 
- we can modify an existing column in a DataFrame, 
- we can change the data type of column values in the DataFrame

Creating a DataFrame

As we have seen before, it is possible to create DataFrame from list of lists, list of dictionaries or dictionaries.

As always, first we import the necessary packages. Now, lets import pandas and numpy, two best friends ever.

Adding a New Column

Let's add a weight column in the DataFrame

Let's add a height column into the DataFrame as well

As you can see in the DataFrame above, we did add new columns, Weight and Height. Let's add one additional column called BMI(Body Mass Index) by calculating their BMI using thier mass and height. BMI is mass divided by height squared (in meters) - Weight/Height * Height.

As you can see, the height is in centimeters, so we shoud change it to meters. Let's modify the height row.

Modifying column values

Formating DataFrame columns

The BMI column values of the DataFrame are float with many significant digits after decimal. Let's change it to one significant digit after point.

The information in the DataFrame seems not yet complete, let's add birth year and current year columns.

Copying Dataframe

Sometimes, we may be interested to work on the copy of the original file and we want to keep the original file intact.

Deleting a DataFrame Column

Deleting Columns

To delete a DataFrame column(s), we use the name of the columns and the axis as 1.

Deleting Rows

The seventh row does not have full information and it is not important to keep in the dataset. Let's remove the sventh row.

Renaming Columns

Checking data types of Column values

Now same for the current year:

Now, the column values of birth year and current year are integers. We can calculate the age.

The person in the first row lived so far for 251 years. It is unlikely for someone to live so long. Either it is a typo or the data is cooked. So lets fill that data with average of the columns without including outlier.

mean = (35 + 30) / 2

We can use the iloc method to impute the value. DataFrame.iloc(row, col)

Selecting Column(s)

We can select a specified column(s) from a pandas dataframe

Boolean Indexing

We can use boolean indexing to select some part of the rows. In the example below, we are selecting rows with birth year less than 1900.

Using apply method to modify pandas data frame

Accessing row(s) using loc attribute

Pandas has loc attribute uses the pandas dataframe location or rows to return one or more row(s). If the dataframe does not have name indexes we can use positive number indexes that starts from zero. However, if the dataframe has a named indexes, we can use the name use the loc attribute to access the row(s).

Accessing a single row

Accessing Multiple Rows

To access multiple rows, we pass list indexes of the rows

Name indexes Pandas DataFrame

Accessing DataFrame values using iloc method

.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

Allowed inputs are:

An integer, e.g. 5.

A list or array of integers, e.g. [4, 3, 0].

A slice object with ints, e.g. 1:7.

A boolean array.

A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.

.iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).

Concating DataFrames

We can use the concat() method to join different dataframes horizontally.

Cleaning Data

Data cleaning takes most of the time in the data analysis work flow. The data cleaning is the first stage in the data analysis work flow.

Dropping null value

Imputation

Removing Duplicates

Checking if there is duplicates using the duplicated() method. If there is a duplicate, it returns True otherwise False. The data below has a duplicate.

Using descriptive Statistics methods

Use value_counts method

Using groupby method to group columns

Visulization by plotting data

Cleaning Data

Exercises

  1. Read the hacker_news.csv file from data directory
  2. Get the first five rows
  3. Get the last five rows
  4. Get the title column as pandas series
  5. Count the number of rows and columns
    • Filter the titles which contain python
    • Filter the titles which contain JavaScript
    • Explore the data and make sense of it