Python pandas DataFrame

Pandas DataFrame in Python is a two dimensional data structure. It means, Pandas DataFrames stores data in a tabular format i.e., rows and columns. In this article, we show how to create Python Pandas DataFrame, access dataFrame, alter DataFrame rows and columns. Next, we will discuss about Transposing DataFrame in Python, Iterating over DataFrame rows so on.

How to create a DataFrame in Python?

In real-time, we use this Pandas dataFrame to load data from Sql Server, Text Files, Excel Files or any CSV Files. Next, we slice and dice that data as per our requirements. Once the data is in our required format, we use that data to create reports or charts or graphs using matplotlib module.

Create an Empty DataFrame in pandas

This is a simple example to create a DataFrame in Python. Here, we are creating an empty DataFrame

import pandas as pd
 
data = pd.DataFrame()
print(data)

Creating an Empty DataFrame in Python output

Empty DataFrame
Columns: []
Index: []

Create pandas DataFrame from List

Here, we create a list of Python integer values. Next, we used pandas.DataFrame function to create our DataFrame from list or to convert list to DataFrame.

import pandas as pd
 
table = [1, 2, 3, 4, 5]

data = pd.DataFrame(table)
print(data)

Create DataFrame from List

   0
0  1
1  2
2  3
3  4
4  5

Creating pandas DataFrame from Mixed List. Here, we are also using multiple rows and Columns.

import pandas as pd
 
table = [[1, 'Suresh'], [2, 'Python'], [3, 'Hello']]

data = pd.DataFrame(table)
print(data)

Convert Mixed List to DataFrame output

   0       1
0  1  Suresh
1  2  Python
2  3   Hello

Assign the names to column values in a Pandas DataFrame.

import pandas as pd
 
table = [[1, 'Suresh'], [2, 'Python'], [3, 'Hello']]
data = pd.DataFrame(table, columns = ['S.No', 'Name'])
print(data)

DataFrame column Names output

   S.No    Name
0     1  Suresh
1     2  Python
2     3   Hello

Python DataFrame of Random Numbers

To create a Pandas DataFrame using random numbers, we used numpy random function to generate random numbers of size 8 * 4. Next, we used Python DataFrame function to convert those sequence to a DataFrame

import numpy as np
import pandas as pd
 
d_frame = pd.DataFrame(np.random.randn(8, 4))
print(d_frame)

DataFrame of Random Numbers output

          0         1         2         3
0 -0.492116 -0.824771 -0.869890 -1.753722
1 -0.733930  0.947616  0.089861  0.888474
2 -0.948483 -1.002449 -0.283761 -0.207897
3  0.013346  2.059951  1.064830  0.830474
4  0.289157 -0.418271 -0.770464  0.223895
5 -0.781827 -0.396441  0.123848 -0.824002
6  0.667090  0.183589  1.212163  0.231251
7  1.067570 -0.615639  0.461147 -1.365541

Python Pandas DataFrame from dict

Python pandas allows you to create DataFrame from dict or dictionary. It was pretty much straight forward. All you have to do is, declare a dictionary of different values and then use Python DataFrame function to convert that dictionary to DataFrame

import pandas as pd
 
table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }
 
data = pd.DataFrame(table)
print(data)

To create a Pandas DataFrame from dict, the length of Dictionary values of all keys should be the same otherwise, it throws error. Next, if you are passing the index values then they should match the length of key values or arrays otherwise, it raises error. If you haven’t passed any index values then it will automatically create an index for you, and it start from 0 to n-1.

     name   Salary
0    John  1000000
1    Mike  1200000
2  Suresh   900000
3   Tracy  1100000

Let me take another example of Python pandas DataFrame from Dictionary. This time, we are converting dictionary of four columns of Data to DataFrame.

import pandas as pd
table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }

data = pd.DataFrame(table)
print(data)
Pandas Create DataFrame from dict 2

How to create pandas DataFrame from dict of lists

If you are confused to place everything in one place, divide them into part. Here, we declared four lists of items and then assigned them for each column.

import pandas as pd

names = ['John', 'Mike', 'Suresh', 'Tracy']
ages =  [25, 32, 30, 26]
Professions = ['Developer', 'Analyst', 'Admin', 'HR']
Salaries = [1000000, 1200000, 900000, 1100000]
	      
table = {'name': names,
         'Age': ages,
         'Profession': Professions,
         'Salary': Salaries
         }
	      
data = pd.DataFrame(table)
print(data)

Creating DataFrame from dict of lists output

     name  Age Profession   Salary
0    John   25  Developer  1000000
1    Mike   32    Analyst  1200000
2  Suresh   30      Admin   900000
3   Tracy   26         HR  1100000

Python Pandas DataFrame of Dates

Using Python pandas module, you can also create a DataFrame with series of dates. Let me create a DataFrame of dates from 2019-01-01 to 2019-01-08. By changing the period values, you can generate more number of Date sequence.

import numpy as np
import pandas as pd

dates = pd.date_range('20190101', periods = 8)
print(dates)
print()

d_frame = pd.DataFrame(np.random.randn(8, 4), index = dates,
                       columns = {'apples', 'oranges', 'kiwis', 'bananas'})
print(d_frame)

DataFrame of Dates output

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08'],
              dtype='datetime64[ns]', freq='D')

               kiwis    apples   oranges   bananas
2019-01-01 -0.393538 -0.406943  1.612431  1.089230
2019-01-02  1.070080 -1.565538  0.727056  1.677534
2019-01-03 -1.324169  0.256827  1.332544 -2.952971
2019-01-04  0.419778 -0.562119  0.507846 -0.223730
2019-01-05  0.175785  1.566511 -1.832633  2.035536
2019-01-06  0.541516 -0.113477  0.444046  0.387718
2019-01-07  0.247760 -1.143530  0.615681  0.400743
2019-01-08 -0.242328  0.913758 -0.088591 -0.533690

Pandas DataFrame Columns

This example shows how to reorder the columns in a DataFrame. By default, DataFrame will use the column order that we used in the actual data. However, you can use the Columns argument to alter the position of any column. Let me change the Age from 2nd position to 4th.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }

data1 = pd.DataFrame(table)
print(data1)

print('\n---- After Changing the Column Order-----')
data2 = pd.DataFrame(table, columns = ['name', 'Profession', 'Salary', 'Age'])
print(data2)

Please be careful, while using this columns argument. If we specified any non-existing column name or typo mistake will returns NaN. Let me use Qualification column name (which doesn’t exist)

print('\n---- Using Wrong Column -----')
data3 = pd.DataFrame(table, columns = ['name', 'Qualification', 'Salary', 'Age'])
print(data3)
     name  Age Profession   Salary
0    John   25  Developer  1000000
1    Mike   32    Analyst  1200000
2  Suresh   30      Admin   900000
3   Tracy   26         HR  1100000

---- After Changing the Column Order-----
     name Profession   Salary  Age
0    John  Developer  1000000   25
1    Mike    Analyst  1200000   32
2  Suresh      Admin   900000   30
3   Tracy         HR  1100000   26

---- Using Wrong Column -----
     name Qualification   Salary  Age
0    John           NaN  1000000   25
1    Mike           NaN  1200000   32
2  Suresh           NaN   900000   30
3   Tracy           NaN  1100000   26

The DataFrame columns attribute returns the list of available columns in a DataFrame in the same order, along with the datatype of a DataFrame

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }

data1 = pd.DataFrame(table)
print(data1)

data2 = pd.DataFrame(table, columns = ['name', 'Profession', 'Salary', 'Age'])

data3 = pd.DataFrame(table, columns = ['name', 'Qualification', 'Salary', 'Age'])

print(data1.columns)
print(data2.columns)
print(data3.columns)

DataFrame columns attribute output

     name  Age Profession   Salary
0    John   25  Developer  1000000
1    Mike   32    Analyst  1200000
2  Suresh   30      Admin   900000
3   Tracy   26         HR  1100000
Index(['name', 'Age', 'Profession', 'Salary'], dtype='object')
Index(['name', 'Profession', 'Salary', 'Age'], dtype='object')
Index(['name', 'Qualification', 'Salary', 'Age'], dtype='object')

Pandas DataFrame Index

By default, Python will assign the index values from 0 to n-1, where n is the maximum number. However, you have an option to alter those default index values using the index attribute. Here, we using the same and assigning the alphabets from a to d as the index values.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }

#Without Index Values - Uses Default Values
data1 = pd.DataFrame(table)
print(data1)

# Index Values are a, b, c, d
data2 = pd.DataFrame(table, index = ['a', 'b', 'c', 'd'])
print('\n----After Setting Index Values----')
print(data2)

DataFrame Indexes output

     name  Age Profession   Salary
0    John   25  Developer  1000000
1    Mike   32    Analyst  1200000
2  Suresh   30      Admin   900000
3   Tracy   26         HR  1100000

----After Setting Index Values----
     name  Age Profession   Salary
a    John   25  Developer  1000000
b    Mike   32    Analyst  1200000
c  Suresh   30      Admin   900000
d   Tracy   26         HR  1100000

In Python, you can use DataFrame set_index function to change or set a column as an index value. Here, we use this DataFrame set_index function not set name as an index. Next, the loc function to show that, we can extra information using index name.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }
data = pd.DataFrame(table)
print(data)

print('\n---Setting name as an index---')
new_data = data.set_index('name')
print(new_data)

print('\n---Return Index John Details---')
print(new_data.loc['John'])

DataFrame set_index function output

     name  Age Profession   Salary
0    John   25  Developer  1000000
1    Mike   32    Analyst  1200000
2  Suresh   30      Admin   900000
3   Tracy   26         HR  1100000

---Setting name as an index---
        Age Profession   Salary
name                           
John     25  Developer  1000000
Mike     32    Analyst  1200000
Suresh   30      Admin   900000
Tracy    26         HR  1100000

---Return Index John Details---
Age                  25
Profession    Developer
Salary          1000000
Name: John, dtype: object

Pandas DataFrame Attributes

The list of available attributes of Python DataFrame

Python DataFrame shape attribute

The Pandas DataFrame shape attribute returns the shape or tuple of number of rows and columns in a DataFrame.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Salary':[1000000, 1200000, 900000, 1100000]
	    }

data = pd.DataFrame(table)
print(data)
print('\n---Shape or Size of a DataFrame---')
print(data.shape)

DataFrame shape attribute output

     name  Age   Salary
0    John   25  1000000
1    Mike   32  1200000
2  Suresh   30   900000
3   Tracy   26  1100000

---Shape or Size of a DataFrame---
(4, 3)

Python DataFrame values attribute

The DataFrame values attributes returns the DataFrame data (without column names) in a two dimensional array format.

import pandas as pd
table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }

data2 = pd.DataFrame(table, columns = ['name', 'Profession', 'Salary', 'Age'])

data3 = pd.DataFrame(table, columns = ['name', 'Qualification', 'Salary', 'Age'])

print('---Data2 Values--- ')
print(data2.values)

print('\n---Data3 Values--- ')
print(data3.values)

DataFrame values attribute output

---Data2 Values--- 
[['John' 'Developer' 1000000 25]
 ['Mike' 'Analyst' 1200000 32]
 ['Suresh' 'Admin' 900000 30]
 ['Tracy' 'HR' 1100000 26]]

---Data3 Values--- 
[['John' nan 1000000 25]
 ['Mike' nan 1200000 32]
 ['Suresh' nan 900000 30]
 ['Tracy' nan 1100000 26]]

The above pandas dataframe examples are returns an array of type Object. This is because, both these DataFrames has a mixed content (int, string). If that is not the case then it won’t display any dtype inside an array. For this, we used an integer DataFrame

import pandas as pd

table = {'Age': [25, 32, 30, 26],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }

data4 = pd.DataFrame(table)
print(data4.values)
[[     25 1000000]
 [     32 1200000]
 [     30  900000]
 [     26 1100000]]

Pandas DataFrame name attribute

The Python DataFrame index and the column has a name attribute, which allows to assign a name to an index or column. 

import pandas as pd
table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }

data1 = pd.DataFrame(table)

table = {'Age': [25, 32, 30, 26],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }
data4 = pd.DataFrame(table)

data1.index.name = 'Emp No'
print(data1)
print()

data4.index.name = 'Cust No'
print(data4)

DataFrame name attribute output

          name  Age Profession   Salary
Emp No                                 
0         John   25  Developer  1000000
1         Mike   32    Analyst  1200000
2       Suresh   30      Admin   900000
3        Tracy   26         HR  1100000

         Age   Salary
Cust No              
0         25  1000000
1         32  1200000
2         30   900000
3         26  1100000

Similarly, we can use columns name attribute to assign name for column headers.

import pandas as pd
table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }

data1 = pd.DataFrame(table)

table = {'Age': [25, 32, 30, 26],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }

data4 = pd.DataFrame(table)

data1.columns.name = 'Employee Details'
print(data1)
 
data4.columns.name = 'Customers Information'
print(data4)

DataFrame column headers output

Employee Details    name  Age Profession   Salary
0                   John   25  Developer  1000000
1                   Mike   32    Analyst  1200000
2                 Suresh   30      Admin   900000
3                  Tracy   26         HR  1100000
Customers Information  Age   Salary
0                       25  1000000
1                       32  1200000
2                       30   900000
3                       26  1100000

Python DataFrame dtype attribute

The DataFrame dtype attribute returns the data type of each column in a DataFrame.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Sale':[22.55, 12.66, 119.470, 200.190],
         'Salary':[10000, 12000, 9000, 11000]
         }

data = pd.DataFrame(table)
print(data)

print('\n---dtype attribute result---')
print(data.dtypes)

DataFrame dtype attribute output

     name  Age Profession    Sale  Salary
0    John   25  Developer   22.55   10000
1    Mike   32    Analyst   12.66   12000
2  Suresh   30      Admin  119.47    9000
3   Tracy   26         HR  200.19   11000

---dtype attribute result---
name           object
Age             int64
Profession     object
Sale          float64
Salary          int64
dtype: object

Python DataFrame describe function

Use this python DataFrame describe function to get a quick statistical information about the DataFrame.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[10000, 12000, 9000, 11000]
         }

data1 = pd.DataFrame(table)
print(data1)

print('\n---describe function result---')
print(data1.describe())

DataFrame describe function output

     name  Age Profession  Salary
0    John   25  Developer   10000
1    Mike   32    Analyst   12000
2  Suresh   30      Admin    9000
3   Tracy   26         HR   11000

---describe function result---
             Age        Salary
count   4.000000      4.000000
mean   28.250000  10500.000000
std     3.304038   1290.994449
min    25.000000   9000.000000
25%    25.750000   9750.000000
50%    28.000000  10500.000000
75%    30.500000  11250.000000
max    32.000000  12000.000000

How to access Python DataFrame Data?

The data in Python DataFrame is stored in a tabular format of rows and columns. It means, you can access DataFrame items using columns and rows.

Accessing Pandas DataFrame Columns

You can access the DataFrame columns in two ways, either specifying the column name inside the [] or after a dot notation. Both these methods will returns the specified column as a Series.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[10000, 12000, 9000, 11000]
         }

data1 = pd.DataFrame(table)
data2 = pd.DataFrame(table, columns = ['name', 'Profession', 'Salary', 'Age'])

print('-----Accessing DataFrame Columns-----')
print(data1.Age)
print(data1['name'])
print(data2.Salary)

# We can also access multiple DataFrame columns 
print('-----Accessing Multiple DataFrame Columns-----')
print(data1[['Age', 'Profession']])
print(data2[['name', 'Salary']])

Accessing DataFrame columns output

-----Accessing DataFrame Columns-----
0    25
1    32
2    30
3    26
Name: Age, dtype: int64
0      John
1      Mike
2    Suresh
3     Tracy
Name: name, dtype: object
0    10000
1    12000
2     9000
3    11000
Name: Salary, dtype: int64
-----Accessing Multiple DataFrame Columns-----
   Age Profession
0   25  Developer
1   32    Analyst
2   30      Admin
3   26         HR
     name  Salary
0    John   10000
1    Mike   12000
2  Suresh    9000
3   Tracy   11000

This is an another example to access pandas DataFrame columns

import pandas as pd
table = {'name': ['Kane', 'John', 'Suresh', 'Tracy', 'Steve'],
         'Age': [35, 25, 32, 30, 29],
         'Profession': ['Manager', 'Developer', 'Analyst', 'Admin', 'HR'],
         'Sale':[422.19, 22.55, 119.470, 200.190, 44.55],
         'Salary':[12000, 10000, 14000, 11000, 14000]
	    }
data = pd.DataFrame(table)
print(data)

print('\n---Select name column from DataFrame---')
print(data['name'])

print('\n---Select Profession and Sale column from DataFrame---')
print(data[['Profession', 'Sale']])

print('\n---Select Profession column from DataFrame---')
print(data.Profession)

Accessing DataFrame columns output 2

     name  Age Profession    Sale  Salary
0    Kane   35    Manager  422.19   12000
1    John   25  Developer   22.55   10000
2  Suresh   32    Analyst  119.47   14000
3   Tracy   30      Admin  200.19   11000
4   Steve   29         HR   44.55   14000

---Select name column from DataFrame---
0      Kane
1      John
2    Suresh
3     Tracy
4     Steve
Name: name, dtype: object

---Select Profession and Sale column from DataFrame---
  Profession    Sale
0    Manager  422.19
1  Developer   22.55
2    Analyst  119.47
3      Admin  200.19
4         HR   44.55

---Select Profession column from DataFrame---
0      Manager
1    Developer
2      Analyst
3        Admin
4           HR
Name: Profession, dtype: object

Access Pandas DataFrame Rows

A Pandas DataFrame in Python can also be accessed using rows. Here, we are using the index slicing technique to returns the required rows from a DataFrame. Here, data[1:] returns all the rows from index 1 to n-1, and data[1:3] returns rows from index 1 to 3.

import pandas as pd
table = {'name': ['Kane', 'John', 'Suresh', 'Tracy', 'Steve'],
         'Age': [35, 25, 32, 30, 29],
         'Profession': ['Manager', 'Developer', 'Analyst', 'Admin', 'HR'],
         'Sale':[422.19, 22.55, 119.470, 200.190, 44.55],
         'Salary':[12000, 10000, 14000, 11000, 14000]
	    }
data = pd.DataFrame(table)
print(data)

print('\n---Select all rows from 1 to N in a DataFrame---')
print(data[1:])

print('\n---Select rows from 1 to 2 in a DataFrame---')
print(data[1:3])

print('\n---Select rows from 0 to 3 in a DataFrame---')
print(data[0:4])

print('\n---Select last row in a DataFrame---')
print(data[-1:])

Accessing DataFrame Rows output

     name  Age Profession    Sale  Salary
0    Kane   35    Manager  422.19   12000
1    John   25  Developer   22.55   10000
2  Suresh   32    Analyst  119.47   14000
3   Tracy   30      Admin  200.19   11000
4   Steve   29         HR   44.55   14000

---Select all rows from 1 to N in a DataFrame---
     name  Age Profession    Sale  Salary
1    John   25  Developer   22.55   10000
2  Suresh   32    Analyst  119.47   14000
3   Tracy   30      Admin  200.19   11000
4   Steve   29         HR   44.55   14000

---Select rows from 1 to 2 in a DataFrame---
     name  Age Profession    Sale  Salary
1    John   25  Developer   22.55   10000
2  Suresh   32    Analyst  119.47   14000

---Select rows from 0 to 3 in a DataFrame---
     name  Age Profession    Sale  Salary
0    Kane   35    Manager  422.19   12000
1    John   25  Developer   22.55   10000
2  Suresh   32    Analyst  119.47   14000
3   Tracy   30      Admin  200.19   11000

---Select last row in a DataFrame---
    name  Age Profession   Sale  Salary
4  Steve   29         HR  44.55   14000

Pandas DataFrame loc Example

A Pandas DataFrame loc is one of the important thing to understand. You can use the DataFrame loc[] to select more than one column and more than one row at a time. Or, use this Pandas dataFrame loc[] to select a portion of a DataFrame. Use this loc to select rows from a DataFrame.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }
data = pd.DataFrame(table, index = ['a', 'b', 'c', 'd'])
print(data)

print('\n---Select b row from a DataFrame---')
print(data.loc['b'])

print('\n---Select c row from a DataFrame---')
print(data.loc['c'])

print('\n---Select b and d rows from a DataFrame---')
print(data.loc[['b', 'd']])

DataFrame loc output

     name  Age Profession   Salary
a    John   25  Developer  1000000
b    Mike   32    Analyst  1200000
c  Suresh   30      Admin   900000
d   Tracy   26         HR  1100000

---Select b row from a DataFrame---
name             Mike
Age                32
Profession    Analyst
Salary        1200000
Name: b, dtype: object

---Select c row from a DataFrame---
name          Suresh
Age               30
Profession     Admin
Salary        900000
Name: c, dtype: object

---Select b and d rows from a DataFrame---
    name  Age Profession   Salary
b   Mike   32    Analyst  1200000
d  Tracy   26         HR  1100000

The first statement, data.loc[:, [‘name’, ‘Sale’]] returns all the rows of name and sale column. Within the last statement, data.loc[1:3, [‘name’, ‘Profession’, ‘Salary’]] returns rows from index value 1 to 3 for the columns of name, profession and Salary.

import pandas as pd
table = {'name': ['Kane', 'John', 'Suresh', 'Tracy', 'Steve'],
         'Age': [35, 25, 32, 30, 29],
         'Profession': ['Manager', 'Developer', 'Analyst', 'Admin', 'HR'],
         'Sale':[422.19, 22.55, 119.470, 200.190, 44.55],
         'Salary':[12000, 10000, 14000, 11000, 14000]
	    }
data = pd.DataFrame(table)
print(data)

print('\n---Select name, Sale column in a DataFrame---')
print(data.loc[:, ['name', 'Sale']])

print('\n---Select name, Profession, Salary in a DataFrame---')
print(data.loc[:, ['name', 'Profession', 'Salary']])

print('\n---Select rows from 1 to 2 in a DataFrame---')
print(data.loc[1:3, ['name', 'Profession', 'Salary']])

DataFrame loc function output

     name  Age Profession    Sale  Salary
0    Kane   35    Manager  422.19   12000
1    John   25  Developer   22.55   10000
2  Suresh   32    Analyst  119.47   14000
3   Tracy   30      Admin  200.19   11000
4   Steve   29         HR   44.55   14000

---Select name, Sale column in a DataFrame---
     name    Sale
0    Kane  422.19
1    John   22.55
2  Suresh  119.47
3   Tracy  200.19
4   Steve   44.55

---Select name, Profession, Salary in a DataFrame---
     name Profession  Salary
0    Kane    Manager   12000
1    John  Developer   10000
2  Suresh    Analyst   14000
3   Tracy      Admin   11000
4   Steve         HR   14000

---Select rows from 1 to 2 in a DataFrame---
     name Profession  Salary
1    John  Developer   10000
2  Suresh    Analyst   14000
3   Tracy      Admin   11000

Pandas DataFrame iloc Example

Similar to loc[], Python Pandas DataFrame has iloc[]. However, this will only accept integer values or index to return data from a DataFrame.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }
data = pd.DataFrame(table, index = ['a', 'b', 'c', 'd'])
print(data)

print('\n---Select 1st row from a DataFrame---')
print(data.iloc[1])

print('\n---Select 3rd row from a DataFrame---')
print(data.iloc[3])

print('\n---Select 1 and 3 rows from a DataFrame---')
print(data.iloc[[1, 3]])

DataFrame iloc function output

     name  Age Profession   Salary
a    John   25  Developer  1000000
b    Mike   32    Analyst  1200000
c  Suresh   30      Admin   900000
d   Tracy   26         HR  1100000

---Select 1st row from a DataFrame---
name             Mike
Age                32
Profession    Analyst
Salary        1200000
Name: b, dtype: object

---Select 3rd row from a DataFrame---
name            Tracy
Age                26
Profession         HR
Salary        1100000
Name: d, dtype: object

---Select 1 and 3 rows from a DataFrame---
    name  Age Profession   Salary
b   Mike   32    Analyst  1200000
d  Tracy   26         HR  1100000

You can use loc, iloc, at and iat to extract or access a single value from a DataFrame. The following example will show you the same.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }
data = pd.DataFrame(table)
print(data)

print('\nitem at 0, 0 in DataFrame   = ', data.iloc[0][0])

print('item at 0, 1 in DataFrame      = ', data.loc[0][1])

print('item at 1, Profession is       = ', data.loc[1]['Profession'])

print('item at 2, 3 in DataFrame      = ', data.iat[2, 3])

print('item at 0, Salary in DataFrame = ', data.at[0, 'Salary'])

Output of loc, iloc, at and iat to access a value from a DataFrame

     name  Age Profession   Salary
0    John   25  Developer  1000000
1    Mike   32    Analyst  1200000
2  Suresh   30      Admin   900000
3   Tracy   26         HR  1100000

item at 0, 0 in DataFrame   =  John
item at 0, 1 in DataFrame      =  25
item at 1, Profession is       =  Analyst
item at 2, 3 in DataFrame      =  900000
item at 0, Salary in DataFrame =  1000000

How to add a New Column to Pandas DataFrame?

In this example, we will show you, how to add a new column to an existing DataFrame. data[‘Sale’] = [422.19, 200.190, 44.55] adds completely new column called Sale. data[‘Income’] = data[‘Salary’] + data[‘basic’] adds new column Income by adding values in Salary column and basic column.

import pandas as pd

table = {'name': ['Kane', 'Suresh', 'Tracy'],
         'Age': [35, 25, 29],
         'Profession': ['Manager', 'Developer', 'HR'],
         'Salary': [10000, 14000, 11000],
         'basic': [4000, 6000, 4500]
        }

data = pd.DataFrame(table)
print(data)

# Add New Column to DataFrame
data['Sale'] = [422.19, 200.190, 44.55]
print('\n---After adding New Column DataFrame---')
print(data)

# Add New Column using existing
data['Income'] = data['Salary'] + data['basic']
print('\n---Total Salary in a DataFrame---')
print(data)

# Add New Calculated Column to DataFrame
data['New_Salary'] = data['Salary'] + data['Salary'] * 0.25
print('\n---After adding New Column DataFrame---')
print(data)

Adding a New Column to Pandas DataFrame output

     name  Age Profession  Salary  basic
0    Kane   35    Manager   10000   4000
1  Suresh   25  Developer   14000   6000
2   Tracy   29         HR   11000   4500

---After adding New Column DataFrame---
     name  Age Profession  Salary  basic    Sale
0    Kane   35    Manager   10000   4000  422.19
1  Suresh   25  Developer   14000   6000  200.19
2   Tracy   29         HR   11000   4500   44.55

---Total Salary in a DataFrame---
     name  Age Profession  Salary  basic    Sale  Income
0    Kane   35    Manager   10000   4000  422.19   14000
1  Suresh   25  Developer   14000   6000  200.19   20000
2   Tracy   29         HR   11000   4500   44.55   15500

---After adding New Column DataFrame---
     name  Age Profession  Salary  basic    Sale  Income  New_Salary
0    Kane   35    Manager   10000   4000  422.19   14000     12500.0
1  Suresh   25  Developer   14000   6000  200.19   20000     17500.0
2   Tracy   29         HR   11000   4500   44.55   15500     13750.0

Delete a Column from a DataFrame in Python

In Python, there are two ways to delete a column from a Pandas DataFrame. Either you can use del function or pop function. In this example, we are going to use both these function to delete columns from Pandas DataFrame.

Here, del(data[‘basic’]) deletes basic column (complete rows belong to basic column) from DataFrame. x = data.pop(‘Age’) deletes or pops Age column from DataFrame, and we are printing that popped column as well. Next, we used the Pandas DataFrame drop function to delete Sale column.

import pandas as pd

table = {'name': ['Kane', 'Suresh', 'Tracy'],
         'Age': [35, 25, 29],
         'Profession': ['Manager', 'Developer', 'HR'],
         'Salary': [10000, 14000, 11000],
         'basic': [4000, 6000, 4500],
         'Sale': [422.19, 200.190, 44.55]
        }

data = pd.DataFrame(table)
print(data)

# Delete existing Columns from DataFrame
del(data['basic'])
print('\n---After Deleting basic Column DataFrame---')
print(data)

x = data.pop('Age')
print('\n---After Deleting Age Column DataFrame---')
print(data)
print('\n---pop Column from DataFrame---')
print(x)

y = data.drop(columns = 'Sale')
print('\n---After Deleting Sale Column DataFrame---')
print(y)

Delete Column from a DataFrame output

     name  Age Profession  Salary  basic    Sale
0    Kane   35    Manager   10000   4000  422.19
1  Suresh   25  Developer   14000   6000  200.19
2   Tracy   29         HR   11000   4500   44.55

---After Deleting basic Column DataFrame---
     name  Age Profession  Salary    Sale
0    Kane   35    Manager   10000  422.19
1  Suresh   25  Developer   14000  200.19
2   Tracy   29         HR   11000   44.55

---After Deleting Age Column DataFrame---
     name Profession  Salary    Sale
0    Kane    Manager   10000  422.19
1  Suresh  Developer   14000  200.19
2   Tracy         HR   11000   44.55

---pop Column from DataFrame---
0    35
1    25
2    29
Name: Age, dtype: int64

---After Deleting Sale Column DataFrame---
     name Profession  Salary
0    Kane    Manager   10000
1  Suresh  Developer   14000
2   Tracy         HR   11000

How to delete DataFrame Row in Python?

In this Python example, we are using the Pandas drop function to delete DataFrame row.

import pandas as pd

table = {'name': ['Kane', 'Suresh', 'Tracy'],
         'Profession': ['Manager', 'Developer', 'HR'],
         'Salary': [10000, 14000, 11000],
         'Sale': [422.19, 200.190, 44.55]
        }

data = pd.DataFrame(table, index = ['a', 'b', 'c'])
print(data)

x = data.drop('b')
print('\n---After Deleting b row DataFrame---')
print(x)

y = data.drop('a')
print('\n---After Deleting a row DataFrame---')
print(y)

Deleting DataFrame Row output

     name Profession  Salary    Sale
a    Kane    Manager   10000  422.19
b  Suresh  Developer   14000  200.19
c   Tracy         HR   11000   44.55

---After Deleting b row DataFrame---
    name Profession  Salary    Sale
a   Kane    Manager   10000  422.19
c  Tracy         HR   11000   44.55

---After Deleting a row DataFrame---
     name Profession  Salary    Sale
b  Suresh  Developer   14000  200.19
c   Tracy         HR   11000   44.55

How to rename Pandas DataFrame Column?

In Python, use Pandas rename function to rename a column or multiple columns of a DataFrame. Here, we use this Pandas DataFrame rename function to rename Profession column to Qualification and Salary to Income.

import pandas as pd

table = {'name': ['John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [25, 32, 30, 26],
         'Profession': ['Developer', 'Analyst', 'Admin', 'HR'],
         'Salary':[1000000, 1200000, 900000, 1100000]
         }
data = pd.DataFrame(table)
print(data)

# data = data.rename(columns = {'Profession': 'Qualification'})
data.rename(columns = {'Profession': 'Qualification'}, inplace = True)
print('\n---After Renaming Column in a DataFrame---')
print(data)

data.rename(columns =
                {'Profession': 'Qualification',
                'Salary': 'Income'},
            inplace = True)
print('\n---After Renaming two Column in a DataFrame---')
print(data)

Rename DataFrame Column output

     name  Age Profession   Salary
0    John   25  Developer  1000000
1    Mike   32    Analyst  1200000
2  Suresh   30      Admin   900000
3   Tracy   26         HR  1100000

---After Renaming Column in a DataFrame---
     name  Age Qualification   Salary
0    John   25     Developer  1000000
1    Mike   32       Analyst  1200000
2  Suresh   30         Admin   900000
3   Tracy   26            HR  1100000

---After Renaming two Column in a DataFrame---
     name  Age Qualification   Income
0    John   25     Developer  1000000
1    Mike   32       Analyst  1200000
2  Suresh   30         Admin   900000
3   Tracy   26            HR  1100000

Python pandas head and tail

If you are coming from R programming, you might be familiar with head and tail functions. The Pandas DataFrame head function accepts integer value as an argument and returns Top or first given number of records. For example, head(5) returns Top 5 records from a dataFrame. Similarly, Python DataFrame tail function returns bottom or last records from a DataFrame. For example, tail(5) returns last 5 records or bottom 5 records from a DataFrame.

import pandas as pd
table = {'name': ['Kane', 'John', 'Mike', 'Suresh', 'Tracy', 'Steve'],
         'Age': [35, 25, 32, 30, 26, 29],
         'Profession': ['Manager', 'Developer', 'Analyst', 'Admin', 'HR', 'HOD'],
         'Sale':[422.19, 22.55, 12.66, 119.470, 200.190, 44.55],
         'Salary':[12000, 10000, 8000, 14000, 11000, 14000]
	    }
data = pd.DataFrame(table)
print(data)

print('\n---First Five records DataFrame head()---')
print(data.head())

print('\n---First two records DataFrame head(2)---')
print(data.head(2))

print('\n---Bottom Five records DataFrame tail()---')
print(data.tail())

print('\n---last two records DataFrame tail(2)---')
print(data.tail(2))

pandas DataFrame head and tail output

     name  Age Profession    Sale  Salary
0    Kane   35    Manager  422.19   12000
1    John   25  Developer   22.55   10000
2    Mike   32    Analyst   12.66    8000
3  Suresh   30      Admin  119.47   14000
4   Tracy   26         HR  200.19   11000
5   Steve   29        HOD   44.55   14000

---First Five records DataFrame head()---
     name  Age Profession    Sale  Salary
0    Kane   35    Manager  422.19   12000
1    John   25  Developer   22.55   10000
2    Mike   32    Analyst   12.66    8000
3  Suresh   30      Admin  119.47   14000
4   Tracy   26         HR  200.19   11000

---First two records DataFrame head(2)---
   name  Age Profession    Sale  Salary
0  Kane   35    Manager  422.19   12000
1  John   25  Developer   22.55   10000

---Bottom Five records DataFrame tail()---
     name  Age Profession    Sale  Salary
1    John   25  Developer   22.55   10000
2    Mike   32    Analyst   12.66    8000
3  Suresh   30      Admin  119.47   14000
4   Tracy   26         HR  200.19   11000
5   Steve   29        HOD   44.55   14000

---last two records DataFrame tail(2)---
    name  Age Profession    Sale  Salary
4  Tracy   26         HR  200.19   11000
5  Steve   29        HOD   44.55   14000

Transpose pandas DataFrame in Python

Python DataFrame has inbuilt functionality to transpose a Matrix. For this, you have to use DataFrame.T

import pandas as pd
table = {'name': ['Kane', 'John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [35, 25, 32, 30, 26],
         'Profession': ['Manager', 'Developer', 'Analyst', 'Admin', 'HR'],
         'Sale':[422.19, 22.55, 12.66, 119.470, 200.190],
         'Salary':[12000, 10000, 8000, 14000, 11000]
	    }
data = pd.DataFrame(table)
print(data)

print('\n---Transposed DataFrame---')
print(data.T)

Transpose DataFrame output

     name  Age Profession    Sale  Salary
0    Kane   35    Manager  422.19   12000
1    John   25  Developer   22.55   10000
2    Mike   32    Analyst   12.66    8000
3  Suresh   30      Admin  119.47   14000
4   Tracy   26         HR  200.19   11000

---Transposed DataFrame---
                  0          1        2       3       4
name           Kane       John     Mike  Suresh   Tracy
Age              35         25       32      30      26
Profession  Manager  Developer  Analyst   Admin      HR
Sale         422.19      22.55    12.66  119.47  200.19
Salary        12000      10000     8000   14000   11000

Python DataFrame groupby

A Python DataFrame groupby function is similar to Group By clause in Sql Server. I mean, you can use this Pandas groupby function to group data by some columns and find the aggregated results of the other columns. This is one of the important concept or function, while working with real-time data. 

In this example, we created a DataFrame of different columns and data types. Next, we used this groupby function on that DataFrame. The first statement, data.groupby(‘Profession’).sum() groups DataFrame by Profession column and calculate the sum of Sales, Salary and Age. The second statement, data.groupby([‘Profession’, ‘Age’]).sum() groups DataFrame by Profession and Age columns and calculate the sum of Sales, and Salary. Remember, any string columns (unable to aggregate) will be concatenated or combined.

import pandas as pd
table = {'name': ['Kane', 'Dave', 'Ram', 'John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [35, 25, 25, 35, 25, 35, 35],
         'Profession': ['Analyst', 'HR', 'Analyst', 'Admin', 'HR', 'Admin', 'HR'],
         'Sale':[422, 22, 55, 12, 119, 470, 200],
         'Salary':[12000, 9000, 10000, 8000, 14000, 20000, 11000]
	    }
data = pd.DataFrame(table)
print(data)

print('\n---DataFrame groupby Profession---')
print(data.groupby('Profession').sum())

print('\n---DataFrame groupby Profession and Age---')
print(data.groupby(['Profession', 'Age']).sum())

DataFrame groupby output

     name  Age Profession  Sale  Salary
0    Kane   35    Analyst   422   12000
1    Dave   25         HR    22    9000
2     Ram   25    Analyst    55   10000
3    John   35      Admin    12    8000
4    Mike   25         HR   119   14000
5  Suresh   35      Admin   470   20000
6   Tracy   35         HR   200   11000

---DataFrame groupby Profession---
            Age  Sale  Salary
Profession                   
Admin        70   482   28000
Analyst      60   477   22000
HR           85   341   34000

---DataFrame groupby Profession and Age---
                Sale  Salary
Profession Age              
Admin      35    482   28000
Analyst    25     55   10000
           35    422   12000
HR         25    141   23000
           35    200   11000

Python DataFrame stack

A Python Pandas DataFrame stack function is used to compress one level of a DataFrame object. In order to use this DataFrame stack function, you can simply call data_to_stack.stack(). In this example, we are using this Python DataFrame stack function on grouped data (groupby function result) to further compress the DataFrame.

import pandas as pd
table = {'name': ['Kane', 'Dave', 'Ram', 'John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [35, 25, 25, 35, 25, 35, 35],
         'Profession': ['Analyst', 'HR', 'Analyst', 'Admin', 'HR', 'Admin', 'HR'],
         'Sale':[422, 22, 55, 12, 119, 470, 200],
         'Salary':[12000, 9000, 10000, 8000, 14000, 20000, 11000]
	    }
data = pd.DataFrame(table)

grouped_data1 = data.groupby('Profession').sum()
stacked_data1 = grouped_data1.stack()
print('\n---Stacked DataFrame groupby Profession---')
print(stacked_data1)

grouped_data2 = data.groupby(['Profession', 'Age']).sum()
stacked_data2 = grouped_data2.stack()
print('\n---Stacked DataFrame groupby Profession and Age---')
print(stacked_data2)

pandas DataFrame stack output


---Stacked DataFrame groupby Profession---
Profession        
Admin       Age          70
            Sale        482
            Salary    28000
Analyst     Age          60
            Sale        477
            Salary    22000
HR          Age          85
            Sale        341
            Salary    34000
dtype: int64

---Stacked DataFrame groupby Profession and Age---
Profession  Age        
Admin       35   Sale        482
                 Salary    28000
Analyst     25   Sale         55
                 Salary    10000
            35   Sale        422
                 Salary    12000
HR          25   Sale        141
                 Salary    23000
            35   Sale        200
                 Salary    11000
dtype: int64

Python DataFrame unstack

The DataFrame unstack function undo the operation done by stack function or say, opposite to stack function. This Python DataFrame unstack function uncompress the last column of a stacked DataFrame (.stack() function). In order to use this function, you can simply call stacked_data.unstack()

import pandas as pd
table = {'name': ['Kane', 'Dave', 'Ram', 'John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [35, 25, 25, 35, 25, 35, 35],
         'Profession': ['Analyst', 'HR', 'Analyst', 'Admin', 'HR', 'Admin', 'HR'],
         'Sale':[422, 22, 55, 12, 119, 470, 200],
         'Salary':[12000, 9000, 10000, 8000, 14000, 20000, 11000]
	    }
data = pd.DataFrame(table)

grouped_data1 = data.groupby('Profession').sum()
stacked_data1 = grouped_data1.stack()
unstacked_data1 = stacked_data1.unstack()
# print('\n---Stacked DataFrame groupby Profession---')
# print(stacked_data1)
print('\n---Unstacked DataFrame groupby Profession---')
print(unstacked_data1)

grouped_data2 = data.groupby(['Profession', 'Age']).sum()
stacked_data2 = grouped_data2.stack()
unstacked_data2 = stacked_data2.unstack()
# print('\n---Stacked DataFrame groupby Profession and Age---')
# print(stacked_data2)
print('\n---Unstacked DataFrame groupby Profession and Age---')
print(unstacked_data2)

pandas DataFrame unstack


---Unstacked DataFrame groupby Profession---
            Age  Sale  Salary
Profession                   
Admin        70   482   28000
Analyst      60   477   22000
HR           85   341   34000

---Unstacked DataFrame groupby Profession and Age---
                Sale  Salary
Profession Age              
Admin      35    482   28000
Analyst    25     55   10000
           35    422   12000
HR         25    141   23000
           35    200   11000

Python DataFrame Concatenation

A Pandas DataFrame concat function is used to combine or concatenate DataFrame objects. First, we declared two DataFrames of random values of a size 4 * 6. Next, we used concat function to concatenate DataFrames

import pandas as pd
import numpy as np

dataframe_one = pd.DataFrame(np.random.randn(4, 6))
print(dataframe_one)

dataframe_two = pd.DataFrame(np.random.randn(4, 6))
print(dataframe_two)

print('\n---DataFrame concatenation---')
print(pd.concat([dataframe_one, dataframe_two]))

pandas DataFrame Concatenation

          0         1         2         3         4         5
0  0.170510 -0.549890 -0.076595 -1.666645 -0.500168 -0.837365
1 -1.056680 -0.296667 -1.418145 -0.357668 -0.319350  2.131726
2  1.359241  0.913525 -0.590698 -0.460282  1.198779 -0.900188
3  0.550750 -0.186552  0.543404  1.520353  0.288910  0.563674
          0         1         2         3         4         5
0  0.748928 -0.095618 -0.490589  0.950306 -0.786737  0.968456
1 -0.561079  0.204682  1.356939 -1.907207 -0.625462  0.163865
2  0.391494  0.881150  0.871912 -0.448490  0.589685  0.271900
3  0.179141 -0.589593 -0.335848 -0.348342  0.516758  0.691327

---DataFrame concatenation---
          0         1         2         3         4         5
0  0.170510 -0.549890 -0.076595 -1.666645 -0.500168 -0.837365
1 -1.056680 -0.296667 -1.418145 -0.357668 -0.319350  2.131726
2  1.359241  0.913525 -0.590698 -0.460282  1.198779 -0.900188
3  0.550750 -0.186552  0.543404  1.520353  0.288910  0.563674
0  0.748928 -0.095618 -0.490589  0.950306 -0.786737  0.968456
1 -0.561079  0.204682  1.356939 -1.907207 -0.625462  0.163865
2  0.391494  0.881150  0.871912 -0.448490  0.589685  0.271900
3  0.179141 -0.589593 -0.335848 -0.348342  0.516758  0.691327

In the above example, we are concatenating two dataFrame objects of same size. However, you can use this Python Pandas DataFrame concat function to concatenate or combines more than two DataFrame Objects and different size. For this, we used three different size DataFrames of randomly generated numbers. Next, we used the Python DataFrame concat function to concat those three objects.

import numpy as np
import pandas as pd

dataframe_one = pd.DataFrame(np.random.randn(4, 6))
print(dataframe_one)

dataframe_two = pd.DataFrame(np.random.randn(4, 5))
print(dataframe_two)

dataframe_three = pd.DataFrame(np.random.randn(3, 4))
print(dataframe_three)

print('\n-----DataFrame concatenation-----')
print(pd.concat([dataframe_one, dataframe_two, dataframe_three]))

Concatenating two DataFrame objects of same size output

          0         1         2         3         4         5
0 -0.071220  0.286829  0.726730 -1.046570  1.114306 -0.622870
1 -0.137455 -1.237104 -2.567032 -0.773737  0.446680  1.241036
2  0.417368 -0.544948 -1.368237 -0.409373 -1.757377  1.481192
3 -0.958583  0.116646  0.491579  1.018028  0.591651  1.072710
          0         1         2         3         4
0  2.525100 -0.172472 -2.364648 -2.312990  0.264522
1  0.041258  0.688158  1.192806  1.590377 -0.549352
2  0.723508 -1.246208 -0.497221  0.174042 -0.634088
3 -0.394750  1.186304  0.575888 -1.201602  0.851508
          0         1         2         3
0  0.038201 -0.987624 -1.347281  0.968429
1 -0.268102 -0.981864  0.378091  0.193392
2  2.287503  0.834575 -0.774165  1.244232

-----DataFrame concatenation-----
          0         1         2         3         4         5
0 -0.071220  0.286829  0.726730 -1.046570  1.114306 -0.622870
1 -0.137455 -1.237104 -2.567032 -0.773737  0.446680  1.241036
2  0.417368 -0.544948 -1.368237 -0.409373 -1.757377  1.481192
3 -0.958583  0.116646  0.491579  1.018028  0.591651  1.072710
0  2.525100 -0.172472 -2.364648 -2.312990  0.264522       NaN
1  0.041258  0.688158  1.192806  1.590377 -0.549352       NaN
2  0.723508 -1.246208 -0.497221  0.174042 -0.634088       NaN
3 -0.394750  1.186304  0.575888 -1.201602  0.851508       NaN
0  0.038201 -0.987624 -1.347281  0.968429       NaN       NaN
1 -0.268102 -0.981864  0.378091  0.193392       NaN       NaN
2  2.287503  0.834575 -0.774165  1.244232       NaN       NaN

math operations on DataFrame in Python

In this example, we use few of the Python Pandas DataFrame mathematical functions. For the bands dataframe demo purpose, we are finding the Mean and Median of each column and each Row. To get the mean or median of each row, you have to place integer 1 inside the function.

import pandas as pd
table = {'name': ['Kane', 'John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [35, 25, 32, 30, 26],
         'Profession': ['Manager', 'Developer', 'Analyst', 'Admin', 'HR'],
         'Sale':[422.19, 22.55, 12.66, 119.470, 200.190],
         'Salary':[12000, 10000, 8000, 14000, 11000]
	    }
data = pd.DataFrame(table)
print(data)

print('\n---DataFrame Mean of Columns---')
print(data.mean())

print('\n---DataFrame Mean of Rows---')
print(data.mean(1))

print('\n---DataFrame Median of Columns---')
print(data.median())

print('\n---DataFrame Median of Rows---')
print(data.median(1))

math operations on DataFrame output

     name  Age Profession    Sale  Salary
0    Kane   35    Manager  422.19   12000
1    John   25  Developer   22.55   10000
2    Mike   32    Analyst   12.66    8000
3  Suresh   30      Admin  119.47   14000
4   Tracy   26         HR  200.19   11000

---DataFrame Mean of Columns---
Age          29.600
Sale        155.412
Salary    11000.000
dtype: float64

---DataFrame Mean of Rows---
0    4152.396667
1    3349.183333
2    2681.553333
3    4716.490000
4    3742.063333
dtype: float64

---DataFrame Median of Columns---
Age          30.00
Sale        119.47
Salary    11000.00
dtype: float64

---DataFrame Median of Rows---
0    422.19
1     25.00
2     32.00
3    119.47
4    200.19
dtype: float64

We are calculating the sum of all the rows of each column, sum of all columns in each row. Similarly, minimum value in a column, maximum value in each column, maximum value in each row using sum(), min() and max() functions.

import pandas as pd
table = {'name': ['Kane', 'John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [35, 25, 32, 30, 26],
         'Profession': ['Manager', 'Developer', 'Analyst', 'Admin', 'HR'],
         'Sale':[422.19, 22.55, 12.66, 119.470, 200.190],
         'Salary':[12000, 10000, 8000, 14000, 11000]
	    }
data = pd.DataFrame(table)
print(data)

print('\n---DataFrame sum of Columns---')
print(data.sum())

print('\n---DataFrame sum of Rows---')
print(data.sum(1))

print('\n---DataFrame Minimum of Columns---')
print(data.min())

print('\n---DataFrame Maximum of Columns---')
print(data.max())

print('\n---DataFrame Maximum of Rows---')
print(data.max(1))

The Sum of DataFrame rows and columns output

     name  Age Profession    Sale  Salary
0    Kane   35    Manager  422.19   12000
1    John   25  Developer   22.55   10000
2    Mike   32    Analyst   12.66    8000
3  Suresh   30      Admin  119.47   14000
4   Tracy   26         HR  200.19   11000

---DataFrame sum of Columns---
name                 KaneJohnMikeSureshTracy
Age                                      148
Profession    ManagerDeveloperAnalystAdminHR
Sale                                  777.06
Salary                                 55000
dtype: object

---DataFrame sum of Rows---
0    12457.19
1    10047.55
2     8044.66
3    14149.47
4    11226.19
dtype: float64

---DataFrame Minimum of Columns---
name           John
Age              25
Profession    Admin
Sale          12.66
Salary         8000
dtype: object

---DataFrame Maximum of Columns---
name            Tracy
Age                35
Profession    Manager
Sale           422.19
Salary          14000
dtype: object

---DataFrame Maximum of Rows---
0    12000.0
1    10000.0
2     8000.0
3    14000.0
4    11000.0
dtype: float64

Arithmetic Operations on Python Pandas DataFrame

We will perform Arithmetic operations on Python DataFrame

import pandas as pd
table = {'Age': [25, 32, 30],
         'Sale':[422.19, 119.470, 200.190],
         'Salary':[12000, 14000, 11000]
	    }
data = pd.DataFrame(table)
print(data)

print('\n---Add 20 to DataFrame---')
print(data + 20)

print('\n---Subtract 10 from DataFrame---')
print(data - 10)

print('\n---Multiply DataFrame by 2---')
print(data * 2)

Arithmetic Operations on Pandas DataFrame output

   Age    Sale  Salary
0   25  422.19   12000
1   32  119.47   14000
2   30  200.19   11000

---Add 20 to DataFrame---
   Age    Sale  Salary
0   45  442.19   12020
1   52  139.47   14020
2   50  220.19   11020

---Subtract 10 from DataFrame---
   Age    Sale  Salary
0   15  412.19   11990
1   22  109.47   13990
2   20  190.19   10990

---Multiply DataFrame by 2---
   Age    Sale  Salary
0   50  844.38   24000
1   64  238.94   28000
2   60  400.38   22000

Python Pandas DataFrame Nulls

The isnull check and returns True if a value in DataFrame is Null otherwise False. Pandas notnull function returns True if value is not Null otherwise, False is returned.

import pandas as pd
import numpy as np

table = {'name': ['Kane', 'Suresh', np.nan],
         'Profession': ['Manager', np.nan, 'HR'],
         'Salary': [np.nan, 14000, 11000],
         'Sale': [422.19, np.nan, 44.55]
        }

data = pd.DataFrame(table)
print(data)

print('\n---Checking Nulls in a DataFrame---')
print(data.isnull())

print('\n---Checking Not Nulls in a DataFrame---')
print(data.notnull())

pandas DataFrame Nulls output

     name Profession   Salary    Sale
0    Kane    Manager      NaN  422.19
1  Suresh        NaN  14000.0     NaN
2     NaN         HR  11000.0   44.55

---Checking Nulls in a DataFrame---
    name  Profession  Salary   Sale
0  False       False    True  False
1  False        True   False   True
2   True       False   False  False

---Checking Not Nulls in a DataFrame---
    name  Profession  Salary   Sale
0   True        True   False   True
1   True       False    True  False
2  False        True    True   True

Replace Nulls in pandas DataFrame

We can also replace those Null values with a meaningful numbers. For this, use Python DataFrame fillna function or replace function.

import pandas as pd
import numpy as np

table = {'Age': [20, 35, np.nan],
         'Salary': [np.nan, 14000, 11000],
         'Sale': [422.19, np.nan, 44.55]
        }

data = pd.DataFrame(table)
print(data)

print('\n---Fill Missing Values DataFrame---')
print(data.fillna(30))

print('\n---Replace Missing Values DataFrame---')
print(data.replace({np.nan:66}))

Replacing Nulls in DataFrame output

    Age   Salary    Sale
0  20.0      NaN  422.19
1  35.0  14000.0     NaN
2   NaN  11000.0   44.55

---Fill Missing Values DataFrame---
    Age   Salary    Sale
0  20.0     30.0  422.19
1  35.0  14000.0   30.00
2  30.0  11000.0   44.55

---Replace Missing Values DataFrame---
    Age   Salary    Sale
0  20.0     66.0  422.19
1  35.0  14000.0   66.00
2  66.0  11000.0   44.55

Pandas DataFrame pivot

The DataFrame has a pivot function, which is very useful to pivot the existing DataFrame.

import pandas as pd

table = {'name': ['Kane', 'John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [35, 25, 32, 30, 26],
         'Profession': ['Manager', 'HR', 'Analyst', 'Manager', 'HR'],
         'Sale':[422.19, 22.55, 12.66, 119.470, 200.190],
         'Salary':[12000, 10000, 8000, 14000, 11000]
	    }
data = pd.DataFrame(table)
print(data)
print('\n--- After DataFrame Pivot---')
data2 = data.pivot(index = 'name', columns = 'Profession', values = 'Salary')
print(data2)

print('\n--- After DataFrame Pivot---')
data3 = data.pivot(index = 'name', columns = 'Profession')
print(data3)

Pandas DataFrame pivot output

     name  Age Profession    Sale  Salary
0    Kane   35    Manager  422.19   12000
1    John   25         HR   22.55   10000
2    Mike   32    Analyst   12.66    8000
3  Suresh   30    Manager  119.47   14000
4   Tracy   26         HR  200.19   11000

--- After DataFrame Pivot---
Profession  Analyst       HR  Manager
name                                 
John            NaN  10000.0      NaN
Kane            NaN      NaN  12000.0
Mike         8000.0      NaN      NaN
Suresh          NaN      NaN  14000.0
Tracy           NaN  11000.0      NaN

--- After DataFrame Pivot---
               Age                  Sale  ...          Salary                  
Profession Analyst    HR Manager Analyst  ... Manager Analyst       HR  Manager
name                                      ...                                  
John           NaN  25.0     NaN     NaN  ...     NaN     NaN  10000.0      NaN
Kane           NaN   NaN    35.0     NaN  ...  422.19     NaN      NaN  12000.0
Mike          32.0   NaN     NaN   12.66  ...     NaN  8000.0      NaN      NaN
Suresh         NaN   NaN    30.0     NaN  ...  119.47     NaN      NaN  14000.0
Tracy          NaN  26.0     NaN     NaN  ...     NaN     NaN  11000.0      NaN

[5 rows x 9 columns]

How to save DataFrame to CSV and Text File?

To load data from a Pandas DataFrame to a csv file or text file, you have to use the Pandas to_csv function.

import pandas as pd

table = {'name': ['Kane', 'John', 'Mike', 'Suresh', 'Tracy'],
         'Age': [35, 25, 32, 30, 26],
         'Profession': ['Manager', 'HR', 'Analyst', 'Manager', 'HR'],
         'Sale':[422.19, 22.55, 12.66, 119.470, 200.190],
         'Salary':[12000, 10000, 8000, 14000, 11000]
	    }
data = pd.DataFrame(table)
print(data)
# load DataFrame to text file
data.to_csv('user_info.txt') 
# load DataFrame to csv file with comma separator
data.to_csv('user_info.csv')
# load data from DataFrame to csv file with Tab separator
data.to_csv('user_info_new.csv', sep = '\t')
Pandas Save DataFrame to CSV and Text File

Iterate over Python DataFrame Rows

In Python, use any of the three functions iteritems, iterrows and itertuple to iterate over rows and returns each row of a DataFrame.

import pandas as pd

table = {'name': ['Kane', 'John', 'Mike'],
         'Age': [35, 25, 32],
         'Profession': ['Manager', 'HR', 'Analyst'],
         'Sale':[422.19, 119.470, 200.190],
         'Salary':[12000, 14000, 11000]
	    }
data = pd.DataFrame(table)
print(data)
print('\n---Iterating Rows---')
for rows, columns in data.iterrows():
    print(rows, columns)
    print()

Iterating DataFrame rows output

   name  Age Profession    Sale  Salary
0  Kane   35    Manager  422.19   12000
1  John   25         HR  119.47   14000
2  Mike   32    Analyst  200.19   11000

---Iterating Rows---
0 name             Kane
Age                35
Profession    Manager
Sale           422.19
Salary          12000
Name: 0, dtype: object

1 name            John
Age               25
Profession        HR
Sale          119.47
Salary         14000
Name: 1, dtype: object

2 name             Mike
Age                32
Profession    Analyst
Sale           200.19
Salary          11000
Name: 2, dtype: object

>>>