Posts Tagged ‘dataframes’

Python Pandas for SQL fans Part 1: Creating DataFrames

panda holding SQL heartI have worked with data for years, and I am very familiar with SQL. When I started exploring data science I found a number of my SQL skills translated well into data science. Pandas tends to be the most popular python library for manipulating data for those familiar with SQL, so I figured I would write a series of posts on how to use Pandas written from a SQL perspective!

In this post we will focus on how to create a DataFrame this is the equivalent of creating a table in a database.

Part 2 shows you How to insert data into a DataFrame.

Pre-requisites

You need to import the python pandas and numpy libraries to use any of the code examples below:

import pandas as pd
import numpy as np

CREATE TABLE

If you want to query data in a database, you need to create a table. If you want to query data in Pandas, you need to create a DataFrame.

If I want to create a database table to hold information about hockey players I would use the CREATE TABLE statement:

CREATE TABLE players (
first_name   VARCHAR(30),
last_name VARCHAR(30),
team VARCHAR(3),
position VARCHAR(2),
jersey_number INT,
salary DECIMAL,
birthdate DATETIME)

Create Pandas DataFrame

To create a DataFrame to hold the same information in pandas, all I need to do is define the column names and create a DataFrame using the column name:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)

Creates an empty dataframe:

Empty DataFrame
Columns: [FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate]

Specifying data types

Now you probably noticed I didn’t specify the datatypes for my columns when I created the dataframe. You can display the datatypes for your DataFrame using dtypes:

players.dtypes

which returns:

FirstName object
LastName object
Team object
Position object
JerseyNumber object
Salary object
Birthdate object

Any column with a datatype of string will be listed as object because in Python a string is actually a sequence of characters which does allow you some neat slicing to retrieve substrings.

You can modify the datatypes of the columns after the DataFrame is created using astype:

players = players.astype({'FirstName':str,
'LastName':str,
'Team':str,
'Position':str,
'JerseyNumber':int,
'Salary':float,
'Birthdate':np.datetime64})

Now I have the datatypes:

FirstName object
LastName object
Team object
Position object
JerseyNumber int32
Salary float64
Birthdate datetime64[ns]

Wondering what datatypes are available? Chris Moffitt wrote a nice post summarizing Pandas data types. You can also refer to the official Pandas documentation on dtypes.

If you are wondering, why you can’t specify datatypes for each column when a DataFrame is created, that’s because unlike when you work with database tables, you usually create DataFrames from a dataset and the datatype is inferred from the data. The DataFrame constructor does accept a datatype argument, but you can only use it to specify a datatype to use for all columns in the DataFrame, you cannot specify different datatypes for each column.

Converting all columns to a single datatype

If all your columns will be the same dataype, you can use astype to convert all columns to a new datatype:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)
players = players.astype(int)

would give me a DataFrame with every column as an integer:

FirstName int32
LastName int32
Team int32
Position int32
JerseyNumber int32
Salary int32
Birthdate int32

Converting one column to a new datatype

You can also convert a single column in a DataFrame to a new datatype using to_datetime, to_numeric, to timedelta

NOTE: Timedelta is a datatype in python used to specify a gap between two dates and times.

The following code shows how I could use these functions to convert the columns that are not storing strings:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)
players['Birthdate']= pd.to_datetime(players['Birthdate'])
players['JerseyNumber']=pd.to_numeric(players['JerseyNumber'],downcast='integer')
players['Salary']=pd.to_numeric(players['Salary'],downcast='float')

Would give me a DataFrame with the data types:

FirstName object
LastName object
Team object
Position object
JerseyNumber int8
Salary float32
Birthdate datetime64[ns]

Note that to_numeric requires an additional parameter downcast to specify the type of numeric datatype required.

Coming next

In the Part 2 you learn the equivalent of the INSERT statement, and learn how to add data to your DataFrame.