Posts Tagged ‘pandas’

Pandas for SQL lovers – Part 3 Reading a CSV file / BULK INSERT

Panda and heart with the word SQLIn this post I will explain how to populate a Pandas DataFrame from a CSV file. If you are not familiar with how to create a Pandas DataFrame check out Part 1 Creating DataFrames. 

If you have ever used BULK INSERT to read a CSV file into a database table, you are going to fall in love with the functionality for reading a csv file into a Pandas DataFrame!

Let’s start with how to:

  • Read a file
  • Handle error rows
  • Specify the index column
  • Specify column delimiters
  • Specify column headers
  • Select a subset of rows or columns

We load a csv file into a Pandas dataframe using read_csv

read_csv

As usual the first thing we need to do is import the numpy and pandas libraries


import pandas as pd
import numpy as np

I have the following csv file:

FirstName,LastName,Team,Position,JerseyNumber,Salary,Birthdate
Joe,Pavelski,SJ,C,8,6000000,1984-07-11
Connor,McDavid,EDM,C,97,925000,1997-01-13
Sidney ,Crosby,PIT,C,87,8700000,1987-08-07
Carey,Price,MTL,G,31,10500000,1987-08-16
Daniel,Sedin,VAN,LW,22,,1980-09-26
Henrik,Sedin,VAN,C,33,,1980-09-26

Use the read_csv method to load a comma separated file into a DataFrame:

players = pd.read_csv('HockeyPlayers.csv')

Creates the following DataFrame:

HockeyPlayerData

A few things to note:

  • Column names were read by default from the first row in the file
  • An index column was added numbering each row
  • The Salary for Daniel & Henrik Sedin was blank in the original file and appears as NaN (we will talk more about handling blanks and nulls in a later post)

Handling invalid rows

By default invalid rows will raise an error, for example if my data has rows with too many values.

I have extra ‘,’ in Connor McDavid’s name and in Carey Price salary:

FirstName,LastName,Team,Position,JerseyNumber,Salary,Birthdate
Joe,Pavelski,SJ,C,8,6000000,1984-07-11
Connor,Mc,David,EDM,C,97,925000,1997-01-13
Sidney ,Crosby,PIT,C,87,8700000,1987-08-07
Carey,Price,MTL,G,31,10,500,000,1987-08-16
Daniel,Sedin,VAN,LW,22,,1980-09-26
Henrik,Sedin,VAN,C,33,,1980-09-26

If I try to load the data I get an error:

ParserError: Error tokenizing data. C error: Expected 7 fields in line 3, saw 8

If you specify error_bad_lines=False, invalid rows are skipped:


players = pd.read_csv('HockeyPlayersInvalidRows.csv',error_bad_lines=False)

InvalidRowsSkipped

In Jupyter Notebooks you will see a message informing you lines were skipped:

'Skipping line 3: expected 7 fields, saw 8\nSkipping line 5: expected 7 fields, saw 9

Specifying your own index column

As we saw in part 2 of the series populating a DataFrame, every DataFrame has an index column, but if you do not want an index column created, you can specify a column in your csv file to use as the index column. The index column does not need to be a numeric value.

NOTE: unlike a PRIMARY KEY in a database table, an index column in a DataFrame can contain duplicate values.


players = pd.read_csv('HockeyPlayers.csv', index_col='LastName')

Would give me a DataFrame that uses LastName as the index column:

lastNameindexCol

You can specify a combination of columns to use as your index:


players = pd.read_csv('HockeyPlayers.csv', index_col=['LastName','FirstName'])

Will return the following DataFrame:

MultiColumnIndex

Column delimiters

The default column delimiter is a comma. If you use a different delimiter all the columns are read as a single column.

If I try to read a data file using semi-colons as column delimiters

FirstName;LastName;Team;Position;JerseyNumber;Salary;Birthdate
Joe;Pavelski;SJ;C;8;6000000;1984-07-11
Connor;McDavid;EDM;C;97;925000;1997-01-13
Sidney ;Crosby;PIT;C;87;8700000;1987-08-07
Carey;Price;MTL;G;31;10500000;1987-08-16
Daniel;Sedin;VAN;LW;22;;1980-09-26
Henrik;Sedin;VAN;C;33;;1980-09-26


players = pd.read_csv('HockeyPlayersSemiColonDelimiter.csv')

I get the following DataFrame:

HockeyPlayerWrongDelimiter

Use the delimiter parameter to specify the correct delimiter character


players = pd.read_csv('HockeyPlayersSemiColonDelimiter.csv', delimiter=';')

returns the DataFrame:

semiColonDelimiter

Column names

If your file does not contain a row with column names, the first row by default will be treated as column headers, so you might end up with a DataFrame like this:

NoHeaderRow

If you specify header=None, columns will be identified by number:


players = pd.read_csv('HockeyPlayersNoHeaderRow.csv', header=None)

NumericColumnNames

You can specify column names using the names parameter

players = pd.read_csv('HockeyPlayersNoHeaderRow.csv',
names = ['First','Last','TeamCode','Pos','JerseyNbr','Salary','Birthdate'])

Will return the DataFrame:

SpecifyColumnNames

If your csv file contains column headers you can use the names parameter to rename the columns.


players = pd.read_csv('HockeyPlayers.csv',

names = ['First','Last','TeamCode','Pos','JerseyNbr','Salary','Birthdate'])

But be warned, if the first row of your file contains column names and you specify the names parameter, the first row is read as a row of data:

ColumnNamesReadAsData

You need to specify the header parameter and indicate the first row of the file is a header row and not a data row


players = pd.read_csv('HockeyPlayers.csv',
header=0,
names = ['First','Last','TeamCode','Pos','JerseyNbr','Salary','Birthdate'])

Will give you the desired DataFrame:

NewColumnNames

Reading a subset of the csv file

You can specify which columns to read from the file with the usecols parameter:


players = pd.read_csv('HockeyPlayers.csv',

   usecols=['LastName','FirstName','Team'])

SpecifyColumnsToInclude

Use skiprows to skip rows at the start of the file. Of course if you skip the row containing the column headers, you need to specify the column names as well.


players = pd.read_csv('HockeyPlayers.csv',
skiprows=3,
names=['FirstName','LastName','Team','Position','Salary','BirthDate'])

will give you the DataFrame:

SKipRows

Use skipfooter to skip rows at the bottom of the file. If you use skipfooter you must also specify the parameter engine=Python. read_csv supports a C, and a Python engine. The C engine is faster, but does not support all the features. The Python engine supports all the features of read_csv.

Now when I use the Python engine, I noticed in Jupyter Notebooks that the first column name gets weird characters appended to it.


players = pd.read_csv('HockeyPlayers.csv',


skipfooter=2,


engine='python')

MEssedUpColumnName

This might just be something odd with my file, or with my notebook, but I can fix it easily enough by just specifying my own column names and skipping the row containing the column names:


players = pd.read_csv('HockeyPlayers.csv',
skiprows=1,
skipfooter=2,
engine='python',
names=['FirstName','LastName','Team','Position','Salary','BirthDate'])

SkipFooter

If you want to read in a specific number of rows you can use the nrows parameter:


players = pd.read_csv('HockeyPlayers.csv', nrows=3)

nrows

If you have a very large data file you can also read it in chunks using the chunksize parameter and store each chunk separately for analysis or processing.


for playerchunk in pd.read_csv('HockeyPlayers.csv', chunksize=2):
print(playerchunk.shape)
print(playerchunk)

Chunks

Summary

Okay we’ve covered how to

    • Read a csv file
    • Control behaviour for error rows
    • Specify an index column
    • Specify column names
    • Specify which rows and columns to load

Still to come we will look at how you work with read_csv to control the column, handle nulls, and date handling. Stay tuned!

Pandas for SQL Lovers Part 2: INSERT / Populating a DataFrame

SQLPanda2Know SQL and trying to learn Python Pandas for data science? Many concepts are the same. In this series I explain Pandas using SQL as a reference point.

In this post I will explain how to populate a Pandas DataFrame. If you are not familiar with how to create a Pandas DataFrame check out Part 1 Creating DataFrames

INSERT INTO

In SQL if you want to insert rows into a table you use the INSERT statement. You can insert a single row


INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11')

OR you can insert multiple rows with a single INSERT statement


INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'),
('Connor','McDavid','EDM','C','97,925000.00,'1997-01-13')

Populating a DataFrame when created

When you create your DataFrame, you can provide data and populate it immediately.


column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players  = pd.DataFrame(data, columns=column_names)

This creates the following DataFrame

PlayersDataFrame

Inferred Datatypes

The advantage to populating the DataFrame when it is created, is that Pandas will infer the datatypes based on the data. If I run the command:


players.dtypes

I can see the DataFrame assigned integer and float datatypes to JerseyNumber and Salary. All the other columns are strings (because strings are stored as a sequence the datatype displayed is object):

FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate object

Explicit Datatypes

If you want BirthDate to be a date, datatype you will need to convert it explicitly. The line of code below uses to_datetime to convert the Birthdate column to a datetime:

players['Birthdate']= pd.to_datetime(players['Birthdate'])

Now the Birthdate column stores the datatype datetime:

FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate datetime64[ns]

You can find more details on how to assign datatypes explicitly in the Part 1 of this series: how to create DataFrames.

Indexes in DataFrames

You may have noticed that I did not have any sort of ‘playerid’ value for the rows I inserted. But you can see a number beside each row.  This column is called the index. Pandas will automatically create an index for each row in the DataFrame.

PlayersDataFrame

Setting your own column as index

If you want to use your own column for the index, you can use set_index. The example below creates a DataFrame with a PlayerId and then users set_index to make PlayerId the index column.

column_names = ['PlayerId',
'FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [[1,'Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
[2,'Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
[3, 'Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
[4, 'Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players  = pd.DataFrame(data,columns=column_names)

players.set_index(‘PlayerId’, inplace=True)

This produces a DataFrame with PlayerId as the Index column

PlayerIdIndex

Using non numeric columns as an index

You are not limited to numeric fields as indexes, you can use any field as your index:

 players.set_index('LastName', inplace=True)

LastNameIndex

Duplicate values in index columns

Unlike an Primary Key in a database, the index on a DataFrame will allow duplicate values. If you decide to use LastName as your index column and you have the Henrik & Daniel Sedin in your DataFrame you will see duplicate indexes.

duplicateIndex

Adding rows to an existing DataFrame

If you want to add rows to a DataFrame after it is created use append. In the code below let’s recreate the populated DataFrame with the autogenerated index:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players = pd.DataFrame(data, columns=column_names)

PlayersDataFrame

Now you can use append to add a row. A couple of things to be aware of:

  • Using append creates a new DataFrame with the added row, if you want to append the row to your existing DataFrame you need to assign the result of the append to your original DataFrame.
  • You must specify ignore_index = True when you are providing explicit values to use for the new row being appended

players = players.append({'FirstName':'Erik',
'LastName':'Karlsson',
'Team':'SJ',
'Position':'D',
'JerseyNumber':65,
'Salary':11500000.00,
'Birthdate':'1990-05-31'},
ignore_index=True)

AppendedOneRow

Summary

Now you can add rows to your DataFrame. In upcoming posts we will look at how to populate your DataFrame from a CSV file or from a database table.

Python Pandas for SQL fans Part 1: Creating DataFrames

panda holding SQL heartI have worked with data for years, and I am very familiar with SQL. When I started exploring data science I found a number of my SQL skills translated well into data science. Pandas tends to be the most popular python library for manipulating data for those familiar with SQL, so I figured I would write a series of posts on how to use Pandas written from a SQL perspective!

In this post we will focus on how to create a DataFrame this is the equivalent of creating a table in a database.

Part 2 shows you How to insert data into a DataFrame.

Pre-requisites

You need to import the python pandas and numpy libraries to use any of the code examples below:

import pandas as pd
import numpy as np

CREATE TABLE

If you want to query data in a database, you need to create a table. If you want to query data in Pandas, you need to create a DataFrame.

If I want to create a database table to hold information about hockey players I would use the CREATE TABLE statement:

CREATE TABLE players (
first_name   VARCHAR(30),
last_name VARCHAR(30),
team VARCHAR(3),
position VARCHAR(2),
jersey_number INT,
salary DECIMAL,
birthdate DATETIME)

Create Pandas DataFrame

To create a DataFrame to hold the same information in pandas, all I need to do is define the column names and create a DataFrame using the column name:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)

Creates an empty dataframe:

Empty DataFrame
Columns: [FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate]

Specifying data types

Now you probably noticed I didn’t specify the datatypes for my columns when I created the dataframe. You can display the datatypes for your DataFrame using dtypes:

players.dtypes

which returns:

FirstName object
LastName object
Team object
Position object
JerseyNumber object
Salary object
Birthdate object

Any column with a datatype of string will be listed as object because in Python a string is actually a sequence of characters which does allow you some neat slicing to retrieve substrings.

You can modify the datatypes of the columns after the DataFrame is created using astype:

players = players.astype({'FirstName':str,
'LastName':str,
'Team':str,
'Position':str,
'JerseyNumber':int,
'Salary':float,
'Birthdate':np.datetime64})

Now I have the datatypes:

FirstName object
LastName object
Team object
Position object
JerseyNumber int32
Salary float64
Birthdate datetime64[ns]

Wondering what datatypes are available? Chris Moffitt wrote a nice post summarizing Pandas data types. You can also refer to the official Pandas documentation on dtypes.

If you are wondering, why you can’t specify datatypes for each column when a DataFrame is created, that’s because unlike when you work with database tables, you usually create DataFrames from a dataset and the datatype is inferred from the data. The DataFrame constructor does accept a datatype argument, but you can only use it to specify a datatype to use for all columns in the DataFrame, you cannot specify different datatypes for each column.

Converting all columns to a single datatype

If all your columns will be the same dataype, you can use astype to convert all columns to a new datatype:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)
players = players.astype(int)

would give me a DataFrame with every column as an integer:

FirstName int32
LastName int32
Team int32
Position int32
JerseyNumber int32
Salary int32
Birthdate int32

Converting one column to a new datatype

You can also convert a single column in a DataFrame to a new datatype using to_datetime, to_numeric, to timedelta

NOTE: Timedelta is a datatype in python used to specify a gap between two dates and times.

The following code shows how I could use these functions to convert the columns that are not storing strings:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
players  = pd.DataFrame(columns=column_names)
players['Birthdate']= pd.to_datetime(players['Birthdate'])
players['JerseyNumber']=pd.to_numeric(players['JerseyNumber'],downcast='integer')
players['Salary']=pd.to_numeric(players['Salary'],downcast='float')

Would give me a DataFrame with the data types:

FirstName object
LastName object
Team object
Position object
JerseyNumber int8
Salary float32
Birthdate datetime64[ns]

Note that to_numeric requires an additional parameter downcast to specify the type of numeric datatype required.

Coming next

In the Part 2 you learn the equivalent of the INSERT statement, and learn how to add data to your DataFrame.