Posts Tagged ‘code sample’

Pandas for SQL Lovers part 4: Handling Nulls read from CSV

pandaSQL4In part 3 of the series I covered how to load a CSV file into a Pandas DataFrame. In this post I’ll focus on how to deal with NULL or missing values read from CSV files.

Previously in this series:

If you work with SQL you have probably had your share of challenges caused by missing/blank and NULL values. Let’s see how read_csv helps us manage these troublemakers when we populate a DataFrame from a csv file.

Let’s start with the following data in a CSV file:

FirstName,LastName,Team,Position,JerseyNumber,Salary,Birthdate
Joe,Pavelski,SJ,C,8,6000000,1984-07-11
Connor,McDavid,EDM,C,97,925000,1997-01-13
Sidney,Crosby,NULL,C,87,8700000,1987-08-07
Carey,Price,Unknown,G,31,10500000,1987-08-16
Daniel,Sedin,VAN,NA,22,,1980-09-26
Henrik,Sedin,VAN,N/A,33,,1980-09-26

I have deliberately provided a variety of values that can be construed as missing values

  • Sidney Crosby: Team is NULL
  • Carey Price” Team is Unknown
  • Daniel Sedin: Position is NA and salary is not provided
  • Henrik Sedin: Position is N/A and salary is not provided

Leave NULL or missing values untouched

If you specify na_filter=false then read_csv will read in all values exactly as they are:


players = pd.read_csv('HockeyPlayersNulls.csv',na_filter=False)

returns:
Dataset with missing and null values highlighted

Replace default missing values with NaN

In Pandas, the equivalent of NULL is NaN. By default, read_csv will replace blanks, NULL, NA, and N/A with NaN:


players = pd.read_csv('HockeyPlayersNulls.csv')

returns:

loaded data frame with blanks and missing values replaced by NaN, the word Unkown was not recognized as null

You can see that most of the ‘missing’ values in my csv files are replaced by NaN, except the value ‘Unknown’ which was not recognized as a missing value.

How to designate values as missing

If there are values in your data which are not recognized as missing, you can use the na_values parameter to specify values you want treated as missing:


players = pd.read_csv('HockeyPlayersNulls.csv',na_values=['Unknown'])

returns:

DataFrame populated with missing values replaced by NaN

Handling blank lines

By default if a blank line is encountered in the CSV file, it is skipped.

So if you have the following file:
FirstName,LastName,Team,Position,JerseyNumber,Salary
Joe,Pavelski,SJ,C,8,6000000
Connor,McDavid,EDM,C,97,925000

Sidney ,Crosby,PIT,C,87,8700000
Carey,Price,MTL,G,31,10500000
Daniel,Sedin,VAN,LW,22,
Henrik,Sedin,VAN,C,33,

You get the following DataFrame:
Data Frame populated with all rows , there is no row where the blank line in the file appeared
If you want the blank line to appear you can specify skip_blank_lines=False
players = pd.read_csv('HockeyPlayersBlankLines.csv', skip_blank_lines=False)

returns:

DataFrame containing a row whose values all show as NaN

Next up

We still need to look at how to control datatypes and how to deal with Dates when using read_csv to populate a DataFrame. Stay tuned!

 

Pandas for SQL lovers – Part 3 Reading a CSV file / BULK INSERT

Panda and heart with the word SQLIn this post I will explain how to populate a Pandas DataFrame from a CSV file. If you are not familiar with how to create a Pandas DataFrame check out Part 1 Creating DataFrames. 

If you have ever used BULK INSERT to read a CSV file into a database table, you are going to fall in love with the functionality for reading a csv file into a Pandas DataFrame!

Let’s start with how to:

  • Read a file
  • Handle error rows
  • Specify the index column
  • Specify column delimiters
  • Specify column headers
  • Select a subset of rows or columns

We load a csv file into a Pandas dataframe using read_csv

read_csv

As usual the first thing we need to do is import the numpy and pandas libraries


import pandas as pd
import numpy as np

I have the following csv file:

FirstName,LastName,Team,Position,JerseyNumber,Salary,Birthdate
Joe,Pavelski,SJ,C,8,6000000,1984-07-11
Connor,McDavid,EDM,C,97,925000,1997-01-13
Sidney ,Crosby,PIT,C,87,8700000,1987-08-07
Carey,Price,MTL,G,31,10500000,1987-08-16
Daniel,Sedin,VAN,LW,22,,1980-09-26
Henrik,Sedin,VAN,C,33,,1980-09-26

Use the read_csv method to load a comma separated file into a DataFrame:

players = pd.read_csv('HockeyPlayers.csv')

Creates the following DataFrame:

HockeyPlayerData

A few things to note:

  • Column names were read by default from the first row in the file
  • An index column was added numbering each row
  • The Salary for Daniel & Henrik Sedin was blank in the original file and appears as NaN (we will talk more about handling blanks and nulls in a later post)

Handling invalid rows

By default invalid rows will raise an error, for example if my data has rows with too many values.

I have extra ‘,’ in Connor McDavid’s name and in Carey Price salary:

FirstName,LastName,Team,Position,JerseyNumber,Salary,Birthdate
Joe,Pavelski,SJ,C,8,6000000,1984-07-11
Connor,Mc,David,EDM,C,97,925000,1997-01-13
Sidney ,Crosby,PIT,C,87,8700000,1987-08-07
Carey,Price,MTL,G,31,10,500,000,1987-08-16
Daniel,Sedin,VAN,LW,22,,1980-09-26
Henrik,Sedin,VAN,C,33,,1980-09-26

If I try to load the data I get an error:

ParserError: Error tokenizing data. C error: Expected 7 fields in line 3, saw 8

If you specify error_bad_lines=False, invalid rows are skipped:


players = pd.read_csv('HockeyPlayersInvalidRows.csv',error_bad_lines=False)

InvalidRowsSkipped

In Jupyter Notebooks you will see a message informing you lines were skipped:

'Skipping line 3: expected 7 fields, saw 8\nSkipping line 5: expected 7 fields, saw 9

Specifying your own index column

As we saw in part 2 of the series populating a DataFrame, every DataFrame has an index column, but if you do not want an index column created, you can specify a column in your csv file to use as the index column. The index column does not need to be a numeric value.

NOTE: unlike a PRIMARY KEY in a database table, an index column in a DataFrame can contain duplicate values.


players = pd.read_csv('HockeyPlayers.csv', index_col='LastName')

Would give me a DataFrame that uses LastName as the index column:

lastNameindexCol

You can specify a combination of columns to use as your index:


players = pd.read_csv('HockeyPlayers.csv', index_col=['LastName','FirstName'])

Will return the following DataFrame:

MultiColumnIndex

Column delimiters

The default column delimiter is a comma. If you use a different delimiter all the columns are read as a single column.

If I try to read a data file using semi-colons as column delimiters

FirstName;LastName;Team;Position;JerseyNumber;Salary;Birthdate
Joe;Pavelski;SJ;C;8;6000000;1984-07-11
Connor;McDavid;EDM;C;97;925000;1997-01-13
Sidney ;Crosby;PIT;C;87;8700000;1987-08-07
Carey;Price;MTL;G;31;10500000;1987-08-16
Daniel;Sedin;VAN;LW;22;;1980-09-26
Henrik;Sedin;VAN;C;33;;1980-09-26


players = pd.read_csv('HockeyPlayersSemiColonDelimiter.csv')

I get the following DataFrame:

HockeyPlayerWrongDelimiter

Use the delimiter parameter to specify the correct delimiter character


players = pd.read_csv('HockeyPlayersSemiColonDelimiter.csv', delimiter=';')

returns the DataFrame:

semiColonDelimiter

Column names

If your file does not contain a row with column names, the first row by default will be treated as column headers, so you might end up with a DataFrame like this:

NoHeaderRow

If you specify header=None, columns will be identified by number:


players = pd.read_csv('HockeyPlayersNoHeaderRow.csv', header=None)

NumericColumnNames

You can specify column names using the names parameter

players = pd.read_csv('HockeyPlayersNoHeaderRow.csv',
names = ['First','Last','TeamCode','Pos','JerseyNbr','Salary','Birthdate'])

Will return the DataFrame:

SpecifyColumnNames

If your csv file contains column headers you can use the names parameter to rename the columns.


players = pd.read_csv('HockeyPlayers.csv',

names = ['First','Last','TeamCode','Pos','JerseyNbr','Salary','Birthdate'])

But be warned, if the first row of your file contains column names and you specify the names parameter, the first row is read as a row of data:

ColumnNamesReadAsData

You need to specify the header parameter and indicate the first row of the file is a header row and not a data row


players = pd.read_csv('HockeyPlayers.csv',
header=0,
names = ['First','Last','TeamCode','Pos','JerseyNbr','Salary','Birthdate'])

Will give you the desired DataFrame:

NewColumnNames

Reading a subset of the csv file

You can specify which columns to read from the file with the usecols parameter:


players = pd.read_csv('HockeyPlayers.csv',

   usecols=['LastName','FirstName','Team'])

SpecifyColumnsToInclude

Use skiprows to skip rows at the start of the file. Of course if you skip the row containing the column headers, you need to specify the column names as well.


players = pd.read_csv('HockeyPlayers.csv',
skiprows=3,
names=['FirstName','LastName','Team','Position','Salary','BirthDate'])

will give you the DataFrame:

SKipRows

Use skipfooter to skip rows at the bottom of the file. If you use skipfooter you must also specify the parameter engine=Python. read_csv supports a C, and a Python engine. The C engine is faster, but does not support all the features. The Python engine supports all the features of read_csv.

Now when I use the Python engine, I noticed in Jupyter Notebooks that the first column name gets weird characters appended to it.


players = pd.read_csv('HockeyPlayers.csv',


skipfooter=2,


engine='python')

MEssedUpColumnName

This might just be something odd with my file, or with my notebook, but I can fix it easily enough by just specifying my own column names and skipping the row containing the column names:


players = pd.read_csv('HockeyPlayers.csv',
skiprows=1,
skipfooter=2,
engine='python',
names=['FirstName','LastName','Team','Position','Salary','BirthDate'])

SkipFooter

If you want to read in a specific number of rows you can use the nrows parameter:


players = pd.read_csv('HockeyPlayers.csv', nrows=3)

nrows

If you have a very large data file you can also read it in chunks using the chunksize parameter and store each chunk separately for analysis or processing.


for playerchunk in pd.read_csv('HockeyPlayers.csv', chunksize=2):
print(playerchunk.shape)
print(playerchunk)

Chunks

Summary

Okay we’ve covered how to

    • Read a csv file
    • Control behaviour for error rows
    • Specify an index column
    • Specify column names
    • Specify which rows and columns to load

Still to come we will look at how you work with read_csv to control the column, handle nulls, and date handling. Stay tuned!