Posts Tagged ‘INSERT’

Pandas for SQL Lovers Part 2: INSERT / Populating a DataFrame

SQLPanda2Know SQL and trying to learn Python Pandas for data science? Many concepts are the same. In this series I explain Pandas using SQL as a reference point.

In this post I will explain how to populate a Pandas DataFrame. If you are not familiar with how to create a Pandas DataFrame check out Part 1 Creating DataFrames

INSERT INTO

In SQL if you want to insert rows into a table you use the INSERT statement. You can insert a single row


INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11')

OR you can insert multiple rows with a single INSERT statement


INSERT INTO players
(FirstName, LastName, Team, Position, JerseyNumber, Salary, Birthdate)
VALUES
('Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'),
('Connor','McDavid','EDM','C','97,925000.00,'1997-01-13')

Populating a DataFrame when created

When you create your DataFrame, you can provide data and populate it immediately.


column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players  = pd.DataFrame(data, columns=column_names)

This creates the following DataFrame

PlayersDataFrame

Inferred Datatypes

The advantage to populating the DataFrame when it is created, is that Pandas will infer the datatypes based on the data. If I run the command:


players.dtypes

I can see the DataFrame assigned integer and float datatypes to JerseyNumber and Salary. All the other columns are strings (because strings are stored as a sequence the datatype displayed is object):

FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate object

Explicit Datatypes

If you want BirthDate to be a date, datatype you will need to convert it explicitly. The line of code below uses to_datetime to convert the Birthdate column to a datetime:

players['Birthdate']= pd.to_datetime(players['Birthdate'])

Now the Birthdate column stores the datatype datetime:

FirstName object
LastName object
Team object
Position object
JerseyNumber int64
Salary float64
Birthdate datetime64[ns]

You can find more details on how to assign datatypes explicitly in the Part 1 of this series: how to create DataFrames.

Indexes in DataFrames

You may have noticed that I did not have any sort of ‘playerid’ value for the rows I inserted. But you can see a number beside each row.  This column is called the index. Pandas will automatically create an index for each row in the DataFrame.

PlayersDataFrame

Setting your own column as index

If you want to use your own column for the index, you can use set_index. The example below creates a DataFrame with a PlayerId and then users set_index to make PlayerId the index column.

column_names = ['PlayerId',
'FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [[1,'Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
[2,'Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
[3, 'Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
[4, 'Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players  = pd.DataFrame(data,columns=column_names)

players.set_index(‘PlayerId’, inplace=True)

This produces a DataFrame with PlayerId as the Index column

PlayerIdIndex

Using non numeric columns as an index

You are not limited to numeric fields as indexes, you can use any field as your index:

 players.set_index('LastName', inplace=True)

LastNameIndex

Duplicate values in index columns

Unlike an Primary Key in a database, the index on a DataFrame will allow duplicate values. If you decide to use LastName as your index column and you have the Henrik & Daniel Sedin in your DataFrame you will see duplicate indexes.

duplicateIndex

Adding rows to an existing DataFrame

If you want to add rows to a DataFrame after it is created use append. In the code below let’s recreate the populated DataFrame with the autogenerated index:

column_names = ['FirstName',
'LastName',
'Team',
'Position',
'JerseyNumber',
'Salary',
'Birthdate']
data = [['Joe','Pavelski','SJ','C',8,6000000.00,'1984-07-11'],
['Connor','McDavid','EDM','C',97,925000.00,'1997-01-13'],
['Sidney','Crosby','PIT','C',87,8700000.00,'1987-08-07'],
['Carey','Price','MTL','G',31,10500000.00,'1987-08-16']]
players = pd.DataFrame(data, columns=column_names)

PlayersDataFrame

Now you can use append to add a row. A couple of things to be aware of:

  • Using append creates a new DataFrame with the added row, if you want to append the row to your existing DataFrame you need to assign the result of the append to your original DataFrame.
  • You must specify ignore_index = True when you are providing explicit values to use for the new row being appended

players = players.append({'FirstName':'Erik',
'LastName':'Karlsson',
'Team':'SJ',
'Position':'D',
'JerseyNumber':65,
'Salary':11500000.00,
'Birthdate':'1990-05-31'},
ignore_index=True)

AppendedOneRow

Summary

Now you can add rows to your DataFrame. In upcoming posts we will look at how to populate your DataFrame from a CSV file or from a database table.

How to Move Data from One Table to Another

I recently saw a post on Stack Overflow asking how do you copy specific table data from one database to another in SQL Server. It struck me I should share the solution to this with others because it is such a handy trick. Often I set up test data and want to quickly copy it to another table, or a co-worked wants a copy of my data, or I want to copy some data from production to a local database.

If all you want to do is copy data from one table to another in the same database, just use the INSERT clause on the SELECT statement.

INSERT INTO PlayerBackups
SELECT * FROM NhlPlayer

If you do not have a second table and you want to make a quick and dirty backup of some test data, you can create a table based on the data you choose in your select statement.

SELECT * INTO PlayerBackups
FROM NhlPlayer

If you want to move data between tables across databases, you will have to use a fully qualified name

INSERT INTO YourDatabase.YourSchema.PlayerBackups
SELECT * FROM MyDatabase.MySchema.NhlPlayer

If you want to move data across servers, you will need to set up a linked server, this will require working with the DBA because there are authentication issues around linked servers (how will your account log in to the other server? what permissions will you have on the other server?) Once you have a linked server set up, then you can just use the fully qualified name to specify the server name.

INSERT INTO YourServer.YourDatabase.YourSchema.PlayerBackups
SELECT * FROM MyServer.MyDatabase.MySchema.NhlPlayer

I am always forgetting the syntax for these commands, so thought I would share them. Don’t forget, if you know SQL, you know SQL Azure! Try it out now