Move data to a Database with Python: SQLAlchemy
When analyzing data, a common task is moving data from a source to a database. However, external data isn't always straightforward, and sometimes you need a tool with flexibility and granular control over the process. Python is an excellent choice for this kind of task.
In the first article of this series, we looked at how to move data to a Database with just Python. In this article, we'll look at another method using SQLAlchemy, a popular Python library for working with databases.
SQLAlchemy is an excellent choice of ETL tool because it abstracts away a lot of the database-specific code and lets you work just in Python.
The article is not intended to be a definitive tutorial on SQLAlchemy but aims to show how you can take advantage of its features to move data into a database.
Prerequisites:
In writing this, I'm assuming a couple of things:
- Familiarity with Python, installing packages, importing them into your code, and running a Python script
- Familiarity with SQL and working with databases.
Setting up
For this tutorial, we'll be using the New York City Airbnb Open Data from Kaggle's excellent public dataset repository.
Each row of the CSV represents a listing on Airbnb and includes a range of data types. I've chosen this dataset because its typical of the kinds of datasets encountered in the real world.
You can find the CSV file and final code in the accompanying repository for this tutorial on my GitHub.
Let's get started.
ORMs
Before using SQLAlchemy, we need to cover off a related concept: ORMs.
What's an ORM?
Languages like Python are Object-Oriented, a programming paradigm where code is organized into 'Objects' that have data (attributes) and methods (functions or behavior). However, in a relational database, data exists in tables with columns and rows. To bridge across these two different systems, we use an ORM or Object Relational Mapper.
ORMs allow us to map Objects in our code to tables and columns in a database. Objects can be stored (persisted) to and loaded from the database.
As you'll see later, we'll create Python Objects from our CSV data and use SQLAlchemy to store them. We won't need to write any SQL to do this or worry about database specifics. SQLAlchemy will handle this for us.
Now, onto the code.
Creating a Database Engine
Start by creating a new file. I've called mine sa.py
When working with SQLAlchemy, the starting point is the Engine
. The Engine gives us database connectivity and related functionality.
We create an Engine
through the create_engine
function passing in a database URL (akin to a connection string).
from sqlalchemy import create_engine
engine = create_engine('sqlite:///ab_nyc.sqlite3', echo=True)
The optional echo
flag sets up SQLAlchemy logging. We'll enable it to see the SQL statements produced by SQLAlchemy.
While the Engine
is our source of database connectivity, we have not yet connected or interacted with the database.
Setting up a mapping
To enable SQLAlchemy to move our data, we need to declare a mapping between our data and a database table. We do this with SQLAlchemy's Declarative system.
To create a mapping, we first define a Base
Class. We will associate our mapped Classes with the Base
. We generate the Base
Class through the declaritive_base
function.
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
engine = create_engine('sqlite:///ab_nyc.sqlite3', echo=True)
Base = declarative_base()
Now we have what we need to create a mapped class.
Create a mapped class
The work of mapping our data to a table occurs with a mapped Class. This Class describes the table we'll be mapping to, its name, columns, and data types.
The mapped Class has a couple of basic requirements. It must inherit from our Base
Class, have a __tablename__
attribute (the table to map to), and a primary key column.
Let's add this Class to our file. We'll call it Listing
, to indicate what each row represents.
We create the table's columns and data types by defining attributes on Listing
and assigning each one a new Column
object, passing in the respective data type.
from sqlalchemy import Column, Date, Float, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
engine = create_engine('sqlite:///ab_nyc.sqlite3', echo=True)
Base = declarative_base()
class Listing(Base):
__tablename__ = 'listings_sqlalchemy'
id = Column(Integer, primary_key=True)
name = Column(String(200))
host_id = Column(Integer)
host_name = Column(String(50))
neighbourhood_group = Column(String(20))
neighbourhood = Column(String(20))
latitude = Column(Float)
longitude = Column(Float)
room_type = Column(String(20))
price = Column(Integer)
minimum_nights = Column(Integer)
number_of_reviews = Column(Integer)
last_review = Column(Date, nullable=True)
reviews_per_month = Column(Integer)
calculated_host_listings_count = Column(Integer)
availability_365 = Column(Integer)
Because SQLAlchemy is mapping an Object to a row in a table, it expects that each mapped Class defines a primary key column. Without a primary key, SQLAlchemy has no way of knowing which Objects map to which rows.
Later you'll see how we can create rows in our table by creating objects from Listing
. Let's look at how we can leverage the Listing
Class to create a table in the database.
Create the table on the database
Recall from above that a mapped Class was required to inherit from the Base
Class. When we did this, our mapped Class got placed into a registry on the Base
Class called metadata.
To create the table defined by our mapped Class we call the Base.metadata.create_all
function passing in our database engine
.
from sqlalchemy import Column, Date, Float, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
engine = create_engine('sqlite:///ab_nyc.sqlite3', echo=True)
Base = declarative_base()
class Listing(Base):
__tablename__ = 'listings_sqlalchemy'
id = Column(Integer, primary_key=True)
name = Column(String(200))
host_id = Column(Integer)
host_name = Column(String(50))
neighbourhood_group = Column(String(20))
neighbourhood = Column(String(20))
latitude = Column(Float)
longitude = Column(Float)
room_type = Column(String(20))
price = Column(Integer)
minimum_nights = Column(Integer)
number_of_reviews = Column(Integer)
last_review = Column(Date, nullable=True)
reviews_per_month = Column(Integer)
calculated_host_listings_count = Column(Integer)
availability_365 = Column(Integer)
Base.metadata.create_all(engine)
Let's stop at this point and execute the script. Watch what happens. You should see some SQL statements printed to your terminal. If you inspect the database, you should see a new table named listings_sqlalchemy
with column names corresponding to those on the Listing
Class.
Pretty neat, right? We haven't had to write any SQL to make this happen. We described the table to SQLAlchemy, and it handled the rest.
Everything is now in place to start moving our data across.
Insert the data
To insert data into the database, we'll need to do the following:
- Define a database session
- Loop through each line of the CSV and turn its contents into a
Listing
Object - Pass a list of those objects to a database session to insert them into the database table.
Define a database session
To store the data, we'll need a database session. The session handles the specifics of talking to the database and inserting data.
To define a session, we use the sessionmaker
method and bind our engine
object to it.
from sqlalchemy import Column, Date, Float, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
engine = create_engine('sqlite:///ab_nyc.sqlite3', echo=True)
Base = declarative_base()
class Listing(Base):
# ...
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
Although we have our Session
defined and ready to go, we haven't made any connections to the database yet.
Create mapped Objects
We can now start creating mapped Objects from our CSV data.
To do this, we first open the CSV file using the with
statement (context manager).
To read the CSV contents, we use csv.DictReader
. The DictReader
returns each row as a dictionary with key names from the header row. By using dictionaries, we'll also be able to use the dictionary unpacking operator when creating our objects.
With a list comprehension, we loop through each row of the csvreader
. The function prepare_listing
performs some light transformation of the data and returns a Listing
Object made from our previously defined mapped Class.
import csv
from dateutil.parser import parse
from sqlalchemy import Column, Date, Float, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
engine = create_engine('sqlite:///ab_nyc.sqlite3', echo=True)
Base = declarative_base()
class Listing(Base):
# ...
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
def parse_none(dt):
try:
return parse(dt)
except:
return None
def prepare_listing(row):
row["last_review"] = parse_none(row["last_review"])
return Listing(**row)
with open('AB_NYC_2019.csv', encoding='utf-8', newline='') as csv_file:
csvreader = csv.DictReader(csv_file, quotechar='"')
listings = [prepare_listing(row) for row in csvreader]
Store in the database
Now we have a list of mapped Objects we have everything we need to insert the data in the database.
We do this by creating a new Session
object from the Session
class defined earlier.
Because our list contains thousands of records, we'll use the session.add_all method
, one of the bulk insert options available in SQLAlchemy. The add_all
method will perform batches of inserts rather than inserting one at a time.
import csv
from dateutil.parser import parse
from sqlalchemy import Column, Date, Float, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
engine = create_engine('sqlite:///ab_nyc.sqlite3', echo=True)
Base = declarative_base()
class Listing(Base):
# ...
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
def parse_none(dt):
try:
return parse(dt)
except:
return None
def prepare_listing(row):
row["last_review"] = parse_none(row["last_review"])
return Listing(**row)
with open('AB_NYC_2019.csv', encoding='utf-8', newline='') as csv_file:
csvreader = csv.DictReader(csv_file, quotechar='"')
listings = [prepare_listing(row) for row in csvreader]
session = Session()
session.add_all(listings)
session.commit()
To complete the database transaction and store the data, we call session.commit
Lastly, all that's left to do is run our file and inspect the results in our database browser. You should see the data added as rows in the table.
Conclusion
There we have it. You've successfully moved data to a database with SQLAlchemy.
The neat thing about this method is we didn't have to write any SQL and kept everything inside Python. And it didn't even matter that SQLite only supports limited data types, SQLAlchemy took care of this for us.
Check back soon for the final article in this series, where we'll take a look at the most straightforward method for moving data to a database - with pandas.