How I Learned to Stop Worrying and Love JSON
Oftentimes, I find myself in situations where I need to store complex nested metadata in a database. This can be a daunting task, as the first thing that comes to mind is having to create a host of new fact and dimension tables. However, this level of complexity can often be condensed into a single column by using the JSON
or JSONB
datatypes.
Note: This post contains code specific to Postgres and Python, however these concepts can be applied to a variety of database frameworks.
JSON and JSONB
JSON
stands for JavaScript Object Notation, and is a human readable text based object derived from Javascript. A JSON file looks like this.
{"First Name": "Matt",
"Last Name": "Guan",
"Age": "25"}
This is pretty much the same thing as a Python dictionary. There is a name
(or key
) on the outer level, with a value
that can be accessed inside.
JSONB
or JSON Blob is a unique Postgres datatype that is almost identical to JSON. The only real difference is that JSONB is stored in a decomposed binary format vs. text, which processes faster. Postgres also has a lot of built in stored procedures that work specifically with JSONB.
JSON and JSONB both have unique functions and operators, that allow for a lot of flexibility in both updating and calling data. These formats are supported with Postgres version 9.3 and above.
Using Postgres in Python
Throughout this post, I will be writing a lot of queries in pure SQL. These can be wrapped as a multi-line string in Python and be executed by arguments within either Pandas or your database cursor.
For executing simple postgres queries, I usually just use the default psycogs2
cursor object. However, sqlalchemy
is a little more user friendly for inserting pandas
dataframes directly into your database. Below is the syntax for both.
import psycopg2
conn = psycopg2.connect("dbname=test, user=postgres")
cur = conn.cursor()
s='''
INSERT INTO important_data (id, data)
VALUES (%(id)s, %(data)s);
'''
cur.execute(s, {'id':1, 'data':'Trade Secrets'})
import sqlalchemy
import pandas as pd
config = pd.DataFrame({'id':[1], 'data':['Trade Secrets']})
db_str = f'postgresql://{user}@{host}:{port}/{database}'
db = sqlalchemy.create_engine(db_str)
with db.connect() as conn:
config.to_sql(name='important_data', con=conn, if_exists='append', index=False)
For reading SQL in Python, the easiest method is to use the pandas
read_sql_query
argument and pass in the database connection. (The psycopg2 cursor will return output in a hard to use nested tuple format)
import pandas as pd
s='''
SELECT *
FROM important_data
WHERE id=%(id)s
'''
pd.read_sql_query(s, conn, params={'id':1})
Creating a JSON Field
Creating a table with a JSON field is relatively simple - just use the built in JSON or JSONB datatype. The one you choose doesn’t really matter. JSONB is the preferred choice, but if you already created a table with a JSON column, there is no need to recreate it as Postgres has a built in stored procedure to convert JSON to JSONB (more on that later).
CREATE TABLE user_metadata
(userId INT PRIMARY KEY,
metadata JSONB,
publish_time TIMESTAMP DEFAULT current_timestamp);
Generating Valid JSON in Python
Creating your JSON object to insert into Postgres is pretty easy. Just create a Python dictionary of any level of complexity, and then use the dumps
argument from the json
module to convert it to a string of valid JSON.
Below is an example of creating a JSON file from a nested dictionary. I imagine that this is a small subset of what actually exists in some database at Google.
Note that we can condense all of this data into one column! In a typical relational database, this could potentially span up to 5+ tables across various schemas.
import json
metadata_dict = {
'name': {'first': 'Matt',
'last': 'Guan'},
'demo': {'ethnicity': 'asian',
'age': '25',
'education': ['high school', 'bachelors']},
'interests': {'lifestyle': ['fitness', 'healthy living', 'travel', 'slick deals'],
'hobbies': ['board games', 'live music', 'movies', 'anime betrayals']}
}
json_str = json.dumps(metadata_dict)
Updating Your JSON Column
So now we’ve generated created our table, and populated our JSON column. So how do we easily update it? Luckily there is a lot of built in functionality for this.
The easiest option is just to pull the entire JSON column into Python using pandas
, update the entire JSON, and load it back in. This is super easy, but can get cumbersome for more complex operations.
Below is an example of inserting an entire JSON object back into the user_metadata
table using the primary key of user_id
along with publish_time
to ensure that I am only overwriting one entry.
UPDATE user_metadata
SET metadata = %(insert_json)s::json
WHERE userId = %(user_id)
AND
publish_time = %(time)s
We can also update an element nested within a JSONB using jsonb_set
- a built in Postgres function with the following parameters - jsonb_set(target, path, new_value, create_missing)
'''
Parameters
----------
target: jsonb
the JSONB that will be modified
path: text
Indicates the location of the field to update
new_value: jsonb
the new JSONB that you want to insert
create_missing: bool
If TRUE, creates that path within the target JSONB
if it did not exist before
'''
This function only works on JSONB
columns though! But like I said earlier, it’s fine if your column is in JSON
- you can just use another built in stored procedure to_jsonb
to temporarily convert your column.
This method can be useful for modifying multiple elements at once that reside deep within the JSONB, that would otherwise be a pain to parse out.
Below is the code to update only the interests
dictionary. Note that in this example, I am using the to_jsonb
stored procedure and then converting the column back to JSON using ::json
. You would ignore these if your column was already JSONB.
UPDATE user_metadata
SET metadata = jsonb_set(to_jsonb(metadata), '{interests}', %(insert_dict)s, false)::json
WHERE userId = %(user_id)
AND
publish_time = %(time)s
We can even take this a step further and just modify just the array within the nested dictionary by extending the target
argument.
UPDATE user_metadata
SET metadata = jsonb_set(to_jsonb(metadata), '{interests, hobbies}', %(insert_array)s, false)::json
WHERE userId = %(user_id)
AND
publish_time = %(time)s
Querying JSON
Lastly, there is built in functionality to query nested JSON data directly! Namely, ->
in addition to a key
within the JSON will return the nested value
. ->>
does the exact same thing, but returns the value as text. I usually just use ->>
for the final call if I’m trying to get to a deeply nested element.
Below is a query I would use to return the entire nested dictionary under the demo
key.
SELECT metadata->>'demo' as demographics_dict
FROM user_metadata
WHERE user_id=%(user_id)s
If I want to take it a step further and just extract the age component, I would simply add a ->
SELECT metadata->'demo'->>'age' as age
FROM user_metadata
WHERE user_id=%(user_id)s
Lastly, we can filter an entire table on a nested field by using the ->
notation with the WHERE
statement. Just take a second to think about how useful this is. If you have an extremely robust metadata column, you can condense a wall of SQL code full of joins and subqueries into just around 3 lines of code!
SELECT *
FROM user_metadata
WHERE metadata->'demo'->>'ethnicity' IN ('asian', 'white')
That should be enough to get you started on building out tables in Postgres with JSON. I hope this was helpful. Thanks for reading!
Leave a Comment
Your email address will not be published. Required fields are marked *