Data Normalization Tips: How to Weave Together Public Datasets to Make Sense of the World
Public datasets can help us gain insight into our business and our world. Combining public datasets, either together or with our own data, often requires a series of steps to clean up (or “normalize”) the data. This blog post walks you through some data normalization techniques.
I enjoy using public datasets, combining them with metrics and business data, and seeing if there are any correlations. It’s kind of a personal hobby. I wrote a blog post about analyzing COVID-19 data along that vein.
My projects usually aim to combine data from one source with data from another source to learn something interesting. However, one problem you are likely to encounter is normalizing your data.
Data normalization is the process of organizing data to reduce redundancy and improve data integrity. It typically involves dividing a database into smaller, related tables and defining relationships between them, following rules known as normal forms to help ensure consistent and efficient data storage and retrieval.
Normalizing your data can be especially challenging in the realm of publicly available, open datasets. In my latest project, I wanted to analyze two public datasets: the San Francisco police incident database and the National Oceanic and Atmospheric Administration (NOAA) historic weather database (for the San Francisco area). There were two challenges:
- The format for a “date” in each dataset was different, making it difficult to join the information for analysis. For example, if I wanted to know how many car prowls took place on occasions with more than 1cm of rain, I’d need to join both databases on their date columns:
incident_date
andweather_date
respectively. In the police database, dates are formatted asyyyy-mm-dd
, while in the weather database, it’smm/dd/yyyy
. - The weather data had numerous gaps, which made time-series graphs difficult to chart.
Let’s walk through how to solve both (all too common) issues.
Data Normalization: Cleaning Up Your Data Format
The data normalization technique you use to clean data depends on how frequently you want to update your dataset and run your analysis.
If you’re looking to download your data once before importing and using it, then you can use anything from Excel (for smaller datasets) to Unix tools like awk
.
First, download your data in CSV format (step 0). In Excel, open your CSV file, highlight a column, and in the Ribbon (top-nav menu), select the ‘Number Formats’ drop-down (it usually starts by saying “General”), and choose the option. Select ‘Custom’ and in the ‘Type:’ field, enter your preferred date format.
In this case, we want to reformat the weather data to match the police dataset. Thus, we want to reformat the date field to yyyy-mm-dd
, which we can enter directly. Press ‘OK’ and your column is now formatted however you’d like.
While Excel works for moderately sized datasets, if you have a large dataset, datasets that you want to update frequently, or data that you retrieve programmatically at periodic intervals, you need something that you can run on the command line and/or automate.
For this, we’ll use awk
, a programming language designed for manipulating text-based data. awk
is ideal for formatting dates in CSV files. This is the awk
command for reformatting a date with mm/dd/yyyy
into yyyy-mm-dd
format:
Let’s dissect this command:
-F
is the field separator. In this case, we have a comma-separated file, so the parameter for this argument is a comma (i.e., a comma separates each of our fields).split
: In our original comma-separated file, the 7th item in each row is the date we are trying to reformat. Therefore, we’ll split the 7th ($7
) field into strings, based on the forward slash character and store each part in an array calleda
.- We will then rearrange the date format to be the third (
a[3]
), first (a[1]
), and second (a[2]
) elements in the array, separated by dashes. - We’ll continue processing the file and, in the output (
OFS
), use commas to separate them. - And the file we will work on is
Weather_SFO.csv
.
Once you run the awk
command, you’ll see the output in your Terminal, which you can then pipe into another file.
Filling in Data Gaps (NULL Values)
When we use public data, we often see “gaps” or times when the dataset doesn’t show any reported values for a given area, time, or other variable. This can be due to faulty sensors, systems maintenance or outage, human error in creating the tables, or a slew of other reasons.
For example, in the San Francisco weather data we’ve been using as an example in this blog post, we see a NULL in the second row (mid-temperature on January 6, 2018):
If you’re trying to visualize data, this presents a problem: we need values for each row and can’t have instances where there isn’t any data to visualize. We see this with monitoring metrics as well; to view systems performance, we need to plot values at each desired interval, or our trend lines and graphs look faulty.
Fortunately, this is an area where TimescaleDB excels (no pun intended), with capabilities like gap filling and time_bucket—a more powerful version of SQL date_trunc. By gap-filling the information in our weather dataset, we fill in every NULL value with the last (typically most recent) value for that sensor by date. We call this “last observation carried forward” or locf
.
To do so, we use the Timescale time_bucket_gapfill
function:
You can read more about time_bucket_gapfill
in our documentation and blog post.
Now, when we run the query, we can see that our NULL value in the second row is replaced with the temperature value from the previous day:
You can also use averages or your own computation to perform your gap-filling operation. Consult the Timescale documentation for more ideas based on your use case.
Conclusion
Public datasets are amazing. Whether you combine them with other public datasets or with your own proprietary data, they give you a wealth of resources to query, analyze, and use to better understand your world (whatever that means to you).
Hopefully, this post has given you some ideas and ways to overcome a few data normalization obstacles, so you can power through your data preparation and get to the best part: analysis, visualization, and sharing your findings with others.
And if you need a high-performance database to power those, one of the benefits of using a PostgreSQL time-series database like TimescaleDB is the ability to combine analytics and metrics (this is inherently time-series data, as we care about the specific time we collected each value) with other relational data.
For example, you may want to combine historic weather data with business performance to identify correlations that could predict future business results. You’d need a high-performance time-series database that also supports the traditional relational data for your business (for example, products, customers, and orders). Voilà Timescale! Combining your external metadata with your time-series data is just a SQL statement. Want to try it for yourself? Create a free Timescale account today.