Intake is a Python package for loading, investigating, and organizing data. From the perspective of a data analyst, it has the following features
- Loads data into containers you already use: NumPy arrays, Pandas/Dask Dataframes, etc.
- Reduces boilerplate code
- Facilitates reusable workflows
- Install datasets as Python packages
- "Self-describing" data sources
- "Quick look" plotting
Below we'll import a built-in catalog of data sources from Intake. We can then list the contents of these data sources.
from intake import cat
list(cat)
Each of the entries in the list above are data sources, most of them take parameters as arguments from the user to modify exactly what data gets loaded. The data sources get the data from remote SQL servers as well as locally installed CSV files. Exactly where the data comes from is intentionally abstracted from the data analyst, freeing them to focus on the workflows that use the data.
We'll later show how these data sources have methods that help investigate the user parameters and other descriptions, but for now let's quickly look at an example of loading some data.
Here we will load the total production histories (oil, gas, water) from a well identified by the API number '33007000110000'. This command executes a SQL query to a data base and returns a Panda's dataframe.
In these tutorials, we attempt to not use packages that haven't been introduced yet, and we will cover Panda's in detail later, but we use Intake to load data in essentially every tutorial, you'll have to excuse the premature use of Panda's as well as a few plotting utilities as they are used only for the purposes of demonstrating Intake.
df = cat.production_by_api(api='33007000110000').read()
df.set_index(['date']).head()
The following code is requried to reproduce the functionality we've demonstrated with a one line command from Intake. The steps here include loading required modules, loading environment variables, formatting the URL for the PostgresSQL database connection, formatting a SQL statement given an API number as an argument to the function query_api
, and finally, using the Panda's read_sql()
function to load the data into a DataFrame object.
import os
import pandas as pd
username = os.environ['BAZEAN_POSTGRES_USERNAME']
password = os.environ['BAZEAN_POSTGRES_PASSWORD']
engine = 'postgresql://{}:{}@daytum-server.bazean.com:5432/daytum'.format(username, password)
def query_api(api):
"""
Return SQL statement to get production given an API Number
"""
return "SELECT date,volume_oil_formation_bbls," \
"volume_gas_formation_mcf," \
"volume_water_formation_bbls " \
"FROM public.production_all WHERE api='{}'".format(api)
pd.read_sql(query_api(api='33007000140000'), engine, parse_dates=['date']).head(3)
Compare with
cat.production_by_api(api='33007000140000').read().head(3)
Sometimes you may have data that comes from different sources, but you would like to reuse a common workflow to analyize the data. For example, one set of data from a PostgresSQL database and another set stored in CSV files, the data engineer can set up catalogs such that the data has the same structure, and the data analyst can use an identical workflow.
Below is a simple plotting workflow that takes as arguments an Intake catalog and an API number as arguments and formats a plot to show the production histories of gas and oil for a well identified by the API number.
def my_fancy_plotting_workflow(catalog, api):
"""
Creates a plot given a dataframe with a column labeled 'date'
"""
dataframe = catalog.production_by_api(api=api).read()
ax1 = dataframe.set_index(['date']).plot(y=['volume_oil_formation_bbls'])
ax2 = dataframe.set_index(['date']).plot(y=['volume_gas_formation_mcf'], ax=ax1, secondary_y=True)
ax1.set_xlabel('Date')
ax1.set_ylabel('Production (BBLS)')
ax2.set_ylabel('Production (MCF)')
handles = []
for ax in [ax1, ax2]:
for h,_ in zip(*ax.get_legend_handles_labels()):
handles.append(h)
ax1.legend(handles,['Oil', 'Gas (right)'])
return ax1
Using the same catalog introduced earlier with our plotting function my_fancy_plotting_workflow()
.
my_fancy_plotting_workflow(cat, api='33007000110000');
Now we'll play data engineer momentarily. We'll read in the production data from a couple of wells and immediately write them to CSV files stored in a local hidden directory named .files
. The detailed syntax of the following is not important at the moment. Just understand that we are reading data from the PostgresSQL database and then storing them locally.
cat.production_by_api(api='33007000110000').read().to_csv('datasets/production_33007000110000.csv', index=False)
cat.production_by_api(api='33007000140000').read().to_csv('datasets/production_33007000140000.csv', index=False)
We can verify the files were written by listing the files in the directory .files
.
ls datasets/production*
Inspecting the contents of one of the files, by looking at the first 5 lines in the file.
!head -n 5 datasets/production_33007000110000.csv
Now we'll use Intake's open_csv()
function to read the CSV file back into a new catalog called csv_cat
. This catalog object has a method .yaml()
that will produce a description of the catalog using YAML syntax. We can then minimally edit this output to create a local catalog that we can use for reading in CSV files with an arbitrary API number.
csv_cat = intake.open_csv('datasets/production_33007000110000.csv', csv_kwargs={'parse_dates': ['date']})
print(csv_cat.yaml())
You can inspect the changes between what's above and the edits below. We rename the data source "csv" to "production_by_api" and using the replacement syntax create a user parameter
api
that replaces the explicit API number above.
%%file "datasets/nd_production.yml"
sources:
production_by_api:
args:
csv_kwargs:
parse_dates:
- date
urlpath: '/production_.csv'
description: 'Returns production history given an API'
driver: intake.source.csv.CSVSource
metadata: {}
csv_cat = intake.open_catalog("datasets/nd_production.yml")
list(csv_cat)
my_fancy_plotting_workflow(csv_cat, api='33007000140000');
For example, with conda
conda install -c daytum data
Here we use the package manager conda
to install a package named "data" from the "daytum" channel. This is extremely useful from a data engineer perspective because you can "version control" data and take advantage of conda update
, etc.
cat.production_by_state.describe()
There is a built in GUI which also allows for exploration of data sources. The search feature is particularly useful.
intake.gui
If the plotting package Holoviews is installed, you can load quick view plots of the data, much like Panda's. Line, Bar, Scatter, and several other types of plots are built in, see the documention for more information. The data engineer can define custom plot specifications in the metadata of the catalog.
Below is an example of creating a scatter plot with the poro_perm
data source without any arguments or custimization.
import hvplot.intake
cat.poro_perm.plot.my_scatter()
A more useful plot is defined as my_scatter()
in the catalog.
cat.poro_perm.plot.my_scatter()
Calling the method .persist()
on a data source will create a local copy in a storage format that is most suitable to the container. This is useful for large queries across networks or when working with big datasets (assuming you have the local hard drive storage capacity). Once the data has been persisted, it will load much faster in subsequent calls. This can be demonstrated by using the Jupyter notebooks %%timeit
magic function.
%timeit df = cat.production_by_api(api='33007000110000', persist='never').read()
To create a local copy, just call the method .persist()
cat.production_by_api(api='33007000110000').persist()
%timeit df = cat.production_by_api(api='33007000110000').read()
%%javascript
function hideElements(elements, start) {
for(var i = 0, length = elements.length; i < length;i++) {
if(i >= start) {
elements[i].style.display = "none";
}
}
}
var prompt_elements = document.getElementsByClassName("prompt");
hideElements(prompt_elements, 0)