Quickstart#

This is a short introduction to using FeatherStore and an overview of its basic functionality. For a complete guide to FeatherStores classes, functions, and methods please visit the API reference.

The project is hosted on PyPI at:

https://pypi.org/project/FeatherStore/

Installation#

To install FeatherStore, simply use pip

pip install featherstore

pip install git+https://github.com/hakonmh/featherstore.git

to install the latest version from GitHub.

Starting Up#

import featherstore as fs

To create and connect to a new database simply use:

fs.create_database('/path/to/database_folder')
fs.connect('/path/to/database_folder')

You can later disconnect from the database by using fs.disconnect()

Working with Stores#

A database consists of one or more stores. A store is the basic unit for organization and where you can store your tables.

fs.create_store('store_1')
fs.create_store('store_2')
fs.list_stores()

>> ['store_1', 'store_2']

fs.drop_store('store_2')
fs.rename_store('store_1', 'example_store')
# Connect to store
store = fs.Store('example_store')

Reading and Writing Tables#

FeatherStore supports reading and writing of Pandas DataFrames/Series, Polars DataFrames/Series and PyArrow tables.

First lets create a DataFrame to store.

import pandas as pd
from numpy.random import randn

dates = pd.date_range("2021-01-01", periods=5)
df = pd.DataFrame(randn(5, 4), index=dates, columns=list("ABCD"))
df

>>                 A         B         C         D
2021-01-01  0.402138 -0.016436 -0.565256  0.520086
2021-01-02 -1.071026 -0.326358 -0.692681  1.188319
2021-01-03  0.777777 -0.665146  1.017527 -0.064830
2021-01-04 -0.835711 -0.575801 -0.650543 -0.411509
2021-01-05 -0.649335 -0.830602  1.191749  0.396745

FeatherStore stores the tables as partitioned Feather files. The size of each partition is defined by using the partition_size parameter when writing a table.

PARTITION_SIZE = 128  # bytes
store.write_table('example_table', df, partition_size=PARTITION_SIZE)
store.list_tables()

>> ['example_table']

The advantage with using partitioned Feather files that you can do different operations without loading in the full data.

# Creating a new DataFrame
new_dates = pd.date_range("2021-01-06", periods=1)
df1 = pd.DataFrame(randn(1, 4), index=new_dates, columns=list("ABCD"))
# Appending to a FeatherStore table only loads in the last partition
store.append_table('example_table', df1)

FeatherStore uses sorted indices to keep track of which partitions to open during a given operation.

We can now read the stored data as Pandas DataFrame, Polars DataFrame or PyArrow Tables.

store.read_pandas('example_table')
# store.read_arrow('example_table') for reading to Arrow Tables
# store.read_polars('example_table') for reading to Polars DataFrames

>>                 A         B         C         D
2021-01-01  0.402138 -0.016436 -0.565256  0.520086
2021-01-02 -1.071026 -0.326358 -0.692681  1.188319
2021-01-03  0.777777 -0.665146  1.017527 -0.064830
2021-01-04 -0.835711 -0.575801 -0.650543 -0.411509
2021-01-05 -0.649335 -0.830602  1.191749  0.396745
2021-01-06 -0.408125 -0.420920  0.632606  0.606950

We can also query parts of the data. FeatherStore uses predicate filtering to only load the partitions and columns specified by the query.

By using sorted indices, FeatherStore allows for range-queries on rows by using {'before': end}, {'after': start} and {'between': [start, end]}

store.read_pandas('example_table', rows={'after': '2021-01-05'}, cols=['D', 'A'])

# All range queries are inclusive
>>                 D         A
2021-01-05  0.396745 -0.649335
2021-01-06  0.606950  0.408125

Inserting, Updating and Deleting Data#

First, let’s create a new table to work with:

index = [1, 3, 5, 6]
df = pd.DataFrame(randn(4, 2), index=index, columns=list("AB"))
df

>>        A         B
1 -0.041727  0.957139
3 -0.272294 -1.758717
5 -0.353684  1.550073
6  1.275938  1.054702

We can use Store.select_table() to select a Table object, which contains more features for working with tables.

table = store.select_table('example_table2')
table.exists  # False
table.write(df)
table.exists

>> True

One of those features is Table.insert(), which allows for adding extra rows into the table.

Note

You can use Table.add_columns() to add extra columns.

df2 = pd.DataFrame(randn(2, 2), index=[4, 2], columns=list("AB"))
table.insert(df2)  # Must have the same index and col dtypes as the stored df
table.read_pandas()

# The data will inserted into its sorted index position
>>        A         B
1 -0.041727  0.957139
2  2.163615 -0.708871
3 -0.272294 -1.758717
4 -1.263981 -0.961670
5 -0.353684  1.550073
6  1.275938  1.054702

Other features include Table.update() and Table.drop() which updates and deletes data.

df3 = pd.DataFrame([[0, 2], [1, 3]], index=[1, 2], columns=list("AB"))
#    A  B
# 1  0  1
# 2  2  3
table.update(df3)
table.drop(rows={'after': 5})
# You can also drop columns using table.drop(cols=['col1', 'col2'])

>>        A         B
1  0.000000  1.000000
2  2.000000  3.000000
3 -0.272294 -1.758717
4 -1.263981 -0.961670