From CSVs to Tables: Infer Schema Data Types From Raw Spreadsheets

Back in August of last year (roughly 8 months ago), I hunched over my desk at 4 am desperate to fire off a post before boarding a flight the next morning. The article was titled Creating Database Schemas: a Job for Robots, or Perhaps Pandas. It was my intent at the time to solve a common annoyance: creating database tables out of raw data, without the obnoxious process of explicitly setting each column's datatype. I had a few leads that led me to believe I had the answer... boy was I wrong.

The task seems somewhat reasonable from the surface. Surely we can spot columns where the data is always in integers, or match the expected format of a date, right? If anything, we'll fall back to text or varchar and call it a day. Hell, even MongoDB's Compass does a great job of this by merely uploading a CSV... this has got to be some trivial task handled by third-party libraries by now.

For one reason or another, searching for a solution to this problem almost always comes up empty. Software developers probably have little need for dynamically generated tables if their applications run solely on self-defined models. Full-time Data Scientists have access to plenty of expensive tools which seem to claim this functionality, yet it all seems so... inaccessible.

Is This NOT a Job For Pandas?

From my experience, no. Pandas does offer hope but doesn't seem to get the job done quite right. Let's start with a dataset so you can see what I mean. Here's a bunch of fake identities I'll be using to mimic the outcome I experienced when working with real data:

id	initiated	hiredate	email	firstname	lastname	title	department	location	country	type
100035435	2015-12-11T09:16:20.722-08:00	3/22/67	GretchenRMorrow@jourrapide.com	Gretchen	Morrow	Power plant operator	Physical Product	Britling Cafeterias	United Kingdom	Employee
100056435	2015-12-15T10:11:24.604-08:00	6/22/99	ElizabethLSnow@armyspy.com	Elizabeth	Snow	Oxygen therapist	Physical Product	Grade A Investment	United States of America	Employee
100037955	2015-12-16T14:31:32.765-08:00	5/31/74	AlbertMPeterson@einrot.com	Albert	Peterson	Psychologist	Physical Product	Grass Roots Yard Services	United States of America	Employee
100035435	2016-01-20T11:15:47.249-08:00	9/9/69	JohnMLynch@dayrep.com	John	Lynch	Environmental hydrologist	Physical Product	Waccamaw's Homeplace	United States of America	Employee
100057657	2016-01-21T12:45:38.261-08:00	4/9/83	TheresaJCahoon@teleworm.us	Theresa	Cahoon	Personal chef	Physical Product	Cala Foods	United States of America	Employee
100056747	2016-02-01T11:25:39.317-08:00	6/26/98	KennethHPayne@dayrep.com	Kenneth	Payne	Central office operator	Frontline	Magna Consulting	United States of America	Employee
100035435	2016-02-01T11:28:11.953-08:00	4/16/82	LeifTSpeights@fleckens.hu	Leif	Speights	Staff development director	Frontline	Rivera Property Maintenance	United States of America	Employee
100035435	2016-02-01T12:21:01.756-08:00	8/6/80	JamesSRobinson@teleworm.us	James	Robinson	Scheduling clerk	Frontline	Discount Furniture Showcase	United States of America	Employee
100074688	2016-02-01T13:29:19.147-08:00	12/14/74	AnnaDMoberly@jourrapide.com	Anna	Moberly	Playwright	Physical Product	The Wiz	United States of America	Employee
100665778	2016-02-04T14:40:05.223-08:00	9/13/66	MarjorieBCrawford@armyspy.com	Marjorie	Crawford	Court, municipal, and license clerk	Physical Product	The Serendipity Dip	United Kingdom	Employee
100876876	2016-02-24T12:39:25.872-08:00	12/19/67	LyleCHackett@fleckens.hu	Lyle	Hackett	Airframe mechanic	Physical Product	Infinity Investment Plan	United States of America	Employee
100658565	2016-02-29T15:52:12.933-08:00	11/17/83	MaryJDensmore@jourrapide.com	Mary	Densmore	Employer relations representative	Frontline	One-Up Realtors	United States of America	Employee
100766547	2016-03-01T12:32:53.357-08:00	10/1/87	CindyRDiaz@armyspy.com	Cindy	Diaz	Student affairs administrator	Physical Product	Mr. AG's	United States of America	Employee
100045677	2016-03-02T12:07:44.264-08:00	8/16/65	AndreaTLigon@einrot.com	Andrea	Ligon	Railroad engineer	Central Growth	Robinson Furniture	United States of America	Employee

There are some juicy datatypes in there: integers , timestamps , dates , strings.... and those are only the first four columns! Let's load this thing into a DataFrame and see what information we can get that way:

import pandas as pd

csv = 'data/fake.csv'

workers_df = pd.read_csv(csv, header=0, encoding='utf-8')
meta = workers_df.info(verbose=True)
print(meta)

Using Pandas' info() should do the trick! This returns a list of columns and their data types:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 11 columns):
id 14 non-null float64
initiated 14 non-null object
hiredate 14 non-null object
email 14 non-null object
firstname 14 non-null object
lastname 14 non-null object
title 14 non-null object
department 14 non-null object
location 14 non-null object
country 14 non-null object
type 14 non-null object
dtypes: float64(1), object(10)
memory usage: 2.0+ KB
None

...Or not. What is this garbage? Only one of our 11 columns identified a data type, and it was incorrectly listed as a float! Okay, so maybe Pandas doesn't have a secret one-liner for this. So who does?

What about PySpark?

It's always been a matter of time before we'd turn to Apache's family of aged data science products. Hadoop, Spark, Kafka... all of them have a particular musty stench about them that tastes like "I feel like I should be writing in Java right now." Heads up: they do want you to write in Java. Misery loves company.

Nonetheless, PySpark does support reading data as DataFrames in Python, and also comes with the elusive ability to infer schemas. Installing Hadoop and Spark locally still kind of sucks for solving this one particular problem. Cue Databricks: a company that spun off from the Apache team way back in the day, and offers free cloud notebooks integrated with- you guessed it: Spark.

With Databricks, we can upload our CSV and load it into a DataFrame by spinning up a free notebook. The source looks something like this:

# File location and type
file_location = "/FileStore/tables/fake.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

Let's see out the output looks:

df:pyspark.sql.dataframe.DataFrame
id:integer
initiated:timestamp
hiredate:string
email:string
firstname:string
lastname:string
title:string
department:string
location:string
country:string
type:string

Not bad! We correctly 'upgraded' our ID from float to integer, and we managed to get the timestamp correct also. With a bit of messing around, we could probably have even gotten the date correct too, given that we stated the format beforehand.

A look at the Databricks Notebook interface.

And Yet, This Still Kind of Sucks

Even though we can solve our problem in a notebook, we still haven't solved the use case: I want a drop-in solution to create tables out of CSVs... whenever I want! I want to accomplish this while writing any app, at the drop of a hat without warning. I don't want to install Hadoop and have Java errors coming back at me through my terminal. Don't EVER let me see Java in my terminal. UGH:

py4j.protocol.Py4JJavaError: An error occurred while calling o43.count.
: java.lang.IllegalArgumentException: Unsupported class file major version 55
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
        at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
        at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)

        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
        at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
        at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
        at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
        at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)

Python's "tableschema" Library

Thankfully, there's at least one other person out there who has shared this desire. That brings us to tableschema, a not-quite-perfect-but-perhaps-good-enough library to gunsling data like some kind of wild data cowboy. Let's give it a go:

import csv
from tableschema import Table

data = 'data/fake.csv'
schema = infer(data, limit=500, headers=1, confidence=0.85)
print(schema)

If our dataset is particularly large, we can use the limit attribute to limit the sample size to the first X number of rows. Another nice feature is the confidence attribute: a 0-1 ratio for allowing casting errors during the inference. Here's what comes back:

{
  "fields": [{
    "name": "id",
    "type": "integer",
    "format": "default"
  }, {
    "name": "initiated",
    "type": "string",
    "format": "default"
  }, {
    "name": "hiredate",
    "type": "date",
    "format": "default"
  }, {
    "name": "email",
    "type": "string",
    "format": "default"
  }, {
    "name": "firstname",
    "type": "string",
    "format": "default"
  }, {
    "name": "lastname",
    "type": "string",
    "format": "default"
  }, {
    "name": "title",
    "type": "string",
    "format": "default"
  }, {
    "name": "department",
    "type": "string",
    "format": "default"
  }, {
    "name": "location",
    "type": "string",
    "format": "default"
  }, {
    "name": "country",
    "type": "string",
    "format": "default"
  }, {
    "name": "type",
    "type": "string",
    "format": "default"
  }],
  "missingValues": [""]
}

Hey, that's good enough for me! Now let's automate the shit out this.

Creating a Table in SQLAlchemy With Our New Schema

I'm about to throw a bunch in your face right here. Here's a monster of a class:

from sqlalchemy import create_engine
import config
import pandas as pd
import psycopg2
from tableschema import Table, infer, Schema
from functions.recursivejson import extract_values
from sqlalchemy.types import Integer, Text, Date

class CreateTablesFromCSVs:
    """Infer a table schema from a CSV."""

    __uri = config.PG_URI
    __engine = create_engine(__uri, convert_unicode=True, echo=True)
    __data = 'data/fake.csv'

    @classmethod
    def get_data(cls):
        """Pull latest data."""
        test_df = pd.read_csv(cls.__data, header=0, encoding='utf-8')
        return test_df

    @classmethod
    def get_schema_from_csv(cls, csv):
        """Infers schema from CSV."""
        table = Table(csv)
        table.infer(limit=500, confidence=0.55)
        schema = table.schema.descriptor
        names = cls.get_column_names(schema, 'name')
        datatypes = cls.get_column_datatypes(schema, 'type')
        schema_dict = dict(zip(names, datatypes))
        return schema_dict

    @classmethod
    def get_column_names(cls, schema, key):
        """Get names of columns."""
        names = extract_values(schema, key)
        return names

    @classmethod
    def get_column_datatypes(cls, schema, key):
        """Convert schema to recognizable by SQLAlchemy."""
        values = extract_values(schema, key)
        for i, value in enumerate(values):
            if value == 'integer':
                values[i] = Integer
            elif value == 'string':
                values[i] = Text
            elif value == 'date':
                values[i] = Date
        return values

    @classmethod
    def create_new_table(cls, data, schema):
          """Create new table from CSV and generated schema."""
          workday_table.to_sql('faketable',
                               con=cls.__engine,
                               schema='testschema',
                               if_exists='replace',
                               chunksize=300,
                               dtype=schema)

data = CreateTablesFromCSVs.get_schema_from_csv()
schema = CreateTablesFromCSVs.get_schema_from_csv(data)
CreateTablesFromCSVs.create_new_table(data, schema)

The first thing worth mentioning is I'm importing a function from my personal secret library to extract values from JSON objects. I've spoken about it before.

Let's break down this class:

get_data() reads our CSV into a Pandas DataFrame.
get_schema_from_csv() kicks off building a Schema that SQLAlchemy can use to build a table.
get_column_names() simply pulls column names as half our schema.
get_column_datatypes() manually replaces the datatype names we received from tableschema and replaces them with SQLAlchemy datatypes.
create_new_table Uses a beautiful marriage between Pandas and SQLAlchemy to create a table in our database with the correct datatypes mapped.

Promising Potential, Room to Grow

While tableschema works some of the time, it isn't perfect. The base of what we accomplish still stands: we now have a reliable formula for how we would create schemas on the fly if we trust our schemas to be accurate.

Just wait until next time when we introduce Google BigQuery into the mix.

Blog