Types for Tables: A Language Design Benchmark

chrisaycock · on Dec 11, 2021

I created Empirical specifically to have statically typed Dataframes (tables).

  let trades = load("trades.csv")

  from trades select where symbol == "AAPL" and size > 1000

Empirical uses type providers and automatic compile-time function evaluation to sample the CSV file and infer a type ahead of time. The above can be run in a REPL or in a stand-alone script and the result is statically typed! (There is no "gradual" or "optional" types; it is fully static.)

Empirical has generic types, which is a syntactic sugar for templates. Here is an example of a five-minute volume-weighted average price (VWAP):

  func wavg(ws, vs) = sum(ws * vs) / sum(ws)

  from trades select vwap = wavg(size, price) by symbol, bar(timestamp, 5m)

I created Empirical for time-series analysis (I work in finance) and began with the question, "What if q/kdb+ were more like Haskell?" The result is very different from either language, but I wanted to make table actions feel dynamic even though they are fully static.

To run a script that takes a command-line argument for the location of the CSV file, just supply the type ahead of time since there isn't a compile-time path to the file.

  data Trade:
    symbol: String,
    timestamp: Timestamp,
    price: Float64,
    size: Int64
  end

  let trades = csv_load{Trade}(argv[1])

Feel free to try it out.

https://www.empirical-soft.com

gavinray · on Dec 11, 2021

This is pretty interesting.

Are you familiar with the "Morel" dialect of SML?

It's written by Julian Hyde, who has authored several RDBMS and is the lead maintainer of the Apache Calcite relational algebra framework. Your tool reminds me a bit of the work in Morel, where LINQ-like operators have been integrated into the core of the language:

https://github.com/julianhyde/morel

chrisaycock · on Dec 11, 2021

I took a quick look through the example for relational extensions. I don't see anything about reading a file, much less inferring type from a file. Empirical's key innovation is that it can infer type from an external source so as long as the source can be derived at compile time.

  let path = "/path/to/files"
  let fname = "quotes.csv"

  let quotes = load(path + "/" + fname)

Since load() requires a compile-time parameter, Empirical automatically resolves the file's location ahead of time. Specifically, Empirical determines that the variables are constants and that the plus operator is pure; computing the resulting value has no side effects. Thus, the Empirical compiler accommodates load() by computing the parameter during semantic analysis. This is automatic compile-time function evaluation.

As for load(), it is a macro whose expansion includes inferring the schema of the table by sampling the first few lines of the file. This is a type provider.

Now contrast that logic against something like Apache Spark. It's written in a statically typed language (Scala), but its table semantics are dynamically typed! That's because Spark holds the schema as a runtime value and addresses column names as strings. Spark's mechanism for inferring type means it can't maintain static typing.

Empirical is the only system I know of that can infer types while staying static.

sideeffffect · on Dec 12, 2021

> Empirical is the only system I know of that can infer types while staying static.

F#'s Type Providers can do that too.

https://docs.microsoft.com/en-us/dotnet/fsharp/tutorials/typ...

chrisaycock · on Dec 12, 2021

F# is how I learned about type providers:

http://fsprojects.github.io/FSharp.Data/library/CsvProvider....

But look at that syntax:

  CsvProvider<"file.csv">.Load("file.csv")

The file name is listed twice! Conceivably a macro could hide that, but we'd still have the issue that the string has to be written fully. I.e., I can't just call a function to generate the string.

F# has Literals that mitigate some of this, but my understanding is that they are pretty limited:

https://github.com/fsharp/fslang-suggestions/issues/539

On the other hand, Empirical can perform automatic compile-time function evaluation on any legal expression, including across variables, user-defined functions, user-defined datatypes, etc.

  func filename(n):
    let value = String(n + 1)
    return "/path/to/" + value + ".csv"
  end

  load(filename(7))

I could put the filename in a Dataframe embedded in my source, then reference the row/column I want. Empirical's automatic CTFE handles it.

chusk3 · on Dec 12, 2021

Nitpick here = the `Load` member takes a path to _any_ file of the same schema as the sample, but you can also just use the sample at runtime with the use of `Load()`. So it doesn't have to be as redundant as you've shown here, the flexibility is all about loading other instances of the same data shape.

acbart · on Dec 11, 2021

I am so excited about this work! I think it's a wonderful big step in this area. However, I am still a little weirded out by the term "sort" being used to describe the type of a cell. Their goal is to avoid preconceptions about the use of the word, but... there's already a lot of baggage around that word "sort"!

skrishnamurthi · on Dec 11, 2021

We spent a long time on the choice. But it was really important in this setting to avoid the word "type", and sort is a technically correct term too — e.g., as used in many-sorted logics [https://en.wikipedia.org/wiki/Many-sorted_logic]. Given the precedent from mathematics, it seemed like a pretty good choice (and to me, still does).

sideeffffect · on Dec 12, 2021

Could you explain why you wanted to avoid the term (excuse the pun :) ) "type" so much?

skrishnamurthi · on Dec 12, 2021

Because we didn't want to limit people's thinking to current type systems.

In my experience, people who have only been exposed to rather weak type systems tend to have strong (negative) beliefs about what types supposedly can and can't do — witness the number of pointless "debates" about "typed" vs "dynamic" languages. We therefore expressly wanted people to shed their baggage and think about what programmers want to be able to express and work from there, not see something hard or unfamiliar and say "oh, «types» can't do this".

Picking a term that keeps harkening back to that baggage would then be counter-productive.

sbr464 · on Dec 11, 2021

(unrelated design notes)

Columns with numerical data should be right-aligned. Column header alignment should match the data.

https://alistapart.com/article/web-typography-tables/

garmaine · on Dec 11, 2021

Uh… isn’t this exactly what the (typed) relational model is?

johnmyleswhite · on Dec 11, 2021

The relational model's type system is contained inside of this work as the schema component.