Data exploration
as a programming problem








Tomas Petricek, University of Kent
http://tomasp.net | tomas@tomasp.net | @tomaspetricek

Data exploration

Motivation

Simple, trustworthy and accountable data exploration


Approach

Theoretical programming research in a new context

Spreadsheets

Easy to use

Small tabular data

Not reproducible

Error prone

Programming

Requires expert skills

Internet-scale

Reproducible & open

Formally verifiable

Best of both worlds

Can we make data exploration tools


As simple as spreadsheets?

As flexible as programming tools?

Formally verifiable like programs?

DEMO

Exploring Olympic medals ECOOP 2017 Programming 2020

Working with data

Reading data

Unsafe data access in a typed language


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
var url = "http://owm.org/forecast/daily?q=Prague";
var forecast = JsonValue.Load(url);
var days = forecast["list"];

foreach(JsonValue day in days) {
  var hi = (double)day["temp"]["high"];
  Console.WriteLine(hi);
}
Not found!

Reading data

Unsafe data access in a typed language


1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
var url = "http://owm.org/forecast/daily?q=Prague";
var forecast = JsonValue.Load(url);
var days = forecast["list"];

foreach(JsonValue day in days) {
  var hi = (double)day["temp"]["max"];
  Console.WriteLine(hi);
}

Reading data

Accessing data from external data sources


Languages do not understand data

There is rarely explicit schema

Manually defined types can capture it

Easier in dynamic languages!



Querying data

Athletes by number of gold medals from Rio 2016

1: 
2: 
3: 
4: 
5: 
6: 
olympics = pd.read_csv("olympics.csv")
olympics[olympics["Games"] == "Rio (2016)"]
  .groupby("Athlete")
  .agg({"Gold": sum})
  .sort_values(by="Gold", ascending=False)
  .head(8)
Unknown file
Column name

Querying data

Language and data source features you need to know


Python dictionaries {"key": value}

Generalised indexers .[ condition ]

Operation names sort_values

Data column names "Athlete"



Type providers

\(\emptyset \vdash e : \tau\)

\(\pi(~~~~~~~) \vdash e : \tau\)

DEMO

Parsing JSON weather forecast PLDI 2016

F# Data

Interesting theoretical aspects of data access


Pragmatic structural shape inference

Predictable handling of schema change

Relative type safety property





{title : string, author : {age : int}} {author : {age : float}}

{ title : option<string>,  author : {age : float} }

{ coordinates : {lng:num, lat:num} } string

{ coordinates : {lng:num, lat:num} } + string 

Schema change

Provided type can change only in limited ways


\(C[e] \rightarrow C[e.M]\)

\(C[e] \rightarrow C[{\sf match}~e~{\sf with}~\ldots]\)

\(C[e] \rightarrow C[int(e)]\)





Dot-driven development

Encoding complex logic via simple member access



The Gamma

Interesting theoretical aspects of data querying


Laziness for scaling to large hierarchies

Fancy types for the masses ECOOP 2017

Efficient on-the-fly code evaluation Programming 2020

Can non-experts really do this? WiP 2022



Fancy types

Row types track names and types of fields

\[\definecolor{cc}{RGB}{204,82,34} \definecolor{mc}{RGB}{0,0,153} \frac {\Gamma \vdash e : {\color{cc}[f_1:\tau_1, \ldots, f_n:\tau_n]}} {\Gamma \vdash e.\text{drop}~f_i : {\color{cc} [f_1:\tau_1, \ldots, f_{i-1}:\tau_{i-1}, f_{i+1}:\tau_{i+1}, \ldots, f_n:\tau_n]}}\]

Embed row types in provided nominal types

\[\frac {\Gamma \vdash e : {\color{mc} C_1}} {\Gamma \vdash e.\text{drop}~f_i : {\color{mc} C_2}}\]

\[\begin{array}{l} \\[-0.5em] {fields({\color{mc} C_1}) = {\color{mc} \{f_1:\tau_1, \ldots, f_n:\tau_n\}}}\\ {fields({\color{mc} C_2}) = {\color{mc} \{f_1:\tau_1, \ldots, f_{i-1}:\tau_{i-1}, f_{i+1}:\tau_{i+1}, \ldots, f_n:\tau_n\}}} \end{array}\]

AI assistants

Data wrangling

Getting data into the right format


Manual process taking 80% of analyst's time

Obtaining, merging and fixing data

Automatic AI tools still need some help!

Ad-hoc interfaces and feedback mechanisms

Wrattler

Research platform for The Gamma
and AI assistants


Mix languages, build interactive tools, analyse
code provenance

DEMO

Wrattler and outlier detection TaPP 2018

AI assistants

Semi-interactive tools for data wrangling Submitted 2022


A tuple \((\mathit{best}, \mathit{choices}, f)\) such that

  • \(\mathit{best}_X(H)=e\) – recommends best expression
  • \(\mathit{choices}_X(H)=(H_1, \ldots, H_n)\) – offers constraints
  • \(f(e, X)=Y\) – evaluation function

DEMO

Datadiff AI assistant Submitted 2022

AI assistants

Empirical evaluation of number of interactions

Data visualization

Compost.js

Composable data visualization library JFP 2021


Functional ideas applied to data visualization



Fluid

Linked visualizations via Galois dependencies POPL 2022


Uses program slicing to make linking automatic



Future work

Notebooks for data exploration


Programming as interaction with a stateful environment

Programming languages

Programs as expressions in a formal grammar


Programming systems

Programs as lists of interactions with the evnrionment

\(e := e_1 + e_2~|~e_1~e_2~|~\lambda x.e\)



\(p := a_1;~a_2;~\ldots;~a_n\)



Data exploration

Programming as a sequence of interactions


Accommodates manual data edits

Supports user interface interactions

Amenable to formal analysis

Correctness, provenance etc. ICFP 2014



Summary

Thank you!


Programming research in a new context

Data access, AI assistants, data visualization

From languages to programming systems




Tomas Petricek, University of Kent
http://tomasp.net | tomas@tomasp.net | @tomaspetricek