Tomas Petricek, University of Kent
http://tomasp.net |
tomas@tomasp.net |
@tomaspetricek
Simple, trustworthy and accountable data exploration
Theoretical programming research in a new context
SpreadsheetsEasy to use Small tabular data Not reproducible Error prone |
ProgrammingRequires expert skills Internet-scale Reproducible & open Formally verifiable |
Can we make data exploration tools
As simple as spreadsheets?
As flexible as programming tools?
Formally verifiable like programs?
Exploring Olympic medals ECOOP 2017 Programming 2020
Unsafe data access in a typed language
1: 2: 3: 4: 5: 6: 7: 8: |
|
Unsafe data access in a typed language
1: 2: 3: 4: 5: 6: 7: 8: |
|
Accessing data from external data sources
Languages do not understand data
There is rarely explicit schema
Manually defined types can capture it
Easier in dynamic languages!
Athletes by number of gold medals from Rio 2016
1: 2: 3: 4: 5: 6: |
|
Language and data source features you need to know
Python dictionaries {"key": value}
Generalised indexers .[ condition ]
Operation names sort_values
Data column names "Athlete"
\(\emptyset \vdash e : \tau\)
\(\pi(~~~~~~~) \vdash e : \tau\)
Parsing JSON weather forecast PLDI 2016
Interesting theoretical aspects of data access
Pragmatic structural shape inference
Predictable handling of schema change
Relative type safety property
{title : string, author : {age : int}} {author : {age : float}}
{ title : option<string>, author : {age : float} }
{ coordinates : {lng:num, lat:num} } string
{ coordinates : {lng:num, lat:num} } + string
Provided type can change only in limited ways
\(C[e] \rightarrow C[e.M]\)
\(C[e] \rightarrow C[{\sf match}~e~{\sf with}~\ldots]\)
\(C[e] \rightarrow C[int(e)]\)
Encoding complex logic via simple member access
Interesting theoretical aspects of data querying
Laziness for scaling to large hierarchies
Fancy types for the masses ECOOP 2017
Efficient on-the-fly code evaluation Programming 2020
Can non-experts really do this? WiP 2022
Row types track names and types of fields
\[\definecolor{cc}{RGB}{204,82,34} \definecolor{mc}{RGB}{0,0,153} \frac {\Gamma \vdash e : {\color{cc}[f_1:\tau_1, \ldots, f_n:\tau_n]}} {\Gamma \vdash e.\text{drop}~f_i : {\color{cc} [f_1:\tau_1, \ldots, f_{i-1}:\tau_{i-1}, f_{i+1}:\tau_{i+1}, \ldots, f_n:\tau_n]}}\]
Embed row types in provided nominal types
\[\frac {\Gamma \vdash e : {\color{mc} C_1}} {\Gamma \vdash e.\text{drop}~f_i : {\color{mc} C_2}}\]
\[\begin{array}{l} \\[-0.5em] {fields({\color{mc} C_1}) = {\color{mc} \{f_1:\tau_1, \ldots, f_n:\tau_n\}}}\\ {fields({\color{mc} C_2}) = {\color{mc} \{f_1:\tau_1, \ldots, f_{i-1}:\tau_{i-1}, f_{i+1}:\tau_{i+1}, \ldots, f_n:\tau_n\}}} \end{array}\]
Getting data into the right format
Manual process taking 80% of analyst's time
Obtaining, merging and fixing data
Automatic AI tools still need some help!
Ad-hoc interfaces and feedback mechanisms
Research platform for The Gamma
and AI assistants
Mix languages, build interactive tools, analyse
code provenance
Wrattler and outlier detection TaPP 2018
Semi-interactive tools for data wrangling Submitted 2022
A tuple \((\mathit{best}, \mathit{choices}, f)\) such that
Datadiff AI assistant Submitted 2022
Empirical evaluation of number of interactions
Composable data visualization library JFP 2021
Functional ideas applied to data visualization
Linked visualizations via Galois dependencies POPL 2022
Uses program slicing to make linking automatic
Programming as interaction with a stateful environment
Programs as expressions in a formal grammar
Programs as lists of interactions with the evnrionment
\(e := e_1 + e_2~|~e_1~e_2~|~\lambda x.e\)
\(p := a_1;~a_2;~\ldots;~a_n\)
Programming as a sequence of interactions
Accommodates manual data edits
Supports user interface interactions
Amenable to formal analysis
Correctness, provenance etc. ICFP 2014
Programming research in a new context
Data access, AI assistants, data visualization
From languages to programming systems
Tomas Petricek, University of Kent
http://tomasp.net |
tomas@tomasp.net |
@tomaspetricek