Programming
as Human Data Interaction







Tomas Petricek, University of Kent

http://tomasp.net | tomas@tomasp.net | @tomaspetricek

DATA SCIENCE

Motivation

Open, reproducible and accessible data science


Approach

Human-centric programming languages and tools

Spreadsheets

Easy to use

Simple problems only

Not reproducible

Programming

Requires expert skills

Internet-scale

Reproducible & open

What makes spreadsheets successful?

How can programming tools learn?

Think of programming as interaction!

DEMO

Exploring World Bank with type providers DDFP 2013

WORKING WITH DATA

Reading data

Unsafe dynamic access in a typed language

1: 
2: 
3: 
4: 
5: 
6: 
7: 
var url = "http://dvd.netflix.com/Top100RSS";
var rss = XDocument.Load(topRssFeed);
var channel = rss.Element("rss").Element("channel");

foreach(var item in channel.Elements("item")) {    
  Console.WriteLine(item.Element("text").Value);
}
Not found!

Reading data

Unsafe dynamic access in a typed language

1: 
2: 
3: 
4: 
5: 
6: 
7: 
var url = "http://dvd.netflix.com/Top100RSS";
var rss = XDocument.Load(topRssFeed);
var channel = rss.Element("rss").Element("channel");

foreach(var item in channel.Elements("item")) {    
  Console.WriteLine(item.Element("title").Value);
}

Reading data

Accessing data from external data sources

Languages do not understand data

There is rarely explicit schema

Manually define types to caputre it

Easier in dynamic languages

Aggregating data

Athletes by number of gold medals from Rio 2016

1: 
2: 
3: 
4: 
5: 
6: 
olympics = pd.read_csv("olympics.csv")
olympics[olympics["Games"] == "Rio (2016)"]
  .groupby("Athlete")
  .agg({"Gold": sum})
  .sort_values(by="Gold", ascending=False)
  .head(8)
Unknown file
Column name

Aggregating data

Language and data source features you need to know

Python dictionaries {"key": value}

Generalised indexers .[ condition ]

Operation names sort_values

Data column names "Athlete"

TYPE PROVIDERS

\(\emptyset \vdash e : \tau\)

\(\pi(~~~~~~~) \vdash e : \tau\)

DEMO

Reading data from an RSS feed PLDI 2016

F# Data library

Type providers for structured data

Structural shape inference

Language integration via type providers

Relative type safety

{title : string, author : {age : int}} {author : {age : float}}

{ title : option<string>,  author : {age : float} }

{ coordinates : {lng:num, lat:num} } string

{ coordinates : {lng:num, lat:num} } + string 

Shape inference

Pragmatic design choices for usability

Prefers records for tooling

Predictable and stable

Open world assumption about sums

DEMO

Aggregating Olympic medallists ECOOP 2017

Dot-driven development

Encoding complex logic via simple member access

Type providers for member generation

Laziness for scaling to large hierarchies

Fancy types for the masses!

BEHIND THE SCENES

Fancy types for the masses

Row types to track names and types of fields

\[\definecolor{cc}{RGB}{204,82,34} \definecolor{mc}{RGB}{0,0,153} \frac {\Gamma \vdash e : {\color{cc}[f_1:\tau_1, \ldots, f_n:\tau_n]}} {\Gamma \vdash e.\text{drop}~f_i : {\color{cc} [f_1:\tau_1, \ldots, f_{i-1}:\tau_{i-1}, f_{i+1}:\tau_{i+1}, \ldots, f_n:\tau_n]}}\]

Embed row types in provided nominal types

\[\frac {\Gamma \vdash e : {\color{mc} C_1}} {\Gamma \vdash e.\text{drop}~f_i : {\color{mc} C_2}} \quad{\small \text{where}}\]

\[\begin{array}{l} \\[-0.5em] {fields({\color{mc} C_1}) = {\color{mc} \{f_1:\tau_1, \ldots, f_n:\tau_n\}}}\\ {fields({\color{mc} C_2}) = {\color{mc} \{f_1:\tau_1, \ldots, f_{i-1}:\tau_{i-1}, f_{i+1}:\tau_{i+1}, \ldots, f_n:\tau_n\}}} \end{array}\]

Fancy types for the masses!

Powerful idea that works in other contexts

Row types and phantom types

Session types for communication

Add your own fancy type here!

Relative type safety


Well typed programs do not go wrong.

As long as the world is well-behaved.

F# Data and safety

Given representative samples and an input value

\(S(d)\sqsubset S(d_1, \ldots, d_n)\)

Any program written using a type provider reduces

\(e_{user}[x\leftarrow {\color{mc}\text{new}}~C(d)] \rightsquigarrow^* v\)

DEMO

Handling schema change and errors

F# Data and schema change

Provided type can change only in limited ways


\(C[e] \rightarrow C[e.M]\)

\(C[e] \rightarrow C[{\sf match}~e~{\sf with}~\ldots]\)

\(C[e] \rightarrow C[int(e)]\)


Structure of a type provider


Context \(L\) maps names to definitions and nested contexts

\(L(C) = {\color{mc}\text{type}}~C(x:\tau) = \overline{m}, L'\)

Pivot takes schema and provides a class with context

\(\text{pivot}(F) = C, L\)

Pivot type provider

Generate classes that drop individual columns

XML/JSON type provider

Generate class corresponding to a record shape

HUMAN DATA INTERACTION

Programming for data science

1 Data analytics is an interactive process

2 Program against data, not abstract symbols

DEMO

Learning from spreadsheets Programming 2017

Future work

Making programming with data easier

Learning from spreadsheets

Understanding programmer interactions

Calculus of interactions

Cognitive cost of interactions

SUMMARY

Thank you!

Programming as Human Data Interaction

Interaction Study programming not programs

Type providers Dot driven with fancy types

Properties Relative safety but also cognitive cost



Tomas Petricek, University of Kent

tomas@tomasp.net | @tomaspetricek | tomasp.net
thegamma.net | fslab.org | gamma.turing.ac.uk

References

Don Syme, Keith Battocchi, Kenji Takeda, Donna Malayeri and Tomas Petricek. Themes in Information-Rich Functional Programming for Internet-Scale Data Sources. In proceedings of DDFP 2013

Tomas Petricek, Gustavo Guerra and Don Syme. Types from data: Making structured data first-class citizens in F#. PLDI 2016

Tomas Petricek. Data exploration through dot-driven development. In proceedings of ECOOP 2017