Open, reproducible and accessible data science
Human-centric programming languages and tools
SpreadsheetsEasy to use Simple problems only Not reproducible |
ProgrammingRequires expert skills Internet-scale Reproducible & open |
Exploring World Bank with type providers DDFP 2013
Unsafe dynamic access in a typed language
1: 2: 3: 4: 5: 6: 7: |
|
Unsafe dynamic access in a typed language
1: 2: 3: 4: 5: 6: 7: |
|
Accessing data from external data sources
Languages do not understand data
There is rarely explicit schema
Manually define types to caputre it
Easier in dynamic languages
Athletes by number of gold medals from Rio 2016
1: 2: 3: 4: 5: 6: |
|
Language and data source features you need to know
Python dictionaries {"key": value}
Generalised indexers .[ condition ]
Operation names sort_values
Data column names "Athlete"
\(\emptyset \vdash e : \tau\)
\(\pi(~~~~~~~) \vdash e : \tau\)
Reading data from an RSS feed PLDI 2016
Type providers for structured data
Structural shape inference
Language integration via type providers
Relative type safety
{title : string, author : {age : int}} {author : {age : float}}
{ title : option<string>, author : {age : float} }
{ coordinates : {lng:num, lat:num} } string
{ coordinates : {lng:num, lat:num} } + string
Pragmatic design choices for usability
Prefers records for tooling
Predictable and stable
Open world assumption about sums
Aggregating Olympic medallists ECOOP 2017
Encoding complex logic via simple member access
Type providers for member generation
Laziness for scaling to large hierarchies
Fancy types for the masses!
Row types to track names and types of fields
\[\definecolor{cc}{RGB}{204,82,34} \definecolor{mc}{RGB}{0,0,153} \frac {\Gamma \vdash e : {\color{cc}[f_1:\tau_1, \ldots, f_n:\tau_n]}} {\Gamma \vdash e.\text{drop}~f_i : {\color{cc} [f_1:\tau_1, \ldots, f_{i-1}:\tau_{i-1}, f_{i+1}:\tau_{i+1}, \ldots, f_n:\tau_n]}}\]
Embed row types in provided nominal types
\[\frac {\Gamma \vdash e : {\color{mc} C_1}} {\Gamma \vdash e.\text{drop}~f_i : {\color{mc} C_2}} \quad{\small \text{where}}\]
\[\begin{array}{l} \\[-0.5em] {fields({\color{mc} C_1}) = {\color{mc} \{f_1:\tau_1, \ldots, f_n:\tau_n\}}}\\ {fields({\color{mc} C_2}) = {\color{mc} \{f_1:\tau_1, \ldots, f_{i-1}:\tau_{i-1}, f_{i+1}:\tau_{i+1}, \ldots, f_n:\tau_n\}}} \end{array}\]
Powerful idea that works in other contexts
Row types and phantom types
Session types for communication
Add your own fancy type here!
Well typed programs do not go wrong.
As long as the world is well-behaved.
Given representative samples and an input value
\(S(d)\sqsubset S(d_1, \ldots, d_n)\)
Any program written using a type provider reduces
\(e_{user}[x\leftarrow {\color{mc}\text{new}}~C(d)] \rightsquigarrow^* v\)
Handling schema change and errors
Provided type can change only in limited ways
\(C[e] \rightarrow C[e.M]\)
\(C[e] \rightarrow C[{\sf match}~e~{\sf with}~\ldots]\)
\(C[e] \rightarrow C[int(e)]\)
Context \(L\) maps names to definitions and nested contexts
\(L(C) = {\color{mc}\text{type}}~C(x:\tau) = \overline{m}, L'\)
Pivot takes schema and provides a class with context
\(\text{pivot}(F) = C, L\)
Generate classes that drop individual columns
Generate class corresponding to a record shape
1 Data analytics is an interactive process
2 Program against data, not abstract symbols
Learning from spreadsheets Programming 2017
Making programming with data easier
Learning from spreadsheets
Understanding programmer interactions
Calculus of interactions
Cognitive cost of interactions
Interaction Study programming not programs
Type providers Dot driven with fancy types
Properties Relative safety but also cognitive cost
Tomas Petricek, University of Kent
tomas@tomasp.net | @tomaspetricek | tomasp.net
thegamma.net | fslab.org | gamma.turing.ac.uk
Don Syme, Keith Battocchi, Kenji Takeda, Donna Malayeri and Tomas Petricek. Themes in Information-Rich Functional Programming for Internet-Scale Data Sources. In proceedings of DDFP 2013
Tomas Petricek, Gustavo Guerra and Don Syme. Types from data: Making structured data first-class citizens in F#. PLDI 2016
Tomas Petricek. Data exploration through dot-driven development. In proceedings of ECOOP 2017