The Wallace Line.

Blog

Index

R and XQuery 3.0

No self-respecting datasmith should be without R in their toolbox. The open source statistical language R was first developed 20 years ago, coincidently at the same university (Auckland) where I studied statistics even longer ago. I thought I'd join a Coursera module and get the basics under my belt. Besides I'm interesting in how a MOOC language course would work - we could do with one for eXist and XQuery.

Selecting data

The first exericse in the course is based on a sample csv file which we load, extract various subsets and get basic statistics. All pretty striaghtforward but the syntax for subsetting is rather confused and confusng in my view. This is just the kind of thing XPath is good for so I spent a short time reworking the exercise in XQuery instead.

Here are some of tasks and the R and XQuery equivalents:

in R the CSV file is loaded into a dataframe:

d <- read.csv("hwi.csv")

In XQuery the CSV file was converted (via a custom function) to XML and saved in the database. Here is an extract:

let $d := doc("/db/apps/r/data/hw1.xml")/table/row (: all rows :)

Ignoring the issue of missing values, here are some of the questions:

1) the first two rows:

R: d[1:2, ] //the traiilng comma is needed because we don't want to restrict the second dimenison , the column dimension - it is easy to forget to include it

XPath $d[position() <= 2 ]

2) the last two rows:

R: d [seq(from=dim(d)[1]-1,length=2), ] // I guess this could be simpler - I'm only an newb

XPath: $d[position() >= count($d) - 1]

3) the value of Ozone in row 47

R: d$Ozone[47]

XPath: $d[47]/Ozone

4) the mean value of Ozone

R : mean (d$Ozone)

XPath: stats:mean ($d/Ozone) (: stats is a custom stats library see https://github.com/KitWallace/stats

5) the mean of the Solar.R values when Ozone is greater than 31 and Temp is greater than 90:

R: mean(subset(d, Ozone > 31 & Temp > 90)$Solar.R)

XPath: stats:mean($d[Ozone > 31 and Temp > 90]/Solar.R)

6) the mean temperture in Month 6 :

R : mean(subset(d,Month=6)$Temp)

XPath: stats:mean($d[Month=6]/Temp

In R, filtering uses a few different forms d[..], d$var, seq(), subset() whereas XPath is more uniform, and, thanks to it being an expression language, more expressive too.

The raw data contains NA values which cause R functions to return NA but cause an error in the XQuery stats functions,so we must remove all NA values:

We can remove individual NA values from the rows:

XQuery: let $dp := for $row in $data return element row {$row/*[not (. = "NA")]}

(: surely that could be simplified?- still learning XQuery too :)

to create a dataset which is safe to perform numerical calculations on (provided the data is numeric)

Or we could remove all rows containing missing values to work with complete rows :

R: complete.cases(d)

XPath : $d[not(* = "NA")]

Vector arithmetic and Higher Order Functions

What is nice in R is vector-wise arithmetic :

R: x <-c(1,2,3,4,5,6)# Create ordered collection (vector)

y <- x^2 # square each element of x

mean(y)

In XQuery 2.0 this would have required an explicit loop but XQuery 3.0 supports higher-order functions )HOFs] and the latest version of eXist RC2 supports much of XQuery 3.0.

So we now can write:

     let $x := (1,2,3,4,5,6)
     let $y:= map(math:power(?,2), $x)
     return stats:mean($y)

Here the function math:power() is being partially applied, specifiying the second parameter as 2. This partial function is then evaluated for each of the items in the sequence $x to create the sequence $y.

Functions in R are first-class objects so we could also say:

sqr <- function(n) {n * n}

y <- sqr(x)

whilst in XQuery 3.0:
let $f := function($x) {$x * $x}
let yt := map($f, $y)

So the languages are closer than I first thought.

Of course R has an extensive community-contributed library of modules covering every conceivable statistical analysis, graphing and lots more besides, but it is a pity that value selection has not been able to use the XPath model.