Statistic Computing

Table of Contents

1 R lang

1.2 Setting up

1.2.1 Packages

  • library() see the list of installed packages.
  • library(class) load the package class.
  • search() see the list of loaded packages.
  • install.packages() and update.packages() install and update packages.

1.2.2 Emacs ESS

To start ESS session, run command S. ESS will create a command interface as a buffer. Execute ?foo will open the R-doc for the function foo.

There's a babel in org mode for R, so just C-c C-c would work. This will prompt to create a session. One special for this babel is you can evaluate by line, using C-<RET> in the edit buffer.

Keep the session using the header:

#+PROPERTY: session *R*

To export a graph:

:file image.png :results output graphics :exports both

To export ordinary result:

:exports both

To export some summary data:

:exports both :results output org

1.2.3 Interactive command

We need to know what's going on in the current workspace. getwd() and setwd() get and set the current dir. ls() list the objects currently stored. rm(x, y, z) remove objects rm(list=ls()) remove all objects. objects() create and store current objects.

We can perform some IO by save.image("mydata.Rdata") and load("mydata.Rdata") will save and load workspace in current directory respectively. source("a.R") loads a script. sink("a.lis") redirects the output to a file, and sink() restore that to standard output.

You can view documentation by calling help help(lm). ?lm and ??solve also shows documentation, while example(topic), as its name indicates, shows the examples. help.start() opens the html documentation page.

Finally, q() quit R.

The print function can output the value of a variable.

cat("world", "!", "\n")

To enter the debugger, a call to browser function suffices. This allows you to browse the values at that point. A more powerful debugger is by a call to debug with the function name as argument. Each time that function is called, you enter the debug and can control the execution. Tracing can be registered by trace or untrace with the name of the function. It might need to be quoted in some case, so you'd better quote it, with double quotes. Every time the function is invoked, the return value will be printed as trace.

1.2.4 Commonly used functions

  • str show the structure of arbitrary type
  • summary print the summary

To see the data, you can use:

  • dim()
  • length()
  • head()
  • tail()
  • summary(dataset) shows some information like max,min,mean
  • class(data$col) get type
  • levels(data$col) if it is factor, get the values
  • stopifnot(): the assert

1.3 Scope

The scope is pretty weired, it is lexical scoping, but assign to the variable has no effect. You have to use <<- to assign.

f <- function() {
  var <- 0
    print = function() {
    add1 = function() {
      var <- var + 1
    sub1 = function() {
      var <<- var - 1
l <- f()

The function parameter is always local, changing it have no effect outside:

f <- function (var) {
  var[1] <- 8
x <- c(1,2,3)

It is call by value, for all objects. You can NOT even change a global variable, even using <<-:

x <- 1
f <- function() {
  x <<- 2

1.4 Types

1.4.1 primitives

Looks like all numbers are double by default:


1.4.2 Vector

Create a vector by c(), which is append:

x <- c(1,2,3)
c(x, 0, x)

Vectors are the array of objects of the same mode (type).

c(1, "hello")

To create a sequence:


Colon operator has higher priority:


The more powerful sequence function is seq:

seq(-1, 1, by=.2)
seq(length=10, from=-5, by=.2)

Repeating something:

x <- c(1,2,3)
rep(x, times=3)
rep(x, each=3)

The length, mode, typeof

x <- c(1,2,3)

1.4.3 Indexing

Vectors can be indexed by integers, starting from 1.

x <- c(1,2,3,4)

It can also be indexed by vector of integers

x <- c(1,2,3,4)

Negative index selects the elements other than those index. The index 0 will return empty (HEBI: ??).

x <- c(1,2,3,4)
x[c(-1, -3)]

Indexing by logical vector

x <- c(1,2,3,4)

Or by its name, where the string is compared partially (HEBI: ??)

x <- c(1,2,3)
names(x) <- c("hello", "world", "lala")
x[c("hello", "world")]

1.4.4 cbind & rbind

This is to combine (for matrix, similar to append for vector) arguments in row-wise or column-wise, and creates matrix. This will apply broadcast.

cbind(0, rbind(1, 1:3))

1.4.5 matrix

Matrix can be created by the matrix function.

x <- matrix(1:8, nrow=2)

Specify dimnames:

matrix(1:6, nrow = 2, dimnames = list(c("a", "b"), LETTERS[1:3]))

When indexing, the drop argument (defaulting to TRUE), if TRUE, the result is coerced to the lowest possible dimension:

m <- matrix(1:6, nrow = 2, dimnames = list(c("a", "b"), LETTERS[1:3]))
m[1, , drop = FALSE]

1.4.6 list

list is a misnomer, it is a dict

l <- list(hello=1, "world"=2)

When indexing lists, [] retains names, while [[]] returns only the element:

c(abc = 123)[1]
c(abc = 123)[[1]]

1.4.7 Data frame

can omit the NA values in data frame

A data frame is a list of equal-length vectors. When getting the data from read.csv, the result is a data frame. Use names to work on data frames will emit the names.

  • Since it is a list, using [] to index will give also the list, a.k.a. data frame, retaining names. You can use a vector as index.
  • Using [[]] to index will give the value, dropping names. You cannot use a vector as index.

1.4.8 type conversion

you can change a type of a vector by

  • as.factor(x)
  • as.numeric()

1.4.9 TODO data loading

  • read.csv
  • write
  • write.table
  • write.csv
  • read.table("filename", header=TRUE, sep=",")
    • this ignores blank lines,
    • and expect the header to be one field less than the body.
    • # as comments
  • read.delim
  • cat outputs the data, no index, no newline
  • attach(data): make the columns into this namespace
  • detach(data): remove those

1.4.10 TODO missing value

The missing values are NA, tested by is.na. Illegal computations produces NaN, e.g. 0/0.

1.4.11 TODO set


1.4.12 TODO string


1.4.13 Evaluation rules (broadcast)

  • recycling rules: the shortest list is recycled to the length of longest.
  • dimensional attributes: the dimension of matrix must match. No recycle for a matrix.

1.5 Operators

  • arithmetic: +-*/, ^ for exp, %% for modulus
  • matrix: %*% matrix product, %o% outer product
  • logic: !, &, | for vector, &&, || for no vector
  • relative: >, <, ==, <=, >=
  • general: <-, -> assignments, $ list subset, : sequence, ~ for model formula

Built-in functions:

  • log, exp, sin, cos, tan, sqrt
  • min, max
  • range: same as c(min(x),max(x))
  • length(x), sum(x), prod(x) (product)
  • mean(x): sum(x)/length(x)
  • var(x): sum((x-mean(x))^2)/(length(x)-1)
  • sort(x): increasing order
  • order() or sort.list()
  • paste(sep" ")= function takes an arbitrary number of arguments and concatenates them one by one into character strings. The argument can be numeric.
  • toString(8): convert integer to string
  • round(x, digits=0)

1.6 Control Structure

The compound statements are the same as C, can be a single statement without the braces.

1.6.1 Conditional

  • if: if (STMT) STMT else if (STMT) STMT else STMT

The switch is quite interesting. It is

switch(x, list...)

The semantic is to find x in the list and return the value. If x is not in list, return NULL.

An unnamed list has implicit index as name:


Using name:

switch("hello", hello=1, world=2)

not found, return NULL:

is.null(switch("nothing", hello=1))

Interestingly, even for named list, we can still use the index:

switch(2, hello=1, world=2)

1.6.2 Loop

  • for loop: for (NAME in VECTOR) STMT
  • while loop: while (STMT) STMT
  • repeat: repeat STMT
  • break, next
ret <- c()
for (i in 1:5) {
  ret <- c(ret, i)

1.7 Function

function (ARGLIST) BODY

The argument list can be a symbol, a symbol=value, or a .... The body is a compound expression, surrounded with {}. Function can be assigned to a symbol.

The matching of formals and actual are pretty tricky.

  1. exact matching on tags
  2. partial matching on tags
  3. positional matching for ... Partial matching result must be unique, but the exact matched ones are excluded before this step is entered.
f <- function(a, b, c) {
  return(c(a, b+c))

1.8 Quote (!!)

The quote will wrap the expression into an object without evaluating it. The resulting object has the mode of call. The eval is used to evaluate it.

  • quote
  • substitute
  • eval
  • call

1.9 Models

1.9.1 Linear model

A linear model is created and fitted by lm() function, with the model formula and data frame. For example:

df = data.frame(x=c(1,2,3), y=c(2,4,6))
fm = lm(y ~ x, data=df)

The fitted model in the variable fm can be accessed by:

  • coef: extract the coefficients
  • deviance: the Residual Sum of Square
  • formula: extract the model formula
  • plot: produce four plots: residuals, fitted values, diagnostics.
  • predict(OBJECT, newdata=DATA.FRAME): use the model to predict
  • residuals: extract the residuals
  • summary()

The models can be updated, if the formula only changes a little bit. In the following example, the . means the corresponding part of the original formula.

fs <- lm(y ~ x1 + x2, data=mydf)
fs <- update(fs, . ~ . + x3)
fs <- update(fs, sqrt(.) ~ .)

1.10 Plot

Process data:

  • table
  • cut(data, breaks=c(1,3,8))

1.10.1 Decoration

  • box
  • axis
  • las attribute
  • legend
  • par
  • text
  • mtext
  • points

1.10.2 Plot Types plot
  • lines
  • abline barplot pie boxplot
  • quantile hist
  • lines(density(data)) TODO stem TODO mosaicplot pairs

1.10.3 Device Driver

When outputting some image, you have to tell R which format you want to use. The default on linux is X11, that's why it opens a image immediately after plotting. To use a device, call the device function, and after that all graphics output will be sent to that device.

  • X11
  • pdf
  • png
  • jpeg

    When you have finished with a device, terminate it by dev.off().

    To output to a file TODO to open plot in emacs:

      dev.control(displaylist = "enable")
      dev.copy(pdf, "test2.pdf")
      # should now have a valid test2.pdf
      dev.off() # finished

1.11 FFI

To call a C function in R, first load the C shared library:


Then, you can directly call a C function:

.C("foo", as.double(a), as.integer(b))

Or, if you need to pass a R object to the C function, you use

.Call("foo", a, b)
.External("foo", a, b)

R also provides a command to create shared object, instead of using raw gcc:


1.12 Packages

1.12.1 ggplot2

qplot(totbill, tip, geom="point", data=tips) # scatter plot
qplot(totbill, tip, geom="point", data=tips) + geom_smooth(method="lm") # with linear relationship line
qplot(tip, geom="histogram", data=tip) # histogram
qplot(tip, geom="histogram", binwidth=1, data=tips) # with custom binwidth
# box plots
qplot(sex, tipperc, geom="boxplot", data=tips)
qplot(smoker, tipperc, geom="boxplot", data=tips)
qplot(sex:smoker, tipperc, geom="boxplot", data=tips) # combine! plot the two sets of graph in two one graph
qplot(totbill, tip, geom="point", colour=day, data=tips) # scatter plot with colors, in regard to "day" column

1.12.2 plot(x, y, …)

Possible ... arguments:

  • type what type of plot:
    • p for points,
    • l for lines,
    • b for both,
    • h for histogram like (or high-density) vertical lines,
  • main an overall title for the plot: see title.
  • xlab a title for the x axis: see title.
  • ylab a title for the y axis: see title.

1.12.3 dplyr

A Grammar of Data Manipulation

2 Julia

2.1 Macros

Julia macros are hygienic:

First, variables within a macro result are classified as either local or global. A variable is considered local if it is assigned to (and not declared global), declared local, or used as a function argument name. Otherwise, it is considered global.

Local variables are then renamed to be unique (using the gensym function, which generates new symbols), and global variables are resolved within the macro definition environment.

And you can escape the hygienic by esc, e.g.

macro zerox()
    return esc(:(x = 0))

A common pattern:

for op = (:sin, :cos, :tan, :log, :exp)
    eval(:(Base.$op(a::MyNumber) = MyNumber($op(a.x))))

2.2 Julia Libraries

2.2.1 reference

2.2.4 ML library mpastell/LIBSVM.jl

Interface to libsvm

2.2.5 Optimizers Optim.jl

optimization JuMP.jl

another optimizer with more solvers

2.2.7 Datasets Metalhead.jl

Some vision models and dataset DataFrames.jl

2.2.8 Graph LightGraphs.jl

A great package for

  • just the graph
  • generate different random graphs
  • traversal
  • plotting
  • algorithms:
    • shortest path
    • minimum spanning tree
  • distance metrics MetaGraphs.jl

LightGraphs with arbitrary data on nodes. Compose.jl

The racket/pict for Julia. GraphLayout.jl


2.2.9 Images ColorTypes.jl Images.jl

colorview, channelview, RGB

2.2.10 Compiler tools MacroTools.jl PackageCompiler.jl

To remove JIT compile overhead

2.2.11 Probablistic packages Distributions.jl GLM.jl (!!!)

2.3 Using Pkg

using Pkg
Pkg.add(PackageSpec(url="https://github.com/lihebi/julia-repl", rev="master"))

To develop a project:


Then view the current pkg status:


You will see:

EmacsREPL v0.1.0 [`~/.julia/dev/EmacsREPL`]

2.4 Useful functions

  • sortperm(v): Return a permutation vector I that puts v[I] in sorted order.
  • findfirst(predicate::Function, A): Return the index or key of the first element of A for which predicate returns true.
  • mapreduce(f, op, itrs...; [init]): Apply function f to each element(s) in itrs, and then reduce the result using the binary function op

intuitive ones:

  • reverse
  • abs
  • median

3 Back

  • For vectors, [] returns the element.
  • For lists, [] will return the the element inside a list, while [[]] will return the single element.
  • $ can be used for indexing with character.
  • The empty index [] will returns the entire vector with irrelevant attributes removed. The only retained ones are the names, dim and dimnames attributes.
dim(z) <- c(3,5,100)~

3.0.1 data example

## (HEBI: Command line arguments)
args = commandArgs(trailingOnly=TRUE)
csvfile = args[1]
csv = read.csv(csvfile, header=TRUE)

total_test <- dim(csv)[[1]]
sub = subset(csv, reach_code>=5)
total_reach_poi <- dim(sub)[[1]]
sub = subset(csv, reach_code==5 & status_code == 1)
total_fail_poi <- dim(sub)[[1]]

sub <- sub[1:(length(csv)-2)]
## (HEBI: calling a function)
funcs = TransferFunction(sub);

## (HEBI: define a function)
Constant <- function(data) {
  ## (HEBI: return value as a vector)
  ret <- c()
  i <- 1
  ## (HEBI: a for loop using the vector as range)
  for (i in c(1:length(data))) {
    col = data[i];
    ## (HEBI: Get the name of a column)
    name = names(col);
    if (substr(name, 1, 6) == "output") {
      ## (HEBI: remove of NA)
      newcol = col[!is.na(col)];
      if (length(newcol) > 2) {
        value <- newcol[1]
        ## (HEBI: check the value of the vector is all the same)
        if (length(newcol[newcol != value]) == 0) {
          ## (HEBI: pushing a new value to the return vector)
          ret <- c(ret, paste("name = ",  value))}}}}