combining and operating on matrices twice nested in a list - r

if xmpl is a list where each element has an integer age and a list data, where data contains three matrices of equal size, a to c
What is the best way to do
cor( xmpl[[:]]$data[[:]][c('a','b','c')], xmpl[[:]]$age)
where the results would be 3 x length(a) array or list that reflects age correlated with each instance of each element of a (row 1), b (row 2), and c (row 3) across xmpl.
I am reading in matrices that represent the output of different pipelines. There are 3 of these per subject and a whole lot of subjects. Currently, I've built a list of subjects that has among other things a list of pipeline matrices.
The structure looks like:
str(exmpl)
$ :List of 4
..$ id : int 5
..$ age : num 10
..$ data :List of 3
.. ..$ a: num [1:10, 1:10] 0.782 1.113 3.988 0.253 4.118 ...
.. ..$ b: num [1:10, 1:10] 5.25 5.31 5.28 5.43 5.13 ...
.. ..$ c: num [1:10, 1:10] 1.19e-05 5.64e-03 7.65e-01 1.65e-03 4.50e-01 ...
..$ otherdata: chr "ignorefornow"
#[...]
I want to correlate every element of a across all subjects with the age of subjects. Then do the same for b and c and put the results into a list.
I think I am approaching this in a way that is awkward for R. I'm interested in what the "R way" of storing and retrieving this data would be.
Data Structure and desired output http://dl.dropbox.com/u/56019781/linked/struct-2012-12-19.svg
library(plyr)
## example structure
xmpl.mat <- function(){ matrix(runif(100),nrow=10) }
xmpl.list <- function(x){ list( id=x, age=2*x, data=list( a=x*xmpl.mat(), b=x+xmpl.mat(), c=xmpl.mat()^x ), otherdata='ignorefornow' ) }
xmpl <- lapply( 1:5, xmpl.list )
## extract
ages <- laply(xmpl,'[[','age')
data <- llply(xmpl,'[[','data')
# to get the cor for one set of matrices is easy enough
# though it would be nice to do: a <- xmpl[[:]]$data$a
x.a <- sapply(data,'[[','a')
x.a.corr <- apply(x.a,1,cor,ages)
# ...
#xmpl.corr <- list(x.a.corr,x.b.corr,x.c.corr)
# and by loop, not R like?
xmpl.corr<-list()
for (i in 1:length(names(data[[1]])) ){
x <- sapply(data,'[[',i)
xmpl.corr[[i]] <- apply(x,1,cor,ages)
}
names(xmpl.corr) <- names(data[[1]])
Final output:
str(xmpl.corr)
List of 3
$ a: num [1:100] 0.712 -0.296 0.739 0.8 0.77 ...
$ b: num [1:100] 0.98 0.997 0.974 0.983 0.992 ...
$ c: num [1:100] -0.914 -0.399 -0.844 -0.339 -0.571 ..

Here's a solution. It should be short enough.
ages <- sapply(xmpl, "[[", "age") # extract ages
data <- sapply(xmpl, function(x) unlist(x[["data"]])) # combine all matrices
corr <- apply(data, 1, cor, ages) # calculate correlations
xmpl.corr <- split(corr, substr(names(corr), 1, 1)) # split the vector

Instead of x.a, x.b, x.c you would probably want to have all of these in one list.
# First, get a list of the items in data
abc <- names(xmpl[[1]]$data) # incase variables change in future
names(abc) <- abc # these are the same names that will be used for the final list. You can use whichever names make sense
## use lapply to keep as list, use sapply to "simplify" the list
x.data.list <- lapply(abc, function(z)
sapply(xmpl, function(xm) c(xm$data[[z]])) )
ages <- sapply(xmpl, `[[`, "age")
# Then compute the correlations. Note that on each element of x.data.list we are apply'ing per row
correlations <- lapply(x.data.list, apply, 1, cor, ages)

Related

Remove outlier from five-number summary statistics

How can I force fivenum function to not put outliers as my maximum/minimum values?
I want to be able to see uppper and lower whisker numbers on my boxplot.
My code:
boxplot(data$`Weight(g)`)
text(y=fivenum(data$`Weight(g)`),labels=fivenum(data$`Weight(g)`),x=1.25, title(main = "Weight(g)"))
boxplot returns a named-list that includes things you can use to remove outliers in your call to fivenum:
$out includes the literal outliers. It can be tempting to use setdiff(data$`Weight(g)`), but that may be prone to problems due to R FAQ 7.31 (and floating-point equality), so I recommend against this; instead,
$stats includes the numbers used for the boxplot itself without the outliers. I suggest we work with this.
(BTW, title(.) does its work via side-effect, and it is not used by text(.), I suggest you move that call.)
Reproducible data/code:
vec <- c(1, 10:20, 30)
bp <- boxplot(vec)
str(bp)
# List of 6
# $ stats: num [1:5, 1] 10 12 15 18 20
# $ n : num 13
# $ conf : num [1:2, 1] 12.4 17.6
# $ out : num [1:2] 1 30
# $ group: num [1:2] 1 1
# $ names: chr "1"
five <- fivenum(vec[ vec >= min(bp$stats) & vec <= max(bp$stats)])
text(x=1.25, y=five, labels=five)
title("Weight(g)")

Is there an R package with a generalized class of data.frame in which a column can be an array (or how do I define such a class)?

I have been wondering about this for a long time. The data.frame class in base R only allow the columns to be vectors. I was looking for a package which generalize this so that each "column" can be a 2-d or even n-d array with similar methods to the original class data.frame such as sub-setting with "[]", merge, aggregate, etc.
My reason for such a class is to deal with Monte Carlo simulation data. For example, for each simulation the result can be expressed as a data frame in which the row indices are dates, and columns include character and numeric. If I simulate 1000 times then I get 1000 such data frames. If there is a class in R with which I can store the results in one object and has the convenience of most of the data.frame methods, it'll make my coding a lot easier.
As I couldn't find such a package I attempted to create my own with no success. I came across this package "S4Vectors" with a "DataFrame" class, which "supports the storage of any type of object (with length and [ methods) as columns." Here is my attempt.
library(S4Vectors)
test <- matrix(1:6,2,3)
test1 <- matrix(7:12,2,3)
setClass("Column", slots=list(), contains = "matrix")
setMethod("length", "Column", function(x) {nrow(x)})
'[.Column' <- function(x, i, j, ...) {
i <- ((i-1)*ncol(x)+1):(i*(ncol(x)))
NextMethod()
}
testColumn <- new("Column", test)
testColumn1 <- new("Column", test1)
length(testColumn)
testColumn[1]
testDataFrame <- DataFrame(Col1 = testColumn, Col2 = testColumn1)
I did get the length and [ method to work but the last statement gives an error "cannot coerce class "Column" to a DataFrame".
Has anyone ever tried to do something similar?
Update: Thanks to G. Grothendieck I now know a data frame can take a matrix as a column by using the I() function. Now I am wondering if there is way to preserve such a structure in all operations. An example would be to aggregate the data frame
data.frame(v = c(1,1,2,2), m = I(diag(4)))
by v so that the result is
data.frame(v = c(1,2), m = I(matrix(c(1,1,0,0,0,0,1,1), 2, 4, byrow = T))).
data frames do allow matrix columns:
m <- diag(4)
v <- 1:4
DF <- data.frame(v, m = I(m))
str(DF)
giving:
'data.frame': 4 obs. of 2 variables:
$ v: int 1 2 3 4
$ m: 'AsIs' num [1:4, 1:4] 1 0 0 0 0 1 0 0 0 0 ...
Update 1
The R aggregate function can create matrix columns. For example,
DF <- data.frame(v = 1:4, g = c(1, 1, 2, 2))
ag <- aggregate(v ~ g, DF, function(x) c(sum = sum(x), mean = mean(x)))
str(ag)
giving:
'data.frame': 2 obs. of 2 variables:
$ g: num 1 2
$ v: num [1:2, 1:2] 3 7 1.5 3.5
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "sum" "mean"
Update 2
I don't think the aggregation discussed in the comments is nicely supported in R but you can use the following workaround:
m <- matrix(1:16, 4)
v <- c(1, 1, 2, 2)
DF <- data.frame(v, m = I(m))
nr <- nrow(DF)
ag2 <- aggregate(list(sum = 1:nr), DF["v"], function(ix) colSums(DF$m[ix, ]))
str(ag2)
giving:
'data.frame': 2 obs. of 2 variables:
$ v : num 1 2
$ sum: num [1:2, 1:4] 3 7 11 15 19 23 27 31

Best way to make factor levels uniform over a number of data frames

I have a number of data.frames that each have a factor. I want to make sure that they all use the same levels. What is the proper way to do this?
In the code below you'll see that I reassign the factor for each case using levels from the overall set of levels with a small convenience function changeLevels. I would expect that there is a better way to do this though.
set.seed(1234)
b<-round(runif(100,1,10),digits=2)
set.seed(2345)
b2<-round(runif(100,11,20),digits=2)
set.seed(3456)
b3<-round(runif(50,15,18),digits=2)
#.. all potential levels
bt<-factor(sort(c(b,b2,b3)))
lvls<-levels(bt)
t1<-as.data.frame(table(sample(b,5)))
t2<-as.data.frame(table(sample(b,1)))
t3<-as.data.frame(table(sample(b,1)))
t4<-as.data.frame(table(sample(b,8)))
t5<-as.data.frame(table(sample(b2,20)))
t6<-as.data.frame(table(sample(b3,18)))
t1<-cbind(t1,p="A")
t2<-cbind(t2,p="B")
t3<-cbind(t3,p="C")
t4<-cbind(t4,p="D")
t5<-cbind(t5,p="E")
t6<-cbind(t6,p="F")
d<-data.frame()
d<-rbind(d,t2,t3,t6,t4,t5,t1)
#.. out of order bins
ggplot(d,aes(x=factor(Var1),fill=factor(p))) +
geom_bar(aes(weight=Freq)) +
facet_grid( p ~ ., margins=T)+
ggtitle("out of order bins")
changeFactor<-function(t,lvls){
temp<-as.numeric(as.character(t))
factor(temp,levels=lvls)
}
t1$Var1<-changeFactor(t1$Var1,lvls)
t2$Var1<-changeFactor(t2$Var1,lvls)
t3$Var1<-changeFactor(t3$Var1,lvls)
t4$Var1<-changeFactor(t4$Var1,lvls)
t5$Var1<-changeFactor(t5$Var1,lvls)
t6$Var1<-changeFactor(t6$Var1,lvls)
d<-data.frame()
d<-rbind(d,t2,t3,t6,t4,t5,t1)
#.. in order bins
ggplot(d,aes(x=factor(Var1),fill=factor(p))) +
geom_bar(aes(weight=Freq)) +
facet_grid( p ~ ., margins=T)+
ggtitle("in order bins")
short answer: keep your data in lists and learn the *pply family
set.seed(1234)
b<-round(runif(100,1,10),digits=2)
set.seed(2345)
b2<-round(runif(100,11,20),digits=2)
set.seed(3456)
b3<-round(runif(50,15,18),digits=2)
#.. all potential levels
bt<-factor(sort(c(b,b2,b3)))
lvls<-levels(bt)
options(stringsAsFactors = FALSE)
f <- function(x, y, z)
cbind(data.frame(table(sample(x, y))), p = z)
datl <- Map(f, list(b,b,b,b,b2,b3), c(5,1,1,8,20,18), LETTERS[1:6])
changeFactor<-function(t,lvls){
temp<-as.numeric(as.character(t))
factor(temp,levels=lvls)
}
datl <- lapply(rapply(datl, f = function(x) changeFactor(x, lvls),
classes = 'factor', how = 'replace'),
data.frame)
d <- do.call(rbind, datl[c(2, 3, 6, 4, 5, 1)])
#.. in order bins
ggplot(d,aes(x=factor(Var1),fill=factor(p))) +
geom_bar(aes(weight=Freq)) +
facet_grid( p ~ ., margins=T)+
ggtitle("in order bins")
long answer:
set.seed(1234)
b<-round(runif(100,1,10),digits=2)
set.seed(2345)
b2<-round(runif(100,11,20),digits=2)
set.seed(3456)
b3<-round(runif(50,15,18),digits=2)
#.. all potential levels
bt<-factor(sort(c(b,b2,b3)))
lvls<-levels(bt)
first, I don't want any unexpected factors popping up, so stringsAsFactors = FALSE
then write a function, f, to do what you want, and check to make sure it works
options(stringsAsFactors = FALSE)
f <- function(x, y, z)
cbind(data.frame(table(sample(x, y))), p = z)
f(b, 5, 'A')
# Var1 Freq p
# 1 1.13 1 A
# 2 1.46 1 A
# 3 2.09 1 A
# 4 2.5 1 A
# 5 7.02 1 A
seems to work, so just Map it to lists of arguments and check the output
datl <- Map(f, list(b,b,b,b,b2,b3), c(5,1,1,8,20,18), LETTERS[1:6])
# List of 6
# $ :'data.frame': 5 obs. of 3 variables:
# ..$ Var1: Factor w/ 5 levels "2.02","3.09",..: 1 2 3 4 5
# ..$ Freq: int [1:5] 1 1 1 1 1
# ..$ p : chr [1:5] "A" "A" "A" "A" ...
# $ :'data.frame': 1 obs. of 3 variables:
# ..$ Var1: Factor w/ 1 level "1.63": 1
# ..$ Freq: int 1
# ..$ p : chr "B"
so combine everything to use with ggplot
d <- do.call(rbind, datl[c(2, 3, 6, 4, 5, 1)])
library(ggplot2)
#.. out of order bins
ggplot(d,aes(x=factor(Var1),fill=factor(p))) +
geom_bar(aes(weight=Freq)) +
facet_grid( p ~ ., margins=T)+
ggtitle("out of order bins")
changeFactor<-function(t,lvls){
temp<-as.numeric(as.character(t))
factor(temp,levels=lvls)
}
again making sure the function does what it is supposed to do on one data frame
changeFactor(datl[[1]]$Var1, lvls)
# [1] 2.02 3.09 3.79 3.89 8.3
# 234 Levels: 1.09 1.12 1.13 1.24 1.36 1.38 1.41 1.46 1.63 1.66 1.81 1.95 ... 19.86
so apply it again to them all at once and check the output
datl <- lapply(rapply(datl, f = function(x) changeFactor(x, lvls),
classes = 'factor', how = 'replace'),
data.frame)
str(datl)
# List of 6
# $ :'data.frame': 5 obs. of 3 variables:
# ..$ Var1: Factor w/ 234 levels "1.09","1.12",..: 13 28 41 45 81
# ..$ Freq: int [1:5] 1 1 1 1 1
# ..$ p : chr [1:5] "A" "A" "A" "A" ...
# $ :'data.frame': 1 obs. of 3 variables:
# ..$ Var1: Factor w/ 234 levels "1.09","1.12",..: 9
# ..$ Freq: int 1
# ..$ p : chr "B"
# ...
combine again and plot
d <- do.call(rbind, datl[c(2, 3, 6, 4, 5, 1)])
#.. in order bins
ggplot(d,aes(x=factor(Var1),fill=factor(p))) +
geom_bar(aes(weight=Freq)) +
facet_grid( p ~ ., margins=T)+
ggtitle("in order bins")
I think your way "reading" the factors with as.character is the best way when you don't know all their "true" levels.
But since you do know them (they are all stored inside lvls), why not using them directly when you build your ti$Var1 vectors? That is, instead of :
ti = as.data.frame(table(sample(b,5))); # automately creates a factor vector ti$Var1 with what is found inside the sample as levels
ti$Var1 = factor(as.character(ti$Var1), levels = lvls); # replaces it with a new factor, created by reading each value of the previous one and assigning it a level from lvls
(which is ultimately what you do),
do directly:
tab = table(sample(b,5));
ti = data.frame(myVar = factor(names(tab), lvls) # creates directly the right factor vector with levels drawn from lvls
, myFreq = as.numeric(tab)
);
(which is ultimately what you want) (and 'even allows you better control on the names of ti's columns)
Or else, but you will then get empty lines:
factoredSample = factor(sample(b,5), lvls); # directly associates each drawn value with a level from lvls
ti = as.data.frame(table(factoredSample)); # and table will then also count the non-represented levels within factoredSample
(By the way, I don't know whether or not this was only for asking-the-question purposes, but if you really have to handle so many almost-identical data.frames in your script, you are probably using the wrong data structure.)

Add dataframes to each list element

I am reading a series of files that end up in a list of dataframes. After doing that, i'm interested in putting some additional information related to each dataframe. So, I want to add to each element of my dataframe list, some additional elements.
My attempt was to actually build the list of "extra stuff" and then try to merge it with the list of dataframes.
Example code:
set.seed(42)
#Building my list of data.frames. In my specific case this is coming from files
A <- data.frame(x=rnorm(10), y=rnorm(10))
B <- data.frame(x=rnorm(10), y=rnorm(10))
ListD <- list(A, B)
names(ListD)<- c("A", "B") #some names to know what is what
#now my attributes. Each data.frame as some properties that i want to keep track of.
newList <- list(A=c("Color"=123, "Date"=321), B=c("Color"=111, "Date"=111))
#My wished output is a list were each element of the list has
#"Color", "Date" and a dataframe
#I tried something like:
lapply(ListD, append, values=newList)
As far as I can tell all you have to do is change you're initialization of ListD to:
ListD <- list(list(A), list(B))
Because the data structure you want is a list of lists - with the inner lists holding a data.frame and two further attributes. I can't gurantee this is exactly the result you desire but essently this is where your problem is located.
OK, I thought this would be straightforward with mapply, but I can't get the lists to play together well... maybe someone else can. So here's a for solution:
#preallocate list
updatedList <- vector(mode = "list", length = length(ListD))
names(updatedList) <- names(ListD)
for(i in 1:length(updatedList)) {
updatedList[[i]] <- c(ListD[i], newList[[i]])
}
updatedList$A
# $A
# x y
# 1 -0.51690823 0.4521443
# 2 0.97544933 -0.7212561
# 3 0.98909668 -0.2258737
# 4 -1.72753947 -0.7643175
# 5 -1.31050478 -3.2526437
# 6 -0.63845053 1.1263407
# 7 -0.09010858 -0.9386608
# 8 -0.53933869 -0.6882866
# 9 0.54668290 1.7227261
# 10 -0.87948586 -0.2413344
#
# $Color
# [1] 123
#
# $Date
# [1] 321
Alternatively, if you take #Яaffael's suggestion, mapply works, but that will depend how you're building the list from the files in the first place:
ListD <- list(list(A), list(B))
updatedList <- mapply(c, ListD, newList, SIMPLIFY = FALSE)
I used #Яaffael suggestion of a list of lists.
Since changing my list of dataframes is not very easy, because of the way i'm reading them from files, I made the list of lists with the extra data, and then join the dataframes like this:
newList <- list(A=list("Color"=123, "Date"=321), B=list("Color"=111, "Date"=111))
for(n in names(newList)){
newList[[n]]$Dataframe <- ListD[[n]]
}
the structure of my output:
> str(newList)
List of 2
$ A:List of 3
..$ Color : num 123
..$ Date : num 321
..$ Dataframe:'data.frame': 10 obs. of 2 variables:
.. ..$ x: num [1:10] 1.371 -0.565 0.363 0.633 0.404 ...
.. ..$ y: num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
$ B:List of 3
..$ Color : num 111
..$ Date : num 111
..$ Dataframe:'data.frame': 10 obs. of 2 variables:
.. ..$ x: num [1:10] -0.307 -1.781 -0.172 1.215 1.895 ...
.. ..$ y: num [1:10] 0.455 0.705 1.035 -0.609 0.505 ...

Why does sapply return a matrix that I need to transpose, and then the transposed matrix will not attach to a dataframe?

I would appreciate insight into why this happens and how I might do this more eloquently.
When I use sapply, I would like it to return a 3x2 matrix, but it returns a 2x3 matrix. Why is this? And why is it difficult to attach this to another data frame?
a <- data.frame(id=c('a','b','c'), var1 = c(1,2,3), var2 = c(3,2,1))
out <- sapply(a$id, function(x) out = a[x, c('var1', 'var2')])
#out is 3x2, but I would like it to be 2x3
#I then want to append t(out) (out as a 2x3 matrix) to b, a 1x3 dataframe
b <- data.frame(var3=c(0,0,0))
when I try to attach these,
b[,c('col2','col3')] <- t(out)
The error that I get is:
Warning message:
In `[<-.data.frame`(`*tmp*`, , c("col2", "col3"), value = list(1, :
provided 6 variables to replace 2 variables
although the following appears to give the desired result:
rownames(out) <- c('col1', 'col2')
b <- cbind(b, t(out))
I can not operate on the variables:
b$var1/b$var2
returns
Error in b$var1/b$var2 : non-numeric argument to binary operator
Thanks!
To expand on DWin's answer: it would help to look at the structure of your out object. It explains why b$var1/b$var2 doesn't do what you expect.
> out <- sapply(a$id, function(x) out = a[x, c('var1', 'var2')])
> str(out) # this isn't a data.frame or a matrix...
List of 6
$ : num 1
$ : num 3
$ : num 2
$ : num 2
$ : num 3
$ : num 1
- attr(*, "dim")= int [1:2] 2 3
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "var1" "var2"
..$ : NULL
The apply family of functions are designed to work on vectors and arrays, so you need to take care when using them with data.frames (which are usually lists of vectors). You can use the fact that data.frames are lists to your advantage with lapply.
> out <- lapply(a$id, function(x) a[x, c('var1', 'var2')]) # list of data.frames
> out <- do.call(rbind, out) # data.frame
> b <- cbind(b,out)
> str(b)
'data.frame': 3 obs. of 4 variables:
$ var3: num 0 0 0
$ var1: num 1 2 3
$ var2: num 3 2 1
$ var3: num 0 0 0
> b$var1/b$var2
[1] 0.3333333 1.0000000 3.0000000
First a bit of R notation. The If you look at the code for sapply, you will find the answer to your question. The sapply function checks to see if the list lengths are all equal, and if so, it first "unlist()"s them and then takes that series of lists as the data argument to array(). Since array (like matrix() ) by default arranges its values in column major order, that is what you get. The lists get turned on their side. If you don't like it then you can define a new function tsapply that will return the transposed values:
> tsapply <- function(...) t(sapply(...))
> out <- tsapply(a$id, function(x) out = a[x, c('var1', 'var2')])
> out
var1 var2
[1,] 1 3
[2,] 2 2
[3,] 3 1
... a 3 x 2 matrix.
Have a look at ddply from the plyr package
a <- data.frame(id=c('a','b','c'), var1 = c(1,2,3), var2 = c(3,2,1))
library(plyr)
ddply(a, "id", function(x){
out <- cbind(O1 = rnorm(nrow(x), x$var1), O2 = runif(nrow(x)))
out
})

Resources