As we all know R isn't the most efficient platform to run large analyses.
If I had a large data frame containing three parameters:
GROUP X Y
A 1 2
A 2 2
A 2 3
...
B 1 1
B 2 3
B 1 4
...
millions of rows
and I wanted to run a computation on each group (e.g. compute Pearson's r on X,Y) and store the results in a new data frame, I can do it like this:
df = loadDataFrameFrom( someFile )
results = data.frame()
for ( g in unique( df$GROUP)) ){
gdf <- subset( df, df$GROUP == g )
partialRes <- slowStuff( gdf$X,gdf$Y )
results = rbind( results, data.frame( GROUP = g, RES = partialRes ) )
}
// results contains all the results here.
useResults(results)
The obvious problem is that this is VERY slow, even on powerful multi-core machine.
My question is: is it possible to parallelise this computation, having for example a separate thread for each group or a block of groups?
Is there a clean R pattern to solve this simple divide et impera problem?
Thanks,
Mulone
First off, R is not necessarily slow. Its speed depends largely on using it correctly, just like any language. There are a few things that can speed up your code without altering much: preallocate your results data.frame before you begin; use a list and matrix or vector construct instead of a data.frame; switch to use data.table; the list goes on, but The R Inferno is an excellent place to start.
Also, take a look here. It provides a good summary on how to take advantage of multi-core machines.
The "clean R pattern" was succinctly solved by Hadley Wickam with his plyr package and specifically ddply:
library(plyr)
library(doMC)
registerDoMC()
ddply(df, .(GROUP), your.function, .parallel=TRUE)
However, it is not necessarily fast. You can use something like:
library(parallel)
mclapply(unique(df$GRUOP), function(x, df) ...)
Or finally, you can use the foreach package:
foreach(g = unique(df$Group), ...) %dopar$ {
your.analysis
}
To back up my comment: 10 million rows, 26 groups. Done in < 3 seconds on a single-core 3.3Ghz CPU. Using only base R. No parallelization needed.
> set.seed(21)
> x <- data.frame(GROUP=sample(LETTERS,1e7,TRUE),X=runif(1e7),Y=runif(1e7))
> system.time( y <- do.call(rbind, lapply(split(x,x$GROUP),
+ function(d) data.frame(GROUP=d$GROUP[1],cor=cor(d$X,d$Y)))) )
user system elapsed
2.37 0.56 2.94
> y
GROUP cor
A A 2.311493e-03
B B -1.020239e-03
C C -1.735044e-03
D D 1.355110e-03
E E -8.027199e-04
F F 8.234086e-04
G G 2.337217e-04
H H -5.861781e-04
I I 7.799191e-04
J J 1.063772e-04
K K 7.174137e-04
L L 4.151059e-04
M M 4.440694e-04
N N 2.568411e-03
O O -3.827366e-04
P P -1.239380e-03
Q Q -1.057020e-03
R R 1.079676e-03
S S -1.819232e-03
T T -3.577533e-04
U U -1.084114e-03
V V 6.686503e-05
W W -1.631912e-03
X X 8.668508e-04
Y Y -6.460281e-04
Z Z 1.614978e-03
By the way, parallelization will only help if your slowStuff function is the bottleneck. Your use of rbind in a loop is likely the bottleneck, unless you do something similar in slowStuff.
I think your slowness is in part due to your non R programming in R. The following would give you correlations per group (I used the mtcars data set and divided it by cyl group) and do it pretty fast:
by(mtcars, mtcars$cyl, cor)
Related
I have three populations stored as individual vectors. I need to run a statistical test (wilcoxon, if it matters) on each pair of these three populations.
I want to input three vectors into some block of code and get as output a vector of 6 p-values (one p-value is the result of one test and is a double).
I have a method that works but I am new to R and from what I've been reading I feel like there should be a better way, possibly involving storing the vectors as a data frame and using vectorization, to write this code.
Here is the code I have:
library(arrangements)
runAllTests <- function(pop1,pop2,pop3) {
populations <- list(pop1=pop1,pop2=pop2,pop3=pop3)
colLabels <- c("pop1", "pop2", "pop3")
#This line makes a data frame where each column is a pair of labels
perms <- data.frame(t(permutations(colLabels,2)))
pvals <- vector()
#This for loop gets each column of that data frame
for (pair in perms[,]) {
pair <- as.vector(pair)
p1 <- as.numeric(unlist(populations[pair[1]]))
p2 <- as.numeric(unlist(populations[pair[2]]))
pvals <- append(pvals, wilcox.test(p1, p2,alternative=c("less"))$p.value)
}
return(pvals)
}
What is a more R appropriate way to write this code?
Note: Generating populations and comparing them all to each other is a common enough thing (and tricky enough to code) that I think this question will apply to more people than myself.
EDIT: I forgot that my actual populations are of different sizes. This means I cannot make a data frame out of the vectors (as far as I know). I can make a list of vectors though. I have updated my code with a version that works.
Yes, this is indeed common; indeed so common that R has a built-in function for exactly this scenario: pairwise.table.
p <- list(pop1, pop2, pop3)
pairwise.table(function(i, j) {
wilcox.test(p[[i]], p[[j]])$p.value
}, 1:3)
There are also specific versions for t tests, proportion tests, and Wilcoxon tests; here's an example using pairwise.wilcox.test.
p <- list(pop1, pop2, pop3)
d <- data.frame(x=unlist(p), g=rep(seq_along(p), sapply(p, length)))
with(d, pairwise.wilcox.test(x, g))
Also, make sure you look into the p.adjust.method parameter to correctly adjust for multiple comparisons.
Per your comments, you're interested in tests where the order matters; that's really hard to imagine (and isn't true for the Wilcoxon test you mentioned) but still...
This is the pairwise.table function, edited to do tests in both directions.
pairwise.table.all <- function (compare.levels, level.names, p.adjust.method) {
ix <- setNames(seq_along(level.names), level.names)
pp <- outer(ix, ix, function(ivec, jvec)
sapply(seq_along(ivec), function(k) {
i <- ivec[k]; j <- jvec[k]
if (i != j) compare.levels(i, j) else NA }))
pp[] <- p.adjust(pp[], p.adjust.method)
pp
}
This is a version of pairwise.wilcox.test which uses the above function, and also runs on a list of vectors, instead of a data frame in long format.
pairwise.lazerbeam.test <- function(dat, p.adjust.method=p.adjust.methods) {
p.adjust.method <- match.arg(p.adjust.method)
level.names <- if(!is.null(names(dat))) names(dat) else seq_along(dat)
PVAL <- pairwise.table.all(function(i, j) {
wilcox.test(dat[[i]], dat[[j]])$p.value
}, level.names, p.adjust.method = p.adjust.method)
ans <- list(method = "Lazerbeam's special method",
data.name = paste(level.names, collapse=", "),
p.value = PVAL, p.adjust.method = p.adjust.method)
class(ans) <- "pairwise.htest"
ans
}
Output, both before and after tidying, looks like this:
> p <- list(a=1:5, b=2:8, c=10:16)
> out <- pairwise.lazerbeam.test(p)
> out
Pairwise comparisons using Lazerbeams special method
data: a, b, c
a b c
a - 0.2821 0.0101
b 0.2821 - 0.0035
c 0.0101 0.0035 -
P value adjustment method: holm
> pairwise.lazerbeam.test(p) %>% broom::tidy()
# A tibble: 6 x 3
group1 group2 p.value
<chr> <chr> <dbl>
1 b a 0.282
2 c a 0.0101
3 a b 0.282
4 c b 0.00350
5 a c 0.0101
6 b c 0.00350
Here is an example of one approach that uses combn() which has a function argument that can be used to easily apply wilcox.test() to all variable combinations.
set.seed(234)
# Create dummy data
df <- data.frame(replicate(3, sample(1:5, 100, replace = TRUE)))
# Apply wilcox.test to all combinations of variables in data frame.
res <- combn(names(df), 2, function(x) list(data = c(paste(x[1], x[2])), p = wilcox.test(x = df[[x[1]]], y = df[[x[2]]])$p.value), simplify = FALSE)
# Bind results
do.call(rbind, res)
data p
[1,] "X1 X2" 0.45282
[2,] "X1 X3" 0.06095539
[3,] "X2 X3" 0.3162251
Why in R
e=list(a,b,c,d)
is different than:
e=list(a,b)
e=list(e,c)
e=list(e,d)
?
The second approach can be easily used in a for loop, but it produces different result, and I create 1 object each iteration, so cant use first approach, any hints ?
If you absolutely want to use this approach, you can do this:
# Make up some data
a <- 1:3; b <- 4:5; c <- 6:10; d <- 11:17
# Build up the lists
e0 <- list(a, b, c, d)
e <- list(a, b)
e <- c(e, list(c))
e <- c(e, list(d))
# Compare the two
identical(e0, e) # TRUE
In a real-life case, however, instead of using a loop, you probably would be better off using function from the *apply family, such as lapply(), which will return a list of outputs directly.
I have a list of elemental compositions and I'd like to display a count for the number of times an element is included in a composition mapped onto the periodic table (e.g. CH4 would increase the count on H and C by one).
How can I do this with ggplot? Is there a map I can use?
With a bit of searching I found information about the periodic table in this example code project. They had an Access Database with element information. I've exported it to this gist. You can import the data using the httr library with
library(httr)
dd <- read.table(text=content(GET("https://gist.githubusercontent.com/MrFlick/c1183c911bc5398105d4/raw/715868fba2d0d17a61a8081de17c468bbc525ab1/elements.txt")), sep=",", header=TRUE)
(You should probably create your own local version for easier loading in the future.)
Then your other challenge is decomposing something like "CH4" into the raw element counts. I've created this helper function which I think does what you need.
decompose <- function(x) {
m <- gregexpr("([A-Z][a-z]?)(\\d*)", x, perl=T)
dx <- Map(function(x, y) {
ElementSymbol <- gsub("\\d","", x)
cnt <- as.numeric(gsub("\\D","", x))
cnt[is.na(cnt)]<-1
cbind(Sym=y, as.data.frame(xtabs(cnt~ElementSymbol)))
}, regmatches(x,m), x)
do.call(rbind, dx)
}
Here I test the function
test_input <- c("H2O","CH4")
decompose(test_input)
# Sym ElementSymbol Freq
# 1 H2O H 2
# 2 H2O O 1
# 3 CH4 C 1
# 4 CH4 H 4
Now we can combine the data and the reference information to make a plot
library(ggplot2)
ggplot(merge(decompose("CH4"), dd), aes(Column, -Row)) +
geom_tile(data=dd, aes(fill=GroupName), color="black") +
geom_text(aes(label=Freq))
Clearly there are opportunities for improvement but this should give you a good start.
You might look for a more robust decomposition function. Looks like the CHNOSZ package has one
library(CHNOSZ)
data(thermo)
decompose <- function(x) {
do.call(`rbind`, lapply(x, function (x) {
z <- makeup(x)
cbind(data.frame(ElementSymbol = names(z),Freq=z), Sym=x)
}))
}
ggplot(merge(decompose("CaAl2Si2O7(OH)2*H2O"), dd), aes(Column, -Row)) +
geom_tile(data=dd, aes(fill=GroupName), color="black") +
geom_text(aes(label=Freq))
I am trying to create a set of correlation matrices by different levels of a factor variable.
This question has previously been answered (spearman correlation by group in R) but not for a matrix and the vector result doesn't seem to generalize as far as I can see.
The code below works, but can't be written to a csv as by() outputs a list - the error is "cannot coerce class ""by"" to a data.frame"
cor1<- by(data, INDICES=data$factor0, FUN = function(x) cor(x[,c("x","y","z","a",
"b","c")],method="spearman",use="pairwise"))
So I am looking for a method to either coerce the above into a data.frame so I can write it to a csv, or to produce the above result by an alternative method which outputs a data frame
Any help greatly appreciated
The reason you get a list is because if x is a matrix than cor(x) will be a matrix as well, not a scalar. In this case it will be a 6x6 matrix. So the result is a list of 6x6 matrices, one for each factor level.
This is the natural way to represent the result, it seems to me. You can make it into a single data frame if you want, though I'm not sure what you want the rows and columns to represent exactly. Here is one option.
data<-matrix(rnorm(500),100,5)
colnames(data)<-letters[1:5]
factors<-sample(LETTERS[1:3],100,T)
cors<-by(data,factors,cor)
cors[[1]]
# a b c d e
# a 1.00000000 0.05389618 -0.16944040 0.25747174 0.21660217
# b 0.05389618 1.00000000 0.22735796 -0.06002965 -0.30115444
# c -0.16944040 0.22735796 1.00000000 -0.06625523 -0.01120225
# d 0.25747174 -0.06002965 -0.06625523 1.00000000 0.10402791
# e 0.21660217 -0.30115444 -0.01120225 0.10402791 1.00000000
corsMatrix<-do.call(rbind,lapply(cors,function(x)x[upper.tri(x)]))
names<-outer(colnames(data),colnames(data),paste,sep="X")
colnames(corsMatrix)<-names[upper.tri(names)]
corsMatrix
# aXb aXc bXc aXd bXd cXd
# A 0.05389618 -0.16944040 0.22735796 0.25747174 -0.06002965 -0.06625523
# B -0.34231682 -0.14225269 0.20881053 -0.14237661 0.25970138 0.27254840
# C 0.27199944 -0.01333377 0.06402734 0.02583126 -0.03336077 -0.02207024
# aXe bXe cXe dXe
# A 0.216602173 -0.3011544 -0.01120225 0.10402791
# B 0.347006942 -0.2207421 0.33123175 -0.05290809
# C 0.007748369 -0.1257357 0.23048709 0.16037247
I'm not sure if this is what you are looking for. Another option is to export each correlation matrix to its own csv file.
You can use ddply from package library(plyr):
library(plyr)
n <- 1e2
mdat <- data.frame(factor0 = factor(LETTERS[sample(26, n, TRUE)]), x = rnorm(n),
y = rnorm(n), z = rnorm(n), a = rnorm(n), b = rnorm(n),
c = rnorm(n))
ddply(mdat, .(factor0), function(d) {
ret <- as.data.frame(cor(d[, letters[c(1:3, 24:26)]], method="spearman",use="pairwise"))
ret$col <- letters[c(1:3, 24:26)]
ret[, c(7, 1:6)]})
Your query is not that clear, at least to me. If I took it correctly, you may need to have a pairwise matrix first before computing correlation.
You may want try the following function in SciencesPo.
require(SciencesPo)
m<-rprob(mtcars, df = nrow(mtcars) - 2)
The following will stack you matrix, so it becomes easier to check r and related p-values.
rstack(m)
I am calculating sums of matrix columns to each group, where the corresponding group values are contained in matrix columns as well. At the moment I am using a loop as follows:
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
for (i in 1:2) {
tapply(x[,i], index[,i], sum)
}
At the end of the day I need the following result:
1 2
A 3 15
B 7 11
Is there a way to do this using matrix operations without a loop? On top, the real data is large (e.g. 500 x 10000), therefore it has to be fast.
Thanks in advance.
Here are a couple of solutions:
# 1
ag <- aggregate(c(x), data.frame(index = c(index), col = c(col(x))), sum)
xt <- xtabs(x ~., ag)
# 2
m <- mapply(rowsum, as.data.frame(x), as.data.frame(index))
dimnames(m) <- list(levels(factor(index)), 1:ncol(index))
The second only works if every column of index has at least one of each level and also requires that there be at least 2 levels; however, its faster.
This is ugly and works but there's a much better way to do it that is more generalizable. Just getting the ball rolling.
data.frame("col1"=as.numeric(table(rep(index[,1], x[,1]))),
"col2"=as.numeric(table(rep(index[,2], x[,2]))),
row.names=names(table(index)))
I still suspect there's a better option, but this seems reasonably fast actually:
index <- matrix(sample(LETTERS[1:4],size = 500*1000,replace = TRUE),500,10000)
x <- matrix(sample(1:10,500*10000,replace = TRUE),500,10000)
rs <- matrix(NA,4,10000)
rownames(rs) <- LETTERS[1:4]
for (i in LETTERS[1:4]){
tmp <- x
tmp[index != i] <- 0
rs[i,] <- colSums(tmp)
}
It runs in ~0.8 seconds on my machine. I upped the number of categories to four and scaled it up to the size data you have. But I don't having to copy x each time.
You can get clever with matrix multiplication, but I think you still have to do one row or column at a time.
You used tapply. If you add mapply, you can complete your objective.
It does the same thing as that for loop.
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
mapply( function(i) tapply(x[,i], index[,i], sum), 1:2 )
result:
[,1] [,2]
A 3 15
B 7 11