I have a data frame with n columns and want to apply a function to each combination of columns. This is very similar to how the cor() function takes a data frame as input and produces a correlation matrix as output, for example:
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
cor(X)
Which will generate this output:
> cor(X)
A B C
A 1.00000000 -0.01199511 0.02337429
B -0.01199511 1.00000000 0.07918920
C 0.02337429 0.07918920 1.00000000
However, I have a custom function that I need to apply to each combination of columns. I am now using a solution that uses nested for loops, which works:
f <- function(x, y) sum((x+y)^2) # some placeholder function
out <- matrix(NA, ncol = ncol(X), nrow = ncol(X)) # pre-allocate
for(i in seq_along(X)) {
for(j in seq_along(X)) {
out[i, j] <- f(X[, i], X[, j]) # apply f() to each combination
}
}
Which produces:
> out
[,1] [,2] [,3]
[1,] 422.4447 207.0833 211.4198
[2,] 207.0833 409.1242 218.2430
[3,] 211.4198 218.2430 397.5321
I am currently trying to transition into the tidyverse and would prefer to avoid using for loops. Could someone show me a tidy solution for this situation? Thanks!
You could do
library(tidyverse)
f <- function(x, y) sum((x+y)^2)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
as.list(X) %>%
expand.grid(., .) %>%
mutate(out = map2_dbl(Var1, Var2, f)) %>%
as_tibble()
This isn’t a tidyverse solution, but it does avoid using for loops. We use RcppAlgos (I am the author) to generate all pair-wise permutations of columns and apply your custom function to each of these. After that, we coerce to a matrix.
set.seed(42)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
library(RcppAlgos)
matrix(permuteGeneral(ncol(X), 2, repetition = TRUE, FUN = function(y) {
sum((X[,y[1]] + X[,y[2]])^2)
}), ncol = ncol(X))
# [,1] [,2] [,3]
# [1,] 429.8549 194.4271 179.4449
# [2,] 194.4271 326.8032 197.2585
# [3,] 179.4449 197.2585 409.6313
Using base R you could do:
set.seed(42)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
OUT = diag(colSums((X+X)^2))
OUT[lower.tri(OUT)] = combn(X, 2, function(x) sum(do.call('+', x)^2)) #combn(X,2,function(x)sum(rowSums(x)^2))
OUT[upper.tri(OUT)] = OUT[lower.tri(OUT)]
OUT
[,1] [,2] [,3]
[1,] 429.8549 194.4271 179.4449
[2,] 194.4271 326.8032 197.2585
[3,] 179.4449 197.2585 409.6313
So I have a matrix TMatrix that i'm cycling through, and I want to put the row and column names for every cell that contains a value that is not finite into a table. I've tried to doing the following, but I keep getting NA for the row and column names. What's going on?
AA <- 1:rowlength
BB <- 1:ncol(Nmatrix)
for(i in AA){
for(j in BB){
if (is.finite(TMatrix[i,j])==FALSE){
TNS <- matrix(data=NA,nrow=1,ncol=4)
TNS[1,1] <- TMatrix[i,j]
TNS[1,2] <- Nmatrix[i,j]
TNS[1,3] <- paste(rownames(TMatrix)[TMatrix[i,j]])
TNS[1,4] <- paste(colnames(TMatrix)[TMatrix[i,j]])
TMinf <- rbind(TMinf,TNS)
}
PMatrix[i,j] <- pt(TMatrix[i,j],n1+n2-2)
}
}
No idea what this is doing because you provided zero of the objects we would need to run this, but it sounds like you are wanting to do something in the following example:
mat <- matrix(rnorm(20), nrow = 4)
mat[1, 4] <- mat[3, 2] <- NA
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.11025848 1.1021023 -0.3098129 NA -0.1358902
# [2,] 0.00351275 0.1440906 1.2141437 0.2601651 0.2504035
# [3,] -1.11565805 NA 0.1483867 -0.4102958 -0.3104319
# [4,] 0.34785864 1.5319365 1.2750632 0.1259548 -0.7594117
which(!is.finite(mat), arr.ind = TRUE)
# row col
# [1,] 3 2
# [2,] 1 4
If you have the rows/columns named:
colnames(mat) <- LETTERS[1:5]
rownames(mat) <- letters[1:4]
# A B C D E
# a 0.11025848 1.1021023 -0.3098129 NA -0.1358902
# b 0.00351275 0.1440906 1.2141437 0.2601651 0.2504035
# c -1.11565805 NA 0.1483867 -0.4102958 -0.3104319
# d 0.34785864 1.5319365 1.2750632 0.1259548 -0.7594117
idx <- which(!is.finite(mat), arr.ind = TRUE)
rownames(mat)[idx[ , 'row']]
# [1] "c" "a"
colnames(mat)[idx[ , 'col']]
# [1] "B" "D"
Never mind, I figured it out. I had the index wrong. It should be like this:
AA <- 1:rowlength
BB <- 1:ncol(Nmatrix)
for(i in AA){
for(j in BB){
if (is.finite(TMatrix[i,j])==FALSE){
TNS <- matrix(data=NA,nrow=1,ncol=4)
TNS[1,1] <- TMatrix[i,j]
TNS[1,2] <- Nmatrix[i,j]
TNS[1,3] <- rownames(TMatrix)[i]
TNS[1,4] <- colnames(TMatrix)[j]
TMinf <- rbind(TMinf,TNS)
}
PMatrix[i,j] <- pt(TMatrix[i,j],n1+n2-2)
}
}
I am looking for ways to speed up my code. I am looking into the apply/ply methods as well as data.table. Unfortunately, I am running into problems.
Here is a small sample data:
ids1 <- c(1, 1, 1, 1, 2, 2, 2, 2)
ids2 <- c(1, 2, 3, 4, 1, 2, 3, 4)
chars1 <- c("aa", " bb ", "__cc__", "dd ", "__ee", NA,NA, "n/a")
chars2 <- c("vv", "_ ww_", " xx ", "yy__", " zz", NA, "n/a", "n/a")
data <- data.frame(col1 = ids1, col2 = ids2,
col3 = chars1, col4 = chars2,
stringsAsFactors = FALSE)
Here is a solution using loops:
library("plyr")
cols_to_fix <- c("col3","col4")
for (i in 1:length(cols_to_fix)) {
data[,cols_to_fix[i]] <- gsub("_", "", data[,cols_to_fix[i]])
data[,cols_to_fix[i]] <- gsub(" ", "", data[,cols_to_fix[i]])
data[,cols_to_fix[i]] <- ifelse(data[,cols_to_fix[i]]=="n/a", NA, data[,cols_to_fix[i]])
}
I initially looked at ddply, but some methods I want to use only take vectors. Hence, I cannot figure out how to do ddply across just certain columns one-by-one.
Also, I have been looking at laply, but I want to return the original data.frame with the changes. Can anyone help me? Thank you.
Based on the suggestions from earlier, here is what I tried to use from the plyr package.
Option 1:
data[,cols_to_fix] <- aaply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text",.drop = FALSE)
Option 2:
data[,cols_to_fix] <- alply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text")
Option 3:
data[,cols_to_fix] <- adply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text")
None of these are giving me the correct answer.
apply works great, but my data is very large and the progress bars from plyr package would be a very nice. Thanks again.
Here's a data.table solution using set.
require(data.table)
DT <- data.table(data)
for (j in cols_to_fix) {
set(DT, i=NULL, j=j, value=gsub("[ _]", "", DT[[j]], perl=TRUE))
set(DT, i=which(DT[[j]] == "n/a"), j=j, value=NA_character_)
}
DT
# col1 col2 col3 col4
# 1: 1 1 aa vv
# 2: 1 2 bb ww
# 3: 1 3 cc xx
# 4: 1 4 dd yy
# 5: 2 1 ee zz
# 6: 2 2 NA NA
# 7: 2 3 NA NA
# 8: 2 4 NA NA
First line reads: set in DT for all i(=NULL), and column=j the value gsub(..).
Second line reads: set in DT where i(=condn) and column=j with value NA_character_.
Note: Using PCRE (perl=TRUE) has nice speed-up, especially on bigger vectors.
Here is a data.table solution, should be faster if your table is large.
The concept of := is an "update" of the columns. I believe that because of this you aren't copying the table internally again as a "normal" dataframe solution would.
require(data.table)
DT <- data.table(data)
fxn = function(col) {
col = gsub("[ _]", "", col, perl = TRUE)
col[which(col == "n/a")] <- NA_character_
col
}
cols = c("col3", "col4");
# lapply your function
DT[, (cols) := lapply(.SD, fxn), .SDcols = cols]
print(DT)
No need for loops (for or *ply):
tmp <- gsub("[_ ]", "", as.matrix(data[,cols_to_fix]), perl=TRUE)
tmp[tmp=="n/a"] <- NA
data[,cols_to_fix] <- tmp
Benchmarks
I only benchmark Arun's data.table solution and my matrix solution. I assume that many columns need to be fixed.
Benchmark code:
options(stringsAsFactors=FALSE)
set.seed(45)
K <- 1000; N <- 1e5
foo <- function(K) paste(sample(c(letters, "_", " "), 8, replace=TRUE), collapse="")
bar <- function(K) replicate(K, foo(), simplify=TRUE)
data <- data.frame(id1=sample(5, K, TRUE),
id2=sample(5, K, TRUE)
)
data <- cbind(data, matrix(sample(bar(K), N, TRUE), ncol=N/K))
cols_to_fix <- as.character(seq_len(N/K))
library(data.table)
benchfun <- function() {
time1 <- system.time({
DT <- data.table(data)
for (j in cols_to_fix) {
set(DT, i=NULL, j=j, value=gsub("[ _]", "", DT[[j]], perl=TRUE))
set(DT, i=which(DT[[j]] == "n/a"), j=j, value=NA_character_)
}
})
data2 <- data
time2 <- system.time({
tmp <- gsub("[_ ]", "", as.matrix(data2[,cols_to_fix]), perl=TRUE)
tmp[tmp=="n/a"] <- NA
data2[,cols_to_fix] <- tmp
})
list(identical= identical(as.data.frame(DT), data2),
data.table_timing= time1[[3]],
matrix_timing=time2[[3]])
}
replicate(3, benchfun())
Benchmark results:
#100 columns to fix, nrow=1e5
# [,1] [,2] [,3]
#identical TRUE TRUE TRUE
#data.table_timing 6.001 5.571 5.602
#matrix_timing 17.906 17.21 18.343
#1000 columns to fix, nrow=1e4
# [,1] [,2] [,3]
#identical TRUE TRUE TRUE
#data.table_timing 4.509 4.574 4.857
#matrix_timing 13.604 14.219 13.234
#1000 columns to fix, nrow=100
# [,1] [,2] [,3]
#identical TRUE TRUE TRUE
#data.table_timing 0.052 0.052 0.055
#matrix_timing 0.134 0.128 0.127
#100 columns to fix, nrow=1e5 and including
#data1 <- as.data.frame(DT) in the timing
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#identical TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#data.table_timing 5.642 5.58 5.762 5.382 5.419 5.633 5.508 5.578 5.634 5.397
#data.table_returnDF_timing 5.973 5.808 5.817 5.705 5.736 5.841 5.759 5.833 5.689 5.669
#matrix_timing 20.89 20.3 19.988 20.271 19.177 19.676 20.836 20.098 20.005 19.409
data.table is faster only by a factor of three. This advantage could probably be even smaller, if we decide to change the data structure (as the data.table solution does) and keep it a matrix.
I think you can do this with regular old apply, which will call your cleanup function on each column (margin=2):
fxn = function(col) {
col <- gsub("_", "", col)
col <- gsub(" ", "", col)
col <- ifelse(col=="n/a", NA, col)
return(col)
}
data[,cols_to_fix] <- apply(data[,cols_to_fix], 2, fxn)
data
# col1 col2 col3 col4
# 1 1 1 aa vv
# 2 1 2 bb ww
# 3 1 3 cc xx
# 4 1 4 dd yy
# 5 2 1 ee zz
# 6 2 2 <NA> <NA>
# 7 2 3 <NA> <NA>
# 8 2 4 <NA> <NA>
Edit: it sounds like you're requiring the use of the plyr package. I'm not an expert in plyr, but this seemed to work:
library(plyr)
data[,cols_to_fix] <- t(laply(data[,cols_to_fix], fxn))
Here's a benchmark of all the different answers:
First, all the answers as separate functions:
1) Arun's
arun <- function(data, cols_to_fix) {
DT <- data.table(data)
for (j in cols_to_fix) {
set(DT, i=NULL, j=j, value=gsub("[ _]", "", DT[[j]], perl=TRUE))
set(DT, i=which(DT[[j]] == "n/a"), j=j, value=NA_character_)
}
return(DT)
}
2) Martin's
martin <- function(data, cols) {
DT <- data.table(data)
colfun = function(col) {
col <- gsub("_", "", col)
col <- gsub(" ", "", col)
col <- ifelse(col=="n/a", NA, col)
}
DT[, (cols) := lapply(.SD, colfun), .SDcols = cols]
return(DT)
}
3) Roland's
roland <- function(data, cols_to_fix) {
tmp <- gsub("[_ ]", "", as.matrix(data[,cols_to_fix]))
tmp[tmp=="n/a"] <- NA
data[,cols_to_fix] <- tmp
return(data)
}
4) BrodieG's
brodieg <- function(data, cols_to_fix) {
fix_fun <- function(x) gsub("(_| )", "", ifelse(x == "n/a", NA_character_, x))
data[, cols_to_fix] <- apply(data[, cols_to_fix], 2, fix_fun)
return(data)
}
5) Josilber's
josilber <- function(data, cols_to_fix) {
colfun2 <- function(col) {
col <- gsub("_", "", col)
col <- gsub(" ", "", col)
col <- ifelse(col=="n/a", NA, col)
return(col)
}
data[,cols_to_fix] <- apply(data[,cols_to_fix], 2, colfun2)
return(data)
}
2) benchmarking function:
We'll run this function 3 times and take the minimum of the run (removes cache effects) to be the runtime:
bench <- function(data, cols_to_fix) {
ans <- c(
system.time(arun(data, cols_to_fix))["elapsed"],
system.time(martin(data, cols_to_fix))["elapsed"],
system.time(roland(data, cols_to_fix))["elapsed"],
system.time(brodieg(data, cols_to_fix))["elapsed"],
system.time(josilber(data, cols_to_fix))["elapsed"]
)
}
3) On (slightly) big data with just 2 cols to fix (like in OP's example here):
require(data.table)
set.seed(45)
K <- 1000; N <- 1e5
foo <- function(K) paste(sample(c(letters, "_", " "), 8, replace=TRUE), collapse="")
bar <- function(K) replicate(K, foo(), simplify=TRUE)
data <- data.frame(id1=sample(5, N, TRUE),
id2=sample(5, N, TRUE),
col3=sample(bar(K), N, TRUE),
col4=sample(bar(K), N, TRUE)
)
rown <- c("arun", "martin", "roland", "brodieg", "josilber")
coln <- paste("run", 1:3, sep="")
cols_to_fix <- c("col3","col4")
ans <- matrix(0L, nrow=5L, ncol=3L)
for (i in 1:3) {
print(i)
ans[, i] <- bench(data, cols_to_fix)
}
rownames(ans) <- rown
colnames(ans) <- coln
# run1 run2 run3
# arun 0.149 0.140 0.142
# martin 0.643 0.629 0.621
# roland 1.741 1.708 1.761
# brodieg 1.926 1.919 1.899
# josilber 2.067 2.041 2.162
The apply version is the way to go. Looks like #josilber came up with the same answer, but this one is slightly different (note regexp).
fix_fun <- function(x) gsub("(_| )", "", ifelse(x == "n/a", NA_character_, x))
data[, cols_to_fix] <- apply(data[, cols_to_fix], 2, fix_fun)
More importantly, generally you want to use ddply and data.table when you want to do split-apply-combine analysis. In this case, all your data belongs to the same group (there aren't any subgroups you're doing anything different with), so you might as well use apply.
The 2 at the center of the apply statement means we want to subset the input by the 2nd dimension, and pass the result (in this case vectors, each representing a column from your data frame in cols_to_fix) to the function that does the work. apply then re-assembles the result, and we assign it back to the columns in cols_to_fix. If we had used 1 instead, apply would have passed the rows in our data frame to the function. Here is the result:
data
# col1 col2 col3 col4
# 1 1 1 aa vv
# 2 1 2 bb ww
# 3 1 3 cc xx
# 4 1 4 dd yy
# 5 2 1 ee zz
# 6 2 2 <NA> <NA>
# 7 2 3 <NA> <NA>
# 8 2 4 <NA> <NA>
If you do have sub-groups, then I recommend you use data.table. Once you get used to the syntax it's hard to beat for convenience and speed. It will also do efficient joins across data sets.
I have a question on the following issue:
Suppose I have some matrices
A1 <- matrix(runif(rowsA1*T), rowsA1, T)
…
AD <- matrix(runif(rowsAD*T), rowsAD, T)
The number of matrices is variable (but most certainly not too large).
Is there a way to perform the following more efficiently (but in a set-up that allows for a variable number of matrices):
f1 <- function(A1, A2, ..., AD) {
for(i in 1:nrow(A1)) {
for(j in 1:nrow(A2)) {
...
for(d in 1:nrow(AD)) {
ret[i,j,...,d] <- \sum_{t=1}^T (A1[i,t]*A2[j,t]*...*AD[d,t])
}
...
}
}
ret
}
Thank you very much for your help!
Romain
---------------------------------- Edit with example ----------------------------------
A1 <- |a b c| A2 <- |j k l| A3 <- |s t u|
|d e f| |m n o| |v w x|
|g h i| |p q r| |y z ä|
And I want for instance to get the following:
ret[1,1,1] <- a*j*s + b*k*t + c*l*u
ret[2,1,3] <- d*j*y + e*k*z + f*l*ä
Hopefully this makes my point clearer.
---------------------------------- Edit Nov. 26th, 2013 -------------------------------
Hi #flodel. I tried to implement your code, but there seems to be an issue once one has more than three matrices.
Suppose, I have the following matrices
A1 <- matrix(runif(4*3), nrow = 4, ncol = 3)
A2 <- matrix(runif(3*3), nrow = 3, ncol = 3)
A3 <- matrix(runif(2*3), nrow = 2, ncol = 3)
A4 <- matrix(runif(1*3), nrow = 1, ncol = 3)
and pluging them into your code
output.f1 <- f1(A1,A2,A3,A4)
provides the correct number of dimensions
dim(output)
# [1] 4 3 2 1
but the output is full of NAs
output.f1
# , , 1, 1
# [,1] [,2] [,3]
# [1,] 0.13534704 NA NA
# [2,] 0.07360135 NA NA
# [3,] 0.07360135 NA NA
# [4,] 0.07360135 NA NA
# , , 2, 1
# [,1] [,2] [,3]
# [1,] NA NA NA
# [2,] NA NA NA
# [3,] NA NA NA
# [4,] NA NA NA
Thanks for some help...
Best,
Romain
Give this a try. With a big apply loop, it might be slow with large matrices, but it will do the job as far as being general to any number of matrices without necessarily the same number of rows:
f1 <- function(...) {
args <- list(...)
nrows <- sapply(args, nrow)
idx <- do.call(expand.grid, lapply(nrows, seq.int))
get.row <- function(i, mat) mat[i, ]
get.val <- function(i.vec) sum(Reduce(`*`, Map(get.row, i.vec, args)))
idx$val <- apply(idx, 1, get.val)
ret <- array(NA, dim = nrows)
ret[as.matrix(idx[, seq_along(args)])] <- idx$val
ret
}
Example usage:
A1 <- matrix(1:12, nrow = 4, ncol = 3)
A2 <- matrix(1:9, nrow = 3, ncol = 3)
A3 <- matrix(1:6, nrow = 2, ncol = 3)
out <- f1(A1, A2, A3)
Check:
identical(out[3, 2, 1],
sum(A1[3, ] * A2[2, ] * A3[1, ]))
# [1] TRUE
I like to extract the coefficients and standard errors of each lm object and combine them into a data.frame with NA fill in for the missing predictors.
set.seed(12345)
x<-matrix(rnorm(1000),nrow=100,ncol=10)
colnames(x)<-paste("x",1:10,sep="")
df<-data.frame(y=rnorm(100),x)
m1<-vector('list', 10)
for ( i in 2:11){
eqn <- as.formula(paste("y ~", paste(colnames(df)[2:i], collapse='+')))
m1[[i-1]] <- lm(eqn, df)
}
Any suggestions would be much appreciated!
This should do the trick:
cList <- lapply(m1, coef)
nms <- names(cList[[11]])
cMat <- do.call(rbind, lapply(cList, function(X) X[nms]))
cDF <- as.data.frame(cMat); names(cDF) <- nms # Pretty up the results
cDF[1:5, 1:6]
# (Intercept) x1 x2 x3 x4 x5
# 1 -0.2345084 0.2027485 NA NA NA NA
# 2 -0.2334043 0.2074812 -0.05006297 NA NA NA
# 3 -0.2299977 0.2099620 -0.03892985 0.09777829 NA NA
# 4 -0.2095798 0.2221179 -0.02710201 0.06403695 -0.1184191 NA
# 5 -0.2060406 0.2180674 -0.01062671 0.06632922 -0.1045128 0.130937
Edit:
To collect the standard errors into a similar structure, just do something like this:
seList <- lapply(m1, function(X) coef(summary(X))[,2])
seMat <- do.call(rbind, lapply(cList, function(X) X[nms]))
seDF <- as.data.frame(cMat); names(seDF) <- nms
Here is an approach using merge and Reduce:
m2 <- lapply(m1[-1], function(x) as.data.frame(coef(summary(x))) )
tmpfun <- function(x,y) {
n <- as.character(nrow(y)-1)
xn <- if( 'Row.names' %in% colnames(x) ) 1 else 0
merge(x,y,by.x=xn, by.y=0, suffixes=c('',n), all=TRUE)
}
out <- Reduce(tmpfun, m2)
You may want to reorder the columns, or drop some of the columns in m2, or transpose the result.