Say we have a set of matrices of different dimensions, but with common row and column names. We would like to find the element-wise means of the matrices. xtabs() is a convenient function for this.
However, inside of for(), as.table() fails to recognize the expression calling each matrix. Creating a list of the matrices first and then calling each element of that list fails just the same.
MWE: Create matrices:
m1 <- matrix(1:9,nrow=3,ncol=3)
colnames(m1) <- c("A","B","C")
rownames(m1) <- c("A","B","C")
m2 <- matrix(10:18,nrow=3,ncol=3)
colnames(m2) <- c("A","B","C")
rownames(m2) <- c("A","B","C")
m3 <- matrix(19:22,nrow=2,ncol=2)
colnames(m3) <- c("A","B")
rownames(m3) <- c("A","B")
Use one of the matrices as a foundation to build from:
A <- m1
Join and find means:
for(i in 2:3){
mat <- noquote(paste0("m", i))
B <- rbind(as.data.frame(as.table(A)), as.data.frame(as.table(mat)))
A <- xtabs(Freq ~ Var1 + Var2, aggregate(Freq ~ Var1 + Var2, B, mean))
}
The problem is with as.table(mat), resulting in an error:
Error in as.table.default(mat) : cannot coerce to a table
This is just a working example, the real application repeats this over thousands of matrices with different naming conventions. Inserting noquote(paste0("m", i)) directly into as.table() also fails.
Simply replacing mat with the matrix object directly works fine (i.e. as.table(m2)). Thanks!
Here, we need get to get the value from the string identifier object
for(i in 2:3){
mat <- get(paste0("m", i))
B <- rbind(as.data.frame(as.table(A)), as.data.frame(as.table(mat)))
A <- xtabs(Freq ~ Var1 + Var2, aggregate(Freq ~ Var1 + Var2, B, mean))
}
A
# Var2
#Var1 A B C
# A 12.25 14.75 11.50
# B 13.25 15.75 12.50
# C 7.50 10.50 13.50
Related
I use map/reduce on a vector that sometimes has only one element. It returns a numeric vector instead of a matrix. do.call works as expected
test1 <- 1
map_l1 <- map(test1, ~ .x*c(1,2,3))
r1 <- reduce(map_l1, rbind)
r1
d1 <- do.call("rbind", map_l1)
d1
map/reduce on a vector with more than one element works as expected
test2 <- c(1,2)
map_l2 <- map(test2, ~ .x*c(1,2,3))
r2 <- reduce(map_l2, rbind)
r2
d2 <- do.call("rbind", map_l2)
d2
As I do not know a-priori the size of the initial vector I use do.call().
Is there a way to use reduce()?
I have three populations stored as individual vectors. I need to run a statistical test (wilcoxon, if it matters) on each pair of these three populations.
I want to input three vectors into some block of code and get as output a vector of 6 p-values (one p-value is the result of one test and is a double).
I have a method that works but I am new to R and from what I've been reading I feel like there should be a better way, possibly involving storing the vectors as a data frame and using vectorization, to write this code.
Here is the code I have:
library(arrangements)
runAllTests <- function(pop1,pop2,pop3) {
populations <- list(pop1=pop1,pop2=pop2,pop3=pop3)
colLabels <- c("pop1", "pop2", "pop3")
#This line makes a data frame where each column is a pair of labels
perms <- data.frame(t(permutations(colLabels,2)))
pvals <- vector()
#This for loop gets each column of that data frame
for (pair in perms[,]) {
pair <- as.vector(pair)
p1 <- as.numeric(unlist(populations[pair[1]]))
p2 <- as.numeric(unlist(populations[pair[2]]))
pvals <- append(pvals, wilcox.test(p1, p2,alternative=c("less"))$p.value)
}
return(pvals)
}
What is a more R appropriate way to write this code?
Note: Generating populations and comparing them all to each other is a common enough thing (and tricky enough to code) that I think this question will apply to more people than myself.
EDIT: I forgot that my actual populations are of different sizes. This means I cannot make a data frame out of the vectors (as far as I know). I can make a list of vectors though. I have updated my code with a version that works.
Yes, this is indeed common; indeed so common that R has a built-in function for exactly this scenario: pairwise.table.
p <- list(pop1, pop2, pop3)
pairwise.table(function(i, j) {
wilcox.test(p[[i]], p[[j]])$p.value
}, 1:3)
There are also specific versions for t tests, proportion tests, and Wilcoxon tests; here's an example using pairwise.wilcox.test.
p <- list(pop1, pop2, pop3)
d <- data.frame(x=unlist(p), g=rep(seq_along(p), sapply(p, length)))
with(d, pairwise.wilcox.test(x, g))
Also, make sure you look into the p.adjust.method parameter to correctly adjust for multiple comparisons.
Per your comments, you're interested in tests where the order matters; that's really hard to imagine (and isn't true for the Wilcoxon test you mentioned) but still...
This is the pairwise.table function, edited to do tests in both directions.
pairwise.table.all <- function (compare.levels, level.names, p.adjust.method) {
ix <- setNames(seq_along(level.names), level.names)
pp <- outer(ix, ix, function(ivec, jvec)
sapply(seq_along(ivec), function(k) {
i <- ivec[k]; j <- jvec[k]
if (i != j) compare.levels(i, j) else NA }))
pp[] <- p.adjust(pp[], p.adjust.method)
pp
}
This is a version of pairwise.wilcox.test which uses the above function, and also runs on a list of vectors, instead of a data frame in long format.
pairwise.lazerbeam.test <- function(dat, p.adjust.method=p.adjust.methods) {
p.adjust.method <- match.arg(p.adjust.method)
level.names <- if(!is.null(names(dat))) names(dat) else seq_along(dat)
PVAL <- pairwise.table.all(function(i, j) {
wilcox.test(dat[[i]], dat[[j]])$p.value
}, level.names, p.adjust.method = p.adjust.method)
ans <- list(method = "Lazerbeam's special method",
data.name = paste(level.names, collapse=", "),
p.value = PVAL, p.adjust.method = p.adjust.method)
class(ans) <- "pairwise.htest"
ans
}
Output, both before and after tidying, looks like this:
> p <- list(a=1:5, b=2:8, c=10:16)
> out <- pairwise.lazerbeam.test(p)
> out
Pairwise comparisons using Lazerbeams special method
data: a, b, c
a b c
a - 0.2821 0.0101
b 0.2821 - 0.0035
c 0.0101 0.0035 -
P value adjustment method: holm
> pairwise.lazerbeam.test(p) %>% broom::tidy()
# A tibble: 6 x 3
group1 group2 p.value
<chr> <chr> <dbl>
1 b a 0.282
2 c a 0.0101
3 a b 0.282
4 c b 0.00350
5 a c 0.0101
6 b c 0.00350
Here is an example of one approach that uses combn() which has a function argument that can be used to easily apply wilcox.test() to all variable combinations.
set.seed(234)
# Create dummy data
df <- data.frame(replicate(3, sample(1:5, 100, replace = TRUE)))
# Apply wilcox.test to all combinations of variables in data frame.
res <- combn(names(df), 2, function(x) list(data = c(paste(x[1], x[2])), p = wilcox.test(x = df[[x[1]]], y = df[[x[2]]])$p.value), simplify = FALSE)
# Bind results
do.call(rbind, res)
data p
[1,] "X1 X2" 0.45282
[2,] "X1 X3" 0.06095539
[3,] "X2 X3" 0.3162251
Suppose there is a data.frame where some variables are coded as integers:
a <- c(1,2,3,4,5)
b <- as.integer(c(2,3,4,5,6))
c <- as.integer(c(5,1,0,9,2))
d <- as.integer(c(5,6,7,3,1))
e <- c(2,6,1,2,3)
df <- data.frame(a,b,c,d,e)
str(df)
Suppose I want to convert columns b to d to numeric:
varlist <- names(df)[2:4]
lapply(varlist, function(x) {
df$x <- as.numeric(x, data=x)
})
str(df)
does not work.
I tried:
df$b <- as.numeric(b, data=df)
df$c <- as.numeric(c, data=df)
df$d <- as.numeric(d, data=df)
str(df)
which works fine.
Questions:
How do I do this (in a loop or better with lapply, [but I'm a Stata person and as such used to writing loops])?
And more generally: how do I apply any function to a list of variables in a data.frame
(e.g. multiply each variable on the list with some other variable[which is always stays the same,
BONUS: or changes with each variable on the list])?
For the first question you can use sapply:
df[2:4] <- sapply(df[2:4],as.numeric)
for the second you should use mapply. For example to multiply the 3 variables(2 to 4) by some 3 different random scalars:
df[2:4] <- mapply(function(x,y)df[[x]]*y,2:4,rnorm(3))
df[,2:4] <- sapply(df[,2:4], as.numeric)
As for your second question, if you want to say multiply column c by 5
df$c <- df$c * 5
Or any vector the same length as c, maybe a new column multiplying c by d
df$cd <- df$c * df$d
Packages (wmtsa, gtools, caret)
I have the following vector and below the following function
z2 <- c(-0.1100, 0.1800, 0.0620, 0.1000,-0.0730,-0.1310, 0.2170,-0.0680,-0.0840,
-0.1350,-0.3070, 0.0670, 0.1360, 0.1000,-0.0150, 0.2450,0.1190,-0.0560, -0.0600,
-0.1400, -0.0420, 0.1250, 0.0060, -0.0280,-0.0620, -0.0010, 0.0880, -0.0180, 0.0720,
0.3160, -0.0270, -0.0460,0.0600, -0.0300, 0.0550, -0.0560, 0.1780, 0.0200, 0.0070,
0.0430)
Wavs4 <- wavMODWT(z2, wavelet="s4", n.levels=ilogb(length(z), base=2),position=list
(from=1, by=1,units=character()), units=character(),title.data=character(),
documentation=character(), keep.series=FALSE)
MRD4<-wavMRD(Wavs4, level=NULL, osc=NULL)
wavs4access<-as.matrix(MRD4)
wavs4access<-as.matrix(wavs4access)
Dxu4 <- wavs4access [,"D1"]
Dxi4 <- wavs4access [,"D2"]
Dxa4 <- wavs4access [,"D3"]
#From above selected combinations
a <- c("Dxi4","Dxu4","Dxa4")
b <- combinations(3, 2, a, set=TRUE, repeats.allowed=FALSE)
#I extract the coefficients from comb...
d1 <- c(b[[1,1]],b[[1,2]])
d2 <- c(b[[2,1]],b[[2,2]])
d3 <- c(b[[3,1]],b[[3,2]])
#I create a list
dlist<-list(d1=d1,d2=d2,d3=d3)
I would like to apply the above dlist in the below expression pastevar, so I produce an expression that can be recursively looped for all d values with a function, rather than writing three times the below expression and then running the function every time.
pastevar <- paste(c("z2[1:length(z2)] ~ ", paste(d1, collapse=" + ")))
X <- model.matrix(as.formula(pastevar))[,-1]
X <- data.frame(X)
Y <- z2[1:length(z2)]
Thank you
users,
I have data.frames which are NULL in my results, but I don't want them to be NULL. I want them to be the same as the beginning (unchanged). I'm working on a list of files and the aim of my code is to fill all the NA with data from my other data.frames (according to the best correlation coefficient). Here's a small example:
Imagine these are my 3 input data frames (10 rows each):
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,NA,NA,8:10))
ST3 <- data.frame(x3=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
The aim here is for example, if there're NAs in ST1, ST1 must be filled with data from the best correlated file with ST1 (between ST2 and ST3 in this example)).
As ST3 has no data here, I cannot have any correlation coefficient. So NAs from ST3 cannot be filled, and ST3 cannot also be used to fill another file. So ST3 has no use if you want. Nevertheless I want to keep ST3 unchanged during all my code.
So the problem in my code comes from data.frames with no data and so with only NAs.
For the moment my code would give this for "refill" (end of my code) (filled NA in my data.frames):
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,6,7,8:10))
ST3 <- NULL
But actually, I want for results in "refill" this:
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,6,7,8:10))
ST3 <- data.frame(x3=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
So for data.frames with only NAs, I don't want them to be NULL in "refill", but I want them to be identical as in input. I need this to have the same dimensions of data.frames between inputs and outputs.
If they are as NULL (like it is for the moment but I don't understand why and I want to change this), there will be 0 rows in this data.frame instead of 10 rows like the other data.frames.
So I think there's something wrong in my code in function "process.all" or "na.fill" or maybe "lst".
Here's my code and it is a reproductible example for you to understand my error (you'll see in head(refill) ST2 is set as NULL).
Sorry if it is a bit long but my error depends on other functions previously used. Hope you've understand my problem and what I'm trying to do. Thanks for your help!
(For information, in function "process.all" and "na.fill": x is the data.frame I want to fill, and y is the file which will be used to fill x (so the best correlated file with x)).
Geoffrey
# my data for example
DF1 <- data.frame(x1=c(NA,NA,rnorm(3:20)),x2=c(31:50))
write.table(DF1,"ST001_2008.csv",sep=";")
DF2 <- data.frame(x1=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,rnorm(1:10)),x2=c(1:20))
write.table(DF2,"ST002_2008.csv",sep=";")
DF3 <- data.frame(x1=rnorm(81:100),x2=NA)
write.table(DF3,"ST003_2008.csv",sep=";")
DF4 <- data.frame(x1=c(21:40),x2=rnorm(1:20))
write.table(DF4,"ST004_2008.csv",sep=";")
# Correlation table
corhiver2008capt1 <- read.table(text=" ST001 ST002 ST003 ST004
ST001 1.0000000 NA -0.4350665 0.3393549
ST002 NA NA NA NA
ST003 -0.4350665 NA 1.0000000 -0.4992513
ST004 0.3393549 NA -0.4992513 1.0000000",header=T)
lst <- lapply(list.files(pattern="\\_2008.csv$"), read.table,sep=";", header=TRUE, stringsAsFactors=FALSE)
Stations <-c("ST001","ST002","ST003","ST004")
names(lst) <- Stations
# searching the highest correlation for each data.Frame
get.max.cor <- function(station, mat){
mat[row(mat) == col(mat)] <- -Inf
m <- max(mat[station, ],na.rm=TRUE)
if (is.finite(m)) {return(which( mat[station, ] == m ))}
else {return(NA)}
}
# fill the data.frame with the data.frame which has the highest correlation coefficient
na.fill <- function(x, y){
if(all(!is.finite(y[1:10,1]))) return(y)
i <- is.na(x[1:10,1])
xx <- y[1:10,1]
new <- data.frame(xx=xx)
x[1:10,1][i] <- predict(lm(x[1:10,1]~xx, na.action=na.exclude),new)[i]
x
}
process.all <- function(df.list, mat){
f <- function(station)
na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]])
g <- function(station){
x <- df.list[[station]]
if(any(!is.finite(x[1:10,1]))){
mat[row(mat) == col(mat)] <- -Inf
nas <- which(is.na(x[1:10,1]))
ord <- order(mat[station, ], decreasing = TRUE)[-c(1, ncol(mat))]
for(y in ord){
if(all(!is.na(df.list[[y]][1:10,1][nas]))){
xx <- df.list[[y]][1:10,1]
new <- data.frame(xx=xx)
x[1:10,1][nas] <- predict(lm(x[1:10,1]~xx, na.action=na.exclude), new)[nas]
break
}
}
}
x
}
n <- length(df.list)
nms <- names(df.list)
max.cor <- sapply(seq.int(n), get.max.cor, corhiver2008capt1)
df.list <- lapply(seq.int(n), f)
df.list <- lapply(seq.int(n), g)
names(df.list) <- nms
df.list
}
refill <- process.all(lst, corhiver2008capt1)
refill <- as.data.frame(refill) ########## HERE IS THE PROBLEM ######
refill
How about
if(sum(!is.na(ST3)) == 0) {
skip whatever you normally would do and go to the next vector
}
This assumes, of course, that you don't have any problems with, say, a vector of 1999 NAs and one numerical value.