How to run all "N take 2" iterations of a function? - r

I've written a simple correlation function that takes in three variables. "A" and "B" are numerical vectors of equal length, and "n" is the length.
Corr.fxn <- function(A, B, n){
Correlation <- (sum((A - mean(A))*(B - mean(B))) / (n-1)) / (sd(A)*sd(B))
return(Correlation)
}
The function works well enough, but I have many vectors I want to process. What's the best way to modify this code to process all "N take 2" unique analyses for my set of vectors "N"?
EDIT:
Example data showing the structure of the vectors:
A <- c(-1, 0, 1, -1, 0, 1, -1, 0, 1)
B <- c(1, 1, -1, 0, 1, -1, 0, 0, 1)
...
n <- length(A)
So let's say I have vectors A through Z and I want to modify my code to output a new vector containing all {26 take 2} correlation values.

Here is one possible way you can do it assuming you have a bunch of numeric vectors in a list v as follows:
v <- list()
for (i in 1:10) {
v[[i]] <- sample(1:10, 10, replace = TRUE)
}
apply(combn(1:10, 2), 2, function(x) Corr.fxn(v[[x[1]]], v[[x[2]]], length(v[[x[1]]])))

In this answer, I assume 2 things. First, you want to write a function yourself, since otherwise you can use Hmisc::rcorr. Second, you want the "N take 2" part to be inside the function, otherwise the ways suggested earlier are correct. In that case, you can do this:
Corr.fxn <- function(vectors, n){
pairs<- combn(length(vectors), 2)
npairs<- ncol(pairs)
cor.mat<- matrix(NA, nrow = length(vectors), ncol = npairs)
for (i in 1:ncol(pairs)){
A<- vectors[[pairs[1, i]]]
B<- vectors[[pairs[2, i]]]
cor.mat[pairs[1, i], pairs[2, i]] <- (sum((A - mean(A))*(B - mean(B))) / (n-1)) /(sd(A)*sd(B))
}
cor.mat[lower.tri(cor.mat)]<- cor.mat[upper.tri(cor.mat)] ###
diag(cor.mat)<- 1 ###
cor.mat<- data.frame(cor.mat) ###
row.names(cor.mat)<- colnames(cor.mat)<- names(vectors) ###
return(cor.mat)
}
The lines that end in ### are there for decorative reasons. The main input is a list called "vectors". So it works as follows:
A<- runif(100, 1, 100)
B<- runif(100, 30, 50)
C<- runif(100, 120, 200)
> Corr.fxn(list(A=A, B=B, C=C), n=100)
A B C
A 1.0000000 -0.11800104 -0.13980458
B -0.1180010 1.00000000 0.04933581
C -0.1398046 0.04933581 1.00000000

Related

How to create list of functions with multiple parameters from dataframes in R?

Long time reader, first time poster. I have not found any previous questions about my current problem. I would like to create multiple linear functions, which I can later apply to variables. I have a data frame of slopes: df_slopes and a data frame of constants: df_constants.
Dummy data:
df_slope <- data.frame(var1 = c(1, 2, 3,4,5), var2 = c(2,3,4,5,6), var3 = c(-1, 1, 0, -10, 1))
df_constant<- data.frame(var1 = c(3, 4, 6,7,9), var2 = c(2,3,4,5,6), var3 = c(-1, 7, 8, 0, -1))
I would like to construct functions such as
myfunc <- function(slope, constant, trvalue){
result <- trvalue*slope+constant
return(result)}
where the slope and constant values are
slope<- df_slope[i,j]
constant<- df_constant[i,j]
I have tried many ways, for example like this, creating a dataframe of functions with for loop
myfunc_all<-data.frame()
for(i in 1:5){
for(j in 1:3){
myfunc_all[i,j]<-function (x){ x*df_slope[i,j]+df_constant[i,j] }
full_func[[i]][j]<- func_full
}
}
without success. The slope-constant values are paired up, such as df_slope[i,j] is paired with df_constant[i,j]. The desired end result would be some kind of data frame, from where I can call a function by giving it the coordinates, for example like this:
myfunc_all[i,j}
but any form would be great. For example
myfunc_all[2,1]
in our case would be
function (x){ x*2+4]
which I can apply to different x values. I hope my problem is clear.
So you have a slight problem with lazy evaluation and variable scopes when you are using a for loop to build functions (see here for more info). It's a bit safer to use something like mapply which will create closures for you. Try
myfunc_all <- with(expand.grid(1:5, 1:3), mapply(function(i, j) {
function(x) {
x*df_slope[i,j]+df_constant[i,j]
}
},Var1, Var2))
dim(myfunc_all) <- c(5,3)
This will create an array like object. The only difference is that you need to use double brackets to extract the function. For example
myfunc_all[[2,1]](0)
# [1] 4
myfunc_all[[5,3]](0)
# [1] -1
Alternative you can choose to write a function that returns a function. That would look like
myfunc_all <- (function(slopes, constants) {
function(i, j)
function(x) x*slopes[i,j]+constants[i,j]
})(df_slope, df_constant)
then rather than using brackets, you call the function with parenthesis.
myfunc_all(2,1)(0)
# [1] 4
myfunc_all(5,3)(0)
# [1] -1
df_slope <- data.frame(var1 = c(1, 2, 3,4,5), var2 = c(2,3,4,5,6), var3 = c(-1, 1, 0, -10, 1))
df_constant<- data.frame(var1 = c(3, 4, 6,7,9), var2 = c(2,3,4,5,6), var3 = c(-1, 7, 8, 0, -1))
functions = vector(mode = "list", length = nrow(df_slope))
for (i in 1:nrow(df_slope)) {
functions[[i]] = function(i,x) { df_slope[i]*x + df_constant[i]}
}
f = function(i, x) {
functions[[i]](i, x)
}
f(1, 1:10)
f(3, 5:10)

R - Averaging specific matrix indices over matrix

I have two matrices. The first, m1, is 100x100 and contains numbers with decimal places and the other, m2, is 300x100 and is sparsely populated with integers, like so:
m1 <- matrix(rexp(1000, rate = .1), ncol = 100)
m2 <- matrix(sample(c(rep(0, 1000), rep(1, 10), rep(2, 1)), 300 * 100, replace = T), 300, 100)
Each row in m1 corresponds to the column of the same number in m2. Each column m2 represents the number of occurrences of the corresponding row in m1 for that observation.
For each row in m2, I want to get the colMeans of each row of m1 corresponding to how many times it appears in that row of m2. The result should be a 300x100 matrix. I want to know the most efficient way of doing this.
It's a complex operation but hopefully you understand what I mean. If you need any clarification I can give it. If it helps, what I'm trying to do is to get a document features matrix from a word feature matrix and a document-term matrix.
dtm <- matrix(c(0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0), ncol = 4)
wvm <- matrix(c(27.305102, 9.095906, 3.792833, 17.561222, 32.06434, 4.719152, 8.367996, 0.0568822), ncol = 2)
dtm
wvm
t(apply(dtm, 1, function(dtm_row) {
vs <- wvm[dtm_row > 0, ] * dtm_row[dtm_row > 0]
if (is.matrix(vs)) { colMeans(vs) } else vs
}))
Solved my own problem. But if anyone wants to improve my method I'll mark there answer as the correct one.

Combining lists into a dataframe more efficiently

I am running multiple chains of a MCMCglmm() model and I am trying to find the most efficient way to synthesize my output.
I am using mclapply() to run 4 chains and then combining each of the 4 chains into a list with lapply().
Here is my model and code to clean up and combine the chains. I am using this helpful tutorial for running the chains: https://github.com/tmalsburg/MCMCglmm-intro
Model:
library(parallel)
chains <- mclapply(1:4, function(i) {
MCMCglmm(outcome ~ 1 + pretest + race + satisfaction*race, data = data,
random = ~ provider,
prior = prior.1,
verbose = TRUE,
family = "gaussian",
nitt = 10000,
burnin = 5000,
thin = 10)
}, mc.cores=4)
My cleanup is a little clunky. Is there a way to run a lapply command (or I think what is needed is mapply) on both the fixed and random effects to combine them into the same list and subsequent data frame? In the end, I am hoping to have a data frame so I can add/ subtract posterior distributions and run summary statistics on them.
fixed <- lapply(chains, function(m) m$Sol) # Sol = fixed effects
fixed <- do.call(mcmc.list, fixed)
summary(fixed)
random <- lapply(chains, function(m) m$VCV) # VCV = variance
random <- do.call(mcmc.list, random)
summary(random)
fixed_df <- do.call(rbind, Map(data.frame, fixed))
random_df <- do.call(rbind, Map(data.frame, random))
chains_df <- cbind(fixed_df, random_df)
Ultimately, I am hoping to run one lapply() or mapply() and have a single fixed.random list of lists. I believe I can use the Map(data.frame, fixed.random) on that to create my data frame. My knowledge of the apply function is limited, so I'm hoping to learn more and apply it (no pun intended) to my datasets.
Unfortunately, the models output MCMC objects, so I am unable to create the exact structure. This is the best I can come up with:
list1 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list2 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list3 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list4 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list5 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
list6 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
list7 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
list8 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
fixed <- list(list1, list2, list3, list4)
random <- list(list5, list6, list7, list8)
Would the following do?
Say your four_mcmc is a list of models of the class "MCMCglmm" (chain1, chain2, etc.) and extract is the list of elements you want to read from the chains (in your case the fixed ("Sol") and random terms ("VCV")).
## The list of mcmcs
four_mcmc <- list(chain1, chain2, chain3, chain4)
## Which elements to extract from the MCMCs
extract <- c("VCV", "Sol")
You can use a get.element function to extract single elements lists from single chains:
## Extracting some specific elements from a chain
get.elements <- function(extract, mcmc) {
## Extracting the element
mcmc_elements <- sapply(extract, function(extract) mcmc[which(names(mcmc) == extract)])
}
## Extracting the VCV and Sol from one chain
str(get.elements(extract, chain1))
You can then simply apply this function to your list of chains:
## Applying get.element for each elements to extract on each chain
all_elements <- lapply(four_mcmc, function(mcmc, extract) get.elements(extract, mcmc), extract)
You can then easily summarise this table for each terms as a data frame with the terms as rows and the chains as columns
## Fixed terms table
fixed_terms <- as.data.frame(lapply(all_elements, function(X) X[[1]]))
## Random terms table
random_terms <- as.data.frame(lapply(all_elements, function(X) X[[2]]))
This code is simplified from the read.mulTree function from https://github.com/TGuillerme/mulTree.
[edit]
#headpoint suggested to simply use:
as.data.frame(lapply(chains, function(m) cbind(m$Sol, m$VCV)))
Which is more elegant but could be less portable.

calculate mean (or other function) per column for subsets of a matrix based on another matrix

I am working in R with a classifier that outputs a matrix of real values with one column for each class that I am classifying. I then apply a function to the output matrix and my class label matrix (one column per class) to calculate an error per class (column).
This worked well with small datasets and equal distributions of class and non-class rows but breaks down when I use bigger files with a skewed distribution of class versus non-class. Typically my files contain less than 0.3% class versus 99.7% non-class and in this case my classifier tends to simply output the non-class value (0).
I want to try a different error (cost) function to try to balance this out. I will also try up and down sampling but they have other issues. A possible simple change that I would like to try is to calculate the error for class 1 separately from class 0 and then combine those errors in such a way that the class errors are not buried by the overwhelming non-class errors.
I am including a minimum working example to help demonstrate what I want.
L1 <- runif(13, min=0, max=1)
L2 <- runif(13, min=0, max=1)
predy <- cbind(L1, L2) # simulated output from the classifier
#predy
L1 <- c(0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
L2 <- c(0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0)
classy <- cbind(L1, L2) # Simulated class matrix
#classy
# Now compute error showing existing method
mse <- apply((predy - classy)^2, 2, mean)
nrmse <- sqrt(mse / apply(classy, 2, var))
#
#nrmse
# L1 L2
# 1.343796 1.062442
#
# Sort-of-code for what I would like to have
# mse0 <- apply((predy - classy)^2, 2, mean) where x=0
# mse1 <- apply((predy - classy)^2, 2, mean) where x=1
# mse <- (mse0 + mse1) / 2 # or some similar way of combining them of my choice
# nrmse <- sqrt(mse / apply(classy, 2, var))
In addition, my files are large and my classifier model is huge and so doing this in a computationally efficient manner would be very helpful.
I managed to do it using a for loop (below), can anyone help translate this to apply?
mean.ones <- matrix(0, dim(classy)[2])
mean.zeros <- matrix(0, dim(classy)[2])
for (ix in 1:dim(classy)[2]) {
ix.ones <- classy[, ix]==1
mean.ones[ix] <- mean(predy[ix.ones, ix])
mean.zeros[ix] <- mean(predy[!ix.ones, ix])
}
The code above doesn't do the same thing as the original, it just calculates conditional means, however the code flow seems correct.
Here's a solution that takes advantage of (1) lexical scoping so
you don't have to pass the matrices to the summary function passed to the first lapply(), and
(2) that predy and classy have the same dimensions.
Here's the calculation of the conditional means:
# calculation of means
temp <- lapply(seq.int(ncol(predy)),
function(i)tapply(predy[,i],
classy[,i],
mean))
# presumably each column has members of both classes,
# but if not, we'll assure that there are two members
# two each element of the list 'temp', as follows:
temp <- lapply(temp,
function(x)x[match(0:1,names(x))])
# bind the outputs togeather by column.
mean_mx = do.call(cbind,temp)
all(mean_mx[1,]==mean.zeros)
all(mean_mx[2,]==mean.ones)
Here's the calculation of the mean squared errors:
# calculation of MSE
temp <- lapply(seq.int(ncol(predy)),
function(i)tapply((predy[,i] - classy[,i])^2,
classy[,i],
mean))
# presumably each column has members of both classes,
# but if not, we'll assure that there are two members
# two each element of the list 'temp', as follows:
temp <- lapply(temp,
function(x)x[match(0:1,names(x))])
# bind the outputs togeather by column.
mse_mx = do.call(cbind,temp)
mse0 <- mse_mx[1,]
mse1 <- mse_mx[2,]
mse <- (mse0 + mse1) / 2
nrmse <- sqrt(mse / apply(classy, 2, var))

Passing multiple arguments to Reduce

I have a list of data.frames, and would like to operate on their columns, using various weights.
For example, subtracting the first columns from the second column (solved, see below); or subtracting the first and third from twice the second (unsolved).
Thanks to the generous help obtained in response to this question, I have a solution to the the problem in two dimensions without weights using Reduce.
I would like to have the flexibility to operate with weights - and in higher dimesions.
What I have so far is:
priceList <- data.frame(aaa = rnorm(100, 100, 10), bbb = rnorm(100, 100, 10),
ccc = rnorm(100, 100, 10), ddd = rnorm(100, 100, 10),
eee = rnorm(100, 100, 10), fff = rnorm(100, 100, 10),
ggg = rnorm(100, 100, 10)
)
colDiff <- function(x)
{
Reduce('-', rev(x))
}
tradeLegsList <- combn(names(priceList), 3, function(x) priceList[x], simplify = FALSE)
tradeList <- lapply(tradeLegsList, colDiff)
From what I can tell, Reduce is not designed to take multiple arguments.
I can do this the long way with 2* tradeLegsList[[1]]$bbb - tradeLegsList[[1]]$aaa - tradeLegsList[[1]]$ccc, and some loops, but it doesn't seem like the R way.
Is there a way to pass in a weight vector?
Ideally, I would to pass an argument such as w = c(-1, 2, -1) to the colDiff (or Reduce) function ... or something similar.
True, Reduce is not geared to allow multiple arguments, just two for each reduction. Therefore it is easiest to premultiply the elements in the list you are Reduce-ing.
Below is a solution that does this using mapply within your colDiff function definition.
Change your definifion of colDiff to allow a weight vector, and apply this using mapply
with SIMPLIFY = F.
EDIT
In light of the comments, weighting depends on the number of columns and there being no need for the rev
The weighting by length
length(x) == 1 -> w = 1
length(x) == 2 -> w = c(-1, 1),
length(x) == 3 -> w = c(-1, 2, -1),
length(x) == 4 -> w = c(-1, 1, -1, +1)
weighting <- function(i){
switch(i, 1, c(-1,1), c(-1,2,-1), c(-1,1,-1, 1))
}
colDiff <- function(x)
{
w = weighting(length(x))
Reduce('+', mapply('*', x, e2 = w, SIMPLIFY = F))
}
Then something like this would work
tradeList <- lapply(tradeLegsList, colDiff)
you could also keep with the functional programming theme and use Map which is a simple wrapper for mapply with SIMPLIFY = F
colDiff <- function(x)
{
w = weighting(length(x))
Reduce('+', Map('*', x , e2 = w))
}
you could also prefine the weighting within the function colDiff (which may be easier).
weighting[[2]] is weighting for when there are 2 columns, weighting[[3]] when there are 3.
colDiff <- function(x)
{
weighting <- list(1, c(-1,1), c(-1,2,-1), c(-1,1,-1, 1))
w = weighting[[length(x)]]
Reduce('+', Map('*', x , e2 = w))
}

Resources