R - Averaging specific matrix indices over matrix - r

I have two matrices. The first, m1, is 100x100 and contains numbers with decimal places and the other, m2, is 300x100 and is sparsely populated with integers, like so:
m1 <- matrix(rexp(1000, rate = .1), ncol = 100)
m2 <- matrix(sample(c(rep(0, 1000), rep(1, 10), rep(2, 1)), 300 * 100, replace = T), 300, 100)
Each row in m1 corresponds to the column of the same number in m2. Each column m2 represents the number of occurrences of the corresponding row in m1 for that observation.
For each row in m2, I want to get the colMeans of each row of m1 corresponding to how many times it appears in that row of m2. The result should be a 300x100 matrix. I want to know the most efficient way of doing this.
It's a complex operation but hopefully you understand what I mean. If you need any clarification I can give it. If it helps, what I'm trying to do is to get a document features matrix from a word feature matrix and a document-term matrix.

dtm <- matrix(c(0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0), ncol = 4)
wvm <- matrix(c(27.305102, 9.095906, 3.792833, 17.561222, 32.06434, 4.719152, 8.367996, 0.0568822), ncol = 2)
dtm
wvm
t(apply(dtm, 1, function(dtm_row) {
vs <- wvm[dtm_row > 0, ] * dtm_row[dtm_row > 0]
if (is.matrix(vs)) { colMeans(vs) } else vs
}))
Solved my own problem. But if anyone wants to improve my method I'll mark there answer as the correct one.

Related

Replacing Values in Randomly Generated Matrix with Additional Random Numbers

I'm working with a matrix where I need to replace values coded as 1 with new randomly generated numbers
The starting point is a matrix like this
set.seed(38921)
p <- matrix(nrow = 10, ncol = 25)
for(i in 1:10){
p[i, seq(1, floor(runif(1, min = 1, max = 25)), 1)] = 1
}
In the resulting p matrix, on each row I need the value of 1 to be replaced with a randomly generated integer bound between 1 and 25, where the numbers cannot repeat.
For example, on the first row of the matrix, there should be 6 randomly drawn numbers between 1 and 25, where none of the numbers are repeated, and 19 NA columns. On row two, there should be 12 randomly drawn numbers between 1 and 25 with no repeats and 13 NA columns.
Any help is greatly appreciated.
You can simply multiply your matrix by another matrix of random numbers. NA's will remain as NA.
p*matrix(sample(1:25), 10, 25)
Or if the dimensions change:
p*matrix(sample(1:25), nrow(p), ncol(p))
Where you have:
p[i, seq(1, floor(runif(1, min = 1, max = 25)), 1)] = 1
You're assigning to a range of inputs. So instead of assigning 1, you need to assign an appropriately sized vector with the elements you want. This can be generated with: sample(1:25, desiredLength, replace=F)
set.seed(38921)
p <- matrix(nrow = 10, ncol = 25)
for(i in 1:10){
n = floor(runif(1, min = 1, max = 25))
p[i, seq(1, n, 1)] = sample(1:25, n, replace=F)
}
Assuming you've created your initial matrix, here's one way to do it.
apply(t(p), 1, function(x) ifelse(x == 1, sample(1:25, sum(x[x == 1], na.rm = T), replace = F)))

Automate use of SSMcustom on KFAS for Time Series daily effect

I want to quantify the structural effect of an event in a Time Series analysis. For doing so I'm using what the KFAS package has to offer.
The problem is that the code seems to get a bit tricky if I want to model simultaneously n different dates, so n different events.
Here's some example that should clarify, I hope:
library(fpp2)
library(KFAS)
# required libraries
y <- hyndsight # just for the example
plot(hyndsight)
abline(v = c(19, 35, 47), col = "red", lwd = 2)
Let's say that I want to check if each of the "red-line" event is a "permanent shock" (ie: it changes the level of the hyndsight series). Now here's the model that can do that with KFAS functions:
# Z and T component for event on pos 19
aZ1 <- array(0, c(1, 1, length(y)))
aZ1[1, 1, (19 + 1):length(y)] <- 1 # 1s from pos 19(+1) forward
aT1 <- array(1, c(1, 1, length(y)))
# Z and T component for event on pos 35
aZ2 <- array(0, c(1, 1, length(y)))
aZ2[1, 1, (35 + 1):length(y)] <- 1 # 1s from pos 35(+1) forward
aT2 <- array(1, c(1, 1, length(y)))
# Z and T component for event on pos 47
aZ3 <- array(0, c(1, 1, length(y)))
aZ3[1, 1, (47 + 1):length(y)] <- 1 # 1s from pos 47(+1) forward
aT3 <- array(1, c(1, 1, length(y)))
And here's the actual model:
mod <- SSModel(y~0+SSMtrend(2, list(NA, NA))+SSMseasonal(12, NA)+
SSMcustom(Z = aZ1, T = aT1,
R = matrix(0, 1, 0), Q = matrix(0, 0, 0),
a1 = 0, P = matrix(0), P1inf = matrix(1))+ # first event
SSMcustom(Z = aZ2, T = aT2,
R = matrix(0, 1, 0), Q = matrix(0, 0, 0),
a1 = 0, P = matrix(0), P1inf = matrix(1))+ # second event
SSMcustom(Z = aZ3, T = aT3,
R = matrix(0, 1, 0), Q = matrix(0, 0, 0),
a1 = 0, P = matrix(0), P1inf = matrix(1)), # third event
H = NA)
initial_val <- c(0,0,0,0,0,0,0) # the first 4 are always there
fit <- fitSSM(mod, intits = initial_val)
Now, the question is, how can I "automate" this process depending on the number of events that I want to model?
As you can see, for each event, I need to create a vector aZ and a vector aT. Those need to be passed in the model via the SSMcustom function, what if I have a new time series and I need to evaluate just two events, or four or more.
The problem is that I cannot keep adding SSMcustom to the model, I want to pass a new time series with a vector of n events, and automatically build the same model, except for the number of events to evaluate. Can I build a unique SSMcustom for all the events?
Also initial_val has to change, but that's less complicate it's always 4 + n, with n the number of events.
I know this question is a lot specific, maybe it's more for CrossValidated, but I'm not so sure.
Introduce "permanent shock" variables equals to 0 before the shock and equals to 1 after the shock (one variable per shock). Then add these variables as explanatory variables (regressors) in you model and look if they are significant.
Take care that each of these variables will reduce the degree of freedom. So you may want to test the model with the same coefficient for all the shocks.

Combining lists into a dataframe more efficiently

I am running multiple chains of a MCMCglmm() model and I am trying to find the most efficient way to synthesize my output.
I am using mclapply() to run 4 chains and then combining each of the 4 chains into a list with lapply().
Here is my model and code to clean up and combine the chains. I am using this helpful tutorial for running the chains: https://github.com/tmalsburg/MCMCglmm-intro
Model:
library(parallel)
chains <- mclapply(1:4, function(i) {
MCMCglmm(outcome ~ 1 + pretest + race + satisfaction*race, data = data,
random = ~ provider,
prior = prior.1,
verbose = TRUE,
family = "gaussian",
nitt = 10000,
burnin = 5000,
thin = 10)
}, mc.cores=4)
My cleanup is a little clunky. Is there a way to run a lapply command (or I think what is needed is mapply) on both the fixed and random effects to combine them into the same list and subsequent data frame? In the end, I am hoping to have a data frame so I can add/ subtract posterior distributions and run summary statistics on them.
fixed <- lapply(chains, function(m) m$Sol) # Sol = fixed effects
fixed <- do.call(mcmc.list, fixed)
summary(fixed)
random <- lapply(chains, function(m) m$VCV) # VCV = variance
random <- do.call(mcmc.list, random)
summary(random)
fixed_df <- do.call(rbind, Map(data.frame, fixed))
random_df <- do.call(rbind, Map(data.frame, random))
chains_df <- cbind(fixed_df, random_df)
Ultimately, I am hoping to run one lapply() or mapply() and have a single fixed.random list of lists. I believe I can use the Map(data.frame, fixed.random) on that to create my data frame. My knowledge of the apply function is limited, so I'm hoping to learn more and apply it (no pun intended) to my datasets.
Unfortunately, the models output MCMC objects, so I am unable to create the exact structure. This is the best I can come up with:
list1 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list2 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list3 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list4 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list5 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
list6 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
list7 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
list8 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
fixed <- list(list1, list2, list3, list4)
random <- list(list5, list6, list7, list8)
Would the following do?
Say your four_mcmc is a list of models of the class "MCMCglmm" (chain1, chain2, etc.) and extract is the list of elements you want to read from the chains (in your case the fixed ("Sol") and random terms ("VCV")).
## The list of mcmcs
four_mcmc <- list(chain1, chain2, chain3, chain4)
## Which elements to extract from the MCMCs
extract <- c("VCV", "Sol")
You can use a get.element function to extract single elements lists from single chains:
## Extracting some specific elements from a chain
get.elements <- function(extract, mcmc) {
## Extracting the element
mcmc_elements <- sapply(extract, function(extract) mcmc[which(names(mcmc) == extract)])
}
## Extracting the VCV and Sol from one chain
str(get.elements(extract, chain1))
You can then simply apply this function to your list of chains:
## Applying get.element for each elements to extract on each chain
all_elements <- lapply(four_mcmc, function(mcmc, extract) get.elements(extract, mcmc), extract)
You can then easily summarise this table for each terms as a data frame with the terms as rows and the chains as columns
## Fixed terms table
fixed_terms <- as.data.frame(lapply(all_elements, function(X) X[[1]]))
## Random terms table
random_terms <- as.data.frame(lapply(all_elements, function(X) X[[2]]))
This code is simplified from the read.mulTree function from https://github.com/TGuillerme/mulTree.
[edit]
#headpoint suggested to simply use:
as.data.frame(lapply(chains, function(m) cbind(m$Sol, m$VCV)))
Which is more elegant but could be less portable.

How to run all "N take 2" iterations of a function?

I've written a simple correlation function that takes in three variables. "A" and "B" are numerical vectors of equal length, and "n" is the length.
Corr.fxn <- function(A, B, n){
Correlation <- (sum((A - mean(A))*(B - mean(B))) / (n-1)) / (sd(A)*sd(B))
return(Correlation)
}
The function works well enough, but I have many vectors I want to process. What's the best way to modify this code to process all "N take 2" unique analyses for my set of vectors "N"?
EDIT:
Example data showing the structure of the vectors:
A <- c(-1, 0, 1, -1, 0, 1, -1, 0, 1)
B <- c(1, 1, -1, 0, 1, -1, 0, 0, 1)
...
n <- length(A)
So let's say I have vectors A through Z and I want to modify my code to output a new vector containing all {26 take 2} correlation values.
Here is one possible way you can do it assuming you have a bunch of numeric vectors in a list v as follows:
v <- list()
for (i in 1:10) {
v[[i]] <- sample(1:10, 10, replace = TRUE)
}
apply(combn(1:10, 2), 2, function(x) Corr.fxn(v[[x[1]]], v[[x[2]]], length(v[[x[1]]])))
In this answer, I assume 2 things. First, you want to write a function yourself, since otherwise you can use Hmisc::rcorr. Second, you want the "N take 2" part to be inside the function, otherwise the ways suggested earlier are correct. In that case, you can do this:
Corr.fxn <- function(vectors, n){
pairs<- combn(length(vectors), 2)
npairs<- ncol(pairs)
cor.mat<- matrix(NA, nrow = length(vectors), ncol = npairs)
for (i in 1:ncol(pairs)){
A<- vectors[[pairs[1, i]]]
B<- vectors[[pairs[2, i]]]
cor.mat[pairs[1, i], pairs[2, i]] <- (sum((A - mean(A))*(B - mean(B))) / (n-1)) /(sd(A)*sd(B))
}
cor.mat[lower.tri(cor.mat)]<- cor.mat[upper.tri(cor.mat)] ###
diag(cor.mat)<- 1 ###
cor.mat<- data.frame(cor.mat) ###
row.names(cor.mat)<- colnames(cor.mat)<- names(vectors) ###
return(cor.mat)
}
The lines that end in ### are there for decorative reasons. The main input is a list called "vectors". So it works as follows:
A<- runif(100, 1, 100)
B<- runif(100, 30, 50)
C<- runif(100, 120, 200)
> Corr.fxn(list(A=A, B=B, C=C), n=100)
A B C
A 1.0000000 -0.11800104 -0.13980458
B -0.1180010 1.00000000 0.04933581
C -0.1398046 0.04933581 1.00000000

calculate mean (or other function) per column for subsets of a matrix based on another matrix

I am working in R with a classifier that outputs a matrix of real values with one column for each class that I am classifying. I then apply a function to the output matrix and my class label matrix (one column per class) to calculate an error per class (column).
This worked well with small datasets and equal distributions of class and non-class rows but breaks down when I use bigger files with a skewed distribution of class versus non-class. Typically my files contain less than 0.3% class versus 99.7% non-class and in this case my classifier tends to simply output the non-class value (0).
I want to try a different error (cost) function to try to balance this out. I will also try up and down sampling but they have other issues. A possible simple change that I would like to try is to calculate the error for class 1 separately from class 0 and then combine those errors in such a way that the class errors are not buried by the overwhelming non-class errors.
I am including a minimum working example to help demonstrate what I want.
L1 <- runif(13, min=0, max=1)
L2 <- runif(13, min=0, max=1)
predy <- cbind(L1, L2) # simulated output from the classifier
#predy
L1 <- c(0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
L2 <- c(0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0)
classy <- cbind(L1, L2) # Simulated class matrix
#classy
# Now compute error showing existing method
mse <- apply((predy - classy)^2, 2, mean)
nrmse <- sqrt(mse / apply(classy, 2, var))
#
#nrmse
# L1 L2
# 1.343796 1.062442
#
# Sort-of-code for what I would like to have
# mse0 <- apply((predy - classy)^2, 2, mean) where x=0
# mse1 <- apply((predy - classy)^2, 2, mean) where x=1
# mse <- (mse0 + mse1) / 2 # or some similar way of combining them of my choice
# nrmse <- sqrt(mse / apply(classy, 2, var))
In addition, my files are large and my classifier model is huge and so doing this in a computationally efficient manner would be very helpful.
I managed to do it using a for loop (below), can anyone help translate this to apply?
mean.ones <- matrix(0, dim(classy)[2])
mean.zeros <- matrix(0, dim(classy)[2])
for (ix in 1:dim(classy)[2]) {
ix.ones <- classy[, ix]==1
mean.ones[ix] <- mean(predy[ix.ones, ix])
mean.zeros[ix] <- mean(predy[!ix.ones, ix])
}
The code above doesn't do the same thing as the original, it just calculates conditional means, however the code flow seems correct.
Here's a solution that takes advantage of (1) lexical scoping so
you don't have to pass the matrices to the summary function passed to the first lapply(), and
(2) that predy and classy have the same dimensions.
Here's the calculation of the conditional means:
# calculation of means
temp <- lapply(seq.int(ncol(predy)),
function(i)tapply(predy[,i],
classy[,i],
mean))
# presumably each column has members of both classes,
# but if not, we'll assure that there are two members
# two each element of the list 'temp', as follows:
temp <- lapply(temp,
function(x)x[match(0:1,names(x))])
# bind the outputs togeather by column.
mean_mx = do.call(cbind,temp)
all(mean_mx[1,]==mean.zeros)
all(mean_mx[2,]==mean.ones)
Here's the calculation of the mean squared errors:
# calculation of MSE
temp <- lapply(seq.int(ncol(predy)),
function(i)tapply((predy[,i] - classy[,i])^2,
classy[,i],
mean))
# presumably each column has members of both classes,
# but if not, we'll assure that there are two members
# two each element of the list 'temp', as follows:
temp <- lapply(temp,
function(x)x[match(0:1,names(x))])
# bind the outputs togeather by column.
mse_mx = do.call(cbind,temp)
mse0 <- mse_mx[1,]
mse1 <- mse_mx[2,]
mse <- (mse0 + mse1) / 2
nrmse <- sqrt(mse / apply(classy, 2, var))

Resources