i am working with consumer price index CPI and in order to calculate it i have to multiply the index matrix with the corresponding weights:
grossCPI77_10 <- grossIND1977 %*% weights1910/100
grossCPI82_10 <- grossIND1982 %*% weights1910/100
of course i would rather like to have a code like the one beyond:
grossIND1982 <- replicate(20, cbind(1:61))
grossIND1993 <- replicate(20, cbind(1:61))
weights1910_sc <- c(1:20)
grossIND_list <- mget(ls(pattern = "grossIND...."))
totalCPI <- mapply("*", grossIND_list, weights1910_sc)
the problem is that it gives me a 1200x20 matrix. i expected a normal matrix (61x20) vector (20x1) multiplication which should result in a 20x1 vector? could you explain me what i am doing wrong? thanks
part of your problem is that you don't have matrices but 3D arrays, with one singleton dimension. The other issue is that mapply likes to try and combine the results into a matrix, and also that constant arguments should be passed via MoreArgs. But actually, this is more a case for lapply.
grossIND1982 <- replicate(20, cbind(1:61))[,1,]
grossIND1993 <- replicate(20, cbind(1:61))[,1,]
weights1910_sc <- c(1:20)
grossIND_list <- mget(ls(pattern = "grossIND...."))
totalCPI <- mapply("*", grossIND_list, MoreArgs=list(e2 = weights1910_sc), SIMPLIFY = FALSE)
totalCPI <- lapply(grossIND_list, "*", e2 = weights1910_sc)
I am not sure if I understood all aspects of your problem (especially concerning what should be colums, what should be rows, and in which order the crossproduct shall be applied), but I will try at least to cover some aspects. See comments in below code for clarifications of what you did and what you might want. I hope it helps, let me know if this is what you need.
#instead of using mget, I recommend to use a list structure
#otherwise you might capture other variables with similar names
#that you do not want
INDlist <- sapply(c("1990", "1991"), function(x) {
#this is how to set up a matrix correctly, check `?matrix`
#I think your combination of cbind and rep did not give you what you wanted
matrix(rep(1:61, 20), nrow = 61)
}, USE.NAMES = TRUE, simplify = F)
weights <- list(c(1:20))
#the first argument of mapply needs to be a function, in this case of two variables
#the body of the function calculates the cross product
#you feed the arguments (both lists) in the following part of mapply
#I have repeated your weights, but you might assign different weights for each year
res <- mapply(function(x, y) {x %*% y}, INDlist, rep(weights, length(INDlist)))
dim(res)
#[1] 61 2
Related
When trying to find the maximum values of a splitted list, I run into serious performance issues.
Is there a way I can optimize the following code:
# Generate data for this MWE
x <- matrix(runif(900 * 9000), nrow = 900, ncol = 9000)
y <- rep(1:100, each = 9)
my_data <- cbind(y, x)
my_data <- data.frame(my_data)
# This is the critical part I would like to optimize
my_data_split <- split(my_data, y)
max_values <- lapply(my_data_split, function(x) x[which.max(x[ , 50]), ])
I want to get the rows where a given column hits its maximum for a given group (it should be easier to understand from the code).
I know that splitting into a list is probably the reason for the slow performance, but I don't know how to circumvent it.
This may not be immediately clear to you.
There is an internal function max.col doing something similar, except that it finds position index of the maximum along a matrix row (not column). So if you transpose your original matrix x, you will be able to use this function.
Complexity steps in when you want to do max.col by group. The split-lapply convention is needed. But, if after the transpose, we convert the matrix to a data frame, we can do split.default. (Note it is not split or split.data.frame. Here the data frame is treated as a list (vector), so the split happens among the data frame columns.) Finally, we do an sapply to apply max.col by group and cbind the result into a matrix.
tx <- data.frame(t(x))
tx.group <- split.default(tx, y) ## note the `split.default`, not `split`
pos <- sapply(tx.group, max.col)
The resulting pos is something like a look-up table. It has 9000 rows and 100 columns (groups). The pos[i, j] gives the index you want for the i-th column (of your original non-transposed matrix) and j-th group. So your final extraction for the 50-th column and all groups is
max_values <- Map("[[", tx.group, pos[50, ])
You just generate the look-up table once, and make arbitrary extraction at any time.
Disadvantage of this method:
After the split, data in each group are stored in a data frame rather than a matrix. That is, for example, tx.group[[1]] is a 9000 x 9 data frame. But max.col expects a matrix so it will convert this data frame into a matrix internally.
Thus, the major performance / memory overhead includes:
initial matrix transposition;
matrix to data frame conversion;
data frame to matrix conversion (per group).
I am not sure whether we eliminate all above with some functions from MatrixStats package. I look forward to seeing a solution with that.
But anyway, this answer is already much faster than what OP originally does.
A solution using {dplyr}:
# Generate data for this MWE
x <- matrix(runif(900 * 9000), nrow = 900, ncol = 9000)
y <- rep(1:100, each = 9)
my_data <- cbind.data.frame(y, x)
# This is the critical part I would like to optimize
system.time({
my_data_split <- split(my_data, y)
max_values <- lapply(my_data_split, function(x) x[which.max(x[ , 50]), ])
})
# Using {dplyr} is 9 times faster, but you get results in a slightly different format
library(dplyr)
system.time({
max_values2 <- my_data %>%
group_by(y) %>%
do(max_values = .[which.max(.[[50]]), ])
})
all.equal(max_values[[1]], max_values2$max_values[[1]], check.attributes = FALSE)
I have a dataframe with binary values like so:
df<-data.frame(a=rep(c(1,0),9),b=rep(c(0,1,0),6),c=rep(c(0,1),9))
Purpose is to first obtain all pairwise combinations :
combos <- function(df, n) {
unlist(lapply(n, function(x) combn(df, x, simplify=F)), recursive=F)
}
combos(df,2)->j
Next I want to get the proportion of pairs for which both columns in each dataframe in list j has either (0,0) or (1,1). I can get the proportions like so:
lapply(j, function(x) data.frame(new = rowSums(x[,1:2])))->k
lapply(k, function(x) data.frame(prop1 = length(which(x==1))/18,prop2=length(which(x==0|x==2))/18))
However this seems slow and complicated for larger lists. Couple of questions:
1) Is there a faster/better method than this? My actual list is 20 dataframes each with dim : 250 x 400. I tried dist(df,method=binary)but it looks like the binary method doesnot take into account (0,0) instances.
2) Also why when I try to divide using length(x[1]) or lengths(x[1]) it does not give me 18? In the example I divided it by specifying the length of vector new.
Any help is very much appreciated!
#Get the combinations
j = combn(x = df, m = 2, simplify = FALSE)
#Get the Proportions
sapply(j, function(x) length(which(x[1] == x[2]))/NROW(x))
As #thelatemail commented, if you are not concerned with storing the intermediate combinations, you can just do at once using
combn(x = df, m = 2, FUN=function(x) length(which(x[1] == x[2]))/NROW(x))
(Please feel free to change the title to something more appropriate)
I would like extract all reciprocal pairs from a asymmetric square matrix.
Some dummy data to clarify:
m <- matrix(c(NA,0,1,0,0,-1,NA,1,-1,0,1,1,NA,-1,-1,-1,1,0,NA,0,-1,1,0,0,NA), ncol=5, nrow=5)
colnames(m) <- letters[seq(ncol(m))]
rownames(m) <- letters[seq(nrow(m))]
require(reshape2)
m.m <- melt(m) # get all pairs
m.m <- m.m[complete.cases(m.m),] # remove NAs
How would I now extract all "reciprocal duplicates" from m.m (or directly from m)?
This is what I mean with reciprocal duplicate:
Var1 Var2 value
b a 0
a b -1
And I would like to store each value combination, i.e. {1,1},{-1,-1},{1,0},{-1,0},{0,0} in a list with its Var combination {a,b},{a,c},{a,d},{a,e},{b,c},{b,d},{b,e},{c,d},{c,e},{d,e} pointing to it, something like
$`a,b`
[1] 0,-1
I haven't manage to solve this. Feel like it could be possible with merge() or inner_join. Also, I apologize for not providing the best example.
Any pointers would be highly appreciated.
Here's an approach based on the object m.m:
# extract the unique combinations
levs <- apply(m.m[-3], 1, function(x) paste(sort(x), collapse = ","))
# create a list of values for these combinations
split(m.m$value, levs)
Using the matrix representation, you can get vectors of each triangle of the matrix (which align as you wish) using:
m[upper.tri(m)]
t(m)[upper.tri(m)]
To name them:
nm <- matrix(paste("(",rep(rownames(m),times=nrow(m)), ",",rep(rownames(m),each=nrow(m)),")",sep=""), nrow=nrow(m))
nm[as.vector(upper.tri(m))]
Finally to convert to a list as you wish. First I put them in a new 2 x 10 matrix. Then I used lapply to create the list structure.
pairs<- cbind(m[upper.tri(m)], t(m)[upper.tri(m)] )
rownames(pairs) <- nm[as.vector(upper.tri(m))]
pairs
m.list <- lapply(seq_len(nrow(pairs)),function(i) pairs[i,])
names(m.list) <- rownames(pairs)
m.list
So, I'm trying to generate random numbers from multivariate normal distributions with different means. I'm also trying to use the apply functions and not for loops, which is where the problem occurs. Here is my code:
library(MASS)
set.seed(123)
# X and Y means
Means<-cbind(c(.2,.2,.8),c(.2,.6,.8))
Means
Sigma<-matrix(c(.01,0,0,.01),nrow=2)
Sigma
data<-apply(X=Means,MARGIN=1,FUN=mvrnorm,n=10,Sigma=Sigma)
data
Instead of getting two vector with X and Y points for the three means, I get three vectors with X and Y points stacked. What is the best way to get the two vectors? I know I could unstack them manually, but I feel R should have some slick way of getting this done.
It's not sure if it's what I would call 'slick' but if you really want to use apply (instead of lapply as previously mentioned), you can force apply to return your results as a list of matrices. Then it's just a matter of sticking the results together. I expect that this would be less error-prone than trying to rebuild a two column matrix.
data <- apply(Means, 1, function(x) {
list(mvrnorm(n=10, mu=x, Sigma=Sigma))
})
data <- do.call('rbind', unlist(data, recursive=FALSE))
Try:
set.seed(42)
res1 <- lapply(seq_len(nrow(Means)), function(i) mvrnorm(Means[i,], n=10, Sigma))
Checking with the results of apply
set.seed(42)
res2 <- apply(X=Means,MARGIN=1,FUN=mvrnorm,n=10,Sigma=Sigma)
dim(res2) <- c(10,2, 3)
res3 <-lapply(1:dim(res2)[3], function(i) res2[,,i])
all.equal(res3, res1, check.attributes=FALSE)
#[1] TRUE
I have 1000 matrices named A1, A2, A3,...A1000.
In a for loop I would like to simply take the colMeans() of each matrix:
for (i in 1:1000){
means[i,]<-colMeans(A1)
}
I would like to do this for each matrix Ax. Is there a way to put Ai instead of A1 in the for loop?
So, one way is:
for (i in 1:1000){
means[i,]<-colMeans(get(paste('A', i, sep = '')))
}
but I think that misses the point of some of the comments, i.e., you probably had to do something like this:
csvs = lapply(list.files('.', pattern = 'A*.csv'), function(fname) {
read.csv(fname)
})
Then the answer to your question is:
means = lapply(csvs, colMeans)
I don't completely understand, but maybe you have assigned each matrix to a different variable name? That is not the best structure, but you can recover from it:
# Simulate the awful data structure.
matrix.names<-paste0('A',1:1000)
for (name in matrix.names) assign(name,matrix(rnorm(9),ncol=3))
# Pull it into an appropriate list
list.of.matrices<-lapply(matrix.names,get)
# Calculate the column means
column.mean.by.matrix<-sapply(list.of.matrices,colMeans)
You initial question asks for a 'for loop' solution. However, there is an easy way to get the desired
result if we use an 'apply' function.
Perhaps putting the matrices into a list, and then applying a function would prove worthwhile.
### Create matrices
A1 <- matrix(1:4, nrow = 2, ncol = 2)
A2 <- matrix(5:9, nrow = 2, ncol = 2)
A3 <- matrix(11:14, nrow = 2, ncol = 2)
### Create a vector of names
names <- paste0('A', 1:3)
### Create a list of matrices, and assign names
list <- lapply(names, get)
names(list) <- names
### Apply the function 'colMeans' to every matrix in our list
sapply(list, colMeans)
I hope this was useful!
As others wrote already, using a list is perhaps your best option. First you'll need to place your 1000 matrices in a list, most easily accomplished using a for-loop (see several posts above). Your next step is more important: using another for-loop to calculate the summary statistics (colMeans).
To apply a for-loop through an R object, in general you can do one of the two options:
Loop over by indices: for example:
for(i in 1:10){head(mat[i])} #simplistic example
Loop "directly"
for(i in mat){print(i)} #simplistic example
In the case of looping through R lists, the FIRST option will be much easier to set up. Here is the idea adapted to your example:
column_means <- rep(NA,1000) #empty vector to store column means
for (i in 1:length(list_of_matrices)){
mat <- list_of_matrices[[i]] #temporarily store individual matrices
##be sure also to use double brackets!
column_means <- c(column_means, colMeans(mat))