So I've been pondering how to do this without a for loop and I couldn't come up with a good answer. Here is an example of what I mean:
sampleData <- matrix(rnorm(25,0,1),5,5)
meanVec <- vector(length=length(sampleData[,1]))
for(i in 1:length(sampleData[,1])){
subMat <- sampleData[1:i,]
ifelse( i == 1 , sumVec <- sum(subMat) ,sumVec <- apply(subMat,2,sum) )
meanVec[i] <- mean(sumVec)
}
meanVec
The actual matrix I want to do this to is reasonably large, and to be honest, for this application it won't make a huge difference in speed, but it's a question I think should be answered:
How can I get rid of that for loop and replace with some *ply call?
Edit: In the example given, I generate sample data, and define a vector equal to the number of rows in the vector.
The for loop does the following steps:
1) takes a submatrix, from row 1 to row i
2) if i is 1, it just sums up the values in that vector
3) if i is not 1, it gets the sum of each row, then gets the mean of the sum and stores that in position i of the vector meanVec.
Finally, it prints out the mean of that sum.
This does what you describe:
cumsum(rowSums(sampleData))/seq_len(nrow(sampleData))
However, your code doesn't do the same.
Related
I have a data frame composed of numeric values. I calculated the standard deviation and mean for each column and created Upper_Bound and Lower_Bound vectors as follows:
std_devs = apply(exp_vars[,sapply(exp_vars,is.numeric)], 2, sd)
means = apply(exp_vars[,sapply(exp_vars,is.numeric)], 2, mean)
Upper_Bound = means + 3*std_devs
Lower_Bound = means - 3*std_devs
Now i want to detect the rows that has at least one value that does not fall between the relevant upperbound and lowerbound. For example a value in column j must be equal or greater than Lower_Bound[j] and equal or smaller than Upper_Bound[j], if at least one value in a row i violates this condition I want to save the index of that row (I also have row names, saving row names would be fine too.) What I want to obtain is a vector of indices (or row names) that shows all rows which violate the rule. I tried the following:
outliers = apply(my_data ,1, between(x,Lower_Bound, Upper_Bound,incbounds = TRUE))
But i guess it was too much to expect between to automatically go over every value in a row and compare them with the relevant bounds. This was my second hopeless attempt that did not work:
outliers = apply(exp_vars_numeric,1, apply(x,2,between(x,Lower_Bound, Upper_Bound, incbounds = TRUE)))
I know that i can do it with a for loop but i am hoping for a more efficient solution. Any suggestion is highly appreciated.
Thanks in advance.
Consider keeping everything in one data frame by adding lower and upper bound columns with help of ave() for inline aggregation of sd and mean. Then run conditional ifelse() for the flagging of such rows.
num_cols <- sapply(exp_vars,is.numeric)
num_names <- colnames(exp_vars)[num_cols]
means <- sapply(exp_vars[,num_cols], function(x) ave(x, FUN=mean))
std_devs <- sapply(exp_vars[,num_cols], function(x) ave(x, FUN=sd))
exp_vars[,paste0(num_names, "_lower")] <- means - 3*std_devs
exp_vars[,paste0(num_names, "_upper")] <- means + 3*std_devs
# CONDITIONALLY ASSIGN FLAG COLS
exp_vars[,paste0(num_names, "_flag")] <- ifelse(exp_vars[,num_names] >= exp_vars[,paste0(num_names, "_lower")] &
exp_vars[,num_names] <= exp_vars[,paste0(num_names, "_upper")], 1, 0)
# ADD ALL FLAG COLS HORIZONTALLY
exp_vars$index <- ifelse(rowSums(exp_vars[,paste0(num_names, "_flag")]) > 0, row.names(exp_vars), NA)
exp_vars[is.na(exp_vars$index), ]
It is recommended to include a small example of how your data looks like so that it is easier for us to respond to your question :) I generated data.frames based on your description, and it seems that the following solves your problem:
df <- data.frame(a=c(1:10),b=c(5:14))
ncols <- ncol(df)
bounds <- data.frame(lower=seq(.5,5,.5),upper=seq(6.5,11,.5))
one_plus_fall_outside <- sapply(1:nrow(df),
function(i)
sum(between(df[i,],bounds$lower[i],bounds$upper[i]))/ncols<1
)
which(one_plus_fall_outside)
you can check if this works well by looking at all the columns together:
cbind(df,bounds,one_plus_fall_outside)
I have a big data table called "dt", and I want to produce a data table of the same dimensions which gives the deviation from the row mean of each entry in dt.
This code works but it seems very slow to me. I hope there's a way to do it faster? Maybe I'm building my table wrong so I'm not taking advantage of the by-reference assignment. Or maybe this is as good as it gets?
(I'm a R novice so any other tips are appreciated!)
Here is my code:
library(data.table)
r <- 100 # of rows
c <- 100 # of columns
# build a data table with random cols
# (maybe not the best way to build, but this isn't important)
dt <- data.table(rnorm(r))
for (i in c(1:(c-1))) {
dt <- cbind(dt,rnorm(r))
}
colnames(dt) <- as.character(c(1:c))
devs <- copy(dt)
means <- rowMeans(dt)
for (i in c(1:nrow(devs))) {
devs[i, colnames(devs) := abs(dt[i,] - means[[i]])]
}
If you subtract a vector from a data.frame (or data.table), that vector will be subtracted from every column of the data.frame (assuming they're all numeric). Numeric functions like abs also work on all-numeric data.frames. So, you can compute devs with
devs <- abs(dt - rowMeans(dt))
You don't need a loop to create dt either, you can use replicate, which replicates its second argument a number of times specified by the first argument, and arranges the results in a matrix (unless simplify = FALSE is given as an argument)
dt <- as.data.table(replicate(r, rnorm(r)))
Not sure if its what you are looking for, but the sweep function will help you applying operation combining matrices and vectors (like your row means).
table <- matrix(rnorm(r*c), nrow=r, ncol=c) # generate random matrix
means <- apply(table, 1, mean) # compute row means
devs <- abs(sweep(table, 1, means, "-")) # compute by row the deviation from the row mean
Would someone be able to explain why this apply doesn't work correctly? I wanted to normalise all the values in each row by the sum of the values in each row - such that the sum of each row =1 However, when I did this using an apply function, the answer is incorrect.
data <- data.frame(Sample=c("A","B","C"),val1=c(1235,34567,234346),val2=c(3445,23446,234235),val3=c(457643,234567,754234))
norm <- function(x){
x/sum(x)}
applymeth <- data
applymeth[,2:4] <- apply(applymeth[,2:4], 1, norm)
rowSums(applymeth[,2:4])
loopmeth <- data
for(i in 1:nrow(data)){
loopmeth[i,2:4] <- norm(loopmeth[i,2:4])
}
rowSums(loopmeth[,2:4])
Thanks.
apply() gives you (in the result) a matrix column by column - in your case from a row-by-row input. You have to transpose the result:
applymeth <- data
applymeth[,2:4] <- t(apply(applymeth[,2:4], 1, norm))
rowSums(applymeth[,2:4])
Have a look at
apply(matrix(1:12, 3), 1, norm)
The reason for this result of apply() is a convention:
in a matrix or a multidimensional array the index of the first dimension is running first, then the second and so on. Example:
array(1:12, dim=c(2,2,3))
So (without any reorganisation of the data) apply() produces one column after the other. This behavior not depends on the parameter MARGIN= of the function apply().
Suppose X is vector of length 100 with X position for 100 individuals. All agents start with position 0
X <- rep(0,100)
but they are embedded in a word with boundaries. I have a function that randomly changes the X position of all the agents at a given time.
Store <- X
X <- X + runif(100)
Eventually, one agent will reach the boundary and, at that point, it stay within the limits. The most simple way to do it using a looping through the vector and checking with if (in pseudo code):
for (i in 1:length(X)) {
if (between the boundaries) {keep the new X[i]} else {assign X[i] the value in Store[i]}
}
This is useful for 100 individual, but the for-loop adds too much computational time if the number of individual (and the length of the vector) increases, for example, to 1000000.
Is there a more straightforward way to do it? I was thinking that maybe I could skip specific re assignation of values that exceed the threshold during:
X <- X + runif(100)
EDIT: Also, imagine that X is not a vector but a matrix.
I realize this question is relatively old, but I just had the same question so I didn't want to leave it unanswered.
Limiting a vector or matrix to values within a certain range, can be done in a comprehensive way by combining an apply statement with min and max functions, as shown in the example below.
# Create sample vector
X <- c(1:100); print(X)
# Create sample matrix
M <- matrix(c(1:100),nrow=10); print(M)
# Set limits
minV <- 15; maxV <- 85;
# Limit vector
sapply(X, function(y) min(max(y,minV),maxV))
# Limit matrix
apply(M, c(1, 2), function(x) min(max(x,minV),maxV))
For further information on the apply functionality I would refer to the R documentation and this article on R-Bloggers:
https://www.r-bloggers.com/using-apply-sapply-lapply-in-r/
When I first came across apply statements I found it a difficult concept to wrap my head around, but would now consider it one of R's most powerful features.
I'm stuck with a simple loop that takes more than an hour to run, and need help to speed it up.
Basically, I have a matrix with 31 columns and 400 000 rows. The first 30 columns have values, and the 31st column has a column-number. I need to, per row, retrieve the value in the column indicated by the 31st column.
Example row: [26,354,72,5987..,461,3] (this means that the value in column 3 is sought after (72))
The too slow loop looks like this:
a <- rep(0,nrow(data)) #To pre-allocate memory
for (i in 1:nrow(data)) {
a[i] <- data[i,data[i,31]]
}
I would think this would work:
a <- data[,data[,31]]
... but it results in "Error: cannot allocate vector of size 2.8 Mb".
I fear that this is a really simple question, so I've spent hours trying to understand apply, lapply, reshape, and more, but somehow I can't get a grip on the vectorization concept in R.
The matrix actually has even more columns that also go into the a-parameter, which is why I don't want to rebuild the matrix, or split it.
Your support is highly appreciated!
Chris
t(data[,1:30])[30*(0:399999)+data[,31]]
This works because you can reference matricies both in array format, and vector format (a 400000*31 long vector in this case) counting column-wise first. To count row-wise, you use the transpose.
Singe-index notation for the matrix may use less memory. This would involve doing something like:
i <- nrow(data)*(data[,31]-1) + 1:nrow(data)
a <- data[i]
Below is an example of single-index notation for matrices in R. In this example, the index of the per-row maximum is appended as the last column of a random matrix. This last column is then used to select the per-row maxima via single-index notation.
## create a random (10 x 5) matrix
M <- matrix(rpois(50,50),10,5)
## use the last column to index the maximum value of the first 5
## columns
MM <- cbind(M,apply(M,1,which.max))
## column ID row ID
i <- nrow(MM)*(MM[,ncol(MM)]-1) + 1:nrow(MM)
all(MM[i] == apply(M,1,max))
Using an index matrix is an alternative that will probably use more memory but is slightly clearer:
ii <- cbind(1:nrow(MM),MM[,ncol(MM)])
all(MM[ii] == apply(M,1,max))
Try to change the code to work a column at a time:
M <- matrix(rpois(30*400000,50),400000,30)
MM <- cbind(M,apply(M,1,which.max))
a <- rep(0,nrow(MM))
for (i in 1:(ncol(MM)-1)) {
a[MM[, ncol(MM)] == i] <- MM[MM[, ncol(MM)] == i, i]
}
This sets all elements in a with the values from column i if the last column has value i. It took longer to build the matrix than to calculate vector a.