I have a dataset (x) contains
1 10
20 30
34 38
59 83
...
I have a big matrix nx1. I want to assign a value 1 for each row in x. For example
mat[1:10,1] = 1
mat[20:30,1] = 1
etc...
In R, the size of x is quite big and takes a while to do the following:
for ( j in 1:dim(x)[1] ) {
mat[x[j,1]:x[j,2], 1] <- 1
}
Please help me if there is a faster way to do this. Thanks.
You can easily make a list of the rows you want to assign a value of 1 to in your big matrix, using apply on x with seq.int to get the row numbers like this...
rows <- unlist( apply( x , 1 , FUN = function(x){ seq.int(x[1],x[2])}) )
rows
# [1] 1 2 3 4 5 6 7 8 9 10 20 21 22 23 24 25 26 27 28 29 30 34 35 36 37 38 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
And then use subsetting which will be much faster, like this
mat[ rows , 1 ] <- 1
If m is your set of start and stop locations:
m <- matrix(scan(), ncol=2)
#------
1: 1 10
3: 20 30
5: 34 38
7: 59 83
9:
Read 8 items
mapply( seq.int, m[,1], m[,2])
rx1[ unlist( mapply( seq.int, m[,1], m[,2]) ), 1] <- 1
(Trivially different than SimonO101's earlier contribution.)
data.table usually excels in cases like this. Here is a data.table-based solution:
library(data.table)
indexes<-data.table(istart=c(1L,20L,34L,59L), istop=c(10L,30L,38L,83L))
mat<-data.table(val=sample(1L:1e5L,1e5))
mat[indexes[,list(i=seq(istart,istop)),by="istart"][,i],val:=1L]
Related
How can I vectorize the following operation in R that involves modifying column Z recursively using lagged values of Z?
library(dplyr)
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df
X Y
1 20 27
2 69 49
3 92 32
4 28 56
5 10 26
6 70 20
7 53 39
8 81 89
9 96 55
10 11 84
df <- df %>% mutate(Z=if_else(row_number()==1, initial_Z-Y, NA_real_))
df
X Y Z
1 20 27 973
2 69 49 NA
3 92 32 NA
4 28 56 NA
5 10 26 NA
6 70 20 NA
7 53 39 NA
8 81 89 NA
9 96 55 NA
10 11 84 NA
for (i in 2:nrow(df)) {
df$Z[i] <- (df$Z[i-1]*df$X[i-1]/df$X[i])-df$Y[i]
}
df
X Y Z
1 20 27 973.000000
2 69 49 233.028986
3 92 32 142.771739
4 28 56 413.107143
5 10 26 1130.700000
6 70 20 141.528571
7 53 39 147.924528
8 81 89 7.790123
9 96 55 -48.427083
10 11 84 -506.636364
So the first value of Z is set first, based on initial_Z and first value of Y. Rest of the values of Z are calculated by using lagged values of X and Z, and current value of Y.
My actual df is large, and I need to repeat this operation thousands of times in a simulation. Using a for loop takes too much time. I prefer implementing this using dplyr, but other approaches are also welcome.
Many thanks in advance for any help.
I don't know that you can avoid the effect of for loops, but in general R should be pretty good at them. Given that, here is a Reduce variant that might suffice for you:
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df$Z <- with(df, Reduce(function(prevZ, i) {
if (i == 1) return(prevZ - Y[i])
prevZ*X[i-1]/X[i] - Y[i]
}, seq_len(nrow(df)), init = initial_Z, accumulate = TRUE))[-1]
df
# X Y Z
# 1 20 27 973.000000
# 2 69 49 233.028986
# 3 92 32 142.771739
# 4 28 56 413.107143
# 5 10 26 1130.700000
# 6 70 20 141.528571
# 7 53 39 147.924528
# 8 81 89 7.790123
# 9 96 55 -48.427083
# 10 11 84 -506.636364
To be clear, Reduce uses for loops internally to get through the data. I generally don't like using indices as the values for Reduce's x, but since Reduce only iterates over one value, and we need both X and Y, the indices (rows) are a required step.
The same can be accomplished using accumulate2. Note that these are just for-loops. You should consider writing the for loop in Rcpp if at all its causing a problem in R
df %>%
mutate(Z = accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000)[-1])
X Y Z
1 20 27 973
2 69 49 233.029
3 92 32 142.7717
4 28 56 413.1071
5 10 26 1130.7
6 70 20 141.5286
7 53 39 147.9245
8 81 89 7.790123
9 96 55 -48.42708
10 11 84 -506.6364
You could unlist(Z):
df %>%
mutate(Z = unlist(accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000))[-1])
I would like to perform a rowSums based on specific values for multiple columns (i.e. multiple conditions). I know how to rowSums based on a single condition (see example below) but can't seem to figure out multiple conditions.
# rowSums with single, global condition
set.seed(100)
df <- data.frame(a = sample(0:100,10),
b = sample(0:100,10),
c = sample(0:100,10),
d = sample(0:100,10))
print(df)
a b c d
1 31 63 54 49
2 25 88 71 92
3 54 27 53 34
4 5 39 73 93
5 45 73 40 67
6 46 64 16 85
7 77 19 97 17
8 34 33 82 59
9 50 93 51 99
10 15 100 25 11
Single Condition Works
df$ROWSUMS <- rowSums(df[,1:4] <= 50)
# And produces
a b c d ROWSUMS
1 31 63 54 49 2
2 25 88 71 92 1
3 54 27 53 34 2
4 5 39 73 93 2
5 45 73 40 67 2
6 46 64 16 85 2
7 77 19 97 17 2
8 34 33 82 59 2
9 50 93 51 99 1
10 15 100 25 11 3
Multiple Conditions Don't Work
df$ROWSUMS_Multi <- rowSums(df[,1] <= 50 | df[,2] <= 25 | df[,3] <= 75)
Error in rowSums(df[, 1] <= 50 | df[, 2] <= 25 | df[, 3] <= 75) :
'x' must be an array of at least two dimensions
Desired Output
a b c d ROWSUMS_Multi
1 31 63 54 49 2
2 25 88 71 92 2
3 54 27 53 34 1
4 5 39 73 93 2
5 45 73 40 67 2
6 46 64 16 85 2
7 77 19 97 17 1
8 34 33 82 59 1
9 50 93 51 99 2
10 15 100 25 11 2
I could just be sub-setting incorrectly, but I haven't been able to find a fix.
One problem with [ while having a single row or single column is it coerces the data.frame to a vector. Based on ?Extract
x[i, j, ... , drop = TRUE]
NOTE, drop is TRUE by default
and later in the documentation
drop - For matrices and arrays. If TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.
To avoid that either use drop = FALSE or simply drop the , which will return a single column data.frame because by default, the index without any comma is regarded as column index and not row index for data.frame
rowSums(df[1] <= 50 | df[2] <= 25 | df[3] <= 75)
Update
Based on the expected output, the rowSums can be written as
dfROWSUMS <- rowSums(df[1:3] <= c(50, 25, 75)[col(df[1:3])])
df$ROWSUMS
#[1] 2 2 1 2 2 2 1 1 2 2
NOTE: Earlier comment was based on why the rowSums didn't work. Didn't check the expected output earlier. Here, we need to do comparison of 3 columns with different values. When we do
df[1] <= 50
It is a single column of one TRUE/FALSE
When we do | with
df[1] <= 50 | df[2] <= 25
It would be still be a single column of TRUE/FALSE. Only difference is that we have replaced TRUE/FALSE or FALSE/TRUE in a row with TRUE. Similarly, it would be the case when we add n logical comparisons compared with |. Instead of that, do a +, does the elementwise sum
((df[1] <= 50)+ (df[2] <= 25) + (df[3] <= 75))[,1] # note it is a matrix
Here, we can do it with vector i.e. using , as well
((df[, 1] <= 50)+ (df[, 2] <= 25) + (df[, 3] <= 75)) # vector output
The only issue with this would be to repeatedly do the +. If we use rowSums, then make sure the comparison value replicated (col) to the same dimensions of the subset of data.frame. Another option is Map,
Reduce(`+`, Map(`<=`, df[1:3], c(50, 25, 75)))
We can also use cbind to create a matrix from the multiple conditions using column positions or column names then use rowSums like usual, e.g
> rowSums(cbind(df[,'a'] <= 50 ,df[,'b'] <= 25 ,df[,'c'] <= 75), na.rm = TRUE)
[1] 2 2 1 2 2 2 1 1 2 2
> rowSums(cbind(df['a'] <= 50 ,df['b'] <= 25 ,df['c'] <= 75), na.rm = TRUE)
[1] 2 2 1 2 2 2 1 1 2 2
Using dplyr
library(dplyr)
df %>% mutate(ROWSUMS=rowSums(cbind(.['a'] <= 50 ,.['b'] <= 25 ,.['c'] <= 75), na.rm = TRUE))
Consider the following data:
library(Benchmarking)
d <- data.frame(x1=c(100,200,30,500), x2=c(300,200,10,50), y=c(75,100,3000,400))
So I have 4 observations.
Now I want to select 2 observations randomly out of d two times (without repetition). For each of these two times I want to calculate the following:
e <- dea(d[c('x1', 'x2')], d$y)
weighted.mean(eff(e), d$y)
That is, I will get two numbers, which I want to calculate an average of. Can someone show how to do this with a loop function in R?
Example:
Consider that observation 1 and 3 was selected the first time, and 2 and 3 was selected the second time (of course, this could be different). This will give me the following results:
0.9829268 0.9725806
Since (here I have written the observations manually):
> d1 <- data.frame(x1=c(100,30), x2=c(300,10), y=c(75,3000))
> e1 <- dea(d1[c('x1', 'x2')], d1$y)
> weighted.mean(eff(e1), d1$y)
[1] 0.9829268
>
> d2 <- data.frame(x1=c(200,30), x2=c(200,10), y=c(100,3000))
> e2 <- dea(d2[c('x1', 'x2')], d2$y)
> weighted.mean(eff(e2), d2$y)
[1] 0.9725806
And the mean of these two numbers is:
0.9777537
My suggestion:
I have tried with:
for (r in 1:2)
{
a <- (1:4)
s <- sample(a, 2, replace = FALSE)
es <- dea([s, c('x1', 'x2')], y[s])
esav[i] <- weighted.mean(eff(es), y[s])
}
mean(esav)
But this does not work. Can someone help me?
Here's a possible approach (if I understood you correctly) :
library(Benchmarking)
set.seed(123) # just to reproduce this case
d <- data.frame(x1=c(100,200,30,500), x2=c(300,200,10,50), y=c(75,100,3000,400))
# generate all possible couples of row indexes
allPossibleRowIndexes <- combn(1:nrow(d),2,simplify=FALSE)
# select the first maxcomb couples randomly (without repetition)
maxcomb <- 3 # I chose 3... you can also test all the possibilities
rowIndexesRand <- sample(allPossibleRowIndexes,min(maxcomb,length(allPossibleRowIndexes)))
esav <- NULL
for (rowIdxs in rowIndexesRand){
es <- dea(d[rowIdxs, c('x1', 'x2')], d$y[rowIdxs])
esav <- c(esav,weighted.mean(eff(es), d$y[rowIdxs]))
}
avg <- mean(esav)
# or alternatively using sapply instead of loop
avg <- mean(sapply(rowIndexesRand,function(rowIdxs){
es <- dea(d[rowIdxs, c('x1', 'x2')], d$y[rowIdxs])
esav <- weighted.mean(eff(es), d$y[rowIdxs])
return(esav)
}))
Results :
> esav
[1] 0.9829268 0.9725806 0.9058824
> avg
[1] 0.9537966
> rowIndexesRand
[[1]]
[1] 1 3
[[2]]
[1] 2 3
[[3]]
[1] 3 4
EDIT :
As per comment, you can generate unique random indexes without generating all combinations using the following function.
Of course this is not very efficient since it samples multiple times in case the combination has been already extracted before...
# function that (not very efficiently) returns n unique random samples
# of size=k, taken from the set : 1...size
getRandomSamples <- function(size,k,n){
# ensure n is <= than the number of combinations
n <- min(n,choose(size,k))
env <- new.env()
for(i in seq_len(n)){
# sample until it's not a duplicate
while(TRUE){
set <- sort(sample.int(size,k))
key <- paste(set,collapse=',')
if(is.null(env[[key]])){
env[[key]] <- set
break
}
}
}
unname(as.list(env))
}
# usage example
set.seed(1234) # for reproducibility
getRandomSamples(60,36,5)
[[1]]
[1] 1 2 4 7 8 10 11 12 13 14 15 16 17 18 20 21 22 23 24 26 30 31 32 33 34 35 36 37 42 43 44 46 47 55 58 59
[[2]]
[1] 3 4 5 8 10 11 12 13 14 16 17 18 19 20 22 23 24 25 26 29 32 33 35 38 40 43 44 45 47 48 49 50 51 55 56 58
[[3]]
[1] 1 2 4 5 6 7 8 9 10 11 14 18 19 22 25 27 28 30 36 37 38 39 40 43 46 47 49 50 51 53 54 55 57 58 59 60
[[4]]
[1] 1 2 5 7 8 9 10 12 13 14 18 19 27 29 30 31 35 36 37 38 42 43 44 46 47 48 49 51 52 53 55 56 57 58 59 60
[[5]]
[1] 3 5 6 7 9 11 12 13 15 16 19 20 21 22 24 26 27 30 31 32 35 36 37 39 40 42 43 44 45 46 49 50 51 54 55 60
I have multiple frames, for the purpose suppose 2.
Each frame comprises 2 columns - an index column, and a value column
sz<-5;
frame_1<-data.frame(index=sort(sample(1:10,sz,replace=F)),value=rpois(sz,50));
frame_2<-data.frame(index=sort(sample(1:10,sz,replace=F)),value=rpois(sz,50));
frame_1:
index value
1 49
6 62
7 58
8 30
10 50
frame_2:
index value
4 60
5 64
6 48
7 46
9 57
The goal is to create a third frame, frame_3, whose indices will be the union of those in frame_1 and frame_2,
frame_3<-data.frame(index = sort(union(frame_1$index,frame_2$index)));
and which will comprise two additional columns, value_1 and value_2.
frame_3$value_1 will be filled out from frame_1$value, frame_3$value_2 will be filled out from frame_2$value;
These should be filled out like so:
frame_3:
index value_1 value_2
1 49 NA
4 49 60 # value_1 is filled through with previous value
5 49 64 # value_1 is filled through with previous value
6 62 48
7 58 46
8 30 46 # value_2 is filled through with previous value
9 30 57 # value_1 is filled through with previous value
10 50 57 # value_1 is filled through with previous value
i'm looking for an efficient solution, as im dealing with records in the hundreds of thousands
This problem screams for data.table. You can use a loop to recursively construct columns one by one using x[y, roll=TRUE].
require(data.table)
dt1 <- data.table(frame_1)
dt2 <- data.table(frame_2)
setkey(dt1, index)
setkey(dt2, index)
dt3 <- data.table(index = sort(unique(c(dt1$index, dt2$index))))
> dt1[dt2[dt3, roll=TRUE], roll=TRUE]
# index value value.1
# 1: 1 49 NA
# 2: 4 49 60
# 3: 5 49 64
# 4: 6 62 48
# 5: 7 58 46
# 6: 8 30 46
# 7: 9 30 57
# 8: 10 50 57
If your data.frames aren't very large, you can just use merge combined with zoo::na.locf.
R> library(zoo)
R> frame_3 <- merge(frame_1, frame_2, by="index",
+ all=TRUE, suffixes=paste(".",1:2,sep=""))
R > (frame_3 <- na.locf(frame_3))
index value.1 value.2
1 1 49 NA
2 4 49 60
3 5 49 64
4 6 62 48
5 7 58 46
6 8 30 46
7 9 30 57
8 10 50 57
Or, just use zoo objects to begin with, assuming your "value" columns are all one type (like a matrix, you can't mix types in zoo objects).
R> z1 <- zoo(frame_1$value, frame_1$index)
R> z2 <- zoo(frame_2$value, frame_2$index)
R> (z3 <- na.locf(merge(z1, z2)))
z1 z2
1 49 NA
4 49 60
5 49 64
6 62 48
7 58 46
8 30 46
9 30 57
10 50 57
I am looking to make a simple code in R to generate a matrix like the one below..
I know you start one like:
for(i in 1:10){
But I am not sure where to go from here. I know I can just use outer(1:10,1:10) but am looking to use a for() do() statement.
1 2 3 4 5 6 7 8 9 10
1 1 2 3 4 5 6 7 8 9 10
2 2 4 6 8 10 12 14 16 18 20
3 3 6 9 12 15 18 21 24 27 30
4 4 8 12 16 20 24 28 32 36 40
5 5 10 15 20 25 30 35 40 45 50
6 6 12 18 24 30 36 42 48 54 60
7 7 14 21 28 35 42 49 56 63 70
8 8 16 24 32 40 48 56 64 72 80
9 9 18 27 36 45 54 63 72 81 90
10 10 20 30 40 50 60 70 80 90 100
for me to explain outer, you need 2 loops ( at least how I would do if I want to implement it)
v <- c()
for(i in 1:10)
for(j in 1:10)
v <- c(v,i*j) ## This is SLOW! (naughty)
matrix(v,ncol=10,nrow=10)
EDIT
In my last implementation I allocate dynamically the size of the matrix which is very slow. It is better to allocate the matrix before the loops, something like this:
xx <- matrix(NA,ncol=10,nrow=10)
for(i in 1:10)
for(j in 1:10)
xx[i,j] <- i*j
Another way to do with vectors:
I <- 1:10
J <- 1:10
I %*% t(J)
Disclaimer: This is a kinda silly thing to do as outer is a way better option. For fun and educational purposes:
#Define dimensions
matx <- 10
maty <- 10
mat <- matrix(NA, matx, maty)
#For loop
for(i in seq_len(maty)){
mat[,i] <- i * seq_len(matx)
}
Note that I've defined the size of the matrix before looping. This will prevent the horrors of growing an object as described in the R Inferno.
Another suggestion:
lines = 10
output= seq(from=1,by=1,length.out=columns)
i=2
while (i<lines+1){
temp2 = seq(from=i, by=i, length.out=lines)
output= cbind(output, temp2)
i = i+1
}