Vectorizing lagged operations - r

How can I vectorize the following operation in R that involves modifying column Z recursively using lagged values of Z?
library(dplyr)
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df
X Y
1 20 27
2 69 49
3 92 32
4 28 56
5 10 26
6 70 20
7 53 39
8 81 89
9 96 55
10 11 84
df <- df %>% mutate(Z=if_else(row_number()==1, initial_Z-Y, NA_real_))
df
X Y Z
1 20 27 973
2 69 49 NA
3 92 32 NA
4 28 56 NA
5 10 26 NA
6 70 20 NA
7 53 39 NA
8 81 89 NA
9 96 55 NA
10 11 84 NA
for (i in 2:nrow(df)) {
df$Z[i] <- (df$Z[i-1]*df$X[i-1]/df$X[i])-df$Y[i]
}
df
X Y Z
1 20 27 973.000000
2 69 49 233.028986
3 92 32 142.771739
4 28 56 413.107143
5 10 26 1130.700000
6 70 20 141.528571
7 53 39 147.924528
8 81 89 7.790123
9 96 55 -48.427083
10 11 84 -506.636364
So the first value of Z is set first, based on initial_Z and first value of Y. Rest of the values of Z are calculated by using lagged values of X and Z, and current value of Y.
My actual df is large, and I need to repeat this operation thousands of times in a simulation. Using a for loop takes too much time. I prefer implementing this using dplyr, but other approaches are also welcome.
Many thanks in advance for any help.

I don't know that you can avoid the effect of for loops, but in general R should be pretty good at them. Given that, here is a Reduce variant that might suffice for you:
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df$Z <- with(df, Reduce(function(prevZ, i) {
if (i == 1) return(prevZ - Y[i])
prevZ*X[i-1]/X[i] - Y[i]
}, seq_len(nrow(df)), init = initial_Z, accumulate = TRUE))[-1]
df
# X Y Z
# 1 20 27 973.000000
# 2 69 49 233.028986
# 3 92 32 142.771739
# 4 28 56 413.107143
# 5 10 26 1130.700000
# 6 70 20 141.528571
# 7 53 39 147.924528
# 8 81 89 7.790123
# 9 96 55 -48.427083
# 10 11 84 -506.636364
To be clear, Reduce uses for loops internally to get through the data. I generally don't like using indices as the values for Reduce's x, but since Reduce only iterates over one value, and we need both X and Y, the indices (rows) are a required step.

The same can be accomplished using accumulate2. Note that these are just for-loops. You should consider writing the for loop in Rcpp if at all its causing a problem in R
df %>%
mutate(Z = accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000)[-1])
X Y Z
1 20 27 973
2 69 49 233.029
3 92 32 142.7717
4 28 56 413.1071
5 10 26 1130.7
6 70 20 141.5286
7 53 39 147.9245
8 81 89 7.790123
9 96 55 -48.42708
10 11 84 -506.6364
You could unlist(Z):
df %>%
mutate(Z = unlist(accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000))[-1])

Related

How to bin columns based on the minimum and maximum of a column

I've got a dataset that when I score needs to be converted from a continuous scale to categorical. Each value will be put into one of those categories at 10 intervals based on the minimum and maximum of that column. So if the minimum = 1 and the maximum = 100 there will be 10 categories so that any value from 1-10 = 1, and 11-20 = 2, 21-30 = 3, ..., 91-100 = 10. Here's what my data looks like
df <- as.data.frame(cbind(test1 = sample(13:52, 15),
test2 = sample(16:131, 15)))
> df
test1 test2
1 44 131
2 26 83
3 74 41
4 6 73
5 83 20
6 63 110
7 23 29
8 42 64
9 41 40
10 10 96
11 2 39
12 14 24
13 67 30
14 51 59
15 66 37
So far I have a function:
trail.bin <- function(data, col, min, max) {
for(i in 1:10) {
for(e in 0:9) {
x <- as.data.table(data)
mult <- (max - min)/10
x[col >= min+(e*mult) & col < min+(i*mult),
col := i]
}
}
return(x)
}
What I'm trying to do is take the minimum and maximum, find what the spacing of intervals would be (mult), then use two loops on a data.table reference syntax. The outcome I'm hoping for is:
df2
test1 test2
1 5 131
2 3 83
3 8 41
4 1 73
5 9 20
6 7 110
7 3 29
8 5 64
9 5 40
10 2 96
11 1 39
12 2 24
13 7 30
14 6 59
15 7 37
Thanks!
You could create a function using cut
library(data.table)
trail.bin <- function(data, col, n) {
data[, (col) := lapply(.SD, cut, n, labels = FALSE), .SDcols = col]
return(data)
}
setDT(df)
trail.bin(df, 'test1', 10)
You can also pass multiple columns
trail.bin(df, c('test1', 'test2'), 10)

Resampling with a loop in R

Consider the following data:
library(Benchmarking)
d <- data.frame(x1=c(100,200,30,500), x2=c(300,200,10,50), y=c(75,100,3000,400))
So I have 4 observations.
Now I want to select 2 observations randomly out of d two times (without repetition). For each of these two times I want to calculate the following:
e <- dea(d[c('x1', 'x2')], d$y)
weighted.mean(eff(e), d$y)
That is, I will get two numbers, which I want to calculate an average of. Can someone show how to do this with a loop function in R?
Example:
Consider that observation 1 and 3 was selected the first time, and 2 and 3 was selected the second time (of course, this could be different). This will give me the following results:
0.9829268 0.9725806
Since (here I have written the observations manually):
> d1 <- data.frame(x1=c(100,30), x2=c(300,10), y=c(75,3000))
> e1 <- dea(d1[c('x1', 'x2')], d1$y)
> weighted.mean(eff(e1), d1$y)
[1] 0.9829268
>
> d2 <- data.frame(x1=c(200,30), x2=c(200,10), y=c(100,3000))
> e2 <- dea(d2[c('x1', 'x2')], d2$y)
> weighted.mean(eff(e2), d2$y)
[1] 0.9725806
And the mean of these two numbers is:
0.9777537
My suggestion:
I have tried with:
for (r in 1:2)
{
a <- (1:4)
s <- sample(a, 2, replace = FALSE)
es <- dea([s, c('x1', 'x2')], y[s])
esav[i] <- weighted.mean(eff(es), y[s])
}
mean(esav)
But this does not work. Can someone help me?
Here's a possible approach (if I understood you correctly) :
library(Benchmarking)
set.seed(123) # just to reproduce this case
d <- data.frame(x1=c(100,200,30,500), x2=c(300,200,10,50), y=c(75,100,3000,400))
# generate all possible couples of row indexes
allPossibleRowIndexes <- combn(1:nrow(d),2,simplify=FALSE)
# select the first maxcomb couples randomly (without repetition)
maxcomb <- 3 # I chose 3... you can also test all the possibilities
rowIndexesRand <- sample(allPossibleRowIndexes,min(maxcomb,length(allPossibleRowIndexes)))
esav <- NULL
for (rowIdxs in rowIndexesRand){
es <- dea(d[rowIdxs, c('x1', 'x2')], d$y[rowIdxs])
esav <- c(esav,weighted.mean(eff(es), d$y[rowIdxs]))
}
avg <- mean(esav)
# or alternatively using sapply instead of loop
avg <- mean(sapply(rowIndexesRand,function(rowIdxs){
es <- dea(d[rowIdxs, c('x1', 'x2')], d$y[rowIdxs])
esav <- weighted.mean(eff(es), d$y[rowIdxs])
return(esav)
}))
Results :
> esav
[1] 0.9829268 0.9725806 0.9058824
> avg
[1] 0.9537966
> rowIndexesRand
[[1]]
[1] 1 3
[[2]]
[1] 2 3
[[3]]
[1] 3 4
EDIT :
As per comment, you can generate unique random indexes without generating all combinations using the following function.
Of course this is not very efficient since it samples multiple times in case the combination has been already extracted before...
# function that (not very efficiently) returns n unique random samples
# of size=k, taken from the set : 1...size
getRandomSamples <- function(size,k,n){
# ensure n is <= than the number of combinations
n <- min(n,choose(size,k))
env <- new.env()
for(i in seq_len(n)){
# sample until it's not a duplicate
while(TRUE){
set <- sort(sample.int(size,k))
key <- paste(set,collapse=',')
if(is.null(env[[key]])){
env[[key]] <- set
break
}
}
}
unname(as.list(env))
}
# usage example
set.seed(1234) # for reproducibility
getRandomSamples(60,36,5)
[[1]]
[1] 1 2 4 7 8 10 11 12 13 14 15 16 17 18 20 21 22 23 24 26 30 31 32 33 34 35 36 37 42 43 44 46 47 55 58 59
[[2]]
[1] 3 4 5 8 10 11 12 13 14 16 17 18 19 20 22 23 24 25 26 29 32 33 35 38 40 43 44 45 47 48 49 50 51 55 56 58
[[3]]
[1] 1 2 4 5 6 7 8 9 10 11 14 18 19 22 25 27 28 30 36 37 38 39 40 43 46 47 49 50 51 53 54 55 57 58 59 60
[[4]]
[1] 1 2 5 7 8 9 10 12 13 14 18 19 27 29 30 31 35 36 37 38 42 43 44 46 47 48 49 51 52 53 55 56 57 58 59 60
[[5]]
[1] 3 5 6 7 9 11 12 13 15 16 19 20 21 22 24 26 27 30 31 32 35 36 37 39 40 42 43 44 45 46 49 50 51 54 55 60

MCA from dataframe

I have dataframe
name a b c d e f
1 220-volt 1 8 12 17 22 8
2 aliexpress 7 133 317 372 358 349
3 bonprix 0 3 14 13 21 11
4 citilink 1 20 40 31 29 30
5 dns 1 16 37 34 39 38
6 ebay 3 32 65 50 55 58
7 eldorado 0 19 76 44 42 56
8 kupivip 0 8 17 24 11 18
9 labirint 0 15 30 34 36 32
10 lamoda 3 25 66 73 68 55
and I try to build mca plot.
I use FactoMineR and use code
library(FactoMineR)
df <- read.table("info.csv", header = TRUE, sep=';')
row.names(df) = df$name
df = df[,-1]
res.mca <- MCA(df)
but it returns
Error in which(unlist(lapply(listModa, is.numeric))) : argument to 'which' is not logical
How can I avoid this error?
I downloaded the code an reproduced your data.frame ( please use dput, or an other reproducible example ) and got the same error.
When you ?MCA you will find that x has to be:
a data frame with n rows (individuals) and p columns (categorical variables)
After I changed the columns to factors the function runs.
Try this:
df[] <- lapply(df, factor)
Tip: use row.names = 1 to set the first column as row names for your data.frame when you read the data.
df <- read.table("info.csv", header = T, sep = ";", row.names = 1)

How to calculate the standard deviation of subsets of data down a column in R

I would like to calculate the standard deviation of every 4 values down a column from the first to the last observation. I have found lots of answers for moving SD functions, but I simply need a line of code that will calculate the sd() for every 4 data values and write the answers into a new column in the data frame as below:
Example data:
Obs Count
1 56
2 29
3 66
4 62
5 49
6 12
7 65
8 81
9 73
10 66
11 71
12 59
Desired output:
Obs Count SD
1 56 16.68
2 29 16.68
3 66 16.68
4 62 16.68
5 49 29.55
6 12 29.55
7 65 29.55
8 81 29.55
9 73 6.24
10 66 6.24
11 71 6.24
12 59 6.24
I tried the below code, but this is obviously incorrect:
a <- for(i in 1: length(df)) sd(df$Count[i:(i+3)])
This should be a very easy task, but I have not been able to find an answer. I am still learning and any help would be appreciated.
In base R, you can use the following to create an index of "every 4 rows":
(seq_len(nrow(mydf))-1) %/% 4
# [1] 0 0 0 0 1 1 1 1 2 2 2 2
Using that, you can use ave to get the desired result:
mydf$SD <- ave(mydf$Count, (seq_len(nrow(mydf))-1) %/% 4, FUN = sd)
mydf
# Obs Count SD
# 1 1 56 16.680827
# 2 2 29 16.680827
# 3 3 66 16.680827
# 4 4 62 16.680827
# 5 5 49 29.545163
# 6 6 12 29.545163
# 7 7 65 29.545163
# 8 8 81 29.545163
# 9 9 73 6.238322
# 10 10 66 6.238322
# 11 11 71 6.238322
# 12 12 59 6.238322
An anternative is using rollapply from zoo package in combination with rep.
> library(zoo)
> N <- 4 # every four values
> SDs <- rollapply(df[,2], width=N, by=N, sd)
> df$SD <- rep(SDs, each=N)
> df
Obs Count SD
1 1 56 16.680827
2 2 29 16.680827
3 3 66 16.680827
4 4 62 16.680827
5 5 49 29.545163
6 6 12 29.545163
7 7 65 29.545163
8 8 81 29.545163
9 9 73 6.238322
10 10 66 6.238322
11 11 71 6.238322
12 12 59 6.238322
You might want to get it all in a once:
df$SD <- rep( rollapply(df[,2], width=N, by=N, sd), each=N)
This looks faster (i didn't test tough):
# mydf = your data
idxs = rep(1:nrow(mydf), each = 4, length = nrow(mydf))
mydf = within(mydf, {
Sd = rep(tapply(Count, idxs, sd), each = 4)
})
print(mydf)

Merging matrix index in R

I have a dataset (x) contains
1 10
20 30
34 38
59 83
...
I have a big matrix nx1. I want to assign a value 1 for each row in x. For example
mat[1:10,1] = 1
mat[20:30,1] = 1
etc...
In R, the size of x is quite big and takes a while to do the following:
for ( j in 1:dim(x)[1] ) {
mat[x[j,1]:x[j,2], 1] <- 1
}
Please help me if there is a faster way to do this. Thanks.
You can easily make a list of the rows you want to assign a value of 1 to in your big matrix, using apply on x with seq.int to get the row numbers like this...
rows <- unlist( apply( x , 1 , FUN = function(x){ seq.int(x[1],x[2])}) )
rows
# [1] 1 2 3 4 5 6 7 8 9 10 20 21 22 23 24 25 26 27 28 29 30 34 35 36 37 38 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
And then use subsetting which will be much faster, like this
mat[ rows , 1 ] <- 1
If m is your set of start and stop locations:
m <- matrix(scan(), ncol=2)
#------
1: 1 10
3: 20 30
5: 34 38
7: 59 83
9:
Read 8 items
mapply( seq.int, m[,1], m[,2])
rx1[ unlist( mapply( seq.int, m[,1], m[,2]) ), 1] <- 1
(Trivially different than SimonO101's earlier contribution.)
data.table usually excels in cases like this. Here is a data.table-based solution:
library(data.table)
indexes<-data.table(istart=c(1L,20L,34L,59L), istop=c(10L,30L,38L,83L))
mat<-data.table(val=sample(1L:1e5L,1e5))
mat[indexes[,list(i=seq(istart,istop)),by="istart"][,i],val:=1L]

Resources