Resampling with a loop in R - r

Consider the following data:
library(Benchmarking)
d <- data.frame(x1=c(100,200,30,500), x2=c(300,200,10,50), y=c(75,100,3000,400))
So I have 4 observations.
Now I want to select 2 observations randomly out of d two times (without repetition). For each of these two times I want to calculate the following:
e <- dea(d[c('x1', 'x2')], d$y)
weighted.mean(eff(e), d$y)
That is, I will get two numbers, which I want to calculate an average of. Can someone show how to do this with a loop function in R?
Example:
Consider that observation 1 and 3 was selected the first time, and 2 and 3 was selected the second time (of course, this could be different). This will give me the following results:
0.9829268 0.9725806
Since (here I have written the observations manually):
> d1 <- data.frame(x1=c(100,30), x2=c(300,10), y=c(75,3000))
> e1 <- dea(d1[c('x1', 'x2')], d1$y)
> weighted.mean(eff(e1), d1$y)
[1] 0.9829268
>
> d2 <- data.frame(x1=c(200,30), x2=c(200,10), y=c(100,3000))
> e2 <- dea(d2[c('x1', 'x2')], d2$y)
> weighted.mean(eff(e2), d2$y)
[1] 0.9725806
And the mean of these two numbers is:
0.9777537
My suggestion:
I have tried with:
for (r in 1:2)
{
a <- (1:4)
s <- sample(a, 2, replace = FALSE)
es <- dea([s, c('x1', 'x2')], y[s])
esav[i] <- weighted.mean(eff(es), y[s])
}
mean(esav)
But this does not work. Can someone help me?

Here's a possible approach (if I understood you correctly) :
library(Benchmarking)
set.seed(123) # just to reproduce this case
d <- data.frame(x1=c(100,200,30,500), x2=c(300,200,10,50), y=c(75,100,3000,400))
# generate all possible couples of row indexes
allPossibleRowIndexes <- combn(1:nrow(d),2,simplify=FALSE)
# select the first maxcomb couples randomly (without repetition)
maxcomb <- 3 # I chose 3... you can also test all the possibilities
rowIndexesRand <- sample(allPossibleRowIndexes,min(maxcomb,length(allPossibleRowIndexes)))
esav <- NULL
for (rowIdxs in rowIndexesRand){
es <- dea(d[rowIdxs, c('x1', 'x2')], d$y[rowIdxs])
esav <- c(esav,weighted.mean(eff(es), d$y[rowIdxs]))
}
avg <- mean(esav)
# or alternatively using sapply instead of loop
avg <- mean(sapply(rowIndexesRand,function(rowIdxs){
es <- dea(d[rowIdxs, c('x1', 'x2')], d$y[rowIdxs])
esav <- weighted.mean(eff(es), d$y[rowIdxs])
return(esav)
}))
Results :
> esav
[1] 0.9829268 0.9725806 0.9058824
> avg
[1] 0.9537966
> rowIndexesRand
[[1]]
[1] 1 3
[[2]]
[1] 2 3
[[3]]
[1] 3 4
EDIT :
As per comment, you can generate unique random indexes without generating all combinations using the following function.
Of course this is not very efficient since it samples multiple times in case the combination has been already extracted before...
# function that (not very efficiently) returns n unique random samples
# of size=k, taken from the set : 1...size
getRandomSamples <- function(size,k,n){
# ensure n is <= than the number of combinations
n <- min(n,choose(size,k))
env <- new.env()
for(i in seq_len(n)){
# sample until it's not a duplicate
while(TRUE){
set <- sort(sample.int(size,k))
key <- paste(set,collapse=',')
if(is.null(env[[key]])){
env[[key]] <- set
break
}
}
}
unname(as.list(env))
}
# usage example
set.seed(1234) # for reproducibility
getRandomSamples(60,36,5)
[[1]]
[1] 1 2 4 7 8 10 11 12 13 14 15 16 17 18 20 21 22 23 24 26 30 31 32 33 34 35 36 37 42 43 44 46 47 55 58 59
[[2]]
[1] 3 4 5 8 10 11 12 13 14 16 17 18 19 20 22 23 24 25 26 29 32 33 35 38 40 43 44 45 47 48 49 50 51 55 56 58
[[3]]
[1] 1 2 4 5 6 7 8 9 10 11 14 18 19 22 25 27 28 30 36 37 38 39 40 43 46 47 49 50 51 53 54 55 57 58 59 60
[[4]]
[1] 1 2 5 7 8 9 10 12 13 14 18 19 27 29 30 31 35 36 37 38 42 43 44 46 47 48 49 51 52 53 55 56 57 58 59 60
[[5]]
[1] 3 5 6 7 9 11 12 13 15 16 19 20 21 22 24 26 27 30 31 32 35 36 37 39 40 42 43 44 45 46 49 50 51 54 55 60

Related

Vectorizing lagged operations

How can I vectorize the following operation in R that involves modifying column Z recursively using lagged values of Z?
library(dplyr)
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df
X Y
1 20 27
2 69 49
3 92 32
4 28 56
5 10 26
6 70 20
7 53 39
8 81 89
9 96 55
10 11 84
df <- df %>% mutate(Z=if_else(row_number()==1, initial_Z-Y, NA_real_))
df
X Y Z
1 20 27 973
2 69 49 NA
3 92 32 NA
4 28 56 NA
5 10 26 NA
6 70 20 NA
7 53 39 NA
8 81 89 NA
9 96 55 NA
10 11 84 NA
for (i in 2:nrow(df)) {
df$Z[i] <- (df$Z[i-1]*df$X[i-1]/df$X[i])-df$Y[i]
}
df
X Y Z
1 20 27 973.000000
2 69 49 233.028986
3 92 32 142.771739
4 28 56 413.107143
5 10 26 1130.700000
6 70 20 141.528571
7 53 39 147.924528
8 81 89 7.790123
9 96 55 -48.427083
10 11 84 -506.636364
So the first value of Z is set first, based on initial_Z and first value of Y. Rest of the values of Z are calculated by using lagged values of X and Z, and current value of Y.
My actual df is large, and I need to repeat this operation thousands of times in a simulation. Using a for loop takes too much time. I prefer implementing this using dplyr, but other approaches are also welcome.
Many thanks in advance for any help.
I don't know that you can avoid the effect of for loops, but in general R should be pretty good at them. Given that, here is a Reduce variant that might suffice for you:
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df$Z <- with(df, Reduce(function(prevZ, i) {
if (i == 1) return(prevZ - Y[i])
prevZ*X[i-1]/X[i] - Y[i]
}, seq_len(nrow(df)), init = initial_Z, accumulate = TRUE))[-1]
df
# X Y Z
# 1 20 27 973.000000
# 2 69 49 233.028986
# 3 92 32 142.771739
# 4 28 56 413.107143
# 5 10 26 1130.700000
# 6 70 20 141.528571
# 7 53 39 147.924528
# 8 81 89 7.790123
# 9 96 55 -48.427083
# 10 11 84 -506.636364
To be clear, Reduce uses for loops internally to get through the data. I generally don't like using indices as the values for Reduce's x, but since Reduce only iterates over one value, and we need both X and Y, the indices (rows) are a required step.
The same can be accomplished using accumulate2. Note that these are just for-loops. You should consider writing the for loop in Rcpp if at all its causing a problem in R
df %>%
mutate(Z = accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000)[-1])
X Y Z
1 20 27 973
2 69 49 233.029
3 92 32 142.7717
4 28 56 413.1071
5 10 26 1130.7
6 70 20 141.5286
7 53 39 147.9245
8 81 89 7.790123
9 96 55 -48.42708
10 11 84 -506.6364
You could unlist(Z):
df %>%
mutate(Z = unlist(accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000))[-1])

Generate a list from a vector in r

I have a vector like a. I would like to generate a list b from a. I only typed the first 4 components of the list. Any suggestions of quick ways to achieve this be appreciated:
a <- seq(from =0, to = 359, by = 8)
b <- list(c(0:7), c(8:(8+7)), c(16:(16+7)), c(24:(24+7)))
> b
[[1]]
[1] 0 1 2 3 4 5 6 7
[[2]]
[1] 8 9 10 11 12 13 14 15
[[3]]
[1] 16 17 18 19 20 21 22 23
[[4]]
[1] 24 25 26 27 28 29 30 31
You can create a sequence from min value of a to max and then use findInterval or cut to split the sequence based on intervals.
tmp <- seq(min(a), max(a))
split(tmp, findInterval(tmp, a))
#$`1`
#[1] 0 1 2 3 4 5 6 7
#$`2`
#[1] 8 9 10 11 12 13 14 15
#$`3`
#[1] 16 17 18 19 20 21 22 23
#$`4`
#[1] 24 25 26 27 28 29 30 31
#$`5`
#[1] 32 33 34 35 36 37 38 39
#...
Another way using Map :
Map(seq, a[-length(a)], a[-1] - 1)
This will achieve the desired result
list1 <- list()
for (i in 1:45) {
base=i*8-8
list1[[i]] <- base + 1:8
}

Loss of dimensions of dataframe after applying rowMeans() in R

I subset a dataframe and i applied rowMeans() on it but the dimensions of the resultant variable ('y') are lost and i am not able to use 'y' in my further code.
dim(mtcars)
# [1] 32 11
y = rowMeans((mtcars[,3:6]))
dim(y)
# NULL
Why 'y' is no longer a dataframe?. And what can i do to get back its dimensions?.
I tried the following but it didn't work.
as.data.frame(y)
# or
data.frame(y)
When you apply rowMeans() you are creating a vector out of a dataframe. So, you are going from n rows and k columns to a nx1 vector.
For a case with n=8 and k=5 we would have:
> a=as.data.frame(matrix(1:40,8,5))
> a
V1 V2 V3 V4 V5
1 1 9 17 25 33
2 2 10 18 26 34
3 3 11 19 27 35
4 4 12 20 28 36
5 5 13 21 29 37
6 6 14 22 30 38
7 7 15 23 31 39
8 8 16 24 32 40
> rowMeans(a)
[1] 17 18 19 20 21 22 23 24

MCA from dataframe

I have dataframe
name a b c d e f
1 220-volt 1 8 12 17 22 8
2 aliexpress 7 133 317 372 358 349
3 bonprix 0 3 14 13 21 11
4 citilink 1 20 40 31 29 30
5 dns 1 16 37 34 39 38
6 ebay 3 32 65 50 55 58
7 eldorado 0 19 76 44 42 56
8 kupivip 0 8 17 24 11 18
9 labirint 0 15 30 34 36 32
10 lamoda 3 25 66 73 68 55
and I try to build mca plot.
I use FactoMineR and use code
library(FactoMineR)
df <- read.table("info.csv", header = TRUE, sep=';')
row.names(df) = df$name
df = df[,-1]
res.mca <- MCA(df)
but it returns
Error in which(unlist(lapply(listModa, is.numeric))) : argument to 'which' is not logical
How can I avoid this error?
I downloaded the code an reproduced your data.frame ( please use dput, or an other reproducible example ) and got the same error.
When you ?MCA you will find that x has to be:
a data frame with n rows (individuals) and p columns (categorical variables)
After I changed the columns to factors the function runs.
Try this:
df[] <- lapply(df, factor)
Tip: use row.names = 1 to set the first column as row names for your data.frame when you read the data.
df <- read.table("info.csv", header = T, sep = ";", row.names = 1)

Assign weights in lpSolveAPI to prioritise variables

I am trying to set up a linear programming solution using lpSolveAPI and R to solve a scheduling problem. Below is a small sample of the data; the minutes required for each session id, and their 'preferred' order/weight.
id <- 1:100
min <- sample(0:500, 100)
weight <- (1:100)/sum(1:100)
data <- data.frame(id, min, weight)
What I want to do is arrange/schedule these session IDs so that there are maximum number sessions in a day, preferably by their weight and each day is capped by a total of 400 minutes.
This is how I have set it up currently in R:
require(lpSolveAPI)
#Set up matrix to hold results; each row represents day
r <- 5
c <- 10
row <- 1
results <- matrix(0, nrow = r, ncol = c)
rownames(results) <- format(seq(Sys.Date(), by = "days", length.out = r), "%Y-%m-%d")
for (i in 1:r){
for(j in 1:c){
lp <- make.lp(0, nrow(data))
set.type(lp, 1:nrow(data), "binary")
set.objfn(lp, rep(1, nrow(data)))
lp.control(lp, sense = "max")
add.constraint(lp, data$min, "<=", 400)
set.branch.weights(lp, data$weight)
solve(lp)
a <- get.variables(lp)*data$id
b <- a[a!=0]
tryCatch(results[row, 1:length(b)] <- b, error = function(x) 0)
if(dim(data[!data$id == a,])[1] > 0) {
data <- data[!data$id== a,]
row <- row + 1
}
break
}
}
sum(results > 0)
barplot(results) #View of scheduled IDs
A quick look at the results matrix tells me that while the setup works to maximise number of sessions so that the total minutes in a day are close to 400 as possible, the setup doesn't follow the weights given. I expect my results matrix to be filled with increasing session IDs.
I have tried assigning different weights, weights in reverse order etc. but for some reason my setup doesn't seem to enforce "set.branch.weights".
I have read the documentation for "set.branch.weights" from lpSolveAPI but I think I am doing something wrong here.
Example - Data:
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25
Should be
Day 1 67 72 36 91 80 44 76
Day 2 58 84 96 21 1 41 66 89
Day 3 62 11 42 68 25 44 90 4 33 31
Each day has a cumulative sum of <= 480m.
My simple minded approach:
df = read.table(header=T,text="
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25")
# assume sorted by weight
daynr = 1
daymax = 480
dayusd = 0
for (i in 1:nrow(df))
{
v = df$min[i]
dayusd = dayusd + v
if (dayusd>daymax)
{
daynr = daynr + 1
dayusd = v
}
df$day[[i]] = daynr
}
This will give:
> df
id min weight day
1 1 67 1 1
2 2 72 2 1
3 3 36 3 1
4 4 91 4 1
5 5 80 5 1
6 6 44 6 1
7 7 76 7 1
8 8 58 8 2
9 9 84 9 2
10 10 96 10 2
11 11 21 11 2
12 12 1 12 2
13 13 41 13 2
14 14 66 14 2
15 15 89 15 2
16 16 62 16 3
17 17 11 17 3
18 18 42 18 3
19 19 68 19 3
20 20 25 20 3
21 21 44 21 3
22 22 90 22 3
23 23 4 23 3
24 24 33 24 3
25 25 31 25 3
>
I will concentrate on the first solve. We basically solve a knapsack problem (objective + one constraint):
When I run this model as is I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.5952381
Now when I change the objective to
I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.7428571
I.e. the count stayed at 14, but the weight improved.

Resources