R data.frame add a column depending on row-values - r

In R, I have a data.frame that looks like this:
X Y
20 7
25 84
15 62
22 12
60 24
40 10
60 60
12 50
11 17
now, i want a new Colum, lets call it "SumX", that adds two following values of X into a new field of that SumX column, and one that does the same to "SumY" column. So the result data.frame would look like this:
X Y SumX SumY
20 7 20 #first row = X 7 #first row = Y
25 84 45 #X0 + X1 91 #Y0 + Y1
15 62 40 #X1 + X2 146 #Y1 + Y2
22 12 37 #X2 + X3 74 #Y2 + Y3
60 24 82 #X3 + X4 36 #Y3 + Y4
40 10 100 #X4 + X5 34 #Y4 + Y5
60 60 100 #and so on 70 #and so on
12 50 72 110
11 17 23 67
I can do simple X + Y into a new column with
myFrame$SumXY <- with(myFrame, X+Y)
but it there a simple way to add two X (n + (n-1)) values into SumX, and two Y (n + (n-1)) into SumY? Even if it is with a while-loop, though i would prefer a simpler way (its a lot of data like this). Any help is much appreciated! (I'm still pretty new to R)

The rollapply function from the zoo package will work here.
The following code block will create the rolling sum of each 2 adjacent values.
require(zoo)
myFrame$SumX <- rollapply(myFrame$X, 2, sum) # this is a rolling sum of every 2 values
You could add by = 2 as an argument to rollapply in order to not have a rolling sum (i.e. it sums values 1+2, then 3+4, then 5+6 etc.).
Look up ?rollapply for more info.

Here's a dplyr approach.
Use mutate() to add a new colum and var + lag(var, default = 0) to compute your variable. Example:
library(dplyr)
d <- data.frame(
x = 1:10,
y = 11:20,
z = 21:30
)
mutate(d, sumx = x + lag(x, default = 0))
#> x y z sumx
#> 1 1 11 21 1
#> 2 2 12 22 3
#> 3 3 13 23 5
#> 4 4 14 24 7
#> 5 5 15 25 9
#> 6 6 16 26 11
#> 7 7 17 27 13
#> 8 8 18 28 15
#> 9 9 19 29 17
#> 10 10 20 30 19
More variables can be handled similarly:
mutate(d, sumx = x + lag(x, default = 0), sumy = y + lag(y, default = 0))
#> x y z sumx sumy
#> 1 1 11 21 1 11
#> 2 2 12 22 3 23
#> 3 3 13 23 5 25
#> 4 4 14 24 7 27
#> 5 5 15 25 9 29
#> 6 6 16 26 11 31
#> 7 7 17 27 13 33
#> 8 8 18 28 15 35
#> 9 9 19 29 17 37
#> 10 10 20 30 19 39
If you know that you want to do this for many, or even EVERY column in your data frame, then here's a standard evaluation approach with mutate_() that uses a custom function I adapted from this blog post (note you need to have the lazyeval package installed). The function gets applied to each column in a for loop (which could probably be optimised).
f <- function(df, col, new_col_name) {
mutate_call <- lazyeval::interp(~ x + lag(x, default = 0), x = as.name(col))
df %>% mutate_(.dots = setNames(list(mutate_call), new_col_name))
}
for (var in names(d)) {
d <- f(d, var, paste0('sum', var))
}
d
#> x y z sumx sumy sumz
#> 1 1 11 21 1 11 21
#> 2 2 12 22 3 23 43
#> 3 3 13 23 5 25 45
#> 4 4 14 24 7 27 47
#> 5 5 15 25 9 29 49
#> 6 6 16 26 11 31 51
#> 7 7 17 27 13 33 53
#> 8 8 18 28 15 35 55
#> 9 9 19 29 17 37 57
#> 10 10 20 30 19 39 59
Just to continue the tidyverse theme, here's a solution using the purrr package (again, works for all columns, but can subset columns if need to):
library(purrr)
# Create new columns in new data frame.
# Subset `d` here if only want select columns
sum_d <- map_df(d, ~ . + lag(., default = 0))
# Set names correctly and
# bind back to original data
names(sum_d) <- paste0("sum", names(sum_d))
d <- cbind(d, sum_d)
d
#> x y z sumx sumy sumz
#> 1 1 11 21 2 22 42
#> 2 2 12 22 4 24 44
#> 3 3 13 23 6 26 46
#> 4 4 14 24 8 28 48
#> 5 5 15 25 10 30 50
#> 6 6 16 26 12 32 52
#> 7 7 17 27 14 34 54
#> 8 8 18 28 16 36 56
#> 9 9 19 29 18 38 58
#> 10 10 20 30 20 40 60

You can use the lag function to achieve something like this:
myFrame$SumX[1] <- X[1]
myFrame$SumX[2:nrow(myFrame)] <- X[2:nrow(myFrame)]+lag(X)[2:nrow(myFrame)]

#SumX
cumsum(df$X) - c(0, 0, cumsum(df$X)[1:(nrow(df)-2)])
#[1] 20 45 40 37 82 100 100 72 23
#SumY
cumsum(df$Y) - c(0, 0, cumsum(df$Y)[1:(nrow(df)-2)])
#[1] 7 91 146 74 36 34 70 110 67

Related

Looping a loop to do everything at once

I am trying to simulate the following "game:
There is a population of 100 units
You randomly sample 10 of these units, record the id's of the units you saw, and then put them back into the population
You then take a second sample, record the id's of the units you saw in this second sample along with the first sample, and then put the second sample back into the population
Repeat this many times
I wrote the following code in R that performs the above procedure:
library(dplyr)
var_1 = rnorm(100,10,10)
var_2 = rnorm(100,1,10)
var_3 = rnorm(100,5,10)
response = rnorm(100,1,1)
my_data = data.frame(var_1, var_2, var_3, response)
my_data$id = 1:100
results <- list()
results2<- list()
for (i in 1:100)
{
iteration_i = i
sample_i = my_data[sample(nrow(my_data), 10), ]
results_tmp = data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file = data.frame(test_1 %>% group_by(iteration_i) %>% summarise(Count = n()))
cumulative = cumsum(summary_file$Count)
summary_file$Cumulative = cumulative
summary_file$unobserved = 100 - cumulative
The result looks something like this:
> summary_file
iteration_i Count Cumulative unobserved
1 1 10 10 90
2 2 8 18 82
3 3 9 27 73
4 4 8 35 65
5 5 6 41 59
6 6 5 46 54
7 7 7 53 47
8 8 7 60 40
9 9 4 64 36
10 10 3 67 33
11 11 4 71 29
12 12 4 75 25
13 13 1 76 24
14 14 4 80 20
15 15 1 81 19
16 16 2 83 17
17 17 2 85 15
18 18 1 86 14
19 20 1 87 13
20 22 1 88 12
21 23 2 90 10
22 24 1 91 9
23 25 1 92 8
24 27 2 94 6
25 28 1 95 5
26 30 1 96 4
27 35 1 97 3
28 37 1 98 2
29 44 1 99 1
30 46 1 100 0
I would now like to repeat this "game" many times.
I would like to keep the "summary_file" for each "game" (e.g. summary_file_1, summary_file_2, summary_file_3, etc.)
I would then like to create a "total" summary file that shows the number of iterations that were required in each game to observe all units.
This total_summary_file would look something like this:
game_id iterations_required
1 game_1 47
2 game_2 45
3 game_3 44
4 game_4 42
5 game_5 42
Currently, I am just copy/pasting my earlier code several times and storing the results, then I append everything at the end and calculate the summary statistics - but I am trying to find a way to "loop the loop" and do everything at once. I do not know if it is possible to introduce a command like "results_df_i <- do.call(rbind.data.frame, results_i)" into the loop and efficiently create everything at the same time instead of manually copy/pasting the earlier loop.
You're making this a lot less efficient than it could be. To get, say, 100 repeated samples of 10 from the set 1:100 (with replacement), we can do replicate(100, sample(100, 10, TRUE)).
We can then coerce this into a vector and count the number of unique values every 10 entries along the vector until we get to 100. This gives us the number of iterations required to exhaust the samples.
If we put this inside an sapply, we don't even need an explicit loop, which means we can create the results data frame in a single call:
set.seed(1)
n_games <- 10
results <- data.frame(game_id = paste("game", seq(n_games), sep = "_"),
iterations_required = sapply(seq(n_games), function(x) {
samp <- c(replicate(100, sample(100, 10, TRUE)))
sum(sapply(1:100 * 10, function(n) length(unique(samp[1:n]))) < 100)
}))
results
#> game_id iterations_required
#> 1 game_1 59
#> 2 game_2 44
#> 3 game_3 54
#> 4 game_4 59
#> 5 game_5 57
#> 6 game_6 58
#> 7 game_7 96
#> 8 game_8 60
#> 9 game_9 71
#> 10 game_10 33
Created on 2022-06-11 by the reprex package (v2.0.1)
There are lots of ways to get your desired outcome; wrapping your loop in a function and running the function multiple times is another potential solution:
library(dplyr)
var_1 = rnorm(100,10,10)
var_2 = rnorm(100,1,10)
var_3 = rnorm(100,5,10)
response = rnorm(100,1,1)
my_data = data.frame(var_1, var_2, var_3, response)
my_data$id = 1:100
results <- list()
results2<- list()
sample_func <- function(output_file_name) {
for (i in 1:100)
{
iteration_i = i
sample_i = my_data[sample(nrow(my_data), 10), ]
results_tmp = data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file = data.frame(test_1 %>% group_by(iteration_i) %>% summarise(Count = n()))
cumulative = cumsum(summary_file$Count)
summary_file$Cumulative = cumulative
summary_file$unobserved = 100 - cumulative
output_file_name <- summary_file
return(output_file_name)
}
list_of_names <- paste0("game_", 1:10)
output <- lapply(list_of_names, sample_func)
names(output) <- list_of_names
head(output, n = 2)
#> $game_1
#> iteration_i Count Cumulative unobserved
#> 1 1 10 10 90
#> 2 2 9 19 81
#> 3 3 8 27 73
#> 4 4 8 35 65
#> 5 5 6 41 59
#> 6 6 6 47 53
#> 7 7 6 53 47
#> 8 8 4 57 43
#> 9 9 4 61 39
#> 10 10 3 64 36
#> 11 11 2 66 34
#> 12 12 1 67 33
#> 13 13 4 71 29
#> 14 14 1 72 28
#> 15 15 2 74 26
#> 16 16 2 76 24
#> 17 17 3 79 21
#> 18 18 4 83 17
#> 19 19 2 85 15
#> 20 20 2 87 13
#> 21 21 1 88 12
#> 22 24 1 89 11
#> 23 25 2 91 9
#> 24 26 1 92 8
#> 25 27 1 93 7
#> 26 30 1 94 6
#> 27 31 1 95 5
#> 28 33 1 96 4
#> 29 34 1 97 3
#> 30 36 1 98 2
#> 31 41 1 99 1
#> 32 66 1 100 0
#>
#> $game_2
#> iteration_i Count Cumulative unobserved
#> 1 1 10 10 90
#> 2 2 10 20 80
#> 3 3 7 27 73
#> 4 4 7 34 66
#> 5 5 8 42 58
#> 6 6 5 47 53
#> 7 7 7 54 46
#> 8 8 5 59 41
#> 9 9 1 60 40
#> 10 10 7 67 33
#> 11 11 3 70 30
#> 12 12 3 73 27
#> 13 13 1 74 26
#> 14 14 3 77 23
#> 15 15 4 81 19
#> 16 16 3 84 16
#> 17 17 2 86 14
#> 18 18 1 87 13
#> 19 19 2 89 11
#> 20 20 1 90 10
#> 21 21 2 92 8
#> 22 22 1 93 7
#> 23 25 2 95 5
#> 24 27 1 96 4
#> 25 29 2 98 2
#> 26 30 1 99 1
#> 27 41 1 100 0
The output is a list, so you can use purrr to apply functions to each element (e.g. https://purrr.tidyverse.org/reference/lmap.html) or use the bind_rows() function to create a single dataframe for further use, e.g.
df2 <- bind_rows(output, .id = "game") %>%
group_by("Game" = factor(game, levels = list_of_names)) %>%
summarise(rows_in_output = n(),
number_of_iterations = max(iteration_i))
df2
#> # A tibble: 10 × 3
#> Game rows_in_output number_of_iterations
#> <fct> <int> <int>
#> 1 game_1 32 66
#> 2 game_2 27 41
#> 3 game_3 27 48
#> 4 game_4 32 50
#> 5 game_5 27 35
#> 6 game_6 27 71
#> 7 game_7 28 68
#> 8 game_8 27 48
#> 9 game_9 29 43
#> 10 game_10 29 66
Created on 2022-06-17 by the reprex package (v2.0.1)
Or you can use list2env() to get have each individual dataframe in your environment, e.g.
list2env(output, envir = .GlobalEnv)
ls()
#> [1] "df2" "game_1" "game_10" "game_2" "game_3" "game_4"
#> [7] "game_5" "game_6" "game_7" "game_8" "game_9" "list_of_names"
#> [13] "my_data" "output" "response" "results" "results2" "sample_func"
#> [19] "var_1" "var_2" "var_3"
This seems easily solvable using recursion:
fun <- function(x, i=1, size = 10){
a <- setdiff(x,sample(100, 10, TRUE)) # Remove the seen from x
if(length(a)) Recall(a, i+1) else i # if we have unobserved, call fun again
}
Now we can have as many games as we want:
data.frame(game = paste0('game',seq(10)), results = replicate(10, fun(1:100)))
game results
1 game1 62
2 game2 40
3 game3 51
4 game4 50
5 game5 34
6 game6 83
7 game7 38
8 game8 40
9 game9 53
10 game10 41
You could also do
hist(replicate(1000, fun(1:100)), breaks = 30)
Edit:
Note that this can be editted to take in any size and vector. eg:
fun <- function(x, size = 10, y=x, i=1){
a <- setdiff(x,sample(y, size, TRUE))
cat('i', i, '\t a: ',a, '\n')
if(length(a)>0) Recall(a, size, y, i+1) else i
}
set.seed(117);fun(1:10, 1)
i 1 a: 1 2 4 5 6 7 8 9 10 # 3 removed
i 2 a: 1 2 4 5 7 8 9 10 # 6 removed
i 3 a: 1 2 4 5 7 8 9 # 10 removed
i 4 a: 1 2 4 5 8 9 # 7 removed
i 5 a: 1 2 5 8 9 # 4 removed
i 6 a: 1 2 5 8 9 # Nothing removed
i 7 a: 1 5 8 9 # 2 removed
i 8 a: 1 5 8 # 9 removed
i 9 a: 1 5 # Nothing removed
i 10 a: 1 5 # Nothing removed
i 11 a: 5 # 1 removed
i 12 a: # 5 removed
Using Markov chains, we can produce the cumulative distribution function for the number of iterations required for a game (up to machine precision). The resulting CDF can be sampled directly using findInterval.
We can simplify things slightly by starting with the second iteration, since the first iteration will always result in 90 unseen units.
First, set up a matrix for all possible transitions:
m <- matrix(c(rep(90:1, each = 11), sequence(rep(11,90), 90:1, -1)), ncol = 2, dimnames = list(NULL, c("from", "to")))
m <- m[m[,2] >= 0L,]
Then create a transition matrix with row 1 representing the state where all units have been seen and row 91 representing the state where 10 units have been seen:
mTrans <- matrix(0, 91, 91)
The number of previously unseen units selected follows the hypergeometric distribution.
mTrans[m + 1L] <- dhyper(m[,1] - m[,2], m[,1], 100L - m[,1], 10L)
Row 1 represents an absorbing state since all units have been seen.
mTrans[1, 1] <- 1
mTrans contains the probabilities of each state after the second iteration.
Initialize a while loop and calculate the CDF.
mm <- mTrans %*% mTrans
maxIter <- 1000L
p <- numeric(maxIter)
iter <- 3L
while (p[iter] < 1) {
if ((iter <- iter + 1L) > maxIter) {
p <- c(p, numeric(maxIter))
maxIter <- maxIter*2L
}
mm <- mm %*% mTrans
p[iter] <- mm[91, 1]
}
p <- p[1:iter]
iter
#> [1] 345
Machine precision limits the CDF to less than 345 iterations. Plot the CDF:
plot(p, xlab = "iterations", ylab = "cumulative probability")
Using findInterval we can quickly generate a large number of random samples of the iterations required.
ngames <- 1e6L # one million games
results <- data.frame(game_id = 1:ngames, iterations_required = findInterval(runif(ngames), p))
head(results)
#> game_id iterations_required
#> 1 1 73
#> 2 2 69
#> 3 3 40
#> 4 4 41
#> 5 5 44
#> 6 6 43
Get a histogram of the sample number of iterations required.
hist(results$iterations_required)
OP here! I think I was able to find an answer to my own question:
library(dplyr)
var_1 <- rnorm(100, 10, 10)
var_2 <- rnorm(100, 1, 10)
var_3 <- rnorm(100, 5, 10)
response <- rnorm(100, 1, 1)
my_data <- data.frame(var_1, var_2, var_3, response)
my_data$id <- 1:100
simulate <- function() {
results <- list()
results2 <- list()
for (i in 1:100) {
iteration_i <- i
sample_i <- my_data[sample(nrow(my_data), 10), ]
results_tmp <- data.frame(iteration_i, sample_i)
results[[i]] <- results_tmp
}
results_df <- do.call(rbind.data.frame, results)
test_1 <- data.frame(results_df %>%
group_by(id) %>%
filter(iteration_i == min(iteration_i)) %>%
distinct)
summary_file <- data.frame(test_1 %>%
group_by(iteration_i) %>%
summarise(Count=n()))
cumulative <- cumsum(summary_file$Count)
summary_file$Cumulative <- cumulative
summary_file$unobserved <- 100 - cumulative
return(summary_file)
}
# now, loop 10 times!
results <- list()
for (i in 1:10) {
game_i <- i
s_i <- simulate()
results_tmp <- data.frame(game_i, s_i)
results[[i]] <- results_tmp
}
final_file <- do.call(rbind.data.frame, results)
Thanks for your help everyone!

tidyverse: binding list elements efficiently

I want to bind data.frames of same number of rows from a list as given below.
df1 <- data.frame(A1 = 1:10, B1 = 11:20)
df2 <- data.frame(A1 = 1:10, C1 = 21:30)
df3 <- data.frame(A2 = 1:15, B2 = 11:25, C2 = 31:45)
df4 <- data.frame(A2 = 1:15, D2 = 11:25, E2 = 51:65)
df5 <- 5
ls <- list(df1, df2, df3, df4, df5)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
bind_cols(ls[1], ls[2], .id = NULL)
#> New names:
#> * A1 -> A1...1
#> * A1 -> A1...3
#> A1...1 B1 A1...3 C1
#> 1 1 11 1 21
#> 2 2 12 2 22
#> 3 3 13 3 23
#> 4 4 14 4 24
#> 5 5 15 5 25
#> 6 6 16 6 26
#> 7 7 17 7 27
#> 8 8 18 8 28
#> 9 9 19 9 29
#> 10 10 20 10 30
bind_cols(ls[3], ls[4], .id = NULL)
#> New names:
#> * A2 -> A2...1
#> * A2 -> A2...4
#> A2...1 B2 C2 A2...4 D2 E2
#> 1 1 11 31 1 11 51
#> 2 2 12 32 2 12 52
#> 3 3 13 33 3 13 53
#> 4 4 14 34 4 14 54
#> 5 5 15 35 5 15 55
#> 6 6 16 36 6 16 56
#> 7 7 17 37 7 17 57
#> 8 8 18 38 8 18 58
#> 9 9 19 39 9 19 59
#> 10 10 20 40 10 20 60
#> 11 11 21 41 11 21 61
#> 12 12 22 42 12 22 62
#> 13 13 23 43 13 23 63
#> 14 14 24 44 14 24 64
#> 15 15 25 45 15 25 65
In my actual list, I have about twenty data.frames of different number of rows. I wonder if there is a more efficient way of binding data.frames of same number of rows without giving the name and index of list elements.
It is easier to do this by splitting. Create a grouping index with gl
grp <- as.integer(gl(length(ls), 2, length(ls)))
and then use split
library(dplyr)
library(purrr)
library(stringr)
split(ls, grp) %>% # // split by the grouping index
map(bind_cols) %>% # // loop over the `list` and use `bind_cols`
set_names(str_c('df', seq_along(.))) %>% # // name the `list`
list2env(.GlobalEnv) # // create objects in global env
-output
head(df1)
# A1...1 B1 A1...3 C1
#1 1 11 1 21
#2 2 12 2 22
#3 3 13 3 23
#4 4 14 4 24
#5 5 15 5 25
#6 6 16 6 26
head(df2)
# A2...1 B2 C2 A2...4 D2 E2
#1 1 11 31 1 11 51
#2 2 12 32 2 12 52
#3 3 13 33 3 13 53
#4 4 14 34 4 14 54
#5 5 15 35 5 15 55
#6 6 16 36 6 16 56
head(df3)
# A tibble: 1 x 1
# ...1
# <dbl>
#1 5
NOTE:
It is better to keep the elements in the list instead of creating objects in the global environment i.e. list2env
ls is a function name and naming an object with function name is not a good option as it can lead to buggy situations
Maybe not the optimal approach but you can use a loop and bind the dataframes with same number of columns into a new dataframes. The main of this code is to check the dimension of each dataframe and create an unique vector. Then in the loop you can use lapply() to subset the dataframes in ls and the bind their columns. Here the code (Updated considering the little df5, you can make the trick managing it as a dataframe):
library(dplyr)
#Data
df1 <- data.frame(A1 = 1:10, B1 = 11:20)
df2 <- data.frame(A1 = 1:10, C1 = 21:30)
df3 <- data.frame(A2 = 1:15, B2 = 11:25, C2 = 31:45)
df4 <- data.frame(A2 = 1:15, D2 = 11:25, E2 = 51:65)
df5 <- 5
#List
ls <- list(df1, df2, df3, df4,df5)
#Index
index <- sapply(ls,function(x)dim(as.data.frame(x))[1])
m <- unique(index)
#Loop
for(i in 1:length(m))
{
assign(paste0('df',i),do.call(bind_cols,ls[lapply(ls,function(x) dim(as.data.frame(x))[1]==m[i])==T]))
}
Output:
df1
A1...1 B1 A1...3 C1
1 1 11 1 21
2 2 12 2 22
3 3 13 3 23
4 4 14 4 24
5 5 15 5 25
6 6 16 6 26
7 7 17 7 27
8 8 18 8 28
9 9 19 9 29
10 10 20 10 30
df2
A2...1 B2 C2 A2...4 D2 E2
1 1 11 31 1 11 51
2 2 12 32 2 12 52
3 3 13 33 3 13 53
4 4 14 34 4 14 54
5 5 15 35 5 15 55
6 6 16 36 6 16 56
7 7 17 37 7 17 57
8 8 18 38 8 18 58
9 9 19 39 9 19 59
10 10 20 40 10 20 60
11 11 21 41 11 21 61
12 12 22 42 12 22 62
13 13 23 43 13 23 63
14 14 24 44 14 24 64
15 15 25 45 15 25 65
df3
...1
1 5

Assigning Values based on row value

I have a large vector (column of a data frame) where values containing integers 1 to 30. I want to replace numbers from 1 to 5 with 1, 6 to 10 with 5, 11 to 15 with 9...
> x3 <- sample(1:30, 100, rep=TRUE)
> x3
[1] 13 24 16 30 10 6 15 10 3 17 18 22 11 13 29 7 25 28 17 27 1 5 6 20 15 15 8 10 13 26 27 24 3 24 5 7 10 6 28 27 1 4 22 25 14 13 2 10 4 29 23 24 30 24 29 11 2 28 23 1 1 2
[63] 3 23 13 26 21 22 11 4 8 26 17 11 20 23 6 14 24 5 15 21 11 13 6 14 20 11 22 9 6 29 4 30 20 30 4 24 23 29
As I mentioned this is a column in a data frame and with above assignment I want to create a different column. If I do the following I have to do this 30 times.
myFrame$NewColumn[myFrame$oldColumn==1] <- 1
myFrame$NewColumn[myFrame$oldColumn==2] <- 1
myFrame$NewColumn[myFrame$oldColumn==3] <- 1
...
Whats a better way to do this?
We can do this with cut (suppose what you mean by '...' is 10, 11, 12):
x4 <- cut(x3,
breaks = c(seq(1, 30, 5), 30), right = F, include.lowest = T, # generate correct intervals
labels = 4 * (0:5) + 1) # number to fill
# x4 is factor. We should convert it to character first then to the number
x4 <- as.numeric(as.character(x4))
Did you try:
myFrame$NewColumn[myFrame$oldColumn > 0 & myFrame$oldColumn< 6] <- 1
myFrame$NewColumn[myFrame$oldColumn > 5 & myFrame$oldColumn< 11] <- 1
...
Or even better:
myFrame$NewColumn <- as.integer((myFrame$oldColumn - 1)/5)) * 4 + 1

Assign weights in lpSolveAPI to prioritise variables

I am trying to set up a linear programming solution using lpSolveAPI and R to solve a scheduling problem. Below is a small sample of the data; the minutes required for each session id, and their 'preferred' order/weight.
id <- 1:100
min <- sample(0:500, 100)
weight <- (1:100)/sum(1:100)
data <- data.frame(id, min, weight)
What I want to do is arrange/schedule these session IDs so that there are maximum number sessions in a day, preferably by their weight and each day is capped by a total of 400 minutes.
This is how I have set it up currently in R:
require(lpSolveAPI)
#Set up matrix to hold results; each row represents day
r <- 5
c <- 10
row <- 1
results <- matrix(0, nrow = r, ncol = c)
rownames(results) <- format(seq(Sys.Date(), by = "days", length.out = r), "%Y-%m-%d")
for (i in 1:r){
for(j in 1:c){
lp <- make.lp(0, nrow(data))
set.type(lp, 1:nrow(data), "binary")
set.objfn(lp, rep(1, nrow(data)))
lp.control(lp, sense = "max")
add.constraint(lp, data$min, "<=", 400)
set.branch.weights(lp, data$weight)
solve(lp)
a <- get.variables(lp)*data$id
b <- a[a!=0]
tryCatch(results[row, 1:length(b)] <- b, error = function(x) 0)
if(dim(data[!data$id == a,])[1] > 0) {
data <- data[!data$id== a,]
row <- row + 1
}
break
}
}
sum(results > 0)
barplot(results) #View of scheduled IDs
A quick look at the results matrix tells me that while the setup works to maximise number of sessions so that the total minutes in a day are close to 400 as possible, the setup doesn't follow the weights given. I expect my results matrix to be filled with increasing session IDs.
I have tried assigning different weights, weights in reverse order etc. but for some reason my setup doesn't seem to enforce "set.branch.weights".
I have read the documentation for "set.branch.weights" from lpSolveAPI but I think I am doing something wrong here.
Example - Data:
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25
Should be
Day 1 67 72 36 91 80 44 76
Day 2 58 84 96 21 1 41 66 89
Day 3 62 11 42 68 25 44 90 4 33 31
Each day has a cumulative sum of <= 480m.
My simple minded approach:
df = read.table(header=T,text="
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25")
# assume sorted by weight
daynr = 1
daymax = 480
dayusd = 0
for (i in 1:nrow(df))
{
v = df$min[i]
dayusd = dayusd + v
if (dayusd>daymax)
{
daynr = daynr + 1
dayusd = v
}
df$day[[i]] = daynr
}
This will give:
> df
id min weight day
1 1 67 1 1
2 2 72 2 1
3 3 36 3 1
4 4 91 4 1
5 5 80 5 1
6 6 44 6 1
7 7 76 7 1
8 8 58 8 2
9 9 84 9 2
10 10 96 10 2
11 11 21 11 2
12 12 1 12 2
13 13 41 13 2
14 14 66 14 2
15 15 89 15 2
16 16 62 16 3
17 17 11 17 3
18 18 42 18 3
19 19 68 19 3
20 20 25 20 3
21 21 44 21 3
22 22 90 22 3
23 23 4 23 3
24 24 33 24 3
25 25 31 25 3
>
I will concentrate on the first solve. We basically solve a knapsack problem (objective + one constraint):
When I run this model as is I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.5952381
Now when I change the objective to
I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.7428571
I.e. the count stayed at 14, but the weight improved.

R iterate over a data frame to add a new column with sequential values

Here is my data frame "data.frame"
X Y
1 10 12
2 20 22
3 30 32
Below what I want.
1) add a new colum named "New_col"
2) each cell of a given id is a sequence from X-value to Y-value (step of 1).
X Y New_col
1 10 12 10
11
12
2 20 22 20
21
22
3 30 32 30
31
32
Then fill the empty cells
X Y New_col
1 10 12 10
1 10 12 11
1 10 12 12
2 20 22 20
2 20 22 21
2 20 22 22
3 30 32 30
3 30 32 31
3 30 32 32
I tried the following:
New_col<-seq(from = data.frame$X, to = data.frame$Y, by = 1)
The problem it this code computes the sequence only for the first row.
Then I tried a loop:
for (i in 1: length(data.frame$X))
{
New_col <-seq(from = data.frame$X, to = data.frame$Y, by = 1)
}
This is the error I got:
Error in seq.default(from = data.frame$X, to = data.frame$Y, by = 1) :
'from' must be of length 1
Thank you for your help.
You can use apply:
do.call(rbind, apply(dat, 1, function(x)
data.frame(X = x[1], Y = x[2], New_col = seq(x[1], x[2]))))
where dat is the name of your data frame. You can ignore the warnings.
X Y New_col
1.1 10 12 10
1.2 10 12 11
1.3 10 12 12
2.1 20 22 20
2.2 20 22 21
2.3 20 22 22
3.1 30 32 30
3.2 30 32 31
3.3 30 32 32
This is a good use case for the data.table package (which you would have to install first):
dat = read.table(text=" X Y
1 10 12
2 20 22
3 30 32")
library(data.table)
dt = as.data.table(dat)
Once you've got your data table set up, by makes this operation easy:
dt2 = dt[, list(New_col=seq(X, Y)), by=c("X", "Y")]
# X Y New_col
# 1: 10 12 10
# 2: 10 12 11
# 3: 10 12 12
# 4: 20 22 20
# 5: 20 22 21
# 6: 20 22 22
# 7: 30 32 30
# 8: 30 32 31
# 9: 30 32 32
(The only disclaimer is that this will not work if there are duplicate (X, Y) pairs in your original data frame).

Resources