Split into groups based on (multiple) conditions? - r
I have set of marbles, of different colors and weights, and I want to split them into groups based on their weight and color.
The conditions are:
A group cannot weigh more than 100 units
A group cannot have more than 5 different-colored marbles.
A reproducible example:
marbles <- data.frame(color=sample(1:20, 20), weight=sample(1:40, 20, replace=T))
color weight
1 1 22
2 15 33
3 13 35
4 11 13
5 6 26
6 8 15
7 10 3
8 16 22
9 14 21
10 3 16
11 4 26
12 20 30
13 9 31
14 2 16
15 7 12
16 17 13
17 19 19
18 5 17
19 12 12
20 18 40
And what I want is this group column:
color weight group
1 1 22 1
2 15 33 1
3 13 35 1
4 11 13 2
5 6 26 2
6 8 15 2
7 10 3 2
8 16 22 2
9 14 21 3
10 3 16 3
11 4 26 3
12 20 30 3
13 9 31 4
14 2 16 4
15 7 12 4
16 17 13 4
17 19 19 4
18 5 17 5
19 12 12 5
20 18 40 5
TIA.
The below isn't an optimal assignment to the groups, it just does it sequentially through the data frame. It's uses rowwise and might not be the most efficient way as it's not a vectorized approach.
library(dplyr)
marbles <- data.frame(color=sample(1:20, 20), weight=sample(1:40, 20, replace=T))
Below I create a rowwise function which we can apply using dplyr
assign_group <- function(color, weight) {
# Conditions
clists = append(color_list, color)
sum_val = group_sum + weight
num_colors = length(unique(color_list))
assign_condition = (sum_val <= 100 & num_colors <= 5)
#assign globals
cval <- if(assign_condition) clists else c(color)
sval <- ifelse(assign_condition, sum_val, weight)
gval <- ifelse(assign_condition, group_number, group_number + 1)
assign("color_list", cval, envir = .GlobalEnv)
assign("group_sum", sval, envir = .GlobalEnv)
assign("group_number", gval, envir = .GlobalEnv)
res = group_number
return(res)
}
I then setup a few global variables to track the allocation of the marbles to each group.
# globals
color_list <<- c()
group_sum <<- 0
group_number <<- 1
Finally run this function using mutate
test <- marbles %>% rowwise() %>% mutate(group = assign_group(color,weight)) %>% data.frame()
Which results in the below
color weight group
1 6 27 1
2 12 16 1
3 15 32 1
4 20 25 1
5 19 5 2
6 2 21 2
7 16 39 2
8 17 4 2
9 11 16 2
10 7 7 3
11 10 5 3
12 1 30 3
13 13 7 3
14 9 39 3
15 14 7 4
16 8 17 4
17 18 9 4
18 4 36 4
19 3 1 4
20 5 3 5
And seems to meet the constraints
test %>% group_by(group) %>% summarise(tot_w = sum(weight), n_c = length(unique(color)) )
group tot_w n_c
<dbl> <int> <int>
1 1 100 4
2 2 85 5
3 3 88 5
4 4 70 5
5 5 3 1
in base R you could write a recursive function as shown below:
create_group = function(df,a){
if(missing(a)) a = cumsum(df$weight)%/%100
b = !ave(df$color,a,FUN=seq_along)%%6
d = ave(df$weight,a+b,FUN=cumsum)>100
a = a+b+d
if (any(b|d)) create_group(df,a) else cbind(df,group = a+1)
}
create_group(df)
color weight group
1 1 22 1
2 15 33 1
3 13 35 1
4 11 13 2
5 6 26 2
6 8 15 2
7 10 3 2
8 16 22 2
9 14 21 3
10 3 16 3
11 4 26 3
12 20 30 3
13 9 31 4
14 2 16 4
15 7 12 4
16 17 13 4
17 19 19 4
18 5 17 5
19 12 12 5
20 18 40 5
Related
Adding a sequential value using a loop
I have a large dataframe which is effectively combined output from a nested list using do.call(rbind, nested_list) The output has the same number of rows for each list element (e.g. 5 rows per list) and I need to add a column which has a unique numeric code for each list (or group). How can I write a loop to reproduce the group column I have included in the example below, e.g. the five rows have a group value == 1, rows 6 to 10 have a group value == 2, rows 11 to 15 have a group value == 3 df <- data.frame("ID" = 1:15) df$Var_A <- c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29) df$Var_B <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16) df$Var_C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14) df$Var_D <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14) df$New_A <- c(2,5,5,8,11,14,15,17,20,21,22,23,25,25,27) df$New_B <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16) df$New_C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14) df$New_D <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14) df$Group <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3) ID Var_A Var_B Var_C Var_D New_A New_B New_C New_D Group 1 1 1 10 10 10 2 10 10 10 1 2 2 3 0 12 12 5 0 12 12 1 3 3 5 0 14 14 5 0 14 14 1 4 4 7 0 16 16 8 0 16 16 1 5 5 9 12 10 10 11 12 10 10 1 6 6 11 12 12 12 14 12 12 12 2 7 7 13 12 14 14 15 12 14 14 2 8 8 15 12 16 16 17 12 16 16 2 9 9 17 0 10 10 20 0 10 10 2 10 10 19 14 12 12 21 14 12 12 2 11 11 21 NA 14 14 22 NA 14 14 3 12 12 23 14 16 16 23 14 16 16 3 13 13 25 16 10 10 25 16 10 10 3 14 14 27 16 12 12 25 16 12 12 3 15 15 29 16 14 14 27 16 14 14 3
You can use the ceiling function: df <- data.frame("ID" = 1:15) df$Var_A <- c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29) df$Var_B <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16) df$Var_C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14) df$Var_D <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14) df$New_A <- c(2,5,5,8,11,14,15,17,20,21,22,23,25,25,27) df$New_B <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16) df$New_C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14) df$New_D <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14) df$Group <- ceiling(as.numeric(df$ID)/5) df # ID Var_A Var_B Var_C Var_D New_A New_B New_C New_D Group # 1 1 1 10 10 10 2 10 10 10 1 # 2 2 3 0 12 12 5 0 12 12 1 # 3 3 5 0 14 14 5 0 14 14 1 # 4 4 7 0 16 16 8 0 16 16 1 # 5 5 9 12 10 10 11 12 10 10 1 # 6 6 11 12 12 12 14 12 12 12 2 # 7 7 13 12 14 14 15 12 14 14 2 # 8 8 15 12 16 16 17 12 16 16 2 # 9 9 17 0 10 10 20 0 10 10 2 # 10 10 19 14 12 12 21 14 12 12 2 # 11 11 21 NA 14 14 22 NA 14 14 3 # 12 12 23 14 16 16 23 14 16 16 3 # 13 13 25 16 10 10 25 16 10 10 3 # 14 14 27 16 12 12 25 16 12 12 3 # 15 15 29 16 14 14 27 16 14 14 3
Without adding an ID or rownums we can do this using nrow and knowledge of the group length. group_len <- 5 groups <- nrow(df)/group_len df$group <- rep(1:groups, each = group_len) # Example: # rep(1:3, each = 5) # 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
I'd use dplyr::mutate with dplyr::row_number: library(dplyr) df %>% mutate(Group=ceiling(row_number() / 5)) Output: ID Var_A Var_B Var_C Var_D New_A New_B New_C New_D Group 1 1 1 10 10 10 2 10 10 10 1 2 2 3 0 12 12 5 0 12 12 1 3 3 5 0 14 14 5 0 14 14 1 4 4 7 0 16 16 8 0 16 16 1 5 5 9 12 10 10 11 12 10 10 1 6 6 11 12 12 12 14 12 12 12 2 7 7 13 12 14 14 15 12 14 14 2 8 8 15 12 16 16 17 12 16 16 2 9 9 17 0 10 10 20 0 10 10 2 10 10 19 14 12 12 21 14 12 12 2 11 11 21 NA 14 14 22 NA 14 14 3 12 12 23 14 16 16 23 14 16 16 3 13 13 25 16 10 10 25 16 10 10 3 14 14 27 16 12 12 25 16 12 12 3 15 15 29 16 14 14 27 16 14 14 3
An option would be to combine cumsum with rep. cumsum(rep_len(c(TRUE, rep(FALSE, 4)), nrow(df))) #cumsum(rep_len(c(TRUE, FALSE, FALSE, FALSE, FALSE), nrow(df))) #Alternative # [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Or making use of auto repeat. df$Group <- c(TRUE, rep(FALSE, 4)) df$Group <- cumsum(df$Group) df$Group # [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Or create a sequence with length of nrow and make an integer division %/%. 0:(nrow(df)-1) %/% 5 #seq(0, nrow(df)-1) %/% 5 #Alternative #(seq_len(nrow(df))-1) %/% 5 #Alternative # [1] 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 Or using rep: rep(1:ceiling(nrow(df)/5), each=5, length.out=nrow(df)) # [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
You could use the cut function with labels = FALSE to return an integer to use for the group. n_per_group <- 5 df$group <- cut(x = df$ID, breaks = nrow(df) / n_per_group, labels = FALSE) df$group #[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Fill zeros for missing values in R
I am trying to deal with this problem. I have a df with a date column and I want to count the occurences per hour. Here is what I've done: x <- df %>% mutate(hora = hour(date)) %>% select(hora) %>% count(hora) that gives as a result: > x # A tibble: 19 x 2 hora n <int> <int> 1 0 1 2 1 1 3 3 1 4 8 4 5 9 7 6 10 10 7 11 14 8 12 10 9 13 8 10 14 4 11 15 5 12 16 12 13 17 4 14 18 12 15 19 9 16 20 5 17 21 2 18 22 4 19 23 4 As you can see, there are hours that don't show up that would have n=0, like 2 or 4:7. What I want is it to add the hours that are not in x with n=0 so the table is complete. The expected output should be something like this: hora n 1 0 12 2 1 3 3 2 5 4 3 7 5 4 8 6 5 1 7 6 0 8 7 11 9 8 6 10 9 10 11 10 9 12 11 0 13 12 0 14 13 3 15 14 0 16 15 7 17 16 8 18 17 1 19 18 2 20 19 11 21 20 6 22 21 10 23 22 9 24 23 4 I tried creating a table with hours 0:23 and all n=0 and trying to sum the two tables but obviously that didn't work. I also tried x$hour <- 0:23, thinking that the missing values would be added, but it didn't work as well.
You could convert hora to factor and use .drop = FALSE in count library(dplyr) library(lubridate) df %>% mutate(hora = factor(hour(date), levels = 0:23)) %>% count(hora, .drop = FALSE) Another option is to use complete : df %>% mutate(hora = hour(date)) %>% count(hora) %>% tidyr::complete(hora = 0:23, fill = list(n = 0))
A solution in Base R merges a vector of hours with the summarized data, and sets the missing counts to 0. textFile <- "row hour count 1 0 1 2 1 1 3 3 1 4 8 4 5 9 7 6 10 10 7 11 14 8 12 10 9 13 8 10 14 4 11 15 5 12 16 12 13 17 4 14 18 12 15 19 9 16 20 5 17 21 2 18 22 4 19 23 4" data <- read.table(text = textFile,header = TRUE)[-1] hours <- data.frame(hour = 0:23) merged <- merge(data,hours,all.y = TRUE) merged[is.na(merged$count),"count"] <- 0 ...and the output: > head(merged) hour count 1 0 1 2 1 1 3 2 0 4 3 1 5 4 0 6 5 0 >
Is there any method to sort the matrix by both column and row in R?
could you guys help me? I have a matrix like this. the first column and row are the IDs. I need to sort it by column and row ID like this. Thanks!
Two thoughts: mat <- matrix(1:25, nr=5, dimnames=list(c('4',3,5,2,1), c('4',3,5,2,1))) mat # 4 3 5 2 1 # 4 1 6 11 16 21 # 3 2 7 12 17 22 # 5 3 8 13 18 23 # 2 4 9 14 19 24 # 1 5 10 15 20 25 If you want a strictly alphabetic ordering, then this will work: mat[order(rownames(mat)),order(colnames(mat))] # 1 2 3 4 5 # 1 25 20 10 5 15 # 2 24 19 9 4 14 # 3 22 17 7 2 12 # 4 21 16 6 1 11 # 5 23 18 8 3 13 This will not work well if the names are intended to be ordered numerically: mat <- matrix(1:30, nr=3, dimnames=list(c('2',1,3), c('4',3,5,2,1,6,7,8,9,10))) mat # 4 3 5 2 1 6 7 8 9 10 # 2 1 4 7 10 13 16 19 22 25 28 # 1 2 5 8 11 14 17 20 23 26 29 # 3 3 6 9 12 15 18 21 24 27 30 mat[order(rownames(mat)),order(colnames(mat))] # 1 10 2 3 4 5 6 7 8 9 # 1 14 29 11 5 2 8 17 20 23 26 # 2 13 28 10 4 1 7 16 19 22 25 # 3 15 30 12 6 3 9 18 21 24 27 (1, 10, 2, ...) For that, you need a slight modification: mat[order(as.numeric(rownames(mat))),order(as.numeric(colnames(mat)))] # 1 2 3 4 5 6 7 8 9 10 # 1 14 11 5 2 8 17 20 23 26 29 # 2 13 10 4 1 7 16 19 22 25 28 # 3 15 12 6 3 9 18 21 24 27 30
R - Index position with condition
I've a data frame like this w<-c(0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0) i would like an index position starting after value 1. output : NA,NA,NA,NA,NA,1,2,3,4,5,6,7,1,2,3,4,5,1,2,3,4,5,6,7,8,9 ideally applicable to a data frame. Thanks edit : w is a data frame, roughly this function m<-as.data.frame(w) m[m!=1] <- row(m)[m!=1] m w 1 1 2 2 3 3 4 4 5 5 6 1 7 7 8 8 9 9 10 10 11 11 12 12 13 1 14 14 15 15 16 16 17 17 18 1 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 but with a return to 1 when value 1 is matching. > m w wanted 1 1 NA 2 2 NA 3 3 NA 4 4 NA 5 5 NA 6 1 1 7 7 2 8 8 3 9 9 4 10 10 5 11 11 6 12 12 7 13 1 1 14 14 2 15 15 3 16 16 4 17 17 5 18 1 1 19 19 2 20 20 3 21 21 4 22 22 5 23 23 6 24 24 7 25 25 8 26 26 9 Thanks
This assumes that the data is ordered in the way shown in example. m$wanted <- with(m, ave(w, cumsum(c(TRUE,diff(w) <0)), FUN=seq_along)) m$wanted #[1] 1 2 3 4 5 1 2 3 4 5 6 7 1 2 3 4 5 1 2 3 4 5 6 7 8 9
For the given data including repeated 1's and non-sequential input, the following works: m[9,1] <- 100 m[3,1] <- 55 m[14,1] <- 60 m[14,1] <- 60 m[25,1] <- 1 m[19,1] <- 1 m$result <- 1:nrow(m) - which(m$w == 1)[cumsum(m$w == 1)] + 1 But if the data does not start on 1: m[1,1] <- 2 Then this works: firstone <- which(m$w == 1)[1] subindex <- m[firstone:nrow(m),'w'] == 1 m$result <- c(rep(NA,firstone-1),1:length(subindex) - which(subindex)[cumsum(subindex)] + 1)
Changing every set of 5 rows in R
I have a dataframe that looks like this: df$a <- 1:20 df$b <- 2:21 df$c <- 3:22 df <- as.data.frame(df) > df a b c 1 1 2 3 2 2 3 4 3 3 4 5 4 4 5 6 5 5 6 7 6 6 7 8 7 7 8 9 8 8 9 10 9 9 10 11 10 10 11 12 11 11 12 13 12 12 13 14 13 13 14 15 14 14 15 16 15 15 16 17 16 16 17 18 17 17 18 19 18 18 19 20 19 19 20 21 20 20 21 22 I would like to add another column to the data frame (df$d) so that every 5 rows (df$d[seq(1, nrow(df), 4)]) would take the value of the start of the respective row in the first column: df$a. I have tried the manual way, but was wondering if there is a for loop or shorter way that can do this easily. I'm new to R, so I apologize if this seems trivial to some people. "Manual" way: df$d[1:5] <- df$a[1] df$d[6:10] <- df$a[6] df$d[11:15] <- df$a[11] df$d[16:20] <- df$a[16] >df a b c d 1 1 2 3 1 2 2 3 4 1 3 3 4 5 1 4 4 5 6 1 5 5 6 7 1 6 6 7 8 6 7 7 8 9 6 8 8 9 10 6 9 9 10 11 6 10 10 11 12 6 11 11 12 13 11 12 12 13 14 11 13 13 14 15 11 14 14 15 16 11 15 15 16 17 11 16 16 17 18 16 17 17 18 19 16 18 18 19 20 16 19 19 20 21 16 20 20 21 22 16 I have tried for (i in 1:nrow(df)) {df$d[i:(i+4)] <- df$a[seq(1, nrow(df), 4)]} But this is not going the way I want it to. What am I doing wrong?
This should work: df$d <- rep(df$a[seq(1,nrow(df),5)],each=5)
And here's a data.table solution: library(data.table) dt = data.table(df) dt[, d := a[1], by = (seq_len(nrow(dt))-1) %/% 5]
I'd use logical indexing after initializing to NA df$d <- NA df$d <- rep(df$a[ c(TRUE, rep(FALSE,4)) ], each=5) df #-------- a b c d 1 1 2 3 1 2 2 3 4 1 3 3 4 5 1 4 4 5 6 1 5 5 6 7 1 6 6 7 8 6 7 7 8 9 6 8 8 9 10 6 9 9 10 11 6 10 10 11 12 6 11 11 12 13 11 12 12 13 14 11 13 13 14 15 11 14 14 15 16 11 15 15 16 17 11 16 16 17 18 16 17 17 18 19 16 18 18 19 20 16 19 19 20 21 16 20 20 21 22 16