r - use dplyr::group_by in combination with purrr::pmap - r

I have the following dataframe:
df <- data.frame(a = c(1:20),
b = c(2:21),
c = as.factor(c(rep(1,5), rep(2,10), rep(3,5))))
and I want to do the following:
df1 <- df %>% group_by(c) %>% mutate(a = lead(b))
but originally I have many variables to which I need to apply the lead() function in combination with group_by() on multiple variables. I'm trying the purrr::pmap() to achieve this:
df2 <- pmap(list(df[,1],df[,2],df[,3]), function(x,y,z) group_by(z) %>% lead(y))
Unfortunately this results in error:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('integer', 'numeric')"

You can do this with mutate_at and named arguments to funs(), which creates new columns instead of overwriting them. Note that this does nothing to a but you can rename the columns after this as desired.
df <- data.frame(
a = c(1:20),
b = c(2:21),
b2 = 3:22,
b3 = 4:23,
c = as.factor(c(rep(1, 5), rep(2, 10), rep(3, 5)))
)
library(tidyverse)
df %>%
group_by(c) %>%
mutate_at(vars(starts_with("b")), funs(lead = lead(.)))
#> # A tibble: 20 x 8
#> # Groups: c [3]
#> a b b2 b3 c b_lead b2_lead b3_lead
#> <int> <int> <int> <int> <fct> <int> <int> <int>
#> 1 1 2 3 4 1 3 4 5
#> 2 2 3 4 5 1 4 5 6
#> 3 3 4 5 6 1 5 6 7
#> 4 4 5 6 7 1 6 7 8
#> 5 5 6 7 8 1 NA NA NA
#> 6 6 7 8 9 2 8 9 10
#> 7 7 8 9 10 2 9 10 11
#> 8 8 9 10 11 2 10 11 12
#> 9 9 10 11 12 2 11 12 13
#> 10 10 11 12 13 2 12 13 14
#> 11 11 12 13 14 2 13 14 15
#> 12 12 13 14 15 2 14 15 16
#> 13 13 14 15 16 2 15 16 17
#> 14 14 15 16 17 2 16 17 18
#> 15 15 16 17 18 2 NA NA NA
#> 16 16 17 18 19 3 18 19 20
#> 17 17 18 19 20 3 19 20 21
#> 18 18 19 20 21 3 20 21 22
#> 19 19 20 21 22 3 21 22 23
#> 20 20 21 22 23 3 NA NA NA
Created on 2018-09-07 by the reprex package (v0.2.0).

Related

Add rows using name columns

Is it possible to create two rows using the name of the columns?
I need to separate DX from SX and create new rows, after the separation I like to maintain the information DX or SX by adding a column. Some columns it is in common, in this case X. Instead, Num is the key
a = read.table(text="
Num X STDX ABDX XBDX STSX ABSX XBSX
12 3 9 5 3 11 3 7
13 35 24 1 7 18 2 8
14 35 24 1 7 18 2 8
15 10 1 5 16 -10 5 3 ",h=T)
b= read.table(text="Num X ST AB XB DX/SX
12 3 9 5 3 DX
12 3 11 3 7 SX
13 35 24 1 7 DX
13 35 18 2 8 SX
14 35 24 1 7 DX
14 35 18 2 8 SX
15 10 1 5 16 DX
15 10 -10 5 3 SX",h=T)
My idea was separate the data and after join, but it is heavy.
I have tried this code:
c <- sapply(c("DX", "SX",""),
function(x) a[endsWith(names(a),x)],
simplify = FALSE)
But the problem is x and Num, because I would like to have in the same DB with DX and SX.
There are more elegant and compact approaches for sure, but here's an example how you could achieve this by simple renaming and row binding:
a = read.table(text="
Num X STDX ABDX XBDX STSX ABSX XBSX
12 3 9 5 3 11 3 7
13 35 24 1 7 18 2 8
14 35 24 1 7 18 2 8
15 10 1 5 16 -10 5 3 ",h=T)
library(dplyr)
library(stringr)
dx <- a %>%
select(1,2,ends_with("DX")) %>%
rename_with(~ str_remove(.x, "DX$"), .cols = -c(1:2)) %>%
mutate(`DX/SX` = "DX" )
dx
#> Num X ST AB XB DX/SX
#> 1 12 3 9 5 3 DX
#> 2 13 35 24 1 7 DX
#> 3 14 35 24 1 7 DX
#> 4 15 10 1 5 16 DX
sx <- a %>%
select(1,2,ends_with("SX")) %>%
rename_with(~ str_remove(.x, "SX$"), .cols = -c(1:2)) %>%
mutate(`DX/SX` = "SX" )
sx
#> Num X ST AB XB DX/SX
#> 1 12 3 11 3 7 SX
#> 2 13 35 18 2 8 SX
#> 3 14 35 18 2 8 SX
#> 4 15 10 -10 5 3 SX
bind_rows(dx,sx) %>%
arrange(Num)
#> Num X ST AB XB DX/SX
#> 1 12 3 9 5 3 DX
#> 2 12 3 11 3 7 SX
#> 3 13 35 24 1 7 DX
#> 4 13 35 18 2 8 SX
#> 5 14 35 24 1 7 DX
#> 6 14 35 18 2 8 SX
#> 7 15 10 1 5 16 DX
#> 8 15 10 -10 5 3 SX
Created on 2022-10-12 with reprex v2.0.2

filter() rows from dataframe with condition on previous and next row, keeping NA values

I have a dataframe like this:
AA<-c(1,2,4,5,6,7,10,11,12,13,14,15)
BB<-c(32,21,21,NA,27,31,31,12,28,NA,48,7)
df<- data.frame(AA,BB)
I want to remove rows where BB value is equal to previous or next row, to keep only first and last occurrences from each value of BB column. I also want to keep NA rows. I arrive to that code which is not so far from what I want:
lighten_df <- df %>% filter(BB!=lag(BB) | BB!=lead(BB) | is.na(BB) )
which gives me:
> lighten_df
AA BB
1 1 32
2 2 21
3 5 NA
4 6 27
5 7 31
6 10 31
7 11 12
8 12 28
9 13 NA
10 14 48
11 15 7
My problem is that I would like to keep first and last 21 value for col BB. That's the result I expect:
AA BB
1 1 32
2 2 21
3 4 21
4 5 NA
5 6 27
6 7 31
7 10 31
8 11 12
9 12 28
10 13 NA
11 14 48
12 15 7
Any Idea?
I would suggest a different approach: define a grouping variable and keep the first and last rows within each group:
df %>%
group_by(grp = data.table::rleid(BB)) %>%
slice(unique(c(1, n())))
# # A tibble: 12 × 3
# # Groups: grp [10]
# AA BB grp
# <dbl> <dbl> <int>
# 1 1 32 1
# 2 2 21 2
# 3 4 21 2
# 4 5 NA 3
# 5 6 27 4
# 6 7 31 5
# 7 10 31 5
# 8 11 12 6
# 9 12 28 7
# 10 13 NA 8
# 11 14 48 9
# 12 15 7 10

Duplicate clustered observations and create a unique identifiers for the duplicated clusters

Consider the small dataset df1. There are 5 clusters identified by ID, row_numbers contains a unique value for each observation and weights identifies how many copies we want to each cluster.
df1 <-data.frame(ID=c("10","20","30","30","30", "40", "40","50","50","50","50"), row_numbers = c(1,2,3,4,5,6,7,8,9,10,11),weights=c(4,3,2,2,2,3,3,2,2,2,2))
df1
#> ID row_numbers weights
#> 1 10 1 4
#> 2 20 2 3
#> 3 30 3 2
#> 4 30 4 2
#> 5 30 5 2
#> 6 40 6 3
#> 7 40 7 3
#> 8 50 8 2
#> 9 50 9 2
#> 10 50 10 2
#> 11 50 11 2
The expected output is df2
The most important part of df2 is the new variable "newID". The unique identifiers for the duplicated clusters are stored in newID where newID will identify the clusters by using consecutive integers starting from 1.
df2 <-data.frame(ID=c("10","10","10","10","20","20","20","30","30","30","30","30","30", "40", "40","40", "40","40", "40","50","50","50","50","50","50","50","50"), row_numbers = c(1,1,1,1,2,2,2,3,3,4,4,5,5,6,6,6,7,7,7,8,8,9,9,10,10,11,11),weights=c(4,4,4,4,3,3,3,2,2,2,2,2,2,3,3,3,3,3,3,2,2,2,2,2,2,2,2), newID= c(1,2,3,4,5,6,7,8,8,8,9,9,9,10,10,11,11,12,12,13,13,13,13,14,14,14,14))
df2
#> ID row_numbers weights newID
#> 1 10 1 4 1
#> 2 10 1 4 2
#> 3 10 1 4 3
#> 4 10 1 4 4
#> 5 20 2 3 5
#> 6 20 2 3 6
#> 7 20 2 3 7
#> 8 30 3 2 8
#> 9 30 3 2 8
#> 10 30 4 2 8
#> 11 30 4 2 9
#> 12 30 5 2 9
#> 13 30 5 2 9
#> 14 40 6 3 10
#> 15 40 6 3 10
#> 16 40 6 3 11
#> 17 40 7 3 11
#> 18 40 7 3 12
#> 19 40 7 3 12
#> 20 50 8 2 13
#> 21 50 8 2 13
#> 22 50 9 2 13
#> 23 50 9 2 13
#> 24 50 10 2 14
#> 25 50 10 2 14
#> 26 50 11 2 14
#> 27 50 11 2 14
Here's a solution using a split-apply-bind approach:
df3 <- do.call(rbind, lapply(split(df1, df1$ID), function(x)
{
group_size <- nrow(x)
n_groups <- x$weights[1]
if(is.na(n_groups)) n_groups <- 1
if (n_groups < 1) n_groups <- 1
group_labels <- rep(paste(x$ID[1], seq(n_groups)), each = group_size)
x <- x[rep(seq(group_size), each = n_groups), ]
x$newID <- group_labels
x
}))
df3$newID <- as.numeric(as.factor(df3$newID))
df3 <- `rownames<-`(df3, seq(nrow(df3)))
Which matches your expected output:
df3
#> ID row_numbers weights newID
#> 1 10 1 4 1
#> 2 10 1 4 2
#> 3 10 1 4 3
#> 4 10 1 4 4
#> 5 20 2 3 5
#> 6 20 2 3 6
#> 7 20 2 3 7
#> 8 30 3 2 8
#> 9 30 3 2 8
#> 10 30 4 2 8
#> 11 30 4 2 9
#> 12 30 5 2 9
#> 13 30 5 2 9
#> 14 40 6 3 10
#> 15 40 6 3 10
#> 16 40 6 3 11
#> 17 40 7 3 11
#> 18 40 7 3 12
#> 19 40 7 3 12
#> 20 50 8 2 13
#> 21 50 8 2 13
#> 22 50 9 2 13
#> 23 50 9 2 13
#> 24 50 10 2 14
#> 25 50 10 2 14
#> 26 50 11 2 14
#> 27 50 11 2 14
And we can show this is identical to your desired result:
identical(df2, df3)
#> [1] TRUE
a solution with data.table :
library(data.table)
df1 <-data.frame(ID=c("10","20","30","30","30", "40", "40","50","50","50","50"), row_numbers = c(1,2,3,4,5,6,7,8,9,10,11),weights=c(4,3,2,2,2,3,3,2,2,2,2))
dt1 <- data.table(df1)
# with .x a data.table with cols : ID, row_numbers (integer), weight (integer)
duplicate_weight <- function(.x) {
# get the part to keep unchanged
untouched <- list(
.x[is.na(weights), .(ID, row_numbers, weights = 1, repetition = ID)] ,
.x[weights == 0, .(ID, row_numbers, weights = 1, repetition = ID)],
.x[weights == 1, .(ID, row_numbers, weights = 1, repetition = ID)]
)
# list of the weights > 1
weights_list <- sort(unique(.x[['weights']]))
weights_list <- weights_list[weights_list > 1]
# repeat accordingly to weights
repeated <- lapply(weights_list, # for each weight
function(.y) {
rbindlist( # make a data.table
lapply(1:.y, # repetead .y times
function(.z) {
.x[weights == .y, .(ID, row_numbers, weights = 1, repetition_position = .z)])
}
)
)
}
)
result <- rbindlist(c(untouched, repeated))
setorder(result, ID, repetition_position)
result[, new_id := .GRP, by = .(ID, repetition_position)]
result[, repetition_position := NULL]
result
}
duplicate_weight(dt1)
It's similar to #Allan Cameron

Split into groups based on (multiple) conditions?

I have set of marbles, of different colors and weights, and I want to split them into groups based on their weight and color.
The conditions are:
A group cannot weigh more than 100 units
A group cannot have more than 5 different-colored marbles.
A reproducible example:
marbles <- data.frame(color=sample(1:20, 20), weight=sample(1:40, 20, replace=T))
color weight
1 1 22
2 15 33
3 13 35
4 11 13
5 6 26
6 8 15
7 10 3
8 16 22
9 14 21
10 3 16
11 4 26
12 20 30
13 9 31
14 2 16
15 7 12
16 17 13
17 19 19
18 5 17
19 12 12
20 18 40
And what I want is this group column:
color weight group
1 1 22 1
2 15 33 1
3 13 35 1
4 11 13 2
5 6 26 2
6 8 15 2
7 10 3 2
8 16 22 2
9 14 21 3
10 3 16 3
11 4 26 3
12 20 30 3
13 9 31 4
14 2 16 4
15 7 12 4
16 17 13 4
17 19 19 4
18 5 17 5
19 12 12 5
20 18 40 5
TIA.
The below isn't an optimal assignment to the groups, it just does it sequentially through the data frame. It's uses rowwise and might not be the most efficient way as it's not a vectorized approach.
library(dplyr)
marbles <- data.frame(color=sample(1:20, 20), weight=sample(1:40, 20, replace=T))
Below I create a rowwise function which we can apply using dplyr
assign_group <- function(color, weight) {
# Conditions
clists = append(color_list, color)
sum_val = group_sum + weight
num_colors = length(unique(color_list))
assign_condition = (sum_val <= 100 & num_colors <= 5)
#assign globals
cval <- if(assign_condition) clists else c(color)
sval <- ifelse(assign_condition, sum_val, weight)
gval <- ifelse(assign_condition, group_number, group_number + 1)
assign("color_list", cval, envir = .GlobalEnv)
assign("group_sum", sval, envir = .GlobalEnv)
assign("group_number", gval, envir = .GlobalEnv)
res = group_number
return(res)
}
I then setup a few global variables to track the allocation of the marbles to each group.
# globals
color_list <<- c()
group_sum <<- 0
group_number <<- 1
Finally run this function using mutate
test <- marbles %>% rowwise() %>% mutate(group = assign_group(color,weight)) %>% data.frame()
Which results in the below
color weight group
1 6 27 1
2 12 16 1
3 15 32 1
4 20 25 1
5 19 5 2
6 2 21 2
7 16 39 2
8 17 4 2
9 11 16 2
10 7 7 3
11 10 5 3
12 1 30 3
13 13 7 3
14 9 39 3
15 14 7 4
16 8 17 4
17 18 9 4
18 4 36 4
19 3 1 4
20 5 3 5
And seems to meet the constraints
test %>% group_by(group) %>% summarise(tot_w = sum(weight), n_c = length(unique(color)) )
group tot_w n_c
<dbl> <int> <int>
1 1 100 4
2 2 85 5
3 3 88 5
4 4 70 5
5 5 3 1
in base R you could write a recursive function as shown below:
create_group = function(df,a){
if(missing(a)) a = cumsum(df$weight)%/%100
b = !ave(df$color,a,FUN=seq_along)%%6
d = ave(df$weight,a+b,FUN=cumsum)>100
a = a+b+d
if (any(b|d)) create_group(df,a) else cbind(df,group = a+1)
}
create_group(df)
color weight group
1 1 22 1
2 15 33 1
3 13 35 1
4 11 13 2
5 6 26 2
6 8 15 2
7 10 3 2
8 16 22 2
9 14 21 3
10 3 16 3
11 4 26 3
12 20 30 3
13 9 31 4
14 2 16 4
15 7 12 4
16 17 13 4
17 19 19 4
18 5 17 5
19 12 12 5
20 18 40 5

creating new tibble columns based on mapping plus user data

I am trying generate new columns in a tibble from the output of a function that takes as input several existing columns of that tibble plus user data. As a simplified example, I would want to use this function
addup <- function(x, y, z){x + y + z}
and use it to add the numbers in the existing columns in this tibble...
set.seed(1)
(tib <- tibble(num1 = sample(12), num2 = sample(12)))
# A tibble: 12 x 2
num1 num2
<int> <int>
1 8 5
2 6 3
3 7 7
4 3 11
5 1 2
6 2 1
7 11 6
8 10 9
9 4 8
10 9 12
11 5 10
12 12 4
...together with user input. For instance, if a user defines the vector
vec <- c(3,6,4)
I would like to generate one new column per item in vec, adding the mapped values with the user input values.
The desired result in this case would look something like:
# A tibble: 12 x 5
num1 num2 `3` `6` `4`
<int> <int> <dbl> <dbl> <dbl>
1 5 7 15 18 16
2 8 2 13 16 14
3 7 9 19 22 20
4 1 11 15 18 16
5 3 3 9 12 10
6 9 12 24 27 25
7 6 6 15 18 16
8 10 10 23 26 24
9 11 4 18 21 19
10 12 5 20 23 21
11 4 1 8 11 9
12 2 8 13 16 14
If I know vec beforehand, I could achieve this by
tib %>%
mutate("3" = map2_dbl(num1, num2, ~addup(.x, .y, 3)),
"6" = map2_dbl(num1, num2, ~addup(.x, .y, 6)),
"4" = map2_dbl(num1, num2, ~addup(.x, .y, 4)))
but as the length of vec can vary, I do not know how to generalize this. I've found this answer repeated mutate in tidyverse, but there the functions are repeated over the existing columns instead of using the multiple existing columns for mapping.
Any ideas?
Since we don't have to have the function or the colnames as arguments, this is relatively simple. You just need to iterate over vec with a function that returns the summed column, and then combine with the original table. If you have an addup function that accepts vector inputs then you can skip the whole map2 part; in fact this one does but I don't know if your real function does.
library(tidyverse)
vec <- c(3,6,4)
set.seed(1)
tib <- tibble(num1 = sample(12), num2 = sample(12))
addup <- function(c1, c2, z) {c1 + c2 + z}
addup_vec <- function(df, vec) {
new_cols <- map_dfc(
.x = vec,
.f = function(v) {
map2_dbl(
.x = df[["num1"]],
.y = df[["num2"]],
.f = ~ addup(.x, .y, v)
)
}
)
colnames(new_cols) <- vec
bind_cols(df, new_cols)
}
tib %>%
addup_vec(vec)
#> # A tibble: 12 x 5
#> num1 num2 `3` `6` `4`
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 4 9 16 19 17
#> 2 5 5 13 16 14
#> 3 6 8 17 20 18
#> 4 9 11 23 26 24
#> 5 2 6 11 14 12
#> 6 7 7 17 20 18
#> 7 10 3 16 19 17
#> 8 12 4 19 22 20
#> 9 3 12 18 21 19
#> 10 1 1 5 8 6
#> 11 11 2 16 19 17
#> 12 8 10 21 24 22
Created on 2019-01-16 by the reprex package (v0.2.0).
This uses lapply to apply the function to each element of your vector then binds the result to the original data frame and adds column names.
# Given example
set.seed(1)
(tib <- tibble(num1 = sample(12), num2 = sample(12)))
addup <- function(x, y, z){x + y + z}
vec <- c(3,6,4)
# Add columns and bind to original data frame
foo <- cbind(tib, lapply(vec, function(x)addup(tib$num1, tib$num2, x)))
# Correct column names
colnames(foo)[(ncol(tib)+1):ncol(foo)] <- vec
# Print result
print(foo)
# num1 num2 3 6 4
# 1 4 9 16 19 17
# 2 5 5 13 16 14
# 3 6 8 17 20 18
# 4 9 11 23 26 24
# 5 2 6 11 14 12
# 6 7 7 17 20 18
# 7 10 3 16 19 17
# 8 12 4 19 22 20
# 9 3 12 18 21 19
# 10 1 1 5 8 6
# 11 11 2 16 19 17
# 12 8 10 21 24 22

Resources