creating new tibble columns based on mapping plus user data - r

I am trying generate new columns in a tibble from the output of a function that takes as input several existing columns of that tibble plus user data. As a simplified example, I would want to use this function
addup <- function(x, y, z){x + y + z}
and use it to add the numbers in the existing columns in this tibble...
set.seed(1)
(tib <- tibble(num1 = sample(12), num2 = sample(12)))
# A tibble: 12 x 2
num1 num2
<int> <int>
1 8 5
2 6 3
3 7 7
4 3 11
5 1 2
6 2 1
7 11 6
8 10 9
9 4 8
10 9 12
11 5 10
12 12 4
...together with user input. For instance, if a user defines the vector
vec <- c(3,6,4)
I would like to generate one new column per item in vec, adding the mapped values with the user input values.
The desired result in this case would look something like:
# A tibble: 12 x 5
num1 num2 `3` `6` `4`
<int> <int> <dbl> <dbl> <dbl>
1 5 7 15 18 16
2 8 2 13 16 14
3 7 9 19 22 20
4 1 11 15 18 16
5 3 3 9 12 10
6 9 12 24 27 25
7 6 6 15 18 16
8 10 10 23 26 24
9 11 4 18 21 19
10 12 5 20 23 21
11 4 1 8 11 9
12 2 8 13 16 14
If I know vec beforehand, I could achieve this by
tib %>%
mutate("3" = map2_dbl(num1, num2, ~addup(.x, .y, 3)),
"6" = map2_dbl(num1, num2, ~addup(.x, .y, 6)),
"4" = map2_dbl(num1, num2, ~addup(.x, .y, 4)))
but as the length of vec can vary, I do not know how to generalize this. I've found this answer repeated mutate in tidyverse, but there the functions are repeated over the existing columns instead of using the multiple existing columns for mapping.
Any ideas?

Since we don't have to have the function or the colnames as arguments, this is relatively simple. You just need to iterate over vec with a function that returns the summed column, and then combine with the original table. If you have an addup function that accepts vector inputs then you can skip the whole map2 part; in fact this one does but I don't know if your real function does.
library(tidyverse)
vec <- c(3,6,4)
set.seed(1)
tib <- tibble(num1 = sample(12), num2 = sample(12))
addup <- function(c1, c2, z) {c1 + c2 + z}
addup_vec <- function(df, vec) {
new_cols <- map_dfc(
.x = vec,
.f = function(v) {
map2_dbl(
.x = df[["num1"]],
.y = df[["num2"]],
.f = ~ addup(.x, .y, v)
)
}
)
colnames(new_cols) <- vec
bind_cols(df, new_cols)
}
tib %>%
addup_vec(vec)
#> # A tibble: 12 x 5
#> num1 num2 `3` `6` `4`
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 4 9 16 19 17
#> 2 5 5 13 16 14
#> 3 6 8 17 20 18
#> 4 9 11 23 26 24
#> 5 2 6 11 14 12
#> 6 7 7 17 20 18
#> 7 10 3 16 19 17
#> 8 12 4 19 22 20
#> 9 3 12 18 21 19
#> 10 1 1 5 8 6
#> 11 11 2 16 19 17
#> 12 8 10 21 24 22
Created on 2019-01-16 by the reprex package (v0.2.0).

This uses lapply to apply the function to each element of your vector then binds the result to the original data frame and adds column names.
# Given example
set.seed(1)
(tib <- tibble(num1 = sample(12), num2 = sample(12)))
addup <- function(x, y, z){x + y + z}
vec <- c(3,6,4)
# Add columns and bind to original data frame
foo <- cbind(tib, lapply(vec, function(x)addup(tib$num1, tib$num2, x)))
# Correct column names
colnames(foo)[(ncol(tib)+1):ncol(foo)] <- vec
# Print result
print(foo)
# num1 num2 3 6 4
# 1 4 9 16 19 17
# 2 5 5 13 16 14
# 3 6 8 17 20 18
# 4 9 11 23 26 24
# 5 2 6 11 14 12
# 6 7 7 17 20 18
# 7 10 3 16 19 17
# 8 12 4 19 22 20
# 9 3 12 18 21 19
# 10 1 1 5 8 6
# 11 11 2 16 19 17
# 12 8 10 21 24 22

Related

R, How to generate additional observations denoted by numbered sequence

I'm currently a bit stuck, since I'm a bit unsure of how to even formulate my problem.
What I have is a dataframe of observations with a few variables.
Lets say:
test <- data.frame(var1=c("a","b"),var2=c(15,12))
Is my initial dataset.
What I want to end up with is something like:
test2 <- data.frame(var1_p=c("a","a","a","a","a","b","b","b","b","b"),
var2=c(15,15,15,15,15,12,12,12,12,12),
var3=c(1,2,3,4,5,1,2,3,4,5)
However, the initial observation count and the fact, that I need the numbering to run from 0-9 makes it rather tedious to do by hand.
Does anybody have a nice alternative solution?
Thank you.
What I tried so far was:
a)
testdata$C <- 0
testdata <- for (i in testdata$Combined_Number) {add_row(testdata,C=seq(0,9))}
which results in the dataset to be empty.
b)
testdata$C <- with(testdata, ave(Combined_Number,flur, FUN = seq(0,9)))
which gives the following error code:
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'FUN' of mode 'function' was not found
Perhaps crossing helps
library(tidyr)
crossing(df, var3 = 0:9)
-output
# A tibble: 20 × 3
var1 var2 var3
<chr> <dbl> <int>
1 a 15 0
2 a 15 1
3 a 15 2
4 a 15 3
5 a 15 4
6 a 15 5
7 a 15 6
8 a 15 7
9 a 15 8
10 a 15 9
11 b 12 0
12 b 12 1
13 b 12 2
14 b 12 3
15 b 12 4
16 b 12 5
17 b 12 6
18 b 12 7
19 b 12 8
20 b 12 9
With dplyr this is one approach
library(dplyr)
df %>%
group_by(var1) %>%
summarize(var2, var3 = 0:9, .groups="drop")
# A tibble: 20 × 3
var1 var2 var3
<chr> <dbl> <int>
1 a 15 0
2 a 15 1
3 a 15 2
4 a 15 3
5 a 15 4
6 a 15 5
7 a 15 6
8 a 15 7
9 a 15 8
10 a 15 9
11 b 12 0
12 b 12 1
13 b 12 2
14 b 12 3
15 b 12 4
16 b 12 5
17 b 12 6
18 b 12 7
19 b 12 8
20 b 12 9
Data
df <- structure(list(var1 = c("a", "b"), var2 = c(15, 12)), class = "data.frame", row.names = c(NA,
-2L))

filter() rows from dataframe with condition on previous and next row, keeping NA values

I have a dataframe like this:
AA<-c(1,2,4,5,6,7,10,11,12,13,14,15)
BB<-c(32,21,21,NA,27,31,31,12,28,NA,48,7)
df<- data.frame(AA,BB)
I want to remove rows where BB value is equal to previous or next row, to keep only first and last occurrences from each value of BB column. I also want to keep NA rows. I arrive to that code which is not so far from what I want:
lighten_df <- df %>% filter(BB!=lag(BB) | BB!=lead(BB) | is.na(BB) )
which gives me:
> lighten_df
AA BB
1 1 32
2 2 21
3 5 NA
4 6 27
5 7 31
6 10 31
7 11 12
8 12 28
9 13 NA
10 14 48
11 15 7
My problem is that I would like to keep first and last 21 value for col BB. That's the result I expect:
AA BB
1 1 32
2 2 21
3 4 21
4 5 NA
5 6 27
6 7 31
7 10 31
8 11 12
9 12 28
10 13 NA
11 14 48
12 15 7
Any Idea?
I would suggest a different approach: define a grouping variable and keep the first and last rows within each group:
df %>%
group_by(grp = data.table::rleid(BB)) %>%
slice(unique(c(1, n())))
# # A tibble: 12 × 3
# # Groups: grp [10]
# AA BB grp
# <dbl> <dbl> <int>
# 1 1 32 1
# 2 2 21 2
# 3 4 21 2
# 4 5 NA 3
# 5 6 27 4
# 6 7 31 5
# 7 10 31 5
# 8 11 12 6
# 9 12 28 7
# 10 13 NA 8
# 11 14 48 9
# 12 15 7 10

Sidestepping for-loops using dplyr 1.0.0

I am just starting to appreciate the power of the new dplyr 1.0.0. But after reading the vignettes I need to read some more, and of course there aren't any more so I turn once again to SO.
Say I have the following dataset# using rowwise and c_across to calculate new variables
rm(list = ls())
library(tidyverse)
set.seed(1)
df <- tibble(d_1_a = round(sample(1:10,10,replace=T)),
d_1_b = round(sample(1:10,10,replace=T)),
d_1_c = round(sample(1:10,10,replace=T)),
d_1_d = round(sample(1:10,10,replace=T)),
d_2_a = round(sample(1:10,10,replace=T)),
d_2_b = round(sample(1:10,10,replace=T)),
d_2_c = round(sample(1:10,10,replace=T)),
d_2_d = round(sample(1:10,10,replace=T)))
And I want to calculate row sums for a subset of columns within the dataset and add them to the existing dataset. I came up with the following for-loop
for (i in 1:2) {
namesCols <- grep(paste0("^d_",i,"_[a-z]$"), names(df), perl = T) # indexes of subset of columns
newDF <- df %>% select(all_of(namesCols)) # extract subset of columns from main
totDF <- newDF %>% rowwise() %>%
mutate(!!paste0("sum_",i) := sum(c_across(everything()))) %>% # new column from old
select(starts_with("sum")) # now extract just the new column as a dataframe
df <- cbind(df,totDF) # binds the new column to the old dataframe
}
Now if we call the original dataset
df
d_1_a d_1_b d_1_c d_1_d d_2_a d_2_b d_2_c d_2_d sum_1 sum_2
1 9 5 5 10 9 2 6 7 29 24
2 4 10 5 6 7 2 8 6 25 23
3 7 6 2 4 8 6 7 1 19 22
4 1 10 10 4 6 6 1 5 25 18
5 2 7 9 10 10 1 4 6 28 21
6 7 9 1 9 7 3 8 1 26 19
7 2 5 4 7 3 3 9 9 18 24
8 3 5 3 6 10 8 9 7 17 34
9 1 9 6 9 6 6 7 7 25 26
10 5 9 10 8 8 7 4 3 32 22
We can see the two sum columns, each calculated from a different subset of the existing columns from the original dataset and then added on the end of that dataset.
But I am keen to learn some of the new dplyr/purrr voodoo but am ignorant of how the syntax works.
Can anyone suggest a tidyverse version of my for-loop?
Literal translation of the for loop would be -
library(dplyr)
library(purrr)
bind_cols(df, map_dfc(1:2, function(i) {
df %>%
transmute(!!paste0("sum_",i) := rowSums(
select(., matches(paste0("^d_",i,"_[a-z]$")))))
}))
# d_1_a d_1_b d_1_c d_1_d d_2_a d_2_b d_2_c d_2_d sum_1 sum_2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 9 5 5 10 9 2 6 7 29 24
# 2 4 10 5 6 7 2 8 6 25 23
# 3 7 6 2 4 8 6 7 1 19 22
# 4 1 10 10 4 6 6 1 5 25 18
# 5 2 7 9 10 10 1 4 6 28 21
# 6 7 9 1 9 7 3 8 1 26 19
# 7 2 5 4 7 3 3 9 9 18 24
# 8 3 5 3 6 10 8 9 7 17 34
# 9 1 9 6 9 6 6 7 7 25 26
#10 5 9 10 8 8 7 4 3 32 22
However, we can also use split.default -
bind_cols(df, df %>%
split.default(sub('.*(\\d+).*', '\\1', names(.))) %>%
imap_dfc(~.x %>% transmute(!!paste0("sum_",.y) := rowSums(.))))
where sub part returns the grouping of columns on how to split them.
sub('.*(\\d+).*', '\\1', names(df))
#[1] "1" "1" "1" "1" "2" "2" "2" "2"

r - use dplyr::group_by in combination with purrr::pmap

I have the following dataframe:
df <- data.frame(a = c(1:20),
b = c(2:21),
c = as.factor(c(rep(1,5), rep(2,10), rep(3,5))))
and I want to do the following:
df1 <- df %>% group_by(c) %>% mutate(a = lead(b))
but originally I have many variables to which I need to apply the lead() function in combination with group_by() on multiple variables. I'm trying the purrr::pmap() to achieve this:
df2 <- pmap(list(df[,1],df[,2],df[,3]), function(x,y,z) group_by(z) %>% lead(y))
Unfortunately this results in error:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('integer', 'numeric')"
You can do this with mutate_at and named arguments to funs(), which creates new columns instead of overwriting them. Note that this does nothing to a but you can rename the columns after this as desired.
df <- data.frame(
a = c(1:20),
b = c(2:21),
b2 = 3:22,
b3 = 4:23,
c = as.factor(c(rep(1, 5), rep(2, 10), rep(3, 5)))
)
library(tidyverse)
df %>%
group_by(c) %>%
mutate_at(vars(starts_with("b")), funs(lead = lead(.)))
#> # A tibble: 20 x 8
#> # Groups: c [3]
#> a b b2 b3 c b_lead b2_lead b3_lead
#> <int> <int> <int> <int> <fct> <int> <int> <int>
#> 1 1 2 3 4 1 3 4 5
#> 2 2 3 4 5 1 4 5 6
#> 3 3 4 5 6 1 5 6 7
#> 4 4 5 6 7 1 6 7 8
#> 5 5 6 7 8 1 NA NA NA
#> 6 6 7 8 9 2 8 9 10
#> 7 7 8 9 10 2 9 10 11
#> 8 8 9 10 11 2 10 11 12
#> 9 9 10 11 12 2 11 12 13
#> 10 10 11 12 13 2 12 13 14
#> 11 11 12 13 14 2 13 14 15
#> 12 12 13 14 15 2 14 15 16
#> 13 13 14 15 16 2 15 16 17
#> 14 14 15 16 17 2 16 17 18
#> 15 15 16 17 18 2 NA NA NA
#> 16 16 17 18 19 3 18 19 20
#> 17 17 18 19 20 3 19 20 21
#> 18 18 19 20 21 3 20 21 22
#> 19 19 20 21 22 3 21 22 23
#> 20 20 21 22 23 3 NA NA NA
Created on 2018-09-07 by the reprex package (v0.2.0).

Using purrr::map2 when one variable is not part of the function

If I had a function like this:
foo <- function(var) {
if(length(var) > 5) stop("can't be greater than 5")
data.frame(var = var)
}
Where this worked:
df <- 1:20
foo(var = df[1:5])
But this didn't:
foo(var = df)
The desired output is:
var
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
If I know that I can only run this function in chunk of 5 rows, what would be the best approach if I wanted to evaluate all 20 rows? Can I use purrr::map() for this? Assume that the 5 row constraint is rigid.
Thanks in advance.
We split df in chunks of 5 each then use purrr::map_dfr to apply foo function on them then bind everything together by rows
library(tidyverse)
foo <- function(var) {
if(length(var) > 5) stop("can't be greater than 5")
data.frame(var = var)
}
df <- 1:20
df_split <- split(df, (seq(length(df))-1) %/% 5)
df_split
map_dfr(df_split, ~ foo(.x))
var
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
You can use dplyr::group_by or tapply :
data.frame(df) %>%
mutate(grp = (row_number()-1) %/% 5) %>%
group_by(grp) %>%
mutate(var = foo(df)$var) %>%
ungroup %>%
select(var)
# # A tibble: 20 x 1
# var
# <int>
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 6
# 7 7
# 8 8
# 9 9
# 10 10
# 11 11
# 12 12
# 13 13
# 14 14
# 15 15
# 16 16
# 17 17
# 18 18
# 19 19
# 20 20
data.frame(var=unlist(tapply(df,(df-1) %/% 5,foo)))
# var
# 01 1
# 02 2
# 03 3
# 04 4
# 05 5
# 11 6
# 12 7
# 13 8
# 14 9
# 15 10
# 21 11
# 22 12
# 23 13
# 24 14
# 25 15
# 31 16
# 32 17
# 33 18
# 34 19
# 35 20

Resources