Dplyr, join successive dataframes to pre-existing columns, summing their values - r

I want to perform multiple joins to original dataframe, from the same source with different IDs each time. Specifically I actually only need to do two joins, but when I perform the second join, the columns being joined already exist in the input df, and rather than add these columns with new names using the .x/.y suffixes, I want to sum the values to the existing columns. See the code below for the desired output.
# Input data:
values <- tibble(
id = LETTERS[1:10],
variable1 = 1:10,
variable2 = (1:10)*10
)
df <- tibble(
twin_id = c("A/F", "B/G", "C/H", "D/I", "E/J")
)
> values
# A tibble: 10 x 3
id variable1 variable2
<chr> <int> <dbl>
1 A 1 10
2 B 2 20
3 C 3 30
4 D 4 40
5 E 5 50
6 F 6 60
7 G 7 70
8 H 8 80
9 I 9 90
10 J 10 100
> df
# A tibble: 5 x 1
twin_id
<chr>
1 A/F
2 B/G
3 C/H
4 D/I
5 E/J
So this is the two joins:
joined_df <- df %>%
tidyr::separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
left_join(values, by = c("left_id" = "id")) %>%
left_join(values, by = c("right_id" = "id"))
> joined_df
# A tibble: 5 x 7
twin_id left_id right_id variable1.x variable2.x variable1.y variable2.y
<chr> <chr> <chr> <int> <dbl> <int> <dbl>
1 A/F A F 1 10 6 60
2 B/G B G 2 20 7 70
3 C/H C H 3 30 8 80
4 D/I D I 4 40 9 90
5 E/J E J 5 50 10 100
And this is the output I want, using the only way I can see to get it:
output_df_wanted <- joined_df %>%
mutate(
variable1 = variable1.x + variable1.y,
variable2 = variable2.x + variable2.y) %>%
select(twin_id, left_id, right_id, variable1, variable2)
> output_df_wanted
# A tibble: 5 x 5
twin_id left_id right_id variable1 variable2
<chr> <chr> <chr> <int> <dbl>
1 A/F A F 7 70
2 B/G B G 9 90
3 C/H C H 11 110
4 D/I D I 13 130
5 E/J E J 15 150
I can see how to get what I want using a mutate statement, but I will have a much larger number of variables in the actually dataset. I am wondering if this is the best way to do this.

You can try reshaping your data and using dplyr::summarise_at:
library(tidyr)
library(dplyr)
df %>%
separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
pivot_longer(-twin_id) %>%
left_join(values, by = c("value" = "id")) %>%
group_by(twin_id) %>%
summarise_at(vars(starts_with("variable")), sum) %>%
separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE)
## A tibble: 5 x 5
# twin_id left_id right_id variable1 variable2
# <chr> <chr> <chr> <int> <dbl>
#1 A/F A F 7 70
#2 B/G B G 9 90
#3 C/H C H 11 110
#4 D/I D I 13 130
#5 E/J E J 15 150

You can use my package safejoin if it's acceptable to you to use a github package.
The idea is that you have conflicting columns, dplyr and base R deal with conflict by renaming them while safejoin is more flexible, you can use the function you want to apply in case of conflicts. Here you want to add them so we'll use conflict = `+`, for the same effect you could have used conflict = ~ .x + .y or conflict = ~ ..1 + ..2.
# remotes::install_github("moodymudskipper/safejoin")
library(tidyverse)
library(safejoin)
values <- tibble(
id = LETTERS[1:10],
variable1 = 1:10,
variable2 = (1:10)*10
)
df <- tibble(
twin_id = c("A/F", "B/G", "C/H", "D/I", "E/J")
)
joined_df <- df %>%
tidyr::separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
left_join(values, by = c("left_id" = "id")) %>%
safe_left_join(values, by = c("right_id" = "id"), conflict = `+`)
joined_df
#> # A tibble: 5 x 5
#> twin_id left_id right_id variable1 variable2
#> <chr> <chr> <chr> <int> <dbl>
#> 1 A/F A F 7 70
#> 2 B/G B G 9 90
#> 3 C/H C H 11 110
#> 4 D/I D I 13 130
#> 5 E/J E J 15 150
Created on 2020-04-29 by the reprex package (v0.3.0)

Related

R: create new rows from preexistent dataframe

I want to create new rows based on the value of pre-existent rows in my dataset. There are two catches: first, some cell values need to remain constant while others have to increase by +1. Second, I need to cycle through every row the same amount of times.
I think it will be easier to understand with data
Here is where I am starting from:
mydata <- data.frame(id=c(10012000,10012002,10022000,10022002),
col1=c(100,201,44,11),
col2=c("A","C","B","A"))
Here is what I want:
mydata2 <- data.frame(id=c(10012000,10012001,10012002,10012003,10022000,10022001,10022002,10022003),
col1=c(100,100,201,201,44,44,11,11),
col2=c("A","A","C","C","B","B","A","A"))
Note how I add +1 in the id column cell for each new row but col1 and col2 remain constant.
Thank you
library(tidyverse)
mydata |>
mutate(id = map(id, \(x) c(x, x+1))) |>
unnest(id)
#> # A tibble: 8 × 3
#> id col1 col2
#> <dbl> <dbl> <chr>
#> 1 10012000 100 A
#> 2 10012001 100 A
#> 3 10012002 201 C
#> 4 10012003 201 C
#> 5 10022000 44 B
#> 6 10022001 44 B
#> 7 10022002 11 A
#> 8 10022003 11 A
Created on 2022-04-14 by the reprex package (v2.0.1)
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
mydata %>%
group_by(id) %>%
uncount(2) %>%
mutate(id = first(id) + row_number() - 1) %>%
ungroup()
This returns
# A tibble: 8 x 3
id col1 col2
<dbl> <dbl> <chr>
1 10012000 100 A
2 10012001 100 A
3 10012002 201 C
4 10012003 201 C
5 10022000 44 B
6 10022001 44 B
7 10022002 11 A
8 10022003 11 A
library(data.table)
setDT(mydata)
final <- setorder(rbind(copy(mydata), mydata[, id := id + 1]), id)
# id col1 col2
# 1: 10012000 100 A
# 2: 10012001 100 A
# 3: 10012002 201 C
# 4: 10012003 201 C
# 5: 10022000 44 B
# 6: 10022001 44 B
# 7: 10022002 11 A
# 8: 10022003 11 A
I think this should do it:
library(dplyr)
df1 <- arrange(rbind(mutate(mydata, id = id + 1), mydata), id, col2)
Gives:
id col1 col2
1 10012000 100 A
2 10012001 100 A
3 10012002 201 C
4 10012003 201 C
5 10022000 44 B
6 10022001 44 B
7 10022002 11 A
8 10022003 11 A
in base R, for nostalgic reasons:
mydata2 <- as.data.frame(lapply(mydata, function(col) rep(col, each = 2)))
mydata2$id <- mydata2$id + 0:1

Can I extract the value from several columns by column's name?

library(dplyr)
mydf <- data.frame(a_x = c(1,2,3,4,5),
b_x = c(8,9,10,11,12),
a_y = c("k",'b','a','d','z'),
b_y = c('aa','bb','cc','dd','ee'),
prefix=c("a","b","c","a","a"))
mydf
Assuming that the data I have is mydf, I would like to produce the same result as mydf2.
I made a column with the name of the column containing the value to be extracted.
I want to extract the value through this column.
mydf2 <- data.frame(a_x=c(1,2,3,4,5),
b_x=c(8,9,10,11,12),
prefix=c("a","b","c","a","a"),
desired_x_value = c(1,9,NA,4,5),
desired_y_value = c('k','bb',NA,'d','z'))
mydf2
I've used 'get' and 'paste0' but it doesn't work. Can I solve this problem through 'dplyr' chain?
mydf %>% mutate(desired_x_value = get(paste0(prefix,"_x")),
desired_y_value = get(paste0(prefix,"_y")))
So basically you want to create new columns (desired_x_value and desired_y_value) of which its value depends on a condition. Using dplyr I prefer case_when as it is the best readable way to do it, but you could also use (nested) if(else) statements. What it is doing is "if X meets condition A do Y, if X meets condition B do Z, if X meets condition .... do ..."
mydf %>%
dplyr::mutate(
desired_x_value = case_when(
prefix == "a" ~ a_x,
prefix == "b" ~ b_x,
desired_y_values = case_when(
prefix == "a" ~a_y,
prefix == "b" ~b_y,
TRUE ~ NA_character_ ))
You can remove the columns you don't need anymore in a second step if you want. the code above results in the table:
a_x b_x a_y b_y prefix desired_x_value desired_y_values
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc c NA <NA>
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
You can write a helper function for this :
get_value <- function(data, prefix, group) {
data[cbind(1:nrow(data), match(paste(prefix, group, sep = '_'), names(data)))]
}
mydf %>%
mutate(desired_x_value = get_value(select(., ends_with('_x')), prefix, 'x'),
desired_y_value = get_value(select(., ends_with('_y')), prefix, 'y'))
# a_x b_x a_y b_y prefix desired_x_value desired_y_value
#1 1 8 k aa a 1 k
#2 2 9 b bb b 9 bb
#3 3 10 a cc c NA <NA>
#4 4 11 d dd a 4 d
#5 5 12 z ee a 5 z
A simple rowwise also works.
mydf %>% rowwise() %>%
mutate(desired_x = ifelse(any(str_detect(names(mydf)[-5], prefix)),
get(paste(prefix, 'x', sep = '_')), NA),
desired_y = ifelse(any(str_detect(names(mydf)[-5], prefix)),
get(paste(prefix, 'y', sep = '_')), NA))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x desired_y
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc c NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
If the prefixes don't contain any invalid column prefixes, this will do without ifelse statement.
mydf <- data.frame(a_x = c(1,2,3,4,5),
b_x = c(8,9,10,11,12),
a_y = c("k",'b','a','d','z'),
b_y = c('aa','bb','cc','dd','ee'),
prefix=c("a","b","a","a","a"))
mydf %>% rowwise() %>%
mutate(desired_x = get(paste(prefix, 'x', sep = '_')),
desired_y = get(paste(prefix, 'y', sep = '_')))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x desired_y
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc a 3 a
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
First I would like to say that I am not presenting this as a good solution as other proposed solutions are much better and simpler. However, since you have brought up get function, I wanted to show you how to make use of it to get your desired output. As a matter of fact some of the values in your prefix column such as c does not have a match among your column names and get function throws an error on terminating the execution, and unlike mget function it does not have a ifnotfound argument. So you need a way to go around that error message by means of an ifelse:
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
library(glue)
mydf1 %>%
mutate(desired_x_value = map(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(glue("{.x}_x")), NA)),
desired_y_value = map(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(glue("{.x}_y")), NA))) %>%
unnest(cols = c(desired_x_value, desired_y_value))
# A tibble: 5 x 7
a_x b_x a_y b_y prefix desired_x_value desired_y_value
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc NA NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
You can also use paste function instead of glue and in case we already know the output types of the desired columns, we can spare the last line:
mydf1 %>%
mutate(desired_x_value = map_dbl(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(paste(.x, "x", sep = "_")), NA)),
desired_y_value = map_chr(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(paste(.x, "y", sep = "_")), NA)))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x_value desired_y_value
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc NA NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z

How to replace NA with set of values

I have the following data frame:
library(dplyr)
library(tibble)
df <- tibble(
source = c("a", "b", "c", "d", "e"),
score = c(10, 5, NA, 3, NA ) )
df
It looks like this:
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10 . # current max value
2 b 5
3 c NA
4 d 3
5 e NA
What I want to do is to replace NA in score column with values ranging for existing max + n onwards. Where n range from 1 to total number of rows of the df
Resulting in this (hand-coded) :
source score
a 10
b 5
c 11 # obtained from 10 + 1
d 3
e 12 # obtained from 10 + 2
How can I achieve that?
Another option :
transform(df, score = pmin(max(score, na.rm = TRUE) +
cumsum(is.na(score)), score, na.rm = TRUE))
# source score
#1 a 10
#2 b 5
#3 c 11
#4 d 3
#5 e 12
If you want to do this in dplyr
library(dplyr)
df %>% mutate(score = pmin(max(score, na.rm = TRUE) +
cumsum(is.na(score)), score, na.rm = TRUE))
A base R solution
df$score[is.na(df$score)] <- seq(which(is.na(df$score))) + max(df$score,na.rm = TRUE)
such that
> df
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
Here is a dplyr approach,
df %>%
mutate(score = replace(score,
is.na(score),
(max(score, na.rm = TRUE) + (cumsum(is.na(score))))[is.na(score)])
)
which gives,
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
With dplyr:
library(dplyr)
df %>%
mutate_at("score", ~ ifelse(is.na(.), max(., na.rm = TRUE) + cumsum(is.na(.)), .))
Result:
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
A dplyr solution.
df %>%
mutate(na_count = cumsum(is.na(score)),
score = ifelse(is.na(score), max(score, na.rm = TRUE) + na_count, score)) %>%
select(-na_count)
## A tibble: 5 x 2
# source score
# <chr> <dbl>
#1 a 10
#2 b 5
#3 c 11
#4 d 3
#5 e 12
Another one, quite similar to ThomasIsCoding's solution:
> df$score[is.na(df$score)]<-max(df$score, na.rm=T)+(1:sum(is.na(df$score)))
> df
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
Not quite elegant as compared to the base R solutions, but still possible:
library(data.table)
setDT(df)
max.score = df[, max(score, na.rm = TRUE)]
df[is.na(score), score :=(1:.N) + max.score]
Or in one line but a bit slower:
df[is.na(score), score := (1:.N) + df[, max(score, na.rm = TRUE)]]
df
source score
1: a 10
2: b 5
3: c 11
4: d 3
5: e 12

Is there a better way to do a group_by for each value in a list?

I am trying to find the best way to iterate through each column of a data frame, group by that column, and produce a summary.
Here is my attempt:
library(tidyverse)
data = data.frame(
a = sample(LETTERS[1:3], 100, replace=TRUE),
b = sample(LETTERS[1:8], 100, replace=TRUE),
c = sample(LETTERS[3:15], 100, replace=TRUE),
d = sample(LETTERS[16:26], 100, replace=TRUE),
value = rnorm(100)
)
myfunction <- function(x) {
groupVars <- select_if(x, is.factor) %>% colnames()
results <- list()
for(i in 1:length(groupVars)) {
results[[i]] <- x %>%
group_by_at(.vars = vars(groupVars[i])) %>%
summarise(
n = n()
)
}
return(results)
}
test <- myfunction(data)
The function returns:
[[1]]
# A tibble: 3 x 2
a n
<fct> <int>
1 A 37
2 B 34
3 C 29
...
...
...
My question is, is this the best way to do this? Is there a way to avoid using a for loop? Can I use purrr and map somehow to do this?
Thank you
An option is to use map
library(tidyverse)
map(data[1:4], ~data.frame(x = {{.x}}) %>% count(x))
#$a
## A tibble: 3 x 2
# x n
# <fct> <int>
#1 A 39
#2 B 32
#3 C 29
#
#$b
## A tibble: 8 x 2
# x n
# <fct> <int>
#1 A 14
#2 B 11
#3 C 16
#4 D 10
#5 E 12
#6 F 10
#7 G 13
#8 H 14
#...
The output is a list. Note that I have ignored the last column of data, as it doesn't seem to be relevant here.
If you want columns in the list data.frames to be named according to the columns from your original data, we can use imap
imap(data[1:4], ~tibble(!!.y := {{.x}}) %>% count(!!sym(.y)))
#$a
## A tibble: 3 x 2
# a n
# <fct> <int>
#1 A 23
#2 B 35
#3 C 42
#
#$b
## A tibble: 8 x 2
# b n
# <fct> <int>
#1 A 15
#2 B 10
#3 C 13
#4 D 5
#5 E 19
#6 F 9
#7 G 13
#8 H 16
#...
Or making use of tibble::enframe (thanks #camille)
imap(data[1:4], ~enframe(.x, value = .y) %>% count(!!sym(.y)))
You could reshape the data and group by both the column and the letter. This gives you one dataframe instead of a list of them, but you could get the list if you really want it with split.
set.seed(123)
library(tidyverse)
data = data.frame(
a = sample(LETTERS[1:3], 100, replace=TRUE),
b = sample(LETTERS[1:8], 100, replace=TRUE),
c = sample(LETTERS[3:15], 100, replace=TRUE),
d = sample(LETTERS[16:26], 100, replace=TRUE),
value = rnorm(100)
)
data %>%
pivot_longer(cols = -value, names_to = "column", values_to = "letter") %>%
group_by(column, letter) %>%
summarise(n = n())
#> # A tibble: 35 x 3
#> # Groups: column [4]
#> column letter n
#> <chr> <fct> <int>
#> 1 a A 33
#> 2 a B 32
#> 3 a C 35
#> 4 b A 8
#> 5 b B 11
#> 6 b C 12
#> 7 b D 14
#> 8 b E 8
#> 9 b F 17
#> 10 b G 16
#> # … with 25 more rows
Created on 2019-10-30 by the reprex package (v0.3.0)
You can simply call:
apply(data, 2,table)
You can drop the last list element if you want.

dplyr - How to obtain the order of one column within a group?

Example data:
tibbly = tibble(age = c(10,30,50,10,30,50,10,30,50,10,30,50),
grouping1 = c("A","A","A","A","A","A","B","B","B","B","B","B"),
grouping2 = c("X", "X", "X","Y","Y","Y","X","X","X","Y","Y","Y"),
value = c(1,2,3,4,4,6,2,5,3,6,3,2))
> tibbly
# A tibble: 12 x 4
age grouping1 grouping2 value
<dbl> <chr> <chr> <dbl>
1 10 A X 1
2 30 A X 2
3 50 A X 3
4 10 A Y 4
5 30 A Y 4
6 50 A Y 6
7 10 B X 2
8 30 B X 5
9 50 B X 3
10 10 B Y 6
11 30 B Y 3
12 50 B Y 2
Question:
How to obtain the order of rows for each group in a dataframe? I can use dplyr to arrange the data in the an appropriate form to visualize what I am interested in:
> tibbly %>%
group_by(grouping1, grouping2) %>%
arrange(grouping1, grouping2, desc(value))
# A tibble: 12 x 4
# Groups: grouping1, grouping2 [4]
age grouping1 grouping2 value
<dbl> <chr> <chr> <dbl>
1 50 A X 3
2 30 A X 2
3 10 A X 1
4 50 A Y 6
5 10 A Y 4
6 30 A Y 4
7 30 B X 5
8 50 B X 3
9 10 B X 2
10 10 B Y 6
11 30 B Y 3
12 50 B Y 2
In the end I am interested in the order of the age column, for each group based on the value column. Is there a elegant way to do this with dplyr? Something like summarise() based on the order of rows and not actual values
library(dplyr)
tibbly = tibble(age = c(10,30,50,10,30,50,10,30,50,10,30,50),
grouping1 = c("A","A","A","A","A","A","B","B","B","B","B","B"),
grouping2 = c("X", "X", "X","Y","Y","Y","X","X","X","Y","Y","Y"),
value = c(1,2,3,4,4,6,2,5,3,6,3,2))
tibbly %>%
group_by(grouping1, grouping2) %>% # for each group
arrange(desc(value)) %>% # arrange value descending
summarise(order = paste0(age, collapse = ",")) %>% # get the order of age as a strings
ungroup() # forget the grouping
# # A tibble: 4 x 3
# grouping1 grouping2 order
# <chr> <chr> <chr>
# 1 A X 50,30,10
# 2 A Y 50,10,30
# 3 B X 30,50,10
# 4 B Y 10,30,50
With data.table
library(data.table)
setDT(tibbly)[order(-value), .(order = toString(age)),.(grouping1, grouping2)]

Resources