Joining 3 dataframes, in which duplicate entries can be present - r

I have a data frame like so...
df = tibble(id = c(64512, 64513, 64514, 64515),
customer=c("a", "a", "b", "b"))
and want to join two further data frames by id like these...
uvp_new = tibble(id=c(64512, 64513, 64514), uvp=c(12, 14, 16))
uvp_old = tibble(id=c(64512, 64515), uvp=c(10, 18))
with the following logic: whenever there is an entry for a uvp in uvp_new, i want to take this one (ignoring uvp_old), if there is no entry for uvp in uvp_new, i want to take the entry for uvp from uvp_old.
Any help appreciated

You can left_join() uvp_old and then use rows_update() with uvp_new:
library(dplyr)
df %>%
left_join(uvp_old, by = "id") %>%
rows_update(uvp_new, by = "id")
# A tibble: 4 x 3
id customer uvp
<dbl> <chr> <dbl>
1 64512 a 12
2 64513 a 14
3 64514 b 16
4 64515 b 18
Or it might be safer if there are duplicated ids in df to use rows_upsert() first and join the result to df:
uvp_old %>%
rows_upsert(uvp_new, by = "id") %>%
right_join(df, by = "id")

Here is a base R option using transform + merge
transform(
merge(merge(df, uvp_new, by = "id", all.x = TRUE), uvp_old, by = "id", all.x = TRUE),
uvp = ifelse(is.na(uvp.x), uvp.y, uvp.x)
)[c("id","customer","uvp")]
which gives
id customer uvp
1 64512 a 12
2 64513 a 14
3 64514 b 16
4 64515 b 18

You can join the three together using two joins, keeping track of which data.frame the uvp column came from with suffixes. Then, you can select the first non-NA one with coalesce.
df %>%
left_join(uvp_new, by = "id") %>%
left_join(uvp_old, by = "id", suffix = c("_new", "_old")) %>%
mutate(uvp = coalesce(uvp_new, uvp_old))
# id customer uvp_new uvp_old uvp
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 64512 a 12 10 12
# 2 64513 a 14 NA 14
# 3 64514 b 16 NA 16
# 4 64515 b NA 18 18

You can do a full join between uvp_new and uvp_old to have all the id's in one dataframe and then join this combined dataframe with df and select the new uvp value if present or else the old one using coalesce.
library(dplyr)
uvp_new %>%
rename(uvp_n = uvp) %>%
full_join(uvp_old %>%
rename(uvp_o = uvp), by = 'id') %>%
right_join(df, by = 'id') %>%
mutate(uvp = coalesce(uvp_n, uvp_o))
# id uvp_n uvp_o customer uvp
# <dbl> <dbl> <dbl> <chr> <dbl>
#1 64512 12 10 a 12
#2 64513 14 NA a 14
#3 64514 16 NA b 16
#4 64515 NA 18 b 18
You can remove uvp_n and uvp_o columns if they are not needed.

Related

Use a specific value in summarise (dplyr) without filtering it out

I am trying to compare a new algorithm result versus an old one. I need to know approximately how many days of a difference the new algorithm has in predicting a "D" versus the old one.
I can't seem to figure out how to point to the first row (day) that contains a 'D' (min(day) and new == 'D') without filtering (I was able to grab the row using a double filter due to the grouping, but not use it). I want to use it in summarise using dplyr which is why I have included pseudo code similar to where i am currently at in my own dataset.
In my data there are groups of varying length (number of days) for each ID, which is why I made groups of different lengths in the example.
library(dplyr)
id = c(123,123,123,123,123,456,456,456,456)
old = c('S','S','S','S','D','S','S','D','D')
new = c('S','S','D','D','D','S','D','D','D')
day = c(1,2,3,4,5,1,2,3,4)
data = data.frame(id,old,new,day)
data
#> id old new day
#> 1 123 S S 1
#> 2 123 S S 2
#> 3 123 S D 3
#> 4 123 S D 4
#> 5 123 D D 5
#> 6 456 S S 1
#> 7 456 S D 2
#> 8 456 D D 3
#> 9 456 D D 4
d = data %>%
group_by(id)%>%
arrange(day,.by_group=T)%>%
add_tally(new=='S',name='S')%>%
add_tally(new=='D',name='D')%>%
group_by(id,S,D)
# summarise(diff = (day of 1st old D) - (day of 1st new D) )
#Expected Outcome
ido = c(123,456)
S = c(2,1)
D = c(3,3)
diff = c(2,1)
outcome = data.frame(ido,S,D,diff)
outcome
#> ido S D diff
#> 1 123 2 3 2
#> 2 456 1 3 1
Created on 2019-12-26 by the reprex package (v0.3.0)
We can group_by id and count the occurrence of 'S' and 'D' and the difference between first occurrence of old and new 'D'.
library(dplyr)
data %>%
group_by(id) %>%
summarise(S = sum(new == 'S'),
D = sum(new == 'D'),
diff = which.max(old == 'D') - which.max(new == 'D'))
#OR if there could be id without D use
#diff = which(old == 'D')[1] - which(new == 'D')[1])
# A tibble: 2 x 4
# id S D diff
# <dbl> <int> <int> <int>
#1 123 2 3 2
#2 456 1 3 1
We can use pivot_wider after summariseing to get the frequency count after creating a column to take the difference between the 'day' based on the first occurence of 'D' in both 'old' and 'new' columnss
library(dplyr)
library(tidyr)
data %>%
group_by(id) %>%
group_by(diff = day[match("D", old)] - day[match("D", new)],
new, add = TRUE) %>%
summarise(n = n()) %>%
ungroup %>%
pivot_wider(names_from = new, values_from = n)
# A tibble: 2 x 4
# id diff D S
# <dbl> <dbl> <int> <int>
#1 123 2 3 2
#2 456 1 3 1

Tidyeval: apply function to data frames extracted from list

This is a simplified version of a problem involving a large list containing complex tables. I want to extract the tables from the list and apply a function to each one. Here we can create a simple list containing small named data frames:
library(tidyverse)
table_names <- c('dfA', 'dfB', 'dfC')
dfA <- tibble(a = 1:3, b = 4:6, c = 7:9)
dfB <- tibble(a = 10:12, b = 13:15, c = 16:18)
dfC <- tibble(a = 19:21, b = 22:24, c = 25:27)
df_list <- list(dfA, dfB, dfC) %>% setNames(table_names)
Here is a simplified example of the kind of operation I would like to apply:
dfA_mod <- df_list$dfA %>%
mutate(name = 'dfA') %>%
select(name, everything())
In this example, I extract one of three tables in the list df_list$dfA, create a new column with the same value in each row mutate(name = 'dfA'), and re-order the columns so that the new column appears in the left-most position select(name, everything()). The resulting object is assigned to dfA_mod.
To solve the larger problem, I want to use one of the purrr::map() variants to apply the function over the character vector table_names, which was initiated in the first block of code above. The elements of table_names serve two purposes: 1) naming the tables held in the list; and 2) supplying values for the name column in the modified table.
I could write a function such as:
fun <- function(x) {
df_list$x %>%
mutate(name = x) %>%
select(name, everything()) %>%
assign(paste0(x, '_mod'), ., envir = .GlobalEnv)
}
And then use map() to create a new list of modified tables:
new_list <- df_list %>% map(table_name, fun(x))
But of course this code does not work, with the main obstacle being (for me at least) figuring out how to quote and unquote the right terms within the function. I'm a beginner at tidy evaluation, and I could use some help in specifying the function and using map properly.
Here is the desired output (for one modified table):
# A tibble: 3 x 4
name a b c
<chr> <int> <int> <int>
1 dfA 1 4 7
2 dfA 2 5 8
3 dfA 3 6 9
Thanks in advance for any help!
We can use purrr::imap which passes data in the list as well as name of the list
library(dplyr)
library(purrr)
df_out <- imap(df_list, ~.x %>% mutate(name = .y) %>% select(name, everything()))
df_out
#$dfA
# A tibble: 3 x 4
# name a b c
# <chr> <int> <int> <int>
#1 dfA 1 4 7
#2 dfA 2 5 8
#3 dfA 3 6 9
#$dfB
# A tibble: 3 x 4
# name a b c
# <chr> <int> <int> <int>
#1 dfB 10 13 16
#....
#....
This gives a list of desired dataframes, if you want them as separate dataframes, you can do
names(df_out) <- paste0(names(df_out), "_mod")
list2env(df_out, .GlobalEnv)
We can also do it using base R Map
df_out <- Map(function(x, y) transform(x, name = y)[c('name', names(x))],
df_list, names(df_list))
and give list names same as above.
We can convert it to a single data.frame with map while passing the .id
library(purrr)
map_dfr(df_list, I, .id = 'name')
Or with bind_rows
library(dplyr)
bind_rows(df_list, .id = 'name')
# A tibble: 9 x 4
# name a b c
# <chr> <int> <int> <int>
#1 dfA 1 4 7
#2 dfA 2 5 8
#3 dfA 3 6 9
#4 dfB 10 13 16
#5 dfB 11 14 17
#6 dfB 12 15 18
#7 dfC 19 22 25
#8 dfC 20 23 26
#9 dfC 21 24 27

How can I match two sets of factor levels in a new data frame?

I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4
From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)
Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.

group by in R dplyr for more than one variable on unique value of other variable

I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')

Using tidyr complete() with column names specified in variables

I am having trouble using the tidyr::complete() function with column names as variables.
The built-in example works as expected:
df <- data_frame(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
df %>% complete(group, nesting(item_id, item_name))
However, when I try to provide the column names as character strings, it produces an error.
gr="group"
id="item_id"
name="item_name"
df %>% complete_(gr, nesting_(id, name),fill = list(NA))
Even a little more simply, df %>% complete(!!!syms(gr), nesting(!!!syms(id), !!!syms(name))) now gets it done in tidyr 1.0.2
I think it's a bug that complete_ can't work with data.frames or list columns like complete can, but here's a workaround using unite_ and separate to simulate nesting:
df %>% unite_('id_name', c(id, name)) %>%
complete_(c(gr, 'id_name')) %>%
separate(id_name, c(id, name))
## # A tibble: 4 × 5
## group item_id item_name value1 value2
## * <dbl> <chr> <chr> <int> <int>
## 1 1 1 a 1 4
## 2 1 2 b 3 6
## 3 2 1 a NA NA
## 4 2 2 b 2 5
Now that tidyr has adopted tidy evaluation, the underscore variants (i.e. complete_) have been deprecated since their behavior can be handled by the standard variants (complete).
However, complete, crossing and nesting use data-masking, so the way to convert variables into names is via the .data[[var]] pronoun (per the docs), so your case becomes:
suppressPackageStartupMessages(
library(tidyr)
)
df <- data.frame(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
gr <- "group"
id <- "item_id"
name <- "item_name"
df %>% complete(
.data[[gr]],
nesting(.data[[id]],
.data[[name]])
)
#> # A tibble: 4 x 5
#> group item_id item_name value1 value2
#> <dbl> <dbl> <fct> <int> <int>
#> 1 1 1 a 1 4
#> 2 1 2 b 3 6
#> 3 2 1 a NA NA
#> 4 2 2 b 2 5
Created on 2020-02-28 by the reprex package (v0.3.0)
Not very elegant, but it gets the job done.

Resources