merge three datasets using left_join and get unique suffixes - r

Can I merge three datasets using left_join and get unique suffixes for all three datasets?
My dummy data:
df1 <- tibble(x = 1:5, y = c("names", "matches", "multiple", "rows", "different"))
df2 <- tibble(x = 3:5, y = c("first", "second", "third"))
df3 <- tibble(x = 2:4, y = 1:3)
left_join(df1, df2, by='x', suffix = c(".first", ".second")) %>%
left_join(., df3 , by='x', suffix = c("third", "third"))
# # A tibble: 5 x 4
# x y.first y.second y
# <int> <chr> <chr> <int>
# 1 1 names <NA> NA
# 2 2 matches <NA> 1
# 3 3 multiple first 2
# 4 4 rows second 3
# 5 5 different third NA
What I'm looking to obtain (the '.third' in the name of the third last column)
# # A tibble: 5 x 4
# x y.first y.second y.third
# <int> <chr> <chr> <int>
# 1 1 names <NA> NA
# 2 2 matches <NA> 1
# 3 3 multiple first 2
# 4 4 rows second 3
# 5 5 different third NA

try this:
left_join(df1, df2, by='x', suffix = c(".first", "")) %>%
left_join(., df3 , by='x', suffix = c(".second", ".third"))
x y.first y.second y.third
<int> <chr> <chr> <int>
1 1 names NA NA
2 2 matches NA 1
3 3 multiple first 2
4 4 rows second 3
5 5 different third NA

Related

How to convert a column to a different type using NSE?

I'm writing a function that takes a data frame and a column names as arguments, and returns the data frame with the column indicated being transformed to character type. However, I'm stuck at the non-standard evaluation part of dplyr.
My current code:
df <- tibble(id = 1:5, value = 6:10)
col <- "id"
mutate(df, "{col}" := as.character({{ col }}))
# # A tibble: 5 x 2
# id value
# <chr> <int>
# 1 id 6
# 2 id 7
# 3 id 8
# 4 id 9
# 5 id 10
As you can see, instead of transforming the contents of the column to character type, the column values are replaced by the column names. {{ col }} isn't evaluated like I expected it to be. What I want is a dynamic equivalent of this:
mutate(df, id = as.character(id))
# # A tibble: 5 x 2
# id value
# <chr> <int>
# 1 1 6
# 2 2 7
# 3 3 8
# 4 4 9
# 5 5 10
I've tried to follow the instructions provided in dplyr's programming vignette, but I'm not finding a solution that works. What am I doing wrong?
Use the .data pronoun -
library(dplyr)
df <- tibble(id = 1:5, value = 6:10)
col <- "id"
mutate(df, "{col}" := as.character(.data[[col]]))
# id value
# <chr> <int>
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Some other alternatives -
mutate(df, "{col}" := as.character(get(col)))
mutate(df, "{col}" := as.character(!!sym(col)))
We may use across which can also do this on multiple columns
library(dplyr)
df %>%
mutate(across(all_of(col), as.character))
# A tibble: 5 x 2
id value
<chr> <int>
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
data
df <- tibble(id = 1:5, value = 6:10)
col <- "id"

R: Joining Tibbles Using Derived Column Values

Consider the following tibbles:
library(tidyverse)
tbl_base_ids = tibble(base_id = c("ABC", "ABCDEF", "ABCDEFGHI"), base_id_length = c(3, 6, 9), record_id_length = c(10, 12, 15))
tbl_records = tibble(record_id = c("ABC1234567", "ABCDEF123456", "ABCDEFGHI123456"))
I'd like to join matching rows to produce a third tibble:
tbl_records_with_base
record_id
base_id
base_id_length
record_id_length
As you can see, this is not a matter of joining one or more variables from each of the first two. This requires matching variable derivatives. In SQL, I'd do this:
SELECT A.record_id,
B.base_id,
B.base_id_length,
B.record_id_length
FROM tbl_records A
JOIN tbl_base_ids B
ON LENGTH(a.record_id) = B.record_id_length
AND LEFT(a.record_id, B.base_id_length) = B.base_id
I've tried variations of dplyr joins and using the match function, to but to no avail. Can someone help? Thank you.
You should come up with some logic to separate base_id from record_id. because joining only on record_id_length would not be enough. For this example we can get base_id if we remove all numbers from record_id. Based on your actual dataset you need to change this if needed.
Once we do that we can join tbl_records with tbl_base_ids by base_id and record_id_length.
library(dplyr)
tbl_records %>%
mutate(base_id = sub('\\d+', '', record_id),
record_id_length = nchar(record_id)) %>%
inner_join(tbl_base_ids, by = c("base_id", "record_id_length")) -> result
result
# record_id base_id record_id_length base_id_length
# <chr> <chr> <dbl> <dbl>
#1 ABC1234567 ABC 10 3
#2 ABCDEF123456 ABCDEF 12 6
#3 ABCDEFGHI123456 ABCDEFGHI 15 9
I suggest using the fuzzyjoin package.
library(dplyr)
library(fuzzyjoin)
tbl_base_ids %>%
mutate(record_ptn = sprintf("^%s.{%i}$", base_id, pmax(0, record_id_length - base_id_length))) %>%
regex_full_join(tbl_records, ., by = c("record_id" = "record_ptn"))
# # A tibble: 3 x 5
# record_id base_id base_id_length record_id_length record_ptn
# <chr> <chr> <dbl> <dbl> <chr>
# 1 ABC1234567 ABC 3 10 ^ABC.{7}$
# 2 ABCDEF123456 ABCDEF 6 12 ^ABCDEF.{6}$
# 3 ABCDEFGHI123456 ABCDEFGHI 9 15 ^ABCDEFGHI.{6}$
A note about this: the order of tables matters, where the regex must reside on the RHS of the by= settings. For instance, this does not work if we reverse it:
tbl_base_ids %>%
mutate(record_ptn = sprintf("^%s.{%i}$", base_id, pmax(0, record_id_length - base_id_length))) %>%
regex_full_join(., tbl_records, by = c("record_ptn" = "record_id"))
# # A tibble: 6 x 5
# base_id base_id_length record_id_length record_ptn record_id
# <chr> <dbl> <dbl> <chr> <chr>
# 1 ABC 3 10 ^ABC.{7}$ <NA>
# 2 ABCDEF 6 12 ^ABCDEF.{6}$ <NA>
# 3 ABCDEFGHI 9 15 ^ABCDEFGHI.{6}$ <NA>
# 4 <NA> NA NA <NA> ABC1234567
# 5 <NA> NA NA <NA> ABCDEF123456
# 6 <NA> NA NA <NA> ABCDEFGHI123456

R dplyr::Filter dataframe by group and numeric vector?

I have dataframe df1 containing data and groups, and df2 which stores the same groups, and one value per group.
I want to filter rows of df1 by df2 where lag by group is higher than indicated value.
Dummy example:
# identify the first year of disturbance by lag by group
df1 <- data.frame(year = c(1:4, 1:4),
mort = c(5,16,40,4,5,6,10,108),
distance = rep(c("a", "b"), each = 4))
df2 = data.frame(distance = c("a", "b"),
my.median = c(12,1))
Now calculate the lag between values (creates new column) and filter df1 based on column values of df2:
# calculate lag between years
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > df2$my.median) ##
This however does not produce expected results:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 2 16 a 11
2 3 40 a 24
3 4 108 b 98
Instead, I expect to get:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 3 40 a 24
2 1 5 b 5
3 3 10 b 4
The filter works great while applied to single value, but how to adapt it to vector, and especially vector of groups (as the order of elements can potentially change?)
Is this what you're trying to do?
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
left_join(df2) %>%
filter(yearLag > my.median)
Result:
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance yearLag my.median
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 24 12
2 1 5 b 5 1
3 3 10 b 4 1
4 4 108 b 98 1
here is a data.table approach
library( data.table )
#creatae data.tables
setDT(df1);setDT(df2)
#create yearLag variable
df1[, yearLag := mort - shift( mort, type = "lag", fill = 0 ), by = .(distance) ]
#update join and filter wanted rows
df1[ df2, median.value := i.my.median, on = .(distance)][ yearLag > median.value, ][]
# year mort distance yearLag median.value
# 1: 3 40 a 24 12
# 2: 1 5 b 5 1
# 3: 3 10 b 4 1
# 4: 4 108 b 98 1
Came to the same conclusion. You should left_join the data frames.
df1 %>% left_join(df2, by="distance") %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > my.median)
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance my.median yearLag
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 12 24
2 1 5 b 1 5
3 3 10 b 1 4
4 4 108 b 1 98

Add together 2 dataframes in R without losing columns

I have 2 dataframes in R (df1, df2).
A C D
1 1 1
2 2 2
df2 as
A B C
1 1 1
2 2 2
How can I merge these 2 dataframes to produce the following output?
A B C D
2 1 2 1
4 2 4 2
Columns are sorted and column values are added. Both DFs have same number of rows. Thank you in advance.
Code to create DF:
df1 <- data.frame("A" = 1:2, "C" = 1:2, "D" = 1:2)
df2 <- data.frame("A" = 1:2, "B" = 1:2, "C" = 1:2)
nm1 = names(df1)
nm2 = names(df2)
nm = intersect(nm1, nm2)
if (length(nm) == 0){ # if no column names in common
cbind(df1, df2)
} else { # if column names in common
cbind(df1[!nm1 %in% nm2], # columns only in df1
df1[nm] + df2[nm], # add columns common to both
df2[!nm2 %in% nm1]) # columns only in df2
}
# D A C B
#1 1 2 2 1
#2 2 4 4 2
You can try:
library(tidyverse)
list(df2, df1) %>%
map(rownames_to_column) %>%
bind_rows %>%
group_by(rowname) %>%
summarise_all(sum, na.rm = TRUE)
# A tibble: 2 x 5
rowname A B C D
<chr> <int> <int> <int> <int>
1 1 2 1 2 1
2 2 4 2 4 2
By using left_join() from dplyr you won't lose the column
library(tidyverse)
dat1 <- tibble(a = 1:10,
b = 1:10,
c = 1:10)
dat2 <- tibble(c = 1:10,
d = 1:10,
e = 1:10)
left_join(dat1, dat2, by = "c")
#> # A tibble: 10 x 5
#> a b c d e
#> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 1
#> 2 2 2 2 2 2
#> 3 3 3 3 3 3
#> 4 4 4 4 4 4
#> 5 5 5 5 5 5
#> 6 6 6 6 6 6
#> 7 7 7 7 7 7
#> 8 8 8 8 8 8
#> 9 9 9 9 9 9
#> 10 10 10 10 10 10
Created on 2019-01-16 by the reprex package (v0.2.1)
allnames <- sort(unique(c(names(df1), names(df2))))
df3 <- data.frame(matrix(0, nrow = nrow(df1), ncol = length(allnames)))
names(df3) <- allnames
df3[,allnames %in% names(df1)] <- df3[,allnames %in% names(df1)] + df1
df3[,allnames %in% names(df2)] <- df3[,allnames %in% names(df2)] + df2
df3
A B C D
1 2 1 2 1
2 4 2 4 2
Here is a fun base R method with Reduce.
Reduce(cbind,
list(Reduce("+", list(df1[intersect(names(df1), names(df2))],
df2[intersect(names(df1), names(df2))])), # sum results
df1[setdiff(names(df1), names(df2))], # in df1, not df2
df2[setdiff(names(df2), names(df1))])) # in df2, not df1
This returns
A C D B
1 2 2 1 1
2 4 4 2 2
This assumes that both df1 and df2 have columns that are not present in the other. If this is not true, you'd have to adjust the list.
Note also that you could replace Reduce with do.call in both places and you'd get the same result.

Removing mirrored combinations of variables in a data frame

I'm looking to get each unique combination of two variables:
library(purrr)
cross_df(list(id1 = seq_len(3), id2 = seq_len(3)), .filter = `==`)
# A tibble: 6 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 1 2
4 3 2
5 1 3
6 2 3
How do I remove out the mirrored combinations? That is, I want only one of rows 1 and 3 in the data frame above, only one of rows 2 and 5, and only one of rows 4 and 6. My desired output would be something like:
# A tibble: 3 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 3 2
I don't care if a particular id value is in id1 or id2, so the below is just as acceptable as the output:
# A tibble: 3 x 2
id1 id2
<int> <int>
1 1 2
2 1 3
3 2 3
A tidyverse version of Dan's answer:
cross_df(list(id1 = seq_len(3), id2 = seq_len(3)), .filter = `==`) %>%
mutate(min = pmap_int(., min), max = pmap_int(., max)) %>% # Find the min and max in each row
unite(check, c(min, max), remove = FALSE) %>% # Combine them in a "check" variable
distinct(check, .keep_all = TRUE) %>% # Remove duplicates of the "check" variable
select(id1, id2)
# A tibble: 3 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 3 2
A Base R approach:
# create a string with the sorted elements of the row
df$temp <- apply(df, 1, function(x) paste(sort(x), collapse=""))
# then you can simply keep rows with a unique sorted-string value
df[!duplicated(df$temp), 1:2]

Resources