Joining tables using variable columns - dplyr, r, join

Joining tables using variable columns - dplyr, r, join - r

Problem: I have 2 tables I'd like to join. However, the column upon which I wish to join the second to the first will vary dependent upon a successful parsing of the 2nd data frame to identify which column and row to join.
Request: I have found a solution to the problem (see below) but it does not seem to me to very computationally efficient. Not a problem for the reproducible example below but potentially less ideal when stepped out to a larger scale problem i.e. ~200,000+ rows / observations.
Wondering if anyone might be able to help identify something better - ideally utilising functionality from dplyr.
Reproducible Example:
# Equipment alias table
alias1 <- c('a1a1', 'a2a2', 'a3a3', 'a4a4', 'a5a5', 'a6a6')
alias2 <- c('bc001', 'bc002', 'bc003', 'bc004', 'bc005', 'bc006')
alias3 <- c('e1o1', 'e202', 'e303', 'e404', 'e505', 'e606')
df_alias <- data.frame(alias1, alias2, alias3)
# Attribute table
equip <- c('a1a1','bc006', 'e404')
att1 <- c('a', 'b', 'c')
att2 <- c('1', '2', '3')
df_att <- data.frame(equip, att1, att2)
Desired Outcome:
I'm looking to achieve the following....
# DESRIED OUTPUT - combining equipment alias table into attribute table based on string match between attibute_equip and any one of columns in equipment alias
equip <- c('a1a1','bc006', 'e404')
att1 <- c('a', 'b', 'c')
att2 <- c('1', '2', '3')
alias1 <- c('a1a1','a6a6', 'a4a4')
alias2 <- c('bc001','bc006', 'bc004')
alias3 <- c('e1o1','e606', 'e404')
df_att <- data.frame(equip, att1, att2, alias1, alias2, alias3)
Current Solution:
library(dplyr)
left_join(df_att, df_alias, by = character()) %>%
filter(equip == alias1 | equip == alias2 | equip == alias3)
Effective but not exactly elegant as there's a great deal of duplication for ultimately a filter to then be applied to undo that duplication.

An option is to filter with if_any and then bind the subset rows with the df_att
library(dplyr)
df_att2 <- df_alias %>%
filter(if_any(everything(), ~ .x %in% df_att$equip)) %>%
arrange(na.omit(unlist(across(everything(), ~ match(df_att$equip, .x))))) %>%
bind_cols(df_att, .)
-checking with OP's expected (changed the object name 'df_att' to 'out' to avoid any confusion)
> all.equal(df_att2, out)
[1] TRUE

I don't know how it compares efficiency-wise, but one idea is to pivot a copy of each alias so that you can left_join against a single column instead of multiple ones.
library(tidyr)
library(dplyr)
df_alias %>%
mutate(across(everything(), ~., .names = "_{.col}")) %>%
pivot_longer(starts_with('_'), names_to = NULL, values_to = 'equip') %>%
left_join(df_att, .)
#> Joining, by = "equip"
#> equip att1 att2 alias1 alias2 alias3
#> 1 a1a1 a 1 a1a1 bc001 e1o1
#> 2 bc006 b 2 a6a6 bc006 e606
#> 3 e404 c 3 a4a4 bc004 e404

Related

Is there a way to round factors throughout a dataframe using R?

I have a dataframe with a mix of characters and numbers in each column that are ultimately considered character columns like this:
df1 <- data.frame(
Group = c('Type', 'State', 'Roads'),
Value1 = c('A', 'Florida', 107.188887)
)
I want to round the number data points to the tenth digit, but this doesn't seem possible given they are intermingled with other data types. Is there a way to do this rounding using R? The result would look like this:
df_desired <- data.frame(
Group = c('Type', 'State', 'Roads'),
Value1 = c('A', 'Florida', 107.2)
)
I'd prefer to avoid pivoting the df if possible.

Find the elements that are only numeric and do the rounding in base R itself
i1 <- grep("^[0-9.]+$", df1$Value1)
df1$Value1[i1] <- round(as.numeric(df1$Value1[i1]), 1)
-output
> df1
Group Value1
1 Type A
2 State Florida
3 Roads 107.2
If it is an entire dataset, use lapply
df1[] <- lapply(df1, function(x) {
i1 <- grep("^[0-9.]+$", x)
x[i1] <- round(as.numeric(x[i1]), 1)
x
})
-output
> df1
Group Value1
1 Type A
2 State Florida
3 Roads 107.2

First str_detectthe numerical value, then str_extract it, convert it to numeric with as.numeric, and finally round it:
library(stringr)
library(dplyr)
df1 %>%
mutate(Value1 = ifelse(str_detect(Value1, "^[\\d.]+$"),
round(as.numeric(str_extract(Value1, "^[\\d.]+$")),1),
Value1))
Group Value1
1 Type A
2 State Florida
3 Roads 107.2
EDIT:
If this type of edit needs to be done in several columns, you can mutate(across:
df1 %>%
mutate(across(starts_with("V"),
~ifelse(str_detect(., "^[\\d.]+$"),
round(as.numeric(str_extract(., "^[\\d.]+$")),1),
.)))
df1 <- data.frame(
Group = c('Type', 'State', 'Roads'),
Value1 = c('A', 'Florida', 107.188887),
Value2 = c('B', 'California', 234.1229997)
)
This much more concise method works too (Warnings can be ignored):
df1 %>%
mutate(across(starts_with("V"),
~ifelse(str_detect(., "^[\\d.]+$"),
round(as.numeric(.),1),
.)))

Dplyr syntax selecting columns and converting them to a single list

I am starting to learn how to use dplyr's pipe (%>%) command for manipulating data frames. I like that it seems much more streemlined. However, I just encountered a problem that I could not solve with only pipes.
I have a data frame which holds relationship (network) data which looks like this:
The first two columns indicate what items (genes) there is a relationship between, and the third column contains information about that relationship:
a b c
1 Gene_1 Gene_2 X
2 Gene_2 Gene_3 R
3 Gene_1 Gene_4 X
My goal is to get a list of unique genes that share the same attribute. If the attribute X in col 3 is selected, I would get this data frame:
a b c
1 Gene_1 Gene_2 X
3 Gene_1 Gene_4 X
And I would want to end with this list of unique genes:
genes = c("Gene_1" "Gene_2" "Gene_4")
It does not matter if the item (Gene) comes from the first column or the second, I just want a unique list. I came up with this solution:
library(tidyr)
net = tibble(a = c("Gene_1", "Gene_2", "Gene_1"),
b = c("Gene_2", "Gene_3", "Gene_4"),
c = c("X", "R", "X"))
df = net %>%
filter(c == "X") %>%
select(c(1,2))
genes = unique(c(df$a, df$b))
but am not satisfied, as I was not able to do everything within the dplyr pipe commands. I had to make a list outside of the pipe commands, and then call unique on it.
Is there a way to accomplish this task with a call to another pipe? I could not find anyway to do this. Thanks.

1) Use {...} like this:
net %>%
filter(c == "X") %>%
select(c(1,2)) %>%
{ unique(c(.$a, .$b)) }
## [1] "Gene_1" "Gene_3" "Gene_2" "Gene_5"
2) or use magrittr's %$% pipe:
library(magrittr)
net %>%
filter(c == "X") %>%
select(c(1,2)) %$%
unique(c(a, b))
## [1] "Gene_1" "Gene_3" "Gene_2" "Gene_5"
3) or use with:
net %>%
filter(c == "X") %>%
select(c(1,2)) %>%
with(unique(c(a, b)))
## [1] "Gene_1" "Gene_3" "Gene_2" "Gene_5"
Since the result is not a data frame best not call it df.

The unlist() function is probably what you are looking for.
Quoting from the built in documentation for ?unlist: "Given a list structure x, unlist simplifies it to produce a vector which contains all the atomic components which occur in x."
Since R data frames (and tibbles) are implemented as lists of column vectors with equal lengths, the unlist function will effectively convert a data frame into a vector.
Subset for the desired rows and columns with filter and select, then pipe the result through unlist() and then unique(). The result will be a vector with the distinct elements.
library(dplyr)
# The example data
tibble(a = c("Gene_1", "Gene_2", "Gene_1"),
b = c("Gene_2", "Gene_3", "Gene_4"),
c = c("X", "R", "X")) %>%
# Subset data for desired feature
filter(c == "X") %>%
# Select identifier columns
select(a, b) %>%
# convert to a vector
unlist() %>%
# derive unique elements
unique()
Result
[1] "Gene_1" "Gene_2" "Gene_4"

I would suggest using tidyr::pivot_longer to reshape the multiple columns of potential matches from the two distinct gene columns, to a value column (which we care about) and a name column (referencing the original column name, which we don't care about and can ignore). Then distinct to get unique matches, and finally the match to column c:
net %>%
pivot_longer(-c) %>%
distinct(c, value) %>%
filter(c == "X")
If you want the result as a vector, you could add %>% pull(value).
One benefit of this approach is that we already have every distinct set of genes for every column c value calculated, and the last filter step just narrows it to one example c value.
Result
c value
<chr> <chr>
1 X Gene_1
2 X Gene_2
3 X Gene_4
[Note: I made a = c("Gene_1", "Gene_2", "Gene_1") and b = c("Gene_2", "Gene_3", "Gene_4") to match example.]

I realize this question has several answers, but I would have gone a slightly different way with it. Perhaps it will be useful to someone?
I created a data set to demonstrate, as well.
library(tidyverse)
library(stringi) # only used in data generation
# data set creation 100 rows
a = paste0("Gene_",1:100)
b = paste0("Gene_",round(runif(100, 10, 99),digits = 0))
cC = paste0(stringi::stri_rand_strings(100, 1, '[A-Z]'))
# put it together and strip the information
data.frame(a = a, b = b, cC = cC) %>% # collect the data
filter(cC == "X") %>% # filter for attribute
select(-cC) %>% # remove attribute field
unlist() %>% # collapse the data frame into a vector
unique() # show me what's unique
# output example
# [1] "Gene_10" "Gene_12" "Gene_28" "Gene_77" "Gene_22" "Gene_41" "Gene_75"
# [8] "Gene_19"

library(tidyverse)
net <- tibble(
a = c("Gene_1", "Gene_1", "Gene_3"),
b = c("Gene_2", "Gene_4", "Gene_5"),
c = c("X", "R", "X")
)
df <- net %>%
filter(c == "X") %>%
select(a, b)
df
#> # A tibble: 2 x 2
#> a b
#> <chr> <chr>
#> 1 Gene_1 Gene_2
#> 2 Gene_3 Gene_5
genes <- net %>%
select(-c) %>%
unlist() %>%
unique()
genes
#> [1] "Gene_1" "Gene_3" "Gene_2" "Gene_4" "Gene_5"

Though many enlightening answers have been proposed and accepted by OP too, I just want to add that in case, you want it simultaneously for all values in c, do this
library(tidyverse)
net %>%
group_split(c, .keep = F) %>%
setNames(unique(net$c)) %>%
map(~ (.x %>% unlist() %>% unique()))
$X
[1] "Gene_2" "Gene_3"
$R
[1] "Gene_1" "Gene_2" "Gene_4"

How to sum up a list of variables in a customized dplyr function?

Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}

We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))

R dedupe records that are not exactly duplicates

I have a list of record that I need to dedup, these look like a combination of the same set of, but using the regular functions to deduplicate records does not work because the two columns are not duplicates. Below is a reproducible example.
df <- data.frame( A = c("2","2","2","43","43","43","331","391","481","490","501","501","501","502","502","502"),
B = c("43","501","502","2","501","502","491","496","490","481","2","43","502","2","43","501"))
Below is the desired output that I'm looking for.
df_Final <- data.frame( A = c("2","2","2","331","391","481"),
B = c("43","501","502","491","496","490"))

I guess the idea is that you want to find when the elements in column A first appear in column B
idx = match(df$A, df$B)
and keep the row if the element in A isn't in B (is.na(idx)) or the element in A occurs before it's first occurrence in B (seq_along(idx) < idx)
df[is.na(idx) | seq_along(idx) < idx,]
Maybe a more-or-less literal tidyverse approach to this would be to create and then drop a temporary column
library(tidyverse)
df %>% mutate(idx = match(A, B)) %>%
filter(is.na(idx) | seq_along(idx) < idx) %>%
select(-idx)

You can remove all rows which would be duplicates under some reordering with
require(dplyr)
df %>%
apply(1, sort) %>% t %>%
data.frame %>%
group_by_all %>%
slice(1)

Non-standard evaluation (NSE) in dplyr's filter_ & pulling data from MySQL

I'd like to pull some data from a sql server with a dynamic filter. I'm using the great R package dplyr in the following way:
#Create the filter
filter_criteria = ~ column1 %in% some_vector
#Connect to the database
connection <- src_mysql(dbname <- "mydbname",
user <- "myusername",
password <- "mypwd",
host <- "myhost")
#Get data
data <- connection %>%
tbl("mytable") %>% #Specify which table
filter_(.dots = filter_criteria) %>% #non standard evaluation filter
collect() #Pull data
This piece of code works fine but now I'd like to loop it somehow on all the columns of my table, thus I'd like to write the filter as:
#Dynamic filter
i <- 2 #With a loop on this i for instance
which_column <- paste0("column",i)
filter_criteria <- ~ which_column %in% some_vector
And then reapply the first code with the updated filter.
Unfortunately this approach doesn't give the expected results. In fact it does not give any error but doesn't even pull any result into R.
In particular, I looked a bit into the SQL query generated by the two pieces of code and there is one important difference.
While the first, working, code generates a query of the form:
SELECT ... FROM ... WHERE
`column1` IN ....
(` sign in the column name), the second one generates a query of the form:
SELECT ... FROM ... WHERE
'column1' IN ....
(' sign in the column name)
Does anyone have any suggestion on how to formulate the filtering condition to make it work?

It's not really related to SQL. This example in R does not work either:
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
df %>% filter_(~ "v1" == 1)
It does not work because you need to pass to filter_ the expression ~ v1 == 1 — not the expression ~ "v1" == 1.
To solve the problem, simply use the quoting operator quo and the dequoting operator !!
library(dplyr)
which_column = quot(v1)
df %>% filter(!!which_column == 1)

An alternative solution, with dplyr version 0.5.0 (probably implemented earlier than that), it is possible to pass a composed string as the .dots argument, which I find more readable than the lazyeval::interp solution:
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
which_col <- "v1"
which_val <- 1
df %>% filter_(.dots= paste0(which_col, "== ", which_val))
v1 v2
1 1 1
2 1 2
3 1 4
UPDATE for dplyr 0.6 and later:
packageVersion("dplyr")
# [1] ‘0.5.0.9004’
df %>% filter(UQ(rlang::sym(which_col))==which_val)
#OR
df %>% filter((!!rlang::sym(which_col))==which_val)
(Similar to #Matthew 's response for dplyr 0.6, but I assume that which_col is a string variable.)
2nd UPDATE: Edwin Thoen created a nice cheatsheet for tidy evaluation: https://edwinth.github.io/blog/dplyr-recipes/

Here's a slightly less verbose solution and one which uses the typical behavior of the extract function, '[' in selecting a column by character value rather than converting it to a language element:
df %>% filter(., '['(., which_column)==1 )
set.seed(123)
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
which_column <- "v1"
df %>% filter(., '['(., which_column)==1)
# v1 v2
#1 1 5