regex_left_join (fuzzyjoin) not working as expected

regex_left_join (fuzzyjoin) not working as expected - r

I am trying to perform a join in R based on a regex pattern from one table. From what I understand, the fuzzyjoin package should be exactly what I need, but I can't get it to work. Here is an example of what I'm trying to do:
library(tidyverse)
library(fuzzyjoin)
(typing_table <- tribble(
~typing,
"DPB02:01",
"DPB04:02"
)
)
(P_group_table <- tribble(
~P_group, ~Alleles,
"DP1", "DPB01:01:01:01/DPB01:01:01:02/DPB01:01:01:03",
"DP2", "DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03",
"DP3", "DPB03:01:01:01/DPB03:01:01:02/DPB03:01:01:03",
"DP4", "DPB04:01:01:01/DPB04:01:01:02/DPB04:01:01:03"
)
)
I am trying to join the P_group_table to the typing_table by searching for the "typing" value in the "Alleles" string. I have used the following expression:
(typing_table %>% regex_left_join(P_group_table, by = c("typing" = "Alleles")))
Which results in a join, but the values from the right table are empty. I assume I must be misunderstanding the syntax of the regex_left_join expression, but I can't figure it out. I have verified that the "typing" value can be used as a regex pattern with the following code:
(typing_table_2 <- typing_table %>% slice_head)
(P_group_table %>% filter(str_detect(Alleles, typing_table_2$typing)))

We could make use of str_detect as match_fun in fuzzy_.*_join
library(fuzzyjoin)
library(stringr)
fuzzy_right_join(P_group_table, typing_table, by = c("Alleles" = "typing"),
match_fun = str_detect)
Or may use
regex_right_join(P_group_table, typing_table, by = c("Alleles" = "typing"))
# A tibble: 2 × 3
P_group Alleles typing
<chr> <chr> <chr>
1 DP2 DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03 DPB02:01
2 <NA> <NA> DPB04:02
Note the difference when we switch the pattern
> str_detect("DPB02:01", "DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03")
[1] FALSE
> str_detect("DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03", "DPB02:01")
[1] TRUE
One option to do the left_join is by getting the substring from the 'P_group_table' before doing the join
left_join(typing_table, P_group_table %>%
mutate(typing = substr(Alleles, 1, 8)), by = "typing")
# A tibble: 2 × 3
typing P_group Alleles
<chr> <chr> <chr>
1 DPB02:01 DP2 DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03
2 DPB04:02 <NA> <NA>

Related

R dplyr mutate: creating one variable from multiple column variables using "or" logic

I am trying to do something that I think is straightforward but I am having an issue with.
I have several medication-related column variables (med_1, med_2, med_3 for example). These are character variables- so they have text for the name of medications
I want to combine them all into variable anymed using or logic, so that I can then use anymed to look at any medications reported across all medication related fields.
I am trying the following, for dataset FinalData.
FinalData <- FinalData %>% mutate(anymed = med_1 | med_2 | med_3)
I am receiving this error:
*Error: Problem with `mutate()` column `anymed`.
ℹ `anymed = |...`.
x operations are possible only for numeric, logical or complex types*
Could someone help explain what code I should use instead since these are characters? Do I need to convert to factors?

Are you looking for this kind of solution:
# data:
df <- tibble(med_1 = "A", med_2 = "B", med_3 = "C")
library(dplyr)
df %>%
mutate(any_med = paste(c(med_1, med_2, med_3), collapse = " | "))
med_1 med_2 med_3 any_med
<chr> <chr> <chr> <chr>
1 A B C A | B | C

You want to use pivot_longer from tidyverse to get them all in the same column. I also dropped the column name (i.e., col), but you could remove that line if you want to know what column the medication came from. I'm unsure what your data looks like, so I just made a small example to show how to do it.
library(tidyverse)
FinalData %>%
pivot_longer(-ind, names_to = "col", values_to = "anymed") %>%
select(-col)
Output
# A tibble: 6 × 2
ind anymed
<dbl> <chr>
1 1 meda
2 1 meda
3 1 meda
4 2 medb
5 2 medb
6 2 medb
It's a little unclear what your expected output is. But if you are wanting to combine all medications in each row, then you can also use unite.
FinalData %>%
unite("any_med", c("med_1", "med_2", "med_3"), sep = " | ")
Output
ind any_med
1 1 meda | meda | meda
2 2 medb | medb | medb
Data
FinalData <-
structure(
list(
ind = c(1, 2),
med_1 = c("meda", "medb"),
med_2 = c("meda",
"medb"),
med_3 = c("meda", "medb")
),
class = "data.frame",
row.names = c(NA,-2L)
)

Splitting with pipe and additional spaces around this symbol if any using separate in R

How to separate a column into many, based on a symbol "|" and any additional spaces around this symbol if any:
input <- tibble(A = c("Ae1 tt1 | Ae2 tt2", "Be1 | Be2 | Be3"))
output <- tibble(B = c("Ae1 tt1", "Be1") , C = c("Ae2 tt2", "Be2"), D = c(NA, "Be3"))
I tried :
input %>%
separate(A, c("B","C","D"))
#separate(A, c("B","C","D"), sep = "|.")
#mutate(B = str_split(A, "*|")) %>% unnest
What is the syntax with regex ?
Solution from R - separate with specific symbol, vertical bare, | (and tidyr::separate() producing unexpected results) does not provide expected output and produces a warning:
input %>% separate(col=A, into=c("B","C","D"), sep = '\\|')
# A tibble: 2 x 3
B C D
<chr> <chr> <chr>
1 "Ae1 tt1 " " Ae2 tt2" <NA>
2 "Be1 " " Be2 " " Be3"
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
Using separate from tidyr with different length vectors does not seem related unfortunately.

You can use
output <- input %>%
separate(col=A, into=c("B","C","D"), sep="\\s*\\|\\s*", fill="right")
R test:
> input %>% separate(col=A, into=c("B","C","D"), sep="\\s*\\|\\s*", fill="right")
# A tibble: 2 x 3
B C D
<chr> <chr> <chr>
1 Ae1 tt1 Ae2 tt2 <NA>
2 Be1 Be2 Be3
The \s*\|\s* pattern matches a pipe char with any zero or more whitespace chars on both ends of the pipe.
The fill="right" argument fills with missing values on the right.

Pre-processing data in R: filtering and replacing using wildcards

Good day!
I have a dataset in which I have values like "Invalid", "Invalid(N/A)", "Invalid(1.23456)", lots of them in different columns and they are different from file to file.
Goal is to make script file to process different CSVs.
I tried read.csv and read_csv, but faced errors with data types or no errors, but no action either.
All columns are col_character except one - col_double.
Tried this:
is.na(df) <- startsWith(as.character(df, "Inval")
no luck
Tried this:
is.na(df) <- startsWith(df, "Inval")
no luck, some error about non char object
Tried this:
df %>%
mutate(across(everything(), .fns = ~str_replace(., "invalid", NA_character_)))
no luck
And other google stuff - no luck, again, errors with data types or no errors, but no action either.
So R is incapable of simple find and replace in data frame, huh?
data frame exampl
Output of dput(dtype_Result[1:20, 1:4])
structure(list(Location = c("1(1,A1)", "2(1,B1)", "3(1,C1)",
"4(1,D1)", "5(1,E1)", "6(1,F1)", "7(1,G1)", "8(1,H1)", "9(1,A2)",
"10(1,B2)", "11(1,C2)", "12(1,D2)", "13(1,E2)", "14(1,F2)", "15(1,G2)",
"16(1,H2)", "17(1,A3)", "18(1,B3)", "19(1,C3)", "20(1,D3)"),
Sample = c("Background0", "Background0", "Standard1", "Standard1",
"Standard2", "Standard2", "Standard3", "Standard3", "Standard4",
"Standard4", "Standard5", "Standard5", "Standard6", "Standard6",
"Control1", "Control1", "Control2", "Control2", "Unknown1",
"Unknown1"), EGF = c(NA, NA, "6.71743640129069", "2.66183193679533",
"16.1289784536322", "16.1289784536322", "78.2706654825781",
"78.6376213069722", "382.004087907716", "447.193928257862",
"Invalid(N/A)", "1920.90297258996", "7574.57784103579", "29864.0308009592",
"167.830723655146", "109.746615928611", "868.821939675054",
"971.158518683179", "9.59119569511596", "4.95543581398464"
), `FGF-2` = c(NA, NA, "25.5436745776637", NA, "44.3280630362038",
NA, "91.991708192168", "81.9459159768959", "363.563899234418",
"425.754478700876", "Invalid(2002.97340881547)", "2027.71958119836",
"9159.40221389147", "11138.8722428849", "215.58494072476",
"70.9775438699825", "759.798876479002", "830.582605561901",
"58.7007261370257", "70.9775438699825")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))

The error is in the use of startsWith. The following grepl solution is simpler and works.
is.na(df) <- sapply(df, function(x) grepl("^Invalid", x))

The str_replace function will attempt to edit the content of a character string, inserting a partial replacement, rather than replacing it entirely. Also, the across function is targeting all of the columns including the numeric id. The following code works, building on the tidyverse example you provided.
To fix it, use where to identify the columns of interest, then use if_else to overwrite the data with NA values when there is a partial string match, using str_detect to spot the target text.
Example data
library(tiyverse)
df <- tibble(
id = 1:3,
x = c("a", "invalid", "c"),
y = c("d", "e", "Invalid/NA")
)
df
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 invalid e
3 3 c Invalid/NA
Solution
df <- df %>%
mutate(
across(where(is.character),
.fns = ~if_else(str_detect(tolower(.x), "invalid"), NA_character_, .x))
)
print(df)
Result
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 NA e
3 3 c NA

How can we split string and extract the text between round brackets

I need to split the string in dataframe to two columns, the first one contains the value before the round brackets and the second column contains the value inside the round brackets.
This is an example:
study_name = c("apple bannan (tcga, raw 2018)", "frame shift (mskk2 nature, 2000)" )
results= c("Untested", "tested")
df = data_frame(study_name,results)
This is how I tried to do it:
df <- df %>%
mutate(reference = str_extract_all(study_name, "\\([^()]+\\)")) %>%
rename(~gsub("\\([^()]+\\)", "", study_name))
This is the expected dataframe:
reference = c("(tcga, raw 2018)", "(mskk2 nature, 2000)")
study = c("apple bannan", "frame shift")
expexted_df = data_frame(study, reference)

You can use separate() and set the separator as "\\s(?=\\()".
library(tidyr)
df %>%
separate(study_name, c("study", "reference"), sep = "\\s(?=\\()")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan (tcga, raw 2018) Untested
# 2 frame shift (mskk2 nature, 2000) tested
If you want to extract the text in the parentheses, using extract() is a suitable choice.
df %>%
extract(study_name, c("study", "reference"), regex = "(.+)\\s\\((.+)\\)")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan tcga, raw 2018 Untested
# 2 frame shift mskk2 nature, 2000 tested

We can use str_extract thus:
library(stringr)
df$reference <- str_extract(df$study_name, "\\(.*\\)")
df$study <- str_extract(df$study_name, ".*(?= \\(.*\\))")
Result:
df
study_name results reference study
1 apple bannan (tcga, raw 2018) Untested (tcga, raw 2018) apple bannan
2 frame shift (mskk2 nature, 2000) tested (mskk2 nature, 2000) frame shift
If you no longer want the study_name column, remove it thus:
df$study_name <- NULL

find all the records belongs to particular Id are decommissioned

I need to find the records which are decommissioned based on date.
below is the example data frame:
input date: 2020-08-01(YYYY-MM-DD)
df <- data.frame(cel = c("cel12", "cel34", "cel05", "cel98", "cel67",
"cel35", "cel05", "cel45", "cel12","cel99","cel45"),
sect = c("sect56", "sect56", "sect56", "sect78", "sect78",
"sect60", "sect51", "sect51",
"sect98", "sect98", "sect98"),
site = c("site14","site14", "site08", "site08", "site08",
"site89", "site89", "site08", "site24",
"site24", "site36"),
decomdate = c(as.Date("2020-02-01"),as.Date("2020-03-01"), as.Date("2020-12-01"), as.Date("2020-05-01"), NA, NA, as.Date("2020-12-01"), as.Date("2020-07-01"), as.Date("2020-06-01"), NA, NA))
if all the 'cel' in particular 'sect' belongs to particular 'site' are decommissioned(i.e decomdate < inputdate) then that 'sect' is decommissioned.
Expected Output: sect column with decommissioned sects
sect
sect56
sect51

For each sect we can check if all the decomdate is less than input date.
library(dplyr)
input <- as.Date('2020-08-01')
df %>% group_by(sect) %>% filter(all(input > decomdate))
# cel sect decomdate
# <chr> <chr> <date>
#1 cel12 sect56 1964-06-20
#2 cel34 sect56 1964-06-19
#3 cel05 sect56 1964-06-17
#4 cel05 sect51 1964-06-17
#5 cel45 sect51 1964-06-15
To get only sect back we can use distinct :
df %>% group_by(sect) %>% filter(all(input > decomdate)) %>% distinct(sect)
# sect
# <chr>
#1 sect56
#2 sect51
This can also be done using base R with subset and ave :
unique(subset(df, ave(input > decomdate, sect, FUN = all), select = sect))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

regex_left_join (fuzzyjoin) not working as expected - r

Related

R dplyr mutate: creating one variable from multiple column variables using "or" logic

Splitting with pipe and additional spaces around this symbol if any using separate in R

Pre-processing data in R: filtering and replacing using wildcards

How can we split string and extract the text between round brackets

find all the records belongs to particular Id are decommissioned

Categories

Resources