I need to split the string in dataframe to two columns, the first one contains the value before the round brackets and the second column contains the value inside the round brackets.
This is an example:
study_name = c("apple bannan (tcga, raw 2018)", "frame shift (mskk2 nature, 2000)" )
results= c("Untested", "tested")
df = data_frame(study_name,results)
This is how I tried to do it:
df <- df %>%
mutate(reference = str_extract_all(study_name, "\\([^()]+\\)")) %>%
rename(~gsub("\\([^()]+\\)", "", study_name))
This is the expected dataframe:
reference = c("(tcga, raw 2018)", "(mskk2 nature, 2000)")
study = c("apple bannan", "frame shift")
expexted_df = data_frame(study, reference)
You can use separate() and set the separator as "\\s(?=\\()".
library(tidyr)
df %>%
separate(study_name, c("study", "reference"), sep = "\\s(?=\\()")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan (tcga, raw 2018) Untested
# 2 frame shift (mskk2 nature, 2000) tested
If you want to extract the text in the parentheses, using extract() is a suitable choice.
df %>%
extract(study_name, c("study", "reference"), regex = "(.+)\\s\\((.+)\\)")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan tcga, raw 2018 Untested
# 2 frame shift mskk2 nature, 2000 tested
We can use str_extract thus:
library(stringr)
df$reference <- str_extract(df$study_name, "\\(.*\\)")
df$study <- str_extract(df$study_name, ".*(?= \\(.*\\))")
Result:
df
study_name results reference study
1 apple bannan (tcga, raw 2018) Untested (tcga, raw 2018) apple bannan
2 frame shift (mskk2 nature, 2000) tested (mskk2 nature, 2000) frame shift
If you no longer want the study_name column, remove it thus:
df$study_name <- NULL
Related
I am trying to perform a join in R based on a regex pattern from one table. From what I understand, the fuzzyjoin package should be exactly what I need, but I can't get it to work. Here is an example of what I'm trying to do:
library(tidyverse)
library(fuzzyjoin)
(typing_table <- tribble(
~typing,
"DPB02:01",
"DPB04:02"
)
)
(P_group_table <- tribble(
~P_group, ~Alleles,
"DP1", "DPB01:01:01:01/DPB01:01:01:02/DPB01:01:01:03",
"DP2", "DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03",
"DP3", "DPB03:01:01:01/DPB03:01:01:02/DPB03:01:01:03",
"DP4", "DPB04:01:01:01/DPB04:01:01:02/DPB04:01:01:03"
)
)
I am trying to join the P_group_table to the typing_table by searching for the "typing" value in the "Alleles" string. I have used the following expression:
(typing_table %>% regex_left_join(P_group_table, by = c("typing" = "Alleles")))
Which results in a join, but the values from the right table are empty. I assume I must be misunderstanding the syntax of the regex_left_join expression, but I can't figure it out. I have verified that the "typing" value can be used as a regex pattern with the following code:
(typing_table_2 <- typing_table %>% slice_head)
(P_group_table %>% filter(str_detect(Alleles, typing_table_2$typing)))
We could make use of str_detect as match_fun in fuzzy_.*_join
library(fuzzyjoin)
library(stringr)
fuzzy_right_join(P_group_table, typing_table, by = c("Alleles" = "typing"),
match_fun = str_detect)
Or may use
regex_right_join(P_group_table, typing_table, by = c("Alleles" = "typing"))
# A tibble: 2 × 3
P_group Alleles typing
<chr> <chr> <chr>
1 DP2 DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03 DPB02:01
2 <NA> <NA> DPB04:02
Note the difference when we switch the pattern
> str_detect("DPB02:01", "DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03")
[1] FALSE
> str_detect("DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03", "DPB02:01")
[1] TRUE
One option to do the left_join is by getting the substring from the 'P_group_table' before doing the join
left_join(typing_table, P_group_table %>%
mutate(typing = substr(Alleles, 1, 8)), by = "typing")
# A tibble: 2 × 3
typing P_group Alleles
<chr> <chr> <chr>
1 DPB02:01 DP2 DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03
2 DPB04:02 <NA> <NA>
Here is a small sample of a larger character string that I have (no whitespaces). It contains fictional details of individuals.
Each individual is separated by a . There are 10 attributes for each individual.
txt = "EREKSON(Andrew,Hélène),female10/06/2011#Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.BOUKAR(Mohamed,El-Hadi),male04/12/1956#London(England),PPF,2001X005729,dist.097,Dt.043/997.HARIMA(Olak,N’nassik,Gerad,Elisa,Jeremie),female25/06/2013#Paris(France),PPF,2009X005729,dist.088,Dt.043/998.THOMAS(Hajil,Pau,Joëli),female03/03/1980#Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."
I'd like to parse this into a dataframe, with as many observations as there are individuals and 10 columns for each variable.
I've tried using regex and looking at other text extraction solutions on stackoverflow, but haven't been able to reach the output I want.
This is the final dataframe I have in mind, based on the character string input -
result = data.frame(first_names = c('Hélène Andrew','Mohamed El-Hadi','Olak N’nassik Gerad Elisa Jeremie','Joëli Pau Hajil'),
family_name = c('EREKSON','BOUKAR','HARIMA','THOMAS'),
gender = c('male','male','female','female'),
birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
birth_city = c('Geneva','London','Paris','Berlin'),
birth_country = c('Switzerland','England','France','Germany'),
acc_type = c('PPF','PPF','PPF','VAT'),
acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
district = c('dist.093','dist.097','dist.088','dist.078'),
code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))
Any help would be much appreciated
Here's a tidy solution with tidyr's functions separate_rows and extract:
library(tidyr)
data.frame(txt) %>%
# separate `txt` into rows using the dot `.` *if*
# preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
extract(
# select column from which to extract:
txt,
# define column names into which to extract:
into = c("family_name","first_names","gender",
"birthday","birth_city","birth_country",
"acc_type","acc_num","district","code"),
# describe the string exhaustively using capturing groups
# `(...)` to delimit what's to be extracted:
regex = "([A-Z]+)\\(([\\w,]+)\\),([a-z]+)([\\d/]+)#(\\w+)\\((\\w+)\\),([A-Z]+),(\\w+),dist.(\\d+),Dt\\.([\\d/]+)")
# A tibble: 4 × 10
family_name first_names gender birthday birth_city birth_country acc_type acc_num
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 EREKSON Andrew,Peter male 10/06/2011 Geneva Switzerland PPF 2000X007…
2 OBAMA Barack,Hussian male 04/12/1956 London England PPF 2001X005…
3 CLINTON Hillary female 25/06/2013 Paris France PPF 2009X005…
4 GATES Melinda female 03/03/1980 Berlin Germany VAT 2010X006…
# … with 2 more variables: district <chr>, code <chr>
Here is a solution using the tidyverse which pipes together different stringr functions to clean the string, before having readr read it, basically as a CSV:
library(dplyr, warn.conflicts = FALSE) # for pipes
df <-
txt %>%
# Replace "." sep with newline
stringr::str_replace_all(
"\\.[A-Z]",
function(x) stringr::str_replace(x, "\\.", "\n")
) %>%
# Replace all commas in (First[,Middle1,Middle2,...]) with space
stringr::str_replace_all(
# Match anything inside brackets, but as few times as possible, so we don't
# match multiple brackets
"\\(.*?\\)",
# Inside the regex that was matched, replace comma with space
function(x) stringr::str_replace_all(x, ",", " ")
) %>%
# Replace ( with ,
stringr::str_replace_all("\\(", ",") %>%
# Remove )
stringr::str_remove_all("\\)") %>%
# Replace # with ,
stringr::str_replace_all("#", ",") %>%
# Remove the last "."
stringr::str_replace_all("\\.$", "\n") %>%
# Add , after female/male
stringr::str_replace_all("male", "male,") %>%
# Read as comma delimited file (works since string contains \n)
readr::read_delim(
file = .,
delim = ",",
col_names = FALSE,
show_col_types = FALSE
)
# Add names (could also be done directly in read_delim with col_names argument)
names(df) <- c(
"family_name",
"first_names",
"gender",
"birthday",
"birth_city",
"birth_country",
"acc_type",
"acc_num",
"district",
"code"
)
df
#> # A tibble: 4 × 10
#> family_name first_names gender birthday birth_city birth_country acc_type
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 EREKSON Andrew Hélène female 10/06/2… Geneva Switzerland PPF
#> 2 BOUKAR Mohamed El-Hadi male 04/12/1… London England PPF
#> 3 HARIMA Olak N’nassik G… female 25/06/2… Paris France PPF
#> 4 THOMAS Hajil Pau Joëli female 03/03/1… Berlin Germany VAT
#> # … with 3 more variables: acc_num <chr>, district <chr>, code <chr>
Created on 2022-03-20 by the reprex package (v2.0.1)
Note that there probably exists more efficient regex'es one could use, but I believe this is simpler and easier to change later.
I am trying to do something that I think is straightforward but I am having an issue with.
I have several medication-related column variables (med_1, med_2, med_3 for example). These are character variables- so they have text for the name of medications
I want to combine them all into variable anymed using or logic, so that I can then use anymed to look at any medications reported across all medication related fields.
I am trying the following, for dataset FinalData.
FinalData <- FinalData %>% mutate(anymed = med_1 | med_2 | med_3)
I am receiving this error:
*Error: Problem with `mutate()` column `anymed`.
ℹ `anymed = |...`.
x operations are possible only for numeric, logical or complex types*
Could someone help explain what code I should use instead since these are characters? Do I need to convert to factors?
Are you looking for this kind of solution:
# data:
df <- tibble(med_1 = "A", med_2 = "B", med_3 = "C")
library(dplyr)
df %>%
mutate(any_med = paste(c(med_1, med_2, med_3), collapse = " | "))
med_1 med_2 med_3 any_med
<chr> <chr> <chr> <chr>
1 A B C A | B | C
You want to use pivot_longer from tidyverse to get them all in the same column. I also dropped the column name (i.e., col), but you could remove that line if you want to know what column the medication came from. I'm unsure what your data looks like, so I just made a small example to show how to do it.
library(tidyverse)
FinalData %>%
pivot_longer(-ind, names_to = "col", values_to = "anymed") %>%
select(-col)
Output
# A tibble: 6 × 2
ind anymed
<dbl> <chr>
1 1 meda
2 1 meda
3 1 meda
4 2 medb
5 2 medb
6 2 medb
It's a little unclear what your expected output is. But if you are wanting to combine all medications in each row, then you can also use unite.
FinalData %>%
unite("any_med", c("med_1", "med_2", "med_3"), sep = " | ")
Output
ind any_med
1 1 meda | meda | meda
2 2 medb | medb | medb
Data
FinalData <-
structure(
list(
ind = c(1, 2),
med_1 = c("meda", "medb"),
med_2 = c("meda",
"medb"),
med_3 = c("meda", "medb")
),
class = "data.frame",
row.names = c(NA,-2L)
)
So I have a string that I'm attempting to parse through and then create 3 columns with the data I extract. From what I've seen, stringr doesn't really cover this case and the gsub I've used so far is excessive and involves me making multiple columns, parsing from those new columns, and then removing them and that seems really inefficient.
The format is this:
"blah, grabbed by ???-??-?????."
I need this:
???-??-?????
I've used placeholders here, but this is how the string typically looks
"blah, grabbed by PHI-80-J.Matthews."
or
"blah, grabbed by NE-5-J.Mills."
and sometimes there is text after the name like this:
"blah, grabbed by KC-10-T.Hill. Blah blah blah."
This is what I would like the end result to be:
Place
Number
Name
PHI
80
J.Matthews
NE
5
J.Mills
KC
10
T. Hill
Edit for further explanation:
Most strings include other people in the same format so "downed by" needs to be incorporated in someway to make sure it is grabbing the right name.
Ex.
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
Desired Output:
Place
Number
Name
KC
10
T. Hill
This solution simply extract the components based on the logic OP mentioned i.e. capture the characters that are needed as three groups - 1) one or more upper case letter ([A-Z]+) followed by a dash (-), 2) then one or more digits (\\d+), and finally 3) non-whitespace characters (\\S+) that follow the dash
library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"),
".*grabbed by\\s([A-Z]+)-(\\d+)-(\\S+)\\..*", convert = TRUE)
-ouputt
# A tibble: 4 x 3
Place Number Name
<chr> <int> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
4 KC 10 T.Hill
Or do this in base R
read.table(text = sub(".*grabbed by\\s((\\w+-){2}\\S+)\\..*", "\\1",
df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
Place Number Name
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
data
df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.",
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.",
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
This solution actually does what you say in the title, namely first remove the text around the the target substring, then split it into columns:
library(tidyr)
library(stringr)
df1 %>%
mutate(col1 = str_extract(col1, "\\w+-\\w+-\\w\\.\\w+")) %>%
separate(col1,
into = c("Place", "Number", "Name"),
sep = "-")
# A tibble: 3 x 3
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
Here, we make use of the fact that the character class \\w is for letters irrespective of case and for digits (and also for the underscore).
Here is an alternative way using sub with regex "([A-Za-z]+\\.[A-Za-z]+).*", "\\1" that removes the string after the second point.
separate that splits the string by by, and finally again separate to get the desired columns.
library(dplyr)
library(tidyr)
df1 %>%
mutate(test1 = sub("([A-Za-z]+\\.[A-Za-z]+).*", "\\1", col1)) %>%
separate(test1, c('remove', 'keep'), sep = " by ") %>%
separate(keep, c("Place", "Number", "Name"), sep = "-") %>%
select(Place, Number, Name)
Output:
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
I have couple of unstructured sentences like below. Description below is column name
Description
Automatic lever for a machine
Vaccum chamber with additional spare
Glove box for R&D
The Mini Guage 5 sets
Vacuum chamber only
Automatic lever only
I want to split this sentence from Col1 to Col5 and count there occurrence like below
Col1 Col2 Col3 Col4
Automatic_lever lever_for for_a a_machine
Vaccum_chamber chamber_with with_additional additional_spare
Glove_box box_for for_R&D R&D
The_Mini Mini_Guage Guage_5 5_sets
Vacuum_chamber chamber_only only
Automatic_lever lever_only only
Also from above columns, can i have the occurence of these words. Like, Vaccum_chamber and Automatic_lever are repeated twice here. Similarly, the occurence of other words?
Here is a tidyverse option
df %>%
rowid_to_column("row") %>%
mutate(words = map(str_split(Description, " "), function(x) {
if (length(x) %% 2 == 0) words <- c(words, "")
idx <- 1:(length(words) - 1)
map_chr(idx, function(i) paste0(x[i:(i + 1)], collapse = "_"))
})) %>%
unnest() %>%
group_by(row) %>%
mutate(
words = str_replace(words, "_NA", ""),
col = paste0("Col", 1:n())) %>%
filter(words != "NA") %>%
spread(col, words, fill = "")
## A tibble: 6 x 6
## Groups: row [6]
# row Description Col1 Col2 Col3 Col4
# <int> <fct> <chr> <chr> <chr> <chr>
#1 1 Automatic lever for a mac… Automatic_… lever_for for_a a_machine
#2 2 Vaccum chamber with addit… Vaccum_cha… chamber_w… with_addi… additional…
#3 3 Glove box for R&D Glove_box box_for for_R&D R&D
#4 4 The Mini Guage 5 sets The_Mini Mini_Guage Guage_5 5_sets
#5 5 Vacuum chamber only Vacuum_cha… chamber_o… only ""
#6 6 Automatic lever only Automatic_… lever_only only ""
Explanation: We split the sentences in Description on a single whitespace " ", then concatenate every two words together with a sliding window approach, making sure that there are always an odd odd number of words per sentence; the rest is just a long-to-wide transformation.
Not pretty but it reproduces your expected output; instead of the manual sliding window approach you could also you zoo::rollapply.
Sample data
df <- read.table(text =
"Description
'Automatic lever for a machine'
'Vaccum chamber with additional spare'
'Glove box for R&D'
'The Mini Guage 5 sets'
'Vacuum chamber only'
'Automatic lever only'", header = T)