How to split strings based on backslash with regex in R - r

string_dat <- structure(list(ID = c(2455, 2455), Location = c("c(\"Southside of Dune\", \"The Hogwarts Express\")",
"Vertex, Inc.")), class = "data.frame", row.names = c(NA, -2L
))
> string_dat
ID Location
1 2455 c("Southside of Dune", "The Hogwarts Express")
2 2455 Vertex, Inc.
I would like to expand the data.frame above based on Location.
library(tidyr)
> string_dat %>% tidyr::separate_rows(Location, sep = ",")
# A tibble: 4 × 2
ID Location
<dbl> <chr>
1 2455 "c(\"Southside of Dune\""
2 2455 " \"The Hogwarts Express\")"
3 2455 "Vertex"
4 2455 " Inc."
Splitting just on , wrongly split Vertex, Inc. into two entries. Also it did not take care of c(\" and \"" for the first two strings.
I also tried to remove the c(\" at the beginning by using gsub, but it gave me the following error.
> gsub('c(\"', "", x = string_dat$Location)
Error in gsub("c(\"", "", x = string_dat$Location) :
invalid regular expression 'c("', reason 'Missing ')''
My desired output is
# A tibble: 3 × 2
ID Location
<dbl> <chr>
1 2455 "Southside of Dune"
2 2455 "The Hogwarts Express"
3 2455 "Vertex, Inc."
********** Edit **********
library(tidyverse)
string_dat %>%
mutate(
# mark twin elements with `;`:
Location = str_replace(Location, '",', '";'),
# remove string-first `c` and all non-alphanumeric characters
# except `,`, `.`, and `;`:
Location = str_replace_all(Location, '^c|(?![.,; ])\\W', '')) %>%
separate_rows(Location, sep = '; ')
# A tibble: 3 × 2
ID Location
<dbl> <chr>
1 2455 "c(\"Southside of Dune\""
2 2455 "\"The Hogwarts Express\")"
3 2455 "Vertex, Inc."

Here's an approach that combines data cleaning with separate_rows:
library(tidyverse)
string_dat %>%
mutate(
# mark twin elements with `;`:
Location = str_replace(Location, '",', '";'),
# remove string-first `c` and all non-alphanumeric characters
# except `,`, `.`, and `;`:
Location = str_replace_all(Location, '^c|(?![.,; ])\\W', '')) %>%
separate_rows(Location, sep = '; ')
# A tibble: 3 × 2
ID Location
<dbl> <chr>
1 2455 Southside of Dune
2 2455 The Hogwarts Express
3 2455 Vertex, Inc.
How the regex ^c|(?![.,; ])\\W works:
^c: matches literal c at the beginning of the string
|: initiates alternation (i.e., "OR")
(?![.,; ])\\W: negative lookahead to assert that any non-alphanumeric characters (\\W with upper-case "W") are matched except any of period, comma, and semi-colon (this exception from the \\W character class is implemented by the lookahead)

The Location column has a strange data format. In the 1st element, it stores R code, because it's using the c("s1", "s2") syntax for a two-element character vector. For the 2nd element, you're missing escaped quotes for this to be valid R code for a one-element character vector.
If I manually edit the 2nd element to add these quotation marks, then we can easily evaluate the R code contained in the Location column, and then unnest the resulting list column. This might be easier than attempting to edit the strings programmatically?
library(tidyverse)
string_dat <- data.frame(
ID = c(2455, 2455),
Location = c("c(\"Southside of Dune\", \"The Hogwarts Express\")", "\"Vertex, Inc.\"")
)
string_dat %>%
rowwise() %>%
mutate(Location = list(eval(parse(text=Location)))) %>%
unnest(cols=Location)
#> # A tibble: 3 × 2
#> ID Location
#> <dbl> <chr>
#> 1 2455 Southside of Dune
#> 2 2455 The Hogwarts Express
#> 3 2455 Vertex, Inc.
Created on 2022-09-23 by the reprex package (v2.0.1)

Related

How many elements in common on multiple lists?

Hi I'm observing a dataset which have a column named "genres" of string vectors that contain all tags of genres the film has, I want to create a plot that shows the popularity of all genres.
structure(list(anime_id = c("10152", "11061", "11266", "11757",
"11771"), Name.x = c("Kimi ni Todoke 2nd Season: Kataomoi", "Hunter
x Hunter (2011)",
"Ao no Exorcist: Kuro no Iede", "Sword Art Online", "Kuroko no
Basket"
), genres = list("Romance", c("Action", " Adventure", " Fantasy"
), "Fantasy", c("Action", " Adventure", " Fantasy", " Romance"
), "Sports")), row.names = c(NA, 5L), class = "data.frame")
initially the genres column is a string with genres divided by comma . for example : ['action', 'drama', 'fantasy']. To work with I run this code to edit the column :
AnimeList2022new$genres <- gsub("\\[|\\]|'" , "",
as.character(AnimeList2022new$genres))
AnimeList2022new$genres <- strsplit( AnimeList2022new$genres,
",")
I don't know how to compare all the vectors in order to know how many times a tags appear
enter image description here
I'm trying with group_by and summarise
genresdata <-MyAnimeList %>%
group_by(genres) %>%
summarise( count = n() ) %>%
arrange( -count)
but obviously this code group similar vectors and not similar string contained in the vectors.
this is the output:
enter image description here
Your genres column is of class list, so it sounds like you want the length() of reach row in it. Generally, we could do that like this:
MyAnimeList %>%
mutate(n_genres = sapply(genres, length))
But this is a special case where there is a nice convenience function lengths() (notice the s at the end) built-in to R that gives us the same result, so we can simply do
MyAnimeList %>%
mutate(n_genres = lengths(genres))
The above will give the number of genres for each row.
In the comments I see you say you want "for example how many times "Action" appears in the whole column". For that, we can unnest() the genre list column and then count:
library(tidyr)
MyAnimeList %>%
unnest(genres) %>%
count(genres)
# # A tibble: 7 × 2
# genres n
# <chr> <int>
# 1 " Adventure" 2
# 2 " Fantasy" 2
# 3 " Romance" 1
# 4 "Action" 2
# 5 "Fantasy" 1
# 6 "Romance" 1
# 7 "Sports" 1
Do notice that some of your genres have leading white space--it's probably best to solve this problem "upstream" whenever the genre column was created, but we could do it now using trimws to trim whitespace:
MyAnimeList %>%
unnest(genres) %>%
count(trimws(genres))
# # A tibble: 5 × 2
# `trimws(genres)` n
# <chr> <int>
# 1 Action 2
# 2 Adventure 2
# 3 Fantasy 3
# 4 Romance 2
# 5 Sports 1

R convert character string to a dataframe

Here is a small sample of a larger character string that I have (no whitespaces). It contains fictional details of individuals.
Each individual is separated by a . There are 10 attributes for each individual.
txt = "EREKSON(Andrew,Hélène),female10/06/2011#Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.BOUKAR(Mohamed,El-Hadi),male04/12/1956#London(England),PPF,2001X005729,dist.097,Dt.043/997.HARIMA(Olak,N’nassik,Gerad,Elisa,Jeremie),female25/06/2013#Paris(France),PPF,2009X005729,dist.088,Dt.043/998.THOMAS(Hajil,Pau,Joëli),female03/03/1980#Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."
I'd like to parse this into a dataframe, with as many observations as there are individuals and 10 columns for each variable.
I've tried using regex and looking at other text extraction solutions on stackoverflow, but haven't been able to reach the output I want.
This is the final dataframe I have in mind, based on the character string input -
result = data.frame(first_names = c('Hélène Andrew','Mohamed El-Hadi','Olak N’nassik Gerad Elisa Jeremie','Joëli Pau Hajil'),
family_name = c('EREKSON','BOUKAR','HARIMA','THOMAS'),
gender = c('male','male','female','female'),
birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
birth_city = c('Geneva','London','Paris','Berlin'),
birth_country = c('Switzerland','England','France','Germany'),
acc_type = c('PPF','PPF','PPF','VAT'),
acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
district = c('dist.093','dist.097','dist.088','dist.078'),
code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))
Any help would be much appreciated
Here's a tidy solution with tidyr's functions separate_rows and extract:
library(tidyr)
data.frame(txt) %>%
# separate `txt` into rows using the dot `.` *if*
# preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
extract(
# select column from which to extract:
txt,
# define column names into which to extract:
into = c("family_name","first_names","gender",
"birthday","birth_city","birth_country",
"acc_type","acc_num","district","code"),
# describe the string exhaustively using capturing groups
# `(...)` to delimit what's to be extracted:
regex = "([A-Z]+)\\(([\\w,]+)\\),([a-z]+)([\\d/]+)#(\\w+)\\((\\w+)\\),([A-Z]+),(\\w+),dist.(\\d+),Dt\\.([\\d/]+)")
# A tibble: 4 × 10
family_name first_names gender birthday birth_city birth_country acc_type acc_num
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 EREKSON Andrew,Peter male 10/06/2011 Geneva Switzerland PPF 2000X007…
2 OBAMA Barack,Hussian male 04/12/1956 London England PPF 2001X005…
3 CLINTON Hillary female 25/06/2013 Paris France PPF 2009X005…
4 GATES Melinda female 03/03/1980 Berlin Germany VAT 2010X006…
# … with 2 more variables: district <chr>, code <chr>
Here is a solution using the tidyverse which pipes together different stringr functions to clean the string, before having readr read it, basically as a CSV:
library(dplyr, warn.conflicts = FALSE) # for pipes
df <-
txt %>%
# Replace "." sep with newline
stringr::str_replace_all(
"\\.[A-Z]",
function(x) stringr::str_replace(x, "\\.", "\n")
) %>%
# Replace all commas in (First[,Middle1,Middle2,...]) with space
stringr::str_replace_all(
# Match anything inside brackets, but as few times as possible, so we don't
# match multiple brackets
"\\(.*?\\)",
# Inside the regex that was matched, replace comma with space
function(x) stringr::str_replace_all(x, ",", " ")
) %>%
# Replace ( with ,
stringr::str_replace_all("\\(", ",") %>%
# Remove )
stringr::str_remove_all("\\)") %>%
# Replace # with ,
stringr::str_replace_all("#", ",") %>%
# Remove the last "."
stringr::str_replace_all("\\.$", "\n") %>%
# Add , after female/male
stringr::str_replace_all("male", "male,") %>%
# Read as comma delimited file (works since string contains \n)
readr::read_delim(
file = .,
delim = ",",
col_names = FALSE,
show_col_types = FALSE
)
# Add names (could also be done directly in read_delim with col_names argument)
names(df) <- c(
"family_name",
"first_names",
"gender",
"birthday",
"birth_city",
"birth_country",
"acc_type",
"acc_num",
"district",
"code"
)
df
#> # A tibble: 4 × 10
#> family_name first_names gender birthday birth_city birth_country acc_type
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 EREKSON Andrew Hélène female 10/06/2… Geneva Switzerland PPF
#> 2 BOUKAR Mohamed El-Hadi male 04/12/1… London England PPF
#> 3 HARIMA Olak N’nassik G… female 25/06/2… Paris France PPF
#> 4 THOMAS Hajil Pau Joëli female 03/03/1… Berlin Germany VAT
#> # … with 3 more variables: acc_num <chr>, district <chr>, code <chr>
Created on 2022-03-20 by the reprex package (v2.0.1)
Note that there probably exists more efficient regex'es one could use, but I believe this is simpler and easier to change later.

Splitting with pipe and additional spaces around this symbol if any using separate in R

How to separate a column into many, based on a symbol "|" and any additional spaces around this symbol if any:
input <- tibble(A = c("Ae1 tt1 | Ae2 tt2", "Be1 | Be2 | Be3"))
output <- tibble(B = c("Ae1 tt1", "Be1") , C = c("Ae2 tt2", "Be2"), D = c(NA, "Be3"))
I tried :
input %>%
separate(A, c("B","C","D"))
#separate(A, c("B","C","D"), sep = "|.")
#mutate(B = str_split(A, "*|")) %>% unnest
What is the syntax with regex ?
Solution from R - separate with specific symbol, vertical bare, | (and tidyr::separate() producing unexpected results) does not provide expected output and produces a warning:
input %>% separate(col=A, into=c("B","C","D"), sep = '\\|')
# A tibble: 2 x 3
B C D
<chr> <chr> <chr>
1 "Ae1 tt1 " " Ae2 tt2" <NA>
2 "Be1 " " Be2 " " Be3"
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
Using separate from tidyr with different length vectors does not seem related unfortunately.
You can use
output <- input %>%
separate(col=A, into=c("B","C","D"), sep="\\s*\\|\\s*", fill="right")
R test:
> input %>% separate(col=A, into=c("B","C","D"), sep="\\s*\\|\\s*", fill="right")
# A tibble: 2 x 3
B C D
<chr> <chr> <chr>
1 Ae1 tt1 Ae2 tt2 <NA>
2 Be1 Be2 Be3
The \s*\|\s* pattern matches a pipe char with any zero or more whitespace chars on both ends of the pipe.
The fill="right" argument fills with missing values on the right.

Removing all characters before and after text in R, then creating columns from the new text

So I have a string that I'm attempting to parse through and then create 3 columns with the data I extract. From what I've seen, stringr doesn't really cover this case and the gsub I've used so far is excessive and involves me making multiple columns, parsing from those new columns, and then removing them and that seems really inefficient.
The format is this:
"blah, grabbed by ???-??-?????."
I need this:
???-??-?????
I've used placeholders here, but this is how the string typically looks
"blah, grabbed by PHI-80-J.Matthews."
or
"blah, grabbed by NE-5-J.Mills."
and sometimes there is text after the name like this:
"blah, grabbed by KC-10-T.Hill. Blah blah blah."
This is what I would like the end result to be:
Place
Number
Name
PHI
80
J.Matthews
NE
5
J.Mills
KC
10
T. Hill
Edit for further explanation:
Most strings include other people in the same format so "downed by" needs to be incorporated in someway to make sure it is grabbing the right name.
Ex.
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
Desired Output:
Place
Number
Name
KC
10
T. Hill
This solution simply extract the components based on the logic OP mentioned i.e. capture the characters that are needed as three groups - 1) one or more upper case letter ([A-Z]+) followed by a dash (-), 2) then one or more digits (\\d+), and finally 3) non-whitespace characters (\\S+) that follow the dash
library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"),
".*grabbed by\\s([A-Z]+)-(\\d+)-(\\S+)\\..*", convert = TRUE)
-ouputt
# A tibble: 4 x 3
Place Number Name
<chr> <int> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
4 KC 10 T.Hill
Or do this in base R
read.table(text = sub(".*grabbed by\\s((\\w+-){2}\\S+)\\..*", "\\1",
df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
Place Number Name
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
data
df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.",
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.",
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
This solution actually does what you say in the title, namely first remove the text around the the target substring, then split it into columns:
library(tidyr)
library(stringr)
df1 %>%
mutate(col1 = str_extract(col1, "\\w+-\\w+-\\w\\.\\w+")) %>%
separate(col1,
into = c("Place", "Number", "Name"),
sep = "-")
# A tibble: 3 x 3
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
Here, we make use of the fact that the character class \\w is for letters irrespective of case and for digits (and also for the underscore).
Here is an alternative way using sub with regex "([A-Za-z]+\\.[A-Za-z]+).*", "\\1" that removes the string after the second point.
separate that splits the string by by, and finally again separate to get the desired columns.
library(dplyr)
library(tidyr)
df1 %>%
mutate(test1 = sub("([A-Za-z]+\\.[A-Za-z]+).*", "\\1", col1)) %>%
separate(test1, c('remove', 'keep'), sep = " by ") %>%
separate(keep, c("Place", "Number", "Name"), sep = "-") %>%
select(Place, Number, Name)
Output:
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill

How can we split string and extract the text between round brackets

I need to split the string in dataframe to two columns, the first one contains the value before the round brackets and the second column contains the value inside the round brackets.
This is an example:
study_name = c("apple bannan (tcga, raw 2018)", "frame shift (mskk2 nature, 2000)" )
results= c("Untested", "tested")
df = data_frame(study_name,results)
This is how I tried to do it:
df <- df %>%
mutate(reference = str_extract_all(study_name, "\\([^()]+\\)")) %>%
rename(~gsub("\\([^()]+\\)", "", study_name))
This is the expected dataframe:
reference = c("(tcga, raw 2018)", "(mskk2 nature, 2000)")
study = c("apple bannan", "frame shift")
expexted_df = data_frame(study, reference)
You can use separate() and set the separator as "\\s(?=\\()".
library(tidyr)
df %>%
separate(study_name, c("study", "reference"), sep = "\\s(?=\\()")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan (tcga, raw 2018) Untested
# 2 frame shift (mskk2 nature, 2000) tested
If you want to extract the text in the parentheses, using extract() is a suitable choice.
df %>%
extract(study_name, c("study", "reference"), regex = "(.+)\\s\\((.+)\\)")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan tcga, raw 2018 Untested
# 2 frame shift mskk2 nature, 2000 tested
We can use str_extract thus:
library(stringr)
df$reference <- str_extract(df$study_name, "\\(.*\\)")
df$study <- str_extract(df$study_name, ".*(?= \\(.*\\))")
Result:
df
study_name results reference study
1 apple bannan (tcga, raw 2018) Untested (tcga, raw 2018) apple bannan
2 frame shift (mskk2 nature, 2000) tested (mskk2 nature, 2000) frame shift
If you no longer want the study_name column, remove it thus:
df$study_name <- NULL

Resources