I would like to extract substring from every row of the id column of a tibble. I am interested always in a region between 1st and 3rd space of original id. The resulted substring, so Zoe Boston and Jane Rome, would go to the new column - name.
I tried to get the positions of "spaces" in every id with str_locate_all and then use positions to use str_sub. However I cannot extract the positions correctly.
data <- tibble(id = c("#1265746 Zoe Boston 58962 st. Victory cont_1.0)", "#958463279246 Jane Rome 874593.01 musician band: XYZ 985147") ) %>%
mutate(coor = str_locate_all(id, "\\s"),
name = str_sub(id, start = coor[[1]], end = coor[[3]] ) )
You can use regex to extract what you want.
Assuming you have stored your tibble in data, you can use sub to extract 1st and 2nd word.
sub('^#\\w+\\s(\\w+\\s\\w+).*', '\\1', data$id)
#[1] "Zoe Boston" "Jane Rome"
^# - starts with hash
\\w+ - A word
\\s - Whitespace
( - start of capture group
\\w+ - A word
followed by \\s - whitespace
\\w+ - another word
) - end of capture group.
.* - remaining string.
The str_locate is more complex, since it first returns the position of whitespace then you need to select the end of 1st whitespace and start of 3rd and then use str_sub to extract text between those positions.
library(dplyr)
library(stringr)
library(purrr)
data %>%
mutate(coor = str_locate_all(id, "\\s"),
start = map_dbl(coor, `[`, 1) + 1,
end = map_dbl(coor, `[`, 3) - 1,
name = str_sub(id, start, end))
# A tibble: 2 x 2
# id name
# <chr> <chr>
#1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
#2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
Another possible solution using stringr and purrr packages
library(stringr)
library(purrr)
library(dplyr)
data %>%
mutate(name = map_chr(str_split(id, " "), ~paste(unlist(.)[2:3], collapse = " ")))
Explanation:
in str_split(id, " ") we create a list of the terms that are separated inside id by a whitespace
map_chr is useful to take each one of these lists, and apply the following function to them: unlist the list, take the elements in positions 2 and 3 (which are the name we want) and then collapse them with a whitespace between them
Output
# A tibble: 2 x 2
# id name
# <chr> <chr>
# 1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
# 2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
Related
I have the following text, and I want to extract 5 words after a specific word from a string vector:
my_text <- "The World Cup 2022 winners, Argentina, have failed to dislodge Brazil from the top of the Fifa men’s world rankings as England remains fifth in the post-Qatar standings.
Had Argentina won the final within 90 minutes, they would have taken the top spot from Brazil. In the last eight tournaments going back to USA 94, no team leading the rankings at kick-off has won the tournament, with only Brazil, the 1998 finalists, getting beyond the quarter-finals."
my_teams <- tolower(c("Brazil", "Argentina"))
I want to extract the next 5 words after the word Brazil or Argentina and then combine them as an entire string.
I have used the following script to get the exact word, but not the phrases after a specific word:
pattern <- paste(my_teams, collapse = "|")
v <- unlist(str_extract_all(tolower(my_text), pattern))
paste(v, collapse=' ')
Any suggestions would be appreciated. Thanks!
You can use
library(stringr)
my_text <- "The World Cup 2022 winners, Argentina, have failed to dislodge Brazil from the top of the Fifa men’s world rankings as England remain fifth in the post-Qatar standings.
Had Argentina won the final within 90 minutes, they would have taken the top spot from Brazil. In the last eight tournaments going back to USA 94, no team leading the rankings at kick-off has won the tournament, with only Brazil, the 1998 finalists, getting beyond the quarter-finals."
my_teams <- tolower(c("Brazil", "Argentina"))
pattern <- paste0("(?i)\\b(?:", paste(my_teams, collapse = "|"), ")\\s+(\\S+(?:\\s+\\S+){4})")
res <- lapply(str_match_all(my_text, pattern), function (m) m[,2])
v <- unlist(res)
paste(v, collapse=' ')
# => [1] "from the top of the won the final within 90"
See the R demo. You can also check the regex demo. Note the use of str_match_all that keeps the captured texts.
Details:
(?i) - case insensitive matching on
\b - a word boundary
(?:Brazil|Argentina) - one of the countries
\s+ - one or more whitespaces
(\S+(?:\s+\S+){4}) - Group 1: one or more non-whitespaces and then four repetitions of one or more whitespaces followed with one or more non-whitespaces.
Maybe not the best possible, but:
Split into a vector of words, remove non-word characters, lowercase (to match targets):
words <- strsplit(my_text,'\\s', perl= TRUE)[[1]] |>
gsub(pattern = "\\W", replacement = "", perl = TRUE) |>
tolower()
Find locations of targets, get strings, paste back together:
loc <- which(words %in% my_teams)
sapply(loc, \(i) words[(i+1):(i+5)], simplify= FALSE) |>
sapply(paste, collapse=" ")
[1] "have failed to dislodge brazil" "from the top of the"
[3] "won the final within 90" "in the last eight tournaments"
[5] "the 1998 finalists getting beyond"
Maybe you need one more paste(., collapse = " ") at the end ?
Here is an alternative approach:
transform vector to tibble
use separate_rows to get one word in row
create helper x with lower case word
make groups starting with brazil or argentina
remove group == 0
get word 2 to 6 in each group
finale summarise:
my_teams <- tolower(c("Brazil", "Argentina"))
library(dplyr)
library(tidyr)
tibble(my_text = my_text) %>%
separate_rows(my_text, sep = " ") %>%
mutate(x = tolower(my_text)) %>%
group_by(group = cumsum(grepl(paste(my_teams, collapse = "|"), x))) %>%
filter(group > 0) %>%
slice(2:6) %>%
summarise(x = paste(my_text, collapse = " "))
group x
<int> <chr>
1 1 have failed to dislodge
2 2 from the top of the
3 3 won the final within 90
4 4 In the last eight tournaments
5 5 the 1998 finalists, getting beyond
I've been beating my head against this for awhile and was hoping for some suggestions. I'm trying to extract semicolon delimited text from a row in a data frame, performing an internal lookup on a string in that row based on the extracted values, and then outputting that (along with another extracted variable) into a long format...and then repeating for every row in the data frame. I can do the first and last manipulations with str_split, and I think i could just loop everything with apply, but the internal lookup (join?) has me tied in knots. I'd like to imagine that i could do this w/ dplyr but
Starting with a data frame:
name<-"Adam, B.C.; Dave, E.F.; Gerald, H."
school<-"[Adam, B.C.; Gerald, H.]U.Penn; [Dave, E.F.]U.Georgia"
index<-12345
foo<-data.frame(name,school,index)
foo
name school index
1 Adam, B.C.; Dave, E.F.; Gerald, H. [Adam, B.C.; Gerald, H.]U.Penn; [Dave, E.F.]U.Georgia 12345
Desired output:
name school index
Adam, B.C. U.Penn 12345
Dave, E.F. U.Georgia 12345
Gerald, H. U.Penn 12345
etc. etc. etc.
thanks!
A mixture of tidyr::separate() and tidyr::seperate_rows() could do the trick:
library(tidyverse)
foo |>
tidyr::separate_rows(school, sep = "\\[", convert = T) |>
tidyr::separate(col = school, into = c("name", "school"), sep = "]") |>
tidyr::separate_rows(name, sep = ";", convert = T) |>
slice(-1) |>
mutate(across(everything(), trimws)) |>
mutate(across(everything(), str_remove, ";" ))
Output:
# A tibble: 3 x 3
index name school
<chr> <chr> <chr>
1 12345 Adam, B.C. U.Penn
2 12345 Gerald, H. U.Penn
3 12345 Dave, E.F. U.Georgia
I have a dataframe like so:
df = data.frame('name' = c('California parks', 'bear lake', 'beautiful tree house', 'banana plant'), 'extract' = c('parks', 'bear', 'tree', 'plant'))
How do I remove the strings of the 'extract' column from the name column to get the following result:
name_new = California, lake, beautiful house, banana
I'm suspecting this demands a combination of str_extract and lapply but can quite figure it out.
Thanks!
The str_remove or str_replace are vectorized for both string and pattern. So, if we have two columns, just pass those columns 'name', 'extract' as the string, pattern to remove the substring in the 'name' column elementwise. Once we remove those substring, there are chances of having spaces before or after which can be removed or replaced with str_replace with trimws (to remove the leading/lagging spaces)
library(dplyr)
library(stringr)
df %>%
mutate(name_new = str_remove(name, extract),
name_new = str_replace_all(trimws(name_new), "\\s{2,}", " "))
# name extract name_new
#1 California parks parks California
#2 bear lake bear lake
#3 beautiful tree house tree beautiful house
#4 banana plant plant banana
A base R option using gsub + Vectorize
within(df,name_new <- Vectorize(gsub)(paste0("\\s",extract,"\\s")," ",name))
which gives
name extract name_new
1 California parks parks California
2 bear lake bear lake
3 beautiful tree house tree beautiful house
4 banana plant plant banana
I have a large number of text files. Each file is stored as an observation in a dataframe. Each observation contains multiple fields so there is some structure in each object. I'm looking to split each based on the structured information within each file.
Data is currently in the following structure (simplified):
a <- c("Name: John Doe Age: 50 Address Please give full address 22 Main Street, New York")
b <- c("Name: Jane Bloggs Age: 42 Address Please give full address 1 Lower Street, London")
df <- data.frame(rawtext = c(a,b))
I'd like to split each observation into individual variable columns. It should end up looking like this:
Name Age Address
John Doe 50 22 Main Street, New York
Jane Bloggs 42 1 Lower Street, London
I thought that this could be done fairly simply using a pre-defined vector of delimiters since each text object is structured. I have tried using stringr and str_split() but this doesn't handle the vector input. e.g.
delims <- c("Name:", "Age", "Address Please give full address")
str_split(df$rawtext, delims)
I'm perhaps trying to oversimplify here. The only other approach I can think of is to loop through each observation and extract all text after delims[1] and before delims[2] (and so on) for all fields.
e.g. the following bodge would get me the name field based on the delimiters:
sub(paste0(".*", delims[1]), "", df$rawtext[1]) %>% sub(paste0(delims[2], ".*"), "", .)
[1] " John Doe "
This feels extremely inefficient. Is there a better way that I'm missing?
A tidyverse solution:
library(tidyverse)
delims <- c("Name", "Age", "Address Please give full address")
df %>%
mutate(rawtext = str_remove_all(rawtext, ":")) %>%
separate(rawtext, c("x", delims), sep = paste(delims, collapse = "|"), convert = T) %>%
mutate(across(where(is.character), str_squish), x = NULL)
# # A tibble: 2 x 3
# Name Age `Address Please give full address`
# <chr> <dbl> <chr>
# 1 John Doe 50 22 Main Street, New York
# 2 Jane Bloggs 42 1 Lower Street, London
Note: convert = T in separate() converts Age from character to numeric ignoring leading/trailing whitespaces.
I have a problem with a database with names of persons. I want to put the names in abbreviation but not the last names. The last name is separated from the name by a comma and the different people are separated from each other by a semicolon, like this example:
Michael, Jordan; Bird, Larry;
If the name is a single word, the code would be like this:
breve$autor <- str_replace_all(breve$autor, "[:lower:]{1,}\\;", ".\\;")
Result with this code:
Michael, J.; Bird, L.;
The problem is in compound names. With this code, the name:
Jordan, Michael Larry;
It would be:
Jordan, Michael L.;
Could someone tell me how to remove all lowercase letters that are between the comma and the semicolon? and it will look like this:
Jordan, M.L.;
Here is another solution:
x1 <- 'Michael, Jordan; Bird, Larry;'
x2 <- 'Jordan, Michael Larry;'
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x1, perl = TRUE)
# [1] "Michael, J.; Bird, L.;"
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE)
# [1] "Jordan, M. L.;"
Surnames are followed by , while are parts of the names are followed by or ;. Here I use (?=[ ;]) to make sure that the following character after the pattern to be matched is a space or a semicolon.
To remove the space between M. and L., an additional step is needed:
gsub('\\. ', '.', gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE))
# [1] "Jordan, M.L.;"
There must be a regular expression that will do this, of course. But that magic is a little beyond me. So here is an approach with simple string manipulation in a data frame using tidyverse functions.
library(stringr)
library(dplyr)
library(tidyr)
ballers <- "Michael, Jordan; Bird, Larry;"
mj <- "Jordan, Michael Larry"
c(ballers, mj) %>%
#split the players
str_split(., ";", simplify = TRUE) %>%
# remove white space
str_trim() %>%
#transpose to get players in a column
t %>%
#split again into last name and first + middle (if any)
str_split(",", simplify = TRUE) %>%
# convert to a tibble
as_tibble() %>%
# remove more white space
mutate(V2=str_trim(V2)) %>%
# remove empty rows (these can be avoided by different manipulation upstream)
filter(!V1 == "") %>%
# name the columns
rename("Last"=V1, "First_two"=V2) %>%
# separate the given names into first and middle (if any)
separate(First_two,into=c("First", "Middle"), sep=" ",) %>%
# abbreviate to first letter
mutate(First_i=abbreviate(First, 1)) %>%
# abbreviate, but take into account that middle name might be missing
mutate(Middle_i=ifelse(!is.na(Middle), paste0(abbreviate(Middle, 1), "."), "")) %>%
# combine the First and middle initals
mutate(Initials=paste(First_i, Middle_i, sep=".")) %>%
# make the desired Last, F.M. vector
mutate(Final=paste(Last, Initials, sep=", "))
# A tibble: 3 x 7
Last First Middle First_i Middle_i Initials Final
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Michael Jordan NA J "" J. Michael, J.
2 Jordan Michael Larry M L. M.L. Jordan, M.L.
3 Bird Larry NA L "" L. Bird, L.
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 3].
Much longer than a regex.
There will probably be a better way to do this, but I managed to get it to work using the stringr and tibble packages.
library(stringr)
library(tibble)
names <- 'Jordan, Michael; Bird, Larry; Obama, Barack; Bush, George Walker'
df <- as_tibble(str_split(unlist(str_split(names, '; ')), ', ', simplify = TRUE))
df[, 2] <- gsub('[a-z]+', '.', pull(df[, 2]))
This code generates the tibble df, which has the following contents:
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 Jordan M.
2 Bird L.
3 Obama B.
4 Bush G. W.
The names are first split into first and last names and stored into a data frame so that the gsub() call does not operate on the last names. Then, gsub() searches for any lowercase letters in the last names and replaces them with a single .
Then, you can call str_c(str_c(pull(df[, 1]), ', ', pull(df[, 2])), collapse = '; ') (or str_c(pull(unite(df, full, c('V1', 'V2'), sep = ', ')), collapse = '; ') if you already have the tidyr package loaded) to return the string "Jordan, M.; Bird, L.; Obama, B.; Bush, G. W.".
...also, did you mean Michael Jordan, not Jordan Michael? lol
Here's one that uses gsub twice. The inner one is for names with no middle names and the outer is for names that have a middle name.
x = c("Michael, Jordan; Jordan, Michael Larry; Bird, Larry;")
gsub(", ([A-Z])[a-z]+ ([A-Z])[a-z]+;", ", \\1.\\2.;", gsub(", ([A-Z])[a-z]+;", ", \\1.;", x))
#[1] "Michael, J.; Jordan, M.L.; Bird, L.;"