I have a problem with a database with names of persons. I want to put the names in abbreviation but not the last names. The last name is separated from the name by a comma and the different people are separated from each other by a semicolon, like this example:
Michael, Jordan; Bird, Larry;
If the name is a single word, the code would be like this:
breve$autor <- str_replace_all(breve$autor, "[:lower:]{1,}\\;", ".\\;")
Result with this code:
Michael, J.; Bird, L.;
The problem is in compound names. With this code, the name:
Jordan, Michael Larry;
It would be:
Jordan, Michael L.;
Could someone tell me how to remove all lowercase letters that are between the comma and the semicolon? and it will look like this:
Jordan, M.L.;
Here is another solution:
x1 <- 'Michael, Jordan; Bird, Larry;'
x2 <- 'Jordan, Michael Larry;'
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x1, perl = TRUE)
# [1] "Michael, J.; Bird, L.;"
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE)
# [1] "Jordan, M. L.;"
Surnames are followed by , while are parts of the names are followed by or ;. Here I use (?=[ ;]) to make sure that the following character after the pattern to be matched is a space or a semicolon.
To remove the space between M. and L., an additional step is needed:
gsub('\\. ', '.', gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE))
# [1] "Jordan, M.L.;"
There must be a regular expression that will do this, of course. But that magic is a little beyond me. So here is an approach with simple string manipulation in a data frame using tidyverse functions.
library(stringr)
library(dplyr)
library(tidyr)
ballers <- "Michael, Jordan; Bird, Larry;"
mj <- "Jordan, Michael Larry"
c(ballers, mj) %>%
#split the players
str_split(., ";", simplify = TRUE) %>%
# remove white space
str_trim() %>%
#transpose to get players in a column
t %>%
#split again into last name and first + middle (if any)
str_split(",", simplify = TRUE) %>%
# convert to a tibble
as_tibble() %>%
# remove more white space
mutate(V2=str_trim(V2)) %>%
# remove empty rows (these can be avoided by different manipulation upstream)
filter(!V1 == "") %>%
# name the columns
rename("Last"=V1, "First_two"=V2) %>%
# separate the given names into first and middle (if any)
separate(First_two,into=c("First", "Middle"), sep=" ",) %>%
# abbreviate to first letter
mutate(First_i=abbreviate(First, 1)) %>%
# abbreviate, but take into account that middle name might be missing
mutate(Middle_i=ifelse(!is.na(Middle), paste0(abbreviate(Middle, 1), "."), "")) %>%
# combine the First and middle initals
mutate(Initials=paste(First_i, Middle_i, sep=".")) %>%
# make the desired Last, F.M. vector
mutate(Final=paste(Last, Initials, sep=", "))
# A tibble: 3 x 7
Last First Middle First_i Middle_i Initials Final
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Michael Jordan NA J "" J. Michael, J.
2 Jordan Michael Larry M L. M.L. Jordan, M.L.
3 Bird Larry NA L "" L. Bird, L.
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 3].
Much longer than a regex.
There will probably be a better way to do this, but I managed to get it to work using the stringr and tibble packages.
library(stringr)
library(tibble)
names <- 'Jordan, Michael; Bird, Larry; Obama, Barack; Bush, George Walker'
df <- as_tibble(str_split(unlist(str_split(names, '; ')), ', ', simplify = TRUE))
df[, 2] <- gsub('[a-z]+', '.', pull(df[, 2]))
This code generates the tibble df, which has the following contents:
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 Jordan M.
2 Bird L.
3 Obama B.
4 Bush G. W.
The names are first split into first and last names and stored into a data frame so that the gsub() call does not operate on the last names. Then, gsub() searches for any lowercase letters in the last names and replaces them with a single .
Then, you can call str_c(str_c(pull(df[, 1]), ', ', pull(df[, 2])), collapse = '; ') (or str_c(pull(unite(df, full, c('V1', 'V2'), sep = ', ')), collapse = '; ') if you already have the tidyr package loaded) to return the string "Jordan, M.; Bird, L.; Obama, B.; Bush, G. W.".
...also, did you mean Michael Jordan, not Jordan Michael? lol
Here's one that uses gsub twice. The inner one is for names with no middle names and the outer is for names that have a middle name.
x = c("Michael, Jordan; Jordan, Michael Larry; Bird, Larry;")
gsub(", ([A-Z])[a-z]+ ([A-Z])[a-z]+;", ", \\1.\\2.;", gsub(", ([A-Z])[a-z]+;", ", \\1.;", x))
#[1] "Michael, J.; Jordan, M.L.; Bird, L.;"
Related
I have the following text, and I want to extract 5 words after a specific word from a string vector:
my_text <- "The World Cup 2022 winners, Argentina, have failed to dislodge Brazil from the top of the Fifa men’s world rankings as England remains fifth in the post-Qatar standings.
Had Argentina won the final within 90 minutes, they would have taken the top spot from Brazil. In the last eight tournaments going back to USA 94, no team leading the rankings at kick-off has won the tournament, with only Brazil, the 1998 finalists, getting beyond the quarter-finals."
my_teams <- tolower(c("Brazil", "Argentina"))
I want to extract the next 5 words after the word Brazil or Argentina and then combine them as an entire string.
I have used the following script to get the exact word, but not the phrases after a specific word:
pattern <- paste(my_teams, collapse = "|")
v <- unlist(str_extract_all(tolower(my_text), pattern))
paste(v, collapse=' ')
Any suggestions would be appreciated. Thanks!
You can use
library(stringr)
my_text <- "The World Cup 2022 winners, Argentina, have failed to dislodge Brazil from the top of the Fifa men’s world rankings as England remain fifth in the post-Qatar standings.
Had Argentina won the final within 90 minutes, they would have taken the top spot from Brazil. In the last eight tournaments going back to USA 94, no team leading the rankings at kick-off has won the tournament, with only Brazil, the 1998 finalists, getting beyond the quarter-finals."
my_teams <- tolower(c("Brazil", "Argentina"))
pattern <- paste0("(?i)\\b(?:", paste(my_teams, collapse = "|"), ")\\s+(\\S+(?:\\s+\\S+){4})")
res <- lapply(str_match_all(my_text, pattern), function (m) m[,2])
v <- unlist(res)
paste(v, collapse=' ')
# => [1] "from the top of the won the final within 90"
See the R demo. You can also check the regex demo. Note the use of str_match_all that keeps the captured texts.
Details:
(?i) - case insensitive matching on
\b - a word boundary
(?:Brazil|Argentina) - one of the countries
\s+ - one or more whitespaces
(\S+(?:\s+\S+){4}) - Group 1: one or more non-whitespaces and then four repetitions of one or more whitespaces followed with one or more non-whitespaces.
Maybe not the best possible, but:
Split into a vector of words, remove non-word characters, lowercase (to match targets):
words <- strsplit(my_text,'\\s', perl= TRUE)[[1]] |>
gsub(pattern = "\\W", replacement = "", perl = TRUE) |>
tolower()
Find locations of targets, get strings, paste back together:
loc <- which(words %in% my_teams)
sapply(loc, \(i) words[(i+1):(i+5)], simplify= FALSE) |>
sapply(paste, collapse=" ")
[1] "have failed to dislodge brazil" "from the top of the"
[3] "won the final within 90" "in the last eight tournaments"
[5] "the 1998 finalists getting beyond"
Maybe you need one more paste(., collapse = " ") at the end ?
Here is an alternative approach:
transform vector to tibble
use separate_rows to get one word in row
create helper x with lower case word
make groups starting with brazil or argentina
remove group == 0
get word 2 to 6 in each group
finale summarise:
my_teams <- tolower(c("Brazil", "Argentina"))
library(dplyr)
library(tidyr)
tibble(my_text = my_text) %>%
separate_rows(my_text, sep = " ") %>%
mutate(x = tolower(my_text)) %>%
group_by(group = cumsum(grepl(paste(my_teams, collapse = "|"), x))) %>%
filter(group > 0) %>%
slice(2:6) %>%
summarise(x = paste(my_text, collapse = " "))
group x
<int> <chr>
1 1 have failed to dislodge
2 2 from the top of the
3 3 won the final within 90
4 4 In the last eight tournaments
5 5 the 1998 finalists, getting beyond
I've been beating my head against this for awhile and was hoping for some suggestions. I'm trying to extract semicolon delimited text from a row in a data frame, performing an internal lookup on a string in that row based on the extracted values, and then outputting that (along with another extracted variable) into a long format...and then repeating for every row in the data frame. I can do the first and last manipulations with str_split, and I think i could just loop everything with apply, but the internal lookup (join?) has me tied in knots. I'd like to imagine that i could do this w/ dplyr but
Starting with a data frame:
name<-"Adam, B.C.; Dave, E.F.; Gerald, H."
school<-"[Adam, B.C.; Gerald, H.]U.Penn; [Dave, E.F.]U.Georgia"
index<-12345
foo<-data.frame(name,school,index)
foo
name school index
1 Adam, B.C.; Dave, E.F.; Gerald, H. [Adam, B.C.; Gerald, H.]U.Penn; [Dave, E.F.]U.Georgia 12345
Desired output:
name school index
Adam, B.C. U.Penn 12345
Dave, E.F. U.Georgia 12345
Gerald, H. U.Penn 12345
etc. etc. etc.
thanks!
A mixture of tidyr::separate() and tidyr::seperate_rows() could do the trick:
library(tidyverse)
foo |>
tidyr::separate_rows(school, sep = "\\[", convert = T) |>
tidyr::separate(col = school, into = c("name", "school"), sep = "]") |>
tidyr::separate_rows(name, sep = ";", convert = T) |>
slice(-1) |>
mutate(across(everything(), trimws)) |>
mutate(across(everything(), str_remove, ";" ))
Output:
# A tibble: 3 x 3
index name school
<chr> <chr> <chr>
1 12345 Adam, B.C. U.Penn
2 12345 Gerald, H. U.Penn
3 12345 Dave, E.F. U.Georgia
I would like to extract substring from every row of the id column of a tibble. I am interested always in a region between 1st and 3rd space of original id. The resulted substring, so Zoe Boston and Jane Rome, would go to the new column - name.
I tried to get the positions of "spaces" in every id with str_locate_all and then use positions to use str_sub. However I cannot extract the positions correctly.
data <- tibble(id = c("#1265746 Zoe Boston 58962 st. Victory cont_1.0)", "#958463279246 Jane Rome 874593.01 musician band: XYZ 985147") ) %>%
mutate(coor = str_locate_all(id, "\\s"),
name = str_sub(id, start = coor[[1]], end = coor[[3]] ) )
You can use regex to extract what you want.
Assuming you have stored your tibble in data, you can use sub to extract 1st and 2nd word.
sub('^#\\w+\\s(\\w+\\s\\w+).*', '\\1', data$id)
#[1] "Zoe Boston" "Jane Rome"
^# - starts with hash
\\w+ - A word
\\s - Whitespace
( - start of capture group
\\w+ - A word
followed by \\s - whitespace
\\w+ - another word
) - end of capture group.
.* - remaining string.
The str_locate is more complex, since it first returns the position of whitespace then you need to select the end of 1st whitespace and start of 3rd and then use str_sub to extract text between those positions.
library(dplyr)
library(stringr)
library(purrr)
data %>%
mutate(coor = str_locate_all(id, "\\s"),
start = map_dbl(coor, `[`, 1) + 1,
end = map_dbl(coor, `[`, 3) - 1,
name = str_sub(id, start, end))
# A tibble: 2 x 2
# id name
# <chr> <chr>
#1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
#2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
Another possible solution using stringr and purrr packages
library(stringr)
library(purrr)
library(dplyr)
data %>%
mutate(name = map_chr(str_split(id, " "), ~paste(unlist(.)[2:3], collapse = " ")))
Explanation:
in str_split(id, " ") we create a list of the terms that are separated inside id by a whitespace
map_chr is useful to take each one of these lists, and apply the following function to them: unlist the list, take the elements in positions 2 and 3 (which are the name we want) and then collapse them with a whitespace between them
Output
# A tibble: 2 x 2
# id name
# <chr> <chr>
# 1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
# 2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
I have a dataframe with several columns, and one of those columns is populated by pipes "|" and information that I am trying to obtain.
For example:
View(Table$Column)
"|1||KK|12|Gold||4K|"
"|1||Rst|E|Silver||13||"
"|1||RST|E|Silver||18||"
"|1||KK|Y|Iron|y|12||"
"|1||||Copper|Cpr|||E"
"|1||||Iron|||12|F"
And so on for about 120K rows.
What I am trying to excavate is everything in between the 5th pipe and the 6th pipe in this series, but in it's own column vector, so the end result looks like this:
View(Extracted)
Gold
Silver
Silver
Iron
Copper
Iron
I don't want to use RegEx. My tools are only limited to R here. Would you guys happen to have any advice how to overcome this?
Thank you.
1) Assuming x as defined reproducibly in the Note at the end use read.table as shown. No regular expressions or packages are used.
read.table(text = Table$Column, sep = "|", header = FALSE,
as.is = TRUE, fill = TRUE)[6]
giving:
V6
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
2) This alternative does use a regular expression (which the question asked not to) but just in case here is a tidyr solution. Note that it requires tidyr 0.8.2 or later since earlier versions of tidyr did not support NA in the into= argument.
library(dplyr)
library(tidyr)
Table %>%
separate(Column, into = c(rep(NA, 5), "commodity"), sep = "\\|", extra = "drop")
giving:
commodity
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
3) This is another base solution. It is probably not the one you want given that (1) is so much simpler but I wanted to see if we could come up with a second approach in base that did not use regexes. Note that if the split= argument of strsplit is "" then it is treated specially and so is not a regex. It creates a list each of whose components is a vector of single characters. Each such vector is passed to the anonymous function which labels | and the characters in the field after it with its ordinal number. We then take the characters corresponding to 5 (except the first as it is |) and collapse them together using paste.
data.frame(commodities = sapply(strsplit(Table$Column, ""), function(chars) {
wx <- which(cumsum(chars == "|") == 5)
paste(chars[seq(wx[2], tail(wx, 1))], collapse = "")
}), stringsAsFactors = FALSE)
giving:
commodities
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
Note
Table <- data.frame(Column = c("|1||KK|12|Gold||4K|",
"|1||Rst|E|Silver||13||",
"|1||RST|E|Silver||18||",
"|1||KK|Y|Iron|y|12||",
"|1||||Copper|Cpr|||E",
"|1||||Iron|||12|F"), stringsAsFactors = FALSE)
You can try this:
df <- data.frame(x = c("|1||KK|12|Gold||4K|", "|1||Rst|E|Silver||13||"), stringsAsFactors = FALSE)
library(stringr)
stringr::str_split(df$x, "\\|", simplify = TRUE)[, 6]
1) We can use strsplit from base R on the delimiter | and extract the 6th element from the list of vectors
sapply(strsplit(Table$Column, "|", fixed = TRUE), `[`, 6)
#[1] "Gold" "Silver" "Silver" "Iron" "Copper" "Iron"
2) Or using regex (again from base R), use sub to extract the 6th word
sub("^([|][^|]+){4}[|]([^|]*).*", "\\2",
gsub("(?<=[|])(?=[|])", "and", Table$Column, perl = TRUE))
#[1] "Gold" "Silver" "Silver" "Iron" "Copper" "Iron"
data
Table <- structure(list(Column = c("|1||KK|12|Gold||4K|",
"|1||Rst|E|Silver||13||",
"|1||RST|E|Silver||18||", "|1||KK|Y|Iron|y|12||", "|1||||Copper|Cpr|||E",
"|1||||Iron|||12|F")), class = "data.frame", row.names = c(NA,
-6L))
My dataset looks like this below
Id Col1
--------------------
133 Mary 7E
281 Feliz 2D
437 Albert 4C
What I am trying to do is to take the 1st two characters from the 1st word in Col1 and all the whole second word and then merge them.
My final expected dataset should look like this below
Id Col1
--------------------
133 MA7E
281 FE2D
437 AL4C
Any suggestions on how to accomplish this is much appreciated.
You can do
my_data$Col1 <- sub("(\\w{2})(\\w* )(\\b\\w+\\b)", "\\1\\3", my_data$Col1)
my_data$Col1 <- toupper(my_data$Col1)
my_data
# Id Col1
# 1 133 MA7E
# 2 281 FE2D
# 3 437 AL4C
The brackets show the single groups that are matched and only the first and the third are retained. \\w matches letters and numbers and \\b matches the boundary of words.
We can also do this in paste0 together the output of substr and str_split within a dplyr pipe chain:
df <- data.frame(id = c(133,281,437),
Col1 = c("Mary 7E", "Feliz 2D", "Albert 4C"))
library(stringr)
df %>%
mutate(Col1 = toupper(paste0(substr(Col1, 1, 2),
stringr::str_split(Col1, ' ')[[1]][-1])))
You can do this in several steps. First split by space, subset first two letters of the name and capitalize them. Paste that together with the second part. Result is in column final. You could take all these intermediate steps or chain commands into less statements, whatever floats your boat.
xy <- data.frame(id = c(133, 281, 437),
name = c("Mary 7E", "Feliz 2D", "Albert 4C"),
stringsAsFactors = FALSE)
xy$first <- sapply(strsplit(xy$name, " "), "[", 1)
xy$second <- sapply(strsplit(xy$name, " "), "[", 2)
xy$first_upper <- toupper(substr(x = xy$first, start = 1, stop = 2))
xy$final <- paste(xy$first_upper, xy$second, sep = "")
xy
id name first second first_upper final
1 133 Mary 7E Mary 7E MA MA7E
2 281 Feliz 2D Feliz 2D FE FE2D
3 437 Albert 4C Albert 4C AL AL4C
Here is another variation using sub. We can use lookarounds in Perl mode to selectively remove everything except for the first two, and last two, characters. Then, make a call to toupper() to capitalize all letters.
df$Col1 <- toupper(sub("(?<=^..).*(?=..$)", "", df$Col1), perl=TRUE)
[1] "MA7E" "FE2D" "AL4C"
Demo
rather than one row solution this is easy to interpret and modify
xx_df <- data.frame(id = c(133,281,437),
Col1 = c("Mary 7E", "Feliz 2D", "Albert 4C"))
xx_df %>%
mutate(xpart1 = stri_split_fixed(Col1, " ", simplify = T)[,1]) %>%
mutate(xpart2 = stri_split_fixed(Col1, " ", simplify = T)[,2]) %>%
mutate(Col1_new = paste0(substr(xpart1,1,2), substr(xpart2, 1, 2))) %>%
select(id, Col1 = Col1_new) %>%
mutate(Col1 = toupper(Col1))
result is
id Col1
1 133 MA7E
2 281 FE2D
3 437 AL4C
For this solution use substr to take the first 2 elements from each string, and the last 2. For selecting the last 2 we need nchar, as part of sapply. paste0 together. Also using toupper to have capital letters.
l2 <- sapply(df$Col1, function(x) nchar(x))
paste0(toupper(substr(df$Col1,1,2)), substr(df$Col1, l2-1, l2))
[1] "MA7E" "FE2D" "AL4C"