This question already has answers here:
How to remove unicode <U+00A6> from string?
(4 answers)
Closed 6 years ago.
I've used this method, but it doesn't work.
My code include value like:
clients <- c("Greg Smith <U+2032>", "John Coolman", "Mr. Brown <U+2032>")
So I tried:
clients <- gsub("$\\s*<U\\+\\w+>", "", clients)
But it doesnt work.
clients <- gsub("[<].*[>]", "", clients)
You have a $ as the first character of your expression. This matches the end of an expression, but only if it is the last character of the pattern:
> gsub("\\s*<U\\+\\w+>$", "", clients)
[1] "Greg Smith" "John Coolman" "Mr. Brown"
if you want to remove only unicode <U+2032>
clients <- c("Greg Smith <U+2032>", "John Coolman", "Mr. Brown <U+2032>")
clients <- gsub("<U\\+2032>", "", clients)
clients
# [1] "Greg Smith " "John Coolman" "Mr. Brown "
Related
I have a vector of names like so:
x <- c("bob smith", "greg a taylor", "lindsey louise brown")
so each entry is a name firstname lastname with either nothing between, or a middle initial or the whole middle name. What I want to do is remove the information about the middle name where it exists, so I should get
"bob smith", "greg taylor", "lindsey brown"
as the output. How is this possible in R?
Thanks!
We could use capture groups
sub('^(\\w+).*\\b(\\w+)$', '\\1 \\2', x)
#[1] "bob smith" "greg taylor" "lindsey brown"
Use sub
sub("\\s+\\S+(?=\\s)", '',s, perl=TRUE)
or
sub("\\s+\\S+(\\s)", '\\1',s, perl=TRUE)
I have some data where I have names "sandwiched" between two spaces and the phrase "is a (number from 1-99) y.o". For example:
a <- "SomeOtherText John Smith is a 60 y.o. MoreText"
b <- "YetMoreText Will Smth Jr. is a 30 y.o. MoreTextToo"
c <- "JustJunkText Billy Smtih III is 5 y/o MoreTextThree"
I'd like to extract the names "John Smith", "Will Smth Jr." and "Billy Smtih III" (the misspellings are there on purpose). I tried using str_extract or gsub, based on answers to similar questions I found on SO, but with no luck.
You can chain multiple calls to stringr::str_remove.
First regex: remove pattern that start with (^) any letters ([:alpha:]) followed by one or more whitespaces (\\s+).
Seconde regex: remove pattern that ends with ($) a whitespace(\\s) followed by the sequence is, followed by any number of non-newline characters (.)
str_remove(a, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "John Smith"
str_remove(b, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Will Smth Jr."
str_remove(c, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Billy Smtih III"
You can also do it in a single call by using stringr::str_remove_all and joining the two patterns separated by an OR (|) symbol:
str_remove_all(a, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(b, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(c, '^[:alpha:]*\\s+|\\sis.*$')
You can use sub in base R as -
extract_name <- function(x) sub('.*\\s{2,}(.*)\\sis.*\\d+ y[./]o.*', '\\1', x)
extract_name(c(a, b, c))
#[1] "John Smith" "Will Smth Jr." "Billy Smtih III"
\\s{2,} is 2 or more whitespace
(.*) capture group to capture everything until
is followed by a number and y.o and y/o is encountered.
I have a text file of a presidential debate. Eventually, I want to parse the text into a dataframe where each row is a statement, with one column with the speaker's name and another column with the statement. For example:
"Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"
Would become:
name text
1 Bob Smith Hi Steve. How are you doing?
2 Steve Brown Hi Bob. I'm doing well!
Question: How do I split the statements from the names? I tried splitting on the colon:
data <- strsplit(data, split=":")
But then I get this:
"Bob Smith" "Hi Steve. How are you doing? Steve Brown" "Hi Bob. I'm doing well!"
When what I want is this:
"Bob Smith" "Hi Steve. How are you doing?" "Steve Brown" "Hi Bob. I'm doing well!"
I doubt this will fix all of your parsing needs, but an approach using strsplit to solve your most immediate question is using lookaround. You'll need to use perl regex though.
Here you instruct strsplit to split on either : or a space where there is a punctuation character immediately before and nothing but alphanumeric characters or spaces between the space and :. \\pP matches punctuation characters and \\w matches word characters.
data <- "Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"
strsplit(data,split="(: |(?<=\\pP) (?=[\\w ]+:))",perl=TRUE)
[[1]]
[1] "Bob Smith" "Hi Steve. How are you doing?" "Steve Brown"
[4] "Hi Bob. I'm doing well!"
We can extract these with regex using the stringr package. You then directly have the columns of speaker and quote you are looking for.
a <- "Bob: Hi Steve. Steve: Hi Bob."
library(stringr)
str_match_all(a, "([A-Za-z]*?): (.*?\\.)")
#> [[1]]
#> [,1] [,2] [,3]
#> [1,] "Bob: Hi Steve." "Bob" "Hi Steve."
#> [2,] "Steve: Hi Bob." "Steve" "Hi Bob."
This question already has answers here:
Extract last word in string in R
(5 answers)
Closed 5 years ago.
The 2000 names I have are mixed with "first name middle name last name" and "first name last name". My code only works with those with middle names. Please see the toy example.
names <- c("SARAH AMY SMITH", "JACKY LEE", "LOVE JOY", "MONTY JOHN CARLO", "EVA LEE-YOUNG")
last.name <- gsub("[A-Z]+ [A-Z]*","\\", people.from.sg[,7])
last.name is
" SMITH" "" " CARLO" "-YOUNG"
LOVE JOY and JACKY lEE don't have any results.
p.s This is not a duplicate post since the previous ones do not use gsub
Replace everything up to the last space with the empty string. No packages are used.
sub(".* ", "", names)
## [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
Note:
Regarding the comment below on two word last names that does not appear to be part of the question as stated but if it were then suppose the first word is either DEL or VAN. Then replace the space after either of them with a colon, say, and then perform the sub above and then revert the colon back to space.
names2 <- c("SARAH AMY SMITH", "JACKY LEE", "LOVE JOY", "MONTY JOHN CARLO",
"EVA LEE-YOUNG", "ARTHUR DEL GATO", "MARY VAN ALLEN") # test data
sub(":", " ", sub(".* ", "", sub(" (DEL|VAN) ", " \\1:", names2)))
## [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG" "DEL GATO"
## [7] "VAN ALLEN"
Alternatively, extract everything after the last space (or last
library(stringr)
str_extract(names, '[^ ]+$')
# [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
Or, as mikeck suggests, split the string on spaces and take the last word:
sapply(strsplit(names, " "), tail, 1)
# [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
My vector have around 3000 observations like:
clients <- c("Greg Smith", "John Coolman", "Mr. Brown", "John Nightsmith (father)", "2 Nicolas Cage")
How I can subset rows that contain only names with letters. For example, only Greg Smith, John Coolman (without symbols like 0-9,.?:[} etc.).
We can use grep to match only upper or lower case alphabets along with space from start (^) to end ($) of the string.
grep('^[A-Za-z ]+$', clients, value = TRUE)
#[1] "Greg Smith" "John Coolman"
Or just use the [[:alpha:] ]+
grep('^[[:alpha:] ]+$', clients, value = TRUE)
#[1] "Greg Smith" "John Coolman"