As the title says, a have a string where I want to add a period after any capital letter that is followed by a whitespace, e.g.:
"Smith S Kohli V "
would become:
"Smith S. Kohli V. "
This is as close as I got:
v <- c("Smith S Kohli V ")
stringr::str_replace_all(v, "[[:upper:]] ", ". ")
"Smith . Kohli . "
I can see I need to add some more code to keep the capital letter, but I can't figure it out, any help much appreciated.
You can do this way to capture that match where the capital letter followed by space( ) character and then replace the whole match with an extra dot(.).
v <- c("Smith S Kohli V ")
stringr::str_replace_all(v, "([A-Z](?= ))", "\\1.")
Regex: https://regex101.com/r/uriEYS/1
Demo: https://rextester.com/ELKM47734
Base R using gsub :
v <- c("Smith S Kohli V ")
gsub('([A-Z])\\s', '\\1. ', v)
#[1] "Smith S. Kohli V. "
Using base R
gsub("(?<=[A-Z])\\s", ". ", v, perl = TRUE)
#[1] "Smith S. Kohli V. "
data
v <- c("Smith S Kohli V ")
Related
Does anybody knows how can i replace "\" in r?
Other answers have posted something like:
l <- "1120190\neconomic"
gsub("\\", "", l, fixed=TRUE)
But didn't work in my case.
The \n is a symbol for newline and if you want to replace it with space you can use the following:
l <- "1120190\neconomic"
cat(l)
gsub("\n", " ", l, fixed=TRUE)
Note, that the output would be:
1120190
economic
[1] "1120190 economic"
I've been searching for hours. This should be very easy but I don't see how :(
I have a dataframe called ds that contains a column structured like:
name
"Doe, Mr. John"
"Worth, Miss. Jane"
I want to extract the middle word and put it into a new column.
#This is how I'm doing it now
ds$title <- NA
mr <- grep(", Mr. ", ds$name)
miss <- grep(", Miss. ", ds$name)
ds$title[mr] <- ", Mr. "
ds$title[miss] <- ", Miss. "
I'm trying to generalize this with regex so that it'll take any middle word matching the pattern of "comma space word period space"
This is my best guess but it only removes the pattern:
gsub(", .+\\.+ ", "", ds$name)
How do I keep the pattern and remove the rest?
Thank you!
You can use a capture group. Basically, you match the whole pattern, use a capture group to match the part you want to keep, and replace the whole match with the capture group:
# I often specify perl = TRUE, though it isn't necessary here
(ds$title <- gsub(".+(, .+\\.+ ).+", "\\1", ds$name, perl = TRUE))
#[1] ", Mr. " ", Miss. "
The capture group is what's in the parentheses ((, .+\\.+ )), and you refer back to it with \\1. If you had a second capture group, you'd refer to it as \\2.
Note that if you want to catch comma, space, word, period, space, then you could modify the capture group to (, .+\\. ). You only need to match one period, not one or more.
A straightforward stringi alternative that does not use capture groups is stri_extract_first_regex (or in this case stri_extract_last_regex or stri_extract_all_regex work fine)
library(stringi)
ds$title <- stri_extract_first_regex(ds$name, ", .+\\. ")
#[1] ", Mr. " ", Miss. "
and as thelatemail pointed out in a comment you can do a similar thing with base R, too, but it's a little harder to remember how to use the regmatches and regexpr functions:
regmatches(ds$name, regexpr(", .+\\. ", ds$name))
#[1] ", Mr. " ", Miss. "
Matched capture groups are your BFF:
library(stringi)
library(purrr)
ds <- data.frame(name=c("Doe, Mr. John", "Worth, Miss. Jane"), stringsAsFactors=FALSE)
nonsp <- "[[:alnum:][:punct:]]+"
sp <- "[[:blank:]]+"
stri_match_all_regex(ds$name, nonsp %s+% sp %s+% "(" %s+% nonsp %s+% ")" %s+% sp %s+% nonsp) %>%
map_chr(2)
## [1] "Mr." "Miss."
For your "add column to a data frame" needs:
library(stringi)
library(dplyr)
library(purrr)
ds <- data.frame(name=c("Doe, Mr. John", "Worth, Miss. Jane"), stringsAsFactors=FALSE)
nonsp <- "[[:alnum:][:punct:]]+"
sp <- "[[:blank:]]+"
mutate(ds, title=stri_match_all_regex(ds$name, nonsp %s+% sp %s+% "(" %s+% nonsp %s+% ")" %s+% sp %s+% nonsp) %>% map_chr(2))
## name title
## 1 Doe, Mr. John Mr.
## 2 Worth, Miss. Jane Miss.
I don't seem to understand gsub or stringr.
Example:
> a<- "a book"
> gsub(" ", ".", a)
[1] "a.book"
Okay. BUT:
> a<-"a.book"
> gsub(".", " ", a)
[1] " "
I would of expected
"a book"
I'm replacing the full stop with a space.
Also: srintr: str_replace(a, ".", " ") returns:
" .book"
and str_replace_all(a, ".", " ") returns
" "
I can use stringi: stri_replace(a, " ", fixed="."):
"a book"
I'm just wondering why gsub (and str_replace) don't act as I'd have expected. They work when replacing a space with another character, but not the other way around.
That's because the first argument to gsub, namely pattern is actually a regex. In regex the period . is a metacharacter and it matches any single character, see ?base::regex. In your case you need to escape the period in the following way:
gsub("\\.", " ", a)
I just want to replace some word separators with a space. Any hints on this? Doesn't work after converting to character either.
df <- data.frame(m = 1:3, n = c("one.one", "one.two", "one.three"))
> gsub(".", "\\1 \\2", df$n)
[1] " " " " " "
> gsub(".", " ", df$n)
[1] " " " " " "
You don't need to use regex for one-to-one character translation. You can use chartr().
df$n <- chartr(".", " ", df$n)
df
# m n
# 1 1 one one
# 2 2 one two
# 3 3 one three
You can try
gsub("[.]", " ", df$n)
#[1] "one one" "one two" "one three"
Set fixed = TRUE if you are looking for an exact match and don't need a regular expression.
gsub(".", " ", df$n, fixed = TRUE)
#[1] "one one" "one two" "one three"
That's also faster than using an appropriate regex for such a case.
I suggest you to do like this,
gsub("\\.", " ", df$n)
OR
gsub("\\W", " ", df$n)
\\W matches any non-word character. \\W+ matches one or more non-word characters. Use \\W+ if necessary.
Starting with this L Hernandez
From a vector containing the following:
[1] "HernandezOlaf " "HernandezLuciano " "HernandezAdrian "
I tried this:
'subset(ABC, str_detect(ABC, "L Hernandez") == TRUE)'
The name Hernandez which includes the capital L anyplace is the desired output.
The desired output is HernandezLuciano
May be this helps:
vec1 <- c("L Hernandez", "HernandezOlaf ","HernandezLuciano ", "HernandezAdrian ")
grep("L ?Hernandez|Hernandez ?L",vec1,value=T)
#[1] "L Hernandez" "HernandezLuciano "
Update
variable <- "L Hernandez"
v1 <- gsub(" ", " ?", variable) #replace space with a space and question mark
v2 <- gsub("([[:alpha:]]+) ([[:alpha:]]+)", "\\2 ?\\1", variable) #reverse the order of words in the string and add question mark
You can also use strsplit to split variable as #rawr commented
grep(paste(v1,v2, sep="|"), vec1,value=T)
#[1] "L Hernandez" "HernandezLuciano "
You could use agrep function for approximate string matching.
If you simply run this function it matches every string...
agrep("L Hernandez", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "))
[1] 1 2 3
but if you modify this a little "L Hernandez" -> "Hernandez L"
agrep("Hernandez L", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "))
[1] 1 2 3
and change the max distance
agrep("Hernandez L", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "),0.01)
[1] 2
you get the right answer. This is only an idea, it might work for you :)
You could modify the following if you only want full names after a capital L:
vec1[grepl("Hernandez", vec1) & grepl("L\\.*", vec1)]
[1] "L Hernandez" "HernandezLuciano
or
vec1[grepl("Hernandez", vec1) & grepl("L[[:alpha:]]", vec1)]
[1] "HernandezLuciano "
The expression looks for a match on "Hernandez" and then looks to see if there is a capital "L" followed by any character or space. The second version requires a letter after the capital "L".
BTW, it appears that you can't chain the grepls.
vec1[grepl("Hernandez", vec1) & grepl("L\\[[:alpha:]]", vec1)]
character(0)