Finding the best string match with R - r

Starting with this L Hernandez
From a vector containing the following:
[1] "HernandezOlaf " "HernandezLuciano " "HernandezAdrian "
I tried this:
'subset(ABC, str_detect(ABC, "L Hernandez") == TRUE)'
The name Hernandez which includes the capital L anyplace is the desired output.
The desired output is HernandezLuciano

May be this helps:
vec1 <- c("L Hernandez", "HernandezOlaf ","HernandezLuciano ", "HernandezAdrian ")
grep("L ?Hernandez|Hernandez ?L",vec1,value=T)
#[1] "L Hernandez" "HernandezLuciano "
Update
variable <- "L Hernandez"
v1 <- gsub(" ", " ?", variable) #replace space with a space and question mark
v2 <- gsub("([[:alpha:]]+) ([[:alpha:]]+)", "\\2 ?\\1", variable) #reverse the order of words in the string and add question mark
You can also use strsplit to split variable as #rawr commented
grep(paste(v1,v2, sep="|"), vec1,value=T)
#[1] "L Hernandez" "HernandezLuciano "

You could use agrep function for approximate string matching.
If you simply run this function it matches every string...
agrep("L Hernandez", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "))
[1] 1 2 3
but if you modify this a little "L Hernandez" -> "Hernandez L"
agrep("Hernandez L", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "))
[1] 1 2 3
and change the max distance
agrep("Hernandez L", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "),0.01)
[1] 2
you get the right answer. This is only an idea, it might work for you :)

You could modify the following if you only want full names after a capital L:
vec1[grepl("Hernandez", vec1) & grepl("L\\.*", vec1)]
[1] "L Hernandez" "HernandezLuciano
or
vec1[grepl("Hernandez", vec1) & grepl("L[[:alpha:]]", vec1)]
[1] "HernandezLuciano "
The expression looks for a match on "Hernandez" and then looks to see if there is a capital "L" followed by any character or space. The second version requires a letter after the capital "L".
BTW, it appears that you can't chain the grepls.
vec1[grepl("Hernandez", vec1) & grepl("L\\[[:alpha:]]", vec1)]
character(0)

Related

Add a period after capital letters followed by a white space

As the title says, a have a string where I want to add a period after any capital letter that is followed by a whitespace, e.g.:
"Smith S Kohli V "
would become:
"Smith S. Kohli V. "
This is as close as I got:
v <- c("Smith S Kohli V ")
stringr::str_replace_all(v, "[[:upper:]] ", ". ")
"Smith . Kohli . "
I can see I need to add some more code to keep the capital letter, but I can't figure it out, any help much appreciated.
You can do this way to capture that match where the capital letter followed by space( ) character and then replace the whole match with an extra dot(.).
v <- c("Smith S Kohli V ")
stringr::str_replace_all(v, "([A-Z](?= ))", "\\1.")
Regex: https://regex101.com/r/uriEYS/1
Demo: https://rextester.com/ELKM47734
Base R using gsub :
v <- c("Smith S Kohli V ")
gsub('([A-Z])\\s', '\\1. ', v)
#[1] "Smith S. Kohli V. "
Using base R
gsub("(?<=[A-Z])\\s", ". ", v, perl = TRUE)
#[1] "Smith S. Kohli V. "
data
v <- c("Smith S Kohli V ")

Return the beginning of a string up to and including either of two characters

I have a character vector that looks like this:
a <- c("Bob/7", "What is this?", "Seventeen")
I want to extract the beginning of the string up to and including either a slash (/) or whitespace (). The result should look something like this:
b
[1] "Bob/" "What " NA
The non-matching items can also be empty strings or dropped instead of returning NA.
I have tried with grep("^.+?[/ ]", a, value = TRUE), but that returns the matching elements instead of the matching substrings.
Here's another approach using only sub:
a <- c("Bob/7", "What is this?", "Seventeen", "AA 1", "AA 7", " AA 7")
sub("(.*?[/ ]|).*", "\\1", a)
# [1] "Bob/" "What " "" "AA " "AA " " "
So, here .*?[/ ] is almost exactly what you had: I replaced + with * for cases like the last one in my a vector. Next, | corresponds to OR so that a|b matches a or b. Now having .*?[/ ]| matches what we want or, if it wasn't there, we match an empty string "". Without it we would get:
sub("(.*?[/ ]).*", "\\1", a)
# [1] "Bob/" "What " "Seventeen" "AA " "AA " " "
Namely, there was nothing to be done with Seventeen, so it remained unchanged, while with the actual solution we replace it with an empty string.
Found the solution:
b <- regmatches(a, regexpr("^.+?[/ ]", a))
b
[1] "Bob/" "What "

Insert blank space between letters of word

I'm trying to create a function able to return various versions of the same string but with blank spaces between the letters.
something like:
input <- "word"
returning:
w ord
wo rd
wor d
We first break the string into every character using strsplit. We then append an empty space at every position using sapply.
input <- "word"
input_break <- strsplit(input, "")[[1]]
c(input, sapply(seq(1,nchar(input)-1), function(x)
paste0(append(input_break, " ", x), collapse = "")))
#[1] "word" "w ord" "wo rd" "wor d"
?append gives us append(x, values, after = length(x))
where x is the vector, value is the value to be inserted (here " " ) and after is after which place you want to insert the values.
Here is an option using sub
sapply(seq_len(nchar(input)-1), function(i) sub(paste0('^(.{', i, '})'), '\\1 ', input))
#[1] "w ord" "wo rd" "wor d"
Or with substring
paste(substring(input, 1, 1:3), substring(input, 2:4, 4))
#[1] "w ord" "wo rd" "wor d"

substitute word separators with space

I just want to replace some word separators with a space. Any hints on this? Doesn't work after converting to character either.
df <- data.frame(m = 1:3, n = c("one.one", "one.two", "one.three"))
> gsub(".", "\\1 \\2", df$n)
[1] " " " " " "
> gsub(".", " ", df$n)
[1] " " " " " "
You don't need to use regex for one-to-one character translation. You can use chartr().
df$n <- chartr(".", " ", df$n)
df
# m n
# 1 1 one one
# 2 2 one two
# 3 3 one three
You can try
gsub("[.]", " ", df$n)
#[1] "one one" "one two" "one three"
Set fixed = TRUE if you are looking for an exact match and don't need a regular expression.
gsub(".", " ", df$n, fixed = TRUE)
#[1] "one one" "one two" "one three"
That's also faster than using an appropriate regex for such a case.
I suggest you to do like this,
gsub("\\.", " ", df$n)
OR
gsub("\\W", " ", df$n)
\\W matches any non-word character. \\W+ matches one or more non-word characters. Use \\W+ if necessary.

unlist keeping the same number of elements (vectorized)

I am trying to extract all hashtags from some tweets, and obtain for each tweet a single string with all hashtags.
I am using str_extract from stringr, so I obtain a list of character vectors. My problem is that I do not manage to unlist it and keep the same number of elements of the list (that is, the number of tweets).
Example:
This is a vector of tweets of length 3:
a <- "rt #ugh_toulouse: #mondial2014 : le top 5 des mannequins brésiliens http://www.ladepeche.fr/article/2014/06/01/1892121-mondial-2014-le-top-5-des-mannequins-bresiliens.html #brésil "
b <- "rt #30millionsdamis: beauté de la nature : 1 #baleine sauve un naufragé ; elles pourtant tellement menacées par l'homme... http://goo.gl/xqrqhd #instinctanimal "
c <- "rt #onlyshe31: elle siège toujours!!!!!!! marseille. nouveau procès pour la députée - 01/06/2014 - ladépêche.fr http://www.ladepeche.fr/article/2014/06/01/1892035-marseille-nouveau-proces-pour-la-deputee.html #toulouse "
all <- c(a, b, c)
Now I use str_extract_all to extract the hashtags:
ex <- str_extract_all(all, "#(.+?)[ |\n]")
If I now use unlist I get a vector of length 5:
undesired <- unlist(ex)
> undesired
[1] "#mondial2014 " "#brésil "
[3] "#baleine " "#instinctanimal "
[5] "#toulouse "
What I want is something like the following. However this is very inefficient, because it is not vectorized, and it takes forever (really!) on a smallish data frame of tweets:
desired <- c()
for (i in 1:length(ex)){
desired[i] <- paste(ex[[i]], collapse = " ")
}
> desired
[1] "#mondial2014 #brésil "
[2] "#baleine #instinctanimal "
[3] "#toulouse "
Help!
You could use stringi which may be faster for big datasets
library(stringi)
sapply(stri_extract_all_regex(all, '#(.+?)[ |\n]'), paste, collapse=' ')
#[1] "#mondial2014 #brésil " "#baleine #instinctanimal "
#[3] "#toulouse "
The for loops can be fast if you preassign the length of the output desired
desired <- numeric(length(ex))
for (i in 1:length(ex)){
desired[i] <- paste(ex[[i]], collapse = " ")
}
Or you could use vapply which would be faster than sapply and a bit safer (contributed by #Richie Cotton)
vapply(ex, toString, character(1))
#[1] "#mondial2014 , #brésil " "#baleine , #instinctanimal "
#[3] "#toulouse "
Or as suggested by #Ananda Mahto
vapply(stri_extract_all_regex(all, '#(.+?)[ |\n]'),
stri_flatten, character(1L), collapse = " ")

Resources