Finding the best string match with R

Finding the best string match with R - r

Starting with this L Hernandez
From a vector containing the following:
[1] "HernandezOlaf " "HernandezLuciano " "HernandezAdrian "
I tried this:
'subset(ABC, str_detect(ABC, "L Hernandez") == TRUE)'
The name Hernandez which includes the capital L anyplace is the desired output.
The desired output is HernandezLuciano

May be this helps:
vec1 <- c("L Hernandez", "HernandezOlaf ","HernandezLuciano ", "HernandezAdrian ")
grep("L ?Hernandez|Hernandez ?L",vec1,value=T)
#[1] "L Hernandez" "HernandezLuciano "
Update
variable <- "L Hernandez"
v1 <- gsub(" ", " ?", variable) #replace space with a space and question mark
v2 <- gsub("([[:alpha:]]+) ([[:alpha:]]+)", "\\2 ?\\1", variable) #reverse the order of words in the string and add question mark
You can also use strsplit to split variable as #rawr commented
grep(paste(v1,v2, sep="|"), vec1,value=T)
#[1] "L Hernandez" "HernandezLuciano "

You could use agrep function for approximate string matching.
If you simply run this function it matches every string...
agrep("L Hernandez", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "))
[1] 1 2 3
but if you modify this a little "L Hernandez" -> "Hernandez L"
agrep("Hernandez L", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "))
[1] 1 2 3
and change the max distance
agrep("Hernandez L", c("HernandezOlaf ", "HernandezLuciano ", "HernandezAdrian "),0.01)
[1] 2
you get the right answer. This is only an idea, it might work for you :)

You could modify the following if you only want full names after a capital L:
vec1[grepl("Hernandez", vec1) & grepl("L\\.*", vec1)]
[1] "L Hernandez" "HernandezLuciano
or
vec1[grepl("Hernandez", vec1) & grepl("L[[:alpha:]]", vec1)]
[1] "HernandezLuciano "
The expression looks for a match on "Hernandez" and then looks to see if there is a capital "L" followed by any character or space. The second version requires a letter after the capital "L".
BTW, it appears that you can't chain the grepls.
vec1[grepl("Hernandez", vec1) & grepl("L\\[[:alpha:]]", vec1)]
character(0)

Related

Add a period after capital letters followed by a white space

As the title says, a have a string where I want to add a period after any capital letter that is followed by a whitespace, e.g.:
"Smith S Kohli V "
would become:
"Smith S. Kohli V. "
This is as close as I got:
v <- c("Smith S Kohli V ")
stringr::str_replace_all(v, "[[:upper:]] ", ". ")
"Smith . Kohli . "
I can see I need to add some more code to keep the capital letter, but I can't figure it out, any help much appreciated.

You can do this way to capture that match where the capital letter followed by space( ) character and then replace the whole match with an extra dot(.).
v <- c("Smith S Kohli V ")
stringr::str_replace_all(v, "([A-Z](?= ))", "\\1.")
Regex: https://regex101.com/r/uriEYS/1
Demo: https://rextester.com/ELKM47734

Base R using gsub :
v <- c("Smith S Kohli V ")
gsub('([A-Z])\\s', '\\1. ', v)
#[1] "Smith S. Kohli V. "

Using base R
gsub("(?<=[A-Z])\\s", ". ", v, perl = TRUE)
#[1] "Smith S. Kohli V. "
data
v <- c("Smith S Kohli V ")

Return the beginning of a string up to and including either of two characters

I have a character vector that looks like this:
a <- c("Bob/7", "What is this?", "Seventeen")
I want to extract the beginning of the string up to and including either a slash (/) or whitespace (). The result should look something like this:
b
[1] "Bob/" "What " NA
The non-matching items can also be empty strings or dropped instead of returning NA.
I have tried with grep("^.+?[/ ]", a, value = TRUE), but that returns the matching elements instead of the matching substrings.

Here's another approach using only sub:
a <- c("Bob/7", "What is this?", "Seventeen", "AA 1", "AA 7", " AA 7")
sub("(.*?[/ ]|).*", "\\1", a)
# [1] "Bob/" "What " "" "AA " "AA " " "
So, here .*?[/ ] is almost exactly what you had: I replaced + with * for cases like the last one in my a vector. Next, | corresponds to OR so that a|b matches a or b. Now having .*?[/ ]| matches what we want or, if it wasn't there, we match an empty string "". Without it we would get:
sub("(.*?[/ ]).*", "\\1", a)
# [1] "Bob/" "What " "Seventeen" "AA " "AA " " "
Namely, there was nothing to be done with Seventeen, so it remained unchanged, while with the actual solution we replace it with an empty string.

Found the solution:
b <- regmatches(a, regexpr("^.+?[/ ]", a))
b
[1] "Bob/" "What "

Insert blank space between letters of word

I'm trying to create a function able to return various versions of the same string but with blank spaces between the letters.
something like:
input <- "word"
returning:
w ord
wo rd
wor d

We first break the string into every character using strsplit. We then append an empty space at every position using sapply.
input <- "word"
input_break <- strsplit(input, "")[[1]]
c(input, sapply(seq(1,nchar(input)-1), function(x)
paste0(append(input_break, " ", x), collapse = "")))
#[1] "word" "w ord" "wo rd" "wor d"
?append gives us append(x, values, after = length(x))
where x is the vector, value is the value to be inserted (here " " ) and after is after which place you want to insert the values.

Here is an option using sub
sapply(seq_len(nchar(input)-1), function(i) sub(paste0('^(.{', i, '})'), '\\1 ', input))
#[1] "w ord" "wo rd" "wor d"
Or with substring
paste(substring(input, 1, 1:3), substring(input, 2:4, 4))
#[1] "w ord" "wo rd" "wor d"

substitute word separators with space

I just want to replace some word separators with a space. Any hints on this? Doesn't work after converting to character either.
df <- data.frame(m = 1:3, n = c("one.one", "one.two", "one.three"))
> gsub(".", "\\1 \\2", df$n)
[1] " " " " " "
> gsub(".", " ", df$n)
[1] " " " " " "

You don't need to use regex for one-to-one character translation. You can use chartr().
df$n <- chartr(".", " ", df$n)
df
# m n
# 1 1 one one
# 2 2 one two
# 3 3 one three

You can try
gsub("[.]", " ", df$n)
#[1] "one one" "one two" "one three"

Set fixed = TRUE if you are looking for an exact match and don't need a regular expression.
gsub(".", " ", df$n, fixed = TRUE)
#[1] "one one" "one two" "one three"
That's also faster than using an appropriate regex for such a case.

I suggest you to do like this,
gsub("\\.", " ", df$n)
OR
gsub("\\W", " ", df$n)
\\W matches any non-word character. \\W+ matches one or more non-word characters. Use \\W+ if necessary.

unlist keeping the same number of elements (vectorized)

I am trying to extract all hashtags from some tweets, and obtain for each tweet a single string with all hashtags.
I am using str_extract from stringr, so I obtain a list of character vectors. My problem is that I do not manage to unlist it and keep the same number of elements of the list (that is, the number of tweets).
Example:
This is a vector of tweets of length 3:
a <- "rt #ugh_toulouse: #mondial2014 : le top 5 des mannequins brésiliens http://www.ladepeche.fr/article/2014/06/01/1892121-mondial-2014-le-top-5-des-mannequins-bresiliens.html #brésil "
b <- "rt #30millionsdamis: beauté de la nature : 1 #baleine sauve un naufragé ; elles pourtant tellement menacées par l'homme... http://goo.gl/xqrqhd #instinctanimal "
c <- "rt #onlyshe31: elle siège toujours!!!!!!! marseille. nouveau procès pour la députée - 01/06/2014 - ladépêche.fr http://www.ladepeche.fr/article/2014/06/01/1892035-marseille-nouveau-proces-pour-la-deputee.html #toulouse "
all <- c(a, b, c)
Now I use str_extract_all to extract the hashtags:
ex <- str_extract_all(all, "#(.+?)[ |\n]")
If I now use unlist I get a vector of length 5:
undesired <- unlist(ex)
> undesired
[1] "#mondial2014 " "#brésil "
[3] "#baleine " "#instinctanimal "
[5] "#toulouse "
What I want is something like the following. However this is very inefficient, because it is not vectorized, and it takes forever (really!) on a smallish data frame of tweets:
desired <- c()
for (i in 1:length(ex)){
desired[i] <- paste(ex[[i]], collapse = " ")
}
> desired
[1] "#mondial2014 #brésil "
[2] "#baleine #instinctanimal "
[3] "#toulouse "
Help!

You could use stringi which may be faster for big datasets
library(stringi)
sapply(stri_extract_all_regex(all, '#(.+?)[ |\n]'), paste, collapse=' ')
#[1] "#mondial2014 #brésil " "#baleine #instinctanimal "
#[3] "#toulouse "
The for loops can be fast if you preassign the length of the output desired
desired <- numeric(length(ex))
for (i in 1:length(ex)){
desired[i] <- paste(ex[[i]], collapse = " ")
}
Or you could use vapply which would be faster than sapply and a bit safer (contributed by #Richie Cotton)
vapply(ex, toString, character(1))
#[1] "#mondial2014 , #brésil " "#baleine , #instinctanimal "
#[3] "#toulouse "
Or as suggested by #Ananda Mahto
vapply(stri_extract_all_regex(all, '#(.+?)[ |\n]'),
stri_flatten, character(1L), collapse = " ")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Finding the best string match with R - r

Related

Add a period after capital letters followed by a white space

Return the beginning of a string up to and including either of two characters

Insert blank space between letters of word

substitute word separators with space

unlist keeping the same number of elements (vectorized)

Categories

Resources