Remove both English and Non-English names from a dataframe - r

I am working with several hundreds of rows of a junk data. A dummy data is as thus:
foo_data <- c("Mary Smith is not here", "Wiremu Karen is not a nice person",
"Rawiri Herewini is my name", "Ajibade Smith is my man", NA)
I need to remove all names (both English and non-English first names and family names such that my desired output will be:
[1] "is not here" " is not a nice person" " is my name"
[4] "is my man" NA
However, using textclean package, I was only able to remove English names leaving the non-English names:
library(textclean)
textclean::replace_names(foo_data)
[1] " is not here" "Wiremu is not a nice person" "Rawiri Herewini is my name"
[4] "Ajibade is my man" NA
Any help will be appreciated.

You could do:
s <- textclean::replace_names(foo_data)
trimws(gsub(sprintf('\\b(%s)\\b',
paste0(unlist(hunspell::hunspell(s)), collapse = '|')), '', s))
[1] "is not here" "is not a nice person" "is my name" "is my man" NA

Related

To find match between two strings based on word sequence in R Language

Please guide me if any functions in R Language available to match two words in sequence to another string (instead of matching single word to single word).
I'm not sure what you want as a result but using lubridate::intersect and splitsplit may helps,
df <- data.frame(
string1 = c("Raj ate food", "Raj is working", "Raj ate food"),
string2 = c("Raj ate nice food", "Car is driven by Raj", "Raj us not having food")
)
for (i in 1:dim(df)[1]){
print(lubridate::intersect(strsplit(df$string1, " ")[[i]] , strsplit(df$string2, " ")[[i]]))
}
[1] "Raj" "ate" "food"
[1] "Raj" "is"
[1] "Raj" "food"

Find and replace numbers in an R txt file

I am attempting to find all sentences in a text file in r that have numbers of any format in them and replace it with hashtags around them.
for example take the input below:
ex <- c("I have $5.78 in my account","Hello my name is blank","do you want 1,785 puppies?",
"I love stack overflow!","My favorite numbers are 3, 14,568, and 78")
as the output of the function, I'm looking for:
> "I have #$5.78# in my account"
> "do you want #1,785# puppies?"
> "My favorite numbers are #3#, #14,568#, and #78#"
Surrounding numbers is straight-forward, assuming that anything with a number, period, comma, and dollar-sign are all included.
gsub("\\b([-$0-9.,]+)\\b", "#\\1#", ex)
# [1] "I have $#5.78# in my account"
# [2] "Hello my name is blank"
# [3] "do you want #1,785# puppies?"
# [4] "I love stack overflow!"
# [5] "My favorite numbers are #3#, #14,568#, and #78#"
To filter out just the numbered entries:
grep("\\d", gsub("\\b([-$0-9.,]+)\\b", "#\\1#", ex), value = TRUE)
# [1] "I have $#5.78# in my account"
# [2] "do you want #1,785# puppies?"
# [3] "My favorite numbers are #3#, #14,568#, and #78#"
We can use gsub
gsub("(?<=\\s)(?=[$0-9])|(?<=[0-9])(?=,?[ ]|$)", "#", ex, perl = TRUE)
#[1] "I have #$5.78# in my account" "Hello my name is blank"
#[3] "do you want #1,785# puppies?" "I love stack overflow!"
#[5] "My favorite numbers are #3#, #14,568#, and #78#"
Another step-by-step approach is to use grep to identify the elements of text file containing the pattern "[0-9]", subset text elements with numeric entries using ex[....], and use the pipe operator %>% from library(dplyr) to pass the subset to gsub then use #r2evans' logic to place hashtags around numeric entries as shown below:
library(dplyr)
ex[do.call(grep,list("[0-9]",ex))] %>% gsub("\\b([-$0-9.,]+)\\b", "#\\1#",.)
The do.call(grep,list("[0-9]",ex)) portion of the code returns the indices for the text elements in ex with numeric entries.
Output
library(dplyr)
ex[do.call(grep,list("[0-9]",ex))] %>% gsub("\\b([-$0-9.,]+)\\b", "#\\1#",.)
[1] "I have $#5.78# in my account" "do you want #1,785# puppies?"
[3] "My favorite numbers are #3#, #14,568#, and #78#"

R - put space at word begins with capital letter, for full column

i am having a column from XLSX imported to R, where each row is having a sentence without space, but words begins with Capital letters. tried to use
gsub("([[:upper:]])([[:upper:]][[:lower:]])", "\\1 \\2", x)
but this is working, if i start converting each row,
Example
1 HowDoYouWorkOnThis
2 ThisIsGreatExample
3 ProgrammingIsGood
Expected is
1 How Do You Work On This
2 This Is Great Example
3 Programming Is Good
Is this what you're after?
s <- c("HowDoYouWorkOnThis", "ThisIsGreatExample", "ProgrammingIsGood");
sapply(s, function(x) trimws(gsub("([A-Z])", " \\1", x)))
# HowDoYouWorkOnThis ThisIsGreatExample ProgrammingIsGood
#"How Do You Work On This" "This Is Great Example" "Programming Is Good"
Or using stringr::str_replace_all:
library(stringr);
trimws(str_replace_all(s, "([A-Z])", " \\1"));
#[1] "How Do You Work On This" "This Is Great Example"
#[3] "Programming Is Good"

Extract last name from a full name using R [duplicate]

This question already has answers here:
Extract last word in string in R
(5 answers)
Closed 5 years ago.
The 2000 names I have are mixed with "first name middle name last name" and "first name last name". My code only works with those with middle names. Please see the toy example.
names <- c("SARAH AMY SMITH", "JACKY LEE", "LOVE JOY", "MONTY JOHN CARLO", "EVA LEE-YOUNG")
last.name <- gsub("[A-Z]+ [A-Z]*","\\", people.from.sg[,7])
last.name is
" SMITH" "" " CARLO" "-YOUNG"
LOVE JOY and JACKY lEE don't have any results.
p.s This is not a duplicate post since the previous ones do not use gsub
Replace everything up to the last space with the empty string. No packages are used.
sub(".* ", "", names)
## [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
Note:
Regarding the comment below on two word last names that does not appear to be part of the question as stated but if it were then suppose the first word is either DEL or VAN. Then replace the space after either of them with a colon, say, and then perform the sub above and then revert the colon back to space.
names2 <- c("SARAH AMY SMITH", "JACKY LEE", "LOVE JOY", "MONTY JOHN CARLO",
"EVA LEE-YOUNG", "ARTHUR DEL GATO", "MARY VAN ALLEN") # test data
sub(":", " ", sub(".* ", "", sub(" (DEL|VAN) ", " \\1:", names2)))
## [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG" "DEL GATO"
## [7] "VAN ALLEN"
Alternatively, extract everything after the last space (or last
library(stringr)
str_extract(names, '[^ ]+$')
# [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
Or, as mikeck suggests, split the string on spaces and take the last word:
sapply(strsplit(names, " "), tail, 1)
# [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"

Extract words only with R

I have strings like these:
x <-c("DATE TODAY d. 011 + e. 0030 + r. 1061","Now or never d. 003 + e. 011 + g. 021", "Long term is long time (e. 104 to d. 10110)","Time is everything (1012) - /1072, 091A/")
Desired output:
d <- c("DATE TODAY","Now or never","Long term is long time","Time is everything")
After an hour with SO search, I just could not do it. Any help is appreciated.
This bit uses stringr to extract anything containing two or more alphabeticals:
> library(stringr)
> unlist(lapply(str_extract_all(x,"[a-zA-Z][a-zA-Z]+"),paste,collapse=" "))
[1] "DATE TODAY" "Now or never"
[3] "Long term is long time to" "Time is everything"
I'm hoping the "to" missing from your desired output is a mistake on your part. Its a perfectly good word, and you said you wanted to extract words.
The pattern is not very clear. But, based on the example showed, here are a couple of ways to get the expected result.
sub('( .\\.| \\().*', '', x)
#[1] "DATE TODAY" "Now or never" "Long term is long time"
#[4] "Time is everything"
or
pat1 <- '(?<=[0-9] )[A-Za-z]+(*SKIP)(*F)|[A-Za-z]{2,}'
sapply(regmatches(x,gregexpr(pat1, x, perl=TRUE)), paste, collapse=" ")
#[1] "DATE TODAY" "Now or never" "Long term is long time"
#[4] "Time is everything"
If to is a valid word and the expected result had a typo
pat1 <- '[A-Za-z]{2,}'
sapply(regmatches(x,gregexpr(pat1, x, perl=TRUE)), paste, collapse=" ")
#[1] "DATE TODAY" "Now or never"
#[3] "Long term is long time to" "Time is everything"
I agree with the others that "to" is a valid word. Here's a stringi approach
library(stringi)
stri_replace_all_regex(x, "\\s?[A-Za-z]?[+[:punct:]0-9]", "")
# [1] "DATE TODAY" "Now or never"
# [3] "Long term is long time to" "Time is everything"

Resources