Splitting text into separate fuctions - r

I have a spreadsheet in "R" where a column is a full Name (last name, first name middle name) ex Thompson, Billy Red. I am trying to run a function that separates that operates them and will produce only first name.
Example code that I am using. I need the "get_first_name" created any ideas?
top.25 <- head( d$Full.Name,25 )
first.25 <- get_first_name( name=top.25 )
data.frame( top.25, first.25 ) %>% pander()

This should work:
get_first_name <- function(x) {
sub(" .*", "", sub(".*, ", "", x))}
Example:
Full.Name <- c("Thompson, Billy Red", "Smith, John", "St Croix, Some Body")
get_first_name(Full.Name)
Result:
[1] "Billy" "John" "Some"
Rationale:
First, the function extracts everything that is after ", " (comma + space):
sub(".*, ", "", x)
Second, it extracts everything that is before " " (space):
sub(" .*", "", y)
Replace the first step as the "y" in the second step, and assign it to your function.

Related

Formatting UK Postcodes in R

I am trying to format UK postcodes that come in as a vector of different input in R.
For example, I have the following postcodes:
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
How do I write a generic code that would convert entries of the above vector into:
c("IV41 8PW","IV40 8BU","KY11 4HJ","KY1 1UU","KY4 9RW","G32 7EJ")
That is the first part of the postcode is separated from the second part of the postcode by one space and all letters are capitals.
EDIT: the second part of the postcode is always the 3 last characters (combination of a number followed by letters)
I couldn't come up with a smart regex solution so here is a split-apply-combine approach.
sapply(strsplit(sub('^(.*?)(...)$', '\\1:\\2', postcodes), ':', fixed = TRUE), function(x) {
paste0(toupper(trimws(x, whitespace = '[.\\s-]')), collapse = ' ')
})
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
The logic here is that we insert a : (or any character that is not in the data) in the string between the 1st and 2nd part. Split the string on :, remove unnecessary characters, get it in upper case and combine it in one string.
One approach:
Convert to uppercase
extract the alphanumeric characters
Paste back together with a space before the last three characters
The code would then be:
library(stringr)
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
postcodes <- str_to_upper(postcodes)
sapply(str_extract_all(postcodes, "[:alnum:]"), function(x)paste(paste0(head(x,-3), collapse = ""), paste0(tail(x,3), collapse = "")))
# [1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
You can remove everything what is not a word caracter \\W (or [^[:alnum:]_]) and then insert a space before the last 3 characters with (.{3})$ and \\1.
sub("(.{3})$", " \\1", toupper(gsub("\\W+", "", postcodes)))
#sub("(...)$", " \\1", toupper(gsub("\\W+", "", postcodes))) #Alternative
#sub("(?=.{3}$)", " ", toupper(gsub("\\W+", "", postcodes)), perl=TRUE) #Alternative
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
# Option 1 using regex:
res1 <- gsub(
"(\\w+)(\\d[[:upper:]]\\w+$)",
"\\1 \\2",
gsub(
"\\W+",
" ",
postcodes
)
)
# Option 2 using substrings:
res2 <- vapply(
trimws(
gsub(
"\\W+",
" ",
postcodes
)
),
function(ir){
paste(
trimws(
substr(
ir,
1,
nchar(ir) -3
)
),
substr(
ir,
nchar(ir) -2,
nchar(ir)
)
)
},
character(1),
USE.NAMES = FALSE
)

replacement of words in strings

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
How can I search a string, a word that matches and replace it?
The expected result is the following example:
a1<- c(" the classroom is ful ")
a2<- c(" full")
In this case I would be replacing ful for full in a1
Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
So even in your example, would you want to replace ful with full, or many of the other options here?
The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
Update
Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
Update 2
Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".
Your example:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
The solution is to modify badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"
I think the function you are looking for is gsub():
gsub (pattern = "ful", replacement = a2, x = a1)
Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.
library(gsubfn)
L <- list(ful = "full") # can add more words to this list if desired
gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
For a kind of ordered replacement, you can try this
a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")
qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)
For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example
a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"
library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
patt <- paste0('\\b', badword, '\\b')
repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
final.word <- ifelse(is.na(repl), badword, repl)
a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"

Handling string search and substitution in R

I am a beginner in R, used Matlab before and I have been searching around for a solution to my problem but I do not appear to find one.
I have a very large vector with text entries. Something like
CAT06
6CAT
CAT 6
DOG3
3DOG
I would like to be able to find a function such that: If an entry is found and it contains "CAT" & "6" (no matter position), substitute cat6. If an entry is found and it contains "DOG" & "3" (no matter position) substitute dog3. So the outcome should be:
cat6 cat6 cat6 dog3 dog3
Can anybody help on this? Thank you very much, find myself a bit lost!
First remove blank spaces i.e. elements like "CAT 6" to "CAT6":
sp = gsub(" ", "", c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG"))
Then use some regex magic to find any combination of "CAT", "0", "6" and replace these matches with "cat6" as follows:
sp = gsub("^(?:CAT|0|6)*$", "cat6", sp)
Same here with DOG case:
sp = gsub("^(?:DOG|0|3)*$", "dog3", sp)
The input shown in the question is ambiguous as per my comment under the question. We show how to calculate it depending on which of three assumptions was intended.
1) vector input with embedded spaces Remove the digits and spaces ("[0-9 ]") in the first gsub and remove the non-digits ("\\D") in the second gsub converting to numeric to avoid leading zeros and then paste together:
x1 <- c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG") # test input
paste0(gsub("[0-9 ]", "", x1), as.numeric(gsub("\\D", "", x1)))
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
2) single string Form chars by removing all digits and scanning the result in. Then form nums by removing everything except digits and spaces and scanning the result. Finally paste these together.
x2 <- "CAT06 6CAT CAT 6 DOG3 3DOG" # test input
chars <- scan(textConnection(gsub("\\d", "", x2)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", x2)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
or if a single output stirng is wanted add this:
paste(y, collapse = " ")
3) vector input without embedded spaces Reduce this to case (2) and then apply (2).
x3 <- c("CAT06", "6CAT", "CAT", "6", "DOG3", "3DOG") # test input
xx <- paste(x3, collapse = " ")
chars <- scan(textConnection(gsub("\\d", "", xx)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", xx)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
Note that this actually works for all three inputs. That is if we replace x3 with x1 or x2 it still works and as with (2) then if a single output string is wanted then add paste(y, collapse = " ")

how to remove duplicate words in a certain pattern from a string in R

I aim to remove duplicate words only in parentheses from string sets.
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
What I want to get is just like this
a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'
In order to get the result, I used a code like this
a = gsub('\\|', " | ", a)
a = gsub('\\(', "( ", a)
a = gsub('\\)', " )", a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))
However, it resulted in undesirable outputs.
a
[1] "I ( have | has ) certain words word worded"
[2] "( You | Youre ) can cans do this works worked"
[3] "I ( am | are ) sure surely you know what when her should do"
Why did my code remove parentheses located in the latter part of strings?
What should I do for the result I want?
We can use gsubfn. Here, the idea is to select the characters inside the brackets by matching the opening bracket (\\( have to escape the bracket as it is a metacharacter) followed by one or more characters that are not a closing bracket ([^)]+), capture it as a group within the brackets. In the replacement, we split the group of characters (x) with strsplit, unlist the list output, get the unique elements and paste it together
library(gsubfn)
gsubfn("\\(([^)]+)", ~paste0("(", paste(unique(unlist(strsplit(x,
"[|]"))), collapse="|")), a)
#[1] "I (have|has) certain (words|word|worded) certain"
#[2] "(You|Youre) (can|cans) do this (works|worked)"
#[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
Take the answer above. This is more straightforward, but you can also try:
library(stringi)
library(stringr)
a_new <- gsub("[|]","-",a) # replace this | due to some issus during the replacement later
a1 <- str_extract_all(a_new,"[(](.*?)[)]") # extract the "units"
# some magic using stringi::stri_extract_all_words()
a2 <- unlist(lapply(a1,function(x) unlist(lapply(stri_extract_all_words(x), function(y) paste(unique(y),collapse = "|")))))
# prepare replacement
names(a2) <- unlist(a1)
# replacement and finalization
str_replace_all(a_new, a2)
[1] "I (have|has) certain (words|word|worded) certain"
[2] "(You|Youre) (can|cans) do this (works|worked)"
[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
The idea is to extract the words within the brackets as unit. Then remove the duplicates and replace the old unit with the updated.
a longer but more elaborate try
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# blank output
new_a <- c()
for (sentence in 1:length(a)) {
split <- trim(unlist(strsplit(a[sentence],"[( )]")))
newsentence <- c()
for (i in split) {
j1 <- as.character(unique(trim(unlist(strsplit(gsub('\\|'," ",i)," ")))))
if( length(j1)==0) {
next
} else {
ifelse(length(j1)>1,
newsentence <- c(newsentence,paste("(",paste(j1,collapse="|"),")",sep="")),
newsentence <- c(newsentence,j1[1]))
}
}
newsentence <- paste(newsentence,collapse=" ")
print(newsentence)
new_a <- c(new_a,newsentence)}
# [1] "I (have|has) certain (words|word|worded) certain"
# [2] "(You|Youre) (can|cans) do this (works|worked)"
# [3] "I (am|are) (sure|surely) you know (what|when) (you|her) should do"

Simple Comparing of two texts in R

I want to compare two texts to similarity, therefore i need a simple function to list clearly and chronologically the words and phrases occurring in both texts. these words/sentences should be highlighted or underlined for better visualization)
on the base of #joris Meys ideas, i added an array to divide text into sentences and subordinate sentences.
this is how it looks like:
textparts <- function (text){
textparts <- c("\\,", "\\.")
i <- 1
while(i<=length(textparts)){
text <- unlist(strsplit(text, textparts[i]))
i <- i+1
}
return (text)
}
textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")
commonWords <- intersect(textparts1, textparts2)
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
}
return(list(textparts1,textparts2))
However, sometimes it works, sometimes it doesn't.
I WOULD like to have results like these:
> return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence" " whereas this is a dependent clause*" " This thing works*"
[[2]]
[1] "This could be a sentence" " whereas this is a dependent clause*" " Plagiarism is not cool" " This thing works*"
whereas i get none results.
There are some problems with the answer of #Chase :
differences in capitalization are not taken into account
interpunction can mess up results
if there is more than one word similar, then you get a lot of warnings due to the gsub call.
Based on his idea, there is the following solution that makes use of tolower() and some nice functionalities of regular expressions :
compareSentences <- function(sentence1, sentence2) {
# split everything on "not a word" and put all to lowercase
x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
x2 <- tolower(unlist(strsplit(sentence2, "\\W")))
commonWords <- intersect(x1, x2)
#add word beginning and ending and put words between ()
# to allow for match referencing in gsub
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
# replace the match by the match with star added
sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)
}
return(list(sentence1,sentence2))
}
This gives following result :
text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "
compareSentences(text1,text2)
[[1]]
[1] "This* is* a* test*. Weather* is* fine*"
[[2]]
[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "
I am sure that there are far more robust functions on the natural language processing page, but here's one solution using intersect() to find the common words. The approach is to read in the two sentences, identify the common words and gsub() them with a combination of the word and a moniker of our choice. Here I chose to use *, but you could easily change that, or add something else.
sent1 <- "I shot the sheriff."
sent2 <- "Dick Cheney shot a man."
compareSentences <- function(sentence1, sentence2) {
sentence1 <- unlist(strsplit(sentence1, " "))
sentence2 <- unlist(strsplit(sentence2, " "))
commonWords <- intersect(sentence1, sentence2)
return(list(
sentence1 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence1), collapse = " ")
, sentence2 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence2), collapse = " ")
))
}
> compareSentences(sent1, sent2)
$sentence1
[1] "I shot* the sheriff."
$sentence2
[1] "Dick Cheney shot* a man."

Resources