Handling string search and substitution in R - r

I am a beginner in R, used Matlab before and I have been searching around for a solution to my problem but I do not appear to find one.
I have a very large vector with text entries. Something like
CAT06
6CAT
CAT 6
DOG3
3DOG
I would like to be able to find a function such that: If an entry is found and it contains "CAT" & "6" (no matter position), substitute cat6. If an entry is found and it contains "DOG" & "3" (no matter position) substitute dog3. So the outcome should be:
cat6 cat6 cat6 dog3 dog3
Can anybody help on this? Thank you very much, find myself a bit lost!

First remove blank spaces i.e. elements like "CAT 6" to "CAT6":
sp = gsub(" ", "", c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG"))
Then use some regex magic to find any combination of "CAT", "0", "6" and replace these matches with "cat6" as follows:
sp = gsub("^(?:CAT|0|6)*$", "cat6", sp)
Same here with DOG case:
sp = gsub("^(?:DOG|0|3)*$", "dog3", sp)

The input shown in the question is ambiguous as per my comment under the question. We show how to calculate it depending on which of three assumptions was intended.
1) vector input with embedded spaces Remove the digits and spaces ("[0-9 ]") in the first gsub and remove the non-digits ("\\D") in the second gsub converting to numeric to avoid leading zeros and then paste together:
x1 <- c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG") # test input
paste0(gsub("[0-9 ]", "", x1), as.numeric(gsub("\\D", "", x1)))
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
2) single string Form chars by removing all digits and scanning the result in. Then form nums by removing everything except digits and spaces and scanning the result. Finally paste these together.
x2 <- "CAT06 6CAT CAT 6 DOG3 3DOG" # test input
chars <- scan(textConnection(gsub("\\d", "", x2)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", x2)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
or if a single output stirng is wanted add this:
paste(y, collapse = " ")
3) vector input without embedded spaces Reduce this to case (2) and then apply (2).
x3 <- c("CAT06", "6CAT", "CAT", "6", "DOG3", "3DOG") # test input
xx <- paste(x3, collapse = " ")
chars <- scan(textConnection(gsub("\\d", "", xx)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", xx)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
Note that this actually works for all three inputs. That is if we replace x3 with x1 or x2 it still works and as with (2) then if a single output string is wanted then add paste(y, collapse = " ")

Related

Match two character strings by location in R

string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 8), append("B", append(rep(" ", 17), "C"))))), collapse = "")
text <- paste(append(rep(" ", 7), append("I love", append(rep(" ", 3), append("chocolate", append(rep(" ", 9), "pudding"))))), collapse = "")
string
[1] " A B C"
text
[1] " I love chocolate pudding"
I am trying to match letters in "string" with text in "text" such that to the letter A corresponds the text "I love" to B corresponds "chocolate" and to C "pudding". Ideally, I would like to put A, B, C in column 1 and three different rows of a dataframe (or tibble) and the text in column 2 and the corresponding rows. Any suggestion?
It is hard to know whether the strings in which you are trying to manipulate and then collate into columns in a data.frame follow a pattern. But for the example you posted, I suggest creating a list with the strings (strings):
strings <- list(string, text)
Then use lapply() which will in turn create a list for each element in strings.
res <-lapply(strings, function(x){
grep(x=trimws(unlist(strsplit(x, "\\s\\s"))), pattern="[[:alpha:]]", value=TRUE)
})
In the code above, strsplit() splits the string whenever two spaces are found (\\s\\s). But the resulting split is a list with the strings as inner elements. Therefore you need to use unlist() so you can use it with grep(). grep() will select only those strings with an alphanumeric character --which is what you want.
You can then use do.call(cbind, list) to bind the elements in the resulting lapply() list into columns. The dimension must match for this work.
do.call(cbind, res)
Result:
> do.call(cbind, res)
[,1] [,2]
[1,] "A" "I love"
[2,] "B" "chocolate"
[3,] "C" "pudding"
You can wrap it up into a as.data.frame() for instance to get the desired result:
> as.data.frame(do.call(cbind, res), stringsAsFactors = FALSE)
V1 V2
1 A I love
2 B chocolate
3 C pudding
You can use read.fwf and get the positions using nchar.
read.fwf(file=textConnection(text),
widths=c(diff(c(1, gregexpr("\\w", string)[[1]])), nchar(text)))[-1]
# V2 V3 V4
#1 I love chocolate pudding
In case the white spaces should be removed use also trimws:
trimws(read.fwf(file=textConnection(text),
widths=c(diff(c(1, gregexpr("\\w", string)[[1]])), nchar(text)))[-1])
#[1] "I love" "chocolate" "pudding"
Based on your data, I came up with this workaround by using the package stringr. This only works with that kind of pattern, so in case you have erratic ones you need to adjust it.
The output is a data.frame with two columns given by your two input data and rows according to the matches.
library(stringr)
string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 8), append("B", append(rep(" ", 17), "C"))))), collapse = "")
text <- paste(append(rep(" ", 7), append("I love", append(rep(" ", 3), append("chocolate", append(rep(" ", 9), "pudding"))))), collapse = "")
string_nospace <- str_replace_all( string, "\\s{1,20}", " " )
string_nospace <- str_trim( string_nospace )
string_nospace <- data.frame( string = t(str_split(string_nospace, "\\s", simplify = TRUE)))
text_nospace <- str_replace_all( text, "\\s{2,20}", "_" )
text_nospace <- str_sub(text_nospace, start = 2)
text_nospace <- data.frame(text = t(str_split(text_nospace, "_", simplify = TRUE)))
df = data.frame(string = string_nospace,
text = text_nospace )
df
#> string text
#> 1 A I love
#> 2 B chocolate
#> 3 C pudding
Created on 2020-06-08 by the reprex package (v0.3.0)

pattern matching a formula in R

I want to do a pattern matching of variables in a formula. The ideal solution should be able to perform as below:
formula <- 'variable_1+variable_2*variable_3-variable_4/variable_5 + 456' and output should be variable_1, variable_2,variable_3, variable_4, variable_5.
Note: variable name can contain character, underscore (_), numbers only and operations are limited to +,-,*,/. formula may contain constants as well (like here it is 456). The output should contain only variables names and should ignore any numeric constants.
I have tried the below codes. I was only able to check for the variable name containing only character and minus operation (-) does not work as well.
formula <- "variableX +variableY*VariableZ"
strapplyc(gsub(" ", "", format(formula), fixed = T), "-?|[a-zA-Z_]+", simplify = T, ignore.case = T) gives below output
[,1]
[1,] "variableX"
[2,] ""
[3,] "variableY"
[4,] ""
[5,] "VariableZ"
which is correct BUT when i include minus operation (-), the strapplyc gives wrong results
formula <- "variableX -variableY"
strapplyc(gsub(" ", "", format(formula), fixed = T), "-?|[a-zA-Z_]+", simplify = T, ignore.case = T) gives below output
[,1]
[1,] "variableX"
[2,] "-"
[3,] "variableY"
I would appreciate if anyone could help me on ideal solution.
You can use regular expressions for this:
formula <- "variable_1+variable_2*variable_3-variable_4/variable_5"
gsub("[\\+\\*\\-\\/]", ", ", formula)
Explanation of the regex:
[ and ] start and end a group of characters that you want to select
\\+ escapes the + sign, with you want to replace with ", "
\\* escapes the * sign, with you want to replace with ", "
\\- escapes the - sign, with you want to replace with ", "
\\/ escapes the / sign, with you want to replace with ", "
Edit to reflect OP's updated request
Another way would be just to extract your variables. The below works if you hold the format lowercaseletters_numberfor your variable name:
formula <- "variable_1+variable_2*variable_3-variable_4/variable_5+34+brigadeiro_5"
paste(regmatches(formula, gregexpr("variable_[0-9]", formula))[[1]],
collapse = ", ")
You can also use the stringr package if you want the code to look a little cleaner:
library(stringr)
str_extract_all(formula, "[a-z]*_[0-9]*")
You could use strsplit() with some extras.
res <- trimws(el(strsplit(formula, "\\+|\\-|\\*|\\/")))
Thereafter we want those elements yielding NA when we try to coerce them as.numeric().
res[is.na(suppressWarnings(as.numeric(res)))]
# [1] "variable_1" "variable_2" "variable_3" "variable_4" "variable_5"
Data
formula <- 'variable_1+variable_2*variable_3-variable_4/variable_5 + 456'

replacement of words in strings

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
How can I search a string, a word that matches and replace it?
The expected result is the following example:
a1<- c(" the classroom is ful ")
a2<- c(" full")
In this case I would be replacing ful for full in a1
Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
So even in your example, would you want to replace ful with full, or many of the other options here?
The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
Update
Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
Update 2
Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".
Your example:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
The solution is to modify badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"
I think the function you are looking for is gsub():
gsub (pattern = "ful", replacement = a2, x = a1)
Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.
library(gsubfn)
L <- list(ful = "full") # can add more words to this list if desired
gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
For a kind of ordered replacement, you can try this
a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")
qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)
For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example
a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"
library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
patt <- paste0('\\b', badword, '\\b')
repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
final.word <- ifelse(is.na(repl), badword, repl)
a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"

Replace single or multiple spaces with single/multiple zeros

Ultimately I want to convert this value into as.numeric but before this I want to replace space for zeros. I do 2 step sub here as I can do only single space at the time. Can do this in one command?
x <- c(' 3','1 2','12 ') ## could be leading, trailing or in the mid
x
as.numeric(x) ## <#><< NAs introduced by coercion
x <- sub(' ','0',sub(' ','0',x))
as.numeric(x)
This approach can replace all leading white space to leading 0.
# Load package
library(stringr)
# Create example strings with leading white space and number
x <- c(" 3", " 4", " 12")
x %>%
# Trim the leading white space
str_trim(side = "left") %>%
# Add leading 0, the length is based on the original stringlength
str_pad(width = str_length(x), side = "left", pad = "0")
#[1] "003" "00004" "000012"
Update
My previous attempt is actually convoluted. Using gsub can achieve the same thing with only one line.
gsub(" ", "0", x)
#[1] "003" "00004" "000012"
And it does not have to be leading white space. Taking the update from the OP as an example.
x <- c(' 3','1 2','12 ')
gsub(" ", "0", x)
#[1] "003" "102" "120"

R Match And Sub On Space Between Specific Characters

I need a little help with a regular expression using gsub. Take this object:
x <- "4929A 939 8229"
I want to remove the space in between "A" and "9", but I am not sure how to match on only the space between them and not on the second space. I essentially need something like this:
x <- gsub("A 9", "", x)
But I am not sure how to write the regular expression to not match on the "A" and "9" and only the space between them.
Thanks in advance!
You may use the following regex in sub:
> x <- "4929A 939 8229"
> sub("\\s+", "", x)
[1] "4929A939 8229"
The \\s+ will match 1 or more whitespace symbols.
The replacement part is an empty string.
See the online R demo
gsub matches/uses all regex found whereas sub only matches/uses the first one. So
sub(" ", "", "4929A 939 8229") # returns "4929A939 8229"
Will do the job
Removing second/nth occurence
You can do that e.g. by using strsplit as follows:
x <- c("4929A 939 8229", "4929A 9398229")
collapse_nth <- function(x_split, split, nth, replacement){
left <- paste(x_split[seq_len(nth)], collapse = split)
right <- paste(x_split[-seq_len(nth)], collapse = split)
paste(left, right, sep = replacement)
}
remove_nth <- function(x, nth, split, replacement = ""){
x_split <- strsplit(x, split, fixed = TRUE)
x_len <- vapply(x_split, length, integer(1))
out <- x
out[x_len>nth] <- vapply(x_split[x_len>nth], collapse_nth, character(1), split, nth, replacement)
out
}
Which gives you:
# > remove_nth(x, 2, " ")
# [1] "4929A 9398229" "4929A 9398229"
and
# > remove_nth(x, 2, " ", "---")
# [1] "4929A 939---8229" "4929A 9398229"

Resources