combining words in tm R is not achieving desired result - r

I am trying to combine a few words so that they count as one.
In this example I want val and valuatin to be counted as valuation.
The code I have been using to try and do this is below:
#load in package
library(tm)
replaceWords <- function(x, from, keep){
regex_pat <- paste(from, collapse = "|")
gsub(regex_pat, keep, x)
}
oldwords <- c("val", "valuati")
newword <- c("valuation")
TextDoc2 <- tm_map(TextDoc, replaceWords, from=oldwords, keep=newword)
However this does not work as expected. Any time there is val in a word it is now being replaced with valuation. For example equivalent becomes equivaluation. How do I get around this error and achieved my desired result?

Try this function -
replaceWords <- function(x, from, keep){
regex_pat <- sprintf('\\b(%s)\\b', paste(from, collapse = '|'))
gsub(regex_pat, keep, x)
}
val matches with equivalent. Adding word boundaries stop that from happening.
grepl('val', 'equivalent')
#[1] TRUE
grepl('\\bval\\b', 'equivalent')
#[1] FALSE

Related

Using lapply with gsub to replace word in dataframe using another dataframe as 'dictionnary'

I have a dataframe called data where I want to replace some word in specific columns A & B.
I have a second dataframe called dict that is playing the role of dictionnary/hash containing the words and the values to use for replacement.
I think it could be done with purrr’s map() but I want to use apply. It's for a package and I don't want to have to load another package.
The following code is not working but it's give you the idea. I'm stuck.
columns <- c("A", "B" )
data[columns] <- lapply(data[columns], function(x){x}) %>% lapply(dict, function(y){
gsub(pattern = y[,2], replacement = y[,1], x)})
This is working for one word to change...but I'm not able to pass the list of changes conainted in the dictionnary.
data[columns] <- lapply(data[columns], gsub, pattern = "FLT1", replacement = "flt1")
#Gregor_Thomas is right, you need a for loop to have a recursive effect, otherwise you just replace one value at the time.
df <- data.frame("A"=c("PB1","PB2","OK0","OK0"),"B"=c("OK3","OK4","PB1","PB2"))
dict <- data.frame("pattern"=c("PB1","PB2"), "replacement"=c("OK1","OK2"))
apply(df[,c("A","B")],2, FUN=function(x) {
for (i in 1:nrow(dict)) {
x <- gsub(pattern = dict$pattern[i], replacement = dict$replacement[i],x)
}
return(x)
})
Or, if your dict data is too long you can generate a succession of all the gsub you need using a paste as a code generator :
paste0("df[,'A'] <- gsub(pattern = '", dict$pattern,"', replacement = '", dict$replacement,"',df[,'A'])")
It generates all the gsub lines for the "A" column :
"df[,'A'] <- gsub(pattern = 'PB1', replacement = 'OK1',df[,'A'])"
"df[,'A'] <- gsub(pattern = 'PB2', replacement = 'OK2',df[,'A'])"
Then you evaluate the code and wrap it in a lapply for the various columns :
lapply(c("A","B"), FUN = function(v) { eval(parse(text=paste0("df[,'", v,"'] <- gsub(pattern = '", dict$pattern,"', replacement = '", dict$replacement,"',df[,'",v,"'])"))) })
It's ugly but it works fine to avoid long loops.
Edit : for a exact matching between df and dict maybe you should use a boolean selection with == instead of gsub().
(I don't use match() here because it selects only the first matching
df <- data.frame("A"=c("PB1","PB2","OK0","OK0","OK"),"B"=c("OK3","OK4","PB1","PB2","AB"))
dict <- data.frame("pattern"=c("PB1","PB2","OK"), "replacement"=c("OK1","OK2","ZE"))
apply(df[,c("A","B")],2, FUN=function(x) {
for (i in 1:nrow(dict)) {
x[x==dict$pattern[i]] <- dict$replacement[i]
}
return(x)
})

How to move patterns in a string in r?

I am trying to code a function that would allow me to move certain patterns in a string in r. For example, if my strings are pattern_string1, pattern_string2, pattern_string3, pattern_string4, I want to mutate them to string1_pattern, string2_pattern, string3_pattern, string4_pattern.
In oder to achieve this, I tried the following:
string_flip <- function(x, pattern){
if(str_detect(x, pattern)==TRUE){
str_remove(x, pattern) %>%
paste(x, "pattern", sep = "_")
}
}
However, when I try to apply this onto a vector of strings by the following code:
stringvector <- c(pattern_string1, pattern_string2, pattern_string3, pattern_string4, string5, string6)
string_flip(stringvector, "pattern")
it returns a warning and changes all vectors, not only the vectors that contain "pattern". In addition it does not only add pattern to the end of the string, it doubles the string itself as well, so I get the following result:
[1] "_string1_pattern_string1_pattern" "_string2_pattern_string2_pattern" "_string3_pattern_string3_pattern"
[4] "_string4_pattern_string4_pattern" "string5_string5_pattern" "string6_string6_pattern"
Can anybody help me with this?
Thanks a lot in advance!
Your function string_flip is not vectorised. It works for only one string at a time.
I think you have additional x which is why the string is doubling.
In paste, pattern should not be in quotes.
Try this function.
library(stringr)
string_flip <- function(x, pattern){
trimws(ifelse(str_detect(x, pattern),
str_remove(x, pattern) %>% paste(pattern, sep = "_"), x), whitespace = '_')
}
stringvector <- c('pattern_string1', 'pattern_string2', 'pattern_string3', 'pattern_string4')
string_flip(stringvector, "pattern")
#[1] "string1_pattern" "string2_pattern" "string3_pattern" "string4_pattern"

Modify the object without using return in R function

I am trying to reverse a string without using extra space in R. Below is the code for the same. My question is how to get the ReverseString function change the input without using extra space. I even tried using <<- without any luck.
ReverseString <- function(TestString){
TestString <- unlist(strsplit(TestString, ""))
Left <- 1
Right <- length(TestString)
while (Left < Right){
Temp <- TestString[Left]
TestString[Left] <- TestString[Right]
TestString[Right] <- Temp
Left <- Left + 1
Right <- Right - 1
}
return(paste(TestString, collapse = ""))
}
## Input
a = "StackOverFlow"
## OutPut
ReverseString(a)
"wolFrevOkcatS"
##
a
"StackOverFlow"
It is always better to take advantage of the vectorization in R (instead of for or while loops). So, in base-R, without any packages, it would be something like:
ReverseString <- function(x) {
#splitstring splits every character, and rev reverses the order
out <- rev(strsplit(x, split = '')[[1]])
#paste to paste them together
paste(out, collapse = '')
}
a <- "StackOverFlow"
ReverseString(a)
#[1] "wolFrevOkcatS"
According to your comment you want to reverse the string without calling any function that does the reversal, i.e. no rev and co. Both of the solutions below do this.
I think you are also trying to modify global a from within the function, which is why you tried <<-. I'm not sure why it didn't work for you, but you might have used it incorrectly.
You should know that using <<- alone does not mean that you are using less space. To really save space you would have to call or modify global a at each step in your function where you call or modify TestString. This would entail some combination of assign, do.call, eval and parse - not to mention all the pasteing you would have to do to access elements of a by integer position. Your function would end up bulky, nearly unreadable, and very likley less efficient due to the numerous function calls, despite having saved a negligible amount of space by not storing a copy of a. If you're dead set on creating such an abomination, then take a look at the functions I just listed and figure out how to use them.
Your energy would be better spent by improving upon you string-reversing function in other ways. For example, you can shorten it quite a bit by using a numerical sequence such as 13:1 in sapply:
reverse_string <- function(string) {
vec <- str_split(string, "")[[1]]
paste(sapply(length(vec):1, function(i) vec[i]), collapse = "")
}
reverse_string("StackOverFlow")
#### OUTPUT ####
[1] "wolFrevOkcatS"
If your interviewers also have a problem with reverse sequences then here's another option that's closer to your original code, just a little cleaner. I also did my best to eliminate other areas where "extra space" was being used (indices stored in single vector, no more Temp):
reverse_string2 <- function(string){
vec <- str_split(string, "")[[1]]
i_vec <- c(1, length(vec))
while(i_vec[1] < i_vec[2]) {
vec[i_vec] <- vec[c(i_vec[2], i_vec[1])]
i_vec <- i_vec + c(1, -1)
}
paste(vec, collapse = "")
}
reverse_string2("StackOverFlow")
#### OUTPUT ####
[1] "wolFrevOkcatS"
It can be done easily with stringi
library(stringi)
a <- "StackOverFlow"
stri_reverse(a)
#[1] "wolFrevOkcatS"
I'm not sure I understood exactly the problem, but I think you're looking for a way to reverse the string object and automatically assign it to the original object without having to do a <- ReverseString(a) (assuming this is the reason why you tried using <<-). My solution to this is using deparse(substitute()) to read the original variable name inside the function and assign (using envir = .GlobalEnv) to assign your result over the original variable.
ReverseString <- function(TestString){
nm <- deparse(substitute(TestString))
TestString <- unlist(strsplit(TestString, ""))
Left <- 1
Right <- length(TestString)
while (Left < Right){
Temp <- TestString[Left]
TestString[Left] <- TestString[Right]
TestString[Right] <- Temp
Left <- Left + 1
Right <- Right - 1
}
assign(nm, paste(TestString, collapse = ""), envir = .GlobalEnv)
}
## Input
a = "StackOverFlow"
ReverseString(a)
a
#[1] "wolFrevOkcatS"

I need to check for data entry through R ...how do I validate that it is in the correct format

Eg : data has to follow the convention xxxx-xx-xx-xxx-xxx-xxx-xxx-xxx
the right data format is 7448-06-93-030-001 or 7448-06-93-030-001-010-030-060
but not 7448-060-030-070.Hope I made some sense
Assuming that the "correct format" means the correct number of numeric digits between dashes, here is one solution:
test_format <- function(x) {
#get number of characters of each bunch of digits
x <- paste0("-", x, "-")
dash_pos <- unlist(gregexpr("-", x))
n <- length(dash_pos)
lens <- dash_pos[2:n] - dash_pos[1:(n-1)] - 1
#check that this matches the correct convention
correct_lens <- c(4,2,2,3,3,3,3,3)
isTRUE(all.equal(lens, correct_lens[1:(n-1)]))
}
test_format("7448-06-93-030-001") #should be true
test_format("7448-06-93-030-001-010-030-060") #should be true
test_format("7448-060-030-070") #should be false
This regular expression should work, assuming you want the first pattern of x's
\d{4}-(\d{2}-){2}(\d{3}-){4}\d{3}
https://regular-expressions.mobi/rlanguage.html?wlr=1
Are you looking for a blanket gsub? These two work for those two scenarios. You could use an ifelse to determine which one to use.
df <- c("74-486993-030-001")
df <- gsub("-", "", df)
dfa <- gsub("(\\d{4})(\\d{2})(\\d{2})(\\d{3})(\\d{3})$", "\\1-\\2-\\3-\\4-\\5", df)
"7448-69-93-030-001"
df2 <- c("74480693-030-00-10-10-030-060")
df2 <- gsub("-", "", df2)
dfb <- gsub("(\\d{4})(\\d{2})(\\d{2})(\\d{3})(\\d{3})(\\d{3})(\\d{3})(\\d{3})$", "\\1-\\2-\\3-\\4-\\5-\\6-\\7-\\8", df2)
"7448-06-93-030-001-010-030-060"

Assigning new strings with conditional match

I have an issue about replacing strings with the new ones conditionally.
I put short version of my real problem so far its working however I need a better solution since there are many rows in the real data.
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
Basicly I want to replace strings with replace_strings. First item in the strings replaced with the first item in the replace_strings.
replace_strings <- c("A1","A2","A3","A4")
So the final string should look like
final string <- c("ca_A1","cb_A2","cc_A3","cd_A4")
I write some simple function assign_new
assign_new <- function(x){
ifelse(grepl("A33",x),gsub("A33","A1",x),
ifelse(grepl("A32",x),gsub("A32","A2",x),
ifelse(grepl("A31",x),gsub("A31","A3",x),
ifelse(grepl("A30",x),gsub("A30","A4",x),x))))
}
assign_new(strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Ok it seems we have solution. But lets say if I have A1000 to A1 and want to replace them from A1 to A1000 I need to do 1000 of rows of ifelse statement. How can we tackle that?
If your vectors are ordered to be matched, then you can use:
> paste0(gsub("(.*_)(.*)","\\1", strings ), replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
You can use regmatches.First obtain all the characters that are followed by _ using regexpr then replace as shown below
`regmatches<-`(strings,regexpr("(?<=_).*",strings,perl = T),value=replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Not the fastests but very tractable and easy to maintain:
for (i in 1:length(strings)) {
strings[i] <- gsub("\\d+$", i, strings[i])
}
"\\d+$" just matches any number at the end of the string.
EDIT: Per #Onyambu's comment, removing map2_chr as paste is a vectorized function.
foo <- function(x, y){
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_"))
}
foo(strings, replace_strings)
with x being strings and y being replace_strings. You first split the strings object at the _ character, and paste with the respective replace_strings object.
EDIT:
For objects where there is no positional relationship you could create a reference table (dataframe, list, etc.) and match your values.
reference_tbl <- data.frame(strings, replace_strings)
foo <- function(x){
y <- reference_tbl$replace_strings[match(x, reference_tbl$strings)]
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_")
}
foo(strings)
Using the dplyr package:
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
replace_strings <- c("A1","A2","A3","A4")
df <- data.frame(strings, replace_strings)
df <- mutate(rowwise(df),
strings = gsub("_.*",
paste0("_", replace_strings),
strings)
)
df <- select(df, strings)
Output:
# A tibble: 4 x 1
strings
<chr>
1 ca_A1
2 cb_A2
3 cc_A3
4 cd_A4
yet another way:
mapply(function(x,y) gsub("(\\w\\w_).*",paste0("\\1",y),x),strings,replace_strings,USE.NAMES=FALSE)
# [1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"

Resources