R: compare and subset two strings [duplicate] - r

This question already has answers here:
Replace specific characters within strings
(7 answers)
Closed 7 years ago.
Is there a function in R that can respond at these requirements:
if string1 exists in string2 then remove string1 from string2
I passed a day searching on a such function. So, any help would be appreciated.
Edit:
I have a dataframe. Here's a part of it:
mark name ChekMark
Caudalie Caudalie Eau démaquillante 200ml TRUE
Mustela Mustela Bébé lait hydra corps 300ml TRUE
Lierac Lierac Phytolastil gel prévention TRUE
I want to create an new dataframe in witch the mark doesn't exist on the product name.
That's my final goal.

You can use gsub and work with regular expressions:
gsub(" this part ", " ", "A Text where this part should be removed")
# [1] "A Text where should be removed"
gsub(" this part ", " ", "A Text where this 1 part should be removed")
# [1] "A Text where this 1 part should be removed"

Are you looking for string2.replace(string1, '')?
or you could:
>>> R = lambda string1, string2: string2.replace(string1, '')
>>> R('abc', 'AAAabcBBB')
'AAABBB'
>>>

Related

Delete initial (matching) pattern from a string [duplicate]

This question already has answers here:
Regular expression to match characters at beginning of line only
(8 answers)
Closed 2 years ago.
I have the following vector of character strings:
v<-c("RT #name1: hello world", "Hi guys, how are you?", "Hello RT I have no text", "RT #name2: Hello!")
I would like to delete only those RT that are positioned at the beginning of strings and store the results in another vector, e.g., w:
> w
"#name1: hello world" "Hi guys, how are you?" "Hello RT I have no text" "#name2: Hello!"
Maybe I could use function str_extract_all from the package stringr, but I can't apply it to my problem.
Use gsub and the 'anchor' ^, which signifies the beginning of a string:
w <- gsub("^RT\\s", "", v)
<- str_replace(v,"^RT","")

Keep rows with have a specific word [duplicate]

This question already has answers here:
Filter rows which contain a certain string
(5 answers)
Closed 3 years ago.
Using this command it keeps the rows which have the specific word
df[df$ID == "interesting", ]
If this word is exist in the row but it has more words around how is it possible to find if this word exist and keep the row.
Example input
data.frame(text = c("interesting", " I am interesting for this", "remove")
Expected output
data.frame(text = c("interesting", " I am interesting for this")
1.Example data:
df <- data.frame(text = c("interesting", " I am interesting for this", "remove"),
stringsAsFactors = FALSE)
Solution using base R. Indexing using grepl:
df[grepl("interesting", df$text), ]
This returns:
[1] "interesting" " I am interesting for this"
Edit 1
Change code so that it returns a data.frame and not a vector.
df[grep("interesting", df$text), , drop = FALSE]
This now returns:
text
1 interesting
2 I am interesting for this

How to extract first 2 words from a string in R?

I need to extract first 2 words from a string. If the string contains more than 2 words, it should return the first 2 words else if the string contains less than 2 words it should return the string as it is.
I've tried using 'word' function from stringr package but it's not giving the desired output for cases where len(string) < 2.
word(dt$var_containing_strings, 1,2, sep=" ")
Example:
Input String: Auto Loan (Personal)
Output: Auto Loan
Input String: Others
Output: Others
If you want to use stringr::word(), you can do:
ifelse(is.na(word(x, 1, 2)), x, word(x, 1, 2))
[1] "Auto Loan" "Others"
Sample data:
x <- c("Auto Loan (Personal)", "Others")
Something like this?
a <- "this is a character string"
unlist(strsplit(a, " "))[1:2]
[1] "this" "is"
EDIT:
To add the part where original string is returned if number of worlds is less than 2, a simple if-else function can be used:
a <- "this is a character string"
words <- unlist(strsplit(a, " "))
if (length(words) > 2) {
words[1:2]
} else {
a
}
You could use regex in base R using sub
sub("(\\w+\\s+\\w+).*", "\\1", "Auto Loan (Personal)")
#[1] "Auto Loan"
which will also work if you have only one word in the text
sub("(\\w+\\s+\\w+).*", "\\1", "Auto")
#[1] "Auto"
Explanation :
Here we extract the pattern shown inside round brackets which is (\\w+\\s+\\w+) which means :
\\w+ One word followed by \\s+ whitespace followed by \\w+ another word, so in total we extract two words. Extraction is done using backreference \\1 in sub.

Extract Text Starting and Ending with Punctuations in R [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 3 years ago.
I want to extract a group of strings between two punctuations using RStudio.
I tried to use str_extract command, but whenever I tried to use anchors (^ for starting char, and $ for ending char), it failed.
Here is the sample problem:
> text <- "Name : Dr. CHARLES DOWNING MAP ; POB : London; Age/DOB : 53 years / August 05, 1958;"
Here is the sample code I used:
> str_extract(text,"(Name : )(.+)?( ;)")
> str_match(str_extract(text,"(Name : )(.+)?( ;)"),"(Name : )(.+)?( ;)")[3]
But it seemed too verbose, and not flexible.
I only want to extract "Dr. CHARLES DOWNING MAP".
Anyone can help with my problem?
Can I tell the regex to start with any non-white-space character after "Name : " and ends before " ; POB"?
This seems to work.
> gsub(".*Name :(.*) ;.*", "\\1", text)
[1] " Dr. CHARLES DOWNING MAP"
With str_match
stringr::str_match(text, "^Name : (.*) ;")[, 2]
#[1] "Dr. CHARLES DOWNING MAP"
[, 2] is to get the contents from the capture group.
There is also qdapRegex::ex_between to extract string between left and right markers
qdapRegex::ex_between(text, "Name : ", ";")[[1]]
#[1] "Dr. CHARLES DOWNING MAP"

">" is not matched by "[[:punct:]]" when using `stringr::str_replace_all`? [duplicate]

This question already has answers here:
R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?
(2 answers)
Closed 4 years ago.
I find this really odd :
pattern <- "[[:punct:][:digit:][:space:]]+"
string <- "a . , > 1 b"
gsub(pattern, " ", string)
# [1] "a b"
library(stringr)
str_replace_all(string, pattern, " ")
# [1] "a > b"
str_replace_all(string, "[[:punct:][:digit:][:space:]>]+", " ")
# [1] "a b"
Is this expected ?
Still working on this, but ?"stringi-search-charclass" says:
Beware of using POSIX character classes, e.g. ‘[:punct:]’. ICU
User Guide (see below) states that in general they are not
well-defined, so may end up with something different than you
expect.
In particular, in POSIX-like regex engines, ‘[:punct:]’ stands for
the character class corresponding to the ‘ispunct()’
classification function (check out ‘man 3 ispunct’ on UNIX-like
systems). According to ISO/IEC 9899:1990 (ISO C90), the
‘ispunct()’ function tests for any printing character except for
space or a character for which ‘isalnum()’ is true. However, in a
POSIX setting, the details of what characters belong into which
class depend on the current locale. So the ‘[:punct:]’ class does
not lead to portable code (again, in POSIX-like regex engines).
So a POSIX flavor of ‘[:punct:]’ is more like ‘[\p{P}\p{S}]’ in
‘ICU’. You have been warned.
Copying from the issue posted above,
string <- "a . , > 1 b"
mypunct <- "[[\\p{P}][\\p{S}]]"
stringr::str_remove_all(string, mypunct)
I can appreciate stuff being locale-specific, but it still surprises me that [:punct:] doesn't even work in a C locale ...

Resources