I have the following data and am looking to substitute only single characters.
A,Apple
A
I want to produce the output such that
Banana,Apple
Banana
In other words I want to substitute anything that has an A, or just an A with banana. But if another word starting with A comes in, I want to ignore that.
I tried
gsub("A", "Banana"),
gsub("A[^,;]","Banana"),
But this wont work for the first example, the output I get is
Banana,Bpple
Any ideas on how I can achieve this?
Thanks!
If the value is always surrounded by punctuation or line start/end:
text = "A,Apple\nA\nAvocado"
text2 = gsub("(\\b)A(\\b)", "\\1Bananna\\2", text, TRUE, TRUE)
cat(text2)
This captures the punctuation, if any exist, around the "A", and then puts them back using the backreferences \1 and \2. PCRE are used so we can use the \b word boundary match.
Output:
Bananna,Apple
Bananna
Avocado
A non-regex solution could be to split the string on comma (,), change the value to "Banana" if it is equal to "A"
sapply(strsplit(x, ","), function(x) toString(ifelse(x == "A","banana", x)))
#[1] "banana, Apple" "banana"
data
x <- c("A,Apple", "A")
Related
Is there an R function to get only the part of a string before the 2nd capital character appears?
For example:
Example <- "MonkeysDogsCats"
Expected output should be:
"Monkeys"
Maybe something like
stringr::str_extract("MonkeysDogsCats", "[A-Z][a-z]*")
#[1] "Monkeys"
Here is an alternative approach:
Here we first put a space before all uppercase and then extract the first word:
library(stringr)
word(gsub("([a-z])([A-Z])","\\1 \\2", Example), 1)
[1] "Monkeys"
A base solution with sub():
x <- "MonkeysDogsCats"
sub("(?<=[a-z])[A-Z].*", "", x, perl = TRUE)
# [1] "Monkeys"
Another way using stringr::word():
stringr::word(x, 1, sep = "(?=[A-Z])\\B")
# [1] "Monkeys"
If the goal is strictly to capture any string before the 2nd capital character, one might want pick a solution it'll also work with all types of strings including numbers and special characters.
strings <- c("MonkeysDogsCats",
"M4DogsCats",
"M?DogsCats")
stringr::str_remove(strings, "(?<=.)[A-Z].*")
Output:
[1] "Monkeys" "M4" "M?"
It depends on what you want to allow to match. You can for example match an uppercase char [A-Z] optionally followed by any character that is not an uppercase character [^A-Z]*
If you don't want to allow whitespace chars, you can exclude them [^A-Z\\s]*
library(stringr)
str_extract("MonkeysDogsCats", "[A-Z][^A-Z]*")
Output
[1] "Monkeys"
R demo
If there should be an uppercase character following, and there are only lowercase characters allowed:
str <- "MonkeysDogsCats"
regmatches(str, regexpr("[A-Z][a-z]*(?=[A-Z])", str, perl = TRUE))
Output
[1] "Monkeys"
R demo
I'm trying to check whether string B is contained by string A and this is what I tried:
library(stringr)
string_a <- "something else free/1a2b a bird yes"
string_b <- "free/xxxx a bird"
str_detect(string_a, string_b)
I would expect a match (TRUE) since I wouldn't like to consider part of string_b followed by the "/" and before a white space, which is why I put "/xxxx".
In a way the "/xxxx" should represent match any string or number possible in these places. Is there maybe another notation to ignore parts of string when matching like this?
Yes, in regex you can use .* to match zero or more characters.
library(stringr)
string_a <- "something else free/1a2b a bird yes"
string_b <- "free/xxxx a bird"
string_c <- "free/.*a bird"
str_detect(string_a, string_c)
#[1] TRUE
If you cannot change string_b at source, you may use str_replace_all or gsub to replace xxxx with '.*'.
str_detect(string_a, str_replace_all(string_b, 'x+', '.*'))
#[1] TRUE
I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:
grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.
grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.
I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?
You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.
x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"
x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"
For more than 1 match use gregexpr:
x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:
str_extract("The cat scattered his food all over the room.", "\\bcat\\b")
[1] "cat"
The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:
str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B")
[[1]]
[1] "cat" "cat"
These two matches are from education and scattered.
I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing
We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.
you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"
t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"
i would like to get the count times that in a given string a word start with the letter given.
For example, in that phrase: "that pattern is great but pigs likes milk"
if i want to find the number of words starting with "g" there is only 1 "great", but right now i get 2 "great" and "pigs".
this is the code i use:
x <- "that pattern is great but pogintless"
sapply(regmatches(x, gregexpr("g", x)), length)
We need either a space or word boundary to avoid th letter from matching to characters other than the start of the word. In addition, it may be better to use ignore.case = TRUE as some words may begin with uppercase
lengths(regmatches(x, gregexpr("\\bg", x, ignore.case = TRUE)))
The above can be wrapped as a function
fLength <- function(str1, pat){
lengths(regmatches(str1, gregexpr(paste0("\\b", pat), str1, ignore.case = TRUE)))
}
fLength(x, "g")
#[1] 1
You can also do it with stringr library
library(stringr)
str_count(str_split(x," "),"\\bg")