How to specify a repeated pattern within one string using `sub()` instead of `gsub()` in R - r

I am aware of the many answer showing how to match multiple occurrences within a single string. However, I couldn't yet find an answer that would provide context as to why the following doesn't work:
## A string for which I want to replace `red` and `Red` with `RED`
x <- c("redflag flagred red and Red")
## This one works using `gsub()`
gsub("\\b(?:red|Red)\\b", "RED", x)
#[1] "redflag flagred RED and RED"
But is there a way to use sub() instead? The following doesn't work. It only matches the first occurrence and then stops:
sub("\\b(?:red|Red)\\b", "RED", x)
#[1] "redflag flagred RED and Red"
When checking the actual pattern it should match: https://regex101.com/r/X7DSB0/1 I am assuming this has something to do with the "global flag"?
I also tried adding a + or {1,} to get multiple matches but that doesn't work either:
## using a `+` doesn't work either
sub("\\b(?:red|Red)+\\b", "RED", x)
#[1] "redflag flagred RED and Red"
## using `{1,}` doesn't work either
sub("\\b(?:red|Red){1,}\\b", "RED", x)
#[1] "redflag flagred RED and Red"
What am I not understanding? How could I use sub() instead of gsub() for such an operation?

The g in gsub stands for "global," which means that you are telling the regex engine to apply the substitution to the entire string. On the other hand, sub just does the first replacement it encounters.
So the answer to your question is that you should use gsub if you intend to make every possible replacement:
gsub("\\b(?:red|Red)\\b", "RED", x)
[1] "redflag flagred RED and RED"

Related

Is there a list of all available color options for `col` in R plot? [duplicate]

Short question, if I have a string, how can I test if that string is a valid color representation in R?
Two things I tried, first uses the function col2rgb() to test if it is a color:
isColor <- function(x)
{
res <- try(col2rgb(x),silent=TRUE)
return(!"try-error"%in%class(res))
}
> isColor("white")
[1] TRUE
> isColor("#000000")
[1] TRUE
> isColor("foo")
[1] FALSE
Works, but doesn't seem very pretty and isn't vectorized. Second thing is to just check if the string is in the colors() vector or a # followed by a hexadecimal number of length 4 to 6:
isColor2 <- function(x)
{
return(x%in%colors() | grepl("^#(\\d|[a-f]){6,8}$",x,ignore.case=TRUE))
}
> isColor2("white")
[1] TRUE
> isColor2("#000000")
[1] TRUE
> isColor2("foo")
[1] FALSE
Which works though I am not sure how stable it is. But it seems that there should be a built in function to make this check?
Your first idea (using col2rgb() to test color names' validity for you) seems good to me, and just needs to be vectorized. As for whether it seems pretty or not ... lots/most R functions aren't particularly pretty "under the hood", which is a major reason to create a function in the first place! Hides all those ugly internals from the user.
Once you've defined areColors() below, using it is easy as can be:
areColors <- function(x) {
sapply(x, function(X) {
tryCatch(is.matrix(col2rgb(X)),
error = function(e) FALSE)
})
}
areColors(c(NA, "black", "blackk", "1", "#00", "#000000"))
# <NA> black blackk 1 #00 #000000
# TRUE TRUE FALSE TRUE FALSE TRUE
Update, given the edit
?par gives a thorough description of the ways in which colours can be specified in R. Any solution to a valid colour must consider:
A named colour as listed in colors()
A hexademical representation, as a character, of the form "#RRGGBBAA specifying the red, green, blue and alpha channels. The Alpha channel is for transparency, which not all devices support and hence whilst it is valid to specify a colour in this way with 8 hex values it may not be valid on a specific device.
NA is a valid "colour". It means transparent, but as far as R is concerned it is a valid colour representation.
Likewise "transparent" is also valid, but not in colors(), so that needs to be handled as well
1 is a valid colour representation as it is the index of a colour in a small palette of colours as returned by palette()
> palette()
[1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow"
[8] "gray"
Hence you need to cope with 1:8. Why is this important, well ?par tells us that it is also valid to represent the index for these colours as a character hence you need to capture "1" as a valid colour representation. However (as noted by #hadley in the comments) this is just for the default palette. Another palette may be used by a user, in which case you will have to consider a character index to an element of a vector of the maximum allowed length for your version of R.
Once you've handled all those you should be good to go ;-)
To the best of my knowledge there isn't a user-visible function that does this. All of this in buried away inside the C code that does the plotting; very quickly you end up in .Internal(....) land and there be dragons!
Original
[To be pedantic #000000 isn't a colour name in R.]
The only colour names R knows are those returned by colors(). Yes, #000000 is one of the colour representations that R understands but you specifically ask about a name and the definitive list or solution is x %in% colors() as you have in your second example.
This is about as stable as it gets. When you use a colour like col = "goldenrod", internally R matches this with a "proper" representation of the colour for whichever device you are plotting on. color() returns the list of colour names that R can do this looking up for. If it isn't in colors() then it isn't a colour name.

Extract substring match from agrep

My Goal is to identify whether a given text has a target string in it, but i want to allow for typos / small derivations and extract the substring that "caused" the match (to use it for further text analysis).
Example:
target <- "target string"
text <- "the target strlng: Butter. this text i dont want to extract."
Desired Output:
I would like to have target strlng as the Output, since ist very Close to the target (levenshtein distance of 1). And next i want to use target strlng to extract the word Butter (This part i have covered, i just add it to have a detailed spec).
What i tried:
Using adist did not work, since it compares two strings, not substrings.
Next i took a look at agrep which seems very Close. I can have the Output, that my target was found, but not the substring that "caused" the match.
I tried with value = TRUE but it seems to work on Array Level. I think It is not possible for me to Switch to Array type, because i can not split by spaces (my target string might have spaces,...).
agrep(
pattern = target,
x = text,
value = TRUE
)
Use aregexec, it's similar to the use of regexpr/regmatches (or gregexpr) for exact matches extraction.
m <- aregexec('string', 'text strlng wrong')
regmatches('text strlng wrong', m)
#[[1]]
#[1] "strlng"
This can be wrapped in a function that uses the arguments of both aregexec and regmatches. Note that in the latter case, the function argument invert comes after the dots argument ... so it must be a named argument.
aregextract <- function(pattern, text, ..., invert = FALSE){
m <- aregexec(pattern, text, ...)
regmatches(text, m, invert = invert)
}
aregextract(target, text)
#[[1]]
#[1] "target strlng"
aregextract(target, text, invert = TRUE)
#[[1]]
#[1] "the "
#[2] ": Butter. this text i dont want to extract."

subset strings without a pattern stringr

I want to extract elements of a character vector which do not match a given pattern. See the example:
x<-c("age_mean","n_aitd","n_sle","age_sd","n_poly","n_sero","child_age")
x_age<-str_subset(x,"age")
x_notage<-setdiff(x,x_age)
In this example I want to extract those strings which do not match the pattern "age". How to achieve this in a single call of str_subset ? What is the appropriate syntax of the pattern "not age". As you can see I am not very expert with regex. Thanks for any comments.
In this case there seems to be no reason to use stringr (efficiency perhaps). You may simply use grep:
grep("age", x, invert = TRUE, value = TRUE)
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
If, however, you want to stick with str_stringr, note that (from ?str_subset)
str_subset() is a wrapper around x[str_detect(x, pattern)], and is equivalent to grep(pattern, x, value = TRUE).
So,
x[!str_detect(x, "age")]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
or also
x[!grepl("age", x)]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"

Use the grep functie to find words which contain either blue of red

Im using the grep function to select certain column heads. The heads I want to select should contain exactly "red" or "blue"
I got the red thing to work using (I stored the columnnames in a variable called x) ->
x <- c("Red", "Blue", "blue", "green")
grep("^red$", x, varnames=TRUE)
But i cant figure out how to look for red OR blue... Any thoughts?
grep("^(red|blue)$", x, varnames=TRUE)
This doesn't seem to work...
If the search is not supposed to be case-sensitive, then I'd suggest the following:
> x <- c("Red", "Blue", "blue", "green")
> grep("^(red|blue)$",tolower(x))
[1] 1 2 3
grep("red|blue", x, ignore.case=T, value=T) # returns [1] "Red" "Blue" "blue"
If you require the match to be case-sensitive, remove the ignore.case=T.
If you require a case-sensitive match to the entire string (which is what you get when you use the assertions ^ and $) then you are basically asking for x[x=="blue"|x=="red"], which may be more efficient than a regex.

using gsub to find all values that are NOT equal in R

I am trying to use gsub to change values in an Igraph vertex variable to colors before I plot a network graph.
The issue is that my graph has 3 values that I care about, and many others that I'd just like to group as "other" and assign 1 color to.
For example, if I had data that looks like this:
Name........Value
A............1
B............2
C............3
D............4
E............5
and I had code like this:
V(g)$color=V(g)$value #assign the "Value" attribute as the vertex color
V(g)$color=gsub("1","red",V(g)$color) #1 will be red
V(g)$color=gsub("2","blue",V(g)$color) #2 will be blue
V(g)$color=gsub("3", "yellow", V(DMedge)$color) #3 is yellow
What line of code could I add to make 4 and 5 into some other color, (green for example)? Thanks so much for any help you might have!
I would avoid sub (this is not about matching patterns) and do:
my.colors <- c("red", "blue", "yellow", "green")
V(g)$color <- my.colors[match(V(g)$value, c(1, 2, 3), nomatch = 4)]
It looks like this suffices for what you want to do:
x <- c("1","2","3","4")
gsub("4|5", "green", x)
[1] "1" "2" "3" "green" "green"
Or this
gsub("[^1-3]", "green", x)
[1] "1" "2" "3" "green" "green"
However as pointed out in other answers it looks like a better idea to set up a lookup table mapping numbers to colors and use match to determine the color.
Assuming that after you have made the initial substitutions, the only numbers left are the ones you want to be one uniform color, you can use a regex to match all contiguous digits and put the same color for them.
V(g)$color=gsub("\\d+", "green",V(g)$color)
See this page for gsub regular expressions.

Resources