R: Drop all not matching letters of string vector - r

I have a string vector
d <- c("sladfj0923rn2", ääas230ß0sadfn", 823Höl32basdflk")
I want to remove all characters from this vector that do not
match "a-z", "A-z" and "'"
I tried to use
gsub("![a-zA-z'], "", d)
but that doesn't work.

We could even make your replacement pattern even tighter by doing a case insensitive sub:
d <- c("sladfj0923rn2", "ääas230ß0sadfn", "823Höl32basdflk")
gsub("[^a-z]", "", d, ignore.case=TRUE)
[1] "sladfjrn" "assadfn" "Hlbasdflk"

We can use the ^ inside the square brackets to match all characters except the one specified within the bracket
gsub("[^a-zA-Z]", "", d)
#[1] "sladfjrn" "assadfn" "Hlbasdflk"
data
d <- c("sladfj0923rn2", "ääas230ß0sadfn", "823Höl32basdflk")

Related

Cut a string from right to left until a certain character is met R

I am looking to manipulate/cut a character string from right to left until a particular character is met
I want to take this:
a <- "L1.L2.L3.L4.L5"
And output this: a <- "L5"
I have specifically worded this problem as needing to cut the string from right to left because the strings can be of variable length and the output string can be of variable length as well
For example the code needs to work on:
b <- "L1.L555"
c <- "L1.L2.L3.L4.L5.L6.LLLL"
We can use sub to match characters (.*) until a . (. is a metacharacter for any character. So we escape (\\) to evaluate it literally) and replace it with blank ("")
sub(".*\\.", "", a)
#[1] "L5"
sub(".*\\.", "", b)
#[1] "L555"
sub(".*\\.", "", c)
#[1] "LLLL"
Or using trimws
trimws(a, whitespace = ".*\\.")
#[1] "L5"

How to match distinct repeated characters

I'm trying to come up with a regex in R to match strings in which there is repetition of two distinct characters.
x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
This regex matches all of the above, including strings such as "mmmm" and "ohhhh" where the repeated letter is the same in the first and the second repetition:
grep(".*([a-z])\\1.*([a-z])\\2", x, value = T)
What I'd like to match in x are these strings where the repeated letters are distinct:
"cooee","helloee","oooaaah","sshh","vroomm","whoopee","yippee"
How can the regex be tweaked to make sure the second repeated character is not the same as the first?
You may restrict the second char pattern with a negative lookahead:
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# ^^^^^
See the regex demo.
(?!\\1)([a-z]) means match and capture into Group 2 any lowercase ASCII letter if it is not the same as the value in Group 1.
R demo:
x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# => "cooee" "helloee" "oooaaah" "sshh" "vroomm" "whoopee" "yippee"
If you can avoid regex altogether, then I think that's the way to go. A rough example:
nrep <- sapply(
strsplit(x, ""),
function(y) {
run_lengths <- rle(y)
length(unique(run_lengths$values[run_lengths$lengths >= 2]))
}
)
x[nrep > 1]
# [1] "cooee" "helloee" "oooaaah" "sshh" "vroomm" "whoopee" "yippee"

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

How to substring a char vector using patterns in R?

I have this kind of char vector:
"MODIS.evi.2013116.yL2.BOKU.tif"
The number in the middle of the vector is gonna change. And the evi word will change to ndvi some times.
I want to use substr (or other function, maybe) to sub-string the vector after the second point: ., ie, just take the 2013116.yL2.BOKU.tif, even when the string is MODIS.evi.2013116.yL2.BOKU.tif or MODIS.ndvi.2013116.yL2.BOKU.tif.
We can use sub to match two instance of one or more characters that are not a . followed by a . from the start (^) of the string and replace it with blank ("")
sub("^([^.]+\\.){2}", "", str1)
#[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
If the pattern to keep always start with numbers, then the above can be simplified to match only one or more non-numeric characters and replace it with blank from the start (^) of the string
sub("^\\D+", "", str1)
#[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
data
str1 <- c("MODIS.evi.2013116.yL2.BOKU.tif", "MODIS.ndvi.2013116.yL2.BOKU.tif")
This deletes all leading non-digit characters in s :
sub("^\\D*", "", s)
If s is as in the Note at the end then the result of running the above is:
[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
Note:
s <- c("MODIS.evi.2013116.yL2.BOKU.tif", "MODIS.ndvi.2013116.yL2.BOKU.tif")
l = c("MODIS.evi.2013116.yL2.BOKU.tif","MODIS.ndvi.2013116.yL2.BOKU.tif")
sapply(l, function(x) strsplit(x, "vi.", fixed = T)[[1]][2])

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Resources