deleting multiple substrings of string - r

I'm using R and I have a vector of strings with 1 and 2.
Examples of strings could be the following:
"11111111******111"
"11111111111***2222222"
"1111*****22222**111*****1111"
where "*" denote a gap.
I'm interested in deleting substrings of gaps shorter than a certain number n.
Example with sequences above:
I decided that n=3, so...
1. "11111111******111"
2. "111111111112222222"
3. "1111*****22222111*****1111"
In the second and third string the "function" deleted a substring of 3 gaps and 2 gaps, because I wanted to delete all substrings of gaps shorter or equal 3.

May be we can do
n <-3
pat <- sprintf("(?<=[0-9])\\*{1,%d}(?=[0-9])", n)
gsub(pat, "", v1, perl = TRUE)
#[1] "11111111******111" "111111111112222222"
#[3] "1111*****22222111*****1111"
data
v1 <- c("11111111******111", "11111111111***2222222", "1111*****22222**111*****1111")

Similar to #akrun's answer:
x<- list("11111111******111",
"11111111111***2222222",
"1111*****22222**111*****1111")
lapply(x, function(x) gsub("(\\d)\\*{,3}(\\d)", "\\1\\2", x, perl = TRUE))

gsub('(?<=\\d)(\\*{1,3})(?=\\d)','',v1,perl=T)
[1] "11111111******111" "111111111112222222" "1111*****22222111*****1111"

Related

R padding 0's inside a string after the hypen

I have the following data
GT_BUC-01_BUCST-19
ADT_BURC-1_BUCST-09
BT_BUDDC-1_BUDSCST-29
CAST_BUC-31_BUCST-9
CAST_BUC-1_BUCST-9
How do I use R to make the numbers after both hyphens to pad leading zeros so it will have Two digits? The resulting format should look like this:
GT_BUC-01_BUCST-19
ADT_BURC-01_BUCST-09
BT_BUDDC-01_BUDSCST-29
CAST_BUC-31_BUCST-09
CAST_BUC-01_BUCST-09
One option would be to use stringr::str_replace_all
x <- c('GT_BUC-01_BUCST-19', 'ADT_BURC-1_BUCST-09',
'BT_BUDDC-1_BUDSCST-29', 'CAST_BUC-31_BUCST-9', 'CAST_BUC-1_BUCST-9')
stringr::str_replace_all(x, '\\d+', function(m) sprintf('%02s', m))
#[1] "GT_BUC-01_BUCST-19" "ADT_BURC-01_BUCST-09"
#[3] "BT_BUDDC-01_BUDSCST-29" "CAST_BUC-31_BUCST-09"
#[5] "CAST_BUC-01_BUCST-09"
You could try using gsub as follows:
x <- gsub("-(\\d)(?!\\d)", "-0\\1", x, perl=TRUE)
x
[1] "GT_BUC-01_BUCST-19" "ADT_BURC-01_BUCST-09" "BT_BUDDC-01_BUDSCST-29"
[4] "CAST_BUC-31_BUCST-09" "CAST_BUC-01_BUCST-09"
Data:
x <- c("GT_BUC-01_BUCST-19",
"ADT_BURC-1_BUCST-09",
"BT_BUDDC-1_BUDSCST-29",
"CAST_BUC-31_BUCST-9",
"CAST_BUC-1_BUCST-9")
The regex pattern used here matches dash followed by a single number only. In this case, we then replace by prepending a zero to this single number.

How to match distinct repeated characters

I'm trying to come up with a regex in R to match strings in which there is repetition of two distinct characters.
x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
This regex matches all of the above, including strings such as "mmmm" and "ohhhh" where the repeated letter is the same in the first and the second repetition:
grep(".*([a-z])\\1.*([a-z])\\2", x, value = T)
What I'd like to match in x are these strings where the repeated letters are distinct:
"cooee","helloee","oooaaah","sshh","vroomm","whoopee","yippee"
How can the regex be tweaked to make sure the second repeated character is not the same as the first?
You may restrict the second char pattern with a negative lookahead:
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# ^^^^^
See the regex demo.
(?!\\1)([a-z]) means match and capture into Group 2 any lowercase ASCII letter if it is not the same as the value in Group 1.
R demo:
x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# => "cooee" "helloee" "oooaaah" "sshh" "vroomm" "whoopee" "yippee"
If you can avoid regex altogether, then I think that's the way to go. A rough example:
nrep <- sapply(
strsplit(x, ""),
function(y) {
run_lengths <- rle(y)
length(unique(run_lengths$values[run_lengths$lengths >= 2]))
}
)
x[nrep > 1]
# [1] "cooee" "helloee" "oooaaah" "sshh" "vroomm" "whoopee" "yippee"

fastest way to split strings into fixed-length elements in R

How to split a string into elements of fixed length in R is a commonly asked question to which typical answers either rely on substring(x) or strsplit(x, sep="") followed by paste(y, collapse = "").
For instance, one would slit the string "azertyuiop" into "aze", "rty","uio", "p" by specifying a fixed length of 3 characters.
I'm looking for the fastest way possible.
After some testing with long strings (> 1000 chars), I have found that substring() is way too slow. The strategy is hence to split the string into individual characters, and them paste them back into groups of the desired length, by applying some cleverness.
Here is the fastest function I could come up with. The idea is to split the string into individual chars, then have a separator interspersed in the character vector at the right positions, collapse the characters (and separators) back into a string, then split the string again, but this time specifying the separator.
splitInParts <- function(string, size) { #can process a vector of strings. "size" is the length of desired substrings
chars <- strsplit(string,"",T)
lengths <- nchar(string)
nFullGroups <- floor(lengths/size) #the number of complete substrings of the desired size
#here we prepare a vector of separators (comas), which we will replace by the characters, except at the positions that will have to separate substring groups of length "size". Assumes that the string doesn't have any comas.
seps <- Map(rep, ",", lengths + nFullGroups) #so the seps vector is longer than the chars vector, because there are separators (as may as they are groups)
indices <- Map(seq, 1, lengths + nFullGroups) #the positions at which separators will be replaced by the characters
indices <- lapply(indices, function(x) which(x %% (size+1) != 0)) #those exclude the positions at which we want to retain the separators (I haven't found a better way to generate such vector of indices)
temp <- function(x,y,z) { #a fonction describing the replacement, because we call it in the Map() call below
x[y] <- z
x
}
res <- Map(temp, seps, indices, chars) #so now we have a vector of chars with separators interspersed
res <- sapply(res, paste, collapse="", USE.NAMES=F) #collapses the characters and separators
res <- strsplit(res, ",", T) #and at last, we can split the strings into elements of the desired length
}
This looks quite tedious, but I have tried to simply put the chars vector into a matrix with the adequate number of rows, then collapse the matrix columns with apply(mat, 2, paste, collapse=""). This is MUCH slower. And splitting the character vector with split() into a list of vectors of the right length, so as to collapse elements, is even slower.
So if you can find something faster, let me know. If not, well my function may be of some use. :)
Was fun reading the updates, so I benchmarked:
> nchar(mystring)
[1] 260000
My idea was near the same as #akrun's one as str_extract_all use the same function under the hood IIRC)
library(stringr)
tensiSplit <- function(string,size) {
str_extract_all(string, paste0('.{1,',size,'}'))
}
And the results on my machine:
> microbenchmark(splitInParts(mystring,3),akrunSplit(mystring,3),splitInParts2(mystring,3),tensiSplit(mystring,3),gsubSplit(mystring,3),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
splitInParts(mystring, 3) 64.80683 64.83033 64.92800 64.85384 64.98858 65.12332 3
akrunSplit(mystring, 3) 4309.19807 4315.29134 4330.40417 4321.38461 4341.00722 4360.62983 3
splitInParts2(mystring, 3) 21.73150 21.73829 21.90200 21.74507 21.98725 22.22942 3
tensiSplit(mystring, 3) 21.80367 21.85201 21.93754 21.90035 22.00447 22.10859 3
gsubSplit(mystring, 3) 53.90416 54.28191 54.55416 54.65966 54.87915 55.09865 3
We can split by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ((?<=.{3})).
splitInParts <- function(string, size){
pat <- paste0('(?<=.{',size,'})')
strsplit(string, pat, perl=TRUE)
}
splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"
splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"
Or another approach is using stri_extract_all from library(stringi).
library(stringi)
splitInParts2 <- function(string, size){
pat <- paste0('.{1,', size, '}')
stri_extract_all_regex(string, pat)
}
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
stri_extract_all_regex(str1, '.{1,3}')
data
str1 <- "azertyuiop"
Alright, there was a faster solution published here (d'oh!)
Simply
strsplit(gsub("([[:alnum:]]{size})", "\\1 ", string)," ",T)
Here using a space as separator.
(didn't think about [[:allnum::]]{}).
How can I mark my own question as a duplicate? :(

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Resources