Regular Expressions and substring in R

Regular Expressions and substring in R - r

I'm trying to clean and change the data into a specific format.
All data should have the following format: 2 digits, 3 letters (MKT), 4 digits, 1 underscore and 1 digit (for example 66MKT1234_1)
Let's assume that I have the following data:
V <- c("66MKT030_2", "66MGT1220_2", "66MGT063_1", "66MKT350_2","22233366698","66MKT3500_2", "9999999")
What to correct:
a) The 1st, 3rd and 4th element of the vector only have 3 digits after the 3 letters (MTG). In this case, I will need to add one 0 digit after the last letters
b) The 2nd and 3rd element need to be correct from "MGT" to "MTG"
c) the 5th and 7th element need to be removed.
My approach was to:
step 1 - remove the data that do not match the format (2 digits, 3 letters (MKT), 4 digits, 1 underscore and 1 digit)
aa <- grepl("\\d{2}\\w{3}\\d{3,4}[:punct:]\\d{1}", V)
V2 <- V[aa]
step 2 - use gsub to correct "MGT" to "MTG"
step 3 - find a way to add digit 0 after the letters if digits lenght is 3 (for example, the first element should be changed from 66MKT030_2 to 66MKT0030_2)
I am stuck in step 1, as my code does not work to clean the 5th ("22233366698") and 7th ("9999999") elements from the vector.
Can you please help me on how to do this in a cleaver way?
Thanks

You may use
sub("^(\\d{2}[[:alpha:]]{3})(\\d{3}\\D)", "\\10\\2", sub("MGT", "MTG", grep("^\\d+$", V, value=TRUE, invert=TRUE), fixed=TRUE))
In separate steps:
V <- grep("^\\d+$", V, value=TRUE, invert=TRUE)
V <- sub("MGT", "MTG", V, fixed=TRUE)
sub("^(\\d{2}[[:alpha:]]{3})(\\d{3}\\D)", "\\10\\2", V)
Output:
[1] "66MKT0030_2" "66MTG1220_2" "66MTG0063_1" "66MKT0350_2" "66MKT3500_2"
Details
grep("^\\d+$", V, value=TRUE, invert=TRUE) - filters out all items that only consist of digits (invert=TRUE reverses the result of the ^\d+$ regex)
sub("MGT", "MTG", V, fixed=TRUE) - replaces MGT with MTG (fixed=TRUE makes this replacement on literal strings with no regex engine involved, which usually speeds up the process)
sub("^(\\d{2}[[:alpha:]]{3})(\\d{3}\\D)", "\\10\\2", V) - adds a 0 before the 3rd field that consists of three digits only.
The third step regex details:
^ - start of string
(\d{2}[[:alpha:]]{3}) - Group 1: two digits (\d{2}), three letters ([[:alpha:]]{3})
(\d{3}\D) - Group 2: three digits (\d{3}) and a non-digit (\D)
\10\2 - Group 1, 0, Group 2 value.

Related

How to read a comma-separated numerical string and perform various functions on it

I have a column with numerical comma-separated strings, e.g., '0,1,17,200,6,0,1'.
I want to create new columns for the sums of those numbers (or substrings) in the strings that are not equal to 0.
I can use something like this to count the sum of non-zero numbers for the whole string:
df$F1 <- sapply(strsplit(df1$a, ","), function(x) length(which(x>0)))
[1] 5
This outputs '5' as the number of substrings in for the example string above, which is correct as the number of substrings in '0,1,17,200,6,0,1' is indeed 5.
The challenge, however, is to be able to restrict the number of substrings. For example, how can I get the the count for only the first 3 or 6 substrings in the string?

You can use gsub and backreference to cut the string to the desired length before you count how many substrings are > 0:
DATA:
df1 <- data.frame(a = "0,1,17,200,6,0,1")
df1$a <- as.character(df1$a)
SOLUTION:
First cut the string to whatever number of substrings you want--here, I'm cutting it to three numeric characters (the first two of which are followed by a comma)--and store the result in a new vector:
df1$a_3 <- gsub("^(\\d+,\\d+,\\d+)(.*)", "\\1", df1$a)
df1$a_3
[1] "0,1,17"
Now insert the new vector into your sapply statement to count how many substrings are greater than 0:
sapply(strsplit(df1$a_3, ","), function(x) length(which(x>0)))
[1] 2
To vary the number of substrings, vary the number of repetitions of \\d+ in the pattern accordingly. For example, this works for 6 substrings:
df1$a_6 <- gsub("^(\\d+,\\d+,\\d+,\\d+,\\d+,\\d+)(.*)", "\\1", df1$a)
sapply(strsplit(df1$a_6, ","), function(x) length(which(x>0)))
[1] 4
EDIT TO ACCOUNT FOR NEW SET OF QUESTIONS:
To compute the maximum value of substrings > 0, exemplified here for df1$a, the string as a whole (for the restricted strings, just use the relevant vector accordingly, e.g., df1$a_3, df1$a_6 etc.):
First split the string using strsplit, then unlist the resulting list using unlist, and finally convert the resulting vector from character to numeric, storing the result in a vector, e.g., string_a:
string_a <- as.numeric(unlist(strsplit(df1$a, ",")))
string_a
[1] 0 1 17 200 6 0 1
On that vector you can perform all sorts of functions, including max for the maximum value, and sum for the sum of the values:
max(string_a)
[1] 200
sum(string_a)
[1] 225
Re the number of values that are equal to 0, adjust your sapply statement by setting x == 0:
sapply(strsplit(df1$a, ","), function(x) length(which(x == 0)))
[1] 2
Hope this helps!

find string that the second string is 9 using R

I have a list of numbers and I want to find numbers which their second string is 9. the grep() code find any number that has 9 but I am looking for a code that find number that second string is 9. so the below returns:
p <- c(34405, 09098424, 6908347, 8900333, 453434)
grep(9, p)
[1] 1 2 3 4
I am looking for something that return:
[1] 2 3 4
Thanks
Majran

We can use substr to extract the 2nd digit and check whether (==) that is equal to 9, get the numeric index by wrapping with which.
which(substr(p,2,2)=="9")
#[1] 2 3 4
Or another option is grep where we match the pattern ^.9 (where ^ suggests the start of the string, . can be any character followed by 9 i.e. the second character)
grep("^.9", p)
#[1] 2 3 4
NOTE: Here I am assuming that the OP's vector is character class because numeric elements don't have 0 padded on the left.
data
p <- c("34405", "09098424", "6908347", "8900333", "453434")

How to delete everything after nth delimiter in R?

I have this vector myvec. I want to remove everything after second ':' and get the result. How do I remove the string after nth ':'?
myvec<- c("chr2:213403244:213403244:G:T:snp","chr7:55240586:55240586:T:G:snp" ,"chr7:55241607:55241607:C:G:snp")
result
chr2:213403244
chr7:55240586
chr7:55241607

We can use sub. We match one or more characters that are not : from the start of the string (^([^:]+) followed by a :, followed by one more characters not a : ([^:]+), place it in a capture group i.e. within the parentheses. We replace by the capture group (\\1) in the replacement.
sub('^([^:]+:[^:]+).*', '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
The above works for the example posted. For general cases to remove after the nth delimiter,
n <- 2
pat <- paste0('^([^:]+(?::[^:]+){',n-1,'}).*')
sub(pat, '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Checking with a different 'n'
n <- 3
and repeating the same steps
sub(pat, '\\1', myvec)
#[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
#[3] "chr7:55241607:55241607"
Or another option would be to split by : and then paste the n number of components together.
n <- 2
vapply(strsplit(myvec, ':'), function(x)
paste(x[seq.int(n)], collapse=':'), character(1L))
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"

Here are a few alternatives. We delete the kth colon and everything after it. The example in the question would correspond to k = 2. In the examples below we use k = 3.
1) read.table Read the data into a data.frame, pick out the columns desired and paste it back together again:
k <- 3 # keep first 3 fields only
do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":"))
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
2) sprintf/sub Construct the appropriate regular expression (in the case below of k equal to 3 it would be ^((.*?:){2}.*?):.* ) and use it with sub:
k <- 3
sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 1: For k=1 this can be further simplified to sub(":.*", "", myvec) and for k=n-1 it can be further simplified to sub(":[^:]*$", "", myvec)
Note 2: Here is a visualization of the regular regular expression for k equal to 3:
^((.*?:){2}.*?):.*
Debuggex Demo
3) iteratively delete last field We could remove the last field n-k times using the last regular expression in Note 1 above like this:
n <- 6 # number of fields
k < - 3 # number of fields to retain
out <- myvec
for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)
If we wanted to set n automatically we could optionally replace the hard coded line setting n above with this:
n <- count.fields(textConnection(myvec[1]), sep = ":")
4) locate position of kth colon Locate the positions of the colons using gregexpr and then extract the location of the kth subtracting one from it since we don't want the trailing colon. Use substr to extract that many characters from the respective strings.
k <- 3
substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 3: Suppose there are n fields. The question asked to delete everything after the kth delimiter so the solution should work for k = 1, 2, ..., n-1. It need not work for k = n since there are not n delimiters; however, if instead we define k as the number of fields to return then k = n makes sense and, in fact, (1) and (3) work in that case too. (2) and (4) do not work for this extension but we can easily get them to work by using paste0(myvec, ":") as the input instead of myvec.
Note 4: We compare performance:
library(rbenchmark)
benchmark(
.read.table = do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":")),
.sprintf.sub = sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec),
.for = { out <- myvec; for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)},
.gregexpr = substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1),
order = "elapsed", replications = 1000)[1:4]
giving:
test replications elapsed relative
2 .sprintf.sub 1000 0.11 1.000
4 .gregexpr 1000 0.14 1.273
3 .for 1000 0.15 1.364
1 .read.table 1000 2.16 19.636
The solution using sprintf and sub is the fastest although it does use a complex regular expression whereas the others use simpler or no regular expressions and might be preferred on grounds of simplicity.
ADDED Added additional solutions and additional notes.

Sorting a string by specific values

I have the following string:
str1<-"{a{c}{b{{e}{d}}}}"
In addition, I have a list of integers:
str_d <- ( 1, 2, 2, 4, 4)
There is one to one relation between the list to the string.
It means:
a 1
c 2
b 2
e 4
d 4
I would like to sort in alphabetic order only the characters of str1 that have same level.
It means to sort c, b (which have the same value 2) will yield b,c
and to sort e, d (which have the same value 4) will yield d,e.
The required result will be:
str2<-"{a{b}{c{{d}{e}}}}"
In addition a,b,c,d and e can be not only characters, but might be words, such as:
str1<-"{NSP{ARD}{BOS{{DUD}{COR}}}}"
How can I do it with keeping the { in their place?

brkts <- gsub("\\w+", "%s", str1)
strings <- regmatches(str1,gregexpr("[^{}]+",str1))[[1]]
fixed <- ave(strings, str_d, FUN=function(x) sort(x))
do.call(sprintf, as.list(c(brkts, fixed)))
[1] "{a{b}{c{{d}{e}}}}"
and
[1] "{NSP{ARD}{BOS{{COR}{DUD}}}}"
It will work for the first and second case. We first isolate the text with gsub and place %s instead. That will be used later for sprintf. Next we isolate the strings by splitting with strsplit on the comma that we placed after each group of bracket symbols. We then sort based on the sorting vector given and save the characters in the vector fixed. Lastly, we call sprintf on the brkts variable that we created at the beginning and the sorted strings.
Data
str_d <- c(1, 2, 2, 4, 4)
str1<-"{a{c}{b{{e}{d}}}}"
str1<-"{NSP{ARD}{BOS{{DUD}{COR}}}}"

One possible solution (using stringr package):
words <- str_extract_all(str1, '\\w+')[[1]]
ordered <- words[order(paste(str_d, words))]
formatter <- str_replace_all(str1, '\\w+', '%s')
do.call(sprintf, as.list(c(formatter, ordered)))
words is an extract of the words between the braces. I ordered those by sorting the combination of the words with str_d. E.g. the words will become:
1 a
2 c
2 b
4 e
4 d
Then I slap it all back together with sprintf().

Finding repeated substrings with R

I have the following code for finding out a pattern (consecutively repeated substring) in a string, say 0110110110000. The output patterns are 011 and 110, since they are both repeated within the string. What changes can be done to the following code?
I'd like to identify substrings that start from any position in a given string, and which repeat for at least a threshold number of times. In the above mentioned string, the threshold is three (th = 3). The repeated string should be the maximal repeated string. In the above string, 110 and 011 both satisfy these conditions.
Here's my attempt at doing this:
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}

You can do this with regex:
s <- '0110110110000'
thr <- 3
m <- gregexpr(sprintf('(?=(.+)(?:\\1){%s,})', thr-1), s, perl=TRUE)
unique(mapply(function(x, y) substr(s, x, x+y-1),
attr(m[[1]], 'capture.start'),
attr(m[[1]], 'capture.length')))
# [1] "011" "110" "0"
The pattern in the gregexpr uses a positive lookahead to prevent characters from being consumed by the match (and so allowing overlapping matches, such as with the 011 and 110). We use a repeated (at least thr - 1 times) backreference to the captured group to look for repeated substrings.
Then we can extract the matched substrings by taking start positions and lengths from the attributes of the result of gregexpr, i.e. the object m.
You didn't specify a minimum string length, so this returns 0 as one of the repeated substrings. If you have a minimum and/or maximum substring length in mind, you can modify the first subexpression of the regex. For example, the following would match only substrings with at least 2 characters.
sprintf('(?=(.{2,})(?:\\1){%s,})', thr-1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Regular Expressions and substring in R - r

Related

How to read a comma-separated numerical string and perform various functions on it

find string that the second string is 9 using R

How to delete everything after nth delimiter in R?

Sorting a string by specific values

Finding repeated substrings with R

Categories

Resources