How can I extract matched part of multiple strings? - r

I have multiple strings, and I want to extract the part that matches.
In practice my strings are directories, and I need to choose where to write a file, which is the location that matches in all strings. For example, if you have a vector with three strings:
data.dir <- c("C:\\data\\files\\subset1\\", "C:\\data\\files\\subset3\\", "C:\\data\\files\\subset3\\")
...the part that matches in all strings is "C:\data\files\". How can I extract this?

strsplit and intersect the overlapping parts recursively using Reduce. You can then piece it back together by paste-ing.
paste(Reduce(intersect, strsplit(data.dir, "\\\\")), collapse="\\")
#[1] "C:\\data\\files"
As #g-grothendieck notes, this will fail in certain circumstances like:
data.dir <- c("C:\\a\\b\\c\\", "C:\\a\\X\\c\\")
An ugly hack might be something like:
tail(
Reduce(
intersect,
lapply(strsplit(data.dir, "\\\\"),
function(x) sapply(1:length(x), function(y) paste(x[1:y], collapse="\\") )
)
),
1)
...which will deal with either case.
Alternatively, use dirname if you only ever have one extra directory level:
unique(dirname(data.dir))
#[1] "C:/data/files"

g contains the character positions to successive backslashes in data.dir[1]. From this create a logical vector ok whose ith element is TRUE if the first g[i] characters of all elements in data.dir are the same, i.e. all elements of substr(data.dir, 1, g[i]) are the same. If ok[1] is TRUE then there is a non-zero length common prefix whose length is given by the first g[k] characters of data.dir[1] where k (which equals rle(ok)$lengths[1]) is the leading number of TRUE values in ok; otherwise, there is no common prefix so return "".
g <- gregexpr("\\", data.dir[1], fixed = TRUE)[[1]]
ok <- sapply(g, function(i) all(substr(data.dir[1], 1, i) == substr(data.dir, 1, i)))
if (ok[1]) substr(data.dir[1], 1, g[rle(ok)$lengths[1]]) else ""
For data.dir defined in the question the last line gives:
[1] "C:\\data\\files\\"

Related

gsub to remove unwanted precision

Could anyone please help to achieve the following with gsub in R?
input string: a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02
desired output: a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200, g=850.02
Practically, removing the redundant 0s after the decimal point if they are all just 0s, don't remove if real fractions exist.
I couldn't get this to work using gsub alone, but we can try splitting your input vector on comma, and then using an apply function with gsub:
x <- "a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02"
input <- sapply(unlist(strsplit(x, ",")), function(x) gsub("(?<=\\d)\\.$", "", gsub("(\\.[1-9]*)0+$", "\\1", x), perl=TRUE))
input <- paste(input, collapse=",")
input
[1] "a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200,g=850.02"
Demo
I actually make two calls to gsub. The first call strips off all trailing zeroes appearing after a decimal point, should the number have one. And the second call removes stray decimal points, in the case of a number like 5.00, which the first call would leave as 5. and not 5, the latter which we want.
To remove trailing 0s after the decimal, try this:
EDIT Forgot 5.00
x = c('5.00', '0.500', '120', '0.0003', '0.02', '5.20', '1200', '850.02')
gsub("\\.$" "", gsub("(\\.(|[1-9]+))0+$", "\\1", x))
# [1] "5" "0.5" "120" "0.0003" "0.02" "5.2" "1200" "850.02"
HT #TimBiegeleisen: I misread input as a vector of strings. For a single-string input, convert to vector of strings, which you can call gsub on, then collapse output back to a single string:
paste(
gsub("\\.$", "", gsub("(\\.(|[1-9]+))0+$", "\\1",
unlist(strsplit(x, ", ")))),
collapse=", ")
[1] "a=5, b=0.5, c=120, d=0.0003, e=0.02, f=5.2, g=1200, h=850.02"
gsub is a text processing tool that works on character level. It’s ignorant of any semantic interpretation.
However, you are specifically interested in manipulating this semantic interpretation, namely, the precision of numbers encoded in your text.
So use that: parse the numbers in the text, and write them out with the desired precision:
parse_key_value_pairs = function (text) {
parse_pair = function (pair) {
pair = strsplit(pair, "\\s*=\\s*")[[1]]
list(key = pair[1], value = as.numeric(pair[2]))
}
pairs = unlist(strsplit(text, "\\s*,\\s*"))
structure(lapply(pairs, parse_pair), class = 'kvp')
}
as.character.kvp = function (x, ...) {
format_pair = function (pair) {
sprintf('%s = %g', pair[1], pair[2])
}
pairs = vapply(x, format_pair, character(1))
paste(pairs, collapse = ", ")
}
And use it as follows:
text = "a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02"
parsed = parse_key_value_pairs(text)
as.character(parsed)
This uses several interesting features of R:
For text processing, it still uses regular expressions (inside strsplit).
To process multiple values, use lapply to apply a parsing function to parts of the string in turn
To reconstruct a key–value pair, format the string using sprintf. sprintf is a primitive text formatting tool adapted from C. But it’s fairly universal and it works OK in our case.
The parsed value is tagged with an S3 class name. This is how R implements object orientation.
Provide an overload of the standard generic as.character for our type. This means that any existing function that takes an object and displays it via as.character can deal with our parsed data type. In particular, this works with the {glue} library:
> glue::glue("result: {parsed}")
result: a = 5, b = 120, c = 0.0003, d = 0.02, e = 5.2, f = 1200, g = 850.02
This is probably not the most ideal solution, but for educational purposes, here is one way to call gsub only once using conditional regex:
x = 'a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02'
gsub('(?!\\d+(?:,|$))(\\.[0-9]*[1-9])?(?(1)0+\\b|\\.0+(?=(,|$)))', '\\1', x, perl = TRUE)
# [1] "a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200,g=850.02"
Notes:
(?!\\d+(?:,|$)) is a negative lookbehind that matches a digit one or more times following a comma or end of string. This effectively excludes the pattern from the overall regex match.
(\\.[0-9]*[1-9])? matches a literal dot, a digit zero or more times and a digit (except zero). The ? makes this pattern optional, and is crucial to how the conditional handles the back reference.
(?(1)0+\\b|\\.0+(?=(,|$))) is a conditional with the logic (?(IF)THEN|ELSE)
(1) is the (IF) part which checks if capture group 1 matched. This refers to (\\.[0-9]*[1-9])
0+\\b is the (THEN) part which matches only if (IF) is TRUE. In this case, only if (\\.[0-9]*[1-9]) matched, will the regex try to match a zero one or more times following a word boundary
\\.0+(?=(,|$)) is the (ELSE) part which matches only if (IF) is FALSE. In this case only if (\\.[0-9]*[1-9]) didn't match, will the regex try to match a literal dot, a zero one or more times following a comma or end of string
If we put 2. and 3. together, we get either (\\.[0-9]*[1-9])0+\\b or \\.0+(?=(,|$))
\\1 as a replacement therefore turns either (\\.[0-9]*[1-9])0+\\b to the pattern matched by (\\.[0-9]*[1-9]) or \\.0+(?=(,|$)) to blank. which translates to:
5.20 to 5.2 for the former
5.00 to 5 and 1200.0 to 1200 for the latter

String manipulation in R

I have a list of DNA sequences, for example, "AGAACCTTATTGGGTCAAGT". If I were wanting to create a list with all possible strings that could sequentially happen in the sequence of a given length (for example 4) how would this be done in R?
In this case, the first string would be "AGAA". The second would be "GAAC", the third, "AACC", and so forth.
x = "AGAACCTTATTGGGTCAAGT"
sapply(1:(nchar(x)-3), function(i) substr(x, i, i+3))
#[1] "AGAA" "GAAC" "AACC" "ACCT" "CCTT" "CTTA" "TTAT" "TATT" "ATTG" "TTGG" "TGGG" "GGGT" "GGTC" "GTCA" "TCAA" "CAAG" "AAGT"

string split at the last (also at any nth) delimiter in R and remove the string before the delimiter

I have a vector vec. I need to remove the part before last "/" and get the remaining string and get the result. Please also note that I can't use Perl-compatible regexps (i.e. perl=FALSE). I would also like to see it for the nth delimiter.
vec<-c("/apple/pineapple/mango/reg.sh_ttgs.pos","/apple/pipple/mgo/deh_ttgs.pos")
Result for the last delimiter
reg.sh_ttgs.pos , deh_ttgs.pos
Result for the 2nd delimiter
pineapple/mango/reg.sh_ttgs.pos, pipple/mgo/deh_ttgs.pos
and so on..
Alternatively, you can use char2end() in the qdap package. You specify a delimiter and which delimiter you want to use (1st, 2nd, etc.) in the function.
library(qdap)
For the 2nd delimiter,
char2end(vec, "/", 2)
#[1] "pineapple/mango/reg.sh_ttgs.pos" "pipple/mgo/deh_ttgs.pos"
For the last delimiter,
char2end(vec, "/", 4)
#[1] "reg.sh_ttgs.pos" "deh_ttgs.pos"
One way could be to use a function like this (using gregexpr to get the location of a string and substring to subset the string accordingly):
get_string <- function(vec, n) {
if(n == 'last'){
positions <- lapply(gregexpr(pattern ='/',vec), function(x) x[length(x)] + 1)
}else{
positions <- lapply(gregexpr(pattern ='/',vec), function(x) x[n] + 1)
}
substring(vec, positions)
}
Output:
> get_string(vec, 2)
[1] "pineapple/mango/reg.sh_ttgs.pos" "pipple/mgo/deh_ttgs.pos"
> get_string(vec, 'last')
[1] "reg.sh_ttgs.pos" "deh_ttgs.pos"
You either specify the nth '/' or just specify 'last' if you want just the last part of the path.
Note: I am using an if-else statement above just in case the position of the last '/' is different in the various elements of your actual vector. If the number of /s will always be the same across all elements only lapply(gregexpr(pattern ='/',vec), function(x) x[n] + 1) is needed.

Finding the position of a character within a string

I am trying to find the equivalent of the ANYALPHA SAS function in R. This function searches a character string for an alphabetic character, and returns the first position at which at which the character is found.
Example: looking at the following string '123456789A', the ANYALPHA function would return 10 since first alphabetic character is at position 10 in the string. I would like to replicate this function in R but have not been able to figure it out. I need to search for any alphabetic character regardless of case (i.e. [:alpha:])
Thanks for any help you can offer!
Here's an anyalpha function. I added a few extra features. You can specify the maximum amount of matches you want in the n argument, it defaults to 1. You can also specify if you want the position or the value itself with value=TRUE:
anyalpha <- function(txt, n=1, value=FALSE) {
txt <- as.character(txt)
indx <- gregexpr("[[:alpha:]]", txt)[[1]]
ret <- indx[1:(min(n, length(indx)))]
if(value) {
mapply(function(x,y) substr(txt, x, y), ret, ret)
} else {ret}
}
#test
x <- '123A56789BC'
anyalpha(x)
#[1] 4
anyalpha(x, 2)
#[1] 4 10
anyalpha(x, 2, value=TRUE)
#[1] "C" "A"

String split and expand the (vector) at the delimiter: R

I have this vector (it's big in size) myvec. I need to split them matching at / and create another result vector resvector. How can I get this done in R?
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
resvector
IID:WE:G12D, IID:WE:G12V,IID:WE:G12A,GH:SQ:p.R172W,GH:SQ:p.R172G,HH:WG:p.S122F,HH:WG:p.S122H
You can try this, using strsplit as mentioned by #Tensibai:
sp_vec <- strsplit(myvec, "/") # split the element of the vector by "/" : you will get a list where each element is the decomposition (vector) of one element of your vector, according to "/"
ts_vec <- lapply(sp_vec, # for each element of the previous list, do
function(x){
base <- sub("\\w$", "", x[1]) # get the common beginning of the column names (so first item of vector without the last letter)
x[-1] <- paste0(base, x[-1]) # paste this common beginning to the rest of the vector items (so the other letters)
x}) # return the vector
resvector <- unlist(ts_vec) # finally, unlist to get the needed vector
resvector
# [1] "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"
Here is a concise answer with regex and some functional programming:
x = gsub('[A-Z]/.+','',myvec)
y = strsplit(gsub('[^/]+(?=[A-Z]/.+)','',myvec, perl=T),'/')
unlist(Map(paste0, x, y))
# "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
custmSplit <- function(str){
splitbysep <- strsplit(str, '/')[[1]]
splitbysep[-1] <- paste0(substr(splitbysep[1], 1, nchar(splitbysep[1])), splitbysep[-1])
return(splitbysep)
}
do.call('c', lapply(myvec, custmSplit))
# [1] "IID:WE:G12D" "IID:WE:G12DV" "IID:WE:G12DA" "GH:SQ:p.R172W" "GH:SQ:p.R172WG" "HH:WG:p.S122F" "HH:WG:p.S122FH"

Resources