Could anyone please help to achieve the following with gsub in R?
input string: a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02
desired output: a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200, g=850.02
Practically, removing the redundant 0s after the decimal point if they are all just 0s, don't remove if real fractions exist.
I couldn't get this to work using gsub alone, but we can try splitting your input vector on comma, and then using an apply function with gsub:
x <- "a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02"
input <- sapply(unlist(strsplit(x, ",")), function(x) gsub("(?<=\\d)\\.$", "", gsub("(\\.[1-9]*)0+$", "\\1", x), perl=TRUE))
input <- paste(input, collapse=",")
input
[1] "a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200,g=850.02"
Demo
I actually make two calls to gsub. The first call strips off all trailing zeroes appearing after a decimal point, should the number have one. And the second call removes stray decimal points, in the case of a number like 5.00, which the first call would leave as 5. and not 5, the latter which we want.
To remove trailing 0s after the decimal, try this:
EDIT Forgot 5.00
x = c('5.00', '0.500', '120', '0.0003', '0.02', '5.20', '1200', '850.02')
gsub("\\.$" "", gsub("(\\.(|[1-9]+))0+$", "\\1", x))
# [1] "5" "0.5" "120" "0.0003" "0.02" "5.2" "1200" "850.02"
HT #TimBiegeleisen: I misread input as a vector of strings. For a single-string input, convert to vector of strings, which you can call gsub on, then collapse output back to a single string:
paste(
gsub("\\.$", "", gsub("(\\.(|[1-9]+))0+$", "\\1",
unlist(strsplit(x, ", ")))),
collapse=", ")
[1] "a=5, b=0.5, c=120, d=0.0003, e=0.02, f=5.2, g=1200, h=850.02"
gsub is a text processing tool that works on character level. It’s ignorant of any semantic interpretation.
However, you are specifically interested in manipulating this semantic interpretation, namely, the precision of numbers encoded in your text.
So use that: parse the numbers in the text, and write them out with the desired precision:
parse_key_value_pairs = function (text) {
parse_pair = function (pair) {
pair = strsplit(pair, "\\s*=\\s*")[[1]]
list(key = pair[1], value = as.numeric(pair[2]))
}
pairs = unlist(strsplit(text, "\\s*,\\s*"))
structure(lapply(pairs, parse_pair), class = 'kvp')
}
as.character.kvp = function (x, ...) {
format_pair = function (pair) {
sprintf('%s = %g', pair[1], pair[2])
}
pairs = vapply(x, format_pair, character(1))
paste(pairs, collapse = ", ")
}
And use it as follows:
text = "a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02"
parsed = parse_key_value_pairs(text)
as.character(parsed)
This uses several interesting features of R:
For text processing, it still uses regular expressions (inside strsplit).
To process multiple values, use lapply to apply a parsing function to parts of the string in turn
To reconstruct a key–value pair, format the string using sprintf. sprintf is a primitive text formatting tool adapted from C. But it’s fairly universal and it works OK in our case.
The parsed value is tagged with an S3 class name. This is how R implements object orientation.
Provide an overload of the standard generic as.character for our type. This means that any existing function that takes an object and displays it via as.character can deal with our parsed data type. In particular, this works with the {glue} library:
> glue::glue("result: {parsed}")
result: a = 5, b = 120, c = 0.0003, d = 0.02, e = 5.2, f = 1200, g = 850.02
This is probably not the most ideal solution, but for educational purposes, here is one way to call gsub only once using conditional regex:
x = 'a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02'
gsub('(?!\\d+(?:,|$))(\\.[0-9]*[1-9])?(?(1)0+\\b|\\.0+(?=(,|$)))', '\\1', x, perl = TRUE)
# [1] "a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200,g=850.02"
Notes:
(?!\\d+(?:,|$)) is a negative lookbehind that matches a digit one or more times following a comma or end of string. This effectively excludes the pattern from the overall regex match.
(\\.[0-9]*[1-9])? matches a literal dot, a digit zero or more times and a digit (except zero). The ? makes this pattern optional, and is crucial to how the conditional handles the back reference.
(?(1)0+\\b|\\.0+(?=(,|$))) is a conditional with the logic (?(IF)THEN|ELSE)
(1) is the (IF) part which checks if capture group 1 matched. This refers to (\\.[0-9]*[1-9])
0+\\b is the (THEN) part which matches only if (IF) is TRUE. In this case, only if (\\.[0-9]*[1-9]) matched, will the regex try to match a zero one or more times following a word boundary
\\.0+(?=(,|$)) is the (ELSE) part which matches only if (IF) is FALSE. In this case only if (\\.[0-9]*[1-9]) didn't match, will the regex try to match a literal dot, a zero one or more times following a comma or end of string
If we put 2. and 3. together, we get either (\\.[0-9]*[1-9])0+\\b or \\.0+(?=(,|$))
\\1 as a replacement therefore turns either (\\.[0-9]*[1-9])0+\\b to the pattern matched by (\\.[0-9]*[1-9]) or \\.0+(?=(,|$)) to blank. which translates to:
5.20 to 5.2 for the former
5.00 to 5 and 1200.0 to 1200 for the latter
I have a list of DNA sequences, for example, "AGAACCTTATTGGGTCAAGT". If I were wanting to create a list with all possible strings that could sequentially happen in the sequence of a given length (for example 4) how would this be done in R?
In this case, the first string would be "AGAA". The second would be "GAAC", the third, "AACC", and so forth.
x = "AGAACCTTATTGGGTCAAGT"
sapply(1:(nchar(x)-3), function(i) substr(x, i, i+3))
#[1] "AGAA" "GAAC" "AACC" "ACCT" "CCTT" "CTTA" "TTAT" "TATT" "ATTG" "TTGG" "TGGG" "GGGT" "GGTC" "GTCA" "TCAA" "CAAG" "AAGT"
I have a vector vec. I need to remove the part before last "/" and get the remaining string and get the result. Please also note that I can't use Perl-compatible regexps (i.e. perl=FALSE). I would also like to see it for the nth delimiter.
vec<-c("/apple/pineapple/mango/reg.sh_ttgs.pos","/apple/pipple/mgo/deh_ttgs.pos")
Result for the last delimiter
reg.sh_ttgs.pos , deh_ttgs.pos
Result for the 2nd delimiter
pineapple/mango/reg.sh_ttgs.pos, pipple/mgo/deh_ttgs.pos
and so on..
Alternatively, you can use char2end() in the qdap package. You specify a delimiter and which delimiter you want to use (1st, 2nd, etc.) in the function.
library(qdap)
For the 2nd delimiter,
char2end(vec, "/", 2)
#[1] "pineapple/mango/reg.sh_ttgs.pos" "pipple/mgo/deh_ttgs.pos"
For the last delimiter,
char2end(vec, "/", 4)
#[1] "reg.sh_ttgs.pos" "deh_ttgs.pos"
One way could be to use a function like this (using gregexpr to get the location of a string and substring to subset the string accordingly):
get_string <- function(vec, n) {
if(n == 'last'){
positions <- lapply(gregexpr(pattern ='/',vec), function(x) x[length(x)] + 1)
}else{
positions <- lapply(gregexpr(pattern ='/',vec), function(x) x[n] + 1)
}
substring(vec, positions)
}
Output:
> get_string(vec, 2)
[1] "pineapple/mango/reg.sh_ttgs.pos" "pipple/mgo/deh_ttgs.pos"
> get_string(vec, 'last')
[1] "reg.sh_ttgs.pos" "deh_ttgs.pos"
You either specify the nth '/' or just specify 'last' if you want just the last part of the path.
Note: I am using an if-else statement above just in case the position of the last '/' is different in the various elements of your actual vector. If the number of /s will always be the same across all elements only lapply(gregexpr(pattern ='/',vec), function(x) x[n] + 1) is needed.
I am trying to find the equivalent of the ANYALPHA SAS function in R. This function searches a character string for an alphabetic character, and returns the first position at which at which the character is found.
Example: looking at the following string '123456789A', the ANYALPHA function would return 10 since first alphabetic character is at position 10 in the string. I would like to replicate this function in R but have not been able to figure it out. I need to search for any alphabetic character regardless of case (i.e. [:alpha:])
Thanks for any help you can offer!
Here's an anyalpha function. I added a few extra features. You can specify the maximum amount of matches you want in the n argument, it defaults to 1. You can also specify if you want the position or the value itself with value=TRUE:
anyalpha <- function(txt, n=1, value=FALSE) {
txt <- as.character(txt)
indx <- gregexpr("[[:alpha:]]", txt)[[1]]
ret <- indx[1:(min(n, length(indx)))]
if(value) {
mapply(function(x,y) substr(txt, x, y), ret, ret)
} else {ret}
}
#test
x <- '123A56789BC'
anyalpha(x)
#[1] 4
anyalpha(x, 2)
#[1] 4 10
anyalpha(x, 2, value=TRUE)
#[1] "C" "A"
I have this vector (it's big in size) myvec. I need to split them matching at / and create another result vector resvector. How can I get this done in R?
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
resvector
IID:WE:G12D, IID:WE:G12V,IID:WE:G12A,GH:SQ:p.R172W,GH:SQ:p.R172G,HH:WG:p.S122F,HH:WG:p.S122H
You can try this, using strsplit as mentioned by #Tensibai:
sp_vec <- strsplit(myvec, "/") # split the element of the vector by "/" : you will get a list where each element is the decomposition (vector) of one element of your vector, according to "/"
ts_vec <- lapply(sp_vec, # for each element of the previous list, do
function(x){
base <- sub("\\w$", "", x[1]) # get the common beginning of the column names (so first item of vector without the last letter)
x[-1] <- paste0(base, x[-1]) # paste this common beginning to the rest of the vector items (so the other letters)
x}) # return the vector
resvector <- unlist(ts_vec) # finally, unlist to get the needed vector
resvector
# [1] "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"
Here is a concise answer with regex and some functional programming:
x = gsub('[A-Z]/.+','',myvec)
y = strsplit(gsub('[^/]+(?=[A-Z]/.+)','',myvec, perl=T),'/')
unlist(Map(paste0, x, y))
# "IID:WE:G12D" "IID:WE:G12V" "IID:WE:G12A" "GH:SQ:p.R172W" "GH:SQ:p.R172G" "HH:WG:p.S122F" "HH:WG:p.S122H"
myvec<-c("IID:WE:G12D/V/A","GH:SQ:p.R172W/G", "HH:WG:p.S122F/H")
custmSplit <- function(str){
splitbysep <- strsplit(str, '/')[[1]]
splitbysep[-1] <- paste0(substr(splitbysep[1], 1, nchar(splitbysep[1])), splitbysep[-1])
return(splitbysep)
}
do.call('c', lapply(myvec, custmSplit))
# [1] "IID:WE:G12D" "IID:WE:G12DV" "IID:WE:G12DA" "GH:SQ:p.R172W" "GH:SQ:p.R172WG" "HH:WG:p.S122F" "HH:WG:p.S122FH"