gsub and pad inside of a parenthesis

gsub and pad inside of a parenthesis - r

I have vector like this:
x <- c("20(0.23)", "15(0.2)", "16(0.09)")
and I don't want to mess with the numbers on the outside of the parenthesis but want to remove the leading zero on the numbers inside and make everything have 2 digits. The output will look like:
"20(.23)", "15(.20)", "16(.09)"
Useful information:
I can remove leading zero and retain 2 digits using the function below taken from: LINK
numformat <- function(val) { sub("^(-?)0.", "\\1.", sprintf("%.2f", val)) }
numformat(c(0.2, 0.26))
#[1] ".20" ".26"
I know gsub can be used but I don't know how. I'll provide a strsplit answer but that's hackish at best.

The gsubfn package allows you to replace anything matched by a regex with a function applied to the match. So we could use what you have with your numformat function
library(gsubfn)
# Note that I added as.numeric in because what will be passed in
# is a character string
numformat <- function(val){sub("^(-?)0.", "\\1.", sprintf("%.2f", as.numeric(val)))}
gsubfn("0\\.\\d+", numformat, x)
#[1] "20(.23)" "15(.20)" "16(.09)"

pad.fix<-function(x){
y<-gsub('\\.(\\d)\\)','\\.\\10\\)',x)
gsub('0\\.','\\.',y)
}
the first gsub adds a trailing zero if needed the second gsub removes the leading zero.

That is yet another of these Tyler questions that seem to be complicated just for complications sake :)
So here you go:
R> x <- c("20(0.23)", "15(0.2)", "16(0.09)")
R> sapply(strsplit(gsub("^(\\d+)\\((.*)\\)$", "\\1 \\2", x), " "),
+ function(x) sprintf("%2d(.%02d)",
+ as.numeric(x[1]),
+ as.numeric(x[2])*100))
[1] "20(.23)" "15(.20)" "16(.09)"
R>
We do a few things here:
The gsub() picks off the two two numbers: first the one before the parens, then the one inside the parens. [With hindsight, should have picked after the decimal, see below.]
This prints them out just with whitespace, e.g. "20 0.23" for the first.
We then use a standard strsplit() on this.
We then use sapply to process the list we get from strsplit
We print the first number as a two-digit int.
The second one is more tricky -- the (s)printf() family cannot suppress a leading zero so we print the decimal, and the print two digits of an integer -- and convert the second number accordingly.
It is all concise and in one line, but it would be clearer broken out.
Edit: I don;t often provide the fastest solutions, but when I do, at least I can gloat:
R> dason <- function(x) { numformat <- function(val){sub("^(-?)0.", "\\1.", sprintf("%.2f", as.numeric(val)))}; gsubfn("0\\.\\d+", numformat, x) }
R> dirk <- function(x) { sapply(strsplit(gsub("^(\\d+)\\((.*)\\)$", "\\1 \\2", x), " "), function(x) sprintf("%2d(.%02d)", as.numeric(x[1]), as.numeric(x[2])*100)) }
R>
R> dason(x)
[1] "20(.23)" "15(.20)" "16(.09)"
R> dirk(x)
[1] "20(.23)" "15(.20)" "16(.09)"
R>
R> res <- benchmark(dason(x), dirk(x), replications=1000, order="relative")
R> res
test replications elapsed relative user.self sys.self user.child sys.child
2 dirk(x) 1000 0.133 1.000 0.132 0.000 0 0
1 dason(x) 1000 2.026 15.233 1.960 0.064 0 0
R>
So that's about 15 rimes faster. Not that it matters in this context, but speed never hurt anyone in the long run.

Non gsub answer that's ugly at best.
x <- c("20(0.23)", "15(0.2)", "16(0.09)")
numformat <- function(val) { sub("^(-?)0.", "\\1.", sprintf("%.2f", val)) }
z <- do.call(rbind, strsplit(gsub("\\)", "", x), "\\("))
z[, 2] <- numformat(as.numeric(z[, 2]))
paste0(z[, 1], "(", z[, 2], ")")

Related

Extract last digit [duplicate]

How can I get the last n characters from a string in R?
Is there a function like SQL's RIGHT?

I'm not aware of anything in base R, but it's straight-forward to make a function to do this using substr and nchar:
x <- "some text in a string"
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
substrRight(x, 6)
[1] "string"
substrRight(x, 8)
[1] "a string"
This is vectorised, as #mdsumner points out. Consider:
x <- c("some text in a string", "I really need to learn how to count")
substrRight(x, 6)
[1] "string" " count"

If you don't mind using the stringr package, str_sub is handy because you can use negatives to count backward:
x <- "some text in a string"
str_sub(x,-6,-1)
[1] "string"
Or, as Max points out in a comment to this answer,
str_sub(x, start= -6)
[1] "string"

Use stri_sub function from stringi package.
To get substring from the end, use negative numbers.
Look below for the examples:
stri_sub("abcde",1,3)
[1] "abc"
stri_sub("abcde",1,1)
[1] "a"
stri_sub("abcde",-3,-1)
[1] "cde"
You can install this package from github: https://github.com/Rexamine/stringi
It is available on CRAN now, simply type
install.packages("stringi")
to install this package.

str = 'This is an example'
n = 7
result = substr(str,(nchar(str)+1)-n,nchar(str))
print(result)
> [1] "example"
>

Another reasonably straightforward way is to use regular expressions and sub:
sub('.*(?=.$)', '', string, perl=T)
So, "get rid of everything followed by one character". To grab more characters off the end, add however many dots in the lookahead assertion:
sub('.*(?=.{2}$)', '', string, perl=T)
where .{2} means .., or "any two characters", so meaning "get rid of everything followed by two characters".
sub('.*(?=.{3}$)', '', string, perl=T)
for three characters, etc. You can set the number of characters to grab with a variable, but you'll have to paste the variable value into the regular expression string:
n = 3
sub(paste('.+(?=.{', n, '})', sep=''), '', string, perl=T)

UPDATE: as noted by mdsumner, the original code is already vectorised because substr is. Should have been more careful.
And if you want a vectorised version (based on Andrie's code)
substrRight <- function(x, n){
sapply(x, function(xx)
substr(xx, (nchar(xx)-n+1), nchar(xx))
)
}
> substrRight(c("12345","ABCDE"),2)
12345 ABCDE
"45" "DE"
Note that I have changed (nchar(x)-n) to (nchar(x)-n+1) to get n characters.

A simple base R solution using the substring() function (who knew this function even existed?):
RIGHT = function(x,n){
substring(x,nchar(x)-n+1)
}
This takes advantage of basically being substr() underneath but has a default end value of 1,000,000.
Examples:
> RIGHT('Hello World!',2)
[1] "d!"
> RIGHT('Hello World!',8)
[1] "o World!"

Try this:
x <- "some text in a string"
n <- 5
substr(x, nchar(x)-n, nchar(x))
It shoudl give:
[1] "string"

An alternative to substr is to split the string into a list of single characters and process that:
N <- 2
sapply(strsplit(x, ""), function(x, n) paste(tail(x, n), collapse = ""), N)

I use substr too, but in a different way. I want to extract the last 6 characters of "Give me your food." Here are the steps:
(1) Split the characters
splits <- strsplit("Give me your food.", split = "")
(2) Extract the last 6 characters
tail(splits[[1]], n=6)
Output:
[1] " " "f" "o" "o" "d" "."
Each of the character can be accessed by splits[[1]][x], where x is 1 to 6.

someone before uses a similar solution to mine, but I find it easier to think as below:
> text<-"some text in a string" # we want to have only the last word "string" with 6 letter
> n<-5 #as the last character will be counted with nchar(), here we discount 1
> substr(x=text,start=nchar(text)-n,stop=nchar(text))
This will bring the last characters as desired.

For those coming from Microsoft Excel or Google Sheets, you would have seen functions like LEFT(), RIGHT(), and MID(). I have created a package known as forstringr and its development version is currently on Github.
if(!require("devtools")){
install.packages("devtools")
}
devtools::install_github("gbganalyst/forstringr")
library(forstringr)
the str_left(): This counts from the left and then extract n characters
the str_right()- This counts from the right and then extract n characters
the str_mid()- This extract characters from the middle
Examples:
x <- "some text in a string"
str_left(x, 4)
[1] "some"
str_right(x, 6)
[1] "string"
str_mid(x, 6, 4)
[1] "text"

I used the following code to get the last character of a string.
substr(output, nchar(stringOfInterest), nchar(stringOfInterest))
You can play with the nchar(stringOfInterest) to figure out how to get last few characters.

A little modification on #Andrie solution gives also the complement:
substrR <- function(x, n) {
if(n > 0) substr(x, (nchar(x)-n+1), nchar(x)) else substr(x, 1, (nchar(x)+n))
}
x <- "moSvmC20F.5.rda"
substrR(x,-4)
[1] "moSvmC20F.5"
That was what I was looking for. And it invites to the left side:
substrL <- function(x, n){
if(n > 0) substr(x, 1, n) else substr(x, -n+1, nchar(x))
}
substrL(substrR(x,-4),-2)
[1] "SvmC20F.5"

Just in case if a range of characters need to be picked:
# For example, to get the date part from the string
substrRightRange <- function(x, m, n){substr(x, nchar(x)-m+1, nchar(x)-m+n)}
value <- "REGNDATE:20170526RN"
substrRightRange(value, 10, 8)
[1] "20170526"

Split a string every 5 characters

Suppose I have a long string:
"XOVEWVJIEWNIGOIWENVOIWEWVWEW"
How do I split this to get every 5 characters followed by a space?
"XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Note that the last one is shorter.
I can do a loop where I constantly count and build a new string character by character but surely there must be something better no?

Using regular expressions:
gsub("(.{5})", "\\1 ", "XOVEWVJIEWNIGOIWENVOIWEWVWEW")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

Using sapply
> string <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
> sapply(seq(from=1, to=nchar(string), by=5), function(i) substr(string, i, i+4))
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"

You can try something like the following:
s <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW" # Original string
l <- seq(from=5, to=nchar(s), by=5) # Calculate the location where to chop
# Add sentinels 0 (beginning of string) and nchar(s) (end of string)
# and take substrings. (Thanks to #flodel for the condense expression)
mapply(substr, list(s), c(0, l) + 1, c(l, nchar(s)))
Output:
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
Now you can paste the resulting vector (with collapse=' ') to obtain a single string with spaces.

No *apply stringi solution:
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
stri_sub(x, seq(1, stri_length(x),by=5), length=5)
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
This extracts substrings just like in #Jilber answer, but stri_sub function is vectorized se we don't need to use *apply here.

You can also use a sub-string without a loop. substring is the vectorized substr
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
n <- seq(1, nc <- nchar(x), by = 5)
paste(substring(x, n, c(n[-1]-1, nc)), collapse = " ")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

Convert HH:MM:SS to hours (for more than 24 hours) in R

I would like to convert hours more than 24 hours in R.
For example, I have a dataframe which contains hours and minutes like [HH:MM]:
[1] "111:15" "221:15" "111:15" "221:15" "42:05"
I want them to be converted in hours like this:
"111.25" "221.25" "111.25" "221.25" "42.08333333"
as.POSIXct()
function works for general purpose, but not for more than 24 hours.

You can split the strings with strsplit and use sapply to transform all values.
vec <- c("111:15", "221:15", "111:15", "221:15", "42:05")
sapply(strsplit(vec, ":"), function(x) {
x <- as.numeric(x)
x[1] + x[2] / 60
})
The result:
[1] 111.25000 221.25000 111.25000 221.25000 42.08333

I would just parse the strings with regex. Grab the bit before the : then add on the bit after the : divided by 60
> foo = c("111:15", "221:15", "111:15", "221:15", "42:05")
> foo
[1] "111:15" "221:15" "111:15" "221:15" "42:05"
> as.numeric(gsub("([^:]+).*", "\\1", foo)) + as.numeric(gsub(".*:([0-9]{2})$", "\\1", foo))/60
[1] 111.25000 221.25000 111.25000 221.25000 42.08333

Another possibility is a vectorized function such as:
FUN <- function(time){
hours <- sapply(time,FUN=function(x) as.numeric(strsplit(x,split=":")[[1]][1]))
minutes <- sapply(time,FUN=function(x) as.numeric(strsplit(x,split=":")[[1]][2]))
result <- hours+(minutes/60)
return(as.numeric(result))
}
Where you use strsplit to extract the hours and minutes, of which you then take the sum after dividing the minutes by 60.
You can then use the function like this:
FUN(c("111:15","221:15","111:15","221:15","42:05"))
[1] 111.25000 221.25000 111.25000 221.25000 42.08333

strapplyc Here ia a solution using strapplyc in the gsubfn package. It passes the match to each of the parenthesized regular expressions (i.e. the hours and the minutes) to the function described in the third argument. The function can be specified using the usual R function notation and it also supports a short form using a formula (used here) where the right hand side of the formula is the function body and the left hand side represent the arguments and defaults to the free variables (m, h) in the right hand side. We suppose that the original character vector is ch.
library(gsubfn)
strapply(ch, "(\\d+):(\\d+)", ~ as.numeric(h) + as.numeric(m)/60, simplify = TRUE)
numeric processing Another way is to replace the : with a . and manipulate it numerically into what we want:
num <- as.numeric(chartr(":", ".", ch))
trunc(num) + 100 * (num %% 1) / 60
sub This is yet another approach:
h <- as.numeric(sub(":.*", "", ch))
m <- as.numeric(sub(".*:", "", ch))
h + m / 60
The codes above each gives a numberic result but we could wrap each in as.character(...) if a character result were desired.
read.table
as.matrix(read.table(text = ch, sep = ":")) %*% c(1, 1/60)
eval/parse. This one maipulates each one into an R expression which is evaluated. This one is short but the use of eval is often frowned upon:
sapply(parse(text = sub(":", "+(1/60)*", ch)), eval)
ADDED additional solutions.

format numeric without leading zero

What's the best way to format a numeric so that it does NOT show leading zero. For example:
test = .006
sprintf/format/formatC( ??? ) # should result in ".006"

I believe I answered this once before but can't find it. You cannot tell sprintf() et al about a format that drops the leading zero ... so you have to do it yourself, eg via substring():
R> val <- 0.006
R> aa <- substring(sprintf("%4.3f", val), 2)
R> aa
[1] ".006"
R>

f <- function(x) gsub("^(\\s*[+|-]?)0\\.", "\\1.", as.character(x))
f(0.006)
# ".006"
f(-0.006)
# "-.006"
f("+0.006")
# "+.006"
f(" 0.006")
# " .006"
f(10.05)
# "10.05"

You can always fix it up yourself with regular expression search-and-replace:
library(stringr)
test = .006
str_replace(as.character(test), "^0\\.", ".")
Not the most elegant answer, but it works. Substitute whatever string conversion you like for as.character, such as sprintf with your preferred floating point format.

R grep: is there an AND operator?

Suppose I have the following data frame:
User.Id Tags
34234 imageUploaded,people.jpg,more,comma,separated,stuff
34234 imageUploaded
12345 people.jpg
How might I use grep (or some other tool) to only grab rows that include both "imageUploaded" and "people"? In other words, how might I create a subset that includes just the rows with the strings "imageUploaded" AND "people.jpg", regardless of order.
I have tried:
data.people<-data[grep("imageUploaded|people.jpg",results$Tags),]
data.people<-data[grep("imageUploaded?=people.jpg",results$Tags),]
Is there an AND operator? Or perhaps another way to get the intended result?

Thanks to this answer, this regex seems to work. You want to use grepl() which returns a logical to index into your data object. I won't claim to fully understand the inner workings of the regex, but regardless:
x <- c("imageUploaded,people.jpg,more,comma,separated,stuff", "imageUploaded", "people.jpg")
grepl("(?=.*imageUploaded)(?=.*people\\.jpg)", x, perl = TRUE)
#-----
[1] TRUE FALSE FALSE

I love #Chase's answer, and it makes good sense to me, but it can be a bit dangerous to use constructs that one doesn't totally understand.
This answer is meant to reassure anyone who'd like to use #thelatemail's more straightforward approach that it works just as well and is completely competitive speedwise. It's certainly what I'd use in this case. (It's also reassuring that the more sophisticated Perl-compatible-regex pays no performance cost for its power and easy extensibility.)
library(rbenchmark)
x <- paste0(sample(letters, 1e6, replace=T), ## A longer vector of
sample(letters, 1e6, replace=T)) ## possible matches
## Both methods give identical results
tlm <- grepl("a", x, fixed=TRUE) & grepl("b", x, fixed=TRUE)
pat <- "(?=.*a)(?=.*b)"
Chase <- grepl(pat, x, perl=TRUE)
identical(tlm, Chase)
# [1] TRUE
## Both methods are similarly fast
benchmark(
tlm = grepl("a", x, fixed=TRUE) & grepl("b", x, fixed=TRUE),
Chase = grepl(pat, x, perl=TRUE))
# test replications elapsed relative user.self sys.self
# 2 Chase 100 9.89 1.105 9.80 0.10
# 1 thelatemail 100 8.95 1.000 8.47 0.48

For readability's sake, you could just do:
x <- c(
"imageUploaded,people.jpg,more,comma,separated,stuff",
"imageUploaded",
"people.jpg"
)
xmatches <- intersect(
grep("imageUploaded",x,fixed=TRUE),
grep("people.jpg",x,fixed=TRUE)
)
x[xmatches]
[1] "imageUploaded,people.jpg,more,comma,separated,stuff"

Below is an alternative to grep using hadley's stringr::str_detect(). This avoids the use of perl=true #jan-stanstrup. Additionally, the dplyr::filter() will return the rows within the dataframe itself so you never need to leave the df.
library(stringr)
libary(dplyr)
x <- data.frame(User.Id =c(34234,34234,12345),
Tags=c("imageUploaded,people.jpg,more,comma,separated,stuff",
"imageUploaded",
"people.jpg"))
data.people <- x %>% filter(str_detect(Tags,"(?=.*imageUploaded)(?=.*people\\.jpg)"))
data.people
# returns
# User.Id Tags
# 1 34234 imageUploaded,people.jpg,more,comma,separated,stuff
This is simpler and works if "people.jpg" always follows "imageUploaded"
str_extract(x,"imageUploaded.*people\\.jpg")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

gsub and pad inside of a parenthesis - r

pad.fix<-function(x){ y<-gsub('\\.(\\d)\\)','\\.\\10\\)',x) gsub('0\\.','\\.',y) } the first gsub adds a trailing zero if needed the second gsub removes the leading zero.

Non gsub answer that's ugly at best. x <- c("20(0.23)", "15(0.2)", "16(0.09)") numformat <- function(val) { sub("^(-?)0.", "\\1.", sprintf("%.2f", val)) } z <- do.call(rbind, strsplit(gsub("\\)", "", x), "\\(")) z[, 2] <- numformat(as.numeric(z[, 2])) paste0(z[, 1], "(", z[, 2], ")")

Related

Extract last digit [duplicate]

Split a string every 5 characters

Convert HH:MM:SS to hours (for more than 24 hours) in R

format numeric without leading zero

R grep: is there an AND operator?

Categories

Resources