Why does this regex not match decimal numbers? - r

([.[:digit:]]+)
I am thinking this should match decimal numbers like 25.8 or 0.6 ..., but it seems to give up at the "non-digit" part of the match... so I only get 25 or 0
I have tried to escape the "." with \. and .
I am doing this in R, using gregexpr().
Here is a minimal reproducible example:
test
[1] " UNITS\n LAB 6690-2(LOINC) WBC # Bld Auto 10.99 "
LABregexlabname
[1] "LAB[[:print:][:blank:]]+WBC[[:print:][:blank:]]+([\\.[:digit:]]+)[:blank:]*?"
> gregexpr( LABregexlabname, test)
[[1]]
[1] 11
attr(,"match.length")
[1] 46
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
substring( test, 11, 11+46)
[1] "LAB 6690-2(LOINC) WBC # Bld Auto 10"

Place the last [:blank:] inside [] as [[:blank:]] and use perl=TRUE.
test <- " UNITS\n LAB 6690-2(LOINC) WBC # Bld Auto 10.99 "
LABregexlabname <- "LAB[[:print:][:blank:]]+WBC[[:print:][:blank:]]+([.[:digit:]]+)[[:blank:]]*?"
regmatches(test, regexpr(LABregexlabname, test, perl=TRUE))
#[1] "LAB 6690-2(LOINC) WBC # Bld Auto 10.99"
It looks like TRE uses minimal match everywhere when using ? at the end. In this case, when removing the ? also TRE will give the whole number but also all spaces. So maybe leaving also [[:blank:]]* ?
LABregexlabname <- "LAB[[:print:][:blank:]]+WBC[[:print:][:blank:]]+([.[:digit:]]+)[[:blank:]]*"
regmatches(test, regexpr(LABregexlabname, test))
#[1] "LAB 6690-2(LOINC) WBC # Bld Auto 10.99 "
LABregexlabname <- "LAB[[:print:][:blank:]]+WBC[[:print:][:blank:]]+([.[:digit:]]+)"
regmatches(test, regexpr(LABregexlabname, test))
#[1] "LAB 6690-2(LOINC) WBC # Bld Auto 10.99"

Related

Regex: Capturing Numbers at Beginning and Negating Numbers After Characters

I need to capture the 3.93, 4.63999..., and -5.35. I've tried all kinds of variations, but have been unable to grab the correct set of numbers.
Copay: 20.30
3.93
TAB 8.6MG Qty:60
4.6399999999999997
-5.35
2,000UNIT TAB Qty:30
AMOUNT
Qty:180
CAP 4MG
x = c("Copay: 20.30", "3.93", "TAB 8.6MG Qty:60", "4.6399999999999997", "-5.35", "2,000UNIT TAB Qty:30", "AMOUNT", "Qty:180", "CAP 4MG");
grep("^[\\-]?\\d+[\\.]?\\d+$", x);
Output (see ?grep):
[1] 2 4 5
If leading/trailing spaces are allowed change the regex with
"^\\s*[\\-]?\\d+[\\.]?\\d+\\s*$"
Try this
S <- c("Copay: 20.30", "3.93", "TAB 8.6MG Qty:60", "4.6399999999999997", "-5.35", "2,000UNIT TAB Qty:30", "AMOUNT", "Qty:180", "CAP 4MG")
library(stringr)
ans <- str_extract_all(S, "-?[[:digit:]]*(\\.|,)?[[:digit:]]+", simplify=TRUE)
clean <- ans[ans!=""]
Output
[1] "20.30" "3.93" "8.6"
[4] "4.6399999999999997" "-5.35" "2,000"
[7] "180" "4" "60"
[10] "30"

Issue with strsplit not storing searched field

I am running a regex query using R
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street
I am attempting to sort these addresses into two columns
I expected:
strsplit(df, "\\d+(\\s*-\\s*\\d+)?", perl=T)
would split up the numbers on the left and the rest of the address on the right.
The result I am getting is:
[1] "" " Fake Street"
[1] "" " Fake Street"
[1] "" " M" " Ln"
[1] "" " Fake Street"
[1] "" " Fake Street"
The strsplit function appears to be delete the field used to split the string. Is there any way I can preserve it?
Thanks
You are almost there, just append \\K\\s* to your regex and prepend with the ^, start of string anchor:
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
strsplit(df, "^\\d+(\\s*-\\s*\\d+)?\\K\\s*", perl=T)
The \K is a match reset operator that discards the text msatched so far, so after matching 1+ digits, optionally followed with - enclosed with 0+ whitespaces and 1+ digits at the start of the string, this whole text is dropped. Ony 0+ whitespaces get it into the match value, and they will be split on.
See the R demo outputting:
[[1]]
[1] "955 - 959" "Fake Street"
[[2]]
[1] "95-99" "Fake Street"
[[3]]
[1] "4-9" "M4 Ln"
[[4]]
[1] "95 - 99" "Fake Street"
[[5]]
[1] "99" "Fake Street"
You could use lookbehinds and lookaheads to split at the space between a number and the character:
strsplit(df, "(?<=\\d)\\s(?=[[:alpha:]])", perl = TRUE)
# [[1]]
# [1] "955 - 959" "Fake Street"
#
# [[2]]
# [1] "95-99" "Fake Street"
#
# [[3]]
# [1] "4-9" "M4" "Ln"
#
# [[4]]
# [1] "95 - 99" "Fake Street"
#
# [[5]]
# [1] "99" "Fake Street"
This, however also splits at the space between "M4" and "Ln". If your addresses are always of the format "number (possible range) followed by rest of the address" you could extract the two parts separately (as #d.b suggested):
splitDf <- data.frame(
numberPart = sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\1", df),
rest = trimws(sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\3", df)))
splitDf
# numberPart rest
# 1 955 - 959 Fake Street
# 2 95-99 Fake Street
# 3 4-9 M4 Ln
# 4 95 - 99 Fake Street
# 5 99 Fake Street

Split string in R

I am trying to split the output of "ls -lrt" command from Linux. but it's taking only one space as delimeter. If there is two space then its taking 2nd space as value. So I think I need to suppress multiple space as one. Does anybody has any idea on this?
> a <- try(system("ls -lrt | grep -i .rds", intern = TRUE))
> a
[1] "-rw-r--r-- 1 u7x9573 sashare 2297 Jun 9 16:10 abcde.RDS"
[2] "-rw-r--r-- 1 u7x9573 sashare 86704 Jun 9 16:10 InputSource2.rds"
> str(a)
chr [1:6] "-rw-r--r-- 1 u7x9573 sashare 2297 Jun 9 16:10 abcde.RDS" ...
>
>c = strsplit(a," ")
>c
[[1]]
[1] "-rw-r--r--" "1" "u7x9573" "sashare" ""
[6] "2297" "Jun" "" "9" "16:10"
[11] "abcde.RDS"
[[2]]
[1] "-rw-r--r--" "1" "u7x9573" "sashare"
[5] "86704" "Jun" "" "9"
[9] "16:10" "InputSource2.rds"
In next step I needed just file name and I used following code which worked fine:
mtrl_name <- try(system("ls | grep -i .rds", intern = TRUE))
This returns that info in a data frame for the indicated files:
file.info(list.files(pattern = "[.]rds$", ignore.case = TRUE))
or if we knew the extensions were lower case:
file.info(Sys.glob("*.rds"))
strsplit takes a regular expression so we can use those to help out. For more info read ?regex
> x <- "Spaces everywhere right? "
> # Not what we want
> strsplit(x, " ")
[[1]]
[1] "Spaces" "" "" "everywhere" "right?"
[6] ""
> # Use " +" to tell it to split on 1 or more space
> strsplit(x, " +")
[[1]]
[1] "Spaces" "everywhere" "right?"
> # If we want to be more explicit and catch the possibility of tabs, new lines, ...
> strsplit(x, "[[:space:]]+")
[[1]]
[1] "Spaces" "everywhere" "right?"

How to format numbers in R, specifying the number of significant digits but keep significant zeroes and integer part?

I've been struggling with formatting numbers in R using what I feel are very sensible rules. What I would want is to specify a number of significant digits (say 3), keep significant zeroes, and also keep all digits before the decimal point, some examples (with 3 significant digits):
1.23456 -> "1.23"
12.3456 -> "12.3"
123.456 -> "123"
1234.56 -> "1235"
12345.6 -> "12346"
1.50000 -> "1.50"
1.49999 -> "1.50"
Is there a function in R that does this kind of formatting? If not, how could it be done?
I feel these are quite sensible formatting rules, yet I have not managed to find a function that formats in this way in R. As far as I googled this is not a duplicate of many similar questions such as this
Edit:
Inspired by the two good answers I put together a function myself that I believe works for all cases:
sign_digits <- function(x,d){
s <- format(x,digits=d)
if(grepl("\\.", s) && ! grepl("e", s)) {
n_sign_digits <- nchar(s) -
max( grepl("\\.", s), attr(regexpr("(^[-0.]*)", s), "match.length") )
n_zeros <- max(0, d - n_sign_digits)
s <- paste(s, paste(rep("0", n_zeros), collapse=""), sep="")
}
s
}
format(num,3) comes very close.
format(1.23456,digits=3)
# [1] "1.23"
format(12.3456,digits=3)
# [1] "12.3"
format(123.456,digits=3)
# [1] "123"
format(1234.56,digits=3)
# [1] "1235"
format(12345.6,digits=3)
# [1] "12346"
format(1.5000,digits=3)
# [1] "1.5"
format(1.4999,digits=3)
# [1] "1.5"
Your rules are not actually internally consistent. You want 1234.56 to round down to 1234, yet you want 1.4999 to round up to 1.5.
EDIT This appears to deal with the very valid point made by #Henrik.
sigDigits <- function(x,d){
z <- format(x,digits=d)
if (!grepl("[.]",z)) return(z)
require(stringr)
return(str_pad(z,d+1,"right","0"))
}
z <- c(1.23456, 12.3456, 123.456, 1234.56, 12345.6, 1.5000, 1.4999)
sapply(z,sigDigits,d=3)
# [1] "1.23" "12.3" "123" "1235" "12346" "1.50" "1.50"
As #jlhoward points out, your rounding rule is not consistent. Hence you should use a conditional statement:
x <- c(1.23456, 12.3456, 123.456, 1234.56, 12345.6, 1.50000, 1.49999)
ifelse(x >= 100, sprintf("%.0f", x), ifelse(x < 100 & x >= 10, sprintf("%.1f", x), sprintf("%.2f", x)))
# "1.23" "12.3" "123" "1235" "12346" "1.50" "1.50"
It's hard to say the intended usage, but it might be better to use consistent rounding. Exponential notation could be an option:
sprintf("%.2e", x)
[1] "1.23e+00" "1.23e+01" "1.23e+02" "1.23e+03" "1.23e+04" "1.50e+00" "1.50e+00"
sig0=\(x,y){
dig=abs(pmin(0,floor(log10(abs(x)))-y+1))
dig[is.infinite(dig)]=y-1
sprintf(paste0("%.",dig,"f"),x)
}
> v=c(1111,111.11,11.1,1.1,1.99,.01,.001,0,-.11,-.9,-.000011)
> paste(sig0(v,2),collapse=" ")
[1] "1111 111 11 1.1 2.0 0.010 0.0010 0.0 -0.11 -0.90 -0.000011"
Or the following is almost the same with the exception that 0 is converted to 0 and not 0.0 (fg is a special version of f where the digits specify significant digits and not digits after the decimal point, and the # flag causes fg to not drop trailing zeroes):
> paste(sub("\\.$","",formatC(v,2,,"fg","#")),collapse=" ")
[1] "1111 111 11 1.1 2.0 0.010 0.0010 0 -0.11 -0.90 -0.000011"

How to write the proper regular expression to extract value from the string?

> str=" 9.48 12.89 13.9 6.79 "
> strsplit(str,split="\\s+")
[[1]]
[1] "" "9.48" "12.89" "13.9" "6.79"
> unlist(strsplit(str,split="\\s+"))->y
> y[y!=""]
[1] "9.48" "12.89" "13.9" "6.79"
How can i get it by a single regular expression with strsplit , not to oparate it with
y[y!=""]?
I would just trim the string before splitting it:
strsplit(gsub("^\\s+|\\s+$", "", str), "\\s+")[[1]]
# [1] "9.48" "12.89" "13.9" "6.79"
Alternatively, it is pretty direct to use scan in this case:
scan(text=str)
# Read 4 items
# [1] 9.48 12.89 13.90 6.79
If you want to extract just the numbers perhaps following regex would do.
regmatches(str, gregexpr("[0-9.]+", text = str))[[1]]
## [1] "9.48" "12.89" "13.9" "6.79"
To capture -ve numbers you can use following
str = " 9.48 12.89 13.9 --6.79 "
regmatches(str, gregexpr("\\-{0,1}[0-9.]+", text = str))[[1]]
## [1] "9.48" "12.89" "13.9" "-6.79"

Resources