I imported a vector of p-values from an Excel table. The numbers are given as superscripted Unicode strings. After hours of trying I still struggle to convert them into numbers.
See example below. Simple conversion with as.numeric() doesn't work. I also tried to use Regex to capture the superscripted numbers, but it turned out that each superscripted number has a distinct Unicode code, for which there is no translation.
test <- c("0.0126", "0.000289", "4.26x10⁻¹⁴", "6.36x10⁻⁴⁸",
"4.35x10⁻⁹", "0.115", "0.0982", "0.000187", "0.0484", "0.000223")
as.numeric(test)
Does somebody know of an R-package which could do the translation painlessly, or do I have to translate the codes one by one into digits?
This kind of formatting is definitely not very portable... Here's one possible solution though, for the exercise...
test <- c("0.0126", "0.000289", "4.26x10⁻¹⁴", "6.36x10⁻⁴⁸",
"4.35x10⁻⁹", "0.115", "0.0982", "0.000187", "0.0484",
"0.000223")
library(utf8)
library(stringr)
# normalize, ie everything to "normal text"
testnorm <- utf8_normalize(test, map_case = TRUE, map_compat = TRUE)
# replace exponent part
# \\N{Minus Sign} is the unicode name of the minus sign symbol
# (see [ICU regex](http://userguide.icu-project.org/strings/regexp))
# it is necessary because the "-" is not a plain text minus sign...
testnorm <- str_replace_all(testnorm, "x10\\N{Minus Sign}", "e-")
# evaluate these character strings
p_vals <- sapply(X = testnorm,
FUN = function(x) eval(parse(text = x)),
USE.NAMES = FALSE
)
# everything got adjusted to the "e-48" element...
format(p_vals, digits = 2, scientific = F)
Related
A small snippet of my overall code is to round a given vector to a specified number of decimal places. The rounded value is then converted to standard notation, e.g. "1.2e-01".
The following does the rounding which works fine.
values <- c(0.1234, 0.5678)
dig <- 2
rounded_vals <- round(values, dig) %>% str_trim()
When I run the following code I expect to see the same output for both lines.
format(rounded_vals, scientific = TRUE)
format(c(0.12, 0.56), scientific = TRUE)
What I actually get is:
> format(rounded_vals, scientific = TRUE)
[1] "0.12" "0.57"
> format(c(0.12, 0.56), scientific = TRUE)
[1] "1.2e-01" "5.6e-01"
Why doesn't format(rounded_vals, scientific = TRUE) return the same output and how can I adjust the code to do so?
Would appreciate any input :)
Edit: I missed out a bit of code that was causing the problem - seems to be that str_trim() covnerts to character.
I think rounded_vals might have been stored as character values. Try converting them to numbers using as.numeric() and then put it into the format() function.
I have a data frame that I'm looking to round to two decimal places as it is currently at 16. I have this done like this;
##rounding numbers
p32_us1rounding = read.csv("p32_us1_ff6.csv")
#print(p32_us1rounding)
dfp32_us1 = data.frame(p32_us1rounding[-1])
dfp32_us1 <- round(dfp32_us1, digits = 2)
write.csv(dfp32_us1, "32 Fama French for US Market on or before 01-08-2005.csv")
This works perfectly but there are a few rows where I have an extremely small number which I would like to keep in scientific notation. For example the first column of the 8th row is 7.12206157653355e-67. This obviously rounds down to zero but I would like it to say 7.12E-67. Is there anyway to do this while also having the other numbers rounded?
You will have to convert the numbers to character strings to use different formatting. Then just use the quote=FALSE argument in write.csv to prevent them from being surrounded by quotation marks:
set.seed(42)
x <- runif(100)*100/39
x <- matrix(x, 10, 10)
y <- ifelse(x > .1, sprintf("%.2f", x), sprintf("%.2e", x))
write.csv(y, "Test.csv", quote=FALSE, row.names=FALSE)
cat(readLines("Test.csv", 4), sep="\n")
# V1,V2,V3,V4,V5,V6,V7,V8,V9,V10
# 2.35,1.17,2.32,1.89,0.97,0.85,1.73,0.11,1.49,1.71
# 2.40,1.84,0.36,2.08,1.12,0.89,2.52,0.36,0.40,6.13e-04
# 0.73,2.40,2.54,1.00,9.60e-02,1.02,1.95,0.55,0.92,0.53
Last value in line 2 is in scientific notation and the 5th value in line 3.
My x axis labels look like [100000,250000] which makes it hard to understand the numer at first sight, I want it to look like [100.000,250.000], I know that the cut2 function has a formatfun parameter but I think I don´t know how to use it properly.
Try using the "formatC" function on your cut data. e.g.
formatC(my_cuts, big.mark = ".", decimal.mark = ",")
Let's create an example to work on:
x <- cut(seq(0,1,length.out=8) + 1e6, 3)
This is a factor. Although at bottom it's a numeric array, you don't want to format its values; you want to format its levels, which are the strings associated with its values. This is what the levels look like in the example (calling head to prevent lots of printing in case x has many distinct levels):
(head(levels(x)))
[1] "(1000000,1000000.3]" "(1000000.3,1000000.7]" "(1000000.7,1000001]"
To format the levels, we need to pick them apart into their numeric components (which are separated by a comma ","), format each component, and reassemble the results.
Here's the picking-apart-and-formatting step in one go, using only base R functionality. It calls gsub and strsplit on the first line (for cleaning out the "(" and "]" characters and splitting each pair of numeric strings into two strings) and employs prettyNum on the second line (for the formatting), which conveniently will format any character string that looks like a number:
s <- lapply(strsplit(gsub("]|[(]", "", levels(x)), ","),
prettyNum, big.mark=".", decimal.mark=",", input.d.mark=".", preserve.width="individual")
(You might not need the input.d.mark argument, but I did because my locale uses "." for a decimal point, as you could see above. The docs say "individual" is the default for setting the output width, but that just isn't the case on my system: I had to specify it explicitly.)
The paste* functions will perform the reassembly, whose results we simply re-assign to the levels of x:
levels(x) <- paste0("(", sapply(s, function(a) paste0(a, collapse="; ")), "]")
(Since each number potentially already includes "," and "." delimiters, I have specified a third punctuation mark, ";", to separate the numbers themselves -- but you may use what you wish, of course.)
Let's display the new levels to verify the results:
(head(levels(x)))
[1] "(1.000.000; 1.000.000,3]" "(1.000.000,3; 1.000.000,7]" "(1.000.000,7; 1.000.001]"
I would like to convert this:
AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1
to this:
ELA-3
I tried this function:
str_extract(.,pattern = ":?(ELA).*(\\d\\-)"))
it printed this:
"ELA-NH-COMBINED-3-"
I need to get rid of the text or anything between the two extracts. The number will be a number between 3 and 9. How should I modify my expression in pattern =?
Thanks!
1) Match everything up to -ELA followed by anything (.*) up to - followed by captured digits (\\d+)followed by - followed by anything. Then replace that with ELA- followed by the captured digits. No packages are used.
x <- "AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1"
sub(".*-ELA.*-(\\d+)-.*", "ELA-\\1", x)
## [1] "ELA-3"
2) Another approach if there is only one numeric field is that we can read in the fields, grep out the numeric one and preface it with ELA- . No packages are used.
s <- scan(text = x, what = "", quiet = TRUE, sep = "-")
paste("ELA", grep("^\\d+$", s, value = TRUE), sep = "-")
## [1] "ELA-3"
TL;DR;
You can't do that with a single call to str_extract because you cannot match discontinuous portions of texts within a single match operation.
Again, it is impossible to match texts that are separated with other text into one group.
Work-arounds/Solutions
There are two solutions:
Capture parts of text you need and then join them (2 operations: match + join)
Capture parts of text you need and then replace with backreferences to the groups needed (1 replace operation)
Capturing groups only keep parts of text you match in separate memory buffers, but you also need a method or function that is capable of accessing these chunks.
Here, in R, str_extract drops them, but str_match keeps them in the result.
s <- "AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1"
m <- str_match(s, ":?(ELA).*-(\\d+)")
paste0(m[,2], "-", m[,3])
This prints ELA-3. See R demo online.
Another way is to replace while capturing the parts you need to keep and then using backreferences to those parts in the replacement pattern:
x <- "AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1"
sub("^.*-ELA.*?-([^-]+)-[^-]+$", "ELA-\\1", x)
See this R demo
I have a file with internationally formatted numbers (i.e strings) including units of measurement. In this case the decimal place is indicated by "," and the 1e3 seperator is indicated as "." (i.e. German number formats).
a <- c('2.200.222 €',
' 180.109,3 €')
or
b <- c('28,42 m²',
'47,70 m²')
I'd like to convert these strings efficiently to numeric. I've tried to filter out numbers by codes like
require(stringr)
str_extract(a, pattern='[0-9]+.[0-9]+.[0-9]+')
str_extract(b, pattern='[0-9]+,[0-9]+')
however, this does seem to be too prone to errors and I guess there must be a more standardized way. So here's my question: Is there a custom function, package or something else that is capable of such a problem?
Thank you very much!
Here is a function that uses gsub to deal with the sample data you posted:
x <- c('2.200.222 €', ' 180.109,3 €', '28,42 m²', '47,70 m²')
strip <- function(x){
z <- gsub("[^0-9,.]", "", x)
z <- gsub("\\.", "", z)
gsub(",", ".", z)
}
as.numeric(strip(x))
[1] 2200222.00 180109.30 28.42 47.70
It works like this:
First strip out all trailing non-digits (and anything after these non-digits)
Then strip out all periods.
Finally, convert commas to periods.