Converting internationally formatted strings to numeric - r

I have a file with internationally formatted numbers (i.e strings) including units of measurement. In this case the decimal place is indicated by "," and the 1e3 seperator is indicated as "." (i.e. German number formats).
a <- c('2.200.222 €',
' 180.109,3 €')
or
b <- c('28,42 m²',
'47,70 m²')
I'd like to convert these strings efficiently to numeric. I've tried to filter out numbers by codes like
require(stringr)
str_extract(a, pattern='[0-9]+.[0-9]+.[0-9]+')
str_extract(b, pattern='[0-9]+,[0-9]+')
however, this does seem to be too prone to errors and I guess there must be a more standardized way. So here's my question: Is there a custom function, package or something else that is capable of such a problem?
Thank you very much!

Here is a function that uses gsub to deal with the sample data you posted:
x <- c('2.200.222 €', ' 180.109,3 €', '28,42 m²', '47,70 m²')
strip <- function(x){
z <- gsub("[^0-9,.]", "", x)
z <- gsub("\\.", "", z)
gsub(",", ".", z)
}
as.numeric(strip(x))
[1] 2200222.00 180109.30 28.42 47.70
It works like this:
First strip out all trailing non-digits (and anything after these non-digits)
Then strip out all periods.
Finally, convert commas to periods.

Related

String Manipulation in R data frames

I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))

Add thousand separator to levels in cut function

My x axis labels look like [100000,250000] which makes it hard to understand the numer at first sight, I want it to look like [100.000,250.000], I know that the cut2 function has a formatfun parameter but I think I don´t know how to use it properly.
Try using the "formatC" function on your cut data. e.g.
formatC(my_cuts, big.mark = ".", decimal.mark = ",")
Let's create an example to work on:
x <- cut(seq(0,1,length.out=8) + 1e6, 3)
This is a factor. Although at bottom it's a numeric array, you don't want to format its values; you want to format its levels, which are the strings associated with its values. This is what the levels look like in the example (calling head to prevent lots of printing in case x has many distinct levels):
(head(levels(x)))
[1] "(1000000,1000000.3]" "(1000000.3,1000000.7]" "(1000000.7,1000001]"
To format the levels, we need to pick them apart into their numeric components (which are separated by a comma ","), format each component, and reassemble the results.
Here's the picking-apart-and-formatting step in one go, using only base R functionality. It calls gsub and strsplit on the first line (for cleaning out the "(" and "]" characters and splitting each pair of numeric strings into two strings) and employs prettyNum on the second line (for the formatting), which conveniently will format any character string that looks like a number:
s <- lapply(strsplit(gsub("]|[(]", "", levels(x)), ","),
prettyNum, big.mark=".", decimal.mark=",", input.d.mark=".", preserve.width="individual")
(You might not need the input.d.mark argument, but I did because my locale uses "." for a decimal point, as you could see above. The docs say "individual" is the default for setting the output width, but that just isn't the case on my system: I had to specify it explicitly.)
The paste* functions will perform the reassembly, whose results we simply re-assign to the levels of x:
levels(x) <- paste0("(", sapply(s, function(a) paste0(a, collapse="; ")), "]")
(Since each number potentially already includes "," and "." delimiters, I have specified a third punctuation mark, ";", to separate the numbers themselves -- but you may use what you wish, of course.)
Let's display the new levels to verify the results:
(head(levels(x)))
[1] "(1.000.000; 1.000.000,3]" "(1.000.000,3; 1.000.000,7]" "(1.000.000,7; 1.000.001]"

What is the best way in R to identify the first character in a string?

I am trying to find a way to loop through some data in R that contains both numbers and characters and where the first character is found return all values after. For example:
column
000HU89
87YU899
902JUK8
result
HU89
YU89
JUK8
have tried stringr_detct / grepl but the value of the first character is by nature unknown so I am having difficultly pulling it out.
We could use str_extract
stringr::str_extract(x, "[A-Z].*")
#[1] "HU89" "YU899" "JUK8"
data
x <- c("000HU89", "87YU899", "902JUK8")
Ronak's answer is simple.
Though I would also like to provide another method:
column <-c("000HU89", "87YU899" ,"902JUK8")
# Get First character
first<-c(strsplit(gsub("[[:digit:]]","",column),""))[[1]][1]
# Find the location of first character
loc<-gregexpr(pattern =first,column)[[1]][1]
# Extract everything from that chacracter to the right
substring(column, loc, last = 1000000L)
We can use sub from base R to match one or more digits (\\d+) at the start (^) of the string and replace with blank ("")
sub("^\\d+", "", x)
#[1] "HU89" "YU899" "JUK8"
data
x <- c("000HU89", "87YU899", "902JUK8")
In base R we can do
x <- c("000HU89", "87YU899", "902JUK8")
regmatches(x, regexpr("\\D.+", x))
# [1] "HU89" "YU899" "JUK8"

Convert superscripted numbers from string into scientific notation (from Unicode, UTF8)

I imported a vector of p-values from an Excel table. The numbers are given as superscripted Unicode strings. After hours of trying I still struggle to convert them into numbers.
See example below. Simple conversion with as.numeric() doesn't work. I also tried to use Regex to capture the superscripted numbers, but it turned out that each superscripted number has a distinct Unicode code, for which there is no translation.
test <- c("0.0126", "0.000289", "4.26x10⁻¹⁴", "6.36x10⁻⁴⁸",
"4.35x10⁻⁹", "0.115", "0.0982", "0.000187", "0.0484", "0.000223")
as.numeric(test)
Does somebody know of an R-package which could do the translation painlessly, or do I have to translate the codes one by one into digits?
This kind of formatting is definitely not very portable... Here's one possible solution though, for the exercise...
test <- c("0.0126", "0.000289", "4.26x10⁻¹⁴", "6.36x10⁻⁴⁸",
"4.35x10⁻⁹", "0.115", "0.0982", "0.000187", "0.0484",
"0.000223")
library(utf8)
library(stringr)
# normalize, ie everything to "normal text"
testnorm <- utf8_normalize(test, map_case = TRUE, map_compat = TRUE)
# replace exponent part
# \\N{Minus Sign} is the unicode name of the minus sign symbol
# (see [ICU regex](http://userguide.icu-project.org/strings/regexp))
# it is necessary because the "-" is not a plain text minus sign...
testnorm <- str_replace_all(testnorm, "x10\\N{Minus Sign}", "e-")
# evaluate these character strings
p_vals <- sapply(X = testnorm,
FUN = function(x) eval(parse(text = x)),
USE.NAMES = FALSE
)
# everything got adjusted to the "e-48" element...
format(p_vals, digits = 2, scientific = F)

Return the character associated with the specified Ascii code in R

Good afternoon,
I'm trying to create a cartesian product in R with the letters of the alphabet.
What I'm actually trying is this:
First I create a matrix with the letters
a <- as.matrix(seq(97,122,by=1))
Then I create a data frame with 2 columns with all the combinations
b <- expand.grid(a, a)
Finally I combine the 2 columns
apply(b,1,paste,collapse=" ")
The problem I have is that I can't find a way to transform those "decimals" to its Ascii character.
I have tried several things like rawToChar and gsub unsuccessfully.
Can somebody point me in the right direction?
Thanks
A very easy way to return a character based on its ASCII code is the function intToUtf8. It also works for vectors including multiple integers and returns the corresponding characters as one string.
vec <- 97:122
intToUtf8(vec)
# [1] "abcdefghijklmnopqrstuvwxyz"
intToUtf8(65)
# [1] "A"
First direct method:
res <- do.call(paste, expand.grid(letters, letters))
If you've some other ascii values and you want to get equivalent characters:
val <- 65:96 # whatever values you want the equivalent characters for
mode(val) <- "raw" # set mode to raw
# alternatively, val <- as.raw(65:96)
a <- sapply(val, rawToChar)
res <- do.call(paste, expand.grid(a, a))
To print an ASCII char in R you can use the print function with a backslash \ before an ASCII code number. For example to print the character equivalent of 150 use print("\150").
Or for your example above you could try:
a <- sapply(97:122,function(x) rawToChar(as.raw(x)))
b <- expand.grid(a,a)
c <- t(apply(b,1,function(x) paste(x[1],x[2])))

Resources