Convert characters with time units (ms, s, us) into numerics - r

One of the columns in my data frame is a character vector with time span values represented as number+suffix, as so:
c("16.14ms", "7.58ms", "8.38ms", "7.29ms", "6.40ms", "5.76ms",
"5.56ms", "5.27us", "5.12ms", "5.03us", "4.91ms", "4.76ms", "16.12ms",
"7.56ms", "8.59ms", "7.16ms", "6.59ms", "5.91s", "5.62ms", "5.44ms"
)
The units are limited to micro us, milli ms, and full seconds s.
Is there a simple way to make this into a numeric column with all values being either in milliseconds or seconds?

Here are some approaches. We suppose x is the input vector shown in the question.
1) Remove the s, replace m with e-3 and replace u with e-6. Then convert to numeric:
as.numeric(sub("u", "e-6", sub("m", "e-3", sub("s", "", x))))
2) This could also be done neatly using gsubfn. First we match the suffix and then use a replacement list as shown:
library(gsubfn)
as.numeric(gsubfn("\\D+$", list(ms = "e-3", us = "e-6", s = "e0"), x))
This would be particularly convenient if it were desired to extend the problem to many time units as it would just be a matter of extending the list.
Note that at the top of page 4 of the gsubfn vignette there is an example which is very close to this one.

Related

Remove part of a string and turn it into a number?

I have a dataframe called "Camera_data" and a column called "Numeric_time"
My "Numeric_time" column is in character format and includes hours, minutes and seconds, it looks like this: 08:40:01
I need to remove the numbers that pertain to seconds and replace the semicolons with periods to make a decimal number for my time. I need it to look like this: 08.40 in order to turn my time into radians for an analysis I'm running.
I've looked for a few solutions in stringr, but so far can't work out how to consistently take off the last three characters. I think once I have removed the seconds and replaced the : with a . I can just use as.numeric to turn the character column into a numerical column, but would really appreciate any help!
We could do
Camera_data$Numeric_time <- as.numeric(chartr(":", ".",
sub(":\\d{2}$", "", Camera_data$Numeric_time )))
Or use substr
Camera_data$Numeric_time <- substr(Camera_data$Numeric_time, 1, nchar(Camera_data$Numeric_time)-3)
Using gsub with two capture groups.
as.numeric(gsub('(\\d+):(\\d+).*', '\\1.\\2', x))
# [1] 8.40 18.41 0.00
Data:
x <- c('08:40:01', '18:41:01', '00:00:01')

Add thousand separator to levels in cut function

My x axis labels look like [100000,250000] which makes it hard to understand the numer at first sight, I want it to look like [100.000,250.000], I know that the cut2 function has a formatfun parameter but I think I donĀ“t know how to use it properly.
Try using the "formatC" function on your cut data. e.g.
formatC(my_cuts, big.mark = ".", decimal.mark = ",")
Let's create an example to work on:
x <- cut(seq(0,1,length.out=8) + 1e6, 3)
This is a factor. Although at bottom it's a numeric array, you don't want to format its values; you want to format its levels, which are the strings associated with its values. This is what the levels look like in the example (calling head to prevent lots of printing in case x has many distinct levels):
(head(levels(x)))
[1] "(1000000,1000000.3]" "(1000000.3,1000000.7]" "(1000000.7,1000001]"
To format the levels, we need to pick them apart into their numeric components (which are separated by a comma ","), format each component, and reassemble the results.
Here's the picking-apart-and-formatting step in one go, using only base R functionality. It calls gsub and strsplit on the first line (for cleaning out the "(" and "]" characters and splitting each pair of numeric strings into two strings) and employs prettyNum on the second line (for the formatting), which conveniently will format any character string that looks like a number:
s <- lapply(strsplit(gsub("]|[(]", "", levels(x)), ","),
prettyNum, big.mark=".", decimal.mark=",", input.d.mark=".", preserve.width="individual")
(You might not need the input.d.mark argument, but I did because my locale uses "." for a decimal point, as you could see above. The docs say "individual" is the default for setting the output width, but that just isn't the case on my system: I had to specify it explicitly.)
The paste* functions will perform the reassembly, whose results we simply re-assign to the levels of x:
levels(x) <- paste0("(", sapply(s, function(a) paste0(a, collapse="; ")), "]")
(Since each number potentially already includes "," and "." delimiters, I have specified a third punctuation mark, ";", to separate the numbers themselves -- but you may use what you wish, of course.)
Let's display the new levels to verify the results:
(head(levels(x)))
[1] "(1.000.000; 1.000.000,3]" "(1.000.000,3; 1.000.000,7]" "(1.000.000,7; 1.000.001]"

How to extract the max value of a string in R

I have a vector of strings like this:
"1111111221111122111111UUUUUUUUUUUUUUUUUU"
"---1-1---1--111111"
"1111112111 1111" (with blank spaces)
everyone has different length and I want to extract the max value of the each string, for the three examples above the max values would be (2,1,2), but don't know how to do it with the letters or the dash or the blank spaces, all these three are the minimum, i.e., 1 is bigger than "U", "-" and " " and between them is the same.
Any advice?
Best regards
Decompose the problem into independent, solvable steps:
Transform the input into a suitable format
Find the maximum
The we get:
# Separate strings into individual characters
digits_str = strsplit(input, '')
# Convert to correct type
digits = lapply(digits_str, as.integer)
# Perform actual logic, on each input string in turn.
result = vapply(digits, max, integer(1L), na.rm = TRUE)
This uses the lapply and vapply functions which allow you to perform an operation (here first as.integer and then max) on all values in a vector/list.

Easiest way to extract numbers from currency amounts in different formats on R?

I have a dataset with a "value (in millions of USD)" column that I want to manipulate. Entries are strings in different formats - either with a dollar sign and followed by an M, e.g. "$1.3M," or followed by a K, e.g. "$450K," or some that I've already turned into proper numerical entries (e.g. 40 for 40 million USD).
I want to: get rid of the $ and extract only the numerical value for each row in millions.
Probably looking at some kind of column splitter based on values containing M or K, with an "ifelse" resembling something like: ifelse(PL$'VALUE (M)' contains M, extract.numeric from PL$'VALUE (M)', PL$'VALUE (M)' * 10^-3).
Haven't quite figured out the easiest way to do this on R though. Help would be appreciated!
You can use gsubfn to specify how to match the currency to numeric.
x <- c("$1.3M", "$450K")
library(gsubfn)
as.numeric(
gsubfn( "\\D", list( "$"="", "M" = "e6", "K" = "e3"), x)
)
#1300000 450000

adding or retaining leading zeros without converting to character format

Is it possible to add or retain one or more leading zeros to a number without the result being converted to character? Every solution I have found for adding leading zeros returns a character string, including: paste, formatC, format, and sprintf.
For example, can x be 0123 or 00123, etc., instead of 123 and still be numeric?
x <- 0123
EDIT
It is not essential. I was just playing around with the following code and the last two lines gave the wrong answer. I just thought maybe if I could have leading zeros with numeric format obtaining the correct answer would be easier.
a7 = c(1,1,1,0); b7=c(0,1,1,1); # 4
a77 = '1110' ; b77='0111' ; # 4
a777 = 1110 ; b777=0111 ; # 4
length(b7[(b7 %in% intersect(a7,b7))])
R - count matches between characters of one string and another, no replacement
keyword <- unlist(strsplit(a77, ''))
text <- unlist(strsplit(b77, ''))
sum(!is.na(pmatch(keyword, text)))
ab7 <- read.fwf(file = textConnection(as.character(rbind(a777, b777))), widths = c(1,1,1,1), colClasses = rep("character", 2))
length(ab7[2,][(ab7[2,] %in% intersect(ab7[1,],ab7[2,]))])
You are not thinking correctly about what a "number" is. Programming languages store an internal representation which retains full precision to the machine limit. You are apparently concerned with what gets printed to your screen or console. By definition, those number characters are string elements, which is to say, a couple bytes are processed by the ASCII decoder (or equivalent) to determine what to draw on the screen. What x "is," to draw happily on Presidential Testimony, depends on your definition of what "is" is.
You could always create your own class of objects that has one slot for the value of the number (but if it is stored as numeric then what we see as 123 will actually be stored as as a binary value, something like 01111011 (though probably with more leading 0's)) and another slot or attribute for either the number of leading 0's or the number of significant digits. Then you can write methods for what to do with the number (and what effect that will have on the leading 0's, sig digits, etc.).
The print method could then make sure to print it with the leading zeros while keeping the internal value as a number.
But this seems a bit overkill in most cases (though I know that some fields make a big deal about indicating number of significant digits so that leading 0's could be important). It may be simpler to use the conversion to character methods that you already know about, but just do the printing in a way that does not look obviously like a number, see the cat and print functions for the options.

Resources