Remove part of a string and turn it into a number? - r

I have a dataframe called "Camera_data" and a column called "Numeric_time"
My "Numeric_time" column is in character format and includes hours, minutes and seconds, it looks like this: 08:40:01
I need to remove the numbers that pertain to seconds and replace the semicolons with periods to make a decimal number for my time. I need it to look like this: 08.40 in order to turn my time into radians for an analysis I'm running.
I've looked for a few solutions in stringr, but so far can't work out how to consistently take off the last three characters. I think once I have removed the seconds and replaced the : with a . I can just use as.numeric to turn the character column into a numerical column, but would really appreciate any help!

We could do
Camera_data$Numeric_time <- as.numeric(chartr(":", ".",
sub(":\\d{2}$", "", Camera_data$Numeric_time )))
Or use substr
Camera_data$Numeric_time <- substr(Camera_data$Numeric_time, 1, nchar(Camera_data$Numeric_time)-3)

Using gsub with two capture groups.
as.numeric(gsub('(\\d+):(\\d+).*', '\\1.\\2', x))
# [1] 8.40 18.41 0.00
Data:
x <- c('08:40:01', '18:41:01', '00:00:01')

Related

String Manipulation in R data frames

I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))

regex single digit

I have a question which I think is solved by regex use in R.
I have a set of dates (as chr) which I would like in a different format (as chr).
I have tried to fool around with the below examples where the first (new_dates) gives the right format for months 1-9 and wrong for 10-12 and (new_dates2) gives the right format for 10-12 but nothing for 1-9.
I see that the code in the first case matches a single digit twice for 10-12, but don't really know how to tell it to match only single digit.
The final vector of correct dates shows the result I would like.
dates <- c("1/2016", "2/2016", "3/2016", "4/2016", "5/2016", "6/2016", "7/2016", "8/2016", "9/2016", "10/2016", "11/2016", "12/2016", "1/2017")
new_dates <- sub("(\\d)[:/:](\\d{4})","\\2M0\\1", dates)
new_dates2 <- sub("(\\d{2})[:/:](\\d{4})","\\2M\\1", dates)
correctdates <- c("2016M01", "2016M02", "2016M03", "2016M04", "2016M05", "2016M06", "2016M07", "2016M08", "2016M09", "2016M10", "2016M11", "2016M12", "2017M1")
Here's a base R method that will return the desired format:
format(as.Date(paste0("1/",dates), "%d/%m/%Y"), "%YM%m")
[1] "2016M01" "2016M02" "2016M03" "2016M04" "2016M05" "2016M06" "2016M07" "2016M08" "2016M09"
[10] "2016M10" "2016M11" "2016M12" "2017M01"
The idea is to first convert to a Date object and then use the format function to create the desired character representation. I pasted on 1/ so that a day is present in each element.
As #a p o m said it might be better to look for another solution if you are manipulating dates but if you want to stick with regular expressions you can try this one.
([02-9]|1[0-2]?)[:\/](\d{4}) example
new_dates <- sub("(\\d{1,2})\\/(\\d{4})","\\2M0\\1", dates)
It's fine.

Sprintf Function and Character Dates

I have a data set in which I want to pad zeroes in front of a set of dates that don't have six characters. For example, I have a date that reads 91003 (October 3rd, 2009) and I want it to read 091003, as well as any other date that is missing a zero in front. When I use the sprintf function, the code is:
Data1$entrydate <- sprintf("%06d", data1$entrydate)
But what it spits out is something like 000127, or some other other random number for all the other dates in the problem. I don't understand what's going on, and I would appreciate some help on the issue. Thanks.
PS. I am sometimes also getting a error message that sprintf is only for character values, I don't know if there is any code for numerical values.
I guess you got different results than expected because the column class was factor. You can convert the column to numeric either by as.numeric(as.character(datacolumn)) or as.numeric(levels(datacolumn)). According to ?factor
To transform a factor ‘f’ to approximately its
original numeric values, ‘as.numeric(levels(f))[f]’ is recommended
and slightly more efficient than ‘as.numeric(as.character(f))’.
So, you can use
levels(data1$entrydate) <- sprintf('%06d', as.numeric(levels(data1$entrydate)))
Example
Here is an example that shows the problem
v1 <- factor(c(91003, 91104,90103))
sprintf('%06d', v1)
#[1] "000002" "000003" "000001"
Or, it is equivalent to
sprintf('%06d', as.numeric(v1)) #the formatted numbers are
# the numeric index of factor levels.
#[1] "000002" "000003" "000001"
When you convert it back to numeric, works as expected
sprintf('%06d', as.numeric(levels(v1)))
#[1] "090103" "091003" "091104"

Convert characters with time units (ms, s, us) into numerics

One of the columns in my data frame is a character vector with time span values represented as number+suffix, as so:
c("16.14ms", "7.58ms", "8.38ms", "7.29ms", "6.40ms", "5.76ms",
"5.56ms", "5.27us", "5.12ms", "5.03us", "4.91ms", "4.76ms", "16.12ms",
"7.56ms", "8.59ms", "7.16ms", "6.59ms", "5.91s", "5.62ms", "5.44ms"
)
The units are limited to micro us, milli ms, and full seconds s.
Is there a simple way to make this into a numeric column with all values being either in milliseconds or seconds?
Here are some approaches. We suppose x is the input vector shown in the question.
1) Remove the s, replace m with e-3 and replace u with e-6. Then convert to numeric:
as.numeric(sub("u", "e-6", sub("m", "e-3", sub("s", "", x))))
2) This could also be done neatly using gsubfn. First we match the suffix and then use a replacement list as shown:
library(gsubfn)
as.numeric(gsubfn("\\D+$", list(ms = "e-3", us = "e-6", s = "e0"), x))
This would be particularly convenient if it were desired to extend the problem to many time units as it would just be a matter of extending the list.
Note that at the top of page 4 of the gsubfn vignette there is an example which is very close to this one.

Adding conditional leading or trailing zeros

I need help conditionally adding leading or trailing zeros.
I have a dataframe with one column containing icd9 diagnoses. as a vector, the column looks like:
"33.27" "38.45" "9.25" "4.15" "38.45" "39.9" "84.1" "41.5" "50.3"
I need all the values to have the length of 5, including the period in the middle (not counting ""). If the value has one digit before the period, it need to have a leading zero. If value has one digit after the period, it need to have zero at the end. So the result should look like this:
"33.27" "38.45" "09.25" "04.15" "38.45" "39.90" "84.10" "41.50" "50.30"
Here is the vector for R:
icd9 <- c("33.27", "38.45", "9.25", "4.15", "38.45", "39.9", "84.1", "41.5", "50.3" )
This does it in one line
formatC(as.numeric(icd9),width=5,format='f',digits=2,flag='0')
ICD-9 codes have some formatting quirks which can lead to misinterpretation with simple string processing. The icd package on CRAN takes care of all the corner cases when doing ICD processing, and has been battle-tested over about six years of use by many R users.
Using this function called change that accepts the argument of the max number of characters, i think it can help
change<-function(x, n=max(nchar(x))) gsub(" ", "0", formatC(x, width=n))
icd92<-gsub(" ","",paste(change(icd9,5)))
You can also use sprintf after converting the vector into numeric.
sprintf("%05.2f", as.numeric(icd9))
[1] "33.27" "38.45" "09.25" "04.15" "38.45" "39.90" "84.10" "41.50" "50.30"
Notes
The examples in ?sprint to get work out the proper format.
There is some risk of introducing errors due to numerical precision here, though it works well in the example.

Resources