Convert Dollar Data from Character to Numeric - r

How would I change a column that has character data in the format of "33 dollars 14 cents" to numeric data formatted like"33.14"?
Thanks for any help!

You can use the stringr library to extract the numeric components and then paste them together. This assumes that there are always only two numbers for the format you are looking for.
library(stringr)
s <- c("33 dollars 14 cents", "35 dollars 50 cents")
sapply(str_extract_all(s,"\\d+"), function(x) paste(x, collapse = "."))
[1] "33.14" "35.50"

You may use sub
x <- "33 dollars 14 cents"
as.numeric(sub("^(\\d+)\\s+dollars\\s+(\\d+)\\s+cents$", "\\1.\\2", x))
# [1] 33.14
as.numeric(sub("^(\\d+).*?(\\d+).*", "\\1.\\2", x))
# [1] 33.14
or
as.numeric(paste(str_extract_all(x, "\\d+")[[1]], collapse="."))
# [1] 33.14

Assuming your data is all the same format you can use gsub().
This is clumsy but it works:
as.numeric(gsub(" cents","",gsub(" dollars ",".",data)))

It's always worthwhile to write a simple function to handle cases where you need several little steps. Here's a non-elegant example that's easy to read;
numerify <- function(x) {# convert string in form of "33 dollars 14 cents" to numeric 33.14
x <- gsub('[a-z]','',x) # remove letters
x <- gsub(' $','',x) # remove trailing space
x <- gsub(' +','.',x) # insert decimal point
return(as.numeric(x)) # convert to numeric
}

Related

Given the time in hh:mm:ss format, compute the number of seconds using base R

I have a character object with the time in hh:mm:ss-like format.
A given vector a for example purposes:
[1] "01|15|59" "1|47|16" "01|17|20" "1|32|34" "2|17|17"
I want to compute the number of seconds for each measurement.
For the first one you can simply do:
a <- as.numeric(strsplit(a, "|", fixed=TRUE)[[1]])
a[1]*60*60 + a[2]*60 + a[3]
I could use a loop to iterate through all of the elements of the char object, but ideally I would like to use an apply function. Not sure how to implement it.
You can use sapply():
sapply(strsplit(a, "\\|"), \(x) sum(as.numeric(x) * 60^(2:0)))
# [1] 4559 6436 4640 5554 8237
or convert the time string to POSIXct and pass it to as.numeric() to get seconds:
as.numeric(strptime(paste('1970-01-01', a), '%F %H|%M|%S', 'UTC'))
# [1] 4559 6436 4640 5554 8237
difftime(strptime(a, '%H|%M|%S', 'UTC'), Sys.Date(), units = 'sec')
# Time differences in secs
# [1] 4559 6436 4640 5554 8237
Data
a <- c("01|15|59", "1|47|16", "01|17|20", "1|32|34", "2|17|17")
Note that the | (vertical bar) needs to be escaped in your strsplit(). Otherwise it acts as a logical "or", resulting in a split on each character. Use [|] or \\|.
With a list object created by a <- strsplit(a, '[|]') You as could extract each list element with the `[[` function within sapply(). For example:
#Splits your vector on vertical pipe, with list output
a <- strsplit(a, '[|]')
#Apply extract function across list
3600*as.numeric( sapply(a,`[[`,1)) +
60*as.numeric( sapply(a,`[[`,2))+
as.numeric( sapply(a,`[[`,3))
[1] 4559 6436 4640 5554 8237

extract number only after specific word by skipping other word in between it

I'm looking for a two-digit number that comes before the word "years" and a seven- or eight-digit number that comes after the word "years." An example of the data is shown below.
data <- "mr john is 45 years old his number is 12345678, mr doe is 57 years 7654321"
data <- as.list(data)
I tried this approach and was successful in getting two digit numbers before the word "years" :
stringr::str_extract_all(data,regex(".\\d{2}\\s(?:year)"))
I also tried this method to get the number after word "years" :
str_extract_all(data,regex(".\\d{2}\\s(?:year).\\d{7,8}"))
I managed to get the number that appear directly after the word years :
" 57 year 7654321"
However, I was unsuccessful in getting eight digit numbers following the word "years" that included other characters in between the number and the word "years".
How can I retrieve the number only after the word "years" by skipping this other word/character?
I really appreciate your help
We may use str_replace to match and remove the non-digits before and after the 'years' and then extract the digits before and after the years including the 'years'
library(stringr)
str_extract_all(str_replace_all(data,
"(?<=years)\\D+|(\\D+)(?=years)", " "), "\\d{2}\\s+years\\s+\\d{7,8}")[[1]]
[1] "45 years 12345678" "57 years 7654321"
Or another option is to capture the digits, along with the 'years' substring with str_match and then paste them together
library(purrr)
library(dplyr)
str_match_all(data, "(\\d{2})\\D+(years)\\D+(\\d{7,8})")[[1]][,-1] %>%
as.data.frame %>%
invoke(str_c, sep =" ", .)
[1] "45 years 12345678" "57 years 7654321"
data
data <- "mr john is 45 years old his number is 12345678, mr doe is 57 years 7654321"
Here is a base R approach:
Create a list with strsplit separating by ,
define a function my_func that takes a string and searches for numeric before year and after year and then pastes all together.
Use lapply to apply your function to the list.
Use toString() to get the expected output.
my_list <- strsplit(data, ",")
my_func <- function(x){
a <- as.integer(sub(".*?(\\d+)\\s*year.*", "\\1", x))
b <- as.integer(sub(".*?year.*?(\\d+).*", "\\1", x))
paste(a, "year", b)
}
result <- lapply(my_list, my_func)
lapply(result, toString)
Output:
[[1]]
[1] "45 year 12345678, 57 year 7654321"

Regex: Match first two digits of a four digit number

I have:
'30Jun2021'
I want to skip/remove the first two digits of the four digit number (or any other way of doing this):
'30Jun21'
I have tried:
^.{0,5}
https://regex101.com/r/hAJcdE/1
I have the first 5 characters but I have not figured out how to skip/remove the '20'
Manipulating datetimes is better using the dedicated date/time functions.
You can convert the variable to date and use format to get the output in any format.
x <- '30Jun2021'
format(as.Date(x, '%d%b%Y'), '%d%b%y')
#[1] "30Jun21"
You can also use lubridate::dmy(x) to convert x to date.
You don't even need regex for this. Just use substring operations:
x <- '30Jun2021'
paste0(substr(x, 1, 5), substr(x, 8, 9))
[1] "30Jun21"
Use sub
sub('\\d{2}(\\d{2})$', "\\1", x)
[1] "30Jun21"
or with str_remove
library(stringr)
str_remove(x, "\\d{2}(?=\\d{2}$)")
[1] "30Jun21"
data
x <- '30Jun2021'
You could also match the format of the string with 2 capture groups, where you would match the part that you want to omit and capture what you want to keep.
\b(\d+[A-Z][a-z]+)\d\d(\d\d)\b
Regex demo
sub("\\b(\\d+[A-Z][a-z]+)\\d\\d(\\d\\d)\\b", "\\1\\2", "30Jun2021")
Output
[1] "30Jun21"

How to calculate duration from a "start - end" string more succinctly [duplicate]

This question already has an answer here:
Calculating time duration from two time values with no date in R
(1 answer)
Closed 2 years ago.
I have times stamps indicating the time an event started and the time it ended:
x <- "00:01:00.000 - 00:01:10.500"
I need to calculate the event's duration. Using hmsfrom the package lubridateas well as lapply and strsplitdoes give me the expected output:
library(lubridate)
unlist(lapply(strsplit(x, split=" - "), function(x) as.numeric(hms(x))))[2] - unlist(lapply(strsplit(x, split=" - "), function(x) as.numeric(hms(x))))[1]
[1] 10.5
But I feel the code is utterly inelegant and anything but succinct. Is there any better way to get the duration?
EDIT:
What if, as is indeed the case, there are many more than just one value in x, such as:
x <- c("00:01:00.000 - 00:01:10.500", "00:12:12.000 - 00:13:10.500")
I've come up with this solution:
timepoints <- lapply(strsplit(x, split=" - "), function(x) as.numeric(hms(x)))
duration <- lapply(timepoints, function(x) x[2]-x[1])
duration
[[1]]
[1] 10.5
[[2]]
[1] 58.5
But, again, there's surely a nicer and shorter one.
Here is a way :
as.numeric(diff(lubridate::hms(strsplit(x, split=" - ")[[1]])))
#[1] 10.5
Keeping it in base R :
as.numeric(diff(as.POSIXct(strsplit(x, split=" - ")[[1]], format = '%H:%M:%OS')))
#[1] 10.5
For multiple values, we can use sapply :
library(lubridate)
sapply(strsplit(x, " - "), function(y) diff(period_to_seconds(hms(y))))
#[1] 10.5 80.5
and in base R :
sapply(strsplit(x, " - "), function(y) {
x1 <- as.POSIXct(y, format = '%H:%M:%OS')
difftime(x1[2], x1[1], units = "secs")
})
Assuming that x can be a character vector, read it into a data frame using read.table and then convert the relevant columns to hms, take their difference and convert to numeric giving the vector shown. You might need the as.is=TRUE argument to read,table if you are using a version of R prior to 4.0.
library(lubridate)
# test input
x <- c("00:01:00.000 - 00:01:10.500", "00:01:00.000 - 00:01:10.500")
with(read.table(text = x), as.numeric(hms(V3) - hms(V1)))
## [1] 10.5 10.5
or using magrittr and the same input x as above:
library(lubridate)
library(magrittr)
x %>%
read.table(text = .) %$%
as.numeric(hms(V3) - hms(V1))
## [1] 10.5 10.5

extracting numbers and dates from text (a vector of sentence-like strings) using R

I am trying to pull numbers and dates from text using R. Say I have a vector of text strings, V.text. The text strings are sentences that contain numbers and dates. For example:
"listed on 2/14/2015 for 150000 and sold for $160,000 on 3/1/2015"
I want to extract the number amounts for and dates as separate vector components.
So the output would be two vectors:
1 1500000 160000
2 2/14/2015 3/1/2015
I tried using scan() but couldn't get the desired result. I would appreciate any help
How about:
txt <- "listed on 2/14/2015 for 150000 and sold for $160,000 on 3/1/2015"
lapply(c('[0-9,]{5,}',
'[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}'),
function(re) {
matches <- gregexpr(re, txt)
gsub(',', '', regmatches(txt, matches)[[1]])
})
## [[1]]
## [1] "150000" "160000"
## [[2]]
## [1] "2/14/2015" "3/1/2015"
(The first match for numbers assumes 5 digits or more. If you have less, than this simpler regular expression will collide with the year of the date(s).)
First split out the "words". Then the ones with a slash are dates and the ones with only $, digit or comma are numbers. In the latter case strip off the non-digit characters and convert to numeric:
s <- strsplit(x, " ")[[1]]
grep("/", s, value = TRUE) # dates
## [1] "2/14/2015" "3/1/2015"
as.numeric(gsub("\\D", "", grep("^[$0-9,]+$", s, value = TRUE)))
## [1] 150000 160000
If negative numbers or decimal numbers are possible then change the last line of code to:
as.numeric(gsub("[^-0-9.]", "", grep("^-?[$0-9,.]+$", s, value = TRUE)))
Quick-and-dirty approach:
x<-"listed on 2/14/2015 for 150000 and sold for $160,000 on 3/1/2015"
mydate<-regmatches(x,gregexpr("\\d{1,2}/\\d{1,2}/\\d{4}",x,perl=TRUE))
mynumber<-regmatches(sub(",","",x),gregexpr("\\d{6}",sub(",","",x),perl=TRUE))
You can run the above code in r-fiddle:

Resources