Slow String to Date Conversion Function - r

I wrote the following function to convert a vector of Strings to a vector of Dates (the code inside the for loop was inspired by this post: R help converting factor to date). When I pass in a vector of size 1000, this takes about 30 seconds. Not terribly slow, but I ultimately need to pass in about 100,000 so this could be a problem. Any ideas why this is slow and/or how to speed it up?
toDate <- function (dates)
{
theDates <- vector()
for(i in 1:length(dates))
{
temp <- factor(dates[i])
temp <- as.Date(temp, format = "%m/%d/%Y")
theDates[i] <- temp
}
class(theDates) <- "Date"
return(theDates)
}

Just do:
as.Date(dates, format = "%m/%d/%Y")
You don't need to loop over the dates vector as as.Date() can handle a vector of characters just fine in a single shot. Your function is incurring length(dates) calls to as.Date() plus some assignments to other functions, which all have overhead that is totally unnecessary.
You don't want to convert each individual date to a factor. You don't want to convert them at all (as.Date() will just convert them back to characters). If you did want to convert them, factor() is also vectorised, so you could (but you don't need this at all, anywhere in your function) remove the factor() line and insert dates <- as.factor(dates) outside the for() loop. But again, you don't need to do this at all!

Related

Using ifelse statement in a sapply function

I am trying to make a new column in my data.table. I have two columns, one with a start date and one with an end date. The starting date always is 2016-02-28. The end date in some cases is 2014-12-31 and in others it is 2020-12-31 (all in YYYY-MM-DD format).
In the first case it's evident that I should get a negative difference in dates. In the second case it is positive.
I want to use the sapply function with an ifelse statement to determine the difference in dates. Any time, the difference is negative, I want R to replace this with the value 1.
I do this as follows.
sapply(df$end.date, function(x) { ifelse(df$end.date>start_date, as.integer(length(seq(from=start_date, to=as.POSIXct(x,format="%Y-%m-%d"), by ='month')) ), 1) } )
Unfortunately, I get the following error
Error in seq.POSIXt(from = start_date, to = as.POSIXct(df$end.date, :
'from' must be of length 1
How can I make this work?
PS: both start_date and df$end.date are in POSIXct format in a data.table.
ifelse is already vectorised, doubling up sapply and ifelse is redundant.
Unfortunately ifelse won’t work here because we cannot get the month difference for negative dates (as per your comment). So we just use if in combination with mapply instead:
months_between = function (start, end) {
if (end > start)
length(seq(start, end, by = 'month'))
else
1
}
df$new_column = mapply(months_between, df$start.date, df$end.date)
I’m also pretty sure that there’s a better way to write months_between but I’m not versed in the base R date manipulation functions since they are generally quite bad; I recommend using the ‹lubridate› package instead.
I think you're approach is overly complicated. If you're going to use sapply, you ought to be able to avoid ifelse since you will be able to focus on one value at a time (this assumes you are running a vector through sapply. This might not hold true if running a list through sapply). If you really want to use an apply function, however, you'd be better off using mapply with an if ... else clause.
But the apply function isn't necessary at all. In fact, the ifelse function isn't necessary. You can simplify the process a great deal with:
# Borrowed code from http://stackoverflow.com/questions/1995933/number-of-months-between-two-dates/1996404
elapsed_months <- function(end_date, start_date) {
mapply(
function(end_date, start_date){
ed <- as.POSIXlt(end_date)
sd <- as.POSIXlt(start_date)
12 * (ed$year - sd$year) + (ed$mon - sd$mon)
},
end_date,
start_date,
SIMPLIFY = FALSE
)
}
DFrame <- data.frame(start = rep(as.Date("2016-02-28"), 2),
end = as.Date(c("2014-12-31", "2020-12-31")))
DFrame$diff <- elapsed_months(DFrame$end, DFrame$start)
DFrame$diff[DFrame$diff < 0] <- 1
DFrame
All I did was calculate the difference for all of the variables, obtain an index for the negative values, and replace them with 1.
An alternative approach would be to do the indexing up front. This way you aren't calculating the difference in dates for any values you will eventually change. This might have a benefit if you have a few million rows, but I would guess the performance increase would be small.
DFrame$diff2 <- vector("numeric", nrow(DFrame))
end_first <- DFrame$end < DFrame$start
DFrame$diff2[!end_first] <- elapsed_months(DFrame$end[!end_first], DFrame$start[!end_first])
DFrame$diff2[end_first] <- 1

Transforming time-type variables for calculation in R

I have a matrix of time variables in the following format.
time <- matrix(c('01:11', '01:20', '00:51', '01:09',
'01:11', '01:00', '01:19', '00:14',
'00:57', '01:12', '01:14', '00:43',
'01:10', '01:19', '01:03', '00:27',
'00:59', '01:04', '00:46', '00:52',
'01:05', '01:13', '01:01', '00:48'), ncol=3)
Where the values before ':' are minutes and after that are seconds.
I want to convert all the values into seconds. But I am not sure how I should transform the data so that those values with minutes can be converted into seconds, and those already in seconds can then be used as a numeric value.
I tried with the chron package my dataset seems to be in the wrong format.
Use strsplit with apply. If the values are not character, you may want to convert it to character.
apply(time, 2, function(x) sapply(strsplit(x,":"), function(y)
as.numeric(y[1])*60 + as.numeric(y[2])))

R how to format date for an element of a list

This is what I want to do:
First, I randomly generate a sequence of dates.
Then, I assign the earliest date to the variable.
site_start<-list()
for(i in 1:l0){
for(j in 1:10){
date<-seq.Date(from="1900-01-01",to="2000-01-01",by=week)
site_start[[i]][j]<-sample(date,1)
}
}
Now, let us assume the date variable is correctly generated. The reason I say this is because in my real case, I acquired the date variable from dozens of other steps that is irrelevant here.
My question is, why the site_start[[i]][j] I generated, kept on coming out as POSIXct, and R requires me to provide 'origin'? I format it with origin of 1970-01-01, it is still a numeric date, such as 15600. I simply don't know how to format this number anymore.
Any help is appreciated!!
W
Why don't you use this vectorized approach:
date.pool <- seq(from=as.Date("1900-01-01"), to=as.Date("2000-01-01"), by="1 week")
site_start <- replicate(10, sample(date.pool, 10, rep=T), simplify=F)
This produces a list with 10 items, each of which is a 10 length vector with random dates pulled from date.pool. Here are the first two items (site_start[1:2]):
[[1]]
[1] "1969-09-15" "1955-10-10" "1959-04-13" "1992-02-10" "1905-07-31" "1901-09-23"
[7] "1926-10-18" "1959-06-01" "1924-06-02" "1906-05-14"
[[2]]
[1] "1979-01-01" "1998-02-23" "1929-09-02" "1968-07-01" "1924-03-17" "1914-11-02"
[7] "1928-02-13" "1937-10-25" "1915-02-08" "1974-05-06"
In the past, when I have had to grab the oldest or most-recent entry I will use arrange. E.g.,
# read dataset
enforce <- read.csv(paste(input.dir, "provider_enforcement.csv", sep="/"))
# use lubridate package to parse date format
enforce$SNAPSHOT_DATE <- mdy_hm(enforce$SNAPSHOT_DATE)
# this function sorts a data.frame and returns a data.frame with one row containing the most recent SNAPSHOT
MostRecent <- function(data) {
return(arrange(data, SNAPSHOT_DATE, decreasing=TRUE)[1, ])
}
# use plyr to apply MostRecent to my dataset for each provider
enforce <- ddply(enforce, .(PROVIDER_IDNO), MostRecent)

Applying a set of operations across several data frames in r

I've been learning R for my project and have been unable to google a solution to my current problem.
I have ~ 100 csv files and need to perform an exact set of operations across them. I've read them in as separate objects (which I assume is probably improper r style) but I've been unable to write a function that can loop through. Each csv is a dataframe that contain information, including a column with dates in decimal year form. I need to create 2 new columns containing year and day of year. I've figured out how to do it manually I would like to find a way to automate the process. Here's what I've been doing:
#setup
library(lubridate) #Used to check for leap years
df.00 <- data.frame( site = seq(1:10), date = runif(10,1980,2000 ))
#what I need done
df.00$doy <- NA # make an empty column which I'm going to place the day of the year
df.00$year <- floor(df.00$date) # grabs the year from the date column
df.00$dday <- df.00$date - df.00$year # get the year fraction. intermediate step.
# multiply the fraction year by 365 or 366 if it's a leap year to give me the day of the year
df.00$doy[which(leap_year(df.00$year))] <- round(df.00$dday[which(leap_year(df.00$year))] * 366)
df.00$doy[which(!leap_year(df.00$year))] <- round(df.00$dday[which(!leap_year(df.00$year))] * 365)
The above, while inelegant, does what I would like it to. However, I need to do this to the other data frames, df.01 - df.99. So far I've been unable to place it in a function or for loop. If I place it into a function:
funtest <- function(x) {
x$doy <- NA
}
funtest(df.00) does nothing. Which is what I would expect from my understanding of how functions work in r but if I wrap it up in a for loop:
for(i in c(df.00)) {
i$doy <- NA }
I get "In i$doy <- NA : Coercing LHS to a list" several times which tells me that the loop isn't treat the dataframe as a single unit but perhaps looking at each column in the frame.
I would really appreciate some insight on what I should be doing. I feel that I could have solved this easily using bash and awk but I would like to be less incompetent using r
the most efficient and direct way is to use a list.
Put all of your CSV's into one folder
grab a list of the files in that folder
eg: files <- dir('path/to/folder', full.names=TRUE)
iterativly read in all those files into a list of data.frames
eg: df.list <- lapply(files, read.csv, <additional args>)
apply your function iteratively over each data.frame
eg: lapply(df.list, myFunc, <additional args>)
Since your df's are already loaded, and they have nice convenient names, you can grab them easily using the following:
nms <- c(paste0("df.0", 0:9), paste0("df.", 10:99))
df.list <- lapply(nms, get)
Then take everything you have in the #what I need done portion and put inside a function, eg:
myFunc <- function(DF) {
# what you want done to a single DF
return(DF)
}
And then lapply accordingly
df.list <- lapply(df.list, myFunc)
On a separate notes, regarding functions:
The reason your funTest "does nothing" is that it you are not having it return anything. That is to say, it is doing something, but when it finishes doing that, then it does "nothing".
You need to include a return(.) statement in the function. Alternatively, the output of last line of the function, if not assigned to an object, will be used as the return value -- but this last sentence is only loosely true and hence one needs to be cautious. The cleanest option (in my opinion) is to use return(.)
regarding the for loop over the data.frame
As you observed, using for (i in someDataFrame) {...} iterates over the columns of the data.frame.
You can iterate over the rows using apply:
apply(myDF, MARGIN=1, function(x) { x$doy <- ...; return(x) } ) # dont forget to return

How to validate date in R

I have a date in the format dd-mm-yyyy HH:mm:ss
What is the best and easiest way to validate this date?
I tried
d <- format.Date(date, format="%d-%m-%Y %H:%M:%S")
But how can I catch the error when an illegal date is passed?
Simple way:
d <- try(as.Date(date, format="%d-%m-%Y %H:%M:%S"))
if("try-error" %in% class(d) || is.na(d)) {
print("That wasn't correct!")
}
Explanation: format.Date uses as.Date internally to convert date into an object of the Date class. However, it does not use a format option, so as.Date uses the default format, which is %Y-%m-%dand then %Y/%m/%d.
The format option from format.Date is used only for the output, not for the parsing. Quoting from the as.Date man page:
The ‘as.Date’ methods accept character strings, factors, logical
‘NA’ and objects of classes ‘"POSIXlt"’ and ‘"POSIXct"’. (The
last is converted to days by ignoring the time after midnight in
the representation of the time in specified timezone, default
UTC.) Also objects of class ‘"date"’ (from package ‘date’) and
‘"dates"’ (from package ‘chron’). Character strings are processed
as far as necessary for the format specified: any trailing
characters are ignored.
However, when you directly call as.Date with a format specification, nothing else will be allowed than what fits your format.
See also: ?as.Date
You may want to look at the gsubfn package. This has functions (gsubfn specifically) that work like other regular expression functions to match pieces to a string, but then it calls a user supplied function and passes the matching pieces to this function. So you would write your own function that looks at the year, moth, and day and makes sure that they are in the correct ranges (and the range for day can depend on the passed month and year.
This might be helpful if flexibility is desired in a date-time entry.
I have a function where I want to allow either a date-only entry or a date-time entry, then set a flag - for use inside the function only. I'm calling this flag data_type. The flag will be used later in the larger function to select units for getting a difference in two dates with difftime. (In most cases, the function will be perfectly fine with date only, but in some cases a user might need a shorter time frame. I don't want to inconvenience users with the shorter time frame if they don't need it.)
I am posting this for two reasons: 1) to help anyone trying to allow flexibility in date arguments and 2) to welcome sanity checks in case there's a problem with the method, since this is going into a function in an R package.
dat_time_check_fn <- function(dat_time) {
if (!anyNA(as.Date(dat_time, format= "%Y-%m-%d %H:%M:%S"))) date_type <- 1
else if (!anyNA(as.Date(dat_time, format= "%Y-%m-%d"))) date_type <- 2
else stop("Error: dates must either be in format '1999-12-31' or '1999-12-31 23:59:59' ")
date_type
}
Date-time case
date5 <- "1999-12-31 23:59:59"
date_type <- dat_time_check_fn(date5)
date_type
[1] 1
Date only case:
date6 <- "1999-12-31"
date_type <- dat_time_check_fn(date6)
date_type
[1] 2
Note that if the order above in the function is reversed, the longer date-time can be inadvertently coerced to the shorter version and both types result in date_type = 1.
My larger function has more than one date, but I need them to be compatible. Below, I'm checking the two dates checked above, where one was type 1 and one was type 2. Combining types gives the result with date only (type 2):
date_type <- dat_time_check_fn(c(date5, date6))
date_type
[1] 2
Here's a non-compliant version:
date7 <- "1/31/2011"
date_type <- dat_time_check_fn(date7)
Error in dat_time_check_fn(date7) :
Error: dates must either be in format '1999-12-31' or '1999-12-31 23:59:59'
Many solutions here are prone to SQL injection. They return TRUE for date = "2020-08-11; DROP * FROM my_table". Here is a vectorized base R function that works with NA:
is_date = function(x, format = NULL) {
formatted = try(as.Date(x, format), silent = TRUE)
is_date = as.character(formatted) == x & !is.na(formatted) # valid and identical to input
is_date[is.na(x)] = NA # Insert NA for NA in x
return(is_date)
}
Let's try:
> is_date(c("2020-08-11", "2020-13-32", "2020-08-11; DROP * FROM table", NA), format = "%Y-%m-%d")
## TRUE FALSE FALSE NA
I believe that what you are looking for is the tryCatch function.
The following as an excerpt from a script I wrote which accepts any .csv file with two series that have a common x axis. The first column in 'data' is the common x axis variable, and columns 2 & 3 are the y axis variables. I needed the tryCatch statement to make sure the script would create a plot regardless of whether the x axis data is a time series, or some other type of variable.
### READ DATA FROM A CSV FILE
data = read.csv("STLDvsNEM2.csv", header = TRUE)
#CONVERT FIRST ROW OF DATA (IN MY CASE, THE COLUMN INTENDED TO BE THE X AXIS)
#TO AN ACCEPTABLE DATE FORMAT
#IF FIRST ROW OF DATA IS NOT IN AN ACCEPTABLE DATE FORMAT
#USE THE VALUE WITHOUT ANY TRANSFORMATION
x <- tryCatch({
as.Date(data[,1])},
warning = function(w) {},
error = function(e) {
x <- data[,1]
})
y1 <- data[,2]
y2 <- data[,3]

Resources