library(PerformanceAnalytics)
to get the edhec data set
edhec['2000-12-31::2001-12-31',1]
is what I'm trying to obtain.
So far I have tried :
date_begin_test <- as.Date("2000-12-31")
date_end_test <- as.Date("2001-12-31")
I have tried as.POSIXct as well as plain strings
edhec[date_begin_test::date_end_test,1]
edhec[date_begin_test/date_end_test,1]
edhec[paste("'",date_begin_test,'::',date_end_test,"'",sep=''),1]
edhec[noquote(paste("'",date_begin_test,'::',date_end_test,"'",sep='')),1]
The last one is the most puzzling. It gives me every value from the beginning and stops at date_end_test.
You were close, this works:
edhec[paste(date_begin_test, '::', date_end_test, sep = ""), 1]
Personally, I would use:
edhec[paste(date_begin_test, date_end_test, sep="::"), 1]
Or use this:
x.subset=seq.Date(date_begin_test+1,date_end_test+1,by="month")-1
edhec[as.character(x.subset),1]
A slightly different approach with lubridate
require(lubridate)
edhec[index(edhec) %within% (ymd("2000-12-31") %--% ymd("2001-12-31")), 1]
Related
I have a column of military time values, df1$appt_times in the format of "13:30" All of them, 5 characters, "00:00". I have tried POSIXct but it added today's date to the values. I have also tried lubridate and couldn't get that to work. Most recently I am trying to use chron and am so far unsuccessful at that too
The goal is that once this is done I am going to group the times into factor levels, I cannot perform any conditional operations on them currently, unless I am wrong about that as well ;)
> df1$Time <- chron(times = df1$appt_time)
Error in convert.times(times., fmt) : format h:m:s may be incorrect
In addition: Warning message:
In unpaste(times, sep = fmt$sep, fnames = fmt$periods, nfields = 3) :
106057 entries set to NA due to wrong number of fields
also df1$Time <- chron(times(df1$appt_time)) same error as above
as well as different tries at being explicit with the format:
> df1$appt_time <- chron(df1$appt_time, format = "h:m")
Error in widths[, fmt$periods, drop = FALSE] : subscript out of bounds
I would be very grateful if someone could point out my error or suggest a better way to accomplish this task.
You can use as.POSIXct :
df1$date_time <- as.POSIXct(df1$appt_time, format = '%H:%M', tz = 'UTC')
Since you don't have dates this will assign today's date and time would be according to appt_time.
For example -
as.POSIXct('13:30', format = '%H:%M', tz = 'UTC')
#[1] "2021-02-01 13:30:00 UTC"
One way to overcome this problem if you need to perform arithmetic on the times prior to grouping them is to treat the minutes as a fraction of the hour:
# If you need to do some extra arithmetic prior to coercing to factor:
as.numeric(substr(test1, 1, 2)) + (as.numeric(substr(test1, 4, 5))/60)
# Otherwise:
as.factor(test1)
Where df1$appt_times == test1
test1 <- c('13:30','13:45', '14:00', '14:15', '14:30', '14:45', '15:00')
Not being able to find a solution to work with the time in the way I thought I came up with this DIIIIIRRRRRRRRRRRTY solution.
#converted appt_time to POSIXct format, which added toady's date
df9$appt_time <- as.POSIXct(df9$appt_time, format = '%H:%M')
#Since I am only interesting in creating a value based on if the time falls within a specific range I decided I could output this new value, 'unclassed', to a column and then manually eyeball the values I needed that corresponded to my ranges
df9$convert <- unclass(df9$appt_time)
#Using the, manually obtained, unclassed values I was able create the factor levels I wanted
group_appt_time <- function(convert){
ifelse (convert >= 1612624500 & convert <= 1612637100, 'Morning',
ifelse (convert >= 1612638000 & convert <= 1612647900, 'Mid-Day',
ifelse (convert >= 1612648800 & convert <= 1612658700, 'Afternoon',
'Invalid Time')))
}
df9$appt_time_grouped <- as.factor(group_appt_time(df9$convert))
This is a research project, not something I need to recreate in an ongoing manner so it works
a lot of people seem to have this issue however I was not able to find a satisfying answer. If you indulge me, I would like to be sure to understand what's happening
I'm having dates of various format in a dataframe (also a common issue) so i have built a small function to handle it for me:
dateHandler <- function(inputString){
if(grepl("-",inputString)==T){
lubridate::dmy(inputString, tz="GMT")
}else{
as.POSIXct(as.numeric(inputString)*60*60*24, origin="1899-12-30", tz="GMT")
}
}
When using it on one element it works fine:
myExample <-c("18-Mar-11","42433")
> dateHandler(myExample[1])
[1] "2011-03-18 GMT"
> dateHandler(myExample[2])
[1] "2016-03-04 GMT"
However when using it on a whole column it does not work:
myDf <- as.data.frame(myExample)
> myDf <- myDf %>%
+ dplyr::mutate(dateClean=dateHandler(myExample))
Warning messages:
1: In if (grepl("-", inputString) == T) { :
the condition has length > 1 and only the first element will be used
2: 1 failed to parse.
From reading on the forum, my current understanding is that R passes a vector with all the elements of myDf$myExample to the function, which is not built to handle vector of length >1. If that is correct, the next step is to understand what to do from there. Many people recommend using ifelse rather than if but I do not understand how this would help me. Also I read that ifelse returns something of the same format as its input, which does not work for me in that case.
Thank you in advance for answering this question for the 10000th time.
Nicolas
You have two option on where to go from there. One is to apply your current function to a list using lapply. As in:
myDf$dateClean <- lapply(myDf$myExample, function(x) dateHandler(x))
The other option is to build a vectorized function that is designed to take a vector as an input rather than a single data point. Here is a simple example:
dateHandlerVectorized <- function(inputVector){
output <- rep(as.POSIXct("1/1/11"), length(inputVector))
UseLuridate <- grepl("-", inputVector)
output[UseLuridate] <- lubridate::dmy(inputVector[UseLuridate], tz="GMT")
output[!UseLuridate] <- as.POSIXct(as.numeric(inputVector[!UseLuridate])*60*60*24, origin="1899-12-30", tz="GMT")
output
}
myDf <- myDf %>% dplyr::mutate(dateClean=dateHandlerVectorized(myDf$myExample))
I am trying to make a new column in my data.table. I have two columns, one with a start date and one with an end date. The starting date always is 2016-02-28. The end date in some cases is 2014-12-31 and in others it is 2020-12-31 (all in YYYY-MM-DD format).
In the first case it's evident that I should get a negative difference in dates. In the second case it is positive.
I want to use the sapply function with an ifelse statement to determine the difference in dates. Any time, the difference is negative, I want R to replace this with the value 1.
I do this as follows.
sapply(df$end.date, function(x) { ifelse(df$end.date>start_date, as.integer(length(seq(from=start_date, to=as.POSIXct(x,format="%Y-%m-%d"), by ='month')) ), 1) } )
Unfortunately, I get the following error
Error in seq.POSIXt(from = start_date, to = as.POSIXct(df$end.date, :
'from' must be of length 1
How can I make this work?
PS: both start_date and df$end.date are in POSIXct format in a data.table.
ifelse is already vectorised, doubling up sapply and ifelse is redundant.
Unfortunately ifelse won’t work here because we cannot get the month difference for negative dates (as per your comment). So we just use if in combination with mapply instead:
months_between = function (start, end) {
if (end > start)
length(seq(start, end, by = 'month'))
else
1
}
df$new_column = mapply(months_between, df$start.date, df$end.date)
I’m also pretty sure that there’s a better way to write months_between but I’m not versed in the base R date manipulation functions since they are generally quite bad; I recommend using the ‹lubridate› package instead.
I think you're approach is overly complicated. If you're going to use sapply, you ought to be able to avoid ifelse since you will be able to focus on one value at a time (this assumes you are running a vector through sapply. This might not hold true if running a list through sapply). If you really want to use an apply function, however, you'd be better off using mapply with an if ... else clause.
But the apply function isn't necessary at all. In fact, the ifelse function isn't necessary. You can simplify the process a great deal with:
# Borrowed code from http://stackoverflow.com/questions/1995933/number-of-months-between-two-dates/1996404
elapsed_months <- function(end_date, start_date) {
mapply(
function(end_date, start_date){
ed <- as.POSIXlt(end_date)
sd <- as.POSIXlt(start_date)
12 * (ed$year - sd$year) + (ed$mon - sd$mon)
},
end_date,
start_date,
SIMPLIFY = FALSE
)
}
DFrame <- data.frame(start = rep(as.Date("2016-02-28"), 2),
end = as.Date(c("2014-12-31", "2020-12-31")))
DFrame$diff <- elapsed_months(DFrame$end, DFrame$start)
DFrame$diff[DFrame$diff < 0] <- 1
DFrame
All I did was calculate the difference for all of the variables, obtain an index for the negative values, and replace them with 1.
An alternative approach would be to do the indexing up front. This way you aren't calculating the difference in dates for any values you will eventually change. This might have a benefit if you have a few million rows, but I would guess the performance increase would be small.
DFrame$diff2 <- vector("numeric", nrow(DFrame))
end_first <- DFrame$end < DFrame$start
DFrame$diff2[!end_first] <- elapsed_months(DFrame$end[!end_first], DFrame$start[!end_first])
DFrame$diff2[end_first] <- 1
I have a csv data set with a column that contains dates. After importing the data set to R, we need to subset the data set based on certain date range.
app1110 <- read.csv("file_11102015.csv")
app1110$appcom_date2 <- app1110$APPLICATION..COMPLETED..DATE
Then we tried 1)
app1110$appcom_date2 <- format(as.POSIXct(app1110$appcom_date2, format= "%m/%d/%Y"), format="%m/%d/%Y")
subset(app1110, as.Date(appcom_date2 < "12/30/2013"))
The error message:
Error in as.Date.default(appcom_date2 < "12/30/2013") : do not know
how to convert 'appcom_date2 < "12/30/2013"' to class “Date”
So how can I subset data based on the date range?
Without seeing your data, I suspect you need to change this:
as.Date(appcom_date2 < "12/30/2013")
to this:
appcom_date2 < as.Date("12/30/2013", "%M/%d/%Y")
Or better still:
appcom_date2 < as.Date("2013-12-30")
The key point being that you need to coerce the string ("12/30/2013") to a Date object and then make the comparison.
Thanks, the problem was comparing character to date types. This fixed it:
app1110$appcom_date2 <- as.Date(app1110$appcom_date2,
format="%m/%d/%Y") subset(app1110,appcom_date2 < as.Date("2013-12-31")
& appcom_date2 > as.Date("2013-06-01"))
Got another question: when subsetting, I am using appcom_date2 variable as a criteria to set the period. How do I also specify to exclude all NA values from that variable?
I have a date in the format dd-mm-yyyy HH:mm:ss
What is the best and easiest way to validate this date?
I tried
d <- format.Date(date, format="%d-%m-%Y %H:%M:%S")
But how can I catch the error when an illegal date is passed?
Simple way:
d <- try(as.Date(date, format="%d-%m-%Y %H:%M:%S"))
if("try-error" %in% class(d) || is.na(d)) {
print("That wasn't correct!")
}
Explanation: format.Date uses as.Date internally to convert date into an object of the Date class. However, it does not use a format option, so as.Date uses the default format, which is %Y-%m-%dand then %Y/%m/%d.
The format option from format.Date is used only for the output, not for the parsing. Quoting from the as.Date man page:
The ‘as.Date’ methods accept character strings, factors, logical
‘NA’ and objects of classes ‘"POSIXlt"’ and ‘"POSIXct"’. (The
last is converted to days by ignoring the time after midnight in
the representation of the time in specified timezone, default
UTC.) Also objects of class ‘"date"’ (from package ‘date’) and
‘"dates"’ (from package ‘chron’). Character strings are processed
as far as necessary for the format specified: any trailing
characters are ignored.
However, when you directly call as.Date with a format specification, nothing else will be allowed than what fits your format.
See also: ?as.Date
You may want to look at the gsubfn package. This has functions (gsubfn specifically) that work like other regular expression functions to match pieces to a string, but then it calls a user supplied function and passes the matching pieces to this function. So you would write your own function that looks at the year, moth, and day and makes sure that they are in the correct ranges (and the range for day can depend on the passed month and year.
This might be helpful if flexibility is desired in a date-time entry.
I have a function where I want to allow either a date-only entry or a date-time entry, then set a flag - for use inside the function only. I'm calling this flag data_type. The flag will be used later in the larger function to select units for getting a difference in two dates with difftime. (In most cases, the function will be perfectly fine with date only, but in some cases a user might need a shorter time frame. I don't want to inconvenience users with the shorter time frame if they don't need it.)
I am posting this for two reasons: 1) to help anyone trying to allow flexibility in date arguments and 2) to welcome sanity checks in case there's a problem with the method, since this is going into a function in an R package.
dat_time_check_fn <- function(dat_time) {
if (!anyNA(as.Date(dat_time, format= "%Y-%m-%d %H:%M:%S"))) date_type <- 1
else if (!anyNA(as.Date(dat_time, format= "%Y-%m-%d"))) date_type <- 2
else stop("Error: dates must either be in format '1999-12-31' or '1999-12-31 23:59:59' ")
date_type
}
Date-time case
date5 <- "1999-12-31 23:59:59"
date_type <- dat_time_check_fn(date5)
date_type
[1] 1
Date only case:
date6 <- "1999-12-31"
date_type <- dat_time_check_fn(date6)
date_type
[1] 2
Note that if the order above in the function is reversed, the longer date-time can be inadvertently coerced to the shorter version and both types result in date_type = 1.
My larger function has more than one date, but I need them to be compatible. Below, I'm checking the two dates checked above, where one was type 1 and one was type 2. Combining types gives the result with date only (type 2):
date_type <- dat_time_check_fn(c(date5, date6))
date_type
[1] 2
Here's a non-compliant version:
date7 <- "1/31/2011"
date_type <- dat_time_check_fn(date7)
Error in dat_time_check_fn(date7) :
Error: dates must either be in format '1999-12-31' or '1999-12-31 23:59:59'
Many solutions here are prone to SQL injection. They return TRUE for date = "2020-08-11; DROP * FROM my_table". Here is a vectorized base R function that works with NA:
is_date = function(x, format = NULL) {
formatted = try(as.Date(x, format), silent = TRUE)
is_date = as.character(formatted) == x & !is.na(formatted) # valid and identical to input
is_date[is.na(x)] = NA # Insert NA for NA in x
return(is_date)
}
Let's try:
> is_date(c("2020-08-11", "2020-13-32", "2020-08-11; DROP * FROM table", NA), format = "%Y-%m-%d")
## TRUE FALSE FALSE NA
I believe that what you are looking for is the tryCatch function.
The following as an excerpt from a script I wrote which accepts any .csv file with two series that have a common x axis. The first column in 'data' is the common x axis variable, and columns 2 & 3 are the y axis variables. I needed the tryCatch statement to make sure the script would create a plot regardless of whether the x axis data is a time series, or some other type of variable.
### READ DATA FROM A CSV FILE
data = read.csv("STLDvsNEM2.csv", header = TRUE)
#CONVERT FIRST ROW OF DATA (IN MY CASE, THE COLUMN INTENDED TO BE THE X AXIS)
#TO AN ACCEPTABLE DATE FORMAT
#IF FIRST ROW OF DATA IS NOT IN AN ACCEPTABLE DATE FORMAT
#USE THE VALUE WITHOUT ANY TRANSFORMATION
x <- tryCatch({
as.Date(data[,1])},
warning = function(w) {},
error = function(e) {
x <- data[,1]
})
y1 <- data[,2]
y2 <- data[,3]