Create date column from datetime in R - r

I am new to R and I am an avid SAS programmer and am just having a difficult time wrapping my head around R.
Within a data frame I have a date time column formatted as a POSIXct with the following the column appearing as "2013-01-01 00:53:00". I would like to create a date column using a function that extracts the date and a column to extract the hour. In an ideal world I would like to be able to extract the date, year, day, month, time and hour all within the data frame to create these additional columns within the data frame.

It is wise to always to be careful with as.Date(as.POSIXct(...)):
E.g., for me in Australia:
df <- data.frame(dt=as.POSIXct("2013-01-01 00:53:00"))
df
# dt
#1 2013-01-01 00:53:00
as.Date(df$dt)
#[1] "2012-12-31"
You'll see that this is problematic as the dates don't match. You'll hit problems if your POSIXct object is not in the UTC timezone as as.Date defaults to tz="UTC" for this class. See here for more info: as.Date(as.POSIXct()) gives the wrong date?
To be safe you probably need to match your timezones:
as.Date(df$dt,tz=Sys.timezone()) #assuming you've just created df in the same session:
#[1] "2013-01-01"
Or safer option #1:
df <- data.frame(dt=as.POSIXct("2013-01-01 00:53:00",tz="UTC"))
as.Date(df$dt)
#[1] "2013-01-01"
Or safer option #2:
as.Date(df$dt,tz=attr(df$dt,"tzone"))
#[1] "2013-01-01"
Or alternatively use format to extract parts of the POSIXct object:
as.Date(format(df$dt,"%Y-%m-%d"))
#[1] "2013-01-01"
as.numeric(format(df$dt,"%Y"))
#[1] 2013
as.numeric(format(df$dt,"%m"))
#[1] 1
as.numeric(format(df$dt,"%d"))
#[1] 1

Use the lubridate package. For example, if df is a data.frame with a column dt of type POSIXct, then you could:
df$date = as.Date(as.POSIXct(df$dt, tz="UTC"))
df$year = year(df$dt)
df$month = month(df$dt)
df$day = day(df$dt)
# and so on...
If your can store your data in a data.table, then this is even easier:
df[, `:=`(date = as.Date(as.POSIXct(dt, tz="UTC")), year = year(dt), ...)]

Related

Convert character to date format and then compute difference in days

I know this question has probably been answered in different ways, but still struggling with this. I am working with a dataset where the dates format for date1 is '2/1/2000', '5/12/2000', '6/30/2015' where the class() is character. And the second column of dates date2 in the format '2015-07-06', '2015-08-01', '2017-10-09' where the class() is "POSIXct" "POSIXt" .
I am attempting to standardize both columns so I can compute the difference in days between them using something like this
abs(difftime(date1 ,date2 , units = c("days")))
I have tried numerous ways in converting the first date1 into the same class using strtime, lubridate etc. What's the best way to move forward for me to be able to standardize both and compute the difference in days?
sample data
x <- c('2/1/2000', '5/12/2000', '6/30/2015')
y <- as.POSIXct(c('2015-07-06', '2015-08-01', '2017-10-09'))
code
#make both posixct
x2 <- as.POSIXct(x, format = "%m/%d/%Y")
abs(x2 - y)
# Time differences in days
# [1] 5633.958 5559.000 832.000

Convert columns to posixct using sapply and keeping datetime format in R

I want to use sapply (or something similar) to convert certain columns to POSIXct in an R data.frame but maintain the datetime format of the columns. When I do it currently, it converts the format to numeric. How can I do this? An example is below.
#sample dataframe
df <- data.frame(
var1=c(5, 2),
char1=c('he', 'she'),
timestamp1=c('2019-01-01 20:30:08', '2019-01-02 08:27:34'),
timestamp2=c('2019-01-01 12:24:54', '2019-01-02 10:57:47'),
stringsAsFactors = F
)
#Convert only columns with 'timestamp' in name to POSIXct
df[grep('timestamp', names(df))] <- sapply(df[grep('timestamp', names(df))], function(x) as.POSIXct(x, format='%Y-%m-%d %H:%M:%S'))
df
var1 char1 timestamp1 timestamp2
1 5 he 1546392608 1546363494
2 2 she 1546435654 1546444667
Note: I can use as.posixlt instead of as.posixctand it works, but I want the data in POSIXct format. I also tried converting to POSIXlt first and then to POSIXct, but that also ended up converting the columns to numeric.
Use lapply rather than sapply. The "s" in sapply is for simplify and it turns the result into a matrix but sapply can't create a matrix of POSIXct values so it gets cast to a simple numeric matrix. But if you keep it a list, you don't lose the class.
df[grep('timestamp', names(df))] <- lapply(df[grep('timestamp', names(df))], function(x) as.POSIXct(x, format='%Y-%m-%d %H:%M:%S'))
You could also do this fairly easily with dplyr
library(dplyr)
df %>% mutate_at(vars(contains("timestamp")), as.POSIXct)

Convert multiple date format factor to date type in R

I have a variable in a data frame which hold different format of dates (month-year). for example. Jan-62, 98-Apr, March-1987.
The variable type is FACTOR at this point. I need help in converting this variable type to Date or POSIXct. I tried the function parse_date_time from lubridate package, it helped little bit but the year Jan-62 is taken as 01/01/2062 instead it should be 01/01/1962. I tried the function cutoff_2000 but I'm not getting the desired output.
Request your help.
Regards,
Aravindan S
Use parse_date_time and then subtract off 100 years from those components having a year beyond 2019:
x <- factor( c("Jan-62", "98-Apr", "March-1987") ) # input
p <- parse_date_time(x, c("my", "ym", "mY"))
year(p) <- year(p) - 100 * (year(p) > 2019)
p
## [1] "1962-01-01 UTC" "1998-04-01 UTC" "1987-03-01 UTC"
You can use the function as.date:
yourvariable<- as.Date(yourvariable, "%m/%d/%Y")
(m is month)
(d is day)
(y is year)

r intersect of date in with year and month

I would like to find the intersection of two dataframes based on the date column.
Previously, I have been using this command to find the intersect of a yearly date column (where the date only contained the year)
common_rows <-as.Date(intersect(df1$Date, df2$Date), origin = "1970-01-01")
But now my date column for df1 is of type date and looks like this:
1985-01-01
1985-04-01
1985-07-01
1985-10-01
My date column for df2 is also of type date and looks like this (notice the days are different)
1985-01-05
1985-04-03
1985-07-07
1985-10-01
The above command works fine when I keep the format like this (i.e year, month and day) but since my days are different and I am interested in the monthly intersection I dropped the days like this, but that produces and error when I look for the intersection:
df1$Date <- format(as.Date(df1$Date), "%Y-%m")
common_rows <-as.Date(intersect(df1$Date, df2$Date), origin = "1970-01-01")
Error in charToDate(x) :
character string is not in a standard unambiguous format
Is there a way to find the intersection of the two datasets, based on the year and month, while ignoring the day?
The problem is the as.Date() function wrapping your final output. I don't know if you can convert incomplete dates to date objects. If you are fine with simple strings then use common_rows <-intersect(df1$Date, df2$Date). Otherwise, try:
common_rows <-as.Date(paste(intersect(df1$Date, df2$Date),'-01',sep = ''), origin = "1970-01-01")
Try this:
date1 <- c('1985-01-01','1985-04-01','1985-07-01','1985-10-01')
date2 <- c('1985-01-05','1985-04-03','1985-07-07','1985-10-01')
# extract the part without date
date1 <- sapply(date1, function(j) substr(j, 1, 7))
date2 <- sapply(date2, function(j) substr(j, 1, 7))
print(intersect(date1, date2))
[1] "1985-01" "1985-04" "1985-07" "1985-10"

Use dplyr::mutate and lubridate::force_tz based on arguments from data frame columns

I am trying to use lubridate::force_tz to add timezone information to timestamps (date+time) formatted as strings (as.character()). Both are stored as two columns in a data frame:
require(lubridate)
require(dplyr)
row1<-c(as.character(now()),"Etc/UTC")
row2<-c(as.character(now()+5),"America/Chicago")
df<-as.data.frame(rbind(row1,row2))
names(df)<-c("dt","tz")
x<-force_tz(as.POSIXct(as.character(now())),"Etc/UTC") #works
df<-df%>%mutate(newDT=force_tz(as.POSIXct(dt),tz)) #fails
I get: Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "c('matrix', 'character')"
Following Stibu's comments, I tried (an un-R like) approach with an iteration:
for (i in seq(from=1,to=length(df$dt))){
timestamp<-as.character(df[i,1])
tz<-as.character(df[i,2])
print(tz)
newdt<-force_tz(as.POSIXct(timestamp),tz)
df[i,3]<-newdt
print(attr(df[i,3],"tzone"))
df$timezone<-attr(df[i,3],"tzone")
}
This extracts the values correctly, but seems to get stuck with setting the value of the tz to the first value encountered - weirdly:
[1] "Etc/UTC"
[1] "Etc/UTC"
[1] "America/Chicago"
[1] "Etc/UTC"
I would have expected the last printout to result in "America/Chicago"
The df then looks like:
> df
dt tz newDT timezone
1 2016-04-13 23:07:45 Etc/UTC 2016-04-13 23:07:45 Etc/UTC
2 2016-04-13 23:07:50 America/Chicago 2016-04-14 04:07:50 Etc/UTC
You have actually two issues in your code that I will discuss separately below.
dplyr works with data frames
Your df is a matrix, not a data frame. But mutate() (and functions from dplyr in general) works with data frames. The error message simply tells you that mutate() does not know what to do with a matrix.
You can solve this by converting df to a data frame:
df <- as.data.frame(df)
names(df)<-c("dt","tz")
A remark regarding names(): This function can be used to get/set the column names of a data frame. For matrices, the corresponding function is colnames(). You used names() on a matrix, which did not set the column names of the matrix. Therefore, the names of the data frame are also not set after conversion.
You could also create a data frame from the start as follows:
df <- data.frame(dt = as.character(c(now(), now() + 5)),
tz = c("Etc/UTC", "America/Chicago"),
stringsAsFactors = FALSE)
Note that you need to define the contents column-wise, not row-wise as you did.
If you use the data frame df, there will be no error from mutate().
One time zone per vector
Unfortunately, there is a second issue. What you want to do simply cannot be done. The reason is the following.
Let's convert the first column of df to POSIXct with time zone CET:
ts <- as.POSIXct(df$dt, tz = "CET")
ts
## [1] "2016-04-13 14:42:26 CEST" "2016-04-13 14:42:31 CEST"
Let's try to do the same with two time zones:
ts <- as.POSIXct(df$dt, tz = c("CET", "UTC"))
## Error in strptime(xx, f <- "%Y-%m-%d %H:%M:%OS", tz = tz) :
## invalid 'tz' value
This does not work. The reason is that there is a single time zone per vector and not a time zone per element in the vector. Look at the attributes of ts:
attributes(ts)
## $class
## [1] "POSIXct" "POSIXt"
##
## $tzone
## [1] "CET"
The time zone is set as an attribute of the entire vector and it is not a property of each element.

Resources