I have two dataframes:
dat is a 9752x8 dataframe that contains some POSIXlt dates
trips.df is a 35772x28 dataframe that contains hourly temperature
data
I would like to save the corresponding temperature for each dates in dat.
I have tried:
trips.df$temperature<-lapply(trips.df$fin, function(x){
dat_meteo[dat_meteo$Date.Heure==round(x,"hours"),7]})
But I got this error, which makes me think that x is not passed as a datetime variable
Error in round(x, "hours") :
non-numeric argument to mathematical function
I have also tried this:
merge(trips.df,dat_meteo[,c(1,7)])
But I also got an error:
Error: cannot allocate vector of size 653.8 Mb
Any advice on how to retrieve data on dat_meteo by dates?
I am using R version 3.4.0 with RStudio Version 1.0.143 on Windows 10
And here are an exercpt of my data:
> head(trips.df$fin)
[1] "2013-06-25 16:34:16 EDT" "2013-06-25 16:34:16 EDT" "2013-06-26 13:00:05 EDT"
[4] "2013-06-29 12:52:21 EDT" "2013-06-29 15:34:13 EDT" "2013-06-29 17:39:29 EDT"
> dat_meteo[1870:1875,c(1,7)]
Date.Heure Temp...C.
1870 2013-03-19 18:00:00 -1,2
1871 2013-03-19 19:00:00 -1,7
1872 2013-03-19 20:00:00 -2,1
1873 2013-03-19 21:00:00 -2,8
1874 2013-03-19 22:00:00 -3,0
1875 2013-03-19 23:00:00 -3,7
You may want to take a slightly different approach and use data.table.
trips.dt <- data.table(trips.df)
dat <- data.table(dat)
trips.dt <- trips.dt[ , dates.a := strptime(as.POSIXct(fin,format='%m/%d/%Y %H:%M:%S'),format='%m/%d/%Y')][,dates.b := dates.a]
dat <- dat[dates.dat.a := strptime(as.POSIXct(Date.Heure, format = '%m/%d/%Y %H:%M:%S'),format='%m/%d/%Y')][, dates.dat.b := dates.dat.a]
setkey(trips.dt, id, dates.a, dates.b)
setkey(dat , id, dates.dat.a, dates.dat.b)
combo <- foverlaps(trips.df, dat, type = "within")
This creates date ranges for both trip.df and dat after converting it to a data.table, then merges trips.df to dat and stores the result as combo
Make sure that the two time columns you want to match have the same format (POSIXct). It is more straightforward to use the POSIXct format within a dataframe, as the POSIXlt format actually corresponds to a list of named elements whereas POSIXct is in vector form.
dat_meteo$Date.Heure=as.POSIXct(dat_meteo$Date.Heure,format="%Y-%m-%d %H:%M:%S")
Create a column in trips.df of times rounded to the closest hours, converting it to POSIXct too, as round converts POSIXct to POSIXlt:
trips.df$fin_r=as.POSIXct(round(trips.df$fin,"hours"))
Then use merge:
res=merge(trips.df,dat_meteo[,c(1,7)],by.x="fin_r",by.y ="Date.Heure")
Related
I have a date column in a dataframe. I have read this df into R using openxlsx. The column is 'seen' as a character vector when I use typeof(df$date).
The column contains date information in several formats and I am looking to get this into the one format.
#Example
date <- c("43469.494444444441", "12/31/2019 1:41 PM", "12/01/2019 16:00:00")
#What I want -updated
fixed <- c("2019-04-01", "2019-12-31", "2019-12-01")
I have tried many work arounds including openxlsx::ConvertToDate, lubridate::parse_date_time, lubridate::date_decimal
openxlsx::ConvertToDateso far works best but it will only take 1 format and coerce NAs for the others
update
I realized I actually had one of the above output dates wrong.
Value 43469.494444444441 should convert to 2019-04-01.
Here is one way to do this in two-step. Change excel dates separately and all other dates differently. If you have some more formats of dates that can be added in parse_date_time.
temp <- lubridate::parse_date_time(date, c('mdY IMp', 'mdY HMS'))
temp[is.na(temp)] <- as.Date(as.numeric(date[is.na(temp)]), origin = "1899-12-30")
temp
#[1] "2019-01-04 11:51:59 UTC" "2019-12-31 13:41:00 UTC" "2019-12-01 16:00:00 UTC"
as.Date(temp)
#[1] "2019-01-04" "2019-12-31" "2019-12-01"
You could use a helper function to normalize the dates which might be slightly faster than lubridate.
There are weird origins in MS Excel that depend on platform. So if the data are imported from different platforms, you may want to work woth dummy variables.
normDate <- Vectorize(function(x) {
if (!is.na(suppressWarnings(as.numeric(x)))) # Win excel
as.Date(as.numeric(x), origin="1899-12-30")
else if (grepl("A|P", x))
as.Date(x, format="%m/%d/%Y %I:%M %p")
else
as.Date(x, format="%m/%d/%Y %R")
})
For additional date formats just add another else if. Format specifications can be found with ?strptime.
Then just use as.Date() with usual origin.
res <- as.Date(normDate(date), origin="1970-01-01")
# 43469.494444444441 12/31/2019 1:41 PM 12/01/2019 16:00:00
# "2019-01-04" "2019-12-31" "2019-12-01"
class(res)
# [1] "Date"
Edit: To achieve a specific output format, use format, e.g.
format(res, "%Y-%d-%m")
# 43469.494444444441 12/31/2019 1:41 PM 12/01/2019 16:00:00
# "2019-04-01" "2019-31-12" "2019-01-12"
format(res, "%Y/%d/%m")
# 43469.494444444441 12/31/2019 1:41 PM 12/01/2019 16:00:00
# "2019/04/01" "2019/31/12" "2019/01/12"
To lookup the codes type ?strptime.
I want convert numeric values to time without the date for the data like 1215,1423,1544,1100,0645,1324 in R.
These data has to read like 12:15,14:23,15:44.
I was trying as.POSIXct.
We can use strptime with format
format(strptime(sprintf("%04d", v1), "%H%M"), "%H:%M")
The above output is character class, but if we needed a times class, then we can use times from chron on a "HH:MM:SS" format created with sub or from the above code
library(chron)
times(sub("(.{2})(.{2})","\\1:\\2:", sprintf("%04d00", v1)))
#[1] 12:15:00 14:23:00 15:44:00 11:00:00 06:45:00 13:24:00
Or
times(format(strptime(sprintf("%04d", v1), "%H%M"), "%H:%M:%S"))
data
v1 <- c( 1215,1423,1544,1100,0645,1324)
Im have a time stamp column that I am converting into a POSIXct. The problem is that there are two different formats in the same column, so if I use the more common conversion the other gets converted into NA.
MC$Date
12/1/15 22:00
12/1/15 23:00
12/2/15
12/2/15 1:00
12/2/15 2:00
I use the following code to convert to a POSIXct:
MC$Date <- as.POSIXct(MC$Date, tz='MST', format = '%m/%d/%Y %H:%M')
The results:
MC$Date
15-12-01 22:00:00
15-12-01 23:00:00
NA
15-12-02 01:00:00
15-12-02 02:00:00
I have tried using a logic vector to identify the issue then correct it but can't find an easy solution.
The lubridate package was designed to deal with situations like this.
dt <- c(
"12/1/15 22:00",
"12/1/15 23:00",
"12/2/15",
"12/2/15 1:00",
"12/2/15 2:00"
)
dt
[1] "12/1/15 22:00" "12/1/15 23:00" "12/2/15" "12/2/15 1:00" "12/2/15 2:00"
lubridate::mdy_hm(dt, truncated = 2)
[1] "2015-12-01 22:00:00 UTC" "2015-12-01 23:00:00 UTC" "2015-12-02 00:00:00 UTC"
[4] "2015-12-02 01:00:00 UTC" "2015-12-02 02:00:00 UTC"
The truncated parameter indicates how many formats can be missing.
You may add the tz parameter to specify which time zone to parse the date with if UTC is not suitable.
I think the logic vector approach could work. Maybe in tandem with an temporary vector for holding the parsed dates without clobbering the unparsed ones. Something like this:
dates <- as.POSIXct(MC$Date, tz='MST', format = '%m/%d/%Y %H:%M')
dates[is.na(dates)] <- as.POSIXct(MC[is.na(dates),], tz='MST', format = '%m/%d/%Y')
MC$Date <- dates
Since all of your datetimes are separated with a space between date and time, you could use strsplit to extract only the date part.
extractDate <- function(x){ strsplit(x, split = " " )[[1]][1] }
MC$Date <- sapply( MC$Date, extractDate )
Then go ahead and convert any way you like, without worrying about the time part getting in the way.
I'm trying to store some intervals in a dataframe. A cut down version of the code that does this is here:
DateHired <- c("29/09/14", "07/04/08", "18/06/09", "09/03/15", "30/05/11", "05/11/07", "08/09/08", "30/09/13", "10/08/09", "13/08/14", "18/09/06", "21/01/08", "05/12/11", "28/06/10", "19/07/10", "05/05/14", "26/08/09", "21/04/08", "19/10/09")
TerminationDate <- c("11/06/10", "10/02/10", "06/10/09", "02/04/15", "30/06/11", "10/11/07", "17/04/14", "04/10/13", "08/02/12", "11/06/10", "03/07/09", "11/06/10", "08/08/13", "23/12/10", "20/12/13", "11/06/10", "11/06/10", "05/12/08", "01/03/10")
tenures = data.frame(DateHired, TerminationDate, stringsAsFactors=FALSE)
tenures$isoStart <- as.Date(tenures$DateHired, format="%d/%m/%Y")
tenures$isoFinish <- as.Date(tenures$TerminationDate, format="%d/%m/%Y")
tenures$periods = apply(tenures, 1, function(x) interval(x['isoStart'], x['isoFinish']) )
This ends up with this result:
> tenures$periods
[1] -135734400 58233600 9504000 2073600 2678400 432000 176860800 345600 78796800 -131673600 88041600 75340800
[13] 52876800 15379200 108000000 -123033600 24969600 19699200 11491200
When I do the same but manually. I.e.
> interval(as.Date("29/09/14", format="%d/%m/%Y"),as.Date("29/09/15", format="%d/%m/%Y") )
[1] 14-09-29 10:04:52 LMT--15-09-29 10:04:52 LMT
it gives a lubridate interval.
There are ways that I can probably solve this in other ways, but I was hoping to use the intervals in the next part of the puzzle!
tenures$isoStart <- as.Date(tenures$DateHired, format="%d/%m/%y")
tenures$isoFinish <- as.Date(tenures$TerminationDate, format="%d/%m/%y")
tenures$periods = interval(tenures$isoStart, tenures$isoFinish)
Your date format "%d/%m/%Y" did not reflect the two-digit years in your data. The capital %Y is for four-digit years.
Also, the interval function is vectorized, meaning it will take the first element of each vector and create an interval, then move on to the second of each, and continue to the end.
head(tenures$periods)
#[1] 2014-09-28 20:00:00 EDT--2010-06-10 20:00:00 EDT 2008-04-06 20:00:00 EDT--2010-02-09 19:00:00 EST
#[3] 2009-06-17 20:00:00 EDT--2009-10-05 20:00:00 EDT 2015-03-08 20:00:00 EDT--2015-04-01 20:00:00 EDT
#[5] 2011-05-29 20:00:00 EDT--2011-06-29 20:00:00 EDT 2007-11-04 19:00:00 EST--2007-11-09 19:00:00 EST
Why didn't your first function work? Well it did work in a sense. The output is the span between the two dates, but the format/class was unexpected. Instead of the interval output, the number of seconds between the two dates were given.
For more on coercion and ?apply:
If X is not an array but an object of a class with a non-null dim
value (such as a data frame), apply attempts to coerce it to an array
via as.matrix if it is two-dimensional (e.g., a data frame) or via
as.array.
The function will work on data.frames, but with a warning that the results may not be what you expect after coercing to matrix. lapply is friendlier towards data frames and in this case, the function is already vectorised.
I have one date variable that I formatted in the following way:
date <- as.POSIXct(date, "%m-%d-%Y-%X")
For example, this can be the most recent date:
"2014-03-04 23:59:59 EST"
Now I have a data.table DT, in which a column time indicates some other date and is also formated as.POSIXct(format: "%m-%d-%Y-%X"). Now I want to replace some missing values (NA) in DT[,time] with my date variable "date":
library(data.table)
DT <- DT[is.na(time), time:= date]
However, the dates that were replaced in the data.table are now "1970-01-01 14:30:24" (and not "2014-03-04 23:59:59").
What am I missing?
R: 3.02
Data.table: 1.9.2
The problem is not data.table but essentially because you try to mix "Datetime" types with an another type in the same vector. This reproduce the error:
library(lubridate) ## I am using lubridate for smart date conversion
origin <- mdy_hms("01-01-1970-00:00:01") ## Using origin as default value for dates
date <- mdy_hms("3-11-2014-09:12:30")
time = c(NA,1)
ifelse(is.na(time),date,origin)
[1] 1394529150 1 ## date is converted to numeric
one solution is to convert first to string and then convert again to a datetime
ymd_hms(ifelse(is.na(time),paste(date),paste(origin))) ## paste used as as.character
using data.table , you can get the same result :
dt = data.table(time=time,date = date)
dt[,time:=ymd_hms(ifelse(is.na(time),paste(date),
paste(origin)))]
time date
1: 2014-03-11 09:12:30 2014-03-11 09:12:30
2: 1970-01-01 00:00:01 2014-03-11 09:12:30
PS: better to not coerce time variable here and do operations whenever you have a missing values.