Parsing dates in multiple formats in R using lubridate - r

I have data with dates in MM/DD/YY HH:MM format and others in plain old MM/DD/YY format. I want to parse all of them into the same format as "2010-12-01 12:12 EST." How should I go about doing that? I tried the following ifelse statement and it gave me a bunch of long integers and told me a large number of my data points failed to parse:
df_prime$date <- ifelse(!is.na(mdy_hm(df$date)), mdy_hm(df$date), mdy(df$date))
df_prime is a duplicate of the data frame df that I initially loaded in
IEN date admission_number KEY_PTF_45 admission_from discharge_to
1 12 3/3/07 18:05 1 252186 OTHER DIRECT
2 12 3/9/07 12:10 1 252186 RETURN TO COMMUNITY- INDEPENDENT
3 12 3/10/07 15:08 2 252382 OUTPATIENT TREATMENT
4 12 3/14/07 10:26 2 252382 RETURN TO COMMUNITY-INDEPENDENT
5 12 4/24/07 19:45 3 254343 OTHER DIRECT
6 12 4/28/07 11:45 3 254343 RETURN TO COMMUNITY-INDEPENDENT
...
1046334 23613488506 2/25/14 NA NA
1046335 23613488506 2/25/14 11:27 NA NA
1046336 23613488506 2/28/14 NA NA
1046337 23613488506 3/4/14 NA NA
1046338 23613488506 3/10/14 11:30 NA NA
1046339 23613488506 3/10/14 12:32 NA NA
Sorry if some of the formatting isn't right, but the date column is the most important one.
EDIT: Below is some code for a portion of my data frame via a dput command:
structure(list(IEN = c(23613488506, 23613488506, 23613488506, 23613488506, 23613488506, 23613488506), date = c("2/25/14", "2/25/14 11:27", "2/28/14", "3/4/14", "3/10/14 11:30", "3/10/14 12:32")), .Names = c("IEN", "date"), row.names = 1046334:1046339, class = "data.frame")

Have you tried the function guess_formats() in the lubridate package?
A reproducible example to build a dataframe like yours could be helpful!

The lubridate package's mdy_hm has a truncated parameter that lets you supply dates that might not have all the bits. For your example:
> mdy_hm(d$date,truncated=2)
[1] "2014-02-25 00:00:00 UTC" "2014-02-25 11:27:00 UTC"
[3] "2014-02-28 00:00:00 UTC" "2014-03-04 00:00:00 UTC"
[5] "2014-03-10 11:30:00 UTC" "2014-03-10 12:32:00 UTC"

Related

R Error in match.fun(FUN) : object 'Hour' not found after replacement inside column 'Hour'

I have below dataframe (df) from ENTSO-E showing German power prices. I created the "Hour" column with lubridate function hour(df$date). Output was a range (1,2,..,23,0)
# to replace 0 with 24
df["Hour"][df["Hour"]=="0"]<- "24"
I will need to work on an hourly basis. So I filtered each hour from 1 till 24, but I cannot filter the replaced hour: H24.
H1 <- df %>%
filter(Hour==1)
H24 <- df %>%
filter(Hour==24)
Error in match.fun(FUN) : object 'Hour' not found
24 values are still in Hour col, and class is numeric but I cannot do any calculation with the Hour column.
class(df$Hour)
[1] "numeric"
mean(german_last_4$Hour)
[1] NA
I am thinking problem is with replace function. is there any other way to produce a result that works with H24?
date
price
Hour
2019-01-01 01:00:00
28.32
1
2019-01-01 02:00:00
10.07
2
2019-01-01 03:00:00
-4.08
3
2019-01-01 04:00:00
-9.91
4
2019-01-01 05:00:00
-7.41
5
2019-01-01 06:00:00
-12.55
6

Using logicals to change date format in R

New R user here - I have many .csv files containing time stamps (date_time) in one column and temperature readings in the other. I am trying to write a function that detects the date_time format, and then changes it to a different format. Because of the way the data was collected, the date/time format is different in some of the .csv files. I want the function to change the date_time for all files to the same format.
Date_time format I want: %m/%d/%y %H:%M:%S
Date_time format I want changed to above: "%y-%m-%d %H:%M:%S"
> head(file1data)
x date_time temperature coupler_d coupler_a host_con stopped EoF
1 1 18-07-10 09:00:00 41.137 Logged
2 2 18-07-10 09:15:00 41.322
3 3 18-07-10 09:30:00 41.554
4 4 18-07-10 09:45:00 41.832
5 5 18-07-10 10:00:00 42.156
6 6 18-07-10 10:15:00 42.755
> head(file2data)
x date_time temperature coupler_d coupler_a host_con stopped EoF
1 1 07/10/18 01:00:00 PM 8.070 Logged
2 2 07/10/18 01:15:00 PM 8.095
3 3 07/10/18 01:30:00 PM 8.120
4 4 07/10/18 01:45:00 PM 8.120
5 5 07/10/18 02:00:00 PM 8.020
6 6 07/10/18 02:15:00 PM 7.795
file2data is in the correct format. file1data is incorrect.
I have tried using logicals to detect and replace the date format e.g.,
file1data %>%
if(str_match_all(date_time,"([0-9][0-9]{2})[-.])")){
format(as.POSIXct(date_time,format="%y-%m-%d %H:%M:%S"),"%m/%d/%y %H:%M:%S")
}else{format(date_time,"%m/%d/%y %H:%M:%S")}
but this has not worked, I get the following errors:
Error in if (.) str_match_all(date_time, "([0-9][0-9]{2})[-.])") else { :
argument is not interpretable as logical
In addition: Warning message:
In if (.) str_match_all(date_time, "([0-9][0-9]{2})[-.])") else { :
the condition has length > 1 and only the first element will be used
Any ideas?

Convert sub-hourly data to hourly and round up time in R

I have a very big dataframe in R, containing weather data with the following format.
valid temp
1 17/08/2014 00:20 14
2 17/08/2014 00:50 14
3 17/08/2014 01:20 13.5
4 17/08/2014 01:50 13
5 17/08/2014 02:20 12
6 17/08/2014 02:50 10
I would like to convert these sub-hourly data to hourly, like the following.
valid tmpc
1 2014-08-17 00:00:00 14
2 2014-08-17 01:00:00 13.75
3 2014-08-17 02:00:00 12.5
The class of df$valid is 'factor'. I have tried first converting them to Date through POSIXct, but it gives only NA values. I have also tried changing the system locale and still I get NAs.
We can do this in base R by converting to POSIXlt, set the minute to 0, convert it back to POSIXct and aggregate to get the mean of 'temp'
df1$valid <- strptime(df1$valid, "%d/%m/%Y %H:%M")
df1$valid$min <- 0
df1$valid <- as.POSIXct(df1$valid)
aggregate(temp~valid, df1, FUN = mean)
Option 1: The lubridate solution using ceiling_date or round_date. It's not clear according to your data frame and results if what you want is to round or ceiling. For instance, in the first row you are rounding and in the third using ceiling. Anyways here the example:
library(lubridate)
df <- data.frame(i = 1, valid= "17/08/2014 01:28", temp = 14)
df$valid <- dmy_hm(df$valid)
df$valid_round <- ceiling_date(df$valid , unit="hours")
Results:
i valid temp valid_round
1 1 2014-08-17 01:28:00 14 2014-08-17 02:00:00
Option 2: using the base functions. Use:
df$valid <- as.POSIXct(strptime(df$valid, "%d/%m/%Y %H:%M", tz ="UTC"))
and then round it.

Associate numbers to datetime/timestamp

I have a dataframe df with a certain number of columns. One of them, ts, is timestamps:
1462147403122 1462147412990 1462147388224 1462147415651 1462147397069 1462147392497
...
1463529545634 1463529558639 1463529556798 1463529558788 1463529564627 1463529557370.
I have also at my disposal the corresponding datetime in the datetime column:
"2016-05-02 02:03:23 CEST" "2016-05-02 02:03:32 CEST" "2016-05-02 02:03:08 CEST" "2016-05-02 02:03:35 CEST" "2016-05-02 02:03:17 CEST" "2016-05-02 02:03:12 CEST"
...
"2016-05-18 01:59:05 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:16 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:24 CEST" "2016-05-18 01:59:17 CEST"
As you can see my dataframe contains data accross several day. Let's say there are 3. I would like to add a column containing number 1, 2 or 3. 1 if the line belongs to the first day, 2 for the second day, etc...
Thank you very much in advance,
Clement
One way to do this is to keep track of total days elapsed each time the date changes, as demonstrated below.
# Fake data
dat = data.frame(datetime = c(seq(as.POSIXct("2016-05-02 01:03:11"),
as.POSIXct("2016-05-05 01:03:11"), length.out=6),
seq(as.POSIXct("2016-05-09 01:09:11"),
as.POSIXct("2016-05-16 02:03:11"), length.out=4)))
tz(dat$datetime) = "UTC"
Note, if your datetime column is not already in a datetime format, convert it to one using as.POSIXct.
Now, create a new column with the day number, counting the first day in the sequence as day 1.
dat$day = c(1, cumsum(as.numeric(diff(as.Date(dat$datetime, tz="UTC")))) + 1)
dat
datetime day
1 2016-05-02 01:03:11 1
2 2016-05-02 15:27:11 1
3 2016-05-03 05:51:11 2
4 2016-05-03 20:15:11 2
5 2016-05-04 10:39:11 3
6 2016-05-05 01:03:11 4
7 2016-05-09 01:09:11 8
8 2016-05-11 09:27:11 10
9 2016-05-13 17:45:11 12
10 2016-05-16 02:03:11 15
I specified the timezone in the code above to avoid getting tripped up by potential silent shifts between my local timezone and UTC. For example, note the silent shift from my default local time zone ("America/Los_Angeles") to UTC when converting a POSIXct datetime to a date:
# Fake data
datetime = seq(as.POSIXct("2016-05-02 01:03:11"), as.POSIXct("2016-05-05 01:03:11"), length.out=6)
tz(datetime)
[1] ""
date = as.Date(datetime)
tz(date)
[1] "UTC"
data.frame(datetime, date)
datetime date
1 2016-05-02 01:03:11 2016-05-02
2 2016-05-02 15:27:11 2016-05-02
3 2016-05-03 05:51:11 2016-05-03
4 2016-05-03 20:15:11 2016-05-04 # Note day is different due to timezone shift
5 2016-05-04 10:39:11 2016-05-04
6 2016-05-05 01:03:11 2016-05-05

subsetting a data frame according factor date

I have a data frame(df) where one of its column is a date column. However that column's type is factor:
> head(df$date)
[1] 2011-01-01 2011-01-01 2011-01-01 2011-01-01 2011-01-01 2011-01-01
1519 Levels: 2010-11-27 2010-11-28 2010-11-29 2010-11-30 2010-12-01 2010-12-02 2010-12-03 2010-12-04 ... 2015-02-07
I want to subset this data frame according to date. For example I want to create a second data frame(df2) where it is a subset of df where dates are smaller than 2014-03-30.
How can I do that using R? I will be very glad for any help. Thanks a lot.
You could begin exploring the lubridate library. It makes working with dates very simple.
df <- data.frame(date = c("2013-01-01", "2014-04-01", "2014-01-01",
"2011-06-01", "2012-03-01", "2014-08-01"))
df
date
1 2013-01-01
2 2014-04-01
3 2014-01-01
4 2011-06-01
5 2012-03-01
6 2014-08-01
library(lubridate)
# ymd - year-month-day
df$date <- ymd(df$date)
with(df, df[date < ymd("2014-03-30"),])
[1] "2013-01-01 UTC" "2014-01-01 UTC" "2011-06-01 UTC" "2012-03-01 UTC"

Resources