How to parse year from a date in r [duplicate] - r

This question already has answers here:
Extract year from date
(7 answers)
Closed 5 years ago.
I have 53000 Date data-set and I want to extract only "year" from the date variable.
Do you guys know how can I do this?
My data are as follows:
OPN_DT_TM
18/07/2003 10:55
12/06/2004 6:00
9/06/2007 12:20
29/06/2001 16:00
6/06/2000 7:55
27/11/2006 10:15
17/11/2001 17:00
12/05/2004 22:00
16/04/2005 22:00
18/03/2005 8:40
13/06/2006 11:10
30/07/2006 12:00
16/07/2002 6:10
16/07/2002 7:15
3/09/2004 6:00
9/11/2004 15:20
25/08/2005 14:15
24/11/2001 19:10
15/04/2002 6:30
20/06/2002 6:30
17/03/2003 7:00
15/01/2005 13:00
23/03/2007 1:00
21/01/2001 10:30
,,,

This can be achieved by converting the entries into Date format and extracting the year, for instance like this:
> format(as.Date("15/01/2005 13:00", format="%d/%m/%Y %H:%M"),"%Y")
[1] "2005"
To get in-depth knowledge about dates and times in R, please see this.

Related

local daylight saving time to standard UTC time using python or R

I have a timeseries file containing 10 years of data with daylight saving time. Time is in local time in naive format, location is St Louis, USA where multiple time zone comes in a year. A sample of the time series is here:
local_time flow
11/3/12 23:30 58145400
11/4/12 0:00 58147200
11/4/12 0:30 58149000
11/4/12 1:00 58150800
11/4/12 1:30 58152600
11/4/12 1:00 58150800
11/4/12 1:30 58152600
11/4/12 2:00 58154400
11/4/12 2:30 58156200
11/4/12 3:00 58158000
11/4/12 3:30 58159800
11/4/12 4:00 58161600
11/4/12 4:30 58163400
if you see closely after 11/4/12 1:30 58152600 time becomes 11/4/12 1:00. It's a sunday and the clock goes back 1 hour.
If there were no daylight saving thing then the ts should have looked this below:
local_time flow
11/3/2012 23:30 58145400
11/4/12 0:00 58147200
11/4/12 0:30 58149000
11/4/12 1:00 58150800
11/4/12 1:30 58152600
11/4/12 2:30 58150800
11/4/12 3:00 58152600
11/4/12 3:30 58154400
11/4/12 4:00 58156200
11/4/12 4:30 58158000
11/4/12 5:30 58159800
11/4/12 6:00 58161600
11/4/12 6:30 58163400
Now, there are several instances like this in my original file. I want to convert the local data into UTC or CST where there will be no daylight saving time jump like the local time series data.
I tried this:
import pandas as pd
import numpy as np
df=pd.read_excel(r'test_dst.xlsx, sheet_name='Sheet1', header=0)
ts_naive=df.iloc[:,0]
ts_cst = ts_naive.dt.tz_localize('America/Chicago') # 'America/Chicago' uses CDT
but it gives an error: AmbiguousTimeError: Cannot infer dst time from 2012-11-04 01:00:00, try using the 'ambiguous' argument
If I use the following it gives me wrong output:
ts_cst = ts_naive.dt.tz_localize('UTC').dt.tz_convert('America/Chicago')
because I am assigning 'UTC' time zone to a local data which is wrong.
My ultimate goal is to remove the daylight saving timejump from the timeseries so that I can convert it into an ever increasing ts in seconds. My model can only take time in julian seconds and time series can only increase. Thanks. Here is a sample excel file: test_dst.xlsx
There's a useful section on this in the documentation, specifically the ambiguous="infer" argument~
df.local_time = pd.to_datetime(df.local_time)
df.local_time = df.local_time.dt.tz_localize('America/Chicago', 'infer')
print(df.local_time)
print(df.local_time.dt.tz_convert("UTC"))
Output:
0 2012-11-03 23:30:00-05:00
1 2012-11-04 00:00:00-05:00
2 2012-11-04 00:30:00-05:00
3 2012-11-04 01:00:00-05:00
4 2012-11-04 01:30:00-05:00
5 2012-11-04 01:00:00-06:00
6 2012-11-04 01:30:00-06:00
7 2012-11-04 02:00:00-06:00
8 2012-11-04 02:30:00-06:00
9 2012-11-04 03:00:00-06:00
10 2012-11-04 03:30:00-06:00
11 2012-11-04 04:00:00-06:00
12 2012-11-04 04:30:00-06:00
Name: local_time, dtype: datetime64[ns, America/Chicago]
0 2012-11-04 04:30:00+00:00
1 2012-11-04 05:00:00+00:00
2 2012-11-04 05:30:00+00:00
3 2012-11-04 06:00:00+00:00
4 2012-11-04 06:30:00+00:00
5 2012-11-04 07:00:00+00:00
6 2012-11-04 07:30:00+00:00
7 2012-11-04 08:00:00+00:00
8 2012-11-04 08:30:00+00:00
9 2012-11-04 09:00:00+00:00
10 2012-11-04 09:30:00+00:00
11 2012-11-04 10:00:00+00:00
12 2012-11-04 10:30:00+00:00
Name: local_time, dtype: datetime64[ns, UTC]

How to know if a as.POSIXct date time is AM/PM in r?

I have a column with date and time in the as.POSIXct format e.g. "2019-02-23 12:45". I want to identify if the time is AM or PM and add AM or PM to the date and time?
the following code creates an example dataset for representation:
ID <- data.frame(c(1,2,3,4))
DATE <- data.frame(as.POSIXct(c("2019-02-25 07:30", "2019-03-25 14:30", "2019-03-25 12:00", "2019-03-25 00:00"),format="%Y-%m-%d %H:%M"))
DATEAMPM <- data.frame(c("2019-02-25 07:30 AM", "2019-03-25 14:30 PM", "2019-03-25 12:00 PM", "2019-03-25 00:00 AM"))
AMPMFLAG <- data.frame(c(0,1,1,0))
test <- cbind(ID,DATE,DATEAMPM,AMPMFLAG)
names(test) <- c("PID","DATE","DATEAMPM","AMPMFLAG")
Would like to create the DATEAMPM and AMPMFLAG columns as represented in the code above.
I have seen character strings of the form "2019-09-23 08:45 PM" converted to 2019-09-23 20:45" by specifying the argument as below, but do not the other way around to incorporate AM/PM into the date time
as.POSIXct(strptime(,format="%Y-%m-%d %I:%M %p"))
Appreciate your help
We can use format to get the data with AM/PM
test$DATEAMPM <- format(test$DATE, "%Y-%m-%d %I:%M %p")
test$AMPMFLAG <- +(grepl("PM", test$DATEAMPM))
test
# PID DATE DATEAMPM AMPMFLAG
#1 1 2019-02-25 07:30:00 2019-02-25 07:30 AM 0
#2 2 2019-03-25 14:30:00 2019-03-25 02:30 PM 1
#3 3 2019-03-25 12:00:00 2019-03-25 12:00 PM 1
#4 4 2019-03-25 00:00:00 2019-03-25 12:00 AM 0
Also note that when you convert 14:30:00 in AM/PM it would be 02:30 PM and not 14:30 PM.

How to add hourly rows on a daily sequence dataframe?

if i have a daily data in the following format:
A:
DD-MM-YYYY
01-01-2000
02-01-2000
03-01-2000
04-01-2000
...
31-12-2010
31-12-2010
31-12-2010
31-12-2010
How to add hourly values to all the days and obtain a new A like:
A:
DD-MM-YYYY hour
01-01-2000 00:00
01-01-2000 01:00
01-01-2000 02:00
01-01-2000 03:00
...
01-01-2000 21:00
01-01-2000 22:00
01-01-2000 23:00
...
...
31-12-2010 21:00
31-12-2010 22:00
31-12-2010 23:00
This will stick 00:00 to 23:00 on to each of your days:
expand.grid(day = A$`DD-MM-YYYY`, hour = sprintf("%02d:00", 0:23))
However, in the real world you might prefer to use seq.POSIXt, which will account for leap years, daylight savings, etc.

Listing pairwise overlaps of Date time elements in R

I have a list of Lectures for a university course stored in a data-frame. This is a large complex table with over 1000 rows. I have used simple time in the example, but this is actually date time in the format %d %b %Y %H:%M. I think I should be able to extrapolate to the more complex usage.
essentially:
ModuleCode1 ModuleName Lecturer StartTime EndTime Course
11A Hist1 Bob 10:30 12:30 Hist
13A Hist2 Bob 14:30 15:30 Hist
13C Hist3 Steve 11:45 12:45 Hist
15B Hist4 Bob 09:40 10:40 Hist
17B Hist5 Bob 14:00 15:00 Hist
I am trying to create an output data frame which determines which modules clash in the timetable and at which times. For example:
ModuleCode1 StartTime EndTime ModuleCode2 StartTime EndTime
11A 10:30 12:30 15B 09:40 10:40
11A 10:30 12:30 13C 11:45 12:45
13A 10:30 12:30 17B 14:00 15:00
There are a multitude of questions on date time overlaps, but the ones that I can find seem to either work with 2 dataframes, or I can't understand them. I have come across the lubridate and IRanges packages, but cannot work out this specific implementation with date time in a single data frame. It seems as though something which would be generally useful, and most likely would have a simple implementation I am missing. Grateful for any help.
Here is an sqldf solution. The intervals do NOT overlap iff a.StartTime > b.EndTime or a.EndTime < b.StartTime so they do overlap exactly when the negation of this statement is true, hence:
library(sqldf)
sqldf("select a.ModuleCode1, a.StartTime, a.EndTime, b.ModuleCode1, b.StartTime, b.EndTime
from DF a join DF b on a.ModuleCode1 < b.ModuleCode1 and
a.StartTime <= b.EndTime and
a.EndTime >= b.StartTime")
giving:
ModuleCode1 StartTime EndTime ModuleCode1 StartTime EndTime
1 11A 10:30 12:30 13C 11:45 12:45
2 11A 10:30 12:30 15B 09:40 10:40
3 13A 14:30 15:30 17B 14:00 15:00
Note: The input in reproducible form is:
Lines <- "ModuleCode1 ModuleName Lecturer StartTime EndTime Course
11A Hist1 Bob 10:30 12:30 Hist
13A Hist2 Bob 14:30 15:30 Hist
13C Hist3 Steve 11:45 12:45 Hist
15B Hist4 Bob 09:40 10:40 Hist
17B Hist5 Bob 14:00 15:00 Hist"
DF <- read.table(text = Lines, header = TRUE)

data handling outlier with conditional in R

I have 2 data frame (data by hour and data by Day).
I want the point outlier by hour will be mark with conditional (PH by Hour in day belong to (standard1 - standard2) is OK and else is Outlier)
Example
PH in 11-09-13 10:00 (Hour) = 49.14068
compare with 11-09-13 20-40
and 49.14068>40 => Outlier
I want run, compare it automatic in R
I was search for this question but no result for this.
So, help me please !
Data by Hour
DateTime PH
11-09-13 10:00 49.14068
11-09-13 11:00 52.53494167
11-09-13 12:00 24.8525
11-09-13 13:00 8.56055
11-09-13 14:00 23.77944167
11-09-13 15:00 25.13243333
11-09-13 16:00 35.2913
11-09-13 17:00 20.58211667
11-09-13 18:00 18.605975
11-09-13 19:00 59.16179167
11-09-13 20:00 72.06908333
11-09-13 21:00 43.47536667
11-09-13 22:00 44.73696667
11-09-13 23:00 38.7266
12-09-13 0:00 41.12040833
12-09-13 1:00 33.67845833
12-09-13 2:00 38.49083333
12-09-13 3:00 46.20168333
12-09-13 4:00 40.03630833
12-09-13 5:00 41.10841667
12-09-13 6:00 43.753475
12-09-13 7:00 45.077675
12-09-13 8:00 57.53141667
12-09-13 9:00 45.17694167
12-09-13 10:00 41.106525
12-09-13 11:00 30.08048333
12-09-13 12:00 24.70255833
12-09-13 13:00 15.60813333
12-09-13 14:00 14.09729167
........ n day(24h/day)
Data by Day aggregate from Data by Hour
DateTime standard1 standard2
11-09-13 20 40
12-09-13 12 50
13-09-13 16 30
....... n day

Resources