I want create a new colume to represent which date are in the same week.
A data.table DATE_SET contains Date information, like:
DATA_SET<- data.table(transday = seq(from = (Sys.Date()-64), to = Sys.Date(), by = 1))
For example, '2017-03-01' and '2017-03-02' are in the same week, '2017-03-01' and '2017-03-08' both Wednesday, but they are not in the same week.
If "2016-01-01" is the first week in 2016, "2017-01-01" is the first week in 2017, the value is 1, but they are not in the same week. So i want the unique value to pecify "a same week".
The answer to this question depends strongly on
the definition of the first day of the week (usually Sunday or Monday) and
the numbering of the weeks within the year (starting with the first Sunday, Monday, or Thursday of the year, or on 1st January, etc).
A selection of different options can be seen from the example below:
dates isoweek day week_iso week_US week_UK DT_week DT_iso lub_week lub_iso cut.Date
2015-12-25 2015-W52 5 2015-W52 51 51 52 52 52 52 2015-12-21
2015-12-26 2015-W52 6 2015-W52 51 51 52 52 52 52 2015-12-21
2015-12-27 2015-W52 7 2015-W52 52 51 52 52 52 52 2015-12-21
2015-12-28 2015-W53 1 2015-W53 52 52 52 53 52 53 2015-12-28
2015-12-29 2015-W53 2 2015-W53 52 52 52 53 52 53 2015-12-28
2015-12-30 2015-W53 3 2015-W53 52 52 53 53 52 53 2015-12-28
2015-12-31 2015-W53 4 2015-W53 52 52 53 53 53 53 2015-12-28
2016-01-01 2015-W53 5 2015-W53 00 00 1 53 1 53 2015-12-28
2016-01-02 2015-W53 6 2015-W53 00 00 1 53 1 53 2015-12-28
2016-01-03 2015-W53 7 2015-W53 01 00 1 53 1 53 2015-12-28
2016-01-04 2016-W01 1 2016-W01 01 01 1 1 1 1 2016-01-04
2016-01-05 2016-W01 2 2016-W01 01 01 1 1 1 1 2016-01-04
2016-01-06 2016-W01 3 2016-W01 01 01 1 1 1 1 2016-01-04
2016-01-07 2016-W01 4 2016-W01 01 01 2 1 1 1 2016-01-04
2016-01-08 2016-W01 5 2016-W01 01 01 2 1 2 1 2016-01-04
which is created by this code:
library(data.table)
dates <- as.Date("2016-01-01") + (-7:7)
print(data.table(
dates,
isoweek = ISOweek::ISOweek(dates),
day = ISOweek::ISOweekday(dates),
week_iso = format(dates, "%G-W%V"),
week_US = format(dates, "%U"),
week_UK = format(dates, "%W"),
DT_week = data.table::week(dates),
DT_iso = data.table::isoweek(dates),
lub_week = lubridate::week(dates),
lub_iso = lubridate::isoweek(dates),
cut.Date = cut.Date(dates, "week")
), row.names = FALSE)
The format YYYY-Www used in some of the columns is one of the ISO 8601 week formats. It includes the year which is required to distinguish different weeks in different years as requested by the OP.
The ISO week definition is the only format which ensures that each week always consists of 7 days, also across New Year. The other definitions may start or end the year with "weeks" with less than 7 days. Due to the seamless partioning of the year, the ISO week-numbering year is slightly different from the traditional Gregorian calendar year, e.g., 2016-01-01 belongs to the last ISO week 53 of 2015 (2015-W53).
As suggested here, cut.Date() might be the best option for the OP.
Disclosure: I'm maintainer of the ISOweek package which was published at a time when strptime() did not recognize the %G and %V format specifications for output in the Windows versions of R. (Still today they aren't recognized on input).
You can use the week() function of the lubridate package in R.
library(lubridate)
DATA_SET$week <- week(DATA_SET$transday)
This will give you a new column week. Dates within the same week will have same week number.
Related
There are lots of questions about daylight savings conversion and posixct/posixlt, date.time, etc., but none that I have found appear to address what my approach would be to daylight savings.
I am interested in analyzing daily load curves for energy use, and approaches which just cut the spring hour out of the dataset do not work for me. I need an approach that shifts all measurements to the subsequent hour after spring daylight savings and to the prior hour after the fall adjustment. See below for a clear example.
EnergyUse <- data.table("Date"= c("1997-04-06", "1997-04-06", "1997-04-06", "1997-04-06"), "Hour"= 1:4, "Power"=c(30,40,60,80))
print(EnergyUse)
# Date Hour Power
#1: 1997-04-06 1 30
#2: 1997-04-06 2 40 #when daylight savings kicked in for 1997
#3: 1997-04-06 3 60
#4: 1997-04-06 4 80
The "Hour" field ranges from 0 to 23 for every day of the year, i.e. "local standard time". It happens to be Pacific Time, as you will see below, but I would have the same question for any time zone that implemented daylight savings.
Now I need to merge date and time field into single date_time field formatted as date and time and incorporating daylight savings, as I am interested in the hourly power patterns (i.e. load curves), which shift both based on relative time (e.g. when people go to/get off work) and absolute time (e.g. when it gets cold/hot or when sun sets).
EnergyUseAdj <- EnergyUse[, Date_Time := as.POSIXct(paste(Date, Hour), format="%Y-%m-%d %H", tz="America/Los_Angeles")]
which results in:
print(EnergyUseAdj)
# Date Hour Power Date_Time
#1: 1997-04-06 1 30 1997-04-06 01:00:00
#2: 1997-04-06 2 40 <NA>
#3: 1997-04-06 3 60 1997-04-06 03:00:00
#4: 1997-04-06 4 80 1997-04-06 04:00:00
This, however, makes the "Power" data for the new daylight savings 3am and 4am incorrect. The actual power production figure for the daylight adjusted 3am would instead be that which was listed for 2am standard time (i.e. 40), and that for 4am would then be 60.
The correct way to adjust for this, albeit likely more computationally expensive for large datasets, would be to adjust the entire time-series by a positive offset of 1 hour in spring and a negative offset of 1 hour in fall, like the below:
# Date Hour Power Date_Time
#1: 1997-04-06 1 30 1997-04-06 01:00:00
#2: 1997-04-06 2 <NA> <NA>
#3: 1997-04-06 3 40 1997-04-06 03:00:00
#4: 1997-04-06 4 60 1997-04-06 04:00:00
Or, perhaps smoother for use in other algorithms due to lack of NA lines, like the below:
# Date Hour Power Date_Time
#1: 1997-04-06 1 30 1997-04-06 01:00:00
#2: 1997-04-06 3 40 1997-04-06 03:00:00
#3: 1997-04-06 4 60 1997-04-06 04:00:00
#4: 1997-04-06 5 80 1997-04-06 05:00:00
After toying around with Posixct and reading through a bunch of similar questions on this adjustment, I could not find a great solution. Any ideas?
EDIT: GregorThomas' request, see below for a larger sample of data in case you wish to use two days' worth.
# OP_DATE OP_HOUR Power
# 1: 1997-04-05 0 71
# 2: 1997-04-05 1 61
# 3: 1997-04-05 2 54
# 4: 1997-04-05 3 57
# 5: 1997-04-05 4 68
# 6: 1997-04-05 5 76
# 7: 1997-04-05 6 89
# 8: 1997-04-05 7 106
# 9: 1997-04-05 8 148
#10: 1997-04-05 9 154
#11: 1997-04-05 10 143
#12: 1997-04-05 11 123
#13: 1997-04-05 12 105
#14: 1997-04-05 13 94
#15: 1997-04-05 14 85
#16: 1997-04-05 15 86
#17: 1997-04-05 16 84
#18: 1997-04-05 17 83
#19: 1997-04-05 18 99
#20: 1997-04-05 19 105
#21: 1997-04-05 20 94
#22: 1997-04-05 21 95
#23: 1997-04-05 22 81
#24: 1997-04-05 23 66
#25: 1997-04-06 0 75
#26: 1997-04-06 1 70
#27: 1997-04-06 2 62
#28: 1997-04-06 3 56
#29: 1997-04-06 4 55
#30: 1997-04-06 5 57
#31: 1997-04-06 6 51
#32: 1997-04-06 7 57
#33: 1997-04-06 8 59
#34: 1997-04-06 9 61
#35: 1997-04-06 10 64
#36: 1997-04-06 11 63
#37: 1997-04-06 12 63
#38: 1997-04-06 13 63
#39: 1997-04-06 14 60
#40: 1997-04-06 15 68
#41: 1997-04-06 16 69
#42: 1997-04-06 17 69
#43: 1997-04-06 18 91
#44: 1997-04-06 19 120
#45: 1997-04-06 20 100
#46: 1997-04-06 21 74
#47: 1997-04-06 22 56
#48: 1997-04-06 23 55
If your data is reliably every hour, you can calculate a sequence of hours of the appropriate length. The implementation of POSIX datetimes accounts for daylight savings time, leap years, etc.
Simplifying the method in my comment, I'd recommending calculating the sequence based on the start time and the length.
EnergyUseAdj <- EnergyUse[,
Date_Time := seq(
from = as.POSIXct(paste(Date[1], Hour[1]), format="%Y-%m-%d %H", tz="America/Los_Angeles"),
length.out = .N,
by = "1 hour"
)]
I split the time from 2018-12-31 11:45:00 AM to 2018-12-31 and 11:45:00 aAM successfully.
However, I get difficulty that convert "11:45:00 AM" to 24 hours.
I know there are several ways to do that, for example, the most popular way is to use strptime and put format="%I:%M:%S %p. I did that several times and made double checked again and again... but still get N/A in my column. Here is, crimeData is my dataset name, toSplitHrs contains time which is "11:45:00 AM" just like what mentioned:
crimeData$toSplitHrs = strptime(crimeData$SplitHrs, format="%I:%M:%S %p")
Police.Beats SplitMs SplitHrs year month days hours mins sec toSplitHrs
1 28 2018-12-31 11:45:00 2018 12 31 11 45 00 <NA>
2 177 2018-12-31 11:42:00 2018 12 31 11 42 00 <NA>
3 233 2018-12-31 11:30:00 2018 12 31 11 30 00 <NA>
4 91 2018-12-31 11:30:00 2018 12 31 11 30 00 <NA>
5 73 2018-12-31 11:30:00 2018 12 31 11 30 00 <NA>
6 232 2018-12-31 11:27:00 2018 12 31 11 27 00 <NA>
but still, I got N/A result from that...
Also, this dataset contains over 10k observations, I really cannot change them one by one...any suggestions are appreciated!
You can try the format %r for the time, taking into account the am/pm specification (see ?strptime):
strptime("2018-12-31 11:45:00 am", format="%F %r")
#[1] "2018-12-31 11:45:00 CET"
strptime("2018-12-31 11:45:00 pm", format="%F %r")
#[1] "2018-12-31 23:45:00 CET"
This question already has answers here:
Plot separate years on a common day-month scale
(3 answers)
Closed 6 years ago.
I'm trying to drop year from a multiyear data frame and plot day-month on x axis with geom_smooth() calculated for different years.
My data structure, initially looks like this:
> str(pmWaw)
'data.frame': 52488 obs. of 5 variables:
$ date : POSIXct, format: "2014-01-01 00:00:00" "2014-01-01 00:00:00" "2014-01-01 00:00:00" "2014-01-01 01:00:00" ...
$ stacja: Factor w/ 273 levels "DsWrocKorzA",..: 26 27 129 26 27 129 26 27 129 26 ...
$ pm25 : num 100 63 NA 69 36 NA 41 31 NA 37 ...
$ pm10 : num 122 68 79 77 38 90 43 32 39 38 ...
$ season: Ord.factor w/ 4 levels "spring (MAM)"<..: 4 4 4 4 4 4 4 4 4 4 ...
Using lubridate I added year and month as separate variables:
library(lubridate)
pmWaw$year<- year(pmWaw$date)
pmWaw$month<- month(pmWaw$date)
Next, using a code found here on stackoverflow I calculated a month and day variable in %m-%d format:
pmWaw$month.day<-format(pmWaw$date, format="%m-%d")
#check new variable type:
> typeof(pmWaw$month.day)
[1] "character"
Eventually data frame I work with is this:
> head(pmWaw)
date stacja pm25 pm10 season year month month.day
1 2014-01-01 00:00:00 MzWarNiepodKom 100 122 winter (DJF) 2014 1 01-01
2 2014-01-01 00:00:00 MzWarszUrsynow 63 68 winter (DJF) 2014 1 01-01
3 2014-01-01 00:00:00 MzWarTarKondra NA 79 winter (DJF) 2014 1 01-01
4 2014-01-01 01:00:00 MzWarNiepodKom 69 77 winter (DJF) 2014 1 01-01
5 2014-01-01 01:00:00 MzWarszUrsynow 36 38 winter (DJF) 2014 1 01-01
6 2014-01-01 01:00:00 MzWarTarKondra NA 90 winter (DJF) 2014 1 01-01
> tail(pmWaw)
date stacja pm25 pm10 season year month month.day
52483 2015-12-30 22:00:00 MzWarAlNiepo 36 47 winter (DJF) 2015 12 12-30
52484 2015-12-30 22:00:00 MzWarKondrat 26 29 winter (DJF) 2015 12 12-30
52485 2015-12-30 22:00:00 MzWarWokalna 36 44 winter (DJF) 2015 12 12-30
52486 2015-12-30 23:00:00 MzWarAlNiepo 39 59 winter (DJF) 2015 12 12-30
52487 2015-12-30 23:00:00 MzWarKondrat 36 39 winter (DJF) 2015 12 12-30
52488 2015-12-30 23:00:00 MzWarWokalna 40 49 winter (DJF) 2015 12 12-30
Passing new values to ggplot gives me three issues:
ggplot(pmWaw, aes(x=month.day, y=pm25)) +
geom_jitter(alpha=0.5) +
geom_smooth()
First (minor) problem: month.day is a char type variable and ggplot won't recognize it's initial time series nature. This I can probably overcome by manually setting scale labels to months.
Second (major) problem geom_smooth() is not calculated at all and I can't figure out why?
Third (major) problem is I can't work out a solution to add year as a grouping variable for two separate smoothed lines (mostly because geom_smooth is not there at all).
My guess is, that the source of all problems lies somewhere in the way how I extracted month and day format and ended up with a character class variable.
Could anyone help me fix it? Any hints appreciated.
Looks like I found a solution to work with:
ggplot(pmWaw, aes(x=month.day, y=pm25, group = year)) +
geom_point(alpha=0.5) +
geom_smooth(aes(color=factor(year)))
solves issues 2 and 3 - geom smooth is there and I can distinguish years. Probably not the best solution but might be a good place to start
The last day of 2017 (2017-12-31) falls on Sunday, meaning last week of the year only contains 1 day if I consider Sunday as the start day of my week. Now, I would like 2016-01-01 to 2016-01-07, to be associated with week 53, and start week 1 on 2016-01-03, which falls on Sunday.
I have the following data frame structure:
require(lubridate)
range <- seq(as.Date('2017-12-26'), by = 1, len = 10)
df <- data.frame(range)
ddf$WKN <- as.integer(format(df$range + 1, '%V'))
df$weekday <- weekdays(df$range)
df$weeknum <- wday(df$range)
This would give me this:
df:
range WKN weekday weeknum
2017-12-26 52 Tuesday 3
2017-12-27 52 Wednesday 4
2017-12-28 52 Thursday 5
2017-12-29 52 Friday 7
2017-12-30 52 Saturday 7
2017-12-31 01 Sunday 1
2018-01-01 01 Monday 2
2018-01-02 01 Tuesday 3
2018-01-03 01 Wednesday 4
2018-01-04 01 Thursday 5
What I would like to have is:
df:
range WKN weekday weeknum
2017-12-26 52 Tuesday 3
2017-12-27 52 Wednesday 4
2017-12-28 52 Thursday 5
2017-12-29 52 Friday 7
2017-12-30 52 Saturday 7
2017-12-31 53 Sunday 1
2018-01-01 53 Monday 2
2018-01-02 53 Tuesday 3
2018-01-03 53 Wednesday 4
2018-01-04 53 Thursday 5
.
.
2018-01-07 01 Sunday 1
Can anyone point me in a right direction?
#alistaire had provided solution here Start first day of week of the year on Sunday and end last day of week of the year on Saturday But I did not foresee this blip here.
Got it.
Little Adjustments to this should serve my purpose!
df$WKN <- as.integer(format(df$range, '%U'))
I have a data set that is long format and includes exact date/time measurements of 3 scores on a single test administered between 3 and 5 times per year.
ID Date Fl Er Cmp
1 9/24/2010 11:38 15 2 17
1 1/11/2011 11:53 39 11 25
1 1/15/2011 11:36 39 11 39
1 3/7/2011 11:28 95 58 2
2 10/4/2010 14:35 35 9 6
2 1/7/2011 13:11 32 7 8
2 3/7/2011 13:11 79 42 30
3 10/12/2011 13:22 17 3 18
3 1/19/2012 14:14 45 15 36
3 5/8/2012 11:55 29 6 11
3 6/8/2012 11:55 74 37 7
4 9/14/2012 9:15 62 28 18
4 1/24/2013 9:51 82 45 9
4 5/21/2013 14:04 135 87 17
5 9/12/2011 11:30 98 61 18
5 9/15/2011 13:23 55 22 9
5 11/15/2011 11:34 98 61 17
5 1/9/2012 11:32 55 22 17
5 4/20/2012 11:30 23 4 17
I need to transform this data to short format with time bands based on month (i.e. Fall=August-October; Winter=January-February; Spring=March-May). Some bands will include more than one observation per participant, and as such, will need a "spill over" band. An example transformation for the Fl scores below.
ID Fall1Fl Fall2Fl Winter1Fl Winter2Fl Spring1Fl Spring2Fl
1 15 NA 39 39 95 NA
2 35 NA 32 NA 79 NA
3 17 NA 45 NA 28 74
4 62 NA 82 NA 135 NA
5 98 55 55 NA 23 NA
Notice that dates which are "redundant" (i.e. more than 1 Aug-Oct observation) spill over into Fall2fl column. Dates that occur outside of the desired bands (i.e. November, December, June, July) should be deleted. The final data set should have additional columns that include Fl Er and Cmp.
Any help would be appreciated!
(Link to .csv file with long data http://mentor.coe.uh.edu/Data_Example_Long.csv )
This seems to do what you are looking for, but doesn't exactly match your desired output. I haven't looked at your sample data to see whether the problem lies with your sample desired output or the transformations I've done, but you should be able to follow along with the code to see how the transformations were made.
## Convert dates to actual date formats
mydf$Date <- strptime(gsub("/", "-", mydf$Date), format="%m-%d-%Y %H:%M")
## Factor the months so we can get the "seasons" that you want
Months <- factor(month(mydf$Date), levels=1:12)
levels(Months) <- list(Fall = c(8:10),
Winter = c(1:2),
Spring = c(3:5),
Other = c(6, 7, 11, 12))
mydf$Seasons <- Months
## Drop the "Other" seasons
mydf <- mydf[!mydf$Seasons == "Other", ]
## Add a "Year" column
mydf$Year <- year(mydf$Date)
## Add a "Times" column
mydf$Times <- as.numeric(ave(as.character(mydf$Seasons),
mydf$ID, mydf$Year, FUN = seq_along))
## Load "reshape2" and use `dcast` on just one variable.
## Repeat for other variables by changing the "value.var"
dcast(mydf, ID ~ Seasons + Times, value.var="Fluency")
# ID Fall_1 Fall_2 Winter_1 Winter_2 Spring_2 Spring_3
# 1 1 15 NA 39 39 NA 95
# 2 2 35 NA 32 NA 79 NA
# 3 3 17 NA 45 NA 29 NA
# 4 4 62 NA 82 NA 135 NA
# 5 5 98 55 55 NA 23 NA