Start the Weeknum on Sunday as the first day of the week - r

The last day of 2017 (2017-12-31) falls on Sunday, meaning last week of the year only contains 1 day if I consider Sunday as the start day of my week. Now, I would like 2016-01-01 to 2016-01-07, to be associated with week 53, and start week 1 on 2016-01-03, which falls on Sunday.
I have the following data frame structure:
require(lubridate)
range <- seq(as.Date('2017-12-26'), by = 1, len = 10)
df <- data.frame(range)
ddf$WKN <- as.integer(format(df$range + 1, '%V'))
df$weekday <- weekdays(df$range)
df$weeknum <- wday(df$range)
This would give me this:
df:
range WKN weekday weeknum
2017-12-26 52 Tuesday 3
2017-12-27 52 Wednesday 4
2017-12-28 52 Thursday 5
2017-12-29 52 Friday 7
2017-12-30 52 Saturday 7
2017-12-31 01 Sunday 1
2018-01-01 01 Monday 2
2018-01-02 01 Tuesday 3
2018-01-03 01 Wednesday 4
2018-01-04 01 Thursday 5
What I would like to have is:
df:
range WKN weekday weeknum
2017-12-26 52 Tuesday 3
2017-12-27 52 Wednesday 4
2017-12-28 52 Thursday 5
2017-12-29 52 Friday 7
2017-12-30 52 Saturday 7
2017-12-31 53 Sunday 1
2018-01-01 53 Monday 2
2018-01-02 53 Tuesday 3
2018-01-03 53 Wednesday 4
2018-01-04 53 Thursday 5
.
.
2018-01-07 01 Sunday 1
Can anyone point me in a right direction?
#alistaire had provided solution here Start first day of week of the year on Sunday and end last day of week of the year on Saturday But I did not foresee this blip here.

Got it.
Little Adjustments to this should serve my purpose!
df$WKN <- as.integer(format(df$range, '%U'))

Related

How to make a loop that changes values in specific rows from a dictionary

I'm new and can't figure it out how to solve this problem.
I have a data.frame = schedule
Week_number
Start
End
1
09:00
15:00
1
09:00
15:00
1
09:00
15:00
1
09:00
15:00
1
09:00
15:00
1
NA
NA
1
NA
NA
2
09:00
15:00
2
09:00
15:00
2
09:00
15:00
2
09:00
15:00
2
09:00
15:00
2
NA
NA
2
NA
NA
3
09:00
15:00
3
09:00
15:00
3
09:00
15:00
3
09:00
15:00
3
09:00
15:00
3
NA
NA
3
NA
NA
-----------------------------
..
52
-----------------------------
I have a shift dictionary :
> start_vec <- c("06:00", "08:00", "14:00")
> end_vec <- c("14:00", "16:00", "22:00")
My loop is to go through all 52 weeks and replace 9am and 3pm with a dictionary value.
But the problem is that the values should not be repeated, i.e. each week should be different.
For example, I start the year with : 08:00 - 16:00. The year can start with any shift.
Please find an example below :
Week_number
Start
End
1
08:00
16:00
1
08:00
16:00
1
08:00
16:00
1
08:00
16:00
1
08:00
16:00
1
NA
NA
1
NA
NA
2
14:00
22:00
2
14:00
22:00
2
14:00
22:00
2
14:00
22:00
2
14:00
22:00
2
NA
NA
2
NA
NA
3
06:00
14:00
3
06:00
14:00
3
06:00
14:00
3
06:00
14:00
3
06:00
14:00
3
NA
NA
3
NA
NA
-----------------------------
..
52
-----------------------------
I tryed to make nest loop, or make week_number vector to be able replace all 1 without NA with specific value.
> rd_dt <- data.frame()
> for (i in 1:length(schedule$Week_number)){
> for (s in start_vec){
> for (e in end_vec){
> dt <- schedule[i,]
> if (schedule$Start == NA){
> next
> else {
Thanks in advance for any hint.
I think you do not need a loop to do this. Here is one approach that may be helpful. Using ifelse check for NA - if not NA, then refer to start_vec and end_vec for substitute values. It will use the Week_number as an index in your vector, and uses the %% modulus operator where 3 is the length of your vectors, so it will restart at beginning if exceeds the length of the vectors.
library(dplyr)
df %>%
mutate(Start = ifelse(is.na(Start), NA, start_vec[1 + Week_number %% 3]),
End = ifelse(is.na(End), NA, end_vec[1 + Week_number %% 3]))
Output
Week_number Start End
1 1 08:00 16:00
2 1 08:00 16:00
3 1 08:00 16:00
4 1 08:00 16:00
5 1 08:00 16:00
6 1 <NA> <NA>
7 1 <NA> <NA>
8 2 14:00 22:00
9 2 14:00 22:00
10 2 14:00 22:00
11 2 14:00 22:00
12 2 14:00 22:00
13 2 <NA> <NA>
14 2 <NA> <NA>
15 3 06:00 14:00
16 3 06:00 14:00
17 3 06:00 14:00
18 3 06:00 14:00
19 3 06:00 14:00
20 3 <NA> <NA>
21 3 <NA> <NA>

how to convert 12 hour to 24 hour in r

I split the time from 2018-12-31 11:45:00 AM to 2018-12-31 and 11:45:00 aAM successfully.
However, I get difficulty that convert "11:45:00 AM" to 24 hours.
I know there are several ways to do that, for example, the most popular way is to use strptime and put format="%I:%M:%S %p. I did that several times and made double checked again and again... but still get N/A in my column. Here is, crimeData is my dataset name, toSplitHrs contains time which is "11:45:00 AM" just like what mentioned:
crimeData$toSplitHrs = strptime(crimeData$SplitHrs, format="%I:%M:%S %p")
Police.Beats SplitMs SplitHrs year month days hours mins sec toSplitHrs
1 28 2018-12-31 11:45:00 2018 12 31 11 45 00 <NA>
2 177 2018-12-31 11:42:00 2018 12 31 11 42 00 <NA>
3 233 2018-12-31 11:30:00 2018 12 31 11 30 00 <NA>
4 91 2018-12-31 11:30:00 2018 12 31 11 30 00 <NA>
5 73 2018-12-31 11:30:00 2018 12 31 11 30 00 <NA>
6 232 2018-12-31 11:27:00 2018 12 31 11 27 00 <NA>
but still, I got N/A result from that...
Also, this dataset contains over 10k observations, I really cannot change them one by one...any suggestions are appreciated!
You can try the format %r for the time, taking into account the am/pm specification (see ?strptime):
strptime("2018-12-31 11:45:00 am", format="%F %r")
#[1] "2018-12-31 11:45:00 CET"
strptime("2018-12-31 11:45:00 pm", format="%F %r")
#[1] "2018-12-31 23:45:00 CET"

Subset dataframe based on rule - sequence

Subset dataframe w/ sequence of observations
I am experimenting with a large dataset. I would like to subset this data frame, in intervals of Monday through Friday. However, I see that some weeks have missing days (Thursday is missing one week).
If one sequence of days, i.e. Monday to Friday, I would like to not include this sequence of days in my sample.
Would this be possible?
week.nr <- data$week.nr[1:20]
week.day<- data$week.day[1:20]
date <- data$specific.date[1:20]
price <- data$price[1:20]
data.frame(date,week.nr,week.day,price)
data.frame(date,week.nr,week.day,price)
date week.nr week.day price
1 2019-01-28 05 Monday 62.6
2 2019-01-25 04 Friday 63.8
3 2019-01-24 04 Thursday 64.2
4 2019-01-23 04 Wednesday 64.0
5 2019-01-22 04 Tuesday 64.0
6 2019-01-21 04 Monday 63.4
7 2019-01-18 03 Friday 62.6
8 2019-01-17 03 Thursday 62.6
9 2019-01-16 03 Wednesday 64.0
10 2019-01-15 03 Tuesday 64.4
11 2019-01-14 03 Monday 65.2
12 2019-01-11 02 Friday 66.4
13 2019-01-10 02 Thursday 66.2
14 2019-01-09 02 Wednesday 68.2
15 2019-01-08 02 Tuesday 68.8
16 2019-01-07 02 Monday 67.8
17 2019-01-04 01 Friday 67.4
18 2019-01-03 01 Thursday 68.0
19 2019-01-02 01 Wednesday 69.6
20 2018-12-28 52 Friday 71.0

Slow performance of split() function

I have a csv file consisting of around 200.000 rows of transactions. Here is the import and little preprocessing of the data:
data <- read.csv("bitfinex_data/trades.csv", header=T)
data$date <- as.character(data$date)
data$date <- substr(data$date, 1, 10)
data$date <- as.numeric(data$date)
data$date <- as.POSIXct(data$date, origin="1970-01-01", tz = "GMT")
head(data)
id exchange symbol date price amount sell
1 24892563 bf btcusd 2018-01-02 00:00:00 13375 0.05743154 False
2 24892564 bf btcusd 2018-01-02 00:00:01 13374 0.12226129 False
3 24892565 bf btcusd 2018-01-02 00:00:02 13373 0.00489140 False
4 24892566 bf btcusd 2018-01-02 00:00:02 13373 0.07510860 False
5 24892567 bf btcusd 2018-01-02 00:00:02 13373 0.11606086 False
6 24892568 bf btcusd 2018-01-02 00:00:03 13373 0.47000000 False
My goal is to obtain hourly sums of amount of token being traded. For this I need to split my data based on hours, which I did in a following way:
tmp <- split(data, cut(data$date,"hour"))
However this is taking way too long (up to 1 hour) and I wonder whether or not this is normal behaviour for functions such as split() and cut()? Is there any alternative to using those two functions?
UPDATE:
After using great suggestion by #Maurits Evers, my output table looks like this:
# A tibble: 25 x 2
date_hour amount.sum
<chr> <dbl>
1 1970-01-01 00 48.2
2 2018-01-02 00 2746.
3 2018-01-02 01 1552.
4 2018-01-02 02 2010.
5 2018-01-02 03 2171.
6 2018-01-02 04 3640.
7 2018-01-02 05 1399.
8 2018-01-02 06 836.
9 2018-01-02 07 856.
10 2018-01-02 08 819.
# ... with 15 more rows
This is exactly what I wanted, expect for the first row, where the date is from year 1970. Any suggestion on what might be causing the problem? I tried to change the origin parameter of as.POSIXct() function but that did not solve the problem.
I agree with #Roland's comment. To illustrate, here is an example.
Let's generate some data with 200000 entries in one minute time intervals.
set.seed(2018);
df <- data.frame(
date = seq(from = as.POSIXct("2018-01-01 00:00"), by = "min", length.out = 200000),
amount = runif(200000))
head(df);
# date amount
#1 2018-01-01 00:00:00 0.33615347
#2 2018-01-01 00:01:00 0.46372327
#3 2018-01-01 00:02:00 0.06058539
#4 2018-01-01 00:03:00 0.19743361
#5 2018-01-01 00:04:00 0.47431419
#6 2018-01-01 00:05:00 0.30104860
We now (1) create a new column date_hour that includes the date & hour part of the full date&time, (2) group_by column date_hour, and (3) sum entries from column amount to give amount.sum.
df %>%
mutate(date_hour = format(date, "%Y-%m-%d %H")) %>%
group_by(date_hour) %>%
summarise(amount.sum = sum(amount))
## A tibble: 3,333 x 2
# date_hour amount.sum
# <chr> <dbl>
# 1 2018-01-01 00 28.9
# 2 2018-01-01 01 26.4
# 3 2018-01-01 02 32.7
# 4 2018-01-01 03 29.9
# 5 2018-01-01 04 29.7
# 6 2018-01-01 05 28.5
# 7 2018-01-01 06 34.2
# 8 2018-01-01 07 33.8
# 9 2018-01-01 08 30.7
#10 2018-01-01 09 27.7
## ... with 3,323 more rows
This is very fast (it takes around 0.3 seconds on my 2012 MacBook Air), and you should be able to easily adjust this example to your particular case.
You can compute hourly sums without any packages, by using tapply. I use the random data as suggested by Maurits Evers:
set.seed(2018)
df <- data.frame(
date = seq(from = as.POSIXct("2018-01-01 00:00"),
by = "min", length.out = 200000),
amount = runif(200000))
head(df)
## date amount
## 1 2018-01-01 00:00:00 0.33615347
## 2 2018-01-01 00:01:00 0.46372327
## 3 2018-01-01 00:02:00 0.06058539
## 4 2018-01-01 00:03:00 0.19743361
## 5 2018-01-01 00:04:00 0.47431419
## 6 2018-01-01 00:05:00 0.30104860
tapply(df$amount,
format(df$date, "%Y-%m-%d %H"),
sum)
## 2018-01-01 00 2018-01-01 01 2018-01-01 02
## 28.85825 26.39385 32.73600
## 2018-01-01 03 2018-01-01 04 2018-01-01 05
## 29.88545 29.74048 28.46781
## ...

R: How to judge Date in the same week?

I want create a new colume to represent which date are in the same week.
A data.table DATE_SET contains Date information, like:
DATA_SET<- data.table(transday = seq(from = (Sys.Date()-64), to = Sys.Date(), by = 1))
For example, '2017-03-01' and '2017-03-02' are in the same week, '2017-03-01' and '2017-03-08' both Wednesday, but they are not in the same week.
If "2016-01-01" is the first week in 2016, "2017-01-01" is the first week in 2017, the value is 1, but they are not in the same week. So i want the unique value to pecify "a same week".
The answer to this question depends strongly on
the definition of the first day of the week (usually Sunday or Monday) and
the numbering of the weeks within the year (starting with the first Sunday, Monday, or Thursday of the year, or on 1st January, etc).
A selection of different options can be seen from the example below:
dates isoweek day week_iso week_US week_UK DT_week DT_iso lub_week lub_iso cut.Date
2015-12-25 2015-W52 5 2015-W52 51 51 52 52 52 52 2015-12-21
2015-12-26 2015-W52 6 2015-W52 51 51 52 52 52 52 2015-12-21
2015-12-27 2015-W52 7 2015-W52 52 51 52 52 52 52 2015-12-21
2015-12-28 2015-W53 1 2015-W53 52 52 52 53 52 53 2015-12-28
2015-12-29 2015-W53 2 2015-W53 52 52 52 53 52 53 2015-12-28
2015-12-30 2015-W53 3 2015-W53 52 52 53 53 52 53 2015-12-28
2015-12-31 2015-W53 4 2015-W53 52 52 53 53 53 53 2015-12-28
2016-01-01 2015-W53 5 2015-W53 00 00 1 53 1 53 2015-12-28
2016-01-02 2015-W53 6 2015-W53 00 00 1 53 1 53 2015-12-28
2016-01-03 2015-W53 7 2015-W53 01 00 1 53 1 53 2015-12-28
2016-01-04 2016-W01 1 2016-W01 01 01 1 1 1 1 2016-01-04
2016-01-05 2016-W01 2 2016-W01 01 01 1 1 1 1 2016-01-04
2016-01-06 2016-W01 3 2016-W01 01 01 1 1 1 1 2016-01-04
2016-01-07 2016-W01 4 2016-W01 01 01 2 1 1 1 2016-01-04
2016-01-08 2016-W01 5 2016-W01 01 01 2 1 2 1 2016-01-04
which is created by this code:
library(data.table)
dates <- as.Date("2016-01-01") + (-7:7)
print(data.table(
dates,
isoweek = ISOweek::ISOweek(dates),
day = ISOweek::ISOweekday(dates),
week_iso = format(dates, "%G-W%V"),
week_US = format(dates, "%U"),
week_UK = format(dates, "%W"),
DT_week = data.table::week(dates),
DT_iso = data.table::isoweek(dates),
lub_week = lubridate::week(dates),
lub_iso = lubridate::isoweek(dates),
cut.Date = cut.Date(dates, "week")
), row.names = FALSE)
The format YYYY-Www used in some of the columns is one of the ISO 8601 week formats. It includes the year which is required to distinguish different weeks in different years as requested by the OP.
The ISO week definition is the only format which ensures that each week always consists of 7 days, also across New Year. The other definitions may start or end the year with "weeks" with less than 7 days. Due to the seamless partioning of the year, the ISO week-numbering year is slightly different from the traditional Gregorian calendar year, e.g., 2016-01-01 belongs to the last ISO week 53 of 2015 (2015-W53).
As suggested here, cut.Date() might be the best option for the OP.
Disclosure: I'm maintainer of the ISOweek package which was published at a time when strptime() did not recognize the %G and %V format specifications for output in the Windows versions of R. (Still today they aren't recognized on input).
You can use the week() function of the lubridate package in R.
library(lubridate)
DATA_SET$week <- week(DATA_SET$transday)
This will give you a new column week. Dates within the same week will have same week number.

Resources