Subset dataframe w/ sequence of observations
I am experimenting with a large dataset. I would like to subset this data frame, in intervals of Monday through Friday. However, I see that some weeks have missing days (Thursday is missing one week).
If one sequence of days, i.e. Monday to Friday, I would like to not include this sequence of days in my sample.
Would this be possible?
week.nr <- data$week.nr[1:20]
week.day<- data$week.day[1:20]
date <- data$specific.date[1:20]
price <- data$price[1:20]
data.frame(date,week.nr,week.day,price)
data.frame(date,week.nr,week.day,price)
date week.nr week.day price
1 2019-01-28 05 Monday 62.6
2 2019-01-25 04 Friday 63.8
3 2019-01-24 04 Thursday 64.2
4 2019-01-23 04 Wednesday 64.0
5 2019-01-22 04 Tuesday 64.0
6 2019-01-21 04 Monday 63.4
7 2019-01-18 03 Friday 62.6
8 2019-01-17 03 Thursday 62.6
9 2019-01-16 03 Wednesday 64.0
10 2019-01-15 03 Tuesday 64.4
11 2019-01-14 03 Monday 65.2
12 2019-01-11 02 Friday 66.4
13 2019-01-10 02 Thursday 66.2
14 2019-01-09 02 Wednesday 68.2
15 2019-01-08 02 Tuesday 68.8
16 2019-01-07 02 Monday 67.8
17 2019-01-04 01 Friday 67.4
18 2019-01-03 01 Thursday 68.0
19 2019-01-02 01 Wednesday 69.6
20 2018-12-28 52 Friday 71.0
Related
When I try using as.POSIXlt or strptime I keep getting a single value of 'NA' as a result.
What I need to do is transform 3 and 4 digit numbers e.g. 2300 or 115 to 23:00 or 01:15 respectively, but I simply cannot get any code to work.
Basically, this data fame of outputs:
Time
1 2345
2 2300
3 2130
4 2400
5 115
6 2330
7 100
8 2300
9 1530
10 130
11 100
12 215
13 2245
14 145
15 2330
16 2400
17 2300
18 2230
19 2130
20 30
should look like this:
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
7 01:00
8 23:00
9 15:30
10 01:30
11 01:00
12 02:15
13 22:45
14 01:45
15 23:30
16 24:00
17 23:00
18 22:30
19 21:30
20 00:30
I think you can use the following solution. However this is actually producing a character vector:
gsub("(\\d{2})(\\d{2})", "\\1:\\2", sprintf("%04d", df$Time)) |>
as.data.frame() |>
setNames("Time") |>
head()
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
I have a set of data taken every 5 minutes consisting of the following structure:
>df1
Date X1
01/01/2017 0:00 1
01/01/2017 0:30 32
01/01/2017 1:00 65
01/01/2017 1:30 14
01/01/2017 2:00 25
01/01/2017 2:30 14
01/01/2017 3:00 85
01/01/2017 3:30 74
01/01/2017 4:00 74
01/01/2017 4:30 52
01/01/2017 5:00 25
01/01/2017 5:30 74
01/01/2017 6:00 45
01/01/2017 6:30 52
01/01/2017 7:00 21
01/01/2017 7:30 41
01/01/2017 8:00 74
01/01/2017 8:30 11
01/01/2017 9:00 2
01/01/2017 9:30 52
Another vector is given consisting of only dates, but with a different time frequency:
>V1
Date2
1/1/2017 1:30:00
1/1/2017 3:30:00
1/1/2017 5:30:00
1/1/2017 9:30:00
I would like to calculate the moving average of X1 but at the end the only values I really need are the ones associated with the dates in V1 (but preserving the smoothing generated by the moving average)
Would you recommend to calculate the moving average of X1, then associate the value to the corresponding date in V1 and re-apply a moving average? or do you know a function in R that would help me achieve this?
Thank you, I really appreciate your help!
SofĂa
filter is a convenient way to construct moving averages
Assuming you want a simple arithmetic moving average, you'll need to decide how many elements you'd like to average together, and if you'd like a one or two-sided average. Arbitrarily, I'll pick 5 and one-sided.
elements <- 5
df1$x1.smooth <- filter(df1$X1, filter = rep(1/elements, elements), sides=1)
Note that "moving.average" will have elements-1 fewer elements than df1$X1 due to the moving average being undefined until there are elements items to average.
df1 is now
Date X1 x1.smooth
1 01/01/2017 0:00 1 NA
2 01/01/2017 0:30 32 NA
3 01/01/2017 1:00 65 NA
4 01/01/2017 1:30 14 NA
5 01/01/2017 2:00 25 27.4
6 01/01/2017 2:30 14 30.0
7 01/01/2017 3:00 85 40.6
8 01/01/2017 3:30 74 42.4
9 01/01/2017 4:00 74 54.4
10 01/01/2017 4:30 52 59.8
11 01/01/2017 5:00 25 62.0
12 01/01/2017 5:30 74 59.8
13 01/01/2017 6:00 45 54.0
14 01/01/2017 6:30 52 49.6
15 01/01/2017 7:00 21 43.4
16 01/01/2017 7:30 41 46.6
17 01/01/2017 8:00 74 46.6
18 01/01/2017 8:30 11 39.8
19 01/01/2017 9:00 2 29.8
20 01/01/2017 9:30 52 36.0
Now you need only merge the two data frames on Date = Date2 or else subset df1 where Date is %in% V1$Date2
Another option could be to use zoo package. One can use rollapply to calculate and add another column in dataframe that will hold moving average for X1.
A implementation with moving average of width 4 (every 2 hours) can be implemented as:
Library(zoo)
#Add another column with mean value
df$mean <- rollapply(df$X1, 4, mean, align = "right", fill = NA)
df
# Date X1 mean
# 1 2017-01-01 00:00:00 1 NA
# 2 2017-01-01 00:30:00 32 NA
# 3 2017-01-01 01:00:00 65 NA
# 4 2017-01-01 01:30:00 14 28.00
# 5 2017-01-01 02:00:00 25 34.00
# 6 2017-01-01 02:30:00 14 29.50
# 7 2017-01-01 03:00:00 85 34.50
# 8 2017-01-01 03:30:00 74 49.50
# 9 2017-01-01 04:00:00 74 61.75
# 10 2017-01-01 04:30:00 52 71.25
# 11 2017-01-01 05:00:00 25 56.25
# 12 2017-01-01 05:30:00 74 56.25
# 13 2017-01-01 06:00:00 45 49.00
# 14 2017-01-01 06:30:00 52 49.00
# 15 2017-01-01 07:00:00 21 48.00
# 16 2017-01-01 07:30:00 41 39.75
# 17 2017-01-01 08:00:00 74 47.00
# 18 2017-01-01 08:30:00 11 36.75
# 19 2017-01-01 09:00:00 2 32.00
# 20 2017-01-01 09:30:00 52 34.75
I'm borrowing the reproducible example given here:
Aggregate daily level data to weekly level in R
since it's pretty much close to what I want to do.
Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176
In his question, he asks to aggregate on weekly intervals, what I'd like to do is aggregate on a "day of the week basis".
So I'd like to have a table similar to that one, adding the values of all the same day of the week:
Day of the week value
1 "Sunday" 60000
2 "Monday" 50000
3 "Tuesday" 60000
4 "Wednesday" 50000
5 "Thursday" 60000
6 "Friday" 50000
7 "Saturday" 60000
You can try:
aggregate(d$value, list(weekdays(as.Date(d$Interval))), sum)
We can group them by weekly intervals using weekdays :
library(dplyr)
df %>%
group_by(Day_Of_The_Week = weekdays(as.Date(Interval))) %>%
summarise(value = sum(value))
# Day_Of_The_Week value
# <chr> <int>
#1 Friday 16903
#2 Monday 26368
#3 Saturday 4738
#4 Sunday 2975
#5 Thursday 17858
#6 Tuesday 23772
#7 Wednesday 13560
We can do this with data.table
library(data.table)
setDT(df1)[, .(value = sum(value)), .(Dayofweek = weekdays(as.Date(Interval)))]
# Dayofweek value
#1: Sunday 2975
#2: Monday 26368
#3: Tuesday 23772
#4: Wednesday 13560
#5: Thursday 17858
#6: Friday 16903
#7: Saturday 4738
using lubridate https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
df1$Weekday=wday(arrive,label=TRUE)
library(data.table)
df1=data.table(df1)
df1[,sum(value),Weekday]
The last day of 2017 (2017-12-31) falls on Sunday, meaning last week of the year only contains 1 day if I consider Sunday as the start day of my week. Now, I would like 2016-01-01 to 2016-01-07, to be associated with week 53, and start week 1 on 2016-01-03, which falls on Sunday.
I have the following data frame structure:
require(lubridate)
range <- seq(as.Date('2017-12-26'), by = 1, len = 10)
df <- data.frame(range)
ddf$WKN <- as.integer(format(df$range + 1, '%V'))
df$weekday <- weekdays(df$range)
df$weeknum <- wday(df$range)
This would give me this:
df:
range WKN weekday weeknum
2017-12-26 52 Tuesday 3
2017-12-27 52 Wednesday 4
2017-12-28 52 Thursday 5
2017-12-29 52 Friday 7
2017-12-30 52 Saturday 7
2017-12-31 01 Sunday 1
2018-01-01 01 Monday 2
2018-01-02 01 Tuesday 3
2018-01-03 01 Wednesday 4
2018-01-04 01 Thursday 5
What I would like to have is:
df:
range WKN weekday weeknum
2017-12-26 52 Tuesday 3
2017-12-27 52 Wednesday 4
2017-12-28 52 Thursday 5
2017-12-29 52 Friday 7
2017-12-30 52 Saturday 7
2017-12-31 53 Sunday 1
2018-01-01 53 Monday 2
2018-01-02 53 Tuesday 3
2018-01-03 53 Wednesday 4
2018-01-04 53 Thursday 5
.
.
2018-01-07 01 Sunday 1
Can anyone point me in a right direction?
#alistaire had provided solution here Start first day of week of the year on Sunday and end last day of week of the year on Saturday But I did not foresee this blip here.
Got it.
Little Adjustments to this should serve my purpose!
df$WKN <- as.integer(format(df$range, '%U'))
I have such a data frame
0 weekday day month year hour basal bolus carb period.h
1 Tuesday 01 03 2016 0.0 0.25 NA NA 0
2 Tuesday 01 03 2016 10.9 NA NA 67 10
3 Tuesday 01 03 2016 10.9 NA 4.15 NA 10
4 Tuesday 01 03 2016 12.0 0.30 NA NA 12
5 Tuesday 01 03 2016 17.0 0.50 NA NA 17
6 Tuesday 01 03 2016 17.6 NA NA 33 17
7 Tuesday 01 03 2016 17.6 NA 1.35 NA 17
8 Tuesday 01 03 2016 18.6 NA NA 44 18
9 Tuesday 01 03 2016 18.6 NA 1.80 NA 18
10 Tuesday 01 03 2016 18.9 NA NA 17 18
11 Tuesday 01 03 2016 18.9 NA 0.70 NA 18
12 Tuesday 01 03 2016 22.0 0.40 NA NA 22
13 Wednesday 02 03 2016 0.0 0.25 NA NA 0
14 Wednesday 02 03 2016 9.7 NA NA 39 9
15 Wednesday 02 03 2016 9.7 NA 2.65 NA 9
16 Wednesday 02 03 2016 11.2 NA NA 13 11
17 Wednesday 02 03 2016 11.2 NA 0.30 NA 11
18 Wednesday 02 03 2016 12.0 0.30 NA NA 12
19 Wednesday 02 03 2016 12.0 NA NA 16 12
20 Wednesday 02 03 2016 12.0 NA 0.65 NA 12
If you look at the lines 2 and 3, you notice that they correspond exactly to the same day & time: just for the line #2 the "carb" is not NA, and the "bolus" is not NA (These are data about diabete).
I want to merge such lines into a single one:
2 Tuesday 01 03 2016 10.9 NA NA 67 10
3 Tuesday 01 03 2016 10.9 NA 4.15 NA 10
->
2 Tuesday 01 03 2016 10.9 NA 4.15 67 10
I could of course do a brutal double loop over each line, but I look for a cleverer and faster way.
You can group your data frame by the common identifier columns weekday, day, month, year, hour, period.h here and then sort and take the first element from the remaining columns which you would like to merge, sort() function by default will remove NAs in the vector to be sorted and thus you will end up with non-NA elements for each column within each group; if all elements in a column are NA, sort(col)[1] returns NA:
library(dplyr)
df %>%
group_by(weekday, day, month, year, hour, period.h) %>%
summarise_all(funs(sort(.)[1]))
# weekday day month year hour period.h basal bolus carb
# <fctr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <int>
# 1 Tuesday 1 3 2016 0.0 0 0.25 NA NA
# 2 Tuesday 1 3 2016 10.9 10 NA 4.15 67
# 3 Tuesday 1 3 2016 12.0 12 0.30 NA NA
# 4 Tuesday 1 3 2016 17.0 17 0.50 NA NA
# 5 Tuesday 1 3 2016 17.6 17 NA 1.35 33
# 6 Tuesday 1 3 2016 18.6 18 NA 1.80 44
# 7 Tuesday 1 3 2016 18.9 18 NA 0.70 17
# 8 Tuesday 1 3 2016 22.0 22 0.40 NA NA
# 9 Wednesday 2 3 2016 0.0 0 0.25 NA NA
# 10 Wednesday 2 3 2016 9.7 9 NA 2.65 39
# 11 Wednesday 2 3 2016 11.2 11 NA 0.30 13
# 12 Wednesday 2 3 2016 12.0 12 0.30 0.65 16
Instead of sort(), maybe a more appropriate function to use here is na.omit():
df %>% group_by(weekday, day, month, year, hour, period.h) %>%
summarise_all(funs(na.omit(.)[1]))