My data frame looks like this:
Date Time Consumption kVARh kW weekday
2 2016-12-13 0:15:00 90.144 0.000 360.576 Tue
3 2016-12-13 0:30:00 90.144 0.000 360.576 Tue
4 2016-12-13 0:45:00 91.584 0.000 366.336 Tue
5 2016-12-13 1:00:00 93.888 0.000 375.552 Tue
6 2016-12-13 1:15:00 88.416 0.000 353.664 Tue
7 2016-12-13 1:30:00 88.704 0.000 354.816 Tue
8 2016-12-13 1:45:00 91.296 0.000 365.184 Tue
I got data from a csv with date as factor, which I changed to as.character, and then as.date. Then I added a column giving me the day of week using
sigEx1DF$weekday <- format(as.Date(sigEx1DF$Date), "%a")
which I then converted to an ordered factor from Sunday through Saturday.
This is granular data from a smart meter which measures usage (consumption) at 15 minute intervals. kW is Consumption*4. I need to average each weekday and then get the max of the averages, but when I subset the data frame looks like this:
Date Time Consumption kVARh kW weekday
3 2016-12-13 0:30:00 90.144 0.000 360.576 Tue
8 2016-12-13 1:45:00 91.296 0.000 365.184 Tue
13 2016-12-13 3:00:00 93.600 0.000 374.400 Tue
18 2016-12-13 4:15:00 93.312 0.000 373.248 Tue
23 2016-12-13 5:30:00 107.424 0.000 429.696 Tue
28 2016-12-13 6:45:00 103.968 0.000 415.872 Tue
33 2016-12-13 8:00:00 108.576 0.000 434.304 Tue
Several of the 15 minute intervals are missing now (lines 4-7, for instance). I don't see a difference in rows 4-7, yet they are missing after the subset.
This is the code I used to subset:
bldg1_Wkdy <- subset(sort.df, weekday == c("Mon","Tue","Wed","Thu","Fri"),
select = c("Date","Time","Consumption","kVARh","kW","weekday"))
Here's the data frame structure before the subset:
'data.frame': 72888 obs. of 6 variables:
$ Date : Date, format: "2016-12-13" "2016-12-13" "2016-12-13" ...
$ Time : Factor w/ 108 levels "0:00:00","0:15:00",..: 2 3 4 5 6 7 8 49 50 51 ...
$ Consumption: num 90.1 90.1 91.6 93.9 88.4 ...
$ kVARh : num 0 0 0 0 0 0 0 0 0 0 ...
$ kW : num 361 361 366 376 354 ...
$ weekday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 3 3 3 3 3 3 3 3 3 ...
I go from 72888 observations to only 10,427 for the weekdays, and 10,368 for the weekends, with many rows that seem to be randomly missing as noted above. Some of the intervals have zero consumption (electricity may have been out due to storm or other reasons), but those are actually showing up in the subset data. So it doesn't seem like zeroes are causing the problem. Thanks for your help!
Instead of weekday == c("Mon","Tue","Wed","Thu","Fri") you should use weekday %in% c("Mon","Tue","Wed","Thu","Fri"), see below a minimal test which shows how %in% works as expected:
> subset(x, weekday == c("Mon","Tue","Wed","Thu","Fri"))
weekday
NA <NA>
> subset(x, weekday %in% c("Mon","Tue","Wed","Thu","Fri"))
weekday
1 Tue
Related
I have data frame as shown below:
str(Rainfall_Complete)
'data.frame': 8221 obs. of 18 variables:
$ Date : Date, format: "1985-04-29" "1985-04-30" "1985-05-01" ...
$ Month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..:
$ Season : Factor w/ 4 levels "Monsoon","PostMonsoon",..:.
$ Year : chr "1985" "1985" "1985" "1985" ...
$ Stn A : num 0 8.8 0 15 26.2 0 2.5 0 0 0 ...
$ Stn B : num 0 0 26 11 13.8 20 0.26 0 0 0 ...
$ Stn C : num 0.1 0 0 0 13.5 27 16 5 0 0 …
I want to convert the above daily time series to monthly time series
I want my data to look something like this
Year Month StnA StnB StnC……..
1985 Jan 150 100 120
1985 Feb 120 98 58
….
2010 Jan 200 100 87
2010 Feb 140 145 120
I tried the following, however it works only for univariate series
library(dplyr)
Monthly_rainfall <- Rainfall_Complete %>% group_by(Year,Month)%>% summarise()
Any help would be appreciated
Your attempt uses dplyr, but the question is tagged with xts and lubridate, thus here is a solution using those packages with a reprex.
library(lubridate)
library(xts)
## Create some basic data
ans6 <- xts(anscombe[, 1:6], order.by = as.Date("2018-01-28") + 1:nrow(anscombe))
## Summary by month
mon6 <- apply.monthly(ans6, FUN = mean)
## Re-format
df6 <- as.data.frame(mon6)
df6$year <- year(rownames(df6))
df6$month <- month(rownames(df6), label = TRUE)
df6
## x1 x2 x3 x4 y1 y2 year month
## 2018-01-31 10.33333 10.33333 10.33333 8.000 7.523333 8.673333 2018 Jan
## 2018-02-08 8.50000 8.50000 8.50000 9.375 7.492500 7.061250 2018 Feb
This question already has answers here:
Plot separate years on a common day-month scale
(3 answers)
Closed 6 years ago.
I'm trying to drop year from a multiyear data frame and plot day-month on x axis with geom_smooth() calculated for different years.
My data structure, initially looks like this:
> str(pmWaw)
'data.frame': 52488 obs. of 5 variables:
$ date : POSIXct, format: "2014-01-01 00:00:00" "2014-01-01 00:00:00" "2014-01-01 00:00:00" "2014-01-01 01:00:00" ...
$ stacja: Factor w/ 273 levels "DsWrocKorzA",..: 26 27 129 26 27 129 26 27 129 26 ...
$ pm25 : num 100 63 NA 69 36 NA 41 31 NA 37 ...
$ pm10 : num 122 68 79 77 38 90 43 32 39 38 ...
$ season: Ord.factor w/ 4 levels "spring (MAM)"<..: 4 4 4 4 4 4 4 4 4 4 ...
Using lubridate I added year and month as separate variables:
library(lubridate)
pmWaw$year<- year(pmWaw$date)
pmWaw$month<- month(pmWaw$date)
Next, using a code found here on stackoverflow I calculated a month and day variable in %m-%d format:
pmWaw$month.day<-format(pmWaw$date, format="%m-%d")
#check new variable type:
> typeof(pmWaw$month.day)
[1] "character"
Eventually data frame I work with is this:
> head(pmWaw)
date stacja pm25 pm10 season year month month.day
1 2014-01-01 00:00:00 MzWarNiepodKom 100 122 winter (DJF) 2014 1 01-01
2 2014-01-01 00:00:00 MzWarszUrsynow 63 68 winter (DJF) 2014 1 01-01
3 2014-01-01 00:00:00 MzWarTarKondra NA 79 winter (DJF) 2014 1 01-01
4 2014-01-01 01:00:00 MzWarNiepodKom 69 77 winter (DJF) 2014 1 01-01
5 2014-01-01 01:00:00 MzWarszUrsynow 36 38 winter (DJF) 2014 1 01-01
6 2014-01-01 01:00:00 MzWarTarKondra NA 90 winter (DJF) 2014 1 01-01
> tail(pmWaw)
date stacja pm25 pm10 season year month month.day
52483 2015-12-30 22:00:00 MzWarAlNiepo 36 47 winter (DJF) 2015 12 12-30
52484 2015-12-30 22:00:00 MzWarKondrat 26 29 winter (DJF) 2015 12 12-30
52485 2015-12-30 22:00:00 MzWarWokalna 36 44 winter (DJF) 2015 12 12-30
52486 2015-12-30 23:00:00 MzWarAlNiepo 39 59 winter (DJF) 2015 12 12-30
52487 2015-12-30 23:00:00 MzWarKondrat 36 39 winter (DJF) 2015 12 12-30
52488 2015-12-30 23:00:00 MzWarWokalna 40 49 winter (DJF) 2015 12 12-30
Passing new values to ggplot gives me three issues:
ggplot(pmWaw, aes(x=month.day, y=pm25)) +
geom_jitter(alpha=0.5) +
geom_smooth()
First (minor) problem: month.day is a char type variable and ggplot won't recognize it's initial time series nature. This I can probably overcome by manually setting scale labels to months.
Second (major) problem geom_smooth() is not calculated at all and I can't figure out why?
Third (major) problem is I can't work out a solution to add year as a grouping variable for two separate smoothed lines (mostly because geom_smooth is not there at all).
My guess is, that the source of all problems lies somewhere in the way how I extracted month and day format and ended up with a character class variable.
Could anyone help me fix it? Any hints appreciated.
Looks like I found a solution to work with:
ggplot(pmWaw, aes(x=month.day, y=pm25, group = year)) +
geom_point(alpha=0.5) +
geom_smooth(aes(color=factor(year)))
solves issues 2 and 3 - geom smooth is there and I can distinguish years. Probably not the best solution but might be a good place to start
I have a huge dataset similar to the following reproducible sample data.
Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176
I want to aggregate this data to weekly level to get the output similar to the following:
Interval value
1 Week 2, June 2012 *aggregate value for day 10 to day 14 of June 2012*
2 Week 3, June 2012 *aggregate value for day 15 to day 21 of June 2012*
3 Week 4, June 2012 *aggregate value for day 22 to day 28 of June 2012*
4 Week 5, June 2012 *aggregate value for day 29 to day 30 of June 2012*
5 Week 1, July 2012 *aggregate value for day 1 to day 7 of July 2012*
6 Week 2, July 2012 *aggregate value for day 8 to day 10 of July 2012*
How do I achieve this easily without writing a long code?
If you mean the sum of of ‘value’ by week I think the easiest way to do it is to convert the data into a xts object as GSee suggested:
data <- as.xts(data$value,order.by=as.Date(data$interval))
weekly <- apply.weekly(data,sum)
[,1]
2012-06-10 552
2012-06-17 23629
2012-06-24 23872
2012-07-01 23667
2012-07-08 23552
2012-07-10 10902
I leave the formatting of the output as an exercise for you :-)
If you were to use week from lubridate, you would only get five weeks to pass to by. Assume dat is your data,
> library(lubridate)
> do.call(rbind, by(dat$value, week(dat$Interval), summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 24 552 4146 4188 3759 4529 4850
# 25 490 2498 4256 3396 4438 5156
# 26 564 2578 4206 3355 4346 4866
# 27 698 993 4868 3366 5122 5770
# 28 671 1086 3200 3200 5314 5726
This shows a summary for the 24th through 28th week of the year. Similarly, we can get the means with aggregate with
> aggregate(value~week(Interval), data = dat, mean)
# week(Interval) value
# 1 24 3758.667
# 2 25 3396.286
# 3 26 3355.000
# 4 27 3366.429
# 5 28 3199.500
I just came across this old question because it was used as a dupe target.
Unfortunately, all the upvoted answers (except the one by konvas and a now deleted one) present solutions for aggregating the data by week of the year while the OP has requested to aggregate by week of the month.
The definition of week of the year and week of the month is ambiguous as discussed here, here, and here.
However, the OP has indicated that he wants to count the days 1 to 7 of each month as week 1 of the month, days 8 to 14 as week 2 of the month, etc. Note that week 5 is a stub for most of the months consisting of only 2 or 3 days (except for the month of February if no leap year).
Having prepared the ground, here is a data.table solution for this kind of aggregation:
library(data.table)
DT[, .(value = sum(value)),
by = .(Interval = sprintf("Week %i, %s",
(mday(Interval) - 1L) %/% 7L + 1L,
format(Interval, "%b %Y")))]
Interval value
1: Week 2, Jun 2012 18366
2: Week 3, Jun 2012 24104
3: Week 4, Jun 2012 23348
4: Week 5, Jun 2012 5204
5: Week 1, Jul 2012 23579
6: Week 2, Jul 2012 11573
We can verify that we have picked the correct intervals by
DT[, .(value = sum(value),
date_range = toString(range(Interval))),
by = .(Week = sprintf("Week %i, %s",
(mday(Interval) -1L) %/% 7L + 1L,
format(Interval, "%b %Y")))]
Week value date_range
1: Week 2, Jun 2012 18366 2012-06-10, 2012-06-14
2: Week 3, Jun 2012 24104 2012-06-15, 2012-06-21
3: Week 4, Jun 2012 23348 2012-06-22, 2012-06-28
4: Week 5, Jun 2012 5204 2012-06-29, 2012-06-30
5: Week 1, Jul 2012 23579 2012-07-01, 2012-07-07
6: Week 2, Jul 2012 11573 2012-07-08, 2012-07-10
which is in line with OP's specification.
Data
library(data.table)
DT <- fread(
"rn Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176", drop = 1L)
DT[, Interval := as.Date(Interval)]
If you are using a data frame, you can easily do this with the tidyquant package. Use the tq_transmute function, which applies a mutation and returns a new data frame. Select the "value" column and apply the xts function apply.weekly. The additional argument FUN = sum will get the aggregate by week.
library(tidyquant)
df
#> # A tibble: 31 x 2
#> Interval value
#> <date> <int>
#> 1 2012-06-10 552
#> 2 2012-06-11 4850
#> 3 2012-06-12 4642
#> 4 2012-06-13 4132
#> 5 2012-06-14 4190
#> 6 2012-06-15 4186
#> 7 2012-06-16 1139
#> 8 2012-06-17 490
#> 9 2012-06-18 5156
#> 10 2012-06-19 4430
#> # ... with 21 more rows
df %>%
tq_transmute(select = value,
mutate_fun = apply.weekly,
FUN = sum)
#> # A tibble: 6 x 2
#> Interval value
#> <date> <int>
#> 1 2012-06-10 552
#> 2 2012-06-17 23629
#> 3 2012-06-24 23872
#> 4 2012-07-01 23667
#> 5 2012-07-08 23552
#> 6 2012-07-10 10902
When you say "aggregate" the values, you mean take their sum? Let's say your data frame is d and assuming d$Interval is of class Date, you can try
# if d$Interval is not of class Date d$Interval <- as.Date(d$Interval)
formatdate <- function(date)
paste0("Week ", (as.numeric(format(date, "%d")) - 1) + 1,
", ", format(date, "%b %Y"))
# change "sum" to your required function
aggregate(d$value, by = list(formatdate(d$Interval)), sum)
# Group.1 x
# 1 Week 1, Jul 2012 3725.667
# 2 Week 2, Jul 2012 3199.500
# 3 Week 2, Jun 2012 3544.000
# 4 Week 3, Jun 2012 3434.000
# 5 Week 4, Jun 2012 3333.143
# 6 Week 5, Jun 2012 3158.667
I need to calculate the max value contained between the beginning of the day and the moment when the min value happened. This is a toy example of my dataset for one day and one dendro:
TIMESTAMP year DOY ring dendro diameter
1 2013-05-02 00:00:00 2013 122 1 1 3405
2 2013-05-02 00:15:00 2013 122 1 1 3317
3 2013-05-02 00:30:00 2013 122 1 1 3217
4 2013-05-02 00:45:00 2013 122 1 1 3026
5 2013-05-02 01:00:00 2013 122 1 1 4438
6 2013-05-03 00:00:00 2013 123 1 1 3444
7 2013-05-03 00:15:00 2013 123 1 1 3410
8 2013-05-03 00:30:30 2013 123 1 1 3168
9 2013-05-03 00:45:00 2013 123 1 1 3373
10 2013-05-02 00:00:00 2013 122 2 4 5590
11 2013-05-02 00:15:00 2013 122 2 4 5602
12 2013-05-02 00:30:00 2013 122 2 4 5515
13 2013-05-02 00:45:00 2013 122 2 4 4509
14 2013-05-02 01:00:00 2013 122 2 4 5566
15 2013-05-02 01:15:00 2013 122 2 4 6529
First, I calculated the MIN diameter for each day (DOY= day of the year) in each dendro (contained in one ring), also getting the time at what that min value happened:
library(plyr)
dailymin <- ddply(datamelt, .(year, DOY, ring, dendro),function(x)x[which.min(x$diameter), ])
Now, my problem is that I want to calculate the MAX diameter for each day. However, sometimes de max value occurs after the min value. I am only interested in the max value contained BEFORE the min value. I am not interested in the total max value if it happened after the min. Therefore, I need the max value contained (for each DAY) WITHIN THE TIME INTERVAL FROM THE BEGINNING OF THE DAY (00:00:00) TO THE THE MIN DIAMETER. Like I did with the min, I also need to know at what time that max value happened. This is what I want from the previous df:
year DOY ring dendro timeMin min timeMax max
1 2013 122 1 1 2013-05-02 00:45:00 3026 2013-05-02 00:00:00 3405
2 2013 123 1 1 2013-05-03 00:30:00 3168 2013-05-03 00:00:00 3444
3 2013 122 2 4 2013-05-02 00:45:00 4509 2013-05-02 00:00:15 5602
As you can see, the min value is the actual min value. However, the max value I want is not the max value of the day, it is the max value that happened between the beginning of the day and the min value.
My first attempt, unsuccessful, returns the max value of the day, even in it is out of the desired time interval:
dailymax <- ddply(datamelt, .(year, DOY, ring, dendro),
function(x)x[which.max(x$diameter[1:which.min(datamelt$diameter)]), ])
Any ideas?
In a data.table, you could write:
DT[,{
istar <- which.min(diameter)
list(
dmin=diameter[istar],
prevmax=max(diameter[1:istar])
)},by='year,DOY,ring,dendro']
# year DOY ring dendro dmin prevmax
# 1: 2013 242 6 8 470 477.2
I assume that a similar function can be written with your **ply
EDIT1: where DT comes from...
require(data.table)
DT <- data.table(header=TRUE, text='
date TIMESTAMP year DOY ring dendro diameter
1928419 2013-08-30 00:00:00 2013 242 6 8 471.5
1928420 2013-08-30 01:30:00 2013 242 6 8 477.2
1928421 2013-08-30 03:00:00 2013 242 6 8 474.7
1928422 2013-08-30 04:30:00 2013 242 6 8 470.0
1928423 2013-08-30 06:00:00 2013 242 6 8 475.6
1928424 2013-08-30 08:30:00 2013 242 6 8 478.7')
Your "TIMESTAMP" has a space in it, so I'm reading it as two columns, with the first called "date". Paste them together if you like. Next time, you can look into making a "reproducible example", as described here: How to make a great R reproducible example?
EDIT2: For the time of the max and min:
DT[,{
istar <- which.min(diameter)
istar2 <- which.max(diameter[1:istar])
list(
dmin = diameter[istar],
tmin = TIMESTAMP[istar],
dmax = diameter[istar2],
tmax = TIMESTAMP[istar2]
)},by='year,DOY,ring,dendro']
# year DOY ring dendro dmin tmin dmax tmax
# 1: 2013 242 6 8 470 04:30:00 477.2 01:30:00
As mentioned in EDIT1, I don't have both pieces of your TIMESTAMP variable in a single column because you did not provide them that way. To add more columns, just add new expressions in the list() above. The idea behind the code is that the {} expression is a code block where you can work with the variables in the chunk of data associated with each year,DOY,ring,dendro combination and return a list of new columns.
I want to generate a row (with zero ammount) for each missing month (until the current) in the following dataframe. Can you please give me a hand in this? Thanks!
trans_date ammount
1 2004-12-01 2968.91
2 2005-04-01 500.62
3 2005-05-01 434.30
4 2005-06-01 549.15
5 2005-07-01 276.77
6 2005-09-01 548.64
7 2005-10-01 761.69
8 2005-11-01 636.77
9 2005-12-01 1517.58
10 2006-03-01 719.09
11 2006-04-01 1231.88
12 2006-05-01 580.46
13 2006-07-01 1468.43
14 2006-10-01 692.22
15 2006-11-01 505.81
16 2006-12-01 1589.70
17 2007-03-01 1559.82
18 2007-06-01 764.98
19 2007-07-01 964.77
20 2007-09-01 405.18
21 2007-11-01 112.42
22 2007-12-01 1134.08
23 2008-02-01 269.72
24 2008-03-01 208.96
25 2008-04-01 353.58
26 2008-05-01 756.00
27 2008-06-01 747.85
28 2008-07-01 781.62
29 2008-09-01 195.36
30 2008-10-01 424.24
31 2008-12-01 166.23
32 2009-02-01 237.11
33 2009-04-01 110.94
34 2009-07-01 191.29
35 2009-11-01 153.42
36 2009-12-01 222.87
37 2010-09-01 1259.97
38 2010-11-01 375.61
39 2010-12-01 496.48
40 2011-02-01 360.07
41 2011-03-01 324.95
42 2011-04-01 566.93
43 2011-06-01 281.19
44 2011-08-01 428.04
'data.frame': 44 obs. of 2 variables:
$ trans_date : Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount: num 2969 501 434 549 277 ...
you can use seq.Date and merge:
> str(df)
'data.frame': 44 obs. of 2 variables:
$ trans_date: Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount : num 2969 501 434 549 277 ...
> mns <- data.frame(trans_date = seq.Date(min(df$trans_date), max(df$trans_date), by = "month"))
> df2 <- merge(mns, df, all = TRUE)
> df2$ammount <- ifelse(is.na(df2$ammount), 0, df2$ammount)
> head(df2)
trans_date ammount
1 2004-12-01 2968.91
2 2005-01-01 0.00
3 2005-02-01 0.00
4 2005-03-01 0.00
5 2005-04-01 500.62
6 2005-05-01 434.30
and if you need months until current, use this:
mns <- data.frame(trans_date = seq.Date(min(df$trans_date), Sys.Date(), by = "month"))
note that it is sufficient to call simply seq instead of seq.Date if the parameters are Date class.
If you're using xts objects, you can use timeBasedSeq and merge.xts. Assuming your original data is in an object Data:
# create xts object:
# no comma on the first subset (Data['ammount']) keeps column name;
# as.Date needs a vector, so use comma (Data[,'trans_date'])
x <- xts(Data['ammount'],as.Date(Data[,'trans_date']))
# create a time-based vector from 2004-12-01 to 2011-08-01. The "m" denotes
# monthly time-steps. By default this returns a yearmon class. Use
# retclass="Date" to return a Date vector.
d <- timeBasedSeq(paste(start(x),end(x),"m",sep="/"), retclass="Date")
# merge x with an "empty" xts object, xts(,d), filling with zeros
y <- merge(x,xts(,d),fill=0)