I need to aggregate my data Kg
Data Kg
1 2013-03-01 271
2 2013-03-06 374
3 2013-03-07 51
4 2013-03-12 210
5 2013-03-13 698
6 2013-03-15 328
by week or month. I have found this answer here in stackoverflow, but I really don't understand the answer. Who can show me how can I do this case. Thanx
The answer mentioned suggest that you should to use xts package.
library(xts)
## create you zoo objects using your data
## you replace text argument by read.zoo(yourfile, header = TRUE)
x.zoo <- read.zoo(text=' Data Kg
+ 1 2013-03-01 271
+ 2 2013-03-06 374
+ 3 2013-03-07 51
+ 4 2013-03-12 210
+ 5 2013-03-13 698
+ 6 2013-03-15 328',header=TRUE)
### then aggregate
apply.weekly(x.zoo, mean) ## per week
apply.monthly(x.zoo, mean) ## per month
see ??apply.xxxly:
Essentially a wrapper to the xts functions endpoints and period.apply, mainly as a convenience.
Or, you could use tapply to apply by group of weeks. Here I am using lubridate package to extract week part from a date.
# fake data
df <- structure(list(Datechar = c("2013-03-01", "2013-03-06", "2013-03-07",
"2013-03-12", "2013-03-13", "2013-03-15"), Kg = c(271L, 374L,
51L, 210L, 698L, 328L)), .Names = c("Datechar", "Kg"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
# convert character to date
df$Date <- as.Date(df$Datechar)
# calculate mean kg for each week
library(lubridate)
tapply(df$Kg, week(df$Date), mean)
tapply(df$Kg, month(df$Date), mean)
Related
I have daily data for 7 years. I want to group this into weekly data (based on the actual date) and sum the frequency.
Date Frequency
1 2014-01-01 179
2 2014-01-02 82
3 2014-01-03 89
4 2014-01-04 109
5 2014-01-05 90
6 2014-01-06 66
7 2014-01-07 75
8 2014-01-08 106
9 2014-01-09 89
10 2014-01-10 82
What is the best way to achieve that? Thank you
These solutions all use base R and differ only in the definition and labelling of weeks.
1) cut the dates into weeks and then aggregate over those. Weeks start on Monday but you can add start.on.monday=FALSE to cut to start them on Sunday if you prefer.
Week <- as.Date(cut(DF$Date, "week"))
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2013-12-30 549
## 2 2014-01-06 418
2) If you prefer to define a week as 7 days starting with DF$Date[1] and label them according to the first date in that week then use this. (Add 6 to Week if you prefer the last date in the week.)
weekno <- as.numeric(DF$Date - DF$Date[1]) %/% 7
Week <- DF$Date[1] + 7 * weekno
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2014-01-01 690
## 2 2014-01-08 277
3) or if you prefer to label it with the first date existing in DF in that week then use this. This and the last Week definition give the same result if there are no missing dates as is the case here. (If you want the last existing date in the week rather than the first then replace match with findInterval.)
weekno <- as.numeric(DF$Date - DF$Date[1]) %/% 7
Week <- DF$Date[match(weekno, weekno)]
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2014-01-01 690
## 2 2014-01-08 277
Note
The input in reproducible form is assumed to be:
Lines <- "Date Frequency
1 2014-01-01 179
2 2014-01-02 82
3 2014-01-03 89
4 2014-01-04 109
5 2014-01-05 90
6 2014-01-06 66
7 2014-01-07 75
8 2014-01-08 106
9 2014-01-09 89
10 2014-01-10 82"
DF <- read.table(text = Lines)
DF$Date <- as.Date(DF$Date)
I would use library(lubridate).
df <- read.table(header = TRUE,text = "date Frequency
2014-01-01 179
2014-01-02 82
2014-01-03 89
2014-01-04 109
2014-01-05 90
2014-01-06 66
2014-01-07 75
2014-01-08 106
2014-01-09 89
2014-01-10 82")
You can use base R or library(dplyr):
base R:
to be sure that the date is really a date:
df$date <- ymd(df$date)
df$week <- week(df$date)
or short:
df$week <- week(ymd(df$date))
or dplyr:
library(dplyr)
df %>%
mutate(week = week(ymd(date))) %>%
group_by(week)
Out:
Barring a good reason not to, you should be sure to use ISO weeks to be sure your aggregation intervals are equally sized.
data.table makes this work like so:
library(data.table)
setDT(myDF) # convert to data.table
myDF[ , .(weekly_freq = sum(Frequency)), by = isoweek(Date)]
Maybe you can try the base R code with aggregate + format, i.e.,
dfout <- aggregate(Frequency ~ yearweek,within(df,yearweek <- format(Date,"%Y,%W")),sum)
such that
> dfout
yearweek Frequency
1 2014,00 549
2 2014,01 418
DATA
df <- structure(list(Date = structure(c(16071, 16072, 16073, 16074,
16075, 16076, 16077, 16078, 16079, 16080), class = "Date"), Frequency = c(179L,
82L, 89L, 109L, 90L, 66L, 75L, 106L, 89L, 82L)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")
The new package slider from RStudio addresses this problem directly including the specification of the start of the weekly periods. Suppose the weekly periods were to start on a Monday so that the beginning of the first week would be Monday, 2013-12-30. Then the slider solution would be
library(slider)
slide_period_dfr(.x = DF, .i=as.Date(DF$Date),
.period = "week",
.f = ~data.frame(week_ending = tail(.x$Date,1),
week_freq = sum(.x$Frequency)),
.origin = as.Date("2013-12-30"))
with the result
week_ending week_freq
1 2014-01-05 549
2 2014-01-10 418
I have daily data for 7 years. I want to group this into weekly data (based on the actual date) and sum the frequency.
Date Frequency
1 2014-01-01 179
2 2014-01-02 82
3 2014-01-03 89
4 2014-01-04 109
5 2014-01-05 90
6 2014-01-06 66
7 2014-01-07 75
8 2014-01-08 106
9 2014-01-09 89
10 2014-01-10 82
What is the best way to achieve that? Thank you
These solutions all use base R and differ only in the definition and labelling of weeks.
1) cut the dates into weeks and then aggregate over those. Weeks start on Monday but you can add start.on.monday=FALSE to cut to start them on Sunday if you prefer.
Week <- as.Date(cut(DF$Date, "week"))
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2013-12-30 549
## 2 2014-01-06 418
2) If you prefer to define a week as 7 days starting with DF$Date[1] and label them according to the first date in that week then use this. (Add 6 to Week if you prefer the last date in the week.)
weekno <- as.numeric(DF$Date - DF$Date[1]) %/% 7
Week <- DF$Date[1] + 7 * weekno
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2014-01-01 690
## 2 2014-01-08 277
3) or if you prefer to label it with the first date existing in DF in that week then use this. This and the last Week definition give the same result if there are no missing dates as is the case here. (If you want the last existing date in the week rather than the first then replace match with findInterval.)
weekno <- as.numeric(DF$Date - DF$Date[1]) %/% 7
Week <- DF$Date[match(weekno, weekno)]
aggregate(Frequency ~ Week, DF, sum)
## Week Frequency
## 1 2014-01-01 690
## 2 2014-01-08 277
Note
The input in reproducible form is assumed to be:
Lines <- "Date Frequency
1 2014-01-01 179
2 2014-01-02 82
3 2014-01-03 89
4 2014-01-04 109
5 2014-01-05 90
6 2014-01-06 66
7 2014-01-07 75
8 2014-01-08 106
9 2014-01-09 89
10 2014-01-10 82"
DF <- read.table(text = Lines)
DF$Date <- as.Date(DF$Date)
I would use library(lubridate).
df <- read.table(header = TRUE,text = "date Frequency
2014-01-01 179
2014-01-02 82
2014-01-03 89
2014-01-04 109
2014-01-05 90
2014-01-06 66
2014-01-07 75
2014-01-08 106
2014-01-09 89
2014-01-10 82")
You can use base R or library(dplyr):
base R:
to be sure that the date is really a date:
df$date <- ymd(df$date)
df$week <- week(df$date)
or short:
df$week <- week(ymd(df$date))
or dplyr:
library(dplyr)
df %>%
mutate(week = week(ymd(date))) %>%
group_by(week)
Out:
Barring a good reason not to, you should be sure to use ISO weeks to be sure your aggregation intervals are equally sized.
data.table makes this work like so:
library(data.table)
setDT(myDF) # convert to data.table
myDF[ , .(weekly_freq = sum(Frequency)), by = isoweek(Date)]
Maybe you can try the base R code with aggregate + format, i.e.,
dfout <- aggregate(Frequency ~ yearweek,within(df,yearweek <- format(Date,"%Y,%W")),sum)
such that
> dfout
yearweek Frequency
1 2014,00 549
2 2014,01 418
DATA
df <- structure(list(Date = structure(c(16071, 16072, 16073, 16074,
16075, 16076, 16077, 16078, 16079, 16080), class = "Date"), Frequency = c(179L,
82L, 89L, 109L, 90L, 66L, 75L, 106L, 89L, 82L)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")
The new package slider from RStudio addresses this problem directly including the specification of the start of the weekly periods. Suppose the weekly periods were to start on a Monday so that the beginning of the first week would be Monday, 2013-12-30. Then the slider solution would be
library(slider)
slide_period_dfr(.x = DF, .i=as.Date(DF$Date),
.period = "week",
.f = ~data.frame(week_ending = tail(.x$Date,1),
week_freq = sum(.x$Frequency)),
.origin = as.Date("2013-12-30"))
with the result
week_ending week_freq
1 2014-01-05 549
2 2014-01-10 418
I am new to R and currently working on some rainfall data. I have two data frames named df1 and df2.
df1
Date Duration_sum
5/28/2014 110
5/31/2014 20
5/31/2014 20
6/1/2014 10
6/1/2014 110
6/3/2014 140
6/4/2014 40
6/5/2014 60
6/12/2014 10
6/14/2014 100
df2
Date PercentRemoval
6/2/2014 25.8
6/5/2014 78.58
6/6/2014 15.6
6/13/2014 70.06
I want to look up the dates from df2 in df1. For example, if the 1st date from df2 is available in df1, I want to subset rows in df1 within the range of that specific date and 3 days prior to that. If that date is not available, then just look for the previous 3 days.
In case the data for previous 3 days are not available, then it will extract as many days as available but maximum limit is 3 days prior to the specific date of df2. If none of the dates are available in df1, then that date is ignored and look for the next date in df2. Also, for example, 3 days prior to 6/6/2014 is available in df1 but we have already considered those days for 6/5/2014. So, 6/6/2014 is ignored.
The resulted data frame should look something like this:
df3
col_1 Date Duration_sum
5/31/2014 20
5/31/2014 20
6/1/2014 10
6/2/2014 6/1/2014 110
6/3/2014 140
6/4/2014 40
6/5/2014 6/5/2014 60
6/13/2014 6/12/2014 10
I have used this code:
df3 <- df1[df1$Date %in% as.Date(c(df2)),]
this code gives me the results for specific dates but not for the previous 3 days. I would really appreciate If someone can help me out with this code or some other codes. Thanks in advance.
This may be one way to do the task. If I am correctly reading your question, you want to remove any date, which does not have more than 3 days as an interval with a previous date. In this way, you can avoid the overlapping issue you mentioned in your question; you can successfully remove the 5th of June, 2014. Once you filter dates in df2, you can subset df1 for each date in the revised df2 in the lapply() part. The output is a list, and you want to assign names to each data frame in the list. Finally, you bind all data frames.
library(dplyr)
mutate(df1, Date = as.Date(Date, format = "%m/%d/%Y")) -> df1
mutate(df2, Date = as.Date(Date, format = "%m/%d/%Y")) %>%
filter(!(Date - lag(Date, default = 0) < 3)) -> df2
lapply(df2$Date, function(x){
filter(df1, between(Date, x-3, x)) -> foo
foo
}) -> temp
names(temp) <- as.character(df2$Date)
bind_rows(temp, .id = "df2.date")
# df2.date Date Duration_sum
#1 2014-06-02 2014-05-31 20
#2 2014-06-02 2014-05-31 20
#3 2014-06-02 2014-06-01 10
#4 2014-06-02 2014-06-01 110
#5 2014-06-05 2014-06-03 140
#6 2014-06-05 2014-06-04 40
#7 2014-06-05 2014-06-05 60
#8 2014-06-13 2014-06-12 10
DATA
df1 <- structure(list(Date = c("5/28/2014", "5/31/2014", "5/31/2014",
"6/1/2014", "6/1/2014", "6/3/2014", "6/4/2014", "6/5/2014", "6/12/2014",
"6/14/2014"), Duration_sum = c(110L, 20L, 20L, 10L, 110L, 140L,
40L, 60L, 10L, 100L)), .Names = c("Date", "Duration_sum"), class = "data.frame", row.names = c(NA,
-10L))
df2 <- structure(list(Date = c("6/2/2014", "6/5/2014", "6/6/2014", "6/13/2014"
), PercentRemoval = c(25.8, 78.58, 15.6, 70.06)), .Names = c("Date",
"PercentRemoval"), class = "data.frame", row.names = c(NA, -4L
))
i just started with R and finished some tutorials. However, i am trying to get into time series analysis and got big troubles with it. I made a data frame that looks like that:
Date Time T1
1 2014-05-22 15:15:00 21.6
2 2014-05-22 15:20:00 21.2
3 2014-05-22 15:25:00 21.3
4 2014-05-22 15:30:00 21.5
5 2014-05-22 15:35:00 21.1
6 2014-05-22 15:40:00 21.5
Since i didn't want to work with half days, i removed the first and last day from the data frame. Since R didnt recognize the date nor time as such, but as "factor", i used the lubridate library to change it properly. Now it looks like that:
Date Time T1
1 2014-05-23 0S 14.2
2 2014-05-23 5M 0S 14.1
3 2014-05-23 10M 0S 14.6
4 2014-05-23 15M 0S 14.3
5 2014-05-23 20M 0S 14.4
6 2014-05-23 25M 0S 14.5
Now the trouble really starts. Using ts function changes date to 16944 and time to 0. How do i setup a data frame with the correct start date and frequency? A new set of data comes in everty 5 min so frequency should be 288. I also tried to set the start date as a vector. Since 22th of may was the 142th day of the year i tried this
ts_df <- ts(df, start=c(2014, 142/365), frequency=288)
No error there, but when i go for start(ds_df) i get and end(ds_df):
[1] 2013.998
[1] 2058.994
Can anyone give me a hint how to work with these kind of data?
"ts" class is typically not a good fit for that type of data. Assuming DF is the data frame shown reproducibly in the Note at the end of this answer we convert it to a "zoo" class object and then perform some manipulations. The related xts package could also be used.
library(zoo)
z <- read.zoo(DF, index = 1:2, tz = "")
window(z, start = "2014-05-22 15:25:00")
head(z, 3) # first 3
head(z, -3) # all but last 3
tail(z, 3) # last 3
tail(z, -3) # all but first 3
z[2:4] # 2nd, 3rd and 4th element of z
coredata(z) # numeric vector of data values
time(z) # vector of datetimes
fortify.zoo(z) # data frame whose 2 cols are (1) datetimes and (2) data values
aggregate(z, as.Date, mean) # convert to daily averaging values
ym <- aggregate(z, as.yearmon, mean) # convert to monthly averaging values
frequency(ym) <- 12 # only needed since ym only has length 1
as.ts(ym) # year/month series can be reasonably converted to ts
plot(z)
library(ggplot2)
autoplot(z)
read.zoo could also have been used to read the data in from a file.
Note: DF used above in reproducible form:
DF <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2014-05-22",
class = "factor"),
Time = structure(1:6, .Label = c("15:15:00", "15:20:00",
"15:25:00", "15:30:00", "15:35:00", "15:40:00"), class = "factor"),
T1 = c(21.6, 21.2, 21.3, 21.5, 21.1, 21.5)), .Names = c("Date",
"Time", "T1"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))
I wanted to plot a graph of n(y-axis) versus date(x-axis) in R, but due to the format of the date displayed in my data, the order of the date wasn't in the correct ascending order. How can I solve this? Appreciate for the help.
hybrid <- readWorksheetFromFile(excel.file, sheet="ResultSet", header=TRUE)
wb <- loadWorkbook(excel.file)
setMissingValue(wb,value=c("NA"))
hybrid1 <- readWorksheet(wb, sheet="ResultSet", header=TRUE)
I used the dplyr function. Suppose each Pub.Number have a unique code & I replaced it with one. Then, I count the number of it for a certain date.
hybrid <- mutate(hybrid1, n=sum(Publication.Number=1))
p1 <- select(hybrid1, Publication.Date, n)
pt <- count(p1, Publication.Date, wt=n)
The output look like this:
pt
Source: local data frame [627 x 2]
Publication.Date n
(chr) (dbl)
1 01.01.2013 1
2 01.01.2014 8
3 01.01.2015 10
4 01.02.2012 3
5 01.03.2012 16
6 01.04.2015 2
7 01.05.2012 1
8 01.05.2013 7
9 01.05.2014 23
10 01.06.2011 1
.. ... ...
Then, I plotted it but R recognized Pub.Date as character
qplot(x=Publication.Date, y=n, data=pt, geom="point")
x <- hybrid1[,2]
class(x)
[1] "character"
The graph I've plotted is a mess because of the wrong order of the date
I tried using the as.Date function but it seems that it's not complete (I'm using R version 3.2.2)
> pt[,1] <- as.Date(pt[,1], format='%d.%m.%Y’)
+
First convert 'Publication.Date’ to Date format, then order:
using your data:
data <- read.table(pipe('pbpaste'),sep='',header=T,stringsAsFactors = F)
data <- data[,-1]
names(data) <- c('Pub.Date', 'n’)
Pub.Date n
1 01.01.2014 8
2 01.01.2015 10
3 01.02.2012 3
4 01.03.2012 16
5 01.04.2015 2
6 01.05.2012 1
7 01.05.2013 7
8 01.05.2014 23
9 01.06.2011 1
convert ‘Pub.Date’ to date format:
data[,1] <- as.Date(data[,1],format='%d.%m.%Y’)
and order:
data[order(data$"Pub.Date",data$n), ]
Pub.Date n
9 2011-06-01 1
3 2012-02-01 3
4 2012-03-01 16
6 2012-05-01 1
7 2013-05-01 7
1 2014-01-01 8
8 2014-05-01 23
2 2015-01-01 10
5 2015-04-01 2
In the usual course of data input with R, values like " 01.01.2013" will become factor variables. Since they are not in one of the two "stadard Date formats: YYYY/MM/DD or YYYY-MM-DD, they cannot be input directly as "Date"s with "colClasses" unless you build an "as.DT" method. You will need to make sure they are character vectors either by using stringsAsFactors=FALSE in a read function or by coercing to character with as.character after they are input. That header you have displayed makes me think this data has been manipualtes dsomehow, perhaps with functions in the dplyr package?
res <- structure(list(Publication.Date = structure(1:10, .Label = c("01.01.2013",
"01.01.2014", "01.01.2015", "01.02.2012", "01.03.2012", "01.04.2015",
"01.05.2012", "01.05.2013", "01.05.2014", "01.06.2011"), class = "factor"),
n = c(1L, 8L, 10L, 3L, 16L, 2L, 1L, 7L, 23L, 1L)), .Names = c("Publication.Date",
"n"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10"))
> res
Publication.Date n
1 01.01.2013 1
2 01.01.2014 8
3 01.01.2015 10
4 01.02.2012 3
5 01.03.2012 16
6 01.04.2015 2
7 01.05.2012 1
8 01.05.2013 7
9 01.05.2014 23
10 01.06.2011 1
> res$Publication.Date <- as.Date( as.character(res$Publication.Date), format="%m.%d.%Y")
Then you can plot:
png(); qplot(x=Publication.Date, y=n, data=res, geom="point"); dev.off()