aggregating monthly sums and then getting the mean of all monthly sums - r

The list below is sum of data corresponding to each month in a time series using the following snippet:
aggregate(data, by=list(Year=format(DateTime, "%Y"), Month=format(DateTime, "%m")), sum, na.rm=TRUE)
Year Month x
1 1981 01 62426.43
2 1982 01 70328.87
3 1983 01 67516.34
4 1984 01 64454.00
5 1985 01 78801.46
6 1986 01 73865.18
7 1987 01 64224.96
8 1988 01 72362.39
9 1981 02 74835.16
10 1982 02 75275.58
11 1983 02 67457.39
12 1984 02 64981.99
13 1985 02 56490.10
14 1986 02 62759.89
15 1987 02 65144.44
16 1988 02 67704.67
This part is easy...but I am tripping up on trying to get an average of all the monthly sums for each month (i.e. one average of the sums for each month)
If I do the following:
aggregate(data, by=list(Month=format(DateTime, "%m")), sum, na.rm=TRUE)
I just get a sum of all months in the time series, which is what i dont want. Can i achieve the desired result in one aggregate statement, or do I need more code...Any help would be appreciated.

You could also have done it with a single call to aggregate:
aggregate(data,
by=list(Year=format(DateTime, "%Y"), Month=format(DateTime, "%m")),
FUN= function(x){ sum(x, na.rm=TRUE)/sum(!is.na(x))}
)

You can do it with 2 aggregate statements:
aggregate(x~Month, aggregate(data, by=list(Year=format(DateTime, "%Y"), Month=format(DateTime, "%m")), sum, na.rm=TRUE), mean)

Related

How to rearrange daily stream discharge data into monthly format and rank the discharge values for each month using R

I have a data set of daily stream discharge values from a gauging station for approximately 50 years. The data is arranged into three columns, namely, "date", "month", "discharge".(Sample data shown here)
`
Date<- as.Date(c('1938-10-01','1954-10-27', '1967-06-16','1943-01-01','1945-01-14','1945-03-14','1954-05-04','1960-04-23','1960-05-09','1962-01-18','1968-12-19','1972-01-15','1977-08-15','1981-04-11','1986-06-20','1989-01-20','1992-03-29'))
> Months<- c('Oct','Oct','Jun','Jan','Jan','Mar','May','Apr','May','Jan','Dec','Jan','Aug','Apr','Jun','Jan','Mar')
> Dis<-c('1000','1200','400','255','450','215','360','120','145','1204','752','635','1456','154','154','1204','450')
> Sampledata<-data.frame("Date"=Date,"Months"=Months,"Disch"=Dis)
> print(Sampledata)
Date Months Disch
1 1938-10-01 Oct 1000
2 1954-10-27 Oct 1200
3 1967-06-16 Jun 400
4 1943-01-01 Jan 255
5 1945-01-14 Jan 450
6 1945-03-14 Mar 215
7 1954-05-04 May 360
8 1960-04-23 Apr 120
9 1960-05-09 May 145
10 1962-01-18 Jan 1204
11 1968-12-19 Dec 752
12 1972-01-15 Jan 635
13 1977-08-15 Aug 1456
14 1981-04-11 Apr 154
15 1986-06-20 Jun 154
16 1989-01-20 Jan 1204
17 1992-03-29 Mar 450
I want to calculate ranks for each month separately for all the years. For example: Calculate rank in ascending order for the month of January for 50 years. With the same rank value assigned to a duplicate discharge value. Desired output shown here:
> Date Month Disch Rank
1 1943-01-01 Jan 255 1
2 1945-01-14 Jan 450 2
3 1962-01-18 Jan 1204 4
4 1972-01-15 Jan 635 3
5 1989-01-20 Jan 1204 4
> Date Month Disch Rank
1 1945-03-14 Mar 215 1
2 1992-03-29 Mar 450 2
3 2001-03-19 Mar 450 2
Without using any packages first convert columns 2 and 3 to numeric and then use ave and rank with the indicated ties method. Finally order the result.
Note that the output shown in the question does not correspond to the input, e.g. there are three Mar rows in the output but only two such rows in the input so this will correspond to the input but will not be identical to the output shown.
Sampledata2 <- transform(Sampledata,
Disch = as.numeric(as.character(Disch)),
Months = as.numeric(format(Date, "%m")))
Rank <- function(x) rank(x, ties = "min")
Sampledata3 <- transform(Sampledata2,
Rank = ave(Disch, Months, FUN = Rank))
o <- with(Sampledata3, order(Months, Date))
Sampledata3[o, ]
An option would be to group by 'Month' and use one of the ranking functions (dense_rank, row_number(), min_rank - based on the needs) to rank the 'Discharge' column
library(dplyr)
df1 %>%
group_by(Month) %>%
mutate(Rank = dense_rank(Discharge))

How to calculate the average year

I have a 20-year monthly XTS time series
Jan 1990 12.3
Feb 1990 45.6
Mar 1990 78.9
..
Jan 1991 34.5
..
Dec 2009 89.0
I would like to get the average (12-month) year, or
Jan xx
Feb yy
...
Dec kk
where xx is the average of every January, yy of every February, and so on.
I have tried apply.yearly and lapply but these return 1 value, which is the 20-year total average
Would you have any suggestions? I appreciate it.
The lubridate package could be useful for you. I would use the functions year() and month() in conjunction with aggregate():
library(xts)
library(lubridate)
#set up some sample data
dates = seq(as.Date('2000/01/01'), as.Date('2005/01/01'), by="month")
df = data.frame(rand1 = runif(length(dates)), rand2 = runif(length(dates)))
my_xts = xts(df, dates)
#get the mean by year
aggregate(my_xts$rand1, by=year(index(my_xts)), FUN=mean)
This outputs something like:
2000 0.5947939
2001 0.4968154
2002 0.4941752
2003 0.5291211
2004 0.6631564
To find the mean for each month you can do:
#get the mean by month
aggregate(my_xts$rand1, by=month(index(my_xts)), FUN=mean)
which will output something like
1 0.5560279
2 0.6352220
3 0.3308571
4 0.6709439
5 0.6698147
6 0.7483192
7 0.5147294
8 0.3724472
9 0.3266859
10 0.5331233
11 0.5490693
12 0.4642588

R Studio: look up a value in table(both direction V&H), then use as a variable in loop

I am dealing with a dataset ("IndexTable") have 3 million+ observations. Please see following for the first 6 observations:
Identity gender type amount Year Month
1 65 F W 31.88 1987 Jan
2 23 M P 29.21 1985 Mar
3 45 F W 44.70 1987 Jan
4 47 F W 72.64 1987 Jan
5 56 M P 28.92 1986 Jul
6 09 F W 34.32 1990 Jan
and the index table ("index") from which the value will be searched (part of the table):
year average Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 1950 32.84210 33.19118 33.10321 33.01572 32.89977 32.81334 32.98665 32.98665 33.10321 32.89977 32.55677 32.41595 32.24857
2 1951 30.09866 31.94615 31.64936 31.43694 30.94371 30.19568 30.09866 29.64623 29.50617 29.29854 29.09382 28.98131 28.78098
3 1952 27.56470 28.28139 28.25313 28.11271 27.67259 27.67259 27.21981 27.24604 27.40444 27.45766 27.21981 27.24604 27.06353
4 1953 26.73099 27.08945 27.01183 26.83243 26.58025 26.68055 26.53038 26.53038 26.70575 26.75628 26.75628 26.68055 26.78162
5 1954 26.25941 26.73099 26.78162 26.53038 26.43120 26.50552 26.35730 25.92244 26.08984 26.13807 26.01783 25.89871 25.75718
6 1955 25.11668 25.66369 25.66369 25.66369 25.52472 25.57087 25.04994 24.96151 25.13901 24.98356 24.72149 24.33854 24.33854
For each observation in "IndexTable", I would like to find the value in "index" which match the Year and Month, then use the value to multiply it's amount to get the adjusted amount.
Thanks in advance J
Using the dplyr and tidyr package:
index_long <- index %>%
gather(Month, multiplier, Jan:Dec) %>%
select(-average)
left_join(IndexTable, index_long, by = c("Year" = "year", "Month" = "Month")) %>%
mutate(adjusted_amount = amount*multiplier)
First I gather the Month columns into one column with the value column multiplier.
I drop the average column, because it doesn't need to be joined to the other table. Then by using a left join only does value with a matching year month combination will be joined to the IndexTable.
Then finally I used the multiplier to create the new column adjusted_amount

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

Get average value by month

I’ve used zoo aggregate function to get the monthly average from daily data using:
monthlyMeanTemp <- aggregate(merged.precip.flow.and.T[,'E'], as.yearmon, mean, na.rm = TRUE) # ‘E’ is the column of temperature
Here is the head and tail of the result:
Jan 1979 Feb 1979 Mar 1979 Apr 1979 May 1979 Jun 1979
-14.05354839 -11.83078929 -7.32150645 -0.03214333 6.16986774 14.00944000
…
Apr 1997 May 1997 Jun 1997 Jul 1997 Aug 1997 Sep 1997
1.438547 7.421910 12.764450 15.086206 17.376026 10.125013`
Is it possible to get the mean by month (i.e., the mean of all the January values, mean of all the February values etc.) without resorting to padding missing months with NA, forming a n x 12 matrix (where n is the number of years), and then using the colMeans function?
...just found the answer: From the hydroTSM package: monthlyfunction(merged.precip.flow.and.T[,'E'], FUN=mean, na.rm=TRUE)

Resources