I’ve used zoo aggregate function to get the monthly average from daily data using:
monthlyMeanTemp <- aggregate(merged.precip.flow.and.T[,'E'], as.yearmon, mean, na.rm = TRUE) # ‘E’ is the column of temperature
Here is the head and tail of the result:
Jan 1979 Feb 1979 Mar 1979 Apr 1979 May 1979 Jun 1979
-14.05354839 -11.83078929 -7.32150645 -0.03214333 6.16986774 14.00944000
…
Apr 1997 May 1997 Jun 1997 Jul 1997 Aug 1997 Sep 1997
1.438547 7.421910 12.764450 15.086206 17.376026 10.125013`
Is it possible to get the mean by month (i.e., the mean of all the January values, mean of all the February values etc.) without resorting to padding missing months with NA, forming a n x 12 matrix (where n is the number of years), and then using the colMeans function?
...just found the answer: From the hydroTSM package: monthlyfunction(merged.precip.flow.and.T[,'E'], FUN=mean, na.rm=TRUE)
Related
I have a 20-year monthly XTS time series
Jan 1990 12.3
Feb 1990 45.6
Mar 1990 78.9
..
Jan 1991 34.5
..
Dec 2009 89.0
I would like to get the average (12-month) year, or
Jan xx
Feb yy
...
Dec kk
where xx is the average of every January, yy of every February, and so on.
I have tried apply.yearly and lapply but these return 1 value, which is the 20-year total average
Would you have any suggestions? I appreciate it.
The lubridate package could be useful for you. I would use the functions year() and month() in conjunction with aggregate():
library(xts)
library(lubridate)
#set up some sample data
dates = seq(as.Date('2000/01/01'), as.Date('2005/01/01'), by="month")
df = data.frame(rand1 = runif(length(dates)), rand2 = runif(length(dates)))
my_xts = xts(df, dates)
#get the mean by year
aggregate(my_xts$rand1, by=year(index(my_xts)), FUN=mean)
This outputs something like:
2000 0.5947939
2001 0.4968154
2002 0.4941752
2003 0.5291211
2004 0.6631564
To find the mean for each month you can do:
#get the mean by month
aggregate(my_xts$rand1, by=month(index(my_xts)), FUN=mean)
which will output something like
1 0.5560279
2 0.6352220
3 0.3308571
4 0.6709439
5 0.6698147
6 0.7483192
7 0.5147294
8 0.3724472
9 0.3266859
10 0.5331233
11 0.5490693
12 0.4642588
I have a wide data frame in R and I am trying to rename the column names so that I can reshape it to a long format.
Currently, the data is structured like this:
long lat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V477
I'd like to rename the columns so that they are:
long lat Jan_1979 Feb_1979 Mar_1979 Apr_1979 ... Sept_2018
I'm not sure how to go about doing this. Any help would be appreciated.
There are multiple ways you could do this.
One way in base R is by using seq to create monthly dates in the format you need. So for example, you could create first 10 sequence starting from 1979-01-01 by
format(seq(as.Date('1979-01-01'), length.out = 10, by = "1 month"), "%b_%Y")
#[1] "Jan_1979" "Feb_1979" "Mar_1979" "Apr_1979" "May_1979" "Jun_1979" "Jul_1979"
#[8] "Aug_1979" "Sep_1979" "Oct_1979"
For your case, this should work
names(df)[3:479] <- format(seq(as.Date('1979-01-01'),
length.out = 477, by = "1 month"), "%b_%Y")
We can use expand.grid to get all month year combinations:
name_combn <- expand.grid(month.abb, 1979:2018)[1:477,]
names(df) <- c('long', 'lat', paste(name_combn$Var1, name_combn$Var2, sep = "_"))
Output:
> head(name_combn, 20)
Var1 Var2
1 Jan 1979
2 Feb 1979
3 Mar 1979
4 Apr 1979
5 May 1979
6 Jun 1979
7 Jul 1979
8 Aug 1979
9 Sep 1979
10 Oct 1979
11 Nov 1979
12 Dec 1979
13 Jan 1980
14 Feb 1980
15 Mar 1980
16 Apr 1980
17 May 1980
18 Jun 1980
19 Jul 1980
20 Aug 1980
I am dealing with a dataset ("IndexTable") have 3 million+ observations. Please see following for the first 6 observations:
Identity gender type amount Year Month
1 65 F W 31.88 1987 Jan
2 23 M P 29.21 1985 Mar
3 45 F W 44.70 1987 Jan
4 47 F W 72.64 1987 Jan
5 56 M P 28.92 1986 Jul
6 09 F W 34.32 1990 Jan
and the index table ("index") from which the value will be searched (part of the table):
year average Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 1950 32.84210 33.19118 33.10321 33.01572 32.89977 32.81334 32.98665 32.98665 33.10321 32.89977 32.55677 32.41595 32.24857
2 1951 30.09866 31.94615 31.64936 31.43694 30.94371 30.19568 30.09866 29.64623 29.50617 29.29854 29.09382 28.98131 28.78098
3 1952 27.56470 28.28139 28.25313 28.11271 27.67259 27.67259 27.21981 27.24604 27.40444 27.45766 27.21981 27.24604 27.06353
4 1953 26.73099 27.08945 27.01183 26.83243 26.58025 26.68055 26.53038 26.53038 26.70575 26.75628 26.75628 26.68055 26.78162
5 1954 26.25941 26.73099 26.78162 26.53038 26.43120 26.50552 26.35730 25.92244 26.08984 26.13807 26.01783 25.89871 25.75718
6 1955 25.11668 25.66369 25.66369 25.66369 25.52472 25.57087 25.04994 24.96151 25.13901 24.98356 24.72149 24.33854 24.33854
For each observation in "IndexTable", I would like to find the value in "index" which match the Year and Month, then use the value to multiply it's amount to get the adjusted amount.
Thanks in advance J
Using the dplyr and tidyr package:
index_long <- index %>%
gather(Month, multiplier, Jan:Dec) %>%
select(-average)
left_join(IndexTable, index_long, by = c("Year" = "year", "Month" = "Month")) %>%
mutate(adjusted_amount = amount*multiplier)
First I gather the Month columns into one column with the value column multiplier.
I drop the average column, because it doesn't need to be joined to the other table. Then by using a left join only does value with a matching year month combination will be joined to the IndexTable.
Then finally I used the multiplier to create the new column adjusted_amount
The list below is sum of data corresponding to each month in a time series using the following snippet:
aggregate(data, by=list(Year=format(DateTime, "%Y"), Month=format(DateTime, "%m")), sum, na.rm=TRUE)
Year Month x
1 1981 01 62426.43
2 1982 01 70328.87
3 1983 01 67516.34
4 1984 01 64454.00
5 1985 01 78801.46
6 1986 01 73865.18
7 1987 01 64224.96
8 1988 01 72362.39
9 1981 02 74835.16
10 1982 02 75275.58
11 1983 02 67457.39
12 1984 02 64981.99
13 1985 02 56490.10
14 1986 02 62759.89
15 1987 02 65144.44
16 1988 02 67704.67
This part is easy...but I am tripping up on trying to get an average of all the monthly sums for each month (i.e. one average of the sums for each month)
If I do the following:
aggregate(data, by=list(Month=format(DateTime, "%m")), sum, na.rm=TRUE)
I just get a sum of all months in the time series, which is what i dont want. Can i achieve the desired result in one aggregate statement, or do I need more code...Any help would be appreciated.
You could also have done it with a single call to aggregate:
aggregate(data,
by=list(Year=format(DateTime, "%Y"), Month=format(DateTime, "%m")),
FUN= function(x){ sum(x, na.rm=TRUE)/sum(!is.na(x))}
)
You can do it with 2 aggregate statements:
aggregate(x~Month, aggregate(data, by=list(Year=format(DateTime, "%Y"), Month=format(DateTime, "%m")), sum, na.rm=TRUE), mean)
I have a data set that includes cases by year and month. Some months are missing, and I'd like to create rows with a case count of zero for those months.
Here is an example, and my current brute force approach. Thanks for any pointers. Obviously, I'm new at this.
# fake data
library(plyr)
rm(FakeData)
FakeData <- data.frame(DischargeYear=c(rep(2010, 7), rep(2011,7)),
DischargeMonth=c(1:7, 3:9),
Cases=trunc(rnorm(14, mean=100, sd=20)))
# FakeData is missing data for some year/months
FakeData
# Brute force attempt to add rows with 0 and then total
for(i in 1:12){
for(j in 1:length(unique(FakeData$DischargeYear))){
FakeData <- rbind(FakeData, data.frame(
DischargeYear=unique(FakeData$DischargeYear)[j],
DischargeMonth=i,
Cases=0))
}
}
FakeData <- ddply(FakeData, c("DischargeYear","DischargeMonth"), summarise, Cases=sum(Cases))
# FakeData now has every year/month represented
FakeData
Using your FakeData data frame, try this:
# Create all combinations of months and years
allMonths <- expand.grid(DischargeMonth=1:12, DischargeYear=2010:2011)
# Keep all month-year combinations (all.x=TRUE) and add in 'Cases' from FakeData
allData <- merge(allMonths, FakeData, all.x=TRUE)
# 'allData' contains 'NA' for missing values. Set them to 0.
allData[is.na(allData)] <- 0
# Print results
allData
Another solution would be to use cast from the reshape package.
require(reshape)
cast(Fakedata, DischargeYear + DischargeMonth ~ ., add.missing = TRUE, fill = 0)
Note that it only adds 0 for the missing combinations in the data, months 8, 9 for year 2010 and months 1 and 2 for year 2011. To ensure that you have all months 1:12, you can change the definition of DischargeMonth to be a factor with levels 1:12 using
FakeData = transform(FakeData,
DischargeMonth = factor(DischargeMonth, levels = 1:12))
Here is a zoo solution. Note that zoo FAQ #13 discusses forming the grid, g. Also we convert the year and month to a "yearmon" class variable which is represented as a year plus fractional month (0 = Jan, 1/12 = Feb, 2/12 = Mar, etc.)
library(zoo)
# create zoo object with yearmon index
DF <- FakeData
z <- zoo(DF[,3], yearmon(DF[,1] + (DF[,2]-1)/12))
# create grid g. Merge zero width zoo object based on it. Fill NAs with 0s.
g <- seq(start(z), end(z), 1/12)
z0 <- na.fill(merge(z, zoo(, g)), fill = 0)
which gives
> z0
Jan 2010 Feb 2010 Mar 2010 Apr 2010 May 2010 Jun 2010
149 113 110 99 110 96
Jul 2010 Aug 2010 Sep 2010 Oct 2010 Nov 2010 Dec 2010
108 0 0 0 0 0
Jan 2011 Feb 2011 Mar 2011 Apr 2011 May 2011 Jun 2011
0 0 91 72 119 130
Jul 2011 Aug 2011 Sep 2011
93 74 112
or converting to "ts" class:
> as.ts(z0)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2010 149 113 110 99 110 96 108 0 0 0 0 0
2011 0 0 91 72 119 130 93 74 112
Note that if z is a zoo object then coredata(z) is its data and time(z) are its index values.