In the zoo package there is a function called rollmean, which enables you to make moving averages. The rollmean(x,3) will take the previous, current and next value (ie 4, 6 and 2) in the table below. This is shown in the second column.
x rollmean ma3
4
6 4.0
2 4.3
5 3.0 4.0
2 6.3 4.3
12 6.0 3.0
4 6.0 6.3
2 6.0
I would like to get the same job done, but by averaging out the previous 3 values in the fourth row. This is displayed in the third column. Can anybody tell me the name of the function that will help to accomplish this?
You can use rollmean, but set align='right'. Or you could use rollmeanr, which has align='right' as the default.
ma3 <- rollmeanr(x[,1],3,fill=NA)
...but you would still need to lag the result. Another solution is to use rollapply with a list for the width argument:
ma3 <- rollapplyr(x[,1],list(-(3:1)),mean,fill=NA)
I struggled searching for a simple function for moving averages that had some flexibility to do what I needed. I finally wrote a couple functions extending the one based on the filter function which rinni gives above in the comment (but which itself won't work because it will include the current observation in the 3 period average).
Moving average function that includes the current observation
mav <- function(x,n){filter(x,rep(1/n,n), sides=1)}
Moving average function that does not include the current observation
mavback <- function(x,n){
a<-mav(x,1)
b<-mav(x,(n+1))
c<-(1/n)*((n+1)*b - a)
return(c)
}
Backward looking moving average function, not including current obs, based on [h2] readings starting [h1] periods back
mavback1<-function(x,h1,h2){
a<-mavback(x,h1)
b<-mavback(x,h1-h2)
c<-(1/h2)*(h1*a -(h1-h2)*b)
return(c)
}
A simplier implementation of w_i_l_l's mavback function based on his mav function
mavback <- function(x,n){ filter(x, c(0, rep(1/n,n)), sides=1) }
Related
I have a very large data frame and a set of adjustment coefficients that I wish to apply to certain years, with each coefficient applied to one and only one year. The code below tries, for each row, to select the right coefficient, and return a vector containing dat in the unaffected years and dat times that coefficient in the selected years, which is to replace dat.
year <- rep(1:5, times = c(2,2,2,2,2))
dat <- 1:10
df <- tibble(year, dat)
adjust = c(rep(0, 4), rep(c(1 + 0.1*1:3), c(2,2,2)))
df %>% mutate(dat = ifelse(year < 5, year, dat*adjust[[year - 2]]))
If I get to do this, I get the following error:
Evaluation error: attempt to select more than one element in vectorIndex.
I am pretty sure this is because the extraction operator [[ treats year as the entire vector year rather than the year of the current row, so there is then a vectorized subtraction, whereupon [[ chokes on the vector-valued index.
I know there are many ways to solve this problem. I have a particularly ugly way involving nested ifelse’s working now. My question is, is there any way to do what I was trying to do in an R- and dplyr- idiomatic way? In some ways this seems like a filter or group_by problem, since we want to treat rows or groups of rows as distinct entities, but I have not found a way of doing so that is any cleaner.
It seems like there are some functions which are easier to define or to think of as row-by-row rather than as the product of entire vectors. I could produce a single vector containing the correct adjustment for each year, but since the number of rows per year varies, I would still have to apply a multi-valued conditional test to construct that vector, so the same problem arises.
Or doesn’t it?
You need to use [ instead of [[ for vector indexing; And also year - 2 produces negative index which will further give problems; If you want to map year to adjust by index positions, you can use replace with a mask that indicates the year to be modified:
df %>%
mutate(dat = {
mask = year > 2;
replace(year, mask, dat[mask] * adjust[year[mask] - 2])
})
# A tibble: 10 x 2
# year1 dat1
# <int> <dbl>
# 1 1 1.0
# 2 1 1.0
# 3 2 2.0
# 4 2 2.0
# 5 3 5.5
# 6 3 6.6
# 7 4 8.4
# 8 4 9.6
# 9 5 11.7
#10 5 13.0
In the zoo package there is a function called rollmean, which enables you to make moving averages. The rollmean(x,3) will take the previous, current and next value (ie 4, 6 and 2) in the table below. This is shown in the second column.
x rollmean ma3
4
6 4.0
2 4.3
5 3.0 4.0
2 6.3 4.3
12 6.0 3.0
4 6.0 6.3
2 6.0
I would like to get the same job done, but by averaging out the previous 3 values in the fourth row. This is displayed in the third column. Can anybody tell me the name of the function that will help to accomplish this?
You can use rollmean, but set align='right'. Or you could use rollmeanr, which has align='right' as the default.
ma3 <- rollmeanr(x[,1],3,fill=NA)
...but you would still need to lag the result. Another solution is to use rollapply with a list for the width argument:
ma3 <- rollapplyr(x[,1],list(-(3:1)),mean,fill=NA)
I struggled searching for a simple function for moving averages that had some flexibility to do what I needed. I finally wrote a couple functions extending the one based on the filter function which rinni gives above in the comment (but which itself won't work because it will include the current observation in the 3 period average).
Moving average function that includes the current observation
mav <- function(x,n){filter(x,rep(1/n,n), sides=1)}
Moving average function that does not include the current observation
mavback <- function(x,n){
a<-mav(x,1)
b<-mav(x,(n+1))
c<-(1/n)*((n+1)*b - a)
return(c)
}
Backward looking moving average function, not including current obs, based on [h2] readings starting [h1] periods back
mavback1<-function(x,h1,h2){
a<-mavback(x,h1)
b<-mavback(x,h1-h2)
c<-(1/h2)*(h1*a -(h1-h2)*b)
return(c)
}
A simplier implementation of w_i_l_l's mavback function based on his mav function
mavback <- function(x,n){ filter(x, c(0, rep(1/n,n)), sides=1) }
Is there any difference between what these two lines of code do:
mv_avg[i-2] <- (sum(file1$rtn[i-2]:file1$rtn[i+2])/5)
and
mv_avg[i-2] <- mean(file1$rtn[i-2]:file1$rtn[i+2])
I'm trying to calculate the moving average of first 5 elements in my dataset. I was running a for loop and the two lines are giving different outputs. Sorry for not providing the data and the rest of the code for you guys to execute and see (can't do that, some issues).
I just want to know if they both do the same thing or if there's a subtle difference between them both.
It's not an issue with mean or sum. The example below illustrates what's happening with your code:
x = seq(0.5,5,0.5)
i = 8
# Your code
x[i-2]:x[i+2]
[1] 3 4 5
# Index this way to get the five values for the moving average
x[(i-2):(i+2)]
[1] 3.0 3.5 4.0 4.5 5.0
x[i-2]=3 and x[i+2]=5, so x[i-2]:x[i+2] is equivalent to 3:5. You're seeing different results with mean and sum because your code is not returning 5 values. Therefore dividing the sum by 5 does not give you the average. In my example, sum(c(3,4,5))/5 != mean(c(3,4,5)).
#G.Grothendieck mentioned rollmean. Here's an example:
library(zoo)
rollmean(x, k=5, align="center")
[1] 2.1 3.1 4.1 5.1 6.1 7.1 8.1
My dataset looks like this:
d = data.frame(year=rep(2000:2002,each=40),month=rep(c(rep(1:12,3),5,6,7,8),3),species=rep(c(rep(letters[1:12],3),"a","b","g","l"),3),species_group=NA,kg=round(rnorm(120,15,6),digits=2))
d$species_group=ifelse(d$species %in% letters[1:5],"A","B")
I would like to have per year and per species group (so excluding the levels of month and species) the mean weight and the number of species included. This works fine with ddply. However, I also would like to include a value of the “quality” of my data. That is, if the number of species per month is balanced or there are for example more species included during the summer months. Therefore I thought I might simply calculate the yearly standard deviation of the number of unique species per month.
I tried doing this with tapply in ddply as follows:
s=ddply(d,c("year","species_group"),function(x) cbind(n_species=length(unique(x$species)),
quality=tapply(x,x$month,sd(length(unique(x$species)))),
kg=sum(x$kg,na.rm=T)))
but this gives me an error
Error in match.fun(FUN) : 'sd(length(unique(x$species)))' is not a function, character or symbol
What I would like to obtain is something like this:
output=data.frame(year=rep(2000:2002,each=2),species_group=rep(c("A","B"),3),n_species=rep(c(7,9),3),quality=round(rnorm(6,2,0.3),digits=2),kg=round(rnorm(6,15,6),digits=2))
I cannot first use ddply by month, year and species group because this means I couldn’t know anymore the number of unique species per year.
I suppose I could also calculate n_species and quality separately and put them together afterwards, but this would be a cumbersome approach.
How can I make my function work, or how can I do this more properly?
ANSWER:
The easiest solution came from shadow, who noted my mistake in the use of tapply. Furthermore, a standard error should be more appropriate than standard deviation, giving the following formula:
s=ddply(d,c("year","species_group"),function(x) cbind(n_species=length(unique(x$species)),
quality=sd(tapply(x$species,x$month, function(y) length(unique(y))))/sqrt(length(tapply(x$species,x$month, function(y) length(unique(y))))),
kg=sum(x$kg,na.rm=T)))
Not clear how do you define your quality criteria. So How I would do this.
First I define my quality criteria in a separate function. Note that your function should retuen a single value not a vector (in your solution you are using tapply which return a vector).
## returns the mean of sd variation per month
get_quality <-
function(species,month)
mean(tapply(species,month,
FUN=function(s)sd(as.integer(s))),
na.rm=TRUE)
Then I use it within ddply . To simplify code I am also creating a function to be applied by group.
ff <-
function(x) {
cbind(n_species=length(unique(x$species)),
quality= get_quality(x$species,x$month),
kg=sum(x$kg,na.rm=TRUE))
}
library(plyr)
s=ddply(d,.(year,species_group),ff)
year species_group n_species quality kg
1 2000 A 5 0.4000000 259.68
2 2000 B 7 0.2857143 318.24
3 2001 A 5 0.4000000 285.07
4 2001 B 7 0.2857143 351.54
5 2002 A 5 0.4000000 272.46
6 2002 B 7 0.2857143 331.45
I want to calculate the pooled (actually weighted) standard deviation for all the unique sites in my data frame.
The values for these sites are values for single species forest stands and I want to pool the mean and the sd so that I can compare broadleaved stands with conifer stands.
This is the data frame (df) with values for the broadleaved stands:
keybl n mean sd
Vest02DenmDesp 3 58.16 6.16
Vest02DenmDesp 5 54.45 7.85
Vest02DenmDesp 3 51.34 1.71
Vest02DenmDesp 3 59.57 5.11
Vest02DenmDesp 5 62.89 10.26
Vest02DenmDesp 3 77.33 2.14
Mato10GermDesp 4 41.89 12.6
Mato10GermDesp 4 11.92 1.8
Wawa07ChinDesp 18 0.097 0.004
Chen12ChinDesp 3 41.93 1.12
Hans11SwedDesp 2 1406.2 679.46
Hans11SwedDesp 2 1156.2 464.07
Hans11SwedDesp 2 4945.3 364.58
Keybl is the code for the site. The formula for the pooled SD is:
s=sqrt((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))
(Sorry I can't post pictures and did not find a link that would directly go to the formula)
Where 2 is the number of groups and therefore will change depending on site. I know this is used for t-test and two groups one wants to compare. In this case I'm not planning to compare these groups. My professor suggested me to use this formula to get a weighted sd. I didn't find a R function that incorporates this formula in the way I need it, therefore I tried to build my own. I am, however, new to R and not very good at making functions and loops, therefore I hope for your help.
This is what I got so far:
sd=function (data) {
nc1=data[z,"nc"]
sc1=data[z, "sc"]
nc2=data[z+1, "nc"]
sc2=data[z+1, "sc"]
sd1=(nc1-1)*sc1^2 + (nc2-1)*sc2^2
sd2=sd1/(nc1+nc2-length(nc1))
sqrt(sd2)
}
splitdf=split(df, with(df, df$keybl), drop = TRUE)
for (c in 1:length(splitdf)) {
for (i in 1:length(splitdf[[i]])) {
a = (splitdf[[i]])
b =sd(a)
}
}
1) The function itself is not correct as it gives slightly lower values than it should and I don't understand why. Could it be that it does not stop when z+1 has reached the last row? If so, how can that be corrected?
2) The loop is totally wrong but it is what I could come up with after several hours of no success.
Can anybody help me?
Thanks,
Antra
What you're trying to do would benefit from a more general formula which will make it easier. If you didn't need to break it into pieces by the keybl variable you'd be done.
dd <- df #df is not a good name for a data.frame variable since df has a meaning in statistics
dd$df <- dd$n-1
pooledSD <- sqrt( sum(dd$sd^2 * dd$df) / sum(dd$df) )
# note, in this case I only pre-calculated df because I'll need it more than once. The sum of squares, variance, etc. are only used once.
An important general principle in R is that you use vector math as much as possible. In this trivial case it won't matter much but in order to see how to do this on large data.frame objects where compute speed is more important read on.
# First use R's vector facilities to define the variables you need for pooling.
dd$df <- dd$n-1
dd$s2 <- dd$sd^2 # sd isn't a good name for standard deviation variable even in a data.frame just because it's a bad habit to have... it's already a function and standard deviations have a standard name
dd$ss <- dd$s2 * dd$df
And now just use convenience functions for splitting and calculating the necessary sums. Note only one function is executed here in each implicit loop (*apply, aggregate, etc. are all implicit loops executing functions many times).
ds <- aggregate(ss ~ keybl, data = dd, sum)
ds$df <- tapply(dd$df, dd$keybl, sum) #two different built in methods for split apply, we could use aggregate for both if we wanted
# divide your ss by your df and voila
ds$s2 <- ds$ss / ds$df
# and also you can easly get your sd
ds$s <- sqrt(ds$s2)
And the correct answer is:
keybl ss df s2 s
1 Chen12ChinDesp 2.508800e+00 2 1.254400e+00 1.120000
2 Hans11SwedDesp 8.099454e+05 3 2.699818e+05 519.597740
3 Mato10GermDesp 4.860000e+02 6 8.100000e+01 9.000000
4 Vest02DenmDesp 8.106832e+02 16 5.066770e+01 7.118125
5 Wawa07ChinDesp 2.720000e-04 17 1.600000e-05 0.004000
This looks much less concise than other methods (like 42-'s answer) but if you unroll those in terms of how many R commands are actually being executed this is much more concise. For a short problem like this either way is fine but I thought I'd show you the method that uses the most vector math. It also highlights why those convenient implicit loop functions are available, for expressiveness. If you used for loops to accomplish the same then the temptation would be stronger to put everything in the loop. This can be a bad idea in R.
The pooled SD under the assumption of independence (so the covariance terms can be assumed to be zero) will be: sqrt( sum_over_groups[ (var)/sum(n)-N_groups)] )
lapply( split(dat, dat$keybl),
function(dd) sqrt( sum( dd$sd^2 * (dd$n-1) )/(sum(dd$n-1)-nrow(dd)) ) )
#-------------------------
$Chen12ChinDesp
[1] 1.583919
$Hans11SwedDesp
[1] Inf
$Mato10GermDesp
[1] 11.0227
$Vest02DenmDesp
[1] 9.003795
$Wawa07ChinDesp
[1] 0.004123106