Computing a weighted rolling average R [duplicate] - r

This question already has answers here:
Adaptive moving average - top performance in R
(3 answers)
Closed 8 years ago.
Say i have two columns in a dataframe/data.table, one the level and the other one volume. I want to compute a rolling average of the level, weighted by volume, so volume acts as weight (normalized to 1) for some rolling window.
Base R has a weighted.mean() function which does similar calculation for two static vectors. I tried using sapply to pass a list/vector fo argument to it and create a rollign series, but to no avail.
Which "apply" mechanism should i use with weighted.mean() to get the desired result, or i would have to loop/write my own function?
////////////////////////////////////////////////////////////////////////////////////////
P.S. in the end i settled on writing simple custom function, which utilizes the great RccpRoll package. I found RccpRoll to be wicked fast, much faster than other rolling methods, which is important to me, as my data is several million rows.
the code for the function looks like this(i've added some NAs in the beggining since RccpRoll returns data without NAs):
require(RcppRoll)
my.rollmean.weighted <- function(vec1,vec2,width){
return(c(rep(NA,width-1),roll_sum(vec1*vec2,width)/roll_sum(vec2,width)))
}

I think this might work. It employs the technique demonstrated in the rollapply documentation for rolling regression. The key is by.column=FALSE. This provides a matrix of all the columns on a rolling basis.
require(zoo)
df <- data.frame(
price = cumprod(1 + runif(1000,-0.03,0.03)) * 25,
volume = runif(1000,1e6,2e6)
)
rollapply(
df,
width = 50,
function(z){
#uncomment if you want to see the structure
#print(head(z,3))
return(
weighted_mean = weighted.mean(z[,"price"],z[,"volume"])
)
},
by.column = FALSE,
align = "right"
)
Let me know if it doesn't work or is not clear.

Here is a code snippet that might help. It uses the rollmean function from the zoo package, and intervals of two (you pick the interval). The variable you would calculate using the weighted.mean function, I assume:
library(zoo) # for the rollmean() function
movavg <- rollmean(df$weightedVariable, k = 2, align = "right")

Related

How can I insert a data frame in a function and then group by groups with tapply

I am new to programming in R and I have made a function that returns me some basic statistics from a list or vector that we insert. The problem comes when I want to insert a data frame.
The dataframe I want to insert has 2 columns; the first refers to a group (1 or 2) and the second refers to widths of the skull in cm (numerical values). I would like to take the mean of both groups separately so that later I can compare them (1 and 2), the mode, median, quartiles ... (everything I have inside the function).
It occurred to me to use the function that I had made to insert lists or vectors and then to group me, use the tapply function but it gives me an error by console, this one:
Error in tapply(archivo, archivo$`Época histórica`, descriptive_statistics) :
arguments must have same length
Here you have the function and the tapply that I did:
descriptive_statistics = function(x){
result <- list(
mean(x), exp(mean(log(x))), median(x), modes(x),
(range(x)[2] - range(x)[1]), var(x), sqrt(var(x)), sqrt(var(x)) / mean(x)
)
names(result) <- c('Aritmetic mean', 'Geometric mean', 'Median', 'Mode', 'Range', 'Variance', 'Standard deviation', 'Pearsons coefficient of variation')
result
}
tapply(archivo, archivo$`Época histórica`, descriptive_statistics)
What could I improve my function so that it lets me enter dataframes? or what could I do in the tapply function to make it work for me? Can someone give me a hand with this? I also accept other ideas, I have tried with aggregate and inside the summary function and such but it does not give me the statistics I want, such as Pearson's coefficient.
Thank you very much in advance, greetings
Pass column of dataframe in the function instead of complete dataframe. You haven't shared your data so it is difficult to give specific answer but let's assume the other column is called col1. In that case you can do -
tapply(archivo$col1, archivo$`Época histórica`, descriptive_statistics)

looping through some but not all columns? [duplicate]

This question already has answers here:
Standardize data columns in R
(16 answers)
Scale only certain columns R [closed]
(1 answer)
Closed 3 years ago.
Hey awesome community,
I am trying to learn how to use loops to loop through aspects of a dataset. I'm using the sns data set provided free for machine learning and trying to run a k means cluster analysis. The first thing I need to do is to center and scale the variables. I want to do this using a loop, and I need to select all but the first four variables in the data set. Here's what I tried, and I'm not sure why this doesn't work:
for(i in names(sns.nona[, -c(1:4)])){
scale(i, center = TRUE, scale = TRUE)
}
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I get the above error, which must mean it's not selecting the actual column of the data set, just the name. I guess I should expect that, but how do I make it reference the data?
edit: I also tried:
for(i in names(sns.nona)[-c(1:4)]){
scale(sns.nona[,i], center = TRUE, scale = TRUE)
}
This did not return an error but it does not appear to be centering the data. I should get some negative values if the original value was 0 as I'd be subtractign the column mean from it...
A way to do this avoiding writing a loop:
scale(data[-1:-4])
Also, if you want to do this while enabling yourself to modify the selected columns without creating a new data frame:
data[-1:-4] <- lapply(data[-1:-4], scale)
You could use the tidyverse family of packages, which is what I use for pretty much everything I do in R.
It's never too early to start using them imo.
require(tidyverse)
#Convert sns.nona to tibble (robust data format which we can do cool stuff to)
sns.nona = as.tibble(sns.nona)
#Do cool stuff: mutate_at("columns to change","function to apply to columns")
sns.nona = sns.nona %>%
mutate_at(5:(ncol(sns.nona)),function(x) scale(x, center = T, scale = T))
NB don't be alarmed by the %>%. Basically x %>% function(y,z) is equivalent to function(x,y,z)
You might need to assign the result back after applying scale
for(i in names(df)[-(1:4)]){
df[, i] <- scale(df[,i], center = TRUE, scale = TRUE)
}
Or with lapply you could do
df[-(1:4)] <- lapply(df[-(1:4)], scale, center = TRUE, scale = TRUE)
and with dplyr we can . use mutate_at
library(dplyr)
df %>% mutate_at(-(1:4), scale, center = TRUE, scale = TRUE)

Split-apply-combine with aggregate : can the applied function accept multiple arguments that are specified variables of the original data?

Some context: On my quest to improve my R-code I'm trying to replace my for-loops whenever I can by R's apply-class functions.
The question: Are R's apply functions such as sapply, tapply, aggregate, etc. useful for applying functions that are more complicated in the sense that they take as arguments specified variables of the original data?
Simple examples of what works and what does not:
I have a dataframe with one time variable date.time and two numeric variables val.one and value.two:
Generate the data:
df <- data.frame(date.time = seq(ymd_hms("2000-01-01 00:00:00"),ymd_hms("2000-01-03 00:00:00"), length.out=100),value.one = c(1:100), value.two = c(1:100) + 10)
I would like to apply a function to every 10 hour cut of the dataframe that has as its two arguments the two numeric variables of the dataframe. For example, if I want to compute the mean of each of the two values for each 10 hour cut the solution is the following:
A function that computes the mean of value.one and value.two for each time period of 10 hours:
work_on_subsets <- function(data, time.step = "10 hours"){
aggregate(data[,-1], list(cut(df$date.time, breaks = time.step)), function(x) mean(x))}
However, If I want to work with the two data values separately to run another function, say compute the som of the two averages, I run into trouble. The function work_on_subsets_2 gives me the following error : Error in x$value.one : $ operator is invalid for atomic vectors
A function that computes the sum of the means of value.one and value.two for each 10 hour time period:
work_on_subsets_2 <- function(data, time.step = "10 hours"){
aggregate(data, list(cut(df$date.time, breaks = time.step)), function(x) mean(x$value.one) + mean(x$value.two)}
In the limit, I would like to be able to do something like this:
A function that runs another_function on value.one and value.two for each time period of 10 hours :
another_function <- function(a,b) {
# do something with a and b
}
work_on_subsets_3 <- function(data, time.step = "10 hours"){aggregate(data, list(cut(df$date.time, breaks = time.step)), another_function(x$value.one, x$value.two))}
Am I using the wrong tools for this job? I have already a working solution using for loops, but I'm trying to get a grip on the split-apply-combine strategy. If so, are there any viable alternatives to for-loops?
Hi there are a basic things you are doing wrong here. You are creating a function which has data as its data.frame but you are still referencing df from the global environment. You're also missing at least one bracket. And I don't quite know why you have two layers of functions embedded.
My solution departs from your method but hopefully will help. I'd recommend using plyr package when you want to split dataframes and apply functions as I find it much more intuitive. Combining it with dplyr also helps in my opinion. Note: Always load plyr before dplyr or you run into dependency issues.
If I understand your question correctly the below should work, and you could create different functions to apply
library(plyr)
library(dplyr)
#create function you want to apply
MeanFun <- function(data) mean(data[["value.one"]]) + mean(data[["value.two"]])
#add grouping variable to your dataframe. You can link this with pipes (%>%)
# if you don't want to create a new data.frame, but for testing purposes it
# more clearly shows wants happening
df1 <- df %>% mutate(Breaks = cut(date.time, breaks = time.step))
# use plyr's ssply to split the dataframe on "Breaks" column and apply the function
out <- ddply(df1, "Breaks", MeanFun)

R xts: apply over a rolling window

I wish to execute a function FUN over a rolling window of 1 year. My xts has not the same number of points per year. How can I do that in an efficient way?
P.S. usually to execute a FUN over a fixed number of datapoints (for instance 100) I use:
as.xts(rollapply(data = zoo(indicator), FUN = FUN, width = 100, align = "right"))
but obviously this doesn't work if there are not always the same number of points per year.
I'll try to answer my own question: One way to do that is:
First to NA-pad the time series so that there is one datapoint per day (or any unit relevant for your case),
(optional, depending on your FUN) Then to use na.locf to carry over the last data to fill the holes.
Finally to use the usual rollapply as shown in the question, over a fixed number of datapoints that corresponds to 1 year.
Your can use the apply.yearly(x, FUN, ...) function from the xts library.
dat <- xts(rnorm(1000), order.by = as.Date(1:1000))
plot(dat)
apply.yearly(dat, mean)

Using ddply() to Get Frequency of Certain IDs, by Appearance in Multiple Rows (in R)

Goal
If the following description is hard follow, please see the example "before" and "after" to see a straightforward example.
I have bartering data, with unique trade ids, and two sides of the trade. Side1 and Side2 are baskets, lists of item ids that represent both sides of the barter transaction.
I'd like to count the frequency each ITEM appears in TRADES. E.g, if item "001" appeared in 3 trades, I'd have a count of 3 (ignoring how many times the item appeared in each trade).
Further, I'd like to do this with the plyr ddply function.
(If you're interested as to my motivation, I working over many hundreds of thousands of transactions and am already using a ddply to calculate several other summary statistics. I'd like to add this to the ddply I'm already using, rather than calculate it after, and merge it into the ddply output.... sorry if that was difficult to follow.)
In terms of pseudo code I'm working off of:
merge each row of Side1 and Side2
by row, get unique() appearances of each item id
apply table() function
transpose and relabel output from table
Example of the structure of my data, and the output I desire.
Data Example (before):
df <- data.frame(TradeID = c("01","02","03","04"))
df$Side1 = list(c("001","001","002"),
c("002","002","003"),
c("001","004"),
c("001","002","003","004"))
df$Side2 = list(c("001"),c("007"),c("009"),c())
Desired Output (after):
df.ItemRelFreq_byTradeID <- data.frame(ItemID = c("001","002","003","004","007","009"),
RelFreq_byTrade = c(3,3,2,2,1,1))
One method to do this without ddply
I've worked out one way to do this below. My problem is that I can't quite seem to get ddply to do this for me.
temp <- table(unlist(sapply(mapply(c,df$Side1,df$Side2), unique)))
df.ItemRelFreq_byTradeID <- data.frame(ItemID = names(temp),
RelFreq_byTrade = temp[])
Thanks for any help you can offer!
Curtis
I believe this will do what you're asking for. It uses ddply. Twice!
res <- ddply(df, .(TradeID), function(df) data.frame(ItemID = c(df$Side1[[1]],df$Side2[[1]]), TradeID = df$TradeID))
ddply(res, .(ItemID), summarise, RelFreq_byTrade = length(unique(TradeID)))
Note that the ItemsIDs are slightly out of order.

Resources