Let's say I have a dataframe (df) in R:
df <- data.frame(x = rnorm(5, mean = 5), u = rnorm(5, mean = 5), y = rnorm(5, mean = 5), z = rnorm(5, mean = 5))
print(df)
I want to get the mean absolute difference (MAD) between the first column (x) and the other columns.
With this function, I can find the MAD between the first column and another (the second for example):
mad <- function(dat){
abs(mean(dat[,1] - dat[,2], na.rm = TRUE))
}
mad(dat = df)
But I want to generalize the function to apply across all of the columns. Changing the function to something like this:
mad <- function(dat) {
abs(mean(dat[,1] - dat[,2:4], na.rm = TRUE))
}
mad(dat = df)
does not work and returns this error: "argument is not numeric or logical: returning NA"
I was thinking of using apply() across the dataframe, as that seems to be the general advice that I've found on here. But I don't understand how to keep the first column constant and subtract the other columns from the first.
We can create the function with two arguments
mad <- function(x, y) abs(mean(x - y, na.rm = TRUE))
and use sapply/lapply to loop over the columns other than 1, apply the mad function by extracting the first column of data with the looped column values
sapply(df[-1], function(x) mad(df[,1], x))
# u y z
#0.003399429 0.991685267 0.710553411
Here is another option without defining mad function:
sapply(abs(df[-1] - df[["x"]]), mean, na.rm = TRUE)
Related
reprod:
df1 <- data.frame(X = c(0:9), Y = c(10:19))
df2 <- data.frame(X = c(0:9), Y = c(10:19))
df3 <- data.frame(X = c(0:9), Y = c(10:19))
list_of_df <- list(A = df1, B = df2, C = df3)
list_of_df
I'm trying to apply the rollmean function from zoo to every 'Y' column in this list of dataframes.
I've tried lapply with no success, It seems no matter which way i spin it, there is no way to get around specifying the dataframe you want to apply to at some point.
This does one of the dataframes
roll_mean <- rollmean(list_of_df$A, 2)
roll_mean
obviously this doesn't work:
roll_mean1 <- rollmean(list_of_df, 2)
roll_mean1
I also tried this:
subset(may not be necessary)
Sub1 <- lapply(list_of_df, "[", 2)
roll_mean1 <- rollmean(Sub1, 2)
roll_mean1
there doesn't seem to be a way to do it without having to
specify the particular dataframe in the rollmean function
lapply(list_of_df), function(x) rollmean(list_of_df, 2))
for loop? also no success
For (i in list_of_df) {roll_mean1 <- rollmean(Sub1, 2)
Exp
}
Stating the obvious but I'm very new to coding in general and would appreciate some pointers.
It has occurred to me that even if it did work, the column that has been averaged would be one value longer than the rest of the dataframe; how would I get around that?
The question at one point says that it wants to perform the rollmean only on Y and at another point says that this works roll_mean <- rollmean(list_of_df$A, 2) but that does all columns.
1) Assuming that you want to apply rollmean to all columns:
Use lapply like this:
lapply(list_of_df, rollmean, 2)
This also works:
for(i in seq_along(list_of_df)) list_of_df[[i]] <- rollmean(list_of_df[[i]], 2)
2) If you only want to apply it to the Y column:
lapply(list_of_df, transform, Y = rollmean(Y, 2, fill = NA))
or
for(i in seq_along(list_of_df)) {
list_of_df[[i]]$Y <- rollmean(list_of_df[[i]]$Y, 2, fill = NA)
}
Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").
Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").
Here is my data:
LoDFs <- list(first = mtcars[, c(1:3)], second = mtcars[, c(4:6)])
row.names(LoDFs[[1]]) <- NULL
row.names(LoDFs[[2]]) <- NULL
Here is my function:
RollapplyMultipleFuncsAndWins <- function(df.val, df.name, window.size, funs, ..., GroupByWindowSize = TRUE){
library(zoo) # REQUIRED FOR rollapply
by.rows <- 1
combinations <- expand.grid(window.size, funs)
combinations <- cbind(combinations, rep(names(funs), each = length(window.size)))
colnames(combinations) <- c("window.size", "func.call", "func.name")
combinations$window.size <- sprintf(paste0("%0", max(nchar(combinations$window.size)), "d"),
combinations$window.size)
LoMs <- apply(combinations, by.rows, function(x) {
rollapply(
df.val,
width = as.numeric(x[["window.size"]]),
by = as.numeric(x[["window.size"]]),
FUN = x[["func.call"]],
align = "left")})
# COLUMN NAMING CONVENTION: column_name.function_name
LoMs <- lapply(seq_along(LoMs), function(x) {
colnames(LoMs[[x]]) <- paste(colnames(LoMs[[x]]),
combinations$func.name[x],
sep=".");
LoMs[[x]] })
# MULTIPLE FUNCTIONS WITH SAME WINDOW SIZE IN ONE DATASETS
# LIST ELEMENTS NAMING CONVENTION: dataset_name.window_size
if (GroupByWindowSize){
df.win.grps <- lapply(unique(combinations$window.size), function(x) { grep(x, combinations$window.size) })
LoMs <- lapply(df.win.grps, function(x){ do.call(cbind, LoMs[x]) })
names(LoMs) <- paste(rep(df.name, each=length(df.win.grps)),
unique(combinations$window.size),
sep=".")
}
# MULTIPLE FUNCTIONS WITH SAME WINDOW SIZE IN MULTIPLE DATASETS
# LIST ELEMENTS NAMING CONVENTION: dataset_name.function_name.window_size
else {
names(LoMs) <- paste(rep(df.name, each=nrow(combinations)),
combinations$func.name,
combinations$window.size,
sep=".")
}
return(LoMs)
}
Purpose of this function is to apply multiple functions with multiple rollings/movings windows size over one dataset. It takes size of rollings/movings and functions as inputs and creates all possible combinations of those values. For example when you pass c(2, 3, 10) as window.size and c(median = median, mean = mean) as funs It will create following combinations (which says that median and mean will be called with rolling/moving window of size 2, 3, 10 for specified dataset):
window.size func.call func.name
1 02 function (x, na.rm = FALSE) , UseMethod("median") median
2 03 function (x, na.rm = FALSE) , UseMethod("median") median
3 10 function (x, na.rm = FALSE) , UseMethod("median") median
4 02 function (x, ...) , UseMethod("mean") mean
5 03 function (x, ...) , UseMethod("mean") mean
6 10 function (x, ...) , UseMethod("mean") mean
Function then returns list of matrices where each matrix corresponds to results obtained using particular window size including results from all functions (if GroupByWindowSize is TRUE) or list of matrices where each matrix corresponds to results obtained using particular window size and particular function (if GroupByWindowSize is FALSE). You can try e.g. following to better understand what I mean:
res_one_def <- RollapplyMultipleFuncsAndWins(LoDFs[[1]], names(LoDFs)[1], c(2, 3), c(median = median, mean = mean))
res_one_non_def <- RollapplyMultipleFuncsAndWins(LoDFs[[1]], names(LoDFs)[1], c(2, 3), c(median = median, mean = mean), GroupByWindowSize=FALSE)
Problem is when I want same window size but multiple functions e.g.:
res_one_def <- RollapplyMultipleFuncsAndWins(LoDFs[[1]], names(LoDFs)[1], c(1), c(median = median, mean = mean))
I've figured out that the problem is with calling LoMs <- apply(combinations, by.rows, function(x) { .... line. Instead of list of matrices (as it previously returns) it now returns one matrix and I do not know why (now the combinations is of same type as before just smaller):
window.size func.call func.name
1 1 function (x, na.rm = FALSE) , UseMethod("median") median
2 1 function (x, ...) , UseMethod("mean") mean
Questions:
Why I get the error described above?
If you check the code you can see that I'm building combinations as expand.grid(window.size, funs) but what if I want to being able to handle also expand.grid(funs, window.size) (notice reordered arguments) will if (GroupByWindowSize){ branch correctly work also in this example (let's pretend that combinations will be passed as argument to function so I want to being able to handle various types)?
Is possible somehow define naming convention for list elements in the beginning of function and easily switch it from dataset_name.window_size to e.g. dataset_name.function_name.window_size in both if-else branches? As you can see now the names(LoMs) ... in both branches is very different, I'm curious if it is possible to make it unique somehow?
How can I make this code more robust and more generic in general, Is my approach correct or is there better way? Any ideas welcomed.
I want to test the correlations between moving averages of varying lengths and a dependent variable. I've written a for loop that gets the job done but obviously for loops are not the ideal solution. I was wondering if someone could give me some pointers on how to replace the functionality of this for loop with apply as a more elegant solution? I've provided code and test data.
library(zoo)
# a function that calculates the correlation between moving averages for
different lengths of window
# the input functions are "independent": the variable over which to apply the
moving function
# "dependent": the output column, "startLength": the shortest window length,
"endLength" the longest window length
# "functionType": the function to apply (mean, sd, etc.)
MovingAverageCorrelation <- function(indepedent, depedent, startLength, endLength, functionType) {
# declare an matrix for the different rolling functions and a correlation vector
avgMat <- matrix(nrow = length(depedent), ncol = (endLength-startLength+1))
corVector <- rep(NA, ncol(avgMat))
# run the rollapply function over the data and calculate the corresponding correlations
for (i in startLength:endLength) {
avgMat[, i] <- rollapply(indepedent, width = i, FUN = functionType,
na.rm = T, fill = NA, align = "right")
corVector[i] <- cor(avgMat[, i], depedent, use = "complete.obs")
}
return(corVector)
}
# set test data
set.seed(100)
indVector <- runif(1000)
depVector <- runif(1000)
# run the function over the data
cor <- MovingAverageCorrelation(indVector, depVector, 1, 100, "mean")
Thanks!
Try sapply:
sapply(1:100, function(i) cor(rollapplyr(indVector, i, mean, na.rm = TRUE, fill = NA),
depVector, use = "complete.obs"))
If there are no NAs in your inputs this would work and is substantially faster:
sapply(1:100, function(i) cor(rollmeanr(indVector, i, fill = NA), depVector, use = "comp"))