I have the following code, in which I have three lists and I am sort of trying to calculate the MSE.
seg_mean <- list(c(3.5,2.5),c(6.5,5.5),c(9.2,8.8),c(5.3,4.7))
mean_tot <- list(c(3,3),c(6,6),c(9,9),c(5,5))
seg_len <- list(20,18,17,15)
MSE <- mapply('-', seg_mean, mean_tot, SIMPLIFY = FALSE) #element wise difference between seg_mean and mean_tot
MSE <- lapply(MSE, function(x) x^2) #element wise squaring
MSE <- mapply('/', MSE, seg_len, SIMPLIFY = FALSE) #dividing each list element by seg_len
This works. My question is can I make these lines of code easier by using %>% from dplyr package in R?
I am not sure how it will work but I am hoping it would look like
MSE <- mapply('-', seg_mean, mean_tot, SIMPLIFY = FALSE) %>% lapply(function(x) x^2) %>% mapply('/', seg_len, SIMPLIFY = FALSE)
With |> (base R's native pipe), you can do:
mapply('-', seg_mean, mean_tot, SIMPLIFY = FALSE) |>
lapply(function(x) x^2) |>
mapply(FUN = '/', seg_len, SIMPLIFY = FALSE)
It also works with %>%. The difference is in the second mapply, where one needs to specify the first argument, otherwise it'll get interpreted as mapply(., '/', seg_len, ...) instead of mapply('/', ., seg_len, ...)
Related
Seems no one is able to tackle this, so I re-ask:
I would like to use an external function from sjstats package within my own function.
The external function takes 4 arguments as like: https://strengejacke.github.io/sjstats/reference/weighted_sd.html
weighted_mannwhitney(data, x, grp, weights, ...)
A simple call to that external function can be done like this.
weighted_mannwhitney(c12hour ~ c161sex + weight, efc)
I have built a simple function below, which currently only works when provided with data as input - adding arguments produces errors.
I would like to be able to pass on some arguments (data, x, grp, weights) to the internally called external functional, so I can change those arguments for different datasets.
The Question
My attempts at generalizing the function are not successful, can anyone show me how I can do this? I show how I would like to call the generalized function below.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
data(efc)
efc$weight <- abs(rnorm(nrow(efc), 1, .3))
# Own working function
own_funk <- function(data, ...) {
# Weighted MannWhitney Test p-value
res <- data %>%
summarise(
across(
c(c12hour,e17age),
~sjstats:::weighted_mannwhitney.formula(.x ~ c161sex + weight, data)$p.value[[1]]
)
) %>%
#round(., 3) %>%
tibble::rownames_to_column() %>%
pivot_longer(-rowname) %>%
pivot_wider(names_from=rowname, values_from=value) %>%
rename("Outcome" = 1, "P-value" = 2) %>%
as.data.frame()
return(res)
}
Call to own working function
own_funk(efc)
How I would like to call the function
# NB: The x can take on many variables
own_funk(data = efc, x = c(c12hour,e17age), grp = c161sex, weights = weight)
Somewhat similar to #snaut's solution I'd use map to loop over your x columns/formulas. The difference mainly is that we probably don't want to hard code the variables within the function and rather pass these as an argument to the function.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
library(survey)
# Own working function
own_funk <- function(data, x, grp, weights) {
# Weighted MannWhitney Test p-value
formula <- paste0(x, " ~ ", grp, " + ", weights)
res <- map2(.x = formula,
.y = x,
.f = ~ data |>
summarize(!!sym(.y) := sjstats:::weighted_mannwhitney.formula(as.formula(.x), data = data)$p.value[[1]]) %>%
#round(., 3) %>%
tibble::rownames_to_column() %>%
pivot_longer(-rowname) %>%
pivot_wider(names_from=rowname, values_from=value) %>%
rename("Outcome" = 1, "P-value" = 2) %>%
as.data.frame()) |>
bind_rows()
return(res)
}
own_funk(data = efc, x = c("c12hour", "e17age"), grp = "c161sex", weights = "weight")
Which gives:
Outcome P-value
1 c12hour 0.006806572
2 e17age 0.187765467
sjstats:::weighted_mannwhitney.formula is clearly meant for interactive use and not for programming. sjstats:::weighted_mannwhitney.default seems to be more suited for programming, but still does something with unqouting variable names that I did not quite figure out.
What I do most of the time when I have to use functions created to be used in an interactive way for programming is to create a temporary column with a fixed name and then rename the columns I want to iterate over to that column.
This is what I have done in my solution. I also don't use summarise and across but instead use map_dfr to iterate over the column names and output rows of a tibble.
I did not quite understand what you want to archive with the pivot_longer and pivot_wider but I'm sure you can format the output of my solution to your needs.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
data(efc)
efc$weight <- abs(rnorm(nrow(efc), 1, .3))
# Own working function
own_funk <- function(mydata, ...) {
# Weighted MannWhitney Test p-value
map_dfr(
c("c12hour", "e17age"),
function(column){
tmp_data <- mydata %>%
rename(tmp_col = {column})
tibble(
variable = column,
`P-value`=sjstats:::weighted_mannwhitney.formula(tmp_col ~ c161sex + weight, data=tmp_data)$p.value[[1]]
)
}
)
}
own_funk(efc)
I want to discretize a variable using R, preferably SparkR, so that the desired results would be like the following.
library(arules)
mtcars %>% mutate(bins = discretize(x = mpg, method = "interval", breaks = 4))
I checked the document but could see the non-R solutions only at https://spark.apache.org/docs/2.2.0/ml-features.html#bucketizer.
Please advise.
In general SparkR provides a very limited subset of ML functions (a full support is planned for Spark 3.0, as a separate R package SPARK-24359 SPIP: ML Pipelines in R, though simple discretization like this, can be performed using CASE ... WHEN ... statements.
First compute the breaks:
df <- createDataFrame(mtcars)
min_max <- df %>%
select(min(df$mpg), max(df$mpg)) %>%
collect() %>%
unlist()
n <- 4
breaks <- seq(min_max[[1]], min_max[[2]], length.out = n)
Then generate expression:
bucket <- purrr::map2(
breaks[-n], breaks[-1],
function(x, y) between(column("mpg"), c(x, y))) %>%
purrr::reduce2(
., seq(length(.)),
function(acc, x, y) otherwise(when(x, y), acc),
.init = lit(NA))
df %>% withColumn("bucket", bucket)
Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").
Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.
Simplified version of my function...
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = mean(~to.sum, na.rm = TRUE))
return(results)
}
And running it with some dummy data...
set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
[1] 1.881721
mean(temp$eg2)
[1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
n mean
1 100 NA
N is calculated, but the mean is not, can't figure out why.
Ultimately I'd like my function to be more general, along the lines of...
my_summarise <- function(df = temp,
group.by = 'group'
to.sum = c('eg1', 'eg2'),
...){
results <- list()
## Select columns
df <- dplyr::select_(df, .dots = c(group.by, to.sum))
## Summarise overall
results$all <- summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
## Summarise by specified group
results$by.group <- group_by_(df, ~to.group) %>%
summarise_each(df,
funs(n = ~n(),
mean = mean(~to.sum, na.rm = TRUE)))
return(results)
}
...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.
Appreciate any advice as to where I'm going wrong.
Thanks in advance
The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.
In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:
my_summarise <- function(df = temp,
to.sum = 'eg1',
...){
## Summarise
results <- summarise_(df,
n = ~n(),
mean = lazyeval::interp(~mean(x, na.rm = TRUE),
x = as.name(to.sum)))
return(results)
}
Here is what I do when I struggle to get things working:
Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.
In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").
I have a data frame md:
md <- data.frame(x = c(3,5,4,5,3,5), y = c(5,5,5,4,4,1), z = c(1,3,4,3,5,5),
device1 = c("c","a","a","b","c","c"), device2 = c("B","A","A","A","B","B"))
md[2,3] <- NA
md[4,1] <- NA
md
I want to calculate means by device1 / device2 combinations using dplyr:
library(dplyr)
md %>% group_by(device1, device2) %>% summarise_each(funs(mean))
However, I am getting some NAs. I want the NAs to be ignored (na.rm = TRUE) - I tried, but the function doesn't want to accept this argument.
Both these lines result in error:
md %>% group_by(device1, device2) %>% summarise_each(funs(mean), na.rm = TRUE)
md %>% group_by(device1, device2) %>% summarise_each(funs(mean, na.rm = TRUE))
The other answers showed you the syntax for passing mean(., na.rm = TRUE) into summarize/_each.
Personally, I deal with this so often and it's so annoying that I just define the following convenience set of NA-aware basic functions (e.g. in my .Rprofile), such that you can apply them with dplyr with summarize(mean_) and no pesky arg-passing; also keeps the source-code cleaner and more readable, which is another strong plus:
mean_ <- function(...) mean(..., na.rm=T)
median_ <- function(...) median(..., na.rm=T)
sum_ <- function(...) sum(..., na.rm=T)
sd_ <- function(v) sqrt(sum_((v-mean_(v))^2) / length(v))
cor_ <- function(...) cor(..., use='pairwise.complete.obs')
max_ <- function(...) max(..., na.rm=T)
min_ <- function(...) min(..., na.rm=T)
pmax_ <- function(...) pmax(..., na.rm=T)
pmin_ <- function(...) pmin(..., na.rm=T)
table_ <- function(...) table(..., useNA='ifany')
mode_ <- function(...) {
tab <- table(...)
names(tab[tab==max(tab)]) # the '==' implicitly excludes NA values
}
clamp_ <- function(..., minval=0, maxval=70) pmax(minval, pmin(maxval,...))
Really you want to be able to flick one global switch once and for all, like na.action/na.pass/na.omit/na.fail to tell functions as default behavior what to do, and not throw errors or be inconsistent, as they currently do, across different packages.
There used to be a CRAN package called Defaults for setting per-function defaults but it is not maintained since 2014, pre-3.x . For more about it Setting Function Defaults R on a Project Specific Basis
try:
library(dplyr)
md %>% group_by(device1, device2) %>%
summarise_each(funs(mean(., na.rm = TRUE)))
Simple as that:
funs(mean(., na.rm = TRUE))