Extract object from list using dplyr - r

This is related to the questions 1 and 2
I have a list of objects (in my case they are also lists AFAII), as returned by running:
gof_stats <- models %>% map(gof_stats)
Where models is a list of models created by fitdistrplus and gof_stats is a function that computes goodness of fit stats for each model.
Now if I want to extract a specific stat from that list I could do something like:
gof_stats[[1]]$cvm
to get the Cramer von Mises stat. I can achieve the same over the whole list (as per the linked questions) like so:
cvms <- sapply(gof_stats, "[[", "cvm")
Is there a way to do the same using dplyr/purrr syntax?
BONUS: How would you handle the case where some of the elements in the models list are NULL?

If you prefer map to sapply for this, you can do
library(purrr)
map(gof_stats, ~ .x[["cvm"]])
If you just like pipes you could do
gof_stats %>% sapply("[[", "cvm")
Your question is about lists, not data frames, so dplyr doesn't really apply. You may want to look up ?magrittr::multiply_by to see a list of other aliases from the package that defines %>% as you seem to like piping. For example, magrittr::extract2 is an alias of [[ that can be easily used in the middle of a piping chain.
As for your bonus, I would pre-filter the list to remove NULL elements before attempting to extract things.

A solution composed completely of tidyverse functions is:
gof_stats %>%
map(~ .x %>% pluck("cvm"))
Legend:
map is the tidyverse function for lists, replacing the apply family
.x is the object in each iteration
~ is the purrr short syntax for the anonymous function function(x) { }
pluck extracts a list element by index or string

Related

R: How to apply a similar mutate() to multiple data frames with purrr without creating a list?

I have seen similar posts, but none that address this question specifically.
So, I have three data frames that are similar, and I'm applying a similar mutate() to each of them.
xk <- xk %>%
mutate(country="Kosovo",
date=ym(date)) %>%
relocate(country)
al <- al %>%
mutate(country="Albania",
date=ym(date)) %>%
relocate(country)
mne <- mne %>%
mutate(country="Montenegro",
date=my(date)) %>%
relocate(country)
Can I use one of the functions contained in the purrr package to do that with as few lines of code as possible?
there is a way that will work with or without purrr.
convert your data.frames to data.tables
use mapply (or the purrr's equivalent) to do the same operation on all tables
you don't care about the output of mapply, because data.tables will be changed without assignment to a new variable
library(data.table)
xk <- as.data.table(xk)
al <- as.data.table(al)
mne <- as.data.table(mne)
mapply(function(x,y) x[,country:=y], x=list(xk,al,mne), y=c("Kosovo","Albania","Montenegro"))
print(xk)

How to use apply for functions that need "data$varname" vs functions that need just "varname"

Relatively new R user here that has been wrestling with making code more efficient for future uses, mainly trying out functions from the apply family.
Right now, I have a script in which I pull means from a large number of variables by (manually) creating a list of variable names and passing it into a sapply.
So this is an example of how I made a list of variable names and how I passed that into sapply
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
However, I now want to use a function that uses the argument format of function(varname, data), so I can't actually use that list of names I made. What I'm trying to do:
krusk <- sapply(vars, function(x) kruskal.test(x ~ group, data))
I feel like there is a way to pass my variable names into functions I have been completely overlooking, instead manually creating lists. Anyone have any suggestions?
This can work using iris dataset as data similar to great suggestion from #deschen:
#Vars
vars <- c("Sepal.Length", "Sepal.Width")
#Code
krusk <- sapply(vars, function(x) kruskal.test(iris[[x]] ~ iris[['Species']]))
Output:
krusk
Sepal.Length Sepal.Width
statistic 96.93744 63.57115
parameter 2 2
p.value 8.918734e-22 1.569282e-14
method "Kruskal-Wallis rank sum test" "Kruskal-Wallis rank sum test"
data.name "iris[[x]] by iris[["Species"]]" "iris[[x]] by iris[["Species"]]"
You were very close! You can do it by subsetting the data frame that you input to sapply using your vars vector, and changing the formula in kruskal.test:
vars <- c("Sepal.Length", "Sepal.Width")
sapply(iris[, vars], function(x) kruskal.test(x ~ iris$Species))
R is a very diverse coding language and there will ultimately be many ways to do the same thing. Some functions expect a more standard evaluation while others may use NSE (non-standard evaluation).
However, you seem to be asking about functions who just expect a single vector as impute as opposed to functions who have a data argument, in which you use variable as opposed to data$variable.
I have a few sidebars before I give some advice
Sidebar 1 - S3 Methods
While this may be besides the point in regards to the question, the function kruskal.test has two methods.
methods("kruskal.test")
#[1] kruskal.test.default* kruskal.test.formula*
#see '?methods' for accessing help and source code
Which method is used depends on the class of the first argument in the function. In this example you are passing a formula expression in which the data argument is necessary, whereas the default method just requires x and g arguments (which you could probably use your original pipelines).
So, if you're used to doing something one way, be sure to check the documentation of a function has different dispatch methods that will work for you.
Sidebar 2 - Data Frames
Data frames are really just a collection of vectors. The difference between f(data$variable) and f(x = variable, data = data), is that in the first the user explicitly tells R where to find the vector, while in the latter, it is up the the function f to evaluate x in the context of data.
I bring this up because of what I said in the beginning - there are many ways to do the same thing. So its generally up to you what you want your standard to be.
If you prefer to be explicit
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
krusk <- sapply(vars, kruskal.test, g = data$group)
or you can write your functions in which they are expected to be evaluated within a certain data.frame object
vars <- c("age", "gender", "PCLR")
means <- sapply(vars, function(x, id, data) fmean(data[[x]], id = data[[id]], na.rm=T), id = "group", data = data)
krusk <- sapply(vars, function(x, id, data) kruskal.test(data[[x]], data[[id]]), id = "group", data = data)
My Advice
I recommend looking into the following packages dplyr, tidyr, purrr. I'm sure there are a few things in these packages that will make your life easier.
For example, you have expressed having to manually make list before doing the sapply. In dplyr package you can possibly circumvent this if there is a condition to filter on.
data %>%
group_by(group) %>% #groups data
summarise_if(is.numeric, mean, na.rm = TRUE) #applys mean to every column that is a numeric vector
And similarly, we can summarise the results of the kurskal.test function if we reshape the data a bit.
data %>%
group_by(group) %>% #grouping to retain column in the next select statement
select_if(is.numeric) %>% # selecting all numeric columns
pivot_longer(cols = -group) %>% # all columns except "group" will be reshaped. Column names are stored in `name`, and values are stored in `values`
group_by(name) %>% #regroup on name variable (old numeric columns)
summarise(krusk = list(kruskal.test(value ~ as.factor(group)))) #perform test
I've only mentioned purrr because you can almost drop and replace all apply style functions with their map variants. purrr is very consistent across its function variants with a lot of options to control the output type.
I hope this helps and wish you luck on your coding adventures.

returning a list from a user function using group_by in R

I have a data.frame, I would like to group the data by one of the columns and then apply a function, which operates on the remaining columns of the data. The function returns a list of mixed objects.
If I was just returning one value from the group I know that I could use something like:
df %>% group_by(Column_1) %>% summarise(my_function)
I also know that I could perform operations on a list using the lapply which will happily return a list. I'm just not sure how to combines these two pieces of knowledge to acheive my desired result.
example code added, userFunction and data are representitive, but should give a good enough idea of what I want.
userFunction <- function(carData){
return(list(
a = carData$am * carData$carb,
b = plot(carData$disp ~ carData$carb),
c = mean(carData$drat)
))
}
mtcars %>%
group_by(cyl) %>%
summarise(userFunction)
I'd like to get back a list of lenght the number of factors in the columns i group_by. In the list should be a, b and c.
This seems to work as I was want.
this <- by(mtcars, mtcars$am, userFunction)

Using dplyr's select where variable names are quoted [duplicate]

This question already has answers here:
Pass a vector of variable names to arrange() in dplyr
(6 answers)
Closed 7 years ago.
Often I'll want to select a subset of variables where the subset is the result of a function. In this simple case, I first get all the variable names which pertain to width characteristics
library(dplyr)
library(magrittr)
data(iris)
width.vars <- iris %>%
names %>%
extract(grep(".Width", .))
Which returns:
>width.vars
[1] "Sepal.Width" "Petal.Width"
It would be useful to be able to use these returns as a way to select columns (and while I'm aware that contains() and its siblings exist, there are plenty of more complicated subsets I would like to perform, and this example is made trivial for the purpose of this example.
If I was to attempt to use this function as a way to select columns, the following happens:
iris %>%
select(Species,
width.vars)
Error: All select() inputs must resolve to integer column positions.
The following do not:
* width.vars
How can I use dplyr::select with a vector of variable names stored as strings?
Within dplyr, most commands have an alternate version that ends with a '_' that accept strings as input; in this case, select_. These are typically what you have to use when you are utilizing dplyr programmatically.
iris %>% select_(.dots=c("Species",width.vars))
First of all, you can do the selection in dplyr with
iris %>% select(Species, contains(".Width"))
No need to create the vector of names separately. But if you did have a list of columns as string names, you could do
width.vars <- c("Sepal.Width", "Petal.Width")
iris %>% select(Species, one_of(width.vars))
See the ?select help page for all the available options.

running lmer with a by/group by statement?

I'm trying to find a quick way to run a lmer model but run it separately for each grouping variable (in SAS one can use the by= statement). I have tried using dplyr for this with a code I found:
t1<- mod1 %>% group_by(c) %>% do(function(df){lmer(m1.formula,data=df)})
but that doesn't seem to work.
Anyone know how to do this using dplyr or another method?
library("lme4")
data(Orthodont,package="nlme")
There are two fundamental issues you might want to consider here:
statistical: as commented above, it's a bit odd to think about running mixed models separately on each stratum (grouping variable) within a data set. Usually the entire point of mixed models is to fit a model to the combined data set, although I can certainly imagine exceptions (below, I fit separate mixed models by sex). You might be looking for something like the lmList function (both nlme and lme4 have versions), which runs (generalized) linear models (not mixed models) on each stratum separately. This makes more sense, especially as an exploratory technique.
computational: doing specifically what you asked for in the dplyr framework is a bit difficult, because the basic dplyr paradigm assumes that you are operating on data frames (or data tables), possibly grouped, and returning data frames. That means that the bits returned by each operation must be data frames (not e.g. merMod model objects). (#docendodismus points out that you can do it by specifying do(model = ...) in the code below, but I think the structure of the resulting object is a bit weird and would encourage you to rethink your question, as illustrated below)
In base R, you can just do this:
lapply(split(Orthodont,Orthodont$Sex),
lmer,formula=distance~age+(1|Subject))
or
by(Orthodont,Orthodont$Sex,
lmer,formula=distance~age+(1|Subject))
Digression: If you want to fit linear (unmixed) models to each subject, you can use
## first remove 'groupedData' attributes from the data, which don't
## work with lme4's version of lmList
Orthodont <- data.frame(Orthodont)
lmList(distance~age|Subject,Orthodont)
## note: not including Sex, which doesn't vary within subjects
Return to main thread: In the plyr (ancestor of dplyr) framework you can fit separate mixed models by sex slightly more compactly:
library("plyr")
dlply(Orthodont,.(Sex),
lmer,formula=distance~age+(1|Subject))
detach("package:plyr")
If you want to do it in plyr, you seem to need do() (I thought I could do without it, but I haven't found a way), and you need to create a function that returns a summary as a data frame.
library("dplyr")
Orthodont %>% group_by(Sex) %>%
do(lmer(.,formula=distance~age+(1|Subject)))
produces
## Error: Results are not data frames at positions: 1, 2
You could do this:
lmer_sum <- function(x,...) {
m <- lmer(x,...)
c(fixef(m),unlist(VarCorr(m)))
data.frame(rbind(c(fixef(m),unlist(VarCorr(m)))),
check.names=FALSE)
}
(unlist(VarCorr(m)) gives the RE variance of the single scalar random effect; the whole data.frame(rbind(...)) thing is needed to convert a numeric vector into a one-row data frame; check.names=FALSE preserves the column name (Intercept))
Orthodont %>% group_by(Sex) %>%
do(lmer_sum(.,formula=distance~age+(1|Subject)))
which gives reasonable results.
The problem is that you're calling do() wrongly - it doesn't work with anonymous functions like that. Arguments in do() are evaluated in the context of the data, so when you say function(df), do will try to use the df column of the data. It doesn't have that column, so it fails (with a cryptic message).
You can refer to the entire data frame in the grouping with ., and you don't need the anonymous function. You just call the (nested) functions directly with the . variable:
t1 <- mod1 %>% group_by(c) %>% do(lmer(m1.formula, .))
Untested because you didn't provide a reproducible example.

Resources