running lmer with a by/group by statement? - r

I'm trying to find a quick way to run a lmer model but run it separately for each grouping variable (in SAS one can use the by= statement). I have tried using dplyr for this with a code I found:
t1<- mod1 %>% group_by(c) %>% do(function(df){lmer(m1.formula,data=df)})
but that doesn't seem to work.
Anyone know how to do this using dplyr or another method?

library("lme4")
data(Orthodont,package="nlme")
There are two fundamental issues you might want to consider here:
statistical: as commented above, it's a bit odd to think about running mixed models separately on each stratum (grouping variable) within a data set. Usually the entire point of mixed models is to fit a model to the combined data set, although I can certainly imagine exceptions (below, I fit separate mixed models by sex). You might be looking for something like the lmList function (both nlme and lme4 have versions), which runs (generalized) linear models (not mixed models) on each stratum separately. This makes more sense, especially as an exploratory technique.
computational: doing specifically what you asked for in the dplyr framework is a bit difficult, because the basic dplyr paradigm assumes that you are operating on data frames (or data tables), possibly grouped, and returning data frames. That means that the bits returned by each operation must be data frames (not e.g. merMod model objects). (#docendodismus points out that you can do it by specifying do(model = ...) in the code below, but I think the structure of the resulting object is a bit weird and would encourage you to rethink your question, as illustrated below)
In base R, you can just do this:
lapply(split(Orthodont,Orthodont$Sex),
lmer,formula=distance~age+(1|Subject))
or
by(Orthodont,Orthodont$Sex,
lmer,formula=distance~age+(1|Subject))
Digression: If you want to fit linear (unmixed) models to each subject, you can use
## first remove 'groupedData' attributes from the data, which don't
## work with lme4's version of lmList
Orthodont <- data.frame(Orthodont)
lmList(distance~age|Subject,Orthodont)
## note: not including Sex, which doesn't vary within subjects
Return to main thread: In the plyr (ancestor of dplyr) framework you can fit separate mixed models by sex slightly more compactly:
library("plyr")
dlply(Orthodont,.(Sex),
lmer,formula=distance~age+(1|Subject))
detach("package:plyr")
If you want to do it in plyr, you seem to need do() (I thought I could do without it, but I haven't found a way), and you need to create a function that returns a summary as a data frame.
library("dplyr")
Orthodont %>% group_by(Sex) %>%
do(lmer(.,formula=distance~age+(1|Subject)))
produces
## Error: Results are not data frames at positions: 1, 2
You could do this:
lmer_sum <- function(x,...) {
m <- lmer(x,...)
c(fixef(m),unlist(VarCorr(m)))
data.frame(rbind(c(fixef(m),unlist(VarCorr(m)))),
check.names=FALSE)
}
(unlist(VarCorr(m)) gives the RE variance of the single scalar random effect; the whole data.frame(rbind(...)) thing is needed to convert a numeric vector into a one-row data frame; check.names=FALSE preserves the column name (Intercept))
Orthodont %>% group_by(Sex) %>%
do(lmer_sum(.,formula=distance~age+(1|Subject)))
which gives reasonable results.

The problem is that you're calling do() wrongly - it doesn't work with anonymous functions like that. Arguments in do() are evaluated in the context of the data, so when you say function(df), do will try to use the df column of the data. It doesn't have that column, so it fails (with a cryptic message).
You can refer to the entire data frame in the grouping with ., and you don't need the anonymous function. You just call the (nested) functions directly with the . variable:
t1 <- mod1 %>% group_by(c) %>% do(lmer(m1.formula, .))
Untested because you didn't provide a reproducible example.

Related

How to use apply for functions that need "data$varname" vs functions that need just "varname"

Relatively new R user here that has been wrestling with making code more efficient for future uses, mainly trying out functions from the apply family.
Right now, I have a script in which I pull means from a large number of variables by (manually) creating a list of variable names and passing it into a sapply.
So this is an example of how I made a list of variable names and how I passed that into sapply
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
However, I now want to use a function that uses the argument format of function(varname, data), so I can't actually use that list of names I made. What I'm trying to do:
krusk <- sapply(vars, function(x) kruskal.test(x ~ group, data))
I feel like there is a way to pass my variable names into functions I have been completely overlooking, instead manually creating lists. Anyone have any suggestions?
This can work using iris dataset as data similar to great suggestion from #deschen:
#Vars
vars <- c("Sepal.Length", "Sepal.Width")
#Code
krusk <- sapply(vars, function(x) kruskal.test(iris[[x]] ~ iris[['Species']]))
Output:
krusk
Sepal.Length Sepal.Width
statistic 96.93744 63.57115
parameter 2 2
p.value 8.918734e-22 1.569282e-14
method "Kruskal-Wallis rank sum test" "Kruskal-Wallis rank sum test"
data.name "iris[[x]] by iris[["Species"]]" "iris[[x]] by iris[["Species"]]"
You were very close! You can do it by subsetting the data frame that you input to sapply using your vars vector, and changing the formula in kruskal.test:
vars <- c("Sepal.Length", "Sepal.Width")
sapply(iris[, vars], function(x) kruskal.test(x ~ iris$Species))
R is a very diverse coding language and there will ultimately be many ways to do the same thing. Some functions expect a more standard evaluation while others may use NSE (non-standard evaluation).
However, you seem to be asking about functions who just expect a single vector as impute as opposed to functions who have a data argument, in which you use variable as opposed to data$variable.
I have a few sidebars before I give some advice
Sidebar 1 - S3 Methods
While this may be besides the point in regards to the question, the function kruskal.test has two methods.
methods("kruskal.test")
#[1] kruskal.test.default* kruskal.test.formula*
#see '?methods' for accessing help and source code
Which method is used depends on the class of the first argument in the function. In this example you are passing a formula expression in which the data argument is necessary, whereas the default method just requires x and g arguments (which you could probably use your original pipelines).
So, if you're used to doing something one way, be sure to check the documentation of a function has different dispatch methods that will work for you.
Sidebar 2 - Data Frames
Data frames are really just a collection of vectors. The difference between f(data$variable) and f(x = variable, data = data), is that in the first the user explicitly tells R where to find the vector, while in the latter, it is up the the function f to evaluate x in the context of data.
I bring this up because of what I said in the beginning - there are many ways to do the same thing. So its generally up to you what you want your standard to be.
If you prefer to be explicit
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
krusk <- sapply(vars, kruskal.test, g = data$group)
or you can write your functions in which they are expected to be evaluated within a certain data.frame object
vars <- c("age", "gender", "PCLR")
means <- sapply(vars, function(x, id, data) fmean(data[[x]], id = data[[id]], na.rm=T), id = "group", data = data)
krusk <- sapply(vars, function(x, id, data) kruskal.test(data[[x]], data[[id]]), id = "group", data = data)
My Advice
I recommend looking into the following packages dplyr, tidyr, purrr. I'm sure there are a few things in these packages that will make your life easier.
For example, you have expressed having to manually make list before doing the sapply. In dplyr package you can possibly circumvent this if there is a condition to filter on.
data %>%
group_by(group) %>% #groups data
summarise_if(is.numeric, mean, na.rm = TRUE) #applys mean to every column that is a numeric vector
And similarly, we can summarise the results of the kurskal.test function if we reshape the data a bit.
data %>%
group_by(group) %>% #grouping to retain column in the next select statement
select_if(is.numeric) %>% # selecting all numeric columns
pivot_longer(cols = -group) %>% # all columns except "group" will be reshaped. Column names are stored in `name`, and values are stored in `values`
group_by(name) %>% #regroup on name variable (old numeric columns)
summarise(krusk = list(kruskal.test(value ~ as.factor(group)))) #perform test
I've only mentioned purrr because you can almost drop and replace all apply style functions with their map variants. purrr is very consistent across its function variants with a lot of options to control the output type.
I hope this helps and wish you luck on your coding adventures.

how to apply lm() to datasets split by factors

In a housing dataset, there are three variables, which are bsqft (the building size of the house), county(a factor variable with 9 levels) and price. I would like to fit an individual regression line using bsqft and price for each separate county. Instead of calling lm() function repeatedly, I prefer using apply function in r but have no idea to create it. Could anyone help me with that? Thanks a lot.
You can use dplyr and broom to do regressions by group and summarise the information back into a dataframe
library(dplyr)
library(broom)
your_dataset %>%
group_by(county) %>%
do(tidy(lm(price ~ bsqft, data=.)))

Correlations between vectors in two groups (defined by: group_by)

I want to make a correlation between two vectors in two different groups (defined by group_by). The solution needs to be based on dplyr.
My data is in the so-called CDISC format. For simplicity: here is some dummy data (note that one column ("values") holds all the data):
n=5
bmi<-rnorm(n=n,mean=25)
glucose<-rnorm(n=n,mean=5)
insulin<-rnorm(n=n,mean=10)
id<-rep(paste0("id",1:n),3)
myData<-data.frame(id=id,measurement=c(rep("BMI",n),rep("glucose",n),rep("insulin",n)),values=c(bmi,glucose,insulin))
Keeping in mind that all my functions for working with this kind of data is by using dplyr package, such as:
myData %>% group_by(measurement) %>% summarise(mean(values), n())
How do I get the correlation between glucose and insulin (cor(glucose, insulin))? Or in a more general way: how do I get the correlation between two groups?
The following solution is obviously very wrong (but may help to understand my question):
myData %>% group_by(measurement) %>% summarise(cor(glucose,insulin))

Extract object from list using dplyr

This is related to the questions 1 and 2
I have a list of objects (in my case they are also lists AFAII), as returned by running:
gof_stats <- models %>% map(gof_stats)
Where models is a list of models created by fitdistrplus and gof_stats is a function that computes goodness of fit stats for each model.
Now if I want to extract a specific stat from that list I could do something like:
gof_stats[[1]]$cvm
to get the Cramer von Mises stat. I can achieve the same over the whole list (as per the linked questions) like so:
cvms <- sapply(gof_stats, "[[", "cvm")
Is there a way to do the same using dplyr/purrr syntax?
BONUS: How would you handle the case where some of the elements in the models list are NULL?
If you prefer map to sapply for this, you can do
library(purrr)
map(gof_stats, ~ .x[["cvm"]])
If you just like pipes you could do
gof_stats %>% sapply("[[", "cvm")
Your question is about lists, not data frames, so dplyr doesn't really apply. You may want to look up ?magrittr::multiply_by to see a list of other aliases from the package that defines %>% as you seem to like piping. For example, magrittr::extract2 is an alias of [[ that can be easily used in the middle of a piping chain.
As for your bonus, I would pre-filter the list to remove NULL elements before attempting to extract things.
A solution composed completely of tidyverse functions is:
gof_stats %>%
map(~ .x %>% pluck("cvm"))
Legend:
map is the tidyverse function for lists, replacing the apply family
.x is the object in each iteration
~ is the purrr short syntax for the anonymous function function(x) { }
pluck extracts a list element by index or string

Removing rows based upon frequency of factor/categorical value in a column

I have a dataset that I will be performing cross validation training upon. However, due to this splitting of the data, I sometimes encounter errors because the factor level found in the test set was not found in the training set ---- because this factor might occur a very limited number of times.
I would like a way to easily filter out these rows prior to doing any cross validation to avoid errors...
for example, how would I check to make sure that the factors that contain 9 or less observations are removed?
mtcars$carb = factor(mtcars$carb)
table(mtcars$carb)
Using library dplyr, you can try something like this:
library(dplyr)
mtcars %>% group_by(carb) %>% filter(n() > 9)
Alternatively, you can re-factor the variable in training set and remove any levels not in training data from the test set.

Resources