how to apply lm() to datasets split by factors

how to apply lm() to datasets split by factors - r

In a housing dataset, there are three variables, which are bsqft (the building size of the house), county(a factor variable with 9 levels) and price. I would like to fit an individual regression line using bsqft and price for each separate county. Instead of calling lm() function repeatedly, I prefer using apply function in r but have no idea to create it. Could anyone help me with that? Thanks a lot.

You can use dplyr and broom to do regressions by group and summarise the information back into a dataframe
library(dplyr)
library(broom)
your_dataset %>%
group_by(county) %>%
do(tidy(lm(price ~ bsqft, data=.)))

Related

Benford’s Law by group in R

I am attempting to implement Benford’s Law using the benford.analysis package in R across all vendors’ invoices. Over the entire dataset the data confirms. I’m trying to find a way to group by vendor to determine if any individual vendor is displaying fraud indicators by not conforming. Is there a way to break out non-conforming by group?

Here is a way to use group_by and group_map to create benford.analysis plots for each group. In this example, grouping Iris data by Species and performing analysis on Sepal Length variable.
In group_map(), .x means the grouped subset data, and .y means the name of the group.
library(dplyr)
library(benford.analysis)
iris %>%
group_by(Species) %>%
group_map(.f = ~ plot(benford(.x$Sepal.Length)))

Taking the mean of a multitude of variables that will grouped by a set of categorcal variables

I have 500 columns. One is a categorical variable with 3 categories and the rest are continuous variables. There are 50 rows that fall under these columns. How do I group the data frame by the categorical variables, and take the mean of the observations that fall within each category for every column that has continuous variables for that DF? ALSO, remove all NA. I want to create a new CD from this info.
Best,
Henry

When posting to SO, please ensure to include a reproducible example of your data (dput is helpful for this). As it is, I can only guess to the structure of your data.
I like doing general grouping/summarising operations with dplyr. Using iris as an example, you might be able to do somehting like this
library(dplyr)
library(tidyr)
data(iris)
iris %>%
drop_na() %>%
group_by(Species) %>%
summarise_all(mean)
summarise_all just automatically uses all non-grouping columns, and takes a function you want to apply.
Note, if you use the dev version of dplyr, you could also do something like
iris %>%
group_by(Species) %>%
summarise(across(is.numeric), mean)
Since summarise_all is being replaced in favor of across

Regression model over factor levels using dplyr : getting repeated errors

I'm trying to run a logistic regression model over several factor levels in my dataframe and I'm getting replicated results for each factor level instead of a unique model's parameters. It happens when I use the diamond dataset and run the same code, this:
diamonds$E <-
if_else(diamonds$color=='E',1,0) #Make 'E' binary
fitted_models <- diamonds %>%
group_by(clarity) %>% #Group by clarity
do(model=glm(E~price,#regress price on E
data=diamonds,
family=binomial(link='logit')))
fitted_models %>%
tidy(model)%>%
View #use broom package to look
I'm stuck as to why I'm having this particular issue.

The issue is in your glm call. Remove data=diamonds and replace it with data=..
fitted_models <- diamonds %>%
group_by(clarity) %>% #Group by clarity
do(model=glm(E~price,#regress price on E
data = .,
family=binomial(link='logit')))
fitted_models %>%
tidy(model)
whenever you are using do you need to reference the grouped data frame using .. As your code currently reads, you are referencing the original, un-grouped frame not the one passed to do by the pipe. for example, you cannot just call for the column E, you need to use .$E. An alternative solution would be glm(.$E~.$price)

Correlations between vectors in two groups (defined by: group_by)

I want to make a correlation between two vectors in two different groups (defined by group_by). The solution needs to be based on dplyr.
My data is in the so-called CDISC format. For simplicity: here is some dummy data (note that one column ("values") holds all the data):
n=5
bmi<-rnorm(n=n,mean=25)
glucose<-rnorm(n=n,mean=5)
insulin<-rnorm(n=n,mean=10)
id<-rep(paste0("id",1:n),3)
myData<-data.frame(id=id,measurement=c(rep("BMI",n),rep("glucose",n),rep("insulin",n)),values=c(bmi,glucose,insulin))
Keeping in mind that all my functions for working with this kind of data is by using dplyr package, such as:
myData %>% group_by(measurement) %>% summarise(mean(values), n())
How do I get the correlation between glucose and insulin (cor(glucose, insulin))? Or in a more general way: how do I get the correlation between two groups?
The following solution is obviously very wrong (but may help to understand my question):
myData %>% group_by(measurement) %>% summarise(cor(glucose,insulin))

running lmer with a by/group by statement?

I'm trying to find a quick way to run a lmer model but run it separately for each grouping variable (in SAS one can use the by= statement). I have tried using dplyr for this with a code I found:
t1<- mod1 %>% group_by(c) %>% do(function(df){lmer(m1.formula,data=df)})
but that doesn't seem to work.
Anyone know how to do this using dplyr or another method?

library("lme4")
data(Orthodont,package="nlme")
There are two fundamental issues you might want to consider here:
statistical: as commented above, it's a bit odd to think about running mixed models separately on each stratum (grouping variable) within a data set. Usually the entire point of mixed models is to fit a model to the combined data set, although I can certainly imagine exceptions (below, I fit separate mixed models by sex). You might be looking for something like the lmList function (both nlme and lme4 have versions), which runs (generalized) linear models (not mixed models) on each stratum separately. This makes more sense, especially as an exploratory technique.
computational: doing specifically what you asked for in the dplyr framework is a bit difficult, because the basic dplyr paradigm assumes that you are operating on data frames (or data tables), possibly grouped, and returning data frames. That means that the bits returned by each operation must be data frames (not e.g. merMod model objects). (#docendodismus points out that you can do it by specifying do(model = ...) in the code below, but I think the structure of the resulting object is a bit weird and would encourage you to rethink your question, as illustrated below)
In base R, you can just do this:
lapply(split(Orthodont,Orthodont$Sex),
lmer,formula=distance~age+(1|Subject))
or
by(Orthodont,Orthodont$Sex,
lmer,formula=distance~age+(1|Subject))
Digression: If you want to fit linear (unmixed) models to each subject, you can use
## first remove 'groupedData' attributes from the data, which don't
## work with lme4's version of lmList
Orthodont <- data.frame(Orthodont)
lmList(distance~age|Subject,Orthodont)
## note: not including Sex, which doesn't vary within subjects
Return to main thread: In the plyr (ancestor of dplyr) framework you can fit separate mixed models by sex slightly more compactly:
library("plyr")
dlply(Orthodont,.(Sex),
lmer,formula=distance~age+(1|Subject))
detach("package:plyr")
If you want to do it in plyr, you seem to need do() (I thought I could do without it, but I haven't found a way), and you need to create a function that returns a summary as a data frame.
library("dplyr")
Orthodont %>% group_by(Sex) %>%
do(lmer(.,formula=distance~age+(1|Subject)))
produces
## Error: Results are not data frames at positions: 1, 2
You could do this:
lmer_sum <- function(x,...) {
m <- lmer(x,...)
c(fixef(m),unlist(VarCorr(m)))
data.frame(rbind(c(fixef(m),unlist(VarCorr(m)))),
check.names=FALSE)
}
(unlist(VarCorr(m)) gives the RE variance of the single scalar random effect; the whole data.frame(rbind(...)) thing is needed to convert a numeric vector into a one-row data frame; check.names=FALSE preserves the column name (Intercept))
Orthodont %>% group_by(Sex) %>%
do(lmer_sum(.,formula=distance~age+(1|Subject)))
which gives reasonable results.

The problem is that you're calling do() wrongly - it doesn't work with anonymous functions like that. Arguments in do() are evaluated in the context of the data, so when you say function(df), do will try to use the df column of the data. It doesn't have that column, so it fails (with a cryptic message).
You can refer to the entire data frame in the grouping with ., and you don't need the anonymous function. You just call the (nested) functions directly with the . variable:
t1 <- mod1 %>% group_by(c) %>% do(lmer(m1.formula, .))
Untested because you didn't provide a reproducible example.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to apply lm() to datasets split by factors - r

You can use dplyr and broom to do regressions by group and summarise the information back into a dataframe library(dplyr) library(broom) your_dataset %>% group_by(county) %>% do(tidy(lm(price ~ bsqft, data=.)))

Related

Benford’s Law by group in R

Taking the mean of a multitude of variables that will grouped by a set of categorcal variables

Regression model over factor levels using dplyr : getting repeated errors

Correlations between vectors in two groups (defined by: group_by)

running lmer with a by/group by statement?

Categories

Resources