Is it possible to perform regression inside aggregate function? [duplicate] - r

This question already has an answer here:
Can't get aggregate() work for regression by group
(1 answer)
Closed 4 years ago.
For Example
FP <- data.frame(A = 1:9, B = 11:19, C = 21:29, D = 31:39 ..... N = 145:153, Date: Jan 1 to Jan 9)
(I know the syntax above is wrong. Just for your understanding)
There are like n number of columns say 14 and an additional date column
I need to perform simple linear regression of A (independent Variable) on B,C,D,E...N (dependent Variables)SEPARATELY grouped by the date column, How to make aggregate function work? Or is there any other function which will be come in handy ?

When working / saving models you might want to work with lists:
FP <- data.frame(A = 1:9, B = 11:19, C = 21:29, D = rep(1:3,3))
lapply(split(FP, FP$D), function(x) lm(B + C ~ A, data = x))
#$`1`
#
#Call:
#lm(formula = B + C ~ A, data = x)
#Coefficients:
#(Intercept) A
# 30 2
#
#$`2`
#Call:
#lm(formula = B + C ~ A, data = x)
#Coefficients:
#(Intercept) A
# 30 2
#$`3`
#Call:
#lm(formula = B + C ~ A, data = x)
#Coefficients:
#(Intercept) A
# 30 2
First you split your data.frame by D and then run your regressions on those splits.

Related

When setting up a mixed effects model in R, how is the model applied to all variables specified by the loop?

I install a mixed effects model in R. But I want to apply this model to all numeric variables in my dataset with a single code. I wrote a code like the one below but got an error. What can I do?
My Data=df
# 200 x 20
week weight height .......
<fct> <dbl> <dbl>
1 week1 50.0 160
2 week1 62.5 172
3 week2 49.6 155
4 week3 80.0 165
5 week2 56.8 163
6 week3 72.3 180
.
.
.
.
Mixed effect model is set up for single variable as follows
mixed.model <- lmer( weight ~ 1 + (1|week), data = df)
a=ranova(mixed)
a$`Pr(>Chisq)`
The code I wrote to apply to multiple variables
for (i in 2:(dim(df)[2])){
mixed.model <- lmer( i ~ 1 + (1|week), data = df)
}
The error i got
Error in model.frame.default(data = df, drop.unused.levels = TRUE, : variable lengths differ (found for 'week')
I would probably make a formula first and then pass the formula to the model function (Thanks to Ben Bolker for the tip on reformulate):
for (i in names(df)[-1]){
form <- reformulate("1 + (1|week)", response=i)
mixed.model <- lmer(form, data = df)
}
EDIT
In response to the comment about always getting the first variable, here's what I get when I run the loop, each time printing the formula:
df <- data.frame(
week = sample(1:3, 1000, replace=TRUE),
X1 = rnorm(1000),
X2 = rnorm(1000),
X3 = rnorm(1000)
)
library(lme4)
for (i in names(df)[-1]){
form <- reformulate("1 + (1|week)", response=i)
print(form)
# mixed.model <- lmer(form, data = df)
}
# X1 ~ 1 + (1 | week)
# X2 ~ 1 + (1 | week)
# X3 ~ 1 + (1 | week)
As you can see, there is a different formula for each iteration.

R: subsetting within a function

Suppose I have a data frame in the environment, mydata, with three columns, A, B, C.
mydata = data.frame(A=c(1,2,3),
B=c(4,5,6),
C=c(7,8,9))
I can create a linear model with
lm(C ~ A, data=mydata)
I want a function to generalize this, to regress B or C on A, given just the name of the column, i.e.,
f = function(x){
lm(x ~ A, data=mydata)
}
f(B)
f(C)
or
g = function(x){
lm(mydata$x ~ mydata$A)
}
g(B)
g(C)
These solutions don't work. I know there is something wrong with the evaluation, and I have tried permutations of quo() and enquo() and !!, but no success.
This is a simplified example, but the idea is, when I have dozens of similar models to build, each fairly complicated, with only one variable changing, I want to do so without repeating the entire formula each time.
If we want to pass unquoted column name, and option is {{}} from tidyverse. With select, it can take both string and unquoted
library(dplyr)
printcol2 <- function(data, x) {
data %>%
select({{x}})
}
printcol2(mydata, A)
# A
#1 1
#2 2
#3 3
printcol2(mydata, 'A')
# A
#1 1
#2 2
#3 3
If the OP wanted to pass unquoted column name to be passed in lm
f1 <- function(x){
rsp <- deparse(substitute(x))
fmla <- reformulate("A", response = rsp)
out <- lm(fmla, data=mydata)
out$call <- as.symbol(paste0("lm(", deparse(fmla), ", data = mydata)"))
out
}
f1(B)
#Call:
#lm(B ~ A, data = mydata)
#Coefficients:
#(Intercept) A
# 3 1
f1(C)
#Call:
#lm(C ~ A, data = mydata)
#Coefficients:
#(Intercept) A
# 6 1
Maybe you are looking for deparse(substitute(.)). It accepts arguments quoted or not quoted.
f = function(x, data = mydata){
y <- deparse(substitute(x))
fmla <- paste(y, 'Species', sep = '~')
lm(as.formula(fmla), data = data)
}
mydata <- iris
f(Sepal.Length)
#
#Call:
#lm(formula = as.formula(fmla), data = data)
#
#Coefficients:
# (Intercept) Speciesversicolor Speciesvirginica
# 5.006 0.930 1.582
f(Petal.Width)
#
#Call:
#lm(formula = as.formula(fmla), data = data)
#
#Coefficients:
# (Intercept) Speciesversicolor Speciesvirginica
# 0.246 1.080 1.780
I think generally, you might be looking for:
printcol <- function(x){
print(x)
}
printcol(mydata$A)
This doesn't involve any fancy evaluation, you just need to specify the variable you'd like to subset in your function call.
This gives us:
[1] 1 2 3
Note that you're only printing the vector A, and not actually subsetting column A from mydata.

Calculate line coefficients for each row of a dataset with NA values in R

I have a dataset with about 75 rows and 25 columns, each row shows one student and the columns show a score between 1 and 5.
S1 S2 ..... S24
x1 0 2 ..... 2
x2 1 3 ..... Na
x3 NA 4 ..... 4
x4 4 NA ..... 2
x5 4 3 ..... 2
I want to have the intercept and slope of each line without considering the NA values for each row and add them to the original dataset. I am using the code below, but it still includes NA values. I am using R.
df = read.csv('exc.csv')
Slope = function(x) {
TempDF = data.frame(x, survey=1:ncol(df))
lm(x ~ survey, data=TempDF,na.rm=TRUE)$coefficients[2]
}
Intercept = function(x) {
TempDF = data.frame(x, survey=1:ncol(df))
lm(x ~ survey, data=TempDF,na.rm=TRUE)$coefficients[1]
}
TData = as.data.frame(t(df))
dataset$Intercept = sapply(TData, Intercept)
dataset$slope = sapply(TData, Slope)
the regression by itself works only with pairs of non-NA values. So anything with NA values will not affect the slope or intercept in your case:
set.seed(100)
y = rnorm(100)
x = rnorm(100)
y[1:10] = NA
x[91:100] = NA
df = data.frame(x,y)
lm(y ~x,data=df)
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
0.02871 -0.15929
And we use only pairs in x and y with no NAs:
df = df[!is.na(df$x) & !is.na(df$y),]
lm(y ~x,data=df)
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
0.02871 -0.15929
If you also need it for something else, here's how you do it:
#simulate your data
df = data.frame(matrix(sample(1:5,25*5,replace=TRUE),ncol=25))
colnames(df) = paste("S",1:25,sep="")
#make some NAs
df[cbind(c(1,3,5),c(2,3,4))] <- NA
# fit once, take both coefficient and intercept
Coef = function(x) {
TempDF = data.frame(x, survey=1:ncol(df))
TempDF = TempDF[!is.na(x),]
c(lm(x ~ survey, data=TempDF)$coefficients,n=nrow(TempDF))
}
TData = as.data.frame(t(df))
dataset = data.frame(t(sapply(TData, Coef)))

R regressions in a loop [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I am running a linear regression on some variables in a data frame. I'd like to be able to subset the linear regressions by a categorical variable, run the linear regression for each categorical variable, and then store the t-stats in a data frame. I'd like to do this without a loop if possible.
Here's a sample of what I'm trying to do:
a<- c("a","a","a","a","a",
"b","b","b","b","b",
"c","c","c","c","c")
b<- c(0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3)
c<- c(0.2,0.1,0.3,0.2,0.4,
0.2,0.5,0.2,0.1,0.2,
0.4,0.2,0.4,0.6,0.8)
cbind(a,b,c)
I can begin by running the following linear regression and pulling the t-statistic out very easily:
summary(lm(b~c))$coefficients[2,3]
However, I'd like to be able to run the regression for when column a is a, b, or c. I'd like to then store the t-stats in a table that looks like this:
variable t-stat
a 0.9
b 2.4
c 1.1
Hope that makes sense. Please let me know if you have any suggestions!
Here is a solution using dplyr and tidy() from the broom package. tidy() converts various statistical model outputs (e.g. lm, glm, anova, etc.) into a tidy data frame.
library(broom)
library(dplyr)
data <- data_frame(a, b, c)
data %>%
group_by(a) %>%
do(tidy(lm(b ~ c, data = .))) %>%
select(variable = a, t_stat = statistic) %>%
slice(2)
# variable t_stat
# 1 a 1.6124515
# 2 b -0.1369306
# 3 c 0.8000000
Or extracting both, the t-statistic for the intercept and the slope term:
data %>%
group_by(a) %>%
do(tidy(lm(b ~ c, data = .))) %>%
select(variable = a, term, t_stat = statistic)
# variable term t_stat
# 1 a (Intercept) 1.2366939
# 2 a c 1.6124515
# 3 b (Intercept) 2.6325081
# 4 b c -0.1369306
# 5 c (Intercept) 1.4572335
# 6 c c 0.8000000
You can use the lmList function from the nlme package to apply lm to subsets of data:
# the data
df <- data.frame(a, b, c)
library(nlme)
res <- lmList(b ~ c | a, df, pool = FALSE)
coef(summary(res))
The output:
, , (Intercept)
Estimate Std. Error t value Pr(>|t|)
a 0.1000000 0.08086075 1.236694 0.30418942
b 0.2304348 0.08753431 2.632508 0.07815663
c 0.1461538 0.10029542 1.457233 0.24110393
, , c
Estimate Std. Error t value Pr(>|t|)
a 0.50000000 0.3100868 1.6124515 0.2052590
b -0.04347826 0.3175203 -0.1369306 0.8997586
c 0.15384615 0.1923077 0.8000000 0.4821990
If you want the t values only, you can use this command:
coef(summary(res))[, "t value", -1]
# a b c
# 1.6124515 -0.1369306 0.8000000
Here's a vote for the plyr package and ddply().
plyrFunc <- function(x){
mod <- lm(b~c, data = x)
return(summary(mod)$coefficients[2,3])
}
tStats <- ddply(dF, .(a), plyrFunc)
tStats
a V1
1 a 1.6124515
2 b -0.1369306
3 c 0.6852483
Use split to subset the data and do the looping by lapply
dat <- data.frame(b,c)
dat_split <- split(x = dat, f = a)
res <- sapply(dat_split, function(x){
summary(lm(b~c, data = x))$coefficients[2,3]
})
Reshape the result to your needs:
data.frame(variable = names(res), "t-stat" = res)
variable t.stat
a a 1.6124515
b b -0.1369306
c c 0.8000000
You could do this:
a<- c("a","a","a","a","a",
"b","b","b","b","b",
"c","c","c","c","c")
b<- c(0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3)
c<- c(0.2,0.1,0.3,0.2,0.4,
0.2,0.5,0.2,0.1,0.2,
0.4,0.2,0.4,0.6,0.8)
df <- data.frame(a,b,c)
t.stats <- t(data.frame(lapply(c('a','b','c'),
function(x) summary(lm(b~c,data=df[df$a==x,]))$coefficients[2,3])))
colnames(t.stats) <- 't-stat'
rownames(t.stats) <- c('a','b','c')
Output:
> t.stats
t-stat
a 1.6124515
b -0.1369306
c 0.8000000
Unless I am mistaken the values you give in your output are not the correct ones.
Or:
t.stats <- data.frame(t.stats)
t.stats$variable <- rownames(t.stats)
> t.stats[,c(2,1)]
variable t.stat
a a 1.6124515
b b -0.1369306
c c 0.8000000
If you want a data.frame and a separate column.

Linear Regression and storing results in data frame [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I am running a linear regression on some variables in a data frame. I'd like to be able to subset the linear regressions by a categorical variable, run the linear regression for each categorical variable, and then store the t-stats in a data frame. I'd like to do this without a loop if possible.
Here's a sample of what I'm trying to do:
a<- c("a","a","a","a","a",
"b","b","b","b","b",
"c","c","c","c","c")
b<- c(0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3)
c<- c(0.2,0.1,0.3,0.2,0.4,
0.2,0.5,0.2,0.1,0.2,
0.4,0.2,0.4,0.6,0.8)
cbind(a,b,c)
I can begin by running the following linear regression and pulling the t-statistic out very easily:
summary(lm(b~c))$coefficients[2,3]
However, I'd like to be able to run the regression for when column a is a, b, or c. I'd like to then store the t-stats in a table that looks like this:
variable t-stat
a 0.9
b 2.4
c 1.1
Hope that makes sense. Please let me know if you have any suggestions!
Here is a solution using dplyr and tidy() from the broom package. tidy() converts various statistical model outputs (e.g. lm, glm, anova, etc.) into a tidy data frame.
library(broom)
library(dplyr)
data <- data_frame(a, b, c)
data %>%
group_by(a) %>%
do(tidy(lm(b ~ c, data = .))) %>%
select(variable = a, t_stat = statistic) %>%
slice(2)
# variable t_stat
# 1 a 1.6124515
# 2 b -0.1369306
# 3 c 0.8000000
Or extracting both, the t-statistic for the intercept and the slope term:
data %>%
group_by(a) %>%
do(tidy(lm(b ~ c, data = .))) %>%
select(variable = a, term, t_stat = statistic)
# variable term t_stat
# 1 a (Intercept) 1.2366939
# 2 a c 1.6124515
# 3 b (Intercept) 2.6325081
# 4 b c -0.1369306
# 5 c (Intercept) 1.4572335
# 6 c c 0.8000000
You can use the lmList function from the nlme package to apply lm to subsets of data:
# the data
df <- data.frame(a, b, c)
library(nlme)
res <- lmList(b ~ c | a, df, pool = FALSE)
coef(summary(res))
The output:
, , (Intercept)
Estimate Std. Error t value Pr(>|t|)
a 0.1000000 0.08086075 1.236694 0.30418942
b 0.2304348 0.08753431 2.632508 0.07815663
c 0.1461538 0.10029542 1.457233 0.24110393
, , c
Estimate Std. Error t value Pr(>|t|)
a 0.50000000 0.3100868 1.6124515 0.2052590
b -0.04347826 0.3175203 -0.1369306 0.8997586
c 0.15384615 0.1923077 0.8000000 0.4821990
If you want the t values only, you can use this command:
coef(summary(res))[, "t value", -1]
# a b c
# 1.6124515 -0.1369306 0.8000000
Here's a vote for the plyr package and ddply().
plyrFunc <- function(x){
mod <- lm(b~c, data = x)
return(summary(mod)$coefficients[2,3])
}
tStats <- ddply(dF, .(a), plyrFunc)
tStats
a V1
1 a 1.6124515
2 b -0.1369306
3 c 0.6852483
Use split to subset the data and do the looping by lapply
dat <- data.frame(b,c)
dat_split <- split(x = dat, f = a)
res <- sapply(dat_split, function(x){
summary(lm(b~c, data = x))$coefficients[2,3]
})
Reshape the result to your needs:
data.frame(variable = names(res), "t-stat" = res)
variable t.stat
a a 1.6124515
b b -0.1369306
c c 0.8000000
You could do this:
a<- c("a","a","a","a","a",
"b","b","b","b","b",
"c","c","c","c","c")
b<- c(0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3)
c<- c(0.2,0.1,0.3,0.2,0.4,
0.2,0.5,0.2,0.1,0.2,
0.4,0.2,0.4,0.6,0.8)
df <- data.frame(a,b,c)
t.stats <- t(data.frame(lapply(c('a','b','c'),
function(x) summary(lm(b~c,data=df[df$a==x,]))$coefficients[2,3])))
colnames(t.stats) <- 't-stat'
rownames(t.stats) <- c('a','b','c')
Output:
> t.stats
t-stat
a 1.6124515
b -0.1369306
c 0.8000000
Unless I am mistaken the values you give in your output are not the correct ones.
Or:
t.stats <- data.frame(t.stats)
t.stats$variable <- rownames(t.stats)
> t.stats[,c(2,1)]
variable t.stat
a a 1.6124515
b b -0.1369306
c c 0.8000000
If you want a data.frame and a separate column.

Resources