Using the coefficients from regression models that use group_by function - r

I run the following code to generate regression models
library (dplyr)
fitted_models <- df %>%
group_by(sic, fyear) %>%
do (model = lm (TACCdTA ~ Inverse_TA + DeL_RevRec + PPEdTA , data = .))
Then to get the coefficients for each sic and fyear, I run the following code
library(broom)
fitted_models %>% tidy(model)
I got coefficients for each sic for each fyear.
Now my question is - under each sic for each fyear, there are many observations (e.g., 1000 observations) - how can I calculate the fitted value for each observation under each sic and fyear by using the coefficients that the model(s) generated above.
Another small question - for my first code in which I run the model, how can I ensure that the model(s) is(are) run only for the cases in which each sic and fyear combination (sic-fyear) has at least 10 observations.

Related

R random forest aggregate vs individual prediction

Please consider this minimal reproducible example of a random forest regression estimate
library(randomForest)
# fix missing data
airquality <- na.roughfix(airquality)
set.seed(123)
#fit the random forest model
rf_fit <- randomForest(formula = Ozone ~ ., data = airquality)
#define new observation
new <- data.frame(Solar.R=250, Wind=8, Temp=70, Month=5, Day=5)
set.seed(123)
#use predict all on new observation
rf_predict<-predict(rf_fit, newdata=new, predict.all = TRUE)
rf_predict$aggregate
library(tidyverse)
predict_mean <- rf_predict$individual %>%
as_tibble() %>%
rowwise() %>%
transmute(avg = mean(V1:V500))
predict_mean
I was expecting to get the same value by rf_predict$aggregate and predict_mean
Where and why am I wrong about this assumption?
My final objective is to get a confidence interval of the predicted value.
I believe your code needs to include a c_across() call for the calculation to be performed correctly:
The ?c_across documentations tells us:
c_across() is designed to work with rowwise() to make it easy to
perform row-wise aggregations.
predict_mean <- rf_predict$individual %>%
as_tibble() %>%
rowwise() %>%
transmute(avg = mean(c_across(V1:V500)))
>predict_mean
[1] 30.5
An answer to a previous question, points out that mean() can't handle a data.frame. And in your code the data being provide to mean() is a row-wise data frame with class rowwise_df. c_across allows the data in the rows to be presented to mean() as vectors (I think).

Obtain P-Value of Fixed Value in Anova Table of many Linear Regressions with Broom Package

In the multi linear regression lm(FE_FCE2 ~ Trial + .x, data = DF_FCE3) there is one fixed variable (trial) and many x variables. I am analysing each x variable against FE_FCE2 with trial as fixed effect. I than use the boom package for the many regressions and plot the results in one table. I have obtained the results for the regression results. However cannot add the data from ANOVA Table into the Broom packages with map function.
Is it possible? And Yes How?
I have used the following formula to obtain Data from Results from Regression:
DF_FCE3 %>%
select(-FE_FCE2, -Trial) %>% # exclude outcome, leave only predictors
map( ~lm(FE_FCE2 ~ Trial + .x, data = DF_FCE3)) %>%
map(summary) %>%
map_df(glance) %>%
round(3) -> rsme
However I would like to obtain the P-Value (**4.26e-08 *****) from the ANOVA Table of Trial.
To
see if Trial had a significant influence on the x variable.
**$x1
Analysis of Variance Table
**Response: FE_FCE2
Df Sum Sq Mean Sq F value Pr(>F)
Trial 3 0.84601 0.282002 15.0653 **4.26e-08 *****
.x 1 0.00716 0.007161 0.3826 0.5377
Residuals 95 1.77827 0.018719**
---**
Is it possible to use the broom package with map function to obtain a table which contains all the many p values of the anova regressions?
Like this (using mpg)?
This returns a dataframe with the original columns and one row containing the p-value except for the outcome and target (hwy and cyl in thisexample, FE_FCE2 and Trial in your case).
mpg %>%
select(-hwy, -cyl) %>% # exclude outcome, leave only predictors
map( ~lm(hwy ~ cyl + .x, data = mpg)) %>%
map(anova) %>%
map(broom::tidy) %>%
map_df(~.$p.value[1])

Fitting a quadratic curve for each data set that has different lengths

I would like to fit a quadratic to (Time,SkinTemp) for each id in the following data.frame df. Each id has a different number of Time,SkinTemp entries so I'm stuck with 'predict'
df<-data.frame(Time=seq(65),
SkinTemp=rnorm(65,37,0.5),
id=rep(1:10,c(5,4,10,6,7,8,9,8,4,4)))
So far I have:
#Fit the model y=x^2+x+C
fitted_models = df %>% group_by(id) %>% do(model = lm(SkinTemp ~ Time+I(Time^2), data = .))
So far so good. Here's where I'm stuck. How do I pass the original Time data into the predict function below?
#Predict data points for each quadratic
predQ<-sapply(unique(df$id), function(x) predict(fitted_models$model[[x]]))
Use fitted:
lapply(fitted_models$model, fitted)

PLM package: Balanced data shown as unbalanced in regression

The dataset which I am using here is unbalanced, but I balanced it manually like this by removing the multiple observations for same ID (this is a characteristic of my data as a single household later split to different ones). T is 2 here.
dataset %>% group_by(ID) %>% summarise(N =n()) %>% filter(N> 2 | N < 2)
Then I removed these rogue observations.So now the panel is balanced.I converted them to pdata afterwards
dataset <-plm.data(dataset, 30462)
And when I run is.pbalanced, it shows TRUE. But the problem is when I run the regression
plm(DEP~ VAR1 + VAR2, data= dataset, model= "within")
The summary shows this
Unbalanced Panel: n=20236, T=1-2, N=34920
I don't understand what I am missing here. Any suggestions will be greatly appreciated.

extract log rank (score) test result wiht p-value for Coxph Model

I have 100 replicates of coxph model fitted in loop. I am trying to extract out log-rank score test result with p-values for each replicate in a data frame or list. I am using the following. But, it gives me only log rank score, not p-value. Any help will be very appreciated.
I can share dataset, but am not sure how to attach here.
thanks,
Krina
Repl_List <- unique(dat3$Repl)
doLogRank = function(sel_name) {
dum <- dat3[dat3$Repl == sel_name,]
reg <- with(dum, coxph(Surv(TIME_day, STATUS) ~ Treatment, ties = "breslow"))
LogRank <- with(reg, reg$score)
}
LogRank <- t(as.data.frame(lapply(Repl_List, doLogRank)))
Here is a mock example that I took from the help page of the coxph function. I just replicated the dataset 100 times to create your scenario. I highly recommend to start using the tidyverse packages to do such work. broom is a great addition along with dplyr and tidyr.
library(survival)
library(tidyverse)
library(broom)
test <- data.frame(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
Below I am replicating the dataset 100 times using the replicate function.
r <- replicate(test,n = 100,simplify = FALSE) %>% bind_rows %>%
mutate(rep = rep(seq(1,100,1),each=7))
I setup the cox model as a small function that I can them pass on to each replicate of the dataframe.
cxph_mod <- function(df) {
coxph(Surv(time, status) ~ x + strata(sex), df)
}
Below, is the step by step process of fitting the model and extracting the values.
tidyr::nest the dataframe
purrr::map the model into each nest
nest is function in library(tidyr)
map is a function similar to lapply in library(purrr)
nested <- r %>%
group_by(rep) %>%
nest %>%
mutate(model = data %>% map(cxph_mod))
look into the first rep to see the coxph output. You will see the model object stored in the cells of the dataframe allowing easier access.
nested %>% filter(rep==1)
With each model object, now use broom to get the parameter estimates and the prediction from the model into the nested dataset
nested <- nested %>%
mutate(
ests = model %>% map(broom::tidy)
)
tidyr::unnest to view your predictions for fitting each resampled dataset
ests <- unnest(nested,ests,.drop=TRUE) %>% dplyr::select(rep,estimate:conf.high)
In this case since I am repeating the same dataset 100 times, the pvalue will be the same, but in your case you will have 100 different datasets and hence 100 different p.values.
ggplot(data=ests,aes(y=p.value,x=rep))+geom_point()
Vijay

Resources