What does se.fit represent exactly? How can we compute it manually? - r

Given the following code in R why should we have an array of standard errors instead of one standard error?
library(dplyr)
library(HistData)
data("GaltonFamilies")
set.seed(1983)
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight)
Now consider this:
fit <- galton_heights %>% lm(son ~ father, data = .)
Y_hat <- predict(fit, se.fit = TRUE)
The predicted values could be extracted by this code:
Y_hat$fit
which gives us the predicted height of Sons for the fathers in galton_heights.
However, I do not understand why do we have an array of standard errors when we run this code:
Y_hat$se.fit
What do these values refer to? and how could they be calculated manually?

Related

Correlation between p-value and linear regression slope in general linear model

I input a counts table (read counts mapping to 600 sequences in a reference fasta) into a package that runs a linear regression using a general linear model lm() in R, and outputs a trend line slope and p-value for each of the 600 sequences. I made a volcano plot of my results (b.value = regression slope) and found an unexpected correlation between the regression slope and the p-value. Can anyone explain why this might be the case? going through the code for the package, but I want to know if this could actually be an issue with the input data rather than the model?
The code for running the linear regression on the counts table "dat$count" with the response variable given by "condition" is given by:
do.lm <- function(dat) {
count <- dat$count
col_data <- dat$col_data
tidy.count <- count %>%
cbind(name = rownames(count)) %>%
gather(sample, count, -name) %>%
left_join(col_data %>% rownames_to_column("sample"), by = "sample")
# TIP: values in a column must be atomic, can't have a vector
res <- tidy.count %>% group_by(name) %>%
summarise(count = list(count),
condition = list(condition)) %>%
group_by(name) %>%
mutate( lm = list(summary(lm(unlist(condition)~unlist(count)))),
baseMean = mean(unlist(count))) %>%
mutate( p.value = tryCatch({lm[[1]]$coefficients[2,4]}, error = function(e) NA),
b.value = tryCatch({lm[[1]]$coefficients[2,3]}, error = function(e) NA)) %>%
select(name, baseMean, b.value, p.value)
res$padj <- p.adjust(res$p.value, method="fdr")
res
}

Paired t test with multiple time points

I have a dataset with 6-time points and I hope to do multiple paired sample t-tests to compare the scores.
The data has been transformed into the long one
I want to achieve something like this table:
I tried to use the following code, but it does not work.
stat.test <- anxiety_score %>%
group_by(group) %>%
pairwise_t_test(
anxiety_score ~ time, paired = TRUE,
p.adjust.method = "bonferroni"
) %>%
select(-df, -statistic, -p) # Remove details
stat.test

How to incorporate tidy models PCA into the workflow of a model and make predictions

I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.
I have tried the following approach,
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_test <-vfold_cv(diamonds_train)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
preprocesados <- prep(diamonds_recipe)
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
lr_fitted_workflow <- pca_workflow %>% #option A workflow full dataset
last_fit(diamonds_split)
performance <- lr_fitted_workflow %>% collect_metrics()
test_predictions <- lr_fitted_workflow %>% collect_predictions()
But I get this error:
x Resample1: model (predictions): Error: penalty should be a single numeric value. ...
Warning message:
“All models failed in [fit_resamples()]. See the .notes column.”
Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:
pca_fit <- juice(preprocesados) %>% #option C no work flow at all
lm(price ~ ., data = .)
prep_test <- prep(diamonds_recipe, new_data = diamonds_test)
truths <- juice(prep_test) %>%
select(price)
ans <- predict(pca_fit, new_data = prep_test)
tib <- tibble(row = 1:length(ans),ans, truths)
ggplot(data = tib) +
geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
geom_smooth(mapping = aes(x = row, y = price, colour = "true"))
And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D
The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is fit and for it to work you'll have to provide the data for the fitting process (here diamonds). The tradeoff is that you don't have to prep your recipe as the workflow will take care of this itself.
So reducing your code slightly, the example below will work.
library(tidymodels)
data(diamonds)
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
pca_fit <- fit(pca_workflow, data = diamonds_train)
As for crossvalidation one has to use fit_resamples and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)
Edit
Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being glmnet. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to the workflows package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.
As for crossvalidation, assume you instead fit using any of the other engines described in ?linear_reg then we could do this as
linear_model_base <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
pca_workflow <- update_model(pca_workflow, linear_model_base)
folds <- vfold_cv(diamonds_train, 10)
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds)
and in the case where metrics are of interest these can indeed be collected as you did using collect_metrics
pca_folds_fit %>% collect_metrics()
If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use collect_predictions
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))
collect_predictions(pca_folds_fit)
Note however that the output from this is the predictions from each fold as you are literally fitting 10 models.
Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (collect_metrics) would then be selected for use and the test dataset would be used to get the evaluation of this models accuracy.
This is all described in TMwR chapter 10 & 11

Object '.' not found while piping with dplyr

I am trying to conduct a survival curve using the survival package. The MWE code is as follows:
df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .) %>%
ggsurvplot(fit = .)
I get the error Error in eval(fit$call$data) : object '.' not found
When I try to break this down further by:
survfit <- df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .)
ggsurvplot(fit = survfit)
I get an identical error. Is anyone able to figure out how to pipe from my dataframe all the way through a survival curve? The reason I would like to do this is to streamline the filtering of my dataframe in order to produce a multitude of different survival curves without having to create many subsetted dataframes.
Apparently, ggsurvplot expects an object of class "survfit" as its first argument but also needs the data set as an argument.
The example below is based on the first example of function
survfit.formula {survival}.
library(dplyr)
library(survival)
library(survminer)
aml %>%
survfit(Surv(time, status) ~ x, data = .) %>%
ggsurvplot(data = aml)
In the question's case this would become
df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .) %>%
ggsurvplot(data = filter(df, fac <= "Limit"))

run multiple model and save model comparison results in dataframe in r

I want to run lm models and save model comparison result and extract p-values. I would like to save all the info in a dataframe.
Using diamonds dataset as an example:
diamonds %>%
group_by(cut) %>%
do(model1 = lm(price~carat, data=.),
model2 = lm(price~carat+depth, data=.)) %>%
mutate(anova = anova(model2,model1)) %>%
mutate(pval= anova$'Pr(>F'[2])
I got error message below:
Error in mutate_impl(.data, dots) :
Column `anova` must be length 1 (the group size), not 6
My question is:
Why I got the error message and how to save anova result in the dataframe?
how to make the whole process work if lm or anova do not work on some subsets? something like try..catch..
My real data is more complicated then this. Just use diamonds and linear model to illustrate the idea.
Thanks a lot.
This is a really good application of the tidyr::nest() function in conjunction with purrr and broom. What you do is:
- Group the data frame
- Apply a model with mutate(mod = map(data, model)
- summarize the model using broom::tidy()
- extract the relevant statistics.
For more on this here's a great talk by Hadley on the subject: https://www.youtube.com/watch?v=rz3_FDVt9eg
In your case I think you can do something like this:
library(tidyverse)
library(broom)
diamonds %>%
group_by(cut) %>%
nest() %>%
mutate(
model1 = map(data, ~lm(price~carat, data=.)),
model2 = map(data, ~lm(price~carat+depth, data=.))
) %>%
mutate(anova = map2(model1, model2, ~anova(.x,.y))) %>%
mutate(tidy_anova = map(anova, broom::tidy)) %>%
mutate(p_val = map_dbl(tidy_anova, ~.$p.value[2])) %>%
select(p_val)

Resources