How to convert fitdistrplus::fitdist summary into tidy format? - r

I have the following code:
x <- c(
0.367141764080875, 0.250037975705769, 0.167204185003365, 0.299794433447383,
0.366885973041269, 0.300453205296379, 0.333686861081341, 0.33301168850398,
0.400142004893329, 0.399433677388411, 0.366077304765104, 0.166402979455671,
0.466624230750293, 0.433499934139897, 0.300017278751768, 0.333673696762895,
0.29973685692478
)
fn <- fitdistrplus::fitdist(x,"norm")
summary(fn)
#> Fitting of the distribution ' norm ' by maximum likelihood
#> Parameters :
#> estimate Std. Error
#> mean 0.32846024 0.01918923
#> sd 0.07911922 0.01355908
#> Loglikelihood: 19.00364 AIC: -34.00727 BIC: -32.34084
#> Correlation matrix:
#> mean sd
#> mean 1 0
#> sd 0 1
Basically, it takes a vector and tried to fit the distribution
using fitdistrplus package.
I tried looking at the broom package, but it doesn't have
a function that covers that.

When you call broom::tidy(fn) you receive an error that says:
Error: No tidy method for objects of class fitdist
This is because this function from broom only has a finite number objects that are "good to use", see methods(tidy) for the complete list. (Read more about S3 methods in R. More here).
So the function doesn't work for an object fitdist but works for a fitdistr object from MASS (more "famous").
We can then assign to fn that class, and then use broom:
class(fn) <- ("fitdist", "fitdistr")
# notice that I've kept the original class and added the other
# you shouldn't overwrite classes. ie: don't to this: class(fn) <- "fitdistr"
broom::tidy(fn)
# # A tibble: 2 x 3
# term estimate std.error
# <chr> <dbl> <dbl>
# 1 mean 0.328 0.0192
# 2 sd 0.0791 0.0136
Note that you can only see the parameters. If you wish to see more and organize everything as "tidy", you should tell us more about your expected output.
broom::tidy() gets you this far, if you want more I'd start by defining my own method function that works for a class fitdist object using as reference the tidy.fitdistr method, and adapting it.
Example of how I'd adapt from the original broom::tidy() code, using the S3 method for the class fitdist.
Define your own method (similar to how you define your own function):
# necessary libraries
library(dplyr)
library(broom)
# method definition:
tidy.fitdist <- function(x, ...) { # notice the use of .fitdist
# you decide what you want to keep from summary(fn)
# use fn$ecc... to see what you can harvest
e1 <- tibble(
term = names(x$estimate),
estimate = unname(x$estimate),
std.error = unname(x$sd)
)
e2 <- tibble(
term = c("loglik", "aic", "bic"),
value = c(unname(x$loglik), unname(x$aic), unname(x$bic))
)
e3 <- x$cor # I prefer this to: as_tibble(x$cor)
list(e1, e2, e3) # you can name each element for a nicer result
# example: list(params = e1, scores = e2, corrMatr = e3)
}
This is how you can call this new method now:
tidy(fn) # to be more clear this is calling your tidy.fitdist(fn) under the hood.
# [[1]]
# # A tibble: 2 x 3
# term estimate std.error
# <chr> <dbl> <dbl>
# 1 mean 0.328 0.0192
# 2 sd 0.0791 0.0136
#
# [[2]]
# # A tibble: 3 x 2
# term value
# <chr> <dbl>
# 1 loglik 19.0
# 2 aic -34.0
# 3 bic -32.3
#
# [[3]]
# mean sd
# mean 1 0
# sd 0 1
Notice that the class is:
class(fn)
[1] "fitdist"
So now you don't actually need to assign the fitdistr (from MASS) class as before.

Not sure exactly what you need, but you can try:
tidy_fn <- rbind(fn$estimate,fn$sd)
https://stats.stackexchange.com/questions/23539/use-fitdist-parameters-in-variables

Related

Compute standard deviation with a manually set mean in R

I know how to compute the sd using summarize:
ans <- temp%>% group_by(permno)%>% summarise(std = sd(ret)))
But how do I compute the standard deviation given I know the mean = 0?
In other words, I know the true mean and want to use that instead of using the sample mean while computing the sd.
One way would be to manually code the sd function, but I need it to work for each group, so I'm stuck.
It is always best to provide reproducible data. Here is an example with the iris data set:
data(iris)
GM <- mean(iris$Sepal.Length) # "Population mean"
ans <- iris %>% group_by(Species) %>% summarise(std=sum((Sepal.Length - GM)^2)/length(Sepal.Length))
ans
# A tibble: 3 × 2
# Species std
# <fct> <dbl>
# 1 setosa 0.823
# 2 versicolor 0.270
# 3 virginica 0.951
As compared with computing the sd with each group mean:
ans <- iris %>% group_by(Species) %>% summarise(std=sd((Sepal.Length)))
ans
# A tibble: 3 × 2
# Species std
# <fct> <dbl>
# 1 setosa 0.352
# 2 versicolor 0.516
# 3 virginica 0.636
Note that sd uses 'n - 1' in the denominator, but since you indicated that your mean was a population mean we use n.
I came up with this solution:
sd_fn <- function(x, mean_pop) {
sd_f <- sqrt((sum((x-mean_pop)^2))/(length(x)))
sd_f
}
x <- c(1,2,3,-1,-1.5,-2.8)
mean_pop <- 0
sd_fn(x, mean_pop)
I simply created a function where the arguments are a numeric vector and the population mean that you already know... Simply enter the vector with data and mean population and the function will givr you thr desired standard deviation.
Hi if want to calculate the sd from a true mean i think you could do it by using the mean function on the square difference of sample vector and the true mean to calculate variance, then use sqrt to calculate the standart deviation. Keep in mind, that base R ' s var and sd functions have automatic bessels correction, you can read at https://www.r-bloggers.com/2018/11/how-to-de-bias-standard-deviation-estimates/
#Sample Size
n=1000
#sample Random Vec
universe = rnorm(n,0,3)
# sample mean
p = mean(universe)
p
# true mean
p0 = 0
# calculate "manually" using sample mean
variance <- mean((universe - p)^2)
variance
standard_deviation <- sqrt(variance)
standard_deviation
# calculate "manually" usingtrue mean
variance_true <- mean((universe - p0)^2)
variance_true
standard_deviation_true <- sqrt(variance_true)
standard_deviation_true
# calculate using built in R functions
var_r<-var(universe)
var_r
r_sd<-sd(universe)
r_sd
# They have automatic Bessels correction :
variance * n/(n-1) == var_r # Bessels correction using * n/(n-1)
r_sd == sqrt(variance * n/(n-1) )

Repeated cv in a mrl3 ensemble model

I have a beautiful mlr3 ensemble model (combined glmnet and glm) for binary prediction, see details here
library("mlr3verse")
library("dplyr")
# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes
# add an additional predictor "superdoc" which is not entered in the glmnet but in the final glm
set.seed(2323)
data %>%
rowwise() %>%
mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)), TRUE~ 0)) %>%
ungroup -> data
# make a rather small train set
set.seed(23)
test.data <- sample_n(data,70,replace=FALSE)
# creat elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet", predict_type = "prob")
# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet")
# PipeOp that drops 'superdoc', i.e. selects all except 'superdoc'
# (ID given to avoid ID clash with other selector)
drop_superdoc = po("select", id = "drop.superdoc",
selector = selector_invert(selector_name("superdoc")))
# PipeOp that selects 'superdoc' (and drops all other columns)
select_superdoc = po("select", id = "select.superdoc",
selector = selector_name("superdoc"))
# superdoc along one path, the fitted model along the other
stacking_layer = gunion(list(
select_superdoc,
drop_superdoc %>>% glmnet_cv1
)) %>>% po("featureunion", id = "union1")
# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")
# combine ensemble model
ensemble = stacking_layer %>>% log_reg_lrn
#define tests
train.task <- TaskClassif$new("test.data", test.data, target = "diabetes")
# make ensemble learner
elearner = as_learner(ensemble)
ensemble$plot(html = FALSE)
If I train it with different set.seed, I get different coefficients.
I think this is mainly caused by the rather low number of training data that is entered in the glmnet model and could be migitated by repeated cross-validation.
# Train the Learner:
# seed 1
elearner = as_learner(ensemble)
set.seed(22521136)
elearner$train(train.task) -> seed1
# seed 2
elearner = as_learner(ensemble)
set.seed(12354)
elearner$train(train.task) -> seed2
# different coefficients of the glment model
coef(seed1$model$glmnet$model, s ="lambda.min")
#> 9 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -6.238598277
#> age .
#> glucose 0.023462376
#> insulin -0.001007037
#> mass 0.055587740
#> pedigree 0.322911217
#> pregnant 0.137419564
#> pressure .
#> triceps .
coef(seed2$model$glmnet$model, s ="lambda.min")
#> 9 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -6.876802620
#> age .
#> glucose 0.025601712
#> insulin -0.001500856
#> mass 0.063029550
#> pedigree 0.464369417
#> pregnant 0.155971123
#> pressure .
#> triceps .
# different coefficients of the final regression model
seed1$model$classif.log_reg$model$coefficients
#> (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
#> -9.438452 23.710923 8.726956 NA
seed2$model$classif.log_reg$model$coefficients
#> (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
#> 0.3698143 23.5362542 -5.5514365 NA
Question:
Where and how could a repeated cross-validation be entered in my mlr3 ensemble model to migitate these varying results? Any help is very appreciated.
Thanks to missuse's comment, his marvellous tutorial (Tuning a stacked learner) and mb706's comments I think I could solve my question.
Replace "classif.cv_glmnet" with "classif.glmnet"
# Add tuning
resampling = rsmp("repeated_cv")
resampling$param_set$values = list(repeats = 10, folds=5)
ps_ens = ParamSet$new(
list(
ParamDbl$new("glmnet.alpha", 0, 1),
ParamDbl$new("glmnet.s", 0, 1)))
auto1 = AutoTuner$new(
learner = elearner,
resampling = resampling,
measure = msr("classif.auc"),
search_space = ps_ens,
terminator = trm("evals", n_evals = 5), # to limit running time
tuner = tnr("random_search")
)
Train with different set.seed and get same coefficients
# Train with different set.seed
#first
set.seed(22521136)
at1= auto1
at1$train(train.task) -> seed1
# second
set.seed(12354)
at2= auto1
at2$train(train.task) -> seed2
# Compare coefficients of the learners
# classif.log_reg
seed1$model$learner$model$classif.log_reg$model$coefficients
# (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
# 2.467855 21.570766 -6.966693 NA
seed2$model$learner$model$classif.log_reg$model$coefficients
# (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
# 2.467855 21.570766 -6.966693 NA
#classif.glmnet
coef(at1$learner$model$glmnet$model, alpha=at1$tuning_result$glmnet.alpha,s=at1$tuning_result$glmnet.s)
# 9 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -3.3066981659
# age 0.0076392198
# glucose 0.0077516975
# insulin 0.0003389759
# mass 0.0133955320
# pedigree 0.3256754612
# pregnant 0.0686746156
# pressure 0.0081338885
# triceps -0.0054976030
coef(at2$learner$model$glmnet$model, alpha=at2$tuning_result$glmnet.alpha,s=at2$tuning_result$glmnet.s)
# 9 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -3.3066981659
# age 0.0076392198
# glucose 0.0077516975
# insulin 0.0003389759
# mass 0.0133955320
# pedigree 0.3256754612
# pregnant 0.0686746156
# pressure 0.0081338885
# triceps -0.0054976030

contrast of contrast with emmeans (second differences)

I am using emmeans to conduct a contrast of a contrast (i.e., testing for an interaction effect through 1st/2nd differences).
It involves 3 steps:
estimate means using “emmeans”
estimate if there is a difference in means (1st difference) using “pairs”
estimate if there is a difference in the difference (2nd difference) using ????
While I can execute steps 1 and 2 (see reprex below with fictions data), i’m stuck on step 3. Tips?
(the contrast of a contrast shown in the vignette here is for alternative functional forms, which is somewhat different than what I want to test)
suppressPackageStartupMessages({
library(emmeans)})
# create ex. data set. 1 row per respondent (dataset shows 2 resp).
cedata.1 <- data.frame( id = c(1,1,1,1,1,1,2,2,2,2,2,2),
QES = c(1,1,2,2,3,3,1,1,2,2,3,3), # Choice set
Alt = c(1,2,1,2,1,2,1,2,1,2,1,2), # Alt 1 or Alt 2 in choice set
Choice = c(0,1,1,0,1,0,0,1,0,1,0,1), # Dep variable. if Chosen (1) or not (0)
LOC = c(0,0,1,1,0,1,0,1,1,0,0,1), # Indep variable per Choice set, binary categorical
SIZE = c(1,1,1,0,0,1,0,0,1,1,0,1), # Indep variable per Choice set, binary categorical
gender = c(1,1,1,1,1,1,0,0,0,0,0,0) # Indep variable per indvidual, binary categorical
)
# estimate model
glm.model <- glm(Choice ~ LOC*SIZE, data=cedata.1, family = binomial(link = "logit"))
# estimate means (i.e., values used to calc 1st diff).
comp1.loc.size <- emmeans(glm.model, ~ LOC * SIZE)
# calculate 1st diff (and p value)
pairs(comp1.loc.size, simple = "SIZE") # gives result I want
#> LOC = 0:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 -1.39 1.73 Inf -0.800 0.4235
#>
#> LOC = 1:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 0.00 1.73 Inf 0.000 1.0000
#>
#> Results are given on the log odds ratio (not the response) scale.
# calculate 2nd diff (and p value)
# ** the following gives the relevant values for doing the 2nd diff comparison (i.e., -1.39 and 0.00)...but how to make the statistical comparison?
pairs(comp1.loc.size, simple = "SIZE")
#> LOC = 0:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 -1.39 1.73 Inf -0.800 0.4235
#>
#> LOC = 1:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 0.00 1.73 Inf 0.000 1.0000
#>
#> Results are given on the log odds ratio (not the response) scale.
pairs(pairs(comp1.loc.size, simple = "SIZE"), by = NULL)
Another solution:
# estimate means (i.e., values used to calc 1st diff).
comp1.loc.size <- emmeans(glm.model, ~ LOC | SIZE)
# second difference:
pairs(pairs(emmeans::regrid(comp1.loc.size)), by = NULL)
PS: This solution is almost a copy of the solution here: Testing contrast of contrast (first/second difference) in outcome

How can I calculate the standard error of the poisson.test in R?

I have such a dataset,
ID Freq.x Freq.y
1 1 8
2 5 3
...
I calculated the ratio between two rate parameters of Freq.x and Freq.y by using R-programming language poisson.test function, but I want to calculate the standard error. How can I do that?
You don't have any reproducible data in your question, so let's make some:
set.seed(69)
x <- rpois(100, lambda = 7)
y <- rpois(100, lambda = 8)
You can get the standard error for each of these two variables like this:
se_x <- sqrt(mean(x)) / length(x)
se_y <- sqrt(mean(y)) / length(y)
se_x
#> [1] 0.02638181
se_y
#> [1] 0.02840775
and you can compare the two to determine if the underlying rate is significantly different like this:
poisson.test(c(sum(x), sum(y)))
#>
#> Comparison of Poisson rates
#>
#> data: c(sum(x), sum(y)) time base: 1
#> count1 = 696, expected count1 = 751.5, p-value = 0.004533
#> alternative hypothesis: true rate ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.7781748 0.9556714
#> sample estimates:
#> rate ratio
#> 0.8624535
It's not clear what you mean by the standard error of the poisson.test though.

R loop for linear regression lm(y~x) and save model output as a dataset

I would like to make a regression loop lm(y~x) with a dataset with one y and several x, and run the regression for each x, and then also store the results (estimate, p-values) in a data.frame() so I don't have to copy them manually (especially as my real data set it much bigger).
I think this should not be too difficult, but I struggle a lot to make it work and appreciate your help:
Here is my sample data set:
sample_data <- data.frame(
fit = c(0.8971963, 1.4205607, 1.4953271, 0.8971963, 1.1588785, 0.1869159, 1.1588785, 1.142857143, 0.523809524),
Xbeta = c(2.8907744, -0.7680777, -0.7278847, -0.06293916, -0.04047017, 2.3755812, 1.3043990, -0.5698354, -0.5698354),
Xgamma = c( 0.1180758, -0.6275700, 0.3731964, -0.2353454,-0.5761923, -0.5186803, 0.43041835, 3.9111749, -0.5030638),
Xalpha = c(0.2643091, 1.6663923, 0.4041057, -0.2100472, -0.2100472, 7.4874195, -0.2385278, 0.3183102, -0.2385278),
Xdelta = c(0.1498646, -0.6325119, -0.5947564, -0.2530748, 3.8413339, 0.6839322, 0.7401834, 3.8966404, 1.2028175)
)
#yname <- ("fit")
#xnames <- c("Xbeta ","Xgamma", "Xalpha", "Xdelta")
The simple regression with the first independant variable Xbeta would look like this lm(fit~Xbeta, data= sample_data)and I would like to run the regression for each variable starting with an "X" and then store the result (estimate, p-value).
I have found a code that allows me to select variables that start with "X" and then use it for the model, but the code gives me an error from mutate() onwards (indicated by #).
library(tidyverse)
library(tsibble)
sample_data %>%
gather(stock, return, starts_with("X")) %>%
group_nest(stock)
# %>%
# mutate(model = map(data,
# ~lm(formula = "fit~ return",
# data = .x))
# ),
# resid = map(model, residuals)
# ) %>%
# unnest(c(data,resid)) %>%
# summarise(sd_residual = sd(resid))
For then storing the regression results I have also found the following appraoch using the R package "broom": r for loop for regression lm(y~x)
sample_data%>%
group_by(y,x)%>% # get combinations of y and x to regress
do(tidy(lm(fRS_relative~xvalue, data=.)))
But I always get an error for group_by() and do()
I really appreciate your help!
One option would be to use lapply to perform a regression with each of the independent variables. Use tidy from broom library to store the results into a tidy format.
lapply(1:length(xnames),
function(i) broom::tidy(lm(as.formula(paste0('fit ~ ', xnames[i])), data = sample_data))) -> test
and then combine all the results into a single dataframe:
do.call('rbind', test)
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 1.05 0.133 7.89 0.0000995
# 2 Xbeta -0.156 0.0958 -1.62 0.148
# 3 (Intercept) 0.968 0.147 6.57 0.000313
# 4 Xgamma 0.0712 0.107 0.662 0.529
# 5 (Intercept) 1.09 0.131 8.34 0.0000697
# 6 Xalpha -0.0999 0.0508 -1.96 0.0902
# 7 (Intercept) 0.998 0.175 5.72 0.000723
# 8 Xdelta -0.0114 0.0909 -0.125 0.904
Step one
Your data is messy, let us tidy it up.
sample_data <- data.frame(
fit = c(0.8971963, 1.4205607, 1.4953271, 0.8971963, 1.1588785, 0.1869159, 1.1588785, 1.142857143, 0.523809524),
Xbeta = c(2.8907744, -0.7680777, -0.7278847, -0.06293916, -0.04047017, 2.3755812, 1.3043990, -0.5698354, -0.5698354),
Xgamma = c( 0.1180758, -0.6275700, 0.3731964, -0.2353454,-0.5761923, -0.5186803, 0.43041835, 3.9111749, -0.5030638),
Xalpha = c(0.2643091, 1.6663923, 0.4041057, -0.2100472, -0.2100472, 7.4874195, -0.2385278, 0.3183102, -0.2385278),
Xdelta = c(0.1498646, -0.6325119, -0.5947564, -0.2530748, 3.8413339, 0.6839322, 0.7401834, 3.8966404, 1.2028175)
)
tidyframe = data.frame(fit = sample_data$fit,
X = c(sample_data$Xbeta,sample_data$Xgamma,sample_data$Xalpha,sample_data$Xdelta),
type = c(rep("beta",9),rep("gamma",9),rep("alpha",9),rep("delta",9)))
Created on 2020-07-13 by the reprex package (v0.3.0)
Step two
Iterate over each type, and get the P-value, using this nifty function
# From https://stackoverflow.com/a/5587781/3212698
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
Then do some clever piping
tidyframe %>% group_by(type) %>%
summarise(type = type, p = lmp(lm(formula = fit ~ X))) %>%
unique()
#> `summarise()` regrouping output by 'type' (override with `.groups` argument)
#> # A tibble: 4 x 2
#> # Groups: type [4]
#> type p
#> <fct> <dbl>
#> 1 alpha 0.0902
#> 2 beta 0.148
#> 3 delta 0.904
#> 4 gamma 0.529
Created on 2020-07-13 by the reprex package (v0.3.0)

Resources