Compute standard deviation with a manually set mean in R - r

I know how to compute the sd using summarize:
ans <- temp%>% group_by(permno)%>% summarise(std = sd(ret)))
But how do I compute the standard deviation given I know the mean = 0?
In other words, I know the true mean and want to use that instead of using the sample mean while computing the sd.
One way would be to manually code the sd function, but I need it to work for each group, so I'm stuck.

It is always best to provide reproducible data. Here is an example with the iris data set:
data(iris)
GM <- mean(iris$Sepal.Length) # "Population mean"
ans <- iris %>% group_by(Species) %>% summarise(std=sum((Sepal.Length - GM)^2)/length(Sepal.Length))
ans
# A tibble: 3 × 2
# Species std
# <fct> <dbl>
# 1 setosa 0.823
# 2 versicolor 0.270
# 3 virginica 0.951
As compared with computing the sd with each group mean:
ans <- iris %>% group_by(Species) %>% summarise(std=sd((Sepal.Length)))
ans
# A tibble: 3 × 2
# Species std
# <fct> <dbl>
# 1 setosa 0.352
# 2 versicolor 0.516
# 3 virginica 0.636
Note that sd uses 'n - 1' in the denominator, but since you indicated that your mean was a population mean we use n.

I came up with this solution:
sd_fn <- function(x, mean_pop) {
sd_f <- sqrt((sum((x-mean_pop)^2))/(length(x)))
sd_f
}
x <- c(1,2,3,-1,-1.5,-2.8)
mean_pop <- 0
sd_fn(x, mean_pop)
I simply created a function where the arguments are a numeric vector and the population mean that you already know... Simply enter the vector with data and mean population and the function will givr you thr desired standard deviation.

Hi if want to calculate the sd from a true mean i think you could do it by using the mean function on the square difference of sample vector and the true mean to calculate variance, then use sqrt to calculate the standart deviation. Keep in mind, that base R ' s var and sd functions have automatic bessels correction, you can read at https://www.r-bloggers.com/2018/11/how-to-de-bias-standard-deviation-estimates/
#Sample Size
n=1000
#sample Random Vec
universe = rnorm(n,0,3)
# sample mean
p = mean(universe)
p
# true mean
p0 = 0
# calculate "manually" using sample mean
variance <- mean((universe - p)^2)
variance
standard_deviation <- sqrt(variance)
standard_deviation
# calculate "manually" usingtrue mean
variance_true <- mean((universe - p0)^2)
variance_true
standard_deviation_true <- sqrt(variance_true)
standard_deviation_true
# calculate using built in R functions
var_r<-var(universe)
var_r
r_sd<-sd(universe)
r_sd
# They have automatic Bessels correction :
variance * n/(n-1) == var_r # Bessels correction using * n/(n-1)
r_sd == sqrt(variance * n/(n-1) )

Related

How to write following formulas in r?

I need to write the following formulas in R. The STAT formula is copying effects of oneway.test-function.
where sample variance is
and
The variables are: m - number of samples, n - sample size, vector sample_means - mean of each sample and vector sample_vars - sample variance of each sample.
I'm trying to work with the following code, but it doesn't give the correct results when I compare it to aov:
my_anova <- function(m, n, sample_means, sample_vars) {
overall_mean <- mean(sample_means)
sample_vars <- sum((sample_means - overall_mean)^2)/(m-1)
STAT <- (n*sample_vars)/(sum(sample_vars/m))
PVAL <- pf(STAT, m - 1, m*(n - 1), lower.tail = FALSE)
}
Not very sure where you obtained the formulas above, but from what I can gather, you want to obtain the F stats and p value for a one way anova. n should be the degree of freedom and not sample size. Try using this table:
So bottom line is SSF should always be the sum of residuals between your predicted mean and overall mean, whereas SSE is the sum of residuals between your predicted mean and actual values. Then you divide by the corresponding degree of freedom. It should be like below:
my_aov <- function(sample_values, sample_means,n){
overall_mean = mean(sample_values)
SSF = sum((sample_means - overall_mean)^2)
SSE = sum((sample_values - sample_means)^2)
DoF = c(n,length(sample_values)-1-n)
Mean_Square = c(SSF/DoF[1] , SSE/DoF[2])
FSTAT = c(Mean_Square[1]/Mean_Square[2],NA)
PVAL <- pf(FSTAT, DoF[1], DoF[2], lower.tail = FALSE)
cbind(Sum_of_Squares= c(SSF,SSE),DoF,Mean_Square,FSTAT,PVAL)
}
Using an example:
values = iris$Sepal.Length
Species_values = tapply(iris$Sepal.Length,iris$Species,mean)
predicted_values = Species_values[as.character(iris$Species)]
# since there are 3 groups, degree of freedom is 3-1
n = length(unique(iris$Species)) - 1
my_aov(values,predicted_values,n)
Sum_of_Squares DoF Mean_Square FSTAT PVAL
[1,] 63.21213 2 31.6060667 119.2645 1.669669e-31
[2,] 38.95620 147 0.2650082 NA NA
Compare with:
summary(aov(Sepal.Length ~ Species,data=iris))
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---

contrast of contrast with emmeans (second differences)

I am using emmeans to conduct a contrast of a contrast (i.e., testing for an interaction effect through 1st/2nd differences).
It involves 3 steps:
estimate means using “emmeans”
estimate if there is a difference in means (1st difference) using “pairs”
estimate if there is a difference in the difference (2nd difference) using ????
While I can execute steps 1 and 2 (see reprex below with fictions data), i’m stuck on step 3. Tips?
(the contrast of a contrast shown in the vignette here is for alternative functional forms, which is somewhat different than what I want to test)
suppressPackageStartupMessages({
library(emmeans)})
# create ex. data set. 1 row per respondent (dataset shows 2 resp).
cedata.1 <- data.frame( id = c(1,1,1,1,1,1,2,2,2,2,2,2),
QES = c(1,1,2,2,3,3,1,1,2,2,3,3), # Choice set
Alt = c(1,2,1,2,1,2,1,2,1,2,1,2), # Alt 1 or Alt 2 in choice set
Choice = c(0,1,1,0,1,0,0,1,0,1,0,1), # Dep variable. if Chosen (1) or not (0)
LOC = c(0,0,1,1,0,1,0,1,1,0,0,1), # Indep variable per Choice set, binary categorical
SIZE = c(1,1,1,0,0,1,0,0,1,1,0,1), # Indep variable per Choice set, binary categorical
gender = c(1,1,1,1,1,1,0,0,0,0,0,0) # Indep variable per indvidual, binary categorical
)
# estimate model
glm.model <- glm(Choice ~ LOC*SIZE, data=cedata.1, family = binomial(link = "logit"))
# estimate means (i.e., values used to calc 1st diff).
comp1.loc.size <- emmeans(glm.model, ~ LOC * SIZE)
# calculate 1st diff (and p value)
pairs(comp1.loc.size, simple = "SIZE") # gives result I want
#> LOC = 0:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 -1.39 1.73 Inf -0.800 0.4235
#>
#> LOC = 1:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 0.00 1.73 Inf 0.000 1.0000
#>
#> Results are given on the log odds ratio (not the response) scale.
# calculate 2nd diff (and p value)
# ** the following gives the relevant values for doing the 2nd diff comparison (i.e., -1.39 and 0.00)...but how to make the statistical comparison?
pairs(comp1.loc.size, simple = "SIZE")
#> LOC = 0:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 -1.39 1.73 Inf -0.800 0.4235
#>
#> LOC = 1:
#> contrast estimate SE df z.ratio p.value
#> 0 - 1 0.00 1.73 Inf 0.000 1.0000
#>
#> Results are given on the log odds ratio (not the response) scale.
pairs(pairs(comp1.loc.size, simple = "SIZE"), by = NULL)
Another solution:
# estimate means (i.e., values used to calc 1st diff).
comp1.loc.size <- emmeans(glm.model, ~ LOC | SIZE)
# second difference:
pairs(pairs(emmeans::regrid(comp1.loc.size)), by = NULL)
PS: This solution is almost a copy of the solution here: Testing contrast of contrast (first/second difference) in outcome

How can I calculate the standard error of the poisson.test in R?

I have such a dataset,
ID Freq.x Freq.y
1 1 8
2 5 3
...
I calculated the ratio between two rate parameters of Freq.x and Freq.y by using R-programming language poisson.test function, but I want to calculate the standard error. How can I do that?
You don't have any reproducible data in your question, so let's make some:
set.seed(69)
x <- rpois(100, lambda = 7)
y <- rpois(100, lambda = 8)
You can get the standard error for each of these two variables like this:
se_x <- sqrt(mean(x)) / length(x)
se_y <- sqrt(mean(y)) / length(y)
se_x
#> [1] 0.02638181
se_y
#> [1] 0.02840775
and you can compare the two to determine if the underlying rate is significantly different like this:
poisson.test(c(sum(x), sum(y)))
#>
#> Comparison of Poisson rates
#>
#> data: c(sum(x), sum(y)) time base: 1
#> count1 = 696, expected count1 = 751.5, p-value = 0.004533
#> alternative hypothesis: true rate ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.7781748 0.9556714
#> sample estimates:
#> rate ratio
#> 0.8624535
It's not clear what you mean by the standard error of the poisson.test though.

How to convert fitdistrplus::fitdist summary into tidy format?

I have the following code:
x <- c(
0.367141764080875, 0.250037975705769, 0.167204185003365, 0.299794433447383,
0.366885973041269, 0.300453205296379, 0.333686861081341, 0.33301168850398,
0.400142004893329, 0.399433677388411, 0.366077304765104, 0.166402979455671,
0.466624230750293, 0.433499934139897, 0.300017278751768, 0.333673696762895,
0.29973685692478
)
fn <- fitdistrplus::fitdist(x,"norm")
summary(fn)
#> Fitting of the distribution ' norm ' by maximum likelihood
#> Parameters :
#> estimate Std. Error
#> mean 0.32846024 0.01918923
#> sd 0.07911922 0.01355908
#> Loglikelihood: 19.00364 AIC: -34.00727 BIC: -32.34084
#> Correlation matrix:
#> mean sd
#> mean 1 0
#> sd 0 1
Basically, it takes a vector and tried to fit the distribution
using fitdistrplus package.
I tried looking at the broom package, but it doesn't have
a function that covers that.
When you call broom::tidy(fn) you receive an error that says:
Error: No tidy method for objects of class fitdist
This is because this function from broom only has a finite number objects that are "good to use", see methods(tidy) for the complete list. (Read more about S3 methods in R. More here).
So the function doesn't work for an object fitdist but works for a fitdistr object from MASS (more "famous").
We can then assign to fn that class, and then use broom:
class(fn) <- ("fitdist", "fitdistr")
# notice that I've kept the original class and added the other
# you shouldn't overwrite classes. ie: don't to this: class(fn) <- "fitdistr"
broom::tidy(fn)
# # A tibble: 2 x 3
# term estimate std.error
# <chr> <dbl> <dbl>
# 1 mean 0.328 0.0192
# 2 sd 0.0791 0.0136
Note that you can only see the parameters. If you wish to see more and organize everything as "tidy", you should tell us more about your expected output.
broom::tidy() gets you this far, if you want more I'd start by defining my own method function that works for a class fitdist object using as reference the tidy.fitdistr method, and adapting it.
Example of how I'd adapt from the original broom::tidy() code, using the S3 method for the class fitdist.
Define your own method (similar to how you define your own function):
# necessary libraries
library(dplyr)
library(broom)
# method definition:
tidy.fitdist <- function(x, ...) { # notice the use of .fitdist
# you decide what you want to keep from summary(fn)
# use fn$ecc... to see what you can harvest
e1 <- tibble(
term = names(x$estimate),
estimate = unname(x$estimate),
std.error = unname(x$sd)
)
e2 <- tibble(
term = c("loglik", "aic", "bic"),
value = c(unname(x$loglik), unname(x$aic), unname(x$bic))
)
e3 <- x$cor # I prefer this to: as_tibble(x$cor)
list(e1, e2, e3) # you can name each element for a nicer result
# example: list(params = e1, scores = e2, corrMatr = e3)
}
This is how you can call this new method now:
tidy(fn) # to be more clear this is calling your tidy.fitdist(fn) under the hood.
# [[1]]
# # A tibble: 2 x 3
# term estimate std.error
# <chr> <dbl> <dbl>
# 1 mean 0.328 0.0192
# 2 sd 0.0791 0.0136
#
# [[2]]
# # A tibble: 3 x 2
# term value
# <chr> <dbl>
# 1 loglik 19.0
# 2 aic -34.0
# 3 bic -32.3
#
# [[3]]
# mean sd
# mean 1 0
# sd 0 1
Notice that the class is:
class(fn)
[1] "fitdist"
So now you don't actually need to assign the fitdistr (from MASS) class as before.
Not sure exactly what you need, but you can try:
tidy_fn <- rbind(fn$estimate,fn$sd)
https://stats.stackexchange.com/questions/23539/use-fitdist-parameters-in-variables

Estimating parameter of a distribution for some generate random number

I just new in R for solving my statistical problem. Currently I'm working to estimate the parameters of a distribution using 200 random numbers (RN) that I generate using R. I generate 200 RN in 100 times. So it means there will be 100 kinds of 200 RN and I will estimate this 100 kinds of RN. It also means that there will be 100 kinds of estimation results.
So here is the code I use to generate the RN:
#Generate random numbers U~(0, 1)
rep <-100 #total replication
unif <-matrix(0, 200, rep)
for (k in 1: rep)
{
unif[,k] <- runif(200, min = 0, max = 1)
}
# Based on the 100 kinds of generated random numbers that follow U ~ (0.1), I will generate again 100 kinds of random numbers which follow the estimated distribution:
# Define parameters
a <- 49.05 #1st parameter
b <- 3.148 #2nd parameter
c <- 0.145 #3rd parameter
d <- 0.00007181 #4th parameter
X <-matrix(0, 200, rep)
for (k in 1: rep)
{
X[,k] <- a*(log(1-((log(1-((unif[,k])^(1/c))))/(a*d))))^(1/b)
}
# Sorting the generated RN from the smallest to the largest
X_sort <-matrix(0, 200, rep)
for (k in 1: rep)
{
X_sort[,k] <- sort(X[,k])
}
Up here I've managed to generate 100 kinds of RN that will be estimated. However, the problem I face now is how to estimate this 100 kinds of RN. I can only estimate one. Here is the code I use for estimation the parameter with maxLik package and the estimation method is BHHH:
xi = X_sort[,1]
log_likelihood<-function(theta,xi){
p1 <- theta[1] #1st parameter
p2 <- theta[2] #2nd parameter
p3 <- theta[3] #3rd parameter
p4 <- theta[4] #4th parameter
logL=log((p4*p2*p3*((xi/p1)^(p2-1))*(exp(((xi/p1)^(p2))+(p4*p1*(1-(exp((xi/p1)^(p2)))))))*((1-(exp((p4*p1*(1-(exp((xi/p1)^(p2))))))))^(p3-1))))
return(logL)
}
library(maxLik);
# Initial parameters
a <- 49.05 #1st parameter
b <- 3.148 #2nd parameter
c <- 0.145 #3rd parameter
d <- 0.00007181 #4th parameter
m <- maxLik(log_likelihood, start=c(a,b,c,d), xi = xi, method="bhhh");
summary(m)
Here is the result:
--------------------------------------------
Maximum Likelihood estimation
BHHH maximisation, 5 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -874.0024
4 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
[1,] 4.790e+01 1.846e+00 25.953 < 2e-16 ***
[2,] 3.015e+00 1.252e-01 24.091 < 2e-16 ***
[3,] 1.717e-01 2.964e-02 5.793 6.91e-09 ***
[4,] 7.751e-05 6.909e-05 1.122 0.262
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------
To estimate the other 99 RN, I have to change manually xi = X_sort[,k] for k=1,2,...,100 , so for the second RN, it should turn into X_sort[,2], and so on until the hundredth RN. I think this is not efficient because it takes a long time to replace them one by one. So is there a way to modify this code so that it did not take long for estimating the other RN?
Firstly, I'd suggest you to rewrite your code in more compact way.
1. Generating random numbers. There is no need to generate 100 vectors each of length 200 while we can generate a vector of length 100*200 and then write this vector column-wise into matrix. This can be done in the following way:
rep <-100
n <- 200
unif <- matrix(runif(rep*n, min = 0, max = 1), n, rep)
2. Calculating function of matrix. In R it is possible to apply vector functions to matrices. So in your case it will be:
X <- a*(log(1-((log(1-((unif)^(1/c))))/(a*d))))^(1/b)
3. Column-wise matrix sorting We can easily sort each column of the matrix using apply function. Parameter 2 means that we do it column-wise (1 stands for rows).
X_sort <- apply(X, 2, sort)
4. Performing estimations. Again, we can use apply here.
estimations <- apply(X_sort, 2, function(x) maxLik(log_likelihood, start=c(a,b,c,d),
xi = x, method="bhhh"))
Then to print all the summaries you can do the following:
lapply(estimations, summary)

Resources