Introductory Statistics RStudio query

Introductory Statistics RStudio query - r

First time here. I'm taking an introductory statistics class which uses R. I have a dataset of 54 observations and I have to determine the 90% upper bounded confidence interval of the variance of the content. From that I have to deduce if it is possible to claim that with confidence at least 90%, is the variance of the content less than 0.001?
From my understanding I have to use the script:
sigma.test(data,alternative="less",conf.level=0.90)$conf.int
From running it, I'm getting a value of 0. That doesn't sound normal at all. I'm not sure if I'm following the correct procedures.
I would appreciate any feedback to help my understanding of this more.
Cheers.

Related

The mmer function of sommer using unstructured variance: singularity arriving quickly and negative variances after

So I am analyzing a multi-site trial of rice breeding lines at 4 environments. The simplified data is here:
https://drive.google.com/file/d/1jilVXX8JMkZCDVtIRmrwzB55kgR2GtYB/view?usp=sharing
And I am runnning a simple model with unstructured variance on sommer. I have done it on lme4 and nlme, but let's just say I want to stick with sommer. The model is:
m3 <- mmer(RDM ~ ENV ,
random=~ vsr(usr(ENV),GEN),
rcov=~ units,
data=d)
Pretty simple, no? However, very quickly I get the error:
System is singular (V). Stopping the job. Try a bigger number of tolparinv.
So, ok, I try a bigger tolparinv number (as I can't make a simpler model). But the smallest number that makes the function work is 1000. So, my question is: what are the implications of this?
Moreover, let's say that it is ok to run the model like that. And now what happens is that many of my variance components are negative. Which doesn't make much sense.
Could somebody please shed some light on this? So, the concrete questions are:
Why is singularity arriving so quickly?
What happens if I increase so much tolparinv?
Is that why my variance is negative?
And most importantly: is this fixable? How?
Thank you!!!

Calculate coverage rate of an OLS estimator

I want to calculate the 95% coverage rate of an simple OLS estimator.
The (for me) tricky addition is that the independent variable has 91 values that i have to test against each other in order to see which value leads to the best estimate.
For each value of the independent variable i want to draw 1000 samples.
I tried looking up on the theory and also on multiple platforms such as stackoverflow, but i didn't manage to find an appropriate answer.
My biggest question is how to calculate a coverage rate for a 95% confidence interval.
I would deeply appreciate it, if you could provide me with some possibilities or insights.

Correct reporting of margin of error on a search value sample

In Google Analytics, I am able to get a list of all the terms users search for on the site. For a large site over the course of several weeks, this could be upwards of 10,000 terms. I want to create a report that categorizes the types of terms that users searched for, but going through 10,000 terms and categorizing them by hand would be difficult in a reasonable timeframe. So my instinct was the sample and report on that sample.
I want to make sure I am using the right formula to generate a margin of error for the sample and that I am properly reporting it.
What I want to do is pull a random sample of the terms used, then put those terms into a spreadsheet of some kind and code them by hand in the categories (products, personnel, jobs). In the end, I'll have categories with some percentage of the sample for each sampled term.
For a 95% confidence, I was going to use:
Margin of error = (1.96 * 0.5) / sqrt((population_total_count - 1) * sample_search_total_count / (population_total_count - sample_search_total_count))
population_total_count would be the total count of search in the population (the full list) and sample_search_total_count would be the number of searches in a random sample I pull.
If 25% of my sample percentage was "products", and I had a Margin of Error 3%, I would report that as "We expect 25% of searches were for products plus or minus 3% at a 95% confidence." I would the same "plus or minus 3% at a 95% confidence" for any of the other categories in the same survey.
Am I using the right formula and discussing this correctly? Am I correct in using the same +/- Margin of Error for each of the categories?

From the "1.96", I can tell you're assuming your data follow normal distributions, which isn’t necessary (and will be too crude an approximation for small datasets).
You should instead use one of the following three approaches:
A Dirichlet-multinomial model, if the data can be modelled as being generated all from one similar process (i.e. you assume all users' search behaviour is similar), or you are happy to treat them as such.
A mixture of Dirichlet distributions, if you know, or suspect, that there are two or several types of data (e.g. a group of children and a group of adults who are entering the search terms, and you don’t know who is whom).
A confidence interval for multinomial proportions, if you are in a hurry and seek an off-the-shelf frequentist technique. An example tool is the MultinomCI function in R. See for example Confidence Intervals for Multinomial Proportions in DescTools20.
Reference for the above three methods: The Datatrie Advisor. Good luck!

Is the model's prediction probability the same as the confidence level?

This question seems weird, let me explain it by example.
We train a particular classification model to determine if an image contains a person or not.
After the model is trained, we use an new image for predicting.
The predicting result show that there is 94% probability that the image contains a person.
Thus, could I say, the confidence level is 94%, for that the image may contains a person?

Your third item is not properly interpreted. The model returns a normalized score of 0.94 for the category "person". Although this score correlates relatively well with our cognitive notions of "probability" and "confidence", do not confuse it with either of those. It's a convenient metric with some overall useful properties, but it is not an accurate prediction to two digits of accuracy.
Granted, there may well be models for which the model's prediction is an accurate figure. For instance, the RealOdds models you'll find on 538 are built and tested to that standard. However, that is a directed effort of more than a decade; your everyday deep learning model is not held to the same standard ... unless you work to tune it to that, making the accuracy of that number a part of your training (incorporate it into the error function).
You can run a simple (although voluminous) experiment: collect all of the predictions and bin them; say, a range of 0.1 for each of 10 bins. Now, if this "prediction" is, indeed, a probability, then your 0.6-0.7 bin should correctly identify a person 65% of the time. Check that against ground truth: did that bin get 65% correct and 35% wrong? Is the discrepancy within expected ranges: do this for each of the 10 categories and run your favorite applicable statistical measures on it.
I expect that this will convince you that the inference score is neither a prediction nor a confidence score. However, I'm also hoping it will give you some ideas for future work.

How do you conduct a power analysis for logistic regression in R?

I'm familiar with G*Power as a tool for power analyses, but have yet to find a resource on the internet describing how to compute a power analysis for for logistic regression in R. The pwr package doesn't list logistic regression as an option.

You will very likely need to "roll your own".
Specify your hypothesized relationship between predictors and outcome.
Specify what values of your predictors you are likely to observe in your study. Will they be correlated?
Specify the effect size you would like to detect, e.g., odds ratios corresponding to two specific settings of your predictors.
Specify a power level, e.g., beta=0.80.
For different sample sizes n:
Simulate predictors as specified
Simulate outcomes
Run your analysis
Record whether you detect a statistically significant effect
Do these steps many times, on the order of 1000 or more times. Count how often you did detect an effect. If you detected an effect more than (e.g.) 80% of the time, you are overpowered - reduce n and start over. If you detected an effect less than 80%, you are underpowered - increase n and start over. Rinse & repeat until you have a good n.
And then think some more about whether all your assumptions really make sense. Vary them a bit. Is the resulting value of n sensitive to your assumptions?
Yes, this will be quite a bit of work. But it will be worth it. On the one hand, it will keep you from running an over- or underpowered study. On the other hand, as I wrote, this will force you to think deeply about your assumptions, and this is the path to enlightenment. (Which is a painful path to travel. Sorry.)
If you don't get any better answers specifically helping you to do this in R, you may want to look to CrossValidated for more help. Good luck!

This question and answers on Crossvalidated discuss power for logistic regression and include R code as well as additional discussion and links for more information.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex