I have a set of data and I'd like to know whether this data set has a logistic distribution.
When I made a histogram of my data set (see the histogram on http://imageshack.us/photo/my-images/593/histogram.png/) it seems to have a logistic distribution, but to be sure I'd like to test for a logistic distribution in R. So my question is: Is there a way to test your data for a logistic distribution and how do you do this?
Additional information: The data set consists of 8544 items. The data are horizontal distances in km between 2 geographical points.
Thanks for your attention
Sander
In R you can use the ks.test or chisq.test functions (and probably others) to test against a hypothesized distribution. Note that these tests (and others) are all rule out tests, a non-significant result does not guarentee that the data come from the given distribution, just that you cannot rule it out. Also note that with a sample size of 8544 these tests are likely to be way overpowered, meaning that it will have power to find slight meaningless differences and you are likely to reject the null hypothesis even though it is "close enough". Also the fact that you decided on a distribution based on looking at the data first could bias results.
Another approach that may give you a better feel for if a logistic distribution is "close enough" rather than exactly is to use the vis.test function in the TeachingDemos package (be sure to read the paper referenced in the help page to understand the test and what assumptions you are making).
Most importantly is understanding the science that leads to the data, does a logistic distribution make sense scientifically? what other distributions could be reasonble? Also understand what question(s) you are trying to answer with the data and what is the effect on those answers of the distribution (e.g. the CLT will let you use the normal to answer some questions, but not others, using a normal distribution even though the data comes from a logistic or something similar).
Related
Unfortunately, I had convergence (and singularity) issues when calculating my GLMM analysis models in R. When I tried it in SPSS, I got no such warning message and the results are only slightly different. Does it mean I can interpret the results from SPSS without worries? Or do I have to test for singularity/convergence issues to be sure?
You have two questions. I will answer both.
First Question
Does it mean I can interpret the results from SPSS without worries?
You do not want to do this. The reason being is that mixed models have a very specific parameterization. Here is a screenshot of common lme4 syntax from the original article about lme4 from the author:
With this comes assumptions about what your model is saying. If for example you are running a model with random intercepts only, you are assuming that the slopes do not vary by any measure. If you include correlated random slopes and random intercepts, you are then assuming that there is a relationship between the slopes and intercepts that may either be positive or negative. If you present this data as-is without knowing why it produced this summary, you may fail to explain your data in an accurate way.
The reason as highlighted by one of the comments is that SPSS runs off defaults whereas R requires explicit parameters for the model. I'm not surprised that the model failed to converge in R but not SPSS given that SPSS assumes no correlation between random slopes and intercepts. This kind of model is more likely to converge compared to a correlated model because the constraints that allow data to fit a correlated model make it very difficult to converge. However, without knowing how you modeled your data, it is impossible to actually know what the differences are. Perhaps if you provide an edit to your question that can be answered more directly, but just know that SPSS and R do not calculate these models the same way.
Second Question
Or do I have to test for singularity/convergence issues to be sure?
SPSS and R both have singularity checks as a default (check this page as an example). If your model fails to converge, you should drop it and use an alternative model (usually something that has a simpler random effects structure or improved optimization).
Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)
Which are the best metrics to evaluate the fit of a GBM algorithm in R (metrics, graphs, ratios)? And how interpret them?
I think maybe you are overthinking this one! Take a step back and think about what matters... the error. You have forecasted values and you have observed values. the difference tells you most of what you need to know when comparing across models. Basic measures like MSE, MPE, etc. should do fine. If you are looking to refine within a given model, I would recommend taking a look at the gbm documentation. For example, you can pass your gbm model object to summary(), to get the relative influence of each of your variables. Additionally, you can find a lot of information in the documentation, so if you haven't taken a look, I would recommend doing so! I have posted the link at the bottom.
-Carmine
gbm_documentation
I'm familiar with G*Power as a tool for power analyses, but have yet to find a resource on the internet describing how to compute a power analysis for for logistic regression in R. The pwr package doesn't list logistic regression as an option.
You will very likely need to "roll your own".
Specify your hypothesized relationship between predictors and outcome.
Specify what values of your predictors you are likely to observe in your study. Will they be correlated?
Specify the effect size you would like to detect, e.g., odds ratios corresponding to two specific settings of your predictors.
Specify a power level, e.g., beta=0.80.
For different sample sizes n:
Simulate predictors as specified
Simulate outcomes
Run your analysis
Record whether you detect a statistically significant effect
Do these steps many times, on the order of 1000 or more times. Count how often you did detect an effect. If you detected an effect more than (e.g.) 80% of the time, you are overpowered - reduce n and start over. If you detected an effect less than 80%, you are underpowered - increase n and start over. Rinse & repeat until you have a good n.
And then think some more about whether all your assumptions really make sense. Vary them a bit. Is the resulting value of n sensitive to your assumptions?
Yes, this will be quite a bit of work. But it will be worth it. On the one hand, it will keep you from running an over- or underpowered study. On the other hand, as I wrote, this will force you to think deeply about your assumptions, and this is the path to enlightenment. (Which is a painful path to travel. Sorry.)
If you don't get any better answers specifically helping you to do this in R, you may want to look to CrossValidated for more help. Good luck!
This question and answers on Crossvalidated discuss power for logistic regression and include R code as well as additional discussion and links for more information.
I have two histograms.
int Hist1[10] = {1,4,3,5,2,5,4,6,3,2};
int Hist1[10] = {1,4,3,15,12,15,4,6,3,2};
Hist1's distribution is of type multi-modal;
Hist2's distribution is of type uni-modal with single prominent peak.
My questions are
Is there any way that i could determine the type of distribution programmatically?
How to quantify whether these two histograms are similar/dissimilar?
Thanks
Raj,
I posted a C function in your other question ( automatically compare two series -Dissimilarity test ) that will compute divergence between two sets of similar data. It's actually intended to tell you how closely real data matches predicted data but I suspect you could use it for your purpose.
Basically, the smaller the error, the more similar the two sets are.
These are just guesses, but I would try fitting each distribution as a gaussian distribution and use something like the R-squared value to determine if the distribution is uni-modal or not.
As to the similarity between the two distributions, I would try doing an autocorrelation and using the peak positive value in the autocorrelation as a similarity measure. These ideas are pretty rough, but hopefully they give you some ideas.
For #2, you could calculate their cross-correlation (so long as the buckets themselves can be sorted). That would give you a rough estimation of what "similarity".
Comparison of Histograms (For Use in Cloud Modeling).
(That's an MS .doc file.)
There are a variety of software packages that will "fit" your distributions to known discrete distributions for you - Minitab, STATA, R, etc. A reference to fitting distributions in R is here. I wouldn't advise programming this from scratch.
Regarding distribution comparisons, if neither distribution fits a known distribution (Poisson, Binomial, etc.), then you need to use non-parametric methods described here.