I'm trying create an ordered probit model in R. My independent variable is categorical, my dependent variable is ordinal. I'm using the polr command and it does go through. When I run the command, I get the log odds for the different variables. I have converted them into odds ratios using the exp command. As far as I understand it, these odds ratios tell me what the probability is of my dependent variable going up one category every time my independent variable "goes up" one category. Is that correct? I'm somewhat confused because in the case of the independent variable, it's not really an increase since they are just categories.
My second question concerns the interpretation of the polr. All I get are the odds ratios. How would you recommend I get additional information on the suitability of the ordered probit? Thanks!
Related
I created a Random-Forest Regression model for time-series data in R that have three predictors and one output variable.
Is there a way to find (perhaps in more absolute terms) how changes in a specific variable affect the prediction output?
I know about variable importance, I am not trying to find the variables that have the biggest effect instead I am trying to see if I pick input variable X_1 and increase its value (or decrease it) how that would change the prediction output.
Does it even makes sense to do this? or is it even possible with a random-forest model? Rereading my question a few times it made me dubious, but any insight/recommendation would be greatly appreciated.
I would guess what this question is actually about is called exploratory data analysis (EDA). For starters, I would calculate the correlations between the variables to get a feeling for the strength of the [linear] relationship between two variables. Further, I would look at scatter plots between the variables to get a feeling for the relationships. Depending on the variables [linear] regression could tell how an increase in variable x1 would affect variable x2.
Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)
I am trying to build a model that can predict SalePrice using independent variables that denote various house features. I used Multiple Regression Model, and also found that some predictor variables needed to be transformed, as well as the response variable.
My final model is as follows;
Model Output
How do I interpret this result? Can I conclude that a one unit increase in Years Since Remodel causes a -2.905e-03 change in log of Sale Price? How do I make this interpretation easier to understand? Thank you.
I have a database of around 36 predictor variables which I am using to predict a target variable. The target is a categorical variable consisting of three different classes whereas predictor variables include both numeric and categorical variables.
However, data is subject to severe multi-collinearity. I am trying to build a parsimonious logistic regression model so need to reduce the variables. According to VIF values results become counter intuitive as soon as I reduce the number of variables. On the other hand, I am not very sure that PCR can solve the problem as I need inferences from the sensitivity from each variable.
What is the better option to deal with such problem?
Which packages from 'R' I can use?
Will factor analysis solve the problem?
Or can we infer everything from PCR?
You have first to run ANOVA/Kruskall Wallis test to check which variables are well suited for your problem. For 36 variables I don't think you will need PCA, as this will make your model lose some explainability.
Remember that PCA will reduce dimensionality and also explain only part of the data variance. Factor Analysis will generate groups of variables in factors, in case you want to run a segmented logistic regression for each factor of grouped variables.
If you want to build a parsimonious logistic regression, you can apply some regularization so that you can increase the generalization properties of it, instead of reducing number of variables.
You can use the following R packages: caret (logistic regression), ROCR (AUC), ggplot (plot), DMwR (outliers), mice (missing values)
Also, if you want to make a regularization, you can use the following formula:
In this case, you can develop regularization from scratch, without a library, to adjust the inclination of the sigmoid, so that you can correctly classify your classes:
I’m trying to do an ANCOVA here ...
I want to analyze the effect of EROSION FORCE and ZONATION on all the species (listed with small letters) in each POOL.STEP (ranging from 1-12/1-4), while controlling for the effect of FISH.
I’m not sure if I’m doing it right. What is the command for ANCOVA?
So far I used lm(EROSIONFORCE~ZONATION+FISH,data=d), which yields:
So what I see here is that both erosion force percentage (intercept?) and sublittoral zonation are significant in some way, but I’m still not sure if I’ve done an ANCOVA correctly here or is this just an ANOVA?
In general, ANCOVA (analysis of covariance) is simply a special case of the general linear model with one categorical predictor (factor) and one continuous predictor (the "covariate"), so lm() is the right function to use.
However ... the bottom line is that you have a moderately challenging statistical problem here, and I would strongly recommend that you try to get local help (if you're working within a research group, can you consult with others in your group about appropriate methods?) I would suggest following up either on CrossValidated or r-sig-ecology#r-project.org
by putting EROSIONFORCE on the left side of the formula, you're specifying that you want to use EROSIONFORCE as a response (dependent) variable, i.e. your model is estimating how erosion force varies across zones and for different fish numbers - nothing about species response
if you want to analyze the response of a single species to erosion and zone, controlling for fish numbers, you need something like
lm(`Acmaeidae s...` ~ EROSIONFORCE+ZONATION+FISH, data=your_data)
the lm() suggestion above would do each species independently, i.e. you'd have to do a separate analysis for each species. If you also want to do it separately for each POOL.STEP you're going to have to do a lot of separate analyses. There are various ways of automating this in R, the most idiomatic is probably to melt your data (see reshape2::melt or tidy::gather) into long format and then use lmList from lme4.
since you have count data with low means, i.e. lots of zeros (and a few big values), you should probably consider a Poisson or negative binomial model, and possibly even a zero-inflated/hurdle model (i.e. analyze presence-absence and size of positive responses separately)
if you really want to analyze the joint distribution of all species (i.e., a response of a multivariate analysis, which is the M in MANOVA), you're going to have to work quite a bit harder ... there are a variety of joint species distribution models by people like Pierre Legendre, David Warton and others ... I'd suggest you try starting with the mvabund package, but you might need to do some reading first