Checking for multicollinearity and adjust the data - r

I would appreciate help interpreting the following pairwise scatterplots of predictor variables to check for multicollinearity and then fit the data to the results to avoid this occurring.
Background: I am working on a task where I have to carry out multilinear regression. In this task I have three explanatory variables, tar, nico and weight and want to predict CO, so CO is the response variable (dependent variable). The data comes from 25 American cigarette brands where tar, nico and weight are the respective brand's content of tar, nicotine and weight per cigarette. And CO is how much carbon monoxide a cigarette emits.
Question: In the task, I will now plot all the explanatory variables in pairs against each other to look for multicollinearity and find an observation that is questionable to include in the regression. Which I have done, see the picture above. But how should I interpret this image?
My thoughts: I have understood that multicollinearity would not exist if all the images in this plot looked different, but I can clearly see that this is not the case here. For example, three out of four plots after "tar" are similar and this form also appears in one plot after "nico" and two plots after "weight". But does this then mean that the three predictor variables are multicollinear? Or that some data in "tar" is collinear with another data in "tar"? After I figure out where this collinearity (possibly) arises, I need to fit the data and run a new multilinear regression on the reduced data set for which the questionable observation has been removed. I think this is done by setting the value of the dubious observation to NA, but then I have to find this one first.
Finally: How should I interpret the image and then fit the data to get rid of any collinearity?
Any thoughts and tips on this are welcomed! Thanks in advance!

Related

Finding how variable affect output of time-series random-forest regression model

I created a Random-Forest Regression model for time-series data in R that have three predictors and one output variable.
Is there a way to find (perhaps in more absolute terms) how changes in a specific variable affect the prediction output?
I know about variable importance, I am not trying to find the variables that have the biggest effect instead I am trying to see if I pick input variable X_1 and increase its value (or decrease it) how that would change the prediction output.
Does it even makes sense to do this? or is it even possible with a random-forest model? Rereading my question a few times it made me dubious, but any insight/recommendation would be greatly appreciated.
I would guess what this question is actually about is called exploratory data analysis (EDA). For starters, I would calculate the correlations between the variables to get a feeling for the strength of the [linear] relationship between two variables. Further, I would look at scatter plots between the variables to get a feeling for the relationships. Depending on the variables [linear] regression could tell how an increase in variable x1 would affect variable x2.

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

Create a new datafram to do piecewise linear regression on percentages after doing serial crosstabs in R

I am working with R. I need to identify the predictors of higher Active trial start percentage over time (StartDateMonthsYrs). I will do linear regression with Percent.Active as the dependent variable.
My original dataframe is attached and my obtained Active trial start percentage over time (named Percent.Activeis presented here.
So, I need to assess whether federal sponsored trials, industry sponsored trials or Other sponsored trials were associated with higher active trial start percentage over time. I have many other variables that I wneed to assess but this is the sample of my data.
I am thinking to do many crosstabs for each variable (eg Fedral & Active then Industry & Active..etc.) in each month (may be with help of lapply then accumulate the obtained percentages data in the second sheet then run the analysis based on that.
My code for linear regression is as follow:
q.lm0 <- lm(Percent.Active ~ Time.point+ xyz, data.percentage);summary(q.lm0)
I'm a little bit confused. You write 'associated'. If you really want to look for association then yeah, a crosstab might be possible, and sufficient, as association is not the same as causation (which is further derived from correlation, if there is a theory behind). If you look for correlation, and insights over time, doing a regression with the lm package is not useful.
If you want to look for a regreesion type analysis there are packages in R like the plm package, which can deal with panel data, as you clearly have panel data (time points, and interested trials labels, and repetitive time points for these labels). Look at this post for infos about the package:https://stackoverflow.com/questions/2804001/panel-data-with-binary-dependent-variable-in-r
I'm writing you this because your Percent.Activevariable is only a binary outcome of 0/1 I'm not sure if this is on purpose. However, even if your outcome is not binary, the plm package might help, but you will find other mentioned packages in that post.

Did I just do an ANCOVA or MANOVA?

I’m trying to do an ANCOVA here ...
I want to analyze the effect of EROSION FORCE and ZONATION on all the species (listed with small letters) in each POOL.STEP (ranging from 1-12/1-4), while controlling for the effect of FISH.
I’m not sure if I’m doing it right. What is the command for ANCOVA?
So far I used lm(EROSIONFORCE~ZONATION+FISH,data=d), which yields:
So what I see here is that both erosion force percentage (intercept?) and sublittoral zonation are significant in some way, but I’m still not sure if I’ve done an ANCOVA correctly here or is this just an ANOVA?
In general, ANCOVA (analysis of covariance) is simply a special case of the general linear model with one categorical predictor (factor) and one continuous predictor (the "covariate"), so lm() is the right function to use.
However ... the bottom line is that you have a moderately challenging statistical problem here, and I would strongly recommend that you try to get local help (if you're working within a research group, can you consult with others in your group about appropriate methods?) I would suggest following up either on CrossValidated or r-sig-ecology#r-project.org
by putting EROSIONFORCE on the left side of the formula, you're specifying that you want to use EROSIONFORCE as a response (dependent) variable, i.e. your model is estimating how erosion force varies across zones and for different fish numbers - nothing about species response
if you want to analyze the response of a single species to erosion and zone, controlling for fish numbers, you need something like
lm(`Acmaeidae s...` ~ EROSIONFORCE+ZONATION+FISH, data=your_data)
the lm() suggestion above would do each species independently, i.e. you'd have to do a separate analysis for each species. If you also want to do it separately for each POOL.STEP you're going to have to do a lot of separate analyses. There are various ways of automating this in R, the most idiomatic is probably to melt your data (see reshape2::melt or tidy::gather) into long format and then use lmList from lme4.
since you have count data with low means, i.e. lots of zeros (and a few big values), you should probably consider a Poisson or negative binomial model, and possibly even a zero-inflated/hurdle model (i.e. analyze presence-absence and size of positive responses separately)
if you really want to analyze the joint distribution of all species (i.e., a response of a multivariate analysis, which is the M in MANOVA), you're going to have to work quite a bit harder ... there are a variety of joint species distribution models by people like Pierre Legendre, David Warton and others ... I'd suggest you try starting with the mvabund package, but you might need to do some reading first

R - Linear Regression - Control for a variable

I have a computer science background & I am trying to teach myself data science by solving the problems available on the internet
I have a smallish data set which has 3 variables - race, gender and annual income. There are about 10,000 sample observations. I am trying to predict income from race & gender.
I have divided the data into 2 parts - one for each gender & now I am trying to create 2 regression models. Is this possible in R? Can some one provide example syntax.
You don't specify how your data are stored or how the variable race is recorded (is it a factor?)
[If you're just fitting income against race for males, say, and you had the male income and race in income.m and race.m and if the second was a factor in R, then lm(income.m~race.m) will fit the line for males (use summary on the resulting object to get information about it). You could do something similar for females. But most people won't fit the models this way.]
If you're prepared to assume that the variation about the lines is the same for both genders, you can fit both lines with one model.
This has several advantages over analyzing the lines separately, though that can also be done.
If gender is either a factor or a numeric variable recorded as (0/1), and race is a factor and you have the data in a data frame (called, for example, incdata), then you'd fit both lines at once with:
lm(income~race*gender, data=incdata)
which is R shorthand for
lm(income~race+gender+race:gender, data=incdata)
where race:gender is an interaction term.
If you further assume that the effect of race is the same for both sexes, then the smaller model:
lm(income~race+gender, data=incdata)
would be used instead. This would often be the model people would fit if asked to 'control for gender', though many would consider the interaction model I mentioned before instead.
I'd strongly advise working on more simple regression problems first, with a textbook or set of notes suitable for guiding you through the ideas.
If you haven't already fitted a regression in R, I'd start with a smaller data set that only has a single predictor just to get used to the basic mechanics.
R comes with many data sets already built in. See, for example, library(help=datasets) which has about 80 data sets; some of the packages that come with R have more (MASS has over 80, for example). Many R packages on CRAN are packed with data sets, many suitable for regression.
For example, the cars data set (see ?cars in R) records the stopping distance of cars, given their speed. You don't need to read the data in, it's already there.
A simple linear regression (not necessarily the best model given some understanding of physics, but just about adequate for the data) would be:
lm(dist~speed, cars)
Again, you use summary to examine it. e.g. (I suggest you type these one at a time):
carsfit <- lm(dist~speed, cars)
summary(carsfit)
plot(dist~speed, cars)
abline(carsfit, col=2)
The examples in the help on the cars data set (?cars) gives several other models and plots. You might try those one at a time also.
The car package (CAR is short for "Companion to Applied Regression") has many small data sets specifically for regression.
It is very simple.
fit1 <- lm(income~gender+race,data=Dataframe1)
summary(fit1)
I would not recommend using two dataframes. Unless you are using more advanced statistical methods that require using two dataframes. Just use your gender variable.
Also, check this site out: http://www.statmethods.net/stats/regression.html
You could indeed do so Abhi but I believe your question is very broad.
(1) you could predict income from race and gender. This can be done in various ways but the most common would perhaps be "regression analysis". I suggest you do some searches on the internet on that topic. Answering what kind of regression and how to perform it is a matter of situation. You would probably find out yourself after reading about regression.
(2) R can do that. But i suggest you do some reading about regression before you get into R.
(3) If I were to analyze if race and gender can predict income I would simply do a linear regression where income would be the dependent variable and race and sex would be independent (predictors). This can be done by the "lm" function in R.
Or did I misunderstand something here?
Regards
You need to do some reading on Linear/Multiple Regression techniques. Not sure why you divide data into 2 groups based on gender. Random split the data into Train and Test, so that you can model on Train and Validate on test.

Resources