Finding how variable affect output of time-series random-forest regression model - r

I created a Random-Forest Regression model for time-series data in R that have three predictors and one output variable.
Is there a way to find (perhaps in more absolute terms) how changes in a specific variable affect the prediction output?
I know about variable importance, I am not trying to find the variables that have the biggest effect instead I am trying to see if I pick input variable X_1 and increase its value (or decrease it) how that would change the prediction output.
Does it even makes sense to do this? or is it even possible with a random-forest model? Rereading my question a few times it made me dubious, but any insight/recommendation would be greatly appreciated.

I would guess what this question is actually about is called exploratory data analysis (EDA). For starters, I would calculate the correlations between the variables to get a feeling for the strength of the [linear] relationship between two variables. Further, I would look at scatter plots between the variables to get a feeling for the relationships. Depending on the variables [linear] regression could tell how an increase in variable x1 would affect variable x2.

Related

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

meaning of ICC in rergression

I'm stuck on this question and can not find a logical explanation.
I'm given the following regression output -
The question is this - a one-way analysis model of variance was mistakenly adapted to explain the variable “level of violence” using the random factor “grade” The grade factor in this study is a constant factor. The partial results in the output are based on a balanced experiment.
Does it make sense in this case to calculate the ICC? Is it at all possible to calculate it manually from this output data only?
I know that the ICC describes the relationship between the observations within the groups. So I thought maybe to describe the connection within the classes, and between the different classes. But how can the ICC be reached by manual calculation from the data in the output?

Dimension Reduction for Clustering in R (PCA and other methods)

Let me preface this:
I have looked extensively on this matter and I've found several intriguing possibilities to look into (such as this and this). I've also looked into principal component analysis and I've seen some sources that claim it's a poor method for dimension reduction. However, I feel as though it may be a good method, but am unsure how to implement it. All the sources I've found on this matter give a good explanation, but rarely do they provide any sort of advice as to actually go about applying one of these methods (i.e. how one can actually apply a method in R).
So, my question is: is there a clear-cut way to go about dimension reduction in R? My dataset contains both numeric and categorical variables (with multiple levels) and is quite large (~40k observations, 18 variables (but 37 if I transform categorical variables into dummies)).
A few points:
If we want to use PCA, then I would have to somehow convert my categorical variables into numeric. Would it be okay to simply use a dummy variable approach for this?
For any sort of dimension reduction for unsupervised learning, how do I treat ordinal variables? Do the concept of ordinal variables even make sense in unsupervised learning?
My real issue with PCA is that when I perform it and have my principal components.. I have no idea what to actually do with them. From my knowledge, each principal component is a combination of the variables - and as such I'm not really sure how this helps us pick and choose which are the best variables.
I don't think this is an R question. This is more like a statistics question.
PCA doesn't work for categorical variables. PCA relies on decomposing the covariance matrix, which doesn't work for categorical variables.
Ordinal variables make lot's of sense in supervised and unsupervised learning. What exactly are you looking for? You should only apply PCA on ordinal variables if they are not skewed and you have many levels.
PCA only gives you a new transformation in terms of principal components, and their eigenvalues. It has nothing to do with dimension reduction. I repeat, it has nothing to do with dimension reduction. You reduce your data set only if you select a subset of the principal components. PCA is useful for regression, data visualisation, exploratory analysis etc.
A common way is to apply optimal scaling to transform your categorical variables for PCA:
Read this:
http://www.sicotests.com/psyarticle.asp?id=159
You may also want to consider correspondence analysis for categorical variables and multiple factor analysis for both categorical and continuous.

R: Which variables to include in model?

I'm fairly new to R and am currently trying to find the best model to predict my dependent variable from a number of predictor variables. I have 20 precictor variables and I want to see which ones I should include in my model and which ones I should exclude.
I am currently just running models with different predictor variables in each and comparing them to see which one has the lowest AIC, but this is taking a really long time. Is there an easier way to do this?
Thank you in advance.
This is more of a theoretical question actually...
In principle, if all of the predictors are actually exogenous to the model, they can all be included together and assuming you have enough data (N >> 20) and they are not too similar (which could give rise to multi-collinearity), that should help prediction. In practice, you need to think about whether each of (or any of) your predictors are actually exogenous to the model (that is, independent of the error term in the model). If they are not, then they will impart a bias on the estimates. (Also, omitting explanatory variables that are actually necessary imparts a bias.)
If predictive accuracy (even spurious in-sample accuracy) is the goal, then techniques like LASSO (as mentioned in the comments) could also help.

how do tell if its better to standardize your data matrix first when you do principal component analysis in R?

I'm trying to do principal component analysis in R . There are 2 ways of doing it, I believe.
One is doing principal component analysis right away the other way is standardizing the matrix first using s = scale(m) and then applying principal component analysis.
How do I tell what result is better? What values, in particular, should I look at? I already managed to find the eigenvalues and eigenvectors, the proportion of variance for each eigenvector using both methods.
I noticed that the proportion of the variance for the first pca without standardizing had a larger value. Is there a meaning to it? Isn't this always the case?
At last, if I am supposed to predict a variable ie weight should I drop the variable ie weight from my data matrix when I do a principal component analysis?
Are your variables measured on a common scale? If yes, then don't scale. If no, then it's probably a good idea to scale.
If you are trying to predict the value of another variable, PCA is probably not the correct tool. Maybe you should look at a regression model instead.

Resources