How to retrieve coefficients from partial dependence plot - r

I am using a randomforest model. I want to compare the outcomes with OLS, and mainly see whether and how the contributions of individual variables differ between these. So, I used partial dependence plots to see the effect of variables. However, this still does not give me a clear coefficient.
Is there another way to extract a coefficients, or is it possible to extract the coefficient from the pdp.
I tried different ways to find the underlying data of the pdp. I tried creating simulated data with only one variable different to see how the predictions changed.

Related

Changing coefficients in logistic regression

I will try to explain my problem as best as i can. I am trying to externally validate a prediction model, made by one of my colleagues. For the external validation, I have collected data from a set of new patients.
I want to test the accuracy of the prediction model on this new dataset. Online i have found a way to do so, using the coef.orig function to extract de coefficients of the original prediction model (see the picture i added). Here comes the problem, it is impossible for me to repeat the steps my colleague did to obtain the original prediction model. He used multiple imputation and bootstrapping for the development and internal validation, making it very complex to repeat his steps. What I do have, is the computed intercept and coefficients from the original model. Step 1 from the picture i added, could therefor be skipped.
My question is, how can I add these coefficients into the regression model, without the use of the 'coef()' function?
Steps to externally validate the prediction model:
The coefficients I need to use:
I thought that the offset function would possibly be of use, however this does not allow me to set the intercept and all the coefficients for the variables at the same time

Finding how variable affect output of time-series random-forest regression model

I created a Random-Forest Regression model for time-series data in R that have three predictors and one output variable.
Is there a way to find (perhaps in more absolute terms) how changes in a specific variable affect the prediction output?
I know about variable importance, I am not trying to find the variables that have the biggest effect instead I am trying to see if I pick input variable X_1 and increase its value (or decrease it) how that would change the prediction output.
Does it even makes sense to do this? or is it even possible with a random-forest model? Rereading my question a few times it made me dubious, but any insight/recommendation would be greatly appreciated.
I would guess what this question is actually about is called exploratory data analysis (EDA). For starters, I would calculate the correlations between the variables to get a feeling for the strength of the [linear] relationship between two variables. Further, I would look at scatter plots between the variables to get a feeling for the relationships. Depending on the variables [linear] regression could tell how an increase in variable x1 would affect variable x2.

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

How do I find out which observations of my dataset have been used for my mlm in R (nlme)?

I have longitudinal data and specified 3 multilevel models for different outcomes with nlme in R.
'model <- lme (...)'
They all are based on the same dataset.
Now,
'summary(model)'
shows me that the observations used for my final three models vary.
Probably, this is due to missing data that is different for every outcome (predictors stayed pretty much the same).
Is there a possibility to see, which observations of my dataset were included in each model? Note, that lme does not give me a S4 object, but medMer. Therefore,
'model#frame'
unfortunately does not work.
My aim is to give precise sample characteristics for each model. Therefore, I somehow need to adress the observations included each of them.
Thank you for any thoughts on this!

Individual or grouped data in glm

I'm doing glms with poisson and gamma distributed data in R. In literature I've seen models made with individual data or aggregate data (grouped by factors in the model). On this website, it is explained, that the model coefficients should be the same, for grouped and individual data:
However in my case they aren't. They are very similar though. Is one of these ways the correct way and the other wrong? If so, which one?
I found out that I aggregated the data on more variables, than there were in the model. When I fixed this, the estimates were the same, just as it's described in the aforementioned website.

Resources