I would like to use the delete-d cross-validation technique available in the R package bestglm. I have a binomial response variable (species presence/absence) and 11 predictor variables that are continuous or have levels and I am treating them in the analysis as continuous. I have about 7000 data points, depending on the species. I would like to allow interactions between one variable and the other ten variables, and I would also like to include quadratic responses.
Is this possible? From what I gather looking at the R help and the vignette for this package, it is not, but maybe I am missing something.
Related
This is for R Studio - Survival Analysis w/ "survival" package data and running Cox PH analysis.
I have a basic survival dataset (nafld1) that includes age and bmi. I created a new variable from the existing data bmi. I created a category A,B, or C based on the quantiles of BMI. So when I include these in the coxph formula and get results, is it okay to include them, or are they somehow getting "double counted" in the regression since the formula looks something like
~sex+age+bmi+bmi_g1+bmi_g2+bmi_g3
Conceptually, it's messing with me because the groups are based on bmi and not technically a completely separate covariate (as opposed to like treatment A or B might be), but more of a refinement of detail from the given data.
Any help would be appreciated, and please let me know if you need further detail.
I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:
1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)
As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.
Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.
This did not work:
fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.
What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.
Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.
Expanding a little bit on #MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.
If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:
library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))
my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))
This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.
Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.
Hope this helps!
I have a question regarding regression analysis in R. It would be nice if you could please help. I have a dependent continuous categorical variable. This variable is actually the number of citations received by three groups of papers: female-authored papers, male-authored papers, female-male-authored papers. I have also several co-variates which could be categorical or continuous. My factor variable is a continuous variable.
With the kind of data that I have (over dispersed and have an excess of zeros), hurdle regression model seems to be the best model. I can run the regression model in R using threes libraries of MASS, pscl and AER. The only problem that I have is how to get the coefficients for all three groups of papers, so that I will be able to compare them.
I have a dataset with a binary dependent variable and a number of predictors, including participant. I am trying to examine the idiosyncratic effects of different predictors for different participants. In order to do that, I'm trying to look at the effect of interactions between participant id and the other predictors on the dependent variable. I'm using randomForest in R. I can fit the forest successfully, and can produce partial dependence plots for individual variables. What I need, however, are partial dependence plots for pairs of variables - participant + others. Is this possible?
For reference, my code:
data_sample<-data_raw[sample(1:nrow(data_raw),500,replace=F),];
test_rf<-randomForest(perceptually.rhotic~vowel+speaker+modified_clip_start+function_word+year_of_birth+gender+fathers_job_type+prepausal,data=data_sample,ntree=500,mtry=3);
partialPlot(test_rf,pred.dat=data_sample,x.var="speaker");
??? partialPlot(test_rf,pred.dat=data_sample,x.var=c("speaker","vowel"));
Thanks very much in advance for any advice anyone can offer!
The plotmo R package will plot partial dependencies for all variables and pairs of variables (bivariate dependencies) for "any" model. For example:
library(randomForest)
data(trees)
mod <- randomForest(Volume~., data=trees)
library(plotmo)
plotmo(mod, pmethod="partdep") # plot partial dependencies
which gives
You can specify exactly which variable and variable pairs get plotted using plotmo's all1, all2, degree1, and degree2 arguments. Additional examples are in the vignette for the plotmo package.
I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:
1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)
As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.
Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.
This did not work:
fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.
What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.
Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.
Expanding a little bit on #MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.
If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:
library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))
my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))
This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.
Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.
Hope this helps!