I have plotted the conditional density distribution of my variables by using cdplot (R). My independent variable and my dependent variable are not independent. Independent variable is discrete (it takes only certain values between 0 and 3) and dependent variable is also discrete (11 levels from 0 to 1 in steps of 0.1).
Some data:
dat <- read.table( text="y x
3.00 0.0
2.75 0.0
2.75 0.1
2.75 0.1
2.75 0.2
2.25 0.2
3 0.3
2 0.3
2.25 0.4
1.75 0.4
1.75 0.5
2 0.5
1.75 0.6
1.75 0.6
1.75 0.7
1 0.7
0.54 0.8
0 0.8
0.54 0.9
0 0.9
0 1.0
0 1.0", header=TRUE, colClasses="factor")
I wonder if my variables are appropriate to run this kind of analysis.
Also, I'd like to know how to report this results in an elegant way with academic and statistical sense.
This is a run using the rms-packages `lrm function which is typically used for binary outcomes but also handles ordered categorical variables:
library(rms) # also loads Hmisc
# first get data in the form you described
dat[] <- lapply(dat, ordered) # makes both columns ordered factor variables
?lrm
#read help page ... Also look at the supporting book and citations on that page
lrm( y ~ x, data=dat)
# --- output------
Logistic Regression Model
lrm(formula = y ~ x, data = dat)
Frequencies of Responses
0 0.54 1 1.75 2 2.25 2.75 3 3.00
4 2 1 5 2 2 4 1 1
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 22 LR chi2 51.66 R2 0.920 C 0.869
max |deriv| 0.0004 d.f. 10 g 20.742 Dxy 0.738
Pr(> chi2) <0.0001 gr 1019053402.761 gamma 0.916
gp 0.500 tau-a 0.658
Brier 0.048
Coef S.E. Wald Z Pr(>|Z|)
y>=0.54 41.6140 108.3624 0.38 0.7010
y>=1 31.9345 88.0084 0.36 0.7167
y>=1.75 23.5277 74.2031 0.32 0.7512
y>=2 6.3002 2.2886 2.75 0.0059
y>=2.25 4.6790 2.0494 2.28 0.0224
y>=2.75 3.2223 1.8577 1.73 0.0828
y>=3 0.5919 1.4855 0.40 0.6903
y>=3.00 -0.4283 1.5004 -0.29 0.7753
x -19.0710 19.8718 -0.96 0.3372
x=0.2 0.7630 3.1058 0.25 0.8059
x=0.3 3.0129 5.2589 0.57 0.5667
x=0.4 1.9526 6.9051 0.28 0.7773
x=0.5 2.9703 8.8464 0.34 0.7370
x=0.6 -3.4705 53.5272 -0.06 0.9483
x=0.7 -10.1780 75.2585 -0.14 0.8924
x=0.8 -26.3573 109.3298 -0.24 0.8095
x=0.9 -24.4502 109.6118 -0.22 0.8235
x=1 -35.5679 488.7155 -0.07 0.9420
There is also the MASS::polr function, but I find Harrell's version more approachable. This could also be approached with rank regression. The quantreg package is pretty standard if that were the route you chose. Looking at your other question, I wondered if you had tried a logistic transform as a method of linearizing that relationship. Of course, the illustrated use of lrm with an ordered variable is a logistic transformation "under the hood".
Related
I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -
cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
bestglm(Xy = winedata,
family = binomial, # binomial family for logistic
IC = "AIC", # Information criteria
method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)
However, this gives the error -
Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"
I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?
The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.
The data was processed as -
winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'
bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that
It's always good to check what is the class:
class(res.best.logistic$BestModel)
[1] "glm" "lm"
But if you look at the call of res.best.logistic$BestModel:
res.best.logistic$BestModel$call
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
head(res.best.logistic$BestModel$model)
y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0 7.0 0.27 0.36 20.7 0.045
2 0 6.3 0.30 0.34 1.6 0.049
3 0 8.1 0.28 0.40 6.9 0.050
4 0 7.2 0.23 0.32 8.5 0.058
5 0 7.2 0.23 0.32 8.5 0.058
6 0 8.1 0.28 0.40 6.9 0.050
free.sulfur.dioxide density pH sulphates
1 45 1.0010 3.00 0.45
2 14 0.9940 3.30 0.49
3 30 0.9951 3.26 0.44
4 47 0.9956 3.19 0.40
5 47 0.9956 3.19 0.40
6 30 0.9951 3.26 0.44
You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:
best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")
best.cv.err<- cv.glm(winedata,fit,cost1, K=10)
I have questions about multivariable cox regression analysis including non-binary categorical variables.
My data consists of several variables, and some of them are binary (like sex, and age over 70, etc..)
whereas the rest of them are not (for example, ECOG)
I tried both analyse_multivariate function and coxph function, but it seems that I can only get overall hazard ratios regarding non-categorical variables, but I'd like to know both overall hazard ratios for the variable and individual hazard ratios for the subcategories in the variable (like hazard ratios for ECOG 0, ECOG 1, ECOG 2, and for overall ECOG)
What I tried in the process is like this:
(1)
ECOG = as.factor(df$ECOG)
analyse_multivariate(data=df,
time_status = vars(df$OS, df$survival_status==1),
covariates = vars(df$age70, df$sex, ECOG),
reference_level_dict = c(ECOG==0))
and the result is like this:
Hazard Ratios:
factor.id factor.name factor.value HR Lower_CI Upper_CI Inv_HR Inv_Lower_CI Inv_Upper_CI
df$age70 df$age70 <continuous> 1.07 0.82 1.41 0.93 0.71 1.22
ECOG:4 ECOG 4 1.13 0.16 8.19 0.89 0.12 6.43
df$sex df$sex <continuous> 1.87 0.96 3.66 0.53 0.27 1.04
ECOG:1 ECOG 1 2.14 1.63 2.81 0.47 0.36 0.61
ECOG:3 ECOG 3 12.12 7.83 18.76 0.08 0.05 0.13
ECOG:2 ECOG 2 13.72 4.92 38.26 0.07 0.03 0.2
(2)
analyse_multivariate(data=df,
time_status = vars(df$OS, df$survival_status==1),
covariates = vars(df$age70, df$sex, df$ECOG),
reference_level_dict = c(ECOG==0))
and the result is:
Hazard Ratios:
factor.id factor.name factor.value HR Lower_CI Upper_CI Inv_HR Inv_Lower_CI Inv_Upper_CI
df$age70 df$age70 <continuous> 0.89 0.68 1.16 1.13 0.86 1.47
df$sex df$sex <continuous> 1.87 0.96 3.65 0.53 0.27 1.04
df$ECOG df$ECOG <continuous> 1.9 1.69 2.15 0.53 0.47 0.59
Does it make sense if I use a p-value for ECOG in total from (2) and consider ECOG as a significant variable if its p-value is <0.05, and combine individual hazard ratios for individual ECOG status from (1)?
like for generating a table like followings:
p-value 0.01
ECOG 1 Reference
ECOG 2 13.72 (4.92-38.26)
ECOG 3 12.12 (7.83-18.76)
ECOG 4 1.13 (0.16-8.19)
I believe there are better solutions but couldn't find one.
Any comments would be appreciated!
Thank you in advance.
Short answer is no. In (2), it is a continuous response, meaning you expect the log odds ratio of survival to have a linear relationship with ECOG, whereas in (1) you expect every level (1 to 4) to have a different effect on survival. To test the variable ECOG collective, you can do an anova:
library(survivalAnalysis)
data = survival::lung
data$ECOG = factor(data$ph.ecog)
data$sex = factor(data$sex)
fit1 = data %>%
analyse_multivariate(vars(time, status),
covariates = vars(age, sex, ECOG, wt.loss))
anova(fit1$coxph)
Analysis of Deviance Table
Cox model: response is Surv(time, status)
Terms added sequentially (first to last)
loglik Chisq Df Pr(>|Chi|)
NULL -675.02
age -672.36 5.3325 1 0.020931 *
sex -667.82 9.0851 1 0.002577 **
ECOG -660.26 15.1127 3 0.001723 **
wt.loss -659.31 1.9036 1 0.167680
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I am implementing a multinomial logit model using the mlogit package in R. The data includes three different "choices" and three variables (A, B, C) which contains information for the independent variable. I have transformed the data into a wide format using the mlogit.data function which makes it look like this:
Observation Choice VariableA VariableB VariableC
1 1 1.27 0.2 0.81
1 0 1.27 0.2 0.81
1 -1 1.27 0.2 0.81
2 1 0.20 0.45 0.70
2 0 0.20 0.45 0.70
2 -1 0.20 0.45 0.70
The thing is that I want the independent variable to be choice-specific and therefore being constructed as Variable D below:
Observation Choice VariableA VariableB VariableC VariableD
1 1 1.27 0.2 0.81 1.27
1 0 1.27 0.2 0.81 0.2
1 -1 1.27 0.2 0.81 0.81
2 1 0.20 0.45 0.70 0.20
2 0 0.20 0.45 0.70 0.45
2 -1 0.20 0.45 0.70 0.70
Variable D was constructed using the following code:
choice_map <- data.frame(choice = c(1, 0, -1), var = grep('Variable[A-C]', names(df)))
df$VariableD <- df[cbind(seq_len(nrow(df)), with(choice_map, var[match(df$Choice, choice)]))]
However, when I try to run the multinomial logit model,
mlog <- mlogit(Choice ~ 1 | VariableD, data=df, reflevel = "0")
the error message "row names supplied are of the wrong length" is returned. When I use any of the other variables A-C separately the regression is run without any problems, so my questions are therefore: why can't Variable D be used and how can this problem be solved?
Thanks!
I got this error when I entered my original dataframe into the model, and not the wide dataframe created by mlogit.data.
So make sure to create your "wide" dataframe first and enter this into your mlogit function.
(source: Andy Field, Discovering statistics using R, page 348)
This question already has answers here:
Calculate the Area under a Curve
(7 answers)
Closed 7 years ago.
I want to integrate a one dimensional vector in R, How should I do that?
Let's say I have:
d=hist(p, breaks=100, plot=FALSE)$density
where p is a sample like:
p=rnorm(1e5)
How can I calculate an integral over d?
If we assume that the values in d correspond to the y values of a function then we can calculate the integral by using a discrete approximation. We can for example use the trapezium rule or Simpsons rule for this purpose. We then also need to input the stepsize that corresponds to the discrete interval on the x-axis in order to "approximate the area under the curve".
Discrete integration functions defined below:
p=rnorm(1e5)
d=hist(p,breaks=100,plot=FALSE)$density
discreteIntegrationTrapeziumRule <- function(v,lower=1,upper=length(v),stepsize=1)
{
if(upper > length(v))
upper=length(v)
if(lower < 1)
lower=1
integrand <- v[lower:upper]
l <- length(integrand)
stepsize*(0.5*integrand[1]+sum(integrand[2:(l-1)])+0.5*v[l])
}
discreteIntegrationSimpsonRule <- function(v,lower=1,upper=length(v),stepsize=1)
{
if(upper > length(v))
upper=length(v)
if(lower < 1)
lower=1
integrand <- v[lower:upper]
l <- length(integrand)
a = seq(from=2,to=l-1,by=2);
b = seq(from=3,to=l-1,by=2)
(stepsize/3)*(integrand[1]+4*sum(integrand[a])+2*sum(integrand[b])+integrand[l])
}
As an example, let's approximate the complete area under the curve while assuming discrete x steps of size 1 and then do the same for the second half of d while we assume x-steps of size 0.2.
> plot(1:length(d),d) # stepsize one on x-axis
> resultTrapeziumRule <- discreteIntegrationTrapeziumRule(d) # integrate over complete interval, assume x-stepsize = 1
> resultSimpsonRule <- discreteIntegrationSimpsonRule(d) # integrate over complete interval, assume x-stepsize = 1
> resultTrapeziumRule
[1] 9.9999
> resultSimpsonRule
[1] 10.00247
> plot(seq(from=-10,to=(-10+(length(d)*0.2)-0.2),by=0.2),d) # stepsize 0.2 on x-axis
> resultTrapziumRule <- discreteIntegrationTrapeziumRule(d,ceiling(length(d)/2),length(d),0.2) # integrate over second part of vector, x-stepsize=0.2
> resultSimpsonRule <- discreteIntegrationSimpsonRule(d,ceiling(length(d)/2),length(d),0.2) # integrate over second part of vector, x-stepsize=0.2
> resultTrapziumRule
[1] 1.15478
> resultSimpsonRule
[1] 1.11678
In general, the Simpson rule offers better approximations of the integral. The more y-values you have (and the smaller the x-axis stepsize), the better your approximations will become.
Small EDIT for clarity:
In this particular case the stepsize should obviously be 0.1. The complete area under the density curve is then (approximately) equal to 1, as expected.
> d=hist(p,breaks=100,plot=FALSE)$density
> hist(p,breaks=100,plot=FALSE)$mids # stepsize = 0.1
[1] -4.75 -4.65 -4.55 -4.45 -4.35 -4.25 -4.15 -4.05 -3.95 -3.85 -3.75 -3.65 -3.55 -3.45 -3.35 -3.25 -3.15 -3.05 -2.95 -2.85 -2.75 -2.65 -2.55
[24] -2.45 -2.35 -2.25 -2.15 -2.05 -1.95 -1.85 -1.75 -1.65 -1.55 -1.45 -1.35 -1.25 -1.15 -1.05 -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25
[47] -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55 1.65 1.75 1.85 1.95 2.05
[70] 2.15 2.25 2.35 2.45 2.55 2.65 2.75 2.85 2.95 3.05 3.15 3.25 3.35 3.45 3.55 3.65 3.75 3.85 3.95 4.05 4.15
> resultTrapeziumRule <- discreteIntegrationTrapeziumRule(d,stepsize=0.1)
> resultTrapeziumRule
[1] 0.999985
I am doing a factor analysis via the psych-package, which generates quite some output fa8<-fa(corMat, nfactors=8, ...)
The output includes some matrices and some text information. However, I did not find a good way of saving the matrices of the output to a file. So far, I was able to dump the data via sink("foo.txt"); f8; sink() as complete output. Neither write(fa8) nor write.csv(fa8) work, because the class of the output is a vector - it does not contain the matrix data itself, though.
Any suggestions on how I can get the fa-matrix itself for further analysis and saving it to a file?
update #1:
An examplatory output of fa(corMat, nfactors=2, ...)would be
Factor Analysis using method = pa
Call: fa(r = corMat, nfactors = 2, rotate = "oblimin", fm = "pa")
Standardized loadings based upon correlation matrix
PA1 PA2 h2 u2
BIO 0.86 0.02 0.75 0.255
GEO 0.78 0.05 0.63 0.369
CHEM 0.87 -0.05 0.75 0.253
ALG -0.04 0.81 0.65 0.354
CALC 0.01 0.96 0.92 0.081
STAT 0.13 0.50 0.29 0.709
PA1 PA2
SS loadings 2.14 1.84
Proportion Var 0.36 0.31
Cumulative Var 0.36 0.66
With factor correlations of
PA1 PA2
PA1 1.00 0.21
PA2 0.21 1.00
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 15 and the objective function was 2.87
The degrees of freedom for the model are 4 and the objective function was 0.01
The root mean square of the residuals is 0.01
The df corrected root mean square of the residuals is 0.02
Fit based upon off diagonal values = 1
Measures of factor score adequacy
PA1 PA2
Correlation of scores with factors 0.94 0.96
Multiple R square of scores with factors 0.88 0.93
Minimum correlation of possible factor scores 0.77 0.86
Source: http://rtutorialseries.blogspot.de/2011/10/r-tutorial-series-exploratory-factor.html
The question is: How do I get the standardized loadings matrix in the output for further analysis?