Cox Regression Hazard Ratio in Percentiles - r

I computed a Cox proportional hazards regression in R.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I got the hazard ratio (HR, or exp(coef)) for all these covariates, but I'm really only interested in the effects of continuous predictor X. The HR for X is 1.20. X is actually scaled to the sample measurements, such that X has a mean of 0 and SD 1. That is, an individual with a 1 SD increase in X has a 1.23 times higher chance of mortality (the event) than someone with an average value of X (I believe).
I would like to be able to say these results in something that's a bit less awkward, and actually this article does exactly what I would like to. It says:
"In a Cox proportional hazards model adjusting for age, sex and
education, a higher level of total daily physical activity was
associated with a decreased risk of death (hazard ratio=0.71;
95%CI:0.63, 0.79). Thus, an individual with high total daily physical
activity (90th percentile) had about ¼ the risk of death as compared
to an individual with low total daily physical activity (10th
percentile)."
Assuming only the HR (i.e. 1.20) is needed, how does one compute this comparison statement? If you need any other information, please ask me for it.

If you have x1 as your 90th percentile X value and x2 as your 10th percentile X value, and if p,q,r and s (s is1.20 as you have mentioned) and your coefficients of cox regression you need to find exp(p*A + q*B + r*C + s*x1)/exp(p*A + q*B + r*C + s*x2) where A, B, and C can be average values of the variable. This ratio give you the comparison statement.
This question is actually for stats.stackexchange.com though.

Related

Formula of computing the Gini Coefficient in fastgini

I use the fastgini package for Stata (https://ideas.repec.org/c/boc/bocode/s456814.html).
I am familiar with the classical formula for the Gini coefficient reported for example in Karagiannis & Kovacevic (2000) (http://onlinelibrary.wiley.com/doi/10.1111/1468-0084.00163/abstract)
Formula I:
Here G is the Gini coefficient, µ the mean value of the distribution, N the sample size and y_i the income of the ith sample unit. Hence, the Gini coefficient computes the difference between all available income pairs in the data and calculates the total of all absolute differences.
This total is then normalized by dividing it by population squared times mean income (and multiplied by two?).
The Gini coefficient ranges between 0 and 1, where 0 means perfect equality (all individuals earn the same) and 1 refers to maximum inequality (1 person earns all the income in the country).
However the fastgini package refers to a different formula (http://fmwww.bc.edu/repec/bocode/f/fastgini.html):
Formula II:
fastgini uses formula:
i=N j=i
SUM W_i*(SUM W_j*X_j - W_i*X_i/2)
i=1 j=1
G = 1 - 2* ----------------------------------
i=N i=N
SUM W_i*X_i * SUM W_i
i=1 i=1
where observations are sorted in ascending order of X.
Here W seems to be the weight, which I don't use, therefore it should be 1 (?). I’m not sure whether formula I and formula II are the same. There are no absolute differences and the result is subtracted from 1 in formula II. I have tried to transform the equations but I don’t get any further.
Could someone give me a hint whether both ways of computing (formula I + formula II) are equivalent?

LDA interpretation

I use the HMeasure package to involve the LDA in my analysis about credit risk. I have 11000 obs and I've chosen age and income to develop the analysis. I don't know exactly how to interpret the R results of LDA. So, I don't know if I chosen the best variables according to credit risk.
I show you below the code.
lda(default ~ ETA, data = train)
Prior probabilities of groups:
       0         1
0.4717286 0.5282714
Group means:
      ETA
0 34.80251
1 37.81549
Coefficients of linear discriminants:
         LD1
ETA 0.1833161
lda(default~ ETA + Stipendio,  train)
Call:
lda(default ~ ETA + Stipendio, data = train)
Prior probabilities of groups:
       0         1
0.4717286 0.5282714
Group means:
      ETA Stipendio
0 34.80251  1535.531
1 37.81549  1675.841
Coefficients of linear discriminants:
                 LD1
ETA       0.148374799
Stipendio 0.001445174
lda(default~ ETA, train)
ldaP <- predict(lda, data= test)
Where ETA = AGE and STIPENDIO =INCOME
Thanks a lot!
LDA uses means and variances of each class in order to create a linear boundary (or separation) between them. This boundary is delimited by the coefficients.
You have two different models, one which depends on the variable ETA and one which depends on ETA and Stipendio.
The first thing you can see are the Prior probabilities of groups. These probabilities are the ones that already exist in your training data. I.e. 47.17% of your training data corresponds to credit risk evaluated as 0 and 52.82% of your training data corresponds to credit risk evaluated as 1. (I assume that 0 means "non-risky" and 1 means "risky"). These probabilities are the same in both models.
The second thing that you can see are the Group means, which are the average of each predictor within each class. These values could suggest that the variable ETA might have a slightly greater influence on risky credits (37.8154) than on non-risky credits (34.8025). This situation also happens with the variable Stipendio, in your second model.
The calculated coefficient for ETAin the first model is 0.1833161. This means that the boundary between the two different classes will be specified by the following formula:
y = 0.1833161 * ETA
This can be represented by the following line (x represents the variable ETA). Credit risks of 0 or 1 will be predicted depending on which side of the line they are.
Your second model contains two dependent variables, ETA and Stipendio, so the boundary between classes will be delimited by this formula:
y = 0.148374799 * ETA + 0.001445174 * Stipendio
As you can see, this formula represents a plane. (x1 represents ETA and x2 represents Stipendio). As in the previous model, this plane represents the difference between a risky credit and a non-risky one.
In this second model, the ETA coefficient is much greater that the Stipendio coefficient, suggesting that the former variable has greater influence on the credit riskiness than the later variable.
I hope this helps.

Mixed model with large sample size

I am currently doing a mixed linear model (using the lme function in R), and I have some problems.
My dataset is about damages by brown bears in Slovenia. Slovenia was divided in 1x1km grids, and for each grid I have data of number of damages per year (for 12 consecutive years). This frequency of damages will be my Y variable in the model, and I will test different environmental variables to explain occurrence of damages (e.g. distance to the forest edge, cover of forest etc.).
I put year as a random factor (verified with a likelihood ratio test).
My sample size is big (250 000 cell values), and mainly 0 (only 4000 cases were positive, ranging from 1 to 17 damages in one cell for a year).
Here is my problem. Following Zuur (2009) methods, I try to find the optimal fixed structure for my model. My first model has all the variables, plus some interactions (see below). I'm using a logit link.
f1 <- formula (dam ~ masting + dens*pop_size_index + saturation + exposition
settlements + orchards + crops + meadows + mixed_for + dist_for_out
dist_for_out_a + dist_for_in + dist_for_in_a + for_edge + prop_broadleaves
prop_broadleaves_a + dist_road + dist_village + feed_stat + sup_food
masting*prop_broadleaves)
M1.lme <- lme (f1, random = ~1|year, method="REML", data=d)
But, looking at likelihood ratio test, I can not remove ANY variable. All are significant. However, the model is still very bad (too many variables in, residuals not good looking), and I can definitely not stop there.
So how can I find a better model (i.e. get rid of non significant variables)?
Is this due to my large sample size?
Could this zero inflation possibly be a problem?
I could not find another way of improving my model, that would take this into account.
Any suggestions?

Convert odds ratio of unit change to whole range

I try to do a logistic regression in R and then calculate an odds ratio. I have two groups of people, the first one more strongly exposed to a pollutant than the second one, and the first one developing a certain disease more often.
I just use a set of toy data here. It's easy to generate a model and estimate the significance of influence of the pollutant exposure on developing the disease:
df <- data.frame(disease = as.factor(c(rep(1,100),rep(0,500))),
exposure=c(rnorm(100, mean = 200, sd = 50),
rnorm(500, mean = 100, sd = 20)))
model <- glm(formula = disease ~ exposure, data=df,
family = binomial(link = "logit"))
summary <-summary(model)
OR <- exp(cbind(OddRatio = coef(model), confint(model)))
In R, odds ratios are based on one unit change of the independent variable, e.g. changing the pollutant concentration for 1 mg/ml yields an odds ratio of around 1.1 to 1 in the example.
My question is now, how can I recalculate an odds ratio based on a change for several unit changes? (across the whole range of pollutant exposure)
My first guess was the OR of the new range is OR of one unit change to the power of range size in units.
range <- max(df$exposure)-min(df$exposure)
ORRange <- (OR["exposure",1])^range
In the toy data, the range is about 300. And 1.1 ^ 300 is about 2x10^13, which is quite a lot.
Is this calculation correct, or must it be multiplied (1.1 x 300)?
And what is the mathematical basis to prove the calculation?
That is not how you calculate an odds ratio for different units of change. First, multiply the coefficient on the logit scale (which is what R reports), and then use the exp function on it. Here is an example of calculating the odds ratio for 1, 2, and 3 units of change
unit.change = c(1,2,3)
exp(coef(model)["exposure"]*unit.change)

Find the average of a variable compared to fixed variable

So I have a short table of demographic data from a survey. Age, income, race, etc.
My HW question is as follows:
I would like you to determine, from your Tulsa data, whether age and income are
significantly related. If so, what is the expected income of a 50-year-old person?
I have the first part, just need to know how to find the mean income of a 50 year old person according to my data.
step one: measure the correlation, use cor(x, y)
step two: use a linear regression (assuming the simpest scenario) with lm(y ~ x).
The result contains intercept and slope to calculate the value of the function, y = f(x), for any values of x, like 50 in your question.

Resources