I am trying to perform multiple logistical regression with some of the variables that came out as statistically significant for a diseased conditions with univariate analysis. We took the cut off for that as p<0.2 since our sample size was ~300. I made a new dataframe for these variables
regression1df <- data.frame(dgfcriteria, recipientage, ESRD_dx,bmirange,graftnumber, dsa_class_1, organ_tx, transfuse01m, transfuse1yr, readmission1yr, citrange1, switrange, anastamosisrange, donorage, donorgender, donorcriteria, donorionotrope, intubaterange, kdpirange, kdrirange, eptsrange, proteinuria, terminalurea, na.rm=TRUE)
I'm using variables to predict for disease condition, which is DGF (dgfcriteria==1), and non-disease is no DGF (dgfcriteria==0).
Here is structure of the data.
When I tried to run the entire list of variables with the glm code I got:
predictors1 <- glm(dgfcriteria ~.,
data = predictors1df,
family = "binomial" )
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels.
But when I run it with only some of the variables of the dataframe, there is an output.
predictors1 <- glm(dgfcriteria ~ recipientage + ESRD_dx + bmirange + graftnumber + dsa_class_1 + organ_tx + transfuse01m + transfuse1yr + readmission1yr +citrange1 +switrange + anastamosisrange+ donorage+ donorgender + donorcriteria + donorionotrope,
data = predictors1df,
family = "binomial" )
This output looks really strange though with alot of NAs.
Where have I gone wrong?
Looking at your data structure, you've got a lot of missing values. Quite a few of your variables look to have only 2 or 3 non-missing values in the first 10 rows. When you run regression on data with missing values, the default is to drop all rows that have any missing values.
Apparently some of your data has bad overlaps, so that when all the rows with missing values are dropped (see na.omit(your_data) for what is left over), some variables only have one level left and are therefore no longer fit for regression. Of course, when you only use some variables, fewer rows will be dropped and you may be in a better situation.
So, you'll have to decide what to do with your missing values. This should depend on your goals and your understanding of the reasons for missingness. Common possibilities include omission, imputation, creating new "missing" levels, and taking level of missingness into account in your variable selection.
Related
I want to plot how the estimated survival from a Cox model depends upon the value of a covariate of interest, while the rest of variables are fixed to their average values (if they are continuous variables) or lowest values for dummy. Following this example http://www.sthda.com/english/wiki/cox-proportional-hazards-model , I have construct a new data frame with three rows, one for each value of my variable of interest; and the other covariates are fixed. Among these covariates I have two factor vectors. I created the new dataset and later it is passed to survfit() via the newdata argument.
When I passed the data frame to survfit(), I obtain the following error message error in relevel.default(occupation) : 'relevel' only for factors. Where is the source of problem? If the source of problem is related to the factor vectors, how I can solve it? Below find an example of the code. Unfortunately, I cannot share the data or find a dataset that produces the same error message:
I have transformed the factor variables into integer vectors in the cox model and in the new dataset. it did not work.
I have deleated all the factor variables and it works.
I have tried to implement this strategy, but it did not work: Plotting predicted survival curves for continuous covariates in ggplot
fit <- coxph(Surv(entry, exit, event == 1) ~ status_plot +
exp_national + relevel(occupation, 5) + age + gender + EDUCATION , data = data)
data_rank <- with(data,
data.frame(status_plot = c(1,2,3), # factor vector of interest
exp_national=rep(mean(exp_national, na.rm = TRUE), 3),
occupation = c(5,5,5), # factor with 6 categories, number 5 is the category of reference in the cox model
age=rep(mean(age, na.rm = TRUE), 3),
gender = c(1,1,1),
EDUCATION=rep(mean(EDUCATION, na.rm = TRUE), 3) ))
surv.fin <- survfit(fit, newdata=data_rank) # this produces the error
Looking at the code it appears you probably attempted to take the mean of a factor. So do post at least str(data) as an edit to the body of your question. You should also realize that you can give a single value to a column in a data.frame call and have it recycled to the correct length, you all the meanss could be entered as a single item rather thanrep`-ng.
I want to perform in R the next linear model:
\begin{equation}
lPC_t = \beta_0 + \beta_1PIBtvh_{t+1} + \beta_2txDes_t + \beta_3Spread_{t+4} + u_t
\end{equation}
The name of my data frame is Dados_R. I need to impose a restriction in the data once I want to estimate over just the observations between 19 and 45. The problem is that when I create the variables with the lead I cannot change the scope of them, or at least I cannot do it, unless I change the original data frame by myself what is not convenient once I want to perform more models with different leads.
So my question is how can I change the range of the variables that I created (leadPIBtvh0 e leadSpread0), in such a way that allows me to perform the linear model with just the observations between 19 and 45?
The code that I wrote:
attach(Dados_R)
leadPIBtvh0=lag(PIBtvh,1)
leadSpread0=lag(Spread,4)
data=Dados_R[19:45,]
detach(Dados_R)
attach(data)
lPC=log(PC/(1-PC))
lm_lPC=lm(lPC~leadPIBtvh0+txDes+leadSpread0)
This code give me the error (that I understood):
Error in model.frame.default(formula = lPC ~ leadPIBtvh0 + txDes + leadSpread0, : :
variable lengths differ (found for 'leadPIBtvh0')
I used ApacheData data with 83784 rows to build a linear regression model:
fit <-lm(tomorrow_apache~ as.factor(state_today)
+as.numeric(daily_creat)
+ as.numeric(last1yr_min_hosp_icu_MDRD)
+as.numeric(bun)
+as.numeric(urin)
+as.numeric(category6)
+as.numeric(category7)
+as.numeric(other_fluid)
+ as.factor(daily)
+ as.factor(age)
+ as.numeric(apache3)
+ as.factor(mv)
+ as.factor(icu_loc)
+ as.factor(liver_tr_before_admit)
+ as.numeric(min_GCS)
+ as.numeric(min_PH)
+ as.numeric(previous_day_creat)
+ as.numeric(previous_day_bun) ,ApacheData)
And I want to use this model to predict a new input so I give each predictor variable a value:
predict(fit, data=data.frame(state_today=1, daily_creat=2.3, last1yr_min_hosp_icu_MDRD=3, bun=10, urin=0.01, category6=10, category7=20, other_fluid=0, daily=2 , age=25, apache3=12, mv=1, icu_loc=1, liver_tr_before_admit=0, min_GCS=20, min_PH=3, previous_day_creat=2.1, previous_day_bun=14))
I expect a single value as a prediction to this new input, but I get many many predictions! I don't know why is this happening. What am I doing wrong?
Thanks a lot for your time!
You may also want to try the excellent effects package in R (?effects). It's very useful for graphing the predicted probabilities from your model by setting the inputs on the right-hand side of the equation to particular values. I can't reproduce the example you've given in your question, but to give you an idea of how to quickly extract predicted probabilities in R and then plot them (since this is vital to understanding what they mean), here's a toy example using the in-built data sets in R:
install.packages("effects") # installs the "effects" package in R
library(effects) # loads the "effects" package
data(Prestige) # loads in-built dataset
m <- lm(prestige ~ income + education + type, data=Prestige)
# this last step creates predicted values of the outcome based on a range of values
# on the "income" variable and holding the other inputs constant at their mean values
eff <- effect("income", m, default.levels=10)
plot(eff) # graphs the predicted probabilities
I am trying to run a Canonical Correspondence Analysis on diet composition data (prey.counts) with respect to a suite of environmental variables (envvar). Every row and every column sums to greater than 0, but I keep getting this error message:
diet <- cca(prey.counts, envvar$SL + envvar$Month + envvar$water.temp +
envvar$salinity + envvar$DO)
Error in if (any(rowSums(X) <= 0)) stop("All row sums must be >0 in the community data matrix") :
missing value where TRUE/FALSE needed
I have double and triple checked the prey.counts dataframe for NAs or empty columns/rows and none of them sum to zero or are missing values. R, RStudio, and all packages are fully up to date. Any help would be appreciated!
Meredith
The problem is how you are calling the function, you seem to be mixing the default and formula interfaces (and abusing the formula notation whilst you are at it).
Does this help:
diet <- cca(prey.counts ~ SL + Month + water.temp + salinity + DO, data = envvar)
Alternatively, if the named variables are the only ones in envvar, you could do either of
diet <- cca(prey.counts ~ ., data = envvar)
or
diet <- cca(prey.counts, envvar)
with the latter using the less flexible but simple default method for cca().
So I have a data set called x. The contents are simple enough to just write out so I'll just outline it here:
the dependent variable, Report, in the first column is binary yes/no (0 = no, 1 = yes)
the subsequent 3 columns are all categorical variables (race.f, sex.f, gender.f) that have all been converted to factors, and they're designated by numbers (e.g. 1= white, 2 = black, etc.)
I have run a logistic regression on x as follows:
glm <- glm(Report ~ race.f + sex.f + gender.f, data=x,
family = binomial(link="logit"))
And I can check the fitted probabilities by looking at summary(glm$fitted).
My question: How do I create a fifth column on the right side of this data set x that will include the predictions (i.e. fitted probabilities) for Report? Of course, I could just insert the glm$fitted as a column, but I'd like to try to write a code that predicts it based on whatever is in the race, sex, gender columns for a more generalized use.
Right now I the follow code which I will hope create a predicted column as well as lower and upper bounds for the confidence interval.
xnew <- cbind(xnew, predict(glm5, newdata = xnew, type = "link", se = TRUE))
xnew <- within(xnew, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
Unfortunately I get the error:
Error in eval(expr, envir, enclos) : object 'race.f' not found
after the cbind code.
Anyone have any idea?
There appears to be a few typo in your codes; First Xnew calls on glm5 but your model as far as I can see is glm (by the way using glm as name of your output is probably not a good idea). Secondly make sure the variable race.f is actually in the dataset you wish to do the prediction from. My guess is R can't find that variable hence the error.