How to treat negative values in lm(x~y) function in R? - r

When running my script I get the following error message: Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases and I'm guessing that is due some negative values?
The script is looping trough a list of csv files and for a small selection of them, the code is working. But for all of them I get the error message. I checked the data and there are some (about 2% of the whole data) negative NDVI values which are always -99999. And I have some soil moisture values which are 0.
I found this solution na.action=na.exclude to add in the lm function:
model <- lm(NDVI ~ T + Prec + soilM, data = BeforeConf)
model <- lm(NDVI ~ T + Prec + soilM, data = BeforeConf, na.action=na.exclude)
But the same error still occurs. Do you have any other solution for this, besides deleting the negative values from the data. Best would be to ignore the whole the not exclude these values in the linear regression (lm) or to ignore the whole csv file. If there are negative values in it.

Missing values in R should be coded as NA. You could use replace,
replace(dat, dat == -99999, NA)
# X1 X2 X3
# 1 1.37 1.30 -0.31
# 2 NA 2.29 -1.78
# 3 0.36 -1.39 -0.17
# 4 0.63 -0.28 1.21
# 5 0.40 NA 1.90
# 6 -0.11 0.64 -0.43
# 7 1.51 -0.28 -0.26
# 8 -0.09 -2.66 -1.76
# 9 2.02 -2.44 NA
# 10 -0.06 1.32 -0.64
what actually works directly in the formula without changing the data.
lm(X1 ~ X2 + X3, replace(dat, dat == -99999, NA))$coefficients
# (Intercept) X2 X3
# 0.61499466 0.06062925 0.25979370
If there are more than one missing code, you could do e.g.:
replace(dat, array(unlist(dat) %in% c(-99999, -88888), dim(dat)), NA)
Data:
set.seed(42)
dat <- data.frame(matrix(round(rnorm(30), 2), 10, 3))
dat[2, 1] <- -99999
dat[5, 2] <- -99999
dat[9, 3] <- -99999

Related

How to get a scientific p-value using the cmprsk package?

Hi stackoverflow community,
I am a recent R starter and today I tried several hours to figure out how to get a scientific p-value (e.g. 3*e-1) from a competing risk analysis using the cmprsk package.
I used:
sumary_J1<-crr(ftime, fstatus, cov1, failcode=2)
summary(sumary_J1)
And got
Call:
crr(ftime = ftime, fstatus = fstatus, cov1 = cov1, failcode = 2)
coef exp(coef) se(coef) z p-value
group1 0.373 1.45 0.02684 13.90 0.00
age 0.122 1.13 0.00384 31.65 0.00
sex 0.604 1.83 0.04371 13.83 0.00
bmi 0.012 1.01 0.00611 1.96 0.05
exp(coef) exp(-coef) 2.5% 97.5%
group1 1.45 0.689 1.38 1.53
age 1.13 0.886 1.12 1.14
sex 1.83 0.546 1.68 1.99
bmi 1.01 0.988 1.00 1.02
Num. cases = 470690 (1900 cases omitted due to missing values)
Pseudo Log-likelihood = -28721
Pseudo likelihood ratio test = 2229 on 4 df,
I can see the p-value column,but I only get two decimal places. I would like to see as many decimal places as possible or print those p-values in the format e.g. 3.0*e-3.
I tried all of those, but nothing worked so far:
summary(sumary_J1, digits=max(options()$digits - 5,10))
print.crr(sumary_J1, digits = 20)
print.crr(sumary_J1, digits = 3, scipen = -2)
print.crr(sumary_J1, format = "e", digits = 3)
Maybe someone is able to help me! Thanks!
Best,
Carolin
The use of digits=2 limits the number of digits to the right of the decimal point when used as an argument to a .summary value. The digits parameter does affect how results are displayed for summary.crr.
summary(z, digits=3) # using first example in `?cmprsk::crr`
#----------------------
#Competing Risks Regression
Call:
crr(ftime = ftime, fstatus = fstatus, cov1 = cov)
coef exp(coef) se(coef) z p-value
x1 0.2668 1.306 0.421 0.633 0.526
x2 -0.0557 0.946 0.381 -0.146 0.884
x3 0.2805 1.324 0.381 0.736 0.462
exp(coef) exp(-coef) 2.5% 97.5%
x1 1.306 0.766 0.572 2.98
x2 0.946 1.057 0.448 2.00
x3 1.324 0.755 0.627 2.79
Num. cases = 200
Pseudo Log-likelihood = -320
Pseudo likelihood ratio test = 1.02 on 3 df,
You can use formatC to control format:
formatC( summary(z, digits=5)$coef , format="e")
#------------>
coef exp(coef) se(coef) z p-value
x1 "2.6676e-01" "1.3057e+00" "4.2115e-01" "6.3340e-01" "5.2647e-01"
x2 "-5.5684e-02" "9.4584e-01" "3.8124e-01" "-1.4606e-01" "8.8387e-01"
x3 "2.8049e-01" "1.3238e+00" "3.8098e-01" "7.3622e-01" "4.6159e-01"
You also might search on [r] very small p-value
Here's the first of over 100 hits on that topic, which despite not very much attention, still has very useful information and coding examples: Reading a very small p-value in R
By looking at the function that prints the output of crr() (cmprsk::print.crr) you can see what is done to create the p-values displayed in the summary. The code below is taken from that function.
x <- sumary_J1
v <- sqrt(diag(x$var))
signif(v, 4) # Gives you the one-sided p-values.
v <- 2 * (1 - pnorm(abs(x$coef)/v))
signif(v, 4) # Gibes you the two-sided p-values.

Why can't I use cv.glm on the output of bestglm?

I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -
cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
bestglm(Xy = winedata,
family = binomial, # binomial family for logistic
IC = "AIC", # Information criteria
method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)
However, this gives the error -
Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"
I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?
The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.
The data was processed as -
winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'
bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that
It's always good to check what is the class:
class(res.best.logistic$BestModel)
[1] "glm" "lm"
But if you look at the call of res.best.logistic$BestModel:
res.best.logistic$BestModel$call
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
head(res.best.logistic$BestModel$model)
y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0 7.0 0.27 0.36 20.7 0.045
2 0 6.3 0.30 0.34 1.6 0.049
3 0 8.1 0.28 0.40 6.9 0.050
4 0 7.2 0.23 0.32 8.5 0.058
5 0 7.2 0.23 0.32 8.5 0.058
6 0 8.1 0.28 0.40 6.9 0.050
free.sulfur.dioxide density pH sulphates
1 45 1.0010 3.00 0.45
2 14 0.9940 3.30 0.49
3 30 0.9951 3.26 0.44
4 47 0.9956 3.19 0.40
5 47 0.9956 3.19 0.40
6 30 0.9951 3.26 0.44
You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:
best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")
best.cv.err<- cv.glm(winedata,fit,cost1, K=10)

apply model coefficients on new data

I have two matrices sub and macro_data. They include the estimated coefficients of a model and the macro data, respectively
> sub
coeff varname
1 -1.50 gdp
2 0.005 inflation
3 -2.4 constant
> macro_data
gdp inflation
1 18.0 -0.17
2 15.8 -0.14
3 17.7 -0.15
I would like to apply the following formula: -1.5*gdp+0.005*inflation-2.4 in order to get the scores.
I have tried
for (i in 1:1){
sub$coeff[i]*macro_data[,1]+sub$coeff[i+1]*macro_data[,sub$coeff[i+1]]+sub$coeff[i+2]
}
Actually it works but this is not the best solution, because I would like something general. Any idea?
You can do a matrix multiplication:
cbind(macro_data, 1) %*% sub[, "coeff", drop=FALSE]
If your coefficients are from estimating a model, then normally the function predict.~() can take a parameter newdata= to claculate estimates for new data.
For your example data this wont work because you have dataframes. This will do:
sub <- read.table(header=TRUE, text=
"coeff varname
-1.50 gdp
0.005 inflation
-2.4 constant ")
macro_data <- read.table(header=TRUE, text=
"gdp inflation
1 18.0 -0.17
2 15.8 -0.14
3 17.7 -0.15")
m <- cbind(macro_data, constant=1)
C <- sub$coeff
names(C) <- sub$varname
m$gdp*C["gdp"] + m$inflation*C["inflation"] + m$constant*C["constant"]
The last line can be shorten to:
as.matrix(m) %*% C[names(m)]

Aggregating columns

I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?
n <- 5
r <- 6
> df
X1 X2 X3 X4 X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60
This is what result should look like:
> result
X1 X2 X3 X4 X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57
I think most of the slowness and eventual crash comes from memory overheads during the loop and not from the correlations (though that could be improved too as #coffeeinjunky says). This is most likely as a result of the way data.frames are modified in R. Consider switching to data.tables and take advantage of their "assignment by reference" paradigm. For example, below is your code translated into data.table syntax. You can time the two loops, compare perfomance and comment the results. cheers.
n <- 5L
r <- 6L
result <- setDT(data.frame(matrix(NA,nrow=r,ncol=n)))
temp <- copy(df) # Create a temporary data frame in which I calculate the correlations
set(result, j=1L, value=temp[[1]]) # The first column is the same
for (icol in as.integer(2:n)) {
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1]) # Determine which are correlated most
set(x=result, i=NULL, j=as.integer(icol), value=(temp[[1]] + temp[[mch]]))# Aggregate and place result in results datatable
set(x=temp, i=NULL, j=1L, value=result[[icol]])# Set result as new 1st column
set(x=temp, i=NULL, j=as.integer(mch), value=NULL) # Remove column
}
Try
for (i in 2:n) {
maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
result[,i] <- temp[,1] + temp[,maxcor]
temp[,1] <- result[,i] # Set result as new 1st column
temp[,maxcor] <- NULL # Remove column
}
The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.
One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.
To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])
contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with
cortemp <- cor(temp)
mch <- match(c(max(cortemp[-1,1])),cortemp[,1])
should cut the computational burden of the initial code line in half.

Model multiple imputation with interaction terms

According to the documentation of the mice package, if we want to impute data when we're interested in interaction terms we need to use passive imputation. Which is done the following way.
library(mice)
nhanes2.ext <- cbind(nhanes2, bmi.chl = NA)
ini <- mice(nhanes2.ext, max = 0, print = FALSE)
meth <- ini$meth
meth["bmi.chl"] <- "~I((bmi-25)*(chl-200))"
pred <- ini$pred
pred[c("bmi", "chl"), "bmi.chl"] <- 0
imp <- mice(nhanes2.ext, meth = meth, pred = pred, seed = 51600, print = FALSE)
It is said that
Imputations created in this way preserve the interaction of bmi with chl
Here, a new variable called bmi.chl is created in the original dataset. The meth step tells how this variable needs to be imputed from the existing ones. The pred step says we don't want to predict bmi and chl from bmi.chl. But now, if we want to apply a model, how do we proceed? Is the product defined by "~I((bmi-25)*(chl-200))" is just a way to control for the imputed values of the main effects, i.e. bmi and chl?
If the model we want to fit is glm(hyp~chl*bmi, family="binomial"), what is the correct way to specify this model from the imputed data? fit1 or fit2?
fit1 <- with(data=imp, glm(hyp~chl*bmi, family="binomial"))
summary(pool(fit1))
Or do we have to use somehow the imputed values of the new variable created, i.e. bmi.chl?
fit2 <- with(data=imp, glm(hyp~chl+bmi+bmi.chl, family="binomial"))
summary(pool(fit2))
With passive imputation, it does not matter if you use the passively imputed variable, or if you re-calculate the product term in your call to glm.
The reason that fit1 and fit2 yield different results in your example is because are not just doing passive imputation for the product term.
Instead you are transforming the two variables befor multiplying (i.e., you calculate bmi-25 and chl-100). As a result, the passively imputed variable bmi.chl does not represent the product term bmi*chl but rather (bmi-25)*(chl-200).
If you just calculate the product term, then fit1 and fit2 yield the same results like they should:
library(mice)
nhanes2.ext <- cbind(nhanes2, bmi.chl = NA)
ini <- mice(nhanes2.ext, max = 0, print = FALSE)
meth <- ini$meth
meth["bmi.chl"] <- "~I(bmi*chl)"
pred <- ini$pred
pred[c("bmi", "chl"), "bmi.chl"] <- 0
pred[c("hyp"), "bmi.chl"] <- 1
imp <- mice(nhanes2.ext, meth = meth, pred = pred, seed = 51600, print = FALSE)
fit1 <- with(data=imp, glm(hyp~chl*bmi, family="binomial"))
summary(pool(fit1))
# > round(summary(pool(fit1)),2)
# est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
# (Intercept) -23.94 38.03 -0.63 10.23 0.54 -108.43 60.54 NA 0.41 0.30
# chl 0.10 0.18 0.58 9.71 0.58 -0.30 0.51 10 0.43 0.32
# bmi 0.70 1.41 0.49 10.25 0.63 -2.44 3.83 9 0.41 0.30
# chl:bmi 0.00 0.01 -0.47 9.67 0.65 -0.02 0.01 NA 0.43 0.33
fit2 <- with(data=imp, glm(hyp~chl+bmi+bmi.chl, family="binomial"))
summary(pool(fit2))
# > round(summary(pool(fit2)),2)
# est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
# (Intercept) -23.94 38.03 -0.63 10.23 0.54 -108.43 60.54 NA 0.41 0.30
# chl 0.10 0.18 0.58 9.71 0.58 -0.30 0.51 10 0.43 0.32
# bmi 0.70 1.41 0.49 10.25 0.63 -2.44 3.83 9 0.41 0.30
# bmi.chl 0.00 0.01 -0.47 9.67 0.65 -0.02 0.01 25 0.43 0.33
This is not surprising because the ~I(bmi*chl) in mice and the bmi*chl in glm do the exact same thing. They merely calculate the product of the two variables.
Remark:
Note that I added a line saying that bmi.chl should be used as a predictor when imputing hyp. Without this step, passive imputation has no purpose because the imputation model would neglect the product term, thus being incongruent with the analysis model.

Resources