I am trying to calculate DFFITS for GLM, where responses follow a Beta distribution. By using betareg R package. But I think this package doesn't support influence.measures() because by using dffits()
Code
require(betareg)
df<-data("ReadingSkills")
y<-ReadingSkills$accuracy
n<-length(y)
bfit<-betareg(accuracy ~ dyslexia + iq, data = ReadingSkills)
DFFITS<-dffits(bfit, infl=influence(bfit, do.coef = FALSE))
DFFITS
it yield
Error in if (model$rank == 0) { : argument is of length zero
I am a newbie in R. I don't know how to resolve this problem. Kindly help to solve this or give me some tips through R code that how to calculate DFFITs manually.
Regards
dffits are not implemented for "betareg" objects, but you could try to calculate them manually.
According to this Stack Overflow Q/A we could write this function:
dffits1 <- function(x1, bres.type="response") {
stopifnot(class(x1) %in% c("lm", "betareg"))
sapply(1:length(x1$fitted.values), function(i) {
x2 <- update(x1, data=x1$model[-i, ]) # leave one out
h <- hatvalues(x1)
nm <- rownames(x1$model[i, ])
num_dffits <- suppressWarnings(predict(x1, x1$model[i, ]) -
predict(x2, x1$model[i, ]))
residx <- if (class(x1) == "betareg") {
betareg:::residuals.betareg(x2, type=bres.type)
} else {
x2$residuals
}
denom_dffits <- sqrt(c(crossprod(residx)) / x2$df.residual*h[i])
return(num_dffits / denom_dffits)
})
}
It works well for lm:
fit <- lm(mpg ~ hp, mtcars)
dffits1(fit)
stopifnot(all.equal(dffits1(fit), dffits(fit)))
Now let's try betareg:
library(betareg)
data("ReadingSkills")
bfit <- betareg(accuracy ~ dyslexia + iq, data=ReadingSkills)
dffits1(bfit)
# 1 2 3 4 5 6 7
# -0.07590185 -0.21862047 -0.03620530 0.07349169 -0.11344968 -0.39255172 -0.25739032
# 8 9 10 11 12 13 14
# 0.33722706 0.16606198 0.10427684 0.11949807 0.09932991 0.11545263 0.09889406
# 15 16 17 18 19 20 21
# 0.21732090 0.11545263 -0.34296030 0.09850239 -0.36810187 0.09824013 0.01513643
# 22 23 24 25 26 27 28
# 0.18635669 -0.31192106 -0.39038732 0.09862045 -0.10859676 0.04362528 -0.28811277
# 29 30 31 32 33 34 35
# 0.07951977 0.02734462 -0.08419156 -0.38471945 -0.43879762 0.28583882 -0.12650591
# 36 37 38 39 40 41 42
# -0.12072976 -0.01701615 0.38653773 -0.06440176 0.15768684 0.05629040 0.12134228
# 43 44
# 0.13347935 0.19670715
Looks not bad.
Notes:
Even if this works in code, you should check if it meets your statistical requirements!
I've used suppressWarnings in lines 5:6 of dffits1. predict(bfit, ReadingSkills) drops the contrasts somehow, whereas predict(bfit) does not (should practically be the same). However the results are identical: all.equal(predict(bfit, ReadingSkills), predict(bfit)), thus ignoring the warnings be safe.
Related
I am trying to run a regression loop based on code that I have found in a previous answer (How to Loop/Repeat a Linear Regression in R) but I keep getting an error. My outcomes (dependent) are 940 variables (metabolites) and my exposure (independent) are "bmi","Age", "sex","lpa2c", and "smoking". where BMI and Age are continuous. BMI is the mean exposure, and for others, I am controlling for them.
So I'm testing the effect of BMI on 940 metabolites.
Also, I would like to know how I can extract coefficient, p-value, standard error, and confidence interval for BMI only and when it is significant.
This is the code I have used:
y<- c(1653:2592) # response
x1<- c("bmi","Age", "sex","lpa2c", "smoking") # predictor
for (i in x1){
model <- lm(paste("y ~", i[[1]]), data= QBB_clean)
print(summary(model))
}
And this is the error:
Error in model.frame.default(formula = paste("y ~", i[[1]]), data = QBB_clean, :
variable lengths differ (found for 'bmi').
y1 y2 y3 y4 bmi age sex lpa2c smoking
1 0.2875775201 0.59998896 0.238726027 0.784575267 24 18 1 0.470681834 1
2 0.7883051354 0.33282354 0.962358936 0.009429905 12 20 0 0.365845473 1
3 0.4089769218 0.48861303 0.601365726 0.779065883 18 15 0 0.121272054 0
4 0.8830174040 0.95447383 0.515029727 0.729390652 16 21 0 0.046993681 0
5 0.9404672843 0.48290240 0.402573342 0.630131853 18 28 1 0.262796304 1
6 0.0455564994 0.89035022 0.880246541 0.480910830 13 13 0 0.968641168 1
7 0.5281054880 0.91443819 0.364091865 0.156636851 11 12 0 0.488495482 1
8 0.8924190444 0.60873498 0.288239281 0.008215520 21 23 0 0.477822030 0
9 0.5514350145 0.41068978 0.170645235 0.452458394 18 17 1 0.748792881 0
10 0.4566147353 0.14709469 0.172171746 0.492293329 20 15 1 0.667640231 1
If you want to loop over responses you will want something like this:
respvars <- names(QBB_clean[1653:2592])
predvars <- c("bmi","Age", "sex","lpa2c", "smoking")
results <- list()
for (v in respvars) {
form <- reformulate(predvars, response = v)
results[[v]] <- lm(form, data = QBB_clean)
}
You can then print the results with something like lapply(results, summary), extract coefficients, etc.. (I have a little trouble seeing how it's going to be useful to just print the results of 940 regressions ... are you really going to inspect them all?
If you want coefficients etc. for BMI, I think this should work (not tested):
t(sapply(results, function(m) coef(summary(m))["bmi",]))
Or for coefficients:
t(sapply(results, function(m) confint(m)["bmi",]))
I recently installed the sommer package, but the 2D Spline and nna examples provided would not run for me. I updated my version of R to see if that would resolve the issue, but it did not. These are the examples I have been trying to run:
2D Spline:
data(CPdata)
head(CPpheno)
CPgeno[1:4,1:4]
#### create the variance-covariance matrix
A <- A.mat(CPgeno) # additive relationship matrix
#### look at the data and fit the model
head(CPpheno)
mix1 <- mmer2(Yield~1,
random=~g(id)
+ Rowf + Colf
+ spl2D(Row,Col),
rcov=~units,
G=list(id=A), silent=TRUE,
data=CPpheno)
summary(mix1)
###Warning message: In data(CPdata) : data set ‘CPdata’ not found
Nearest-Neighbor Adjustment
Found the dataset in agridat by doing a web search for yates.oats, otherwise warning states yates.oats not found
library(agridat)
# NOT RUN {
data(yates.oats)
head(yates.oats)
newyates <- nna(yates.oats, trait="Y")
head(newyates)
plot(newyates$Y, newyates$nnx)
cor(newyates$Y, newyates$nnx)
#### now fit the models and compare #####
m3 <- mmer2(fixed=Y ~ V+N+V:N, random = ~ B + B:MP,
data = yates.oats)
yates.oats$res <- residuals(m3)
m4 <- mmer2(fixed=Y ~ V+N+V:N + nnx, random = ~ B + B:MP,
data = newyates)
newyates$res <- residuals(m4)
# library(lattice)
# wireframe(res~row*col,yates.oats)
# wireframe(res~row*col,newyates)
# }
Error Message:
library(agridat)
data(yates.oats)
head(yates.oats)
row col yield nitro gen block grain straw
1 16 3 80 0 GoldenRain B1 20.00 28.00
2 12 4 60 0 GoldenRain B2 15.00 25.00
3 3 3 89 0 GoldenRain B3 22.25 40.50
4 14 1 117 0 GoldenRain B4 29.25 28.75
5 8 2 64 0 GoldenRain B5 16.00 32.00
6 5 2 70 0 GoldenRain B6 17.50 27.25
newyates <- nna(yates.oats, trait="Y")
Error in nna(yates.oats, trait = "Y") : could not find function "nna"
Please use the documentation from your current version. If you have the newest version installed you may notice that the mmer2() function doesn't exist anymore. If you check the recent documentation from the spl2D() function is this:
data(DT_cpdata)
DT <- DT_cpdata
GT <- GT_cpdata
MP <- MP_cpdata
A <- A.mat(GT)
mix <- mmer(Yield~1,
random=~vs(id, Gu=A) +
vs(Rowf) +
vs(Colf) +
spl2Db(Row,Col),
rcov=~units,
data=DT)
summary(mix)$varcomp
That should work well. The error seems to be that you use a new version of sommer with code from an old script. All the best.
Eduardo
This question already has an answer here:
r predict function returning too many values [closed]
(1 answer)
Closed 6 years ago.
Assume I have have fit a regression model with multiple predictor variables in R, like in the following toy example:
n <- 20
x <- rnorm(n)
y <- rnorm(n)
z <- x + y + rnorm(n)
m <- lm(z ~ x + y + I(y^2))
Now I have new date, consisting of x and y values, and I want to predict the corresponding z values:
x.new <- rnorm(5)
y.new <- rnorm(5)
Question: How should I best call predict to apply the fitted model to the new data?
Here are a few things I tried, which do not work:
Attempt 1. Trying to use the x.new and y.new as the columns of a new data frame:
> predict(m, data=data.frame(x=x.new, y=y.new))
1 2 3 4 5 6 7
-0.0157090 1.1667958 -1.3797101 0.1185750 0.7786496 1.7666232 -0.6692865
8 9 10 11 12 13 14
1.9720532 0.3514206 1.1677019 0.6441418 -2.3010431 -0.3228424 -0.2181511
15 16 17 18 19 20
-0.8883275 0.4549592 -1.0377040 0.1750522 -2.4542843 1.2250101
This gave 20 values instead of 5, so cannot be right.
Attempt 2: Maybe predict got confused because the y^2 values were not supplied? Try to use model.frame to provide data in the correct form.
> predict(m, model.frame(~ x.new + y.new + I(y.new^2)))
1 2 3 4 5 6 7
-0.0157090 1.1667958 -1.3797101 0.1185750 0.7786496 1.7666232 -0.6692865
8 9 10 11 12 13 14
1.9720532 0.3514206 1.1677019 0.6441418 -2.3010431 -0.3228424 -0.2181511
15 16 17 18 19 20
-0.8883275 0.4549592 -1.0377040 0.1750522 -2.4542843 1.2250101
Warning message:
'newdata' had 5 rows but variables found have 20 rows
Again, this results in 20 values (plus a warning), so cannot be right.
The parameter is newdata (not data) when telling predict what to predict for.
predict(m, newdata = data.frame(x = x.new, y = y.new))
I have the R iris dataset which I am using for a PNN. The 3 species have been recoded from level 0 to 3 as follows: 0 is setosa, 1 is versicolor, 2 is virginica. Training set is 75%
Q1. I don't understand the function pred_pnn, if anyone is good in R perhaps you can explain how it works
Q2. The output of the test set or prediction is shown below, I don't understand the output because it is supposed to be something close to either 0,1,2
data = read.csc("c:/iris-recoded.csv" , header = T)
size = nrow(data)
length = ncol(data)
index <- 1:size
positions <- sample(index, trunc(size * 0.75))
training <- data[positions,]
testing <- data[-positions,1:length-1]
result = data[-positions,]
result$actual = result[,length]
result$predict = -1
nn1 <- smooth(learn(training), sigma = 0.9)
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst, .combine = rbind) %dopar% {
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names =NULL)
}
}
print(pred_pnn(testing, nn1))
prob
1 1.850818e-03
2 9.820653e-03
3 6.798603e-04
4 7.421435e-03
5 2.168817e-03
6 3.277354e-03
7 6.541173e-03
8 1.725332e-04
9 2.081845e-03
10 2.491388e-02
11 7.679823e-03
12 1.291811e-03
13 2.197234e-06
14 1.316366e-03
15 1.421219e-05
16 4.639239e-05
17 3.671907e-04
18 1.460001e-04
19 4.382849e-05
20 2.387543e-05
21 1.011196e-05
22 2.719982e-04
23 4.445472e-04
24 1.281762e-04
25 5.931106e-09
26 9.741870e-08
27 9.236434e-09
28 8.384690e-08
29 3.311667e-07
30 6.045306e-11
31 2.949265e-08
32 2.070014e-10
33 8.043735e-06
34 2.136666e-08
35 5.604398e-08
36 2.455841e-07
37 3.445977e-07
38 7.314647e-07
I'm assuming you're using the pnn package. Documentation for ?guess would lead us to believe that it does similar to what predict does for other models. In other words, it predicts to which class the observation belongs to. Everything else in there for bookkeeping. Why you get only the probabilities? Because the person who wrote the function made it that way by extracting guess(x)$probabilities and returning only that. If you look at the raw output, you would also get predicted class tucked in away in $category list element.
So I have a time series of MODIS NDVI values (vegetation values from 0-1 for the non-geographic geeks), and I'm trying to approximate the derivative by using a for loop.
This is a sample of my data:
> m2001
date value valnorm
1 1 0.4834 0.03460912
2 17 0.4844 0.03664495
3 33 0.5006 0.06962541
4 49 0.4796 0.02687296
5 65 0.5128 0.09446254
6 81 0.4915 0.05109935
7 97 0.4664 0.00000000
8 113 0.5345 0.13864007
9 129 0.8771 0.83611564
10 145 0.9529 0.99043160
11 161 0.9250 0.93363192
12 177 0.9450 0.97434853
13 193 0.9491 0.98269544
14 209 0.9434 0.97109121
15 225 0.9576 1.00000000
16 241 0.8992 0.88110749
17 257 0.9115 0.90614821
18 273 0.8361 0.75264658
19 289 0.5725 0.21600163
20 305 0.5188 0.10667752
21 321 0.5467 0.16347720
22 337 0.5484 0.16693811
23 353 0.5427 0.15533388
Column 1 is the julian day of the pixel value
Column 2 is the raw NDVI value
Column 3 is the NDVI stretched from 0-1 (it's a normalization technique, since NDVI rarely actually gets to 1 or 0).
I'm still very new to programming and R, but I think I've managed to piece together a tenuous grasp on it. What I'm trying to do is create a new column with values that would give me some idea of the local slope of data points.
The function I've come up with is this:
deriv <- function(x1=1:23, x2=1){
for (i in x1){
i1 <- c(x1[i-1], x1[i], x1[i+1])
i2 <- c(x2[i-1], x2[i], x2[i+1])
deriv.func <- lm(i2~i1, na.action=NULL)
} return(deriv.func$coef[[2]])
}
What happens when I run it is this:
> deriv <- function(x1=1:23, x2=1){
+ for (i in x1){
+ i1 <- c(x1[i-1], x1[i], x1[i+1])
+ i2 <- c(x2[i-1], x2[i], x2[i+1])
+ deriv.func <- lm(i2~i1, na.action=NULL)
+ } return(deriv.func$coef[[2]])
Error: unexpected symbol in:
"deriv.func <- lm(i2~i1, na.action=NULL)
} return"
> }
Error: unexpected '}' in "}"
>
I'm not sure what I'm doing wrong, as I can get it to parse when I fill in a value for i
> i=6
> x1=m2001$date
> x2=m2001$valnorm
> i1 <- c(x1[i-1], x1[i], x1[i+1])
> i2 <- c(x2[i-1], x2[i], x2[i+1])
> i1
[1] 33 49 65
> i2
[1] 0.06962541 0.02687296 0.09446254
> lm(i2 ~ i1)
Call:
lm(formula = i2 ~ i1)
Coefficients:
(Intercept) i1
0.0256218 0.0007762
> func <- lm(i2 ~ i1)
> func$coef[[2]]
[1] 0.0007761604
Any ideas? Thanks a ton.
Try putting 'return' on a new line.
}
return(deriv.func$coef[[2]])
}
Well, after looking (a lot) more into the for loop, I got it to do what I want.
deriv <- function(x1=1:23, x2=1){
n=length(x1)
deriv.func <- character(length = n)
for (i in 1:n) {
i1 <- c(x1[i-1], x1[i], x1[i+1])
i2 <- c(x2[i-1], x2[i], x2[i+1])
derivate <- lm(i2~i1)
deriv.func[i] <- derivate$coef[[2]]*
}
return(deriv.func)
}
Thanks for the help, and the tip in the right direction, #dbaseman!
Ideas that made a difference:
making sure I had space allocated for the iterator deriv.func <-
character(length = n).
making sure the intermediate variables
didn't overwrite the output.