I am trying run a linear model in R that does not specify an intercept. The reason is to eventually calculate the sums of squares reduced when an intercept is added. However, I am receiving different results when specifying this model using built-in factor contrasts versus explicitly stating the contrast values (i.e., -.5 and .5).
More specifically, using contrasts() results in a model with 2 terms (no intercept) while explicitly stating the contrast values via a column vector results in the correct model (no intercept and 1 term specifying the contrast).
group <- rep(c("c", "t"), each = 5)
group_cont <- rep(c(-.5, .5), each = 5)
var1 <- runif(10)
var2 <- runif(10)
test_data <- data.frame(
group = factor(group),
group_cont = group_cont,
y = var1,
x = var2
)
contrasts(test_data$group) <- cbind(grp = c(-.5, .5))
summary(lm(y ~ 1 + group, data = test_data)) # full model
summary(lm(y ~ 0 + group, data = test_data)) # weird results
summary(lm(y ~ 0 + group_cont, data = test_data)) # expected
Is there a way to specify a linear model without an intercept, but still use contrasts() to specify the contrast?
lm() asks for a data frame and column names as inputs. When you use contrasts(), you are assigning an attribute to the column in your data frame, which you can call directly using the the contrast function or attr. However, you are not changing the data type itself. Using you example above:
> str(test_data)
'data.frame': 10 obs. of 4 variables:
$ group : Factor w/ 2 levels "c","t": 1 1 1 1 1 2 2 2 2 2 #### still a factor ####
..- attr(*, "contrasts")= num [1:2, 1] -0.5 0.5 #### NOTE The contrast attribute ####
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr "c" "t"
.. .. ..$ : chr "grp"
$ group_cont: num -0.5 -0.5 -0.5 -0.5 -0.5 0.5 0.5 0.5 0.5 0.5
$ y : num 0.161 0.518 0.417 0.335 0.301 ...
$ x : num 0.34 0.729 0.766 0.629 0.191 ...
> attr(test_data$group, "contrasts")
grp
c -0.5
t 0.5
So a attr was added but the type is still a factor. So lm treats it like a factor, providing you a coefficient for each level. Moreover, providing contrast or calling the attr inside lm will throw an error. Depending on what you want to the end to look like, you may need to explore a different package like contrast. There is also a contrast argument in lm but I am not 100% sure this is what you are really looking for. See ?lm for more on that.
Related
I am trying to model the performance of a portfolio consisting of a basket of ETFs. To do this, I am using a T copula. For now, I have specified the marginals (i.e. the performance of the individual ETFs) as being normal, however, I want to use a Student t-distribution instead of a normal distribution.
I have looked into the fit.st() method from the QRM package, but I am unsure how to combine this with the copula package.
I know how to implement normally distributed margins:
mv.NE <- mvdc(normalCopula(0.75), c("norm"),
list(list(mean = 0, sd =2)))
How can I do the same thing, but with a t-distribution?
All that you need to do is to use tCopula instead of the normalCopula. You need to set up the parameter and degree of freedom of t-copula. And you need to specify the margins as well.
Hence, here we replace the normalCopula with tCopula and df=5 is the degree of freedom. Both margins are normal (as you want).
mv.NE <- mvdc(tCopula(0.75, df=5), c("norm", "norm"),
+ list(list(mean = 0, sd =2), list(list(mean = 0, sd =2))))
The result is:
Multivariate Distribution Copula based ("mvdc")
# copula:
t-copula, dim. d = 2
Dimension: 2
Parameters:
rho.1 = 0.75
df = 5.00
# margins:
[1] "norm" "norm"
with 2 (not identical) margins; with parameters (# paramMargins)
List of 2
$ :List of 2
..$ mean: num 0
..$ sd : num 2
$ :List of 1
..$ mean:List of 2
.. ..$ mean: num 0
.. ..$ sd : num 2
For t-margins, use this:
mv.NE <- mvdc(tCopula(0.75), c("t","t"),list(t=5,t=5))
Multivariate Distribution Copula based ("mvdc")
# copula:
t-copula, dim. d = 2
Dimension: 2
Parameters:
rho.1 = 0.75
df = 4.00
# margins:
[1] "t" "t"
with 2 (not identical) margins; with parameters (# paramMargins)
List of 2
$ t: Named num 5
..- attr(*, "names")= chr "df"
$ t: Named num 5
..- attr(*, "names")= chr "df"
I have a dataframe in R.
This is part of the head of the DF:
Sujet T E O P meanTR
1 1 0 0 0 0.97
1 1 0 0 0 1.44
2 0 1 0 1 0.94
Sujet : from 1 to 12
T , E , O , P : 1 or 0
meanTR : Numeric
I want to get the anova table, so I tried this:
model_all <- aov(meanTR ~ E*O*P+ Error(Sujet/E*O*P), data = df)
After that, I want to extract the residuals of my models to plot them
So I tried this :
res <- residuals(model_all) returns NULL
So I found people online suggesting this solution:
model_all.pr <- proj(model_all)
res <- model_all.pr[[3]][, "Residuals"]
But this returns subscript out of bound
res <- model_all.pr[[3]]["Residuals"]
But this returns NA
I don't know what I'm doing wrong. I'm really confused
Any help would be appreciated.
The main goal is to be able to run this:
plot(res)
qqnorm(res)
With aov(), you'll get a top-level $residuals attribute for some fits but not others.
For example with a simple model like the following, you can access residuals directly (use str() to see the structure of an object, include what attributes can be accessed):
fit1 <- aov(Sepal.Length ~ Sepal.Width, data=iris)
str(fit1$residuals)
## Named num [1:150] -0.644 -0.956 -1.111 -1.234 -0.722 ...
## - attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
But in the more complex model specification you're using (i.e. with an explicit/custom error term), there are separate residual values in each of the top-level attributes:
fit2 <- aov(Sepal.Length ~ Sepal.Width + Error(Species/Sepal.Length), data=iris)
fit2$residuals # NULL
names(fit2)
## [1] "(Intercept)" "Species" "Sepal.Length:Species" "Within"
fit2$Species$residuals
## 2 3
## -1.136219 5.179749
str(fit2$Within$residuals)
## Named num [1:144] -1.83e-15 -2.49e-15 -1.90e-15 -2.55e-15 -2.89e-15 ...
## - attr(*, "names")= chr [1:144] "7" "8" "9" "10" ...
## ...
## ...
I haven't thought about the stats behind enough to say why this is, but I'm sure it is reasonable.
Hope that helps!
From this question, I was wondering if it's possible to extract the Quadratic discriminant analysis (QDA's) scores and reuse them after like PCA scores.
## follow example from ?lda
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
Sp = rep(c("s","c","v"), rep(50,3)))
set.seed(1) ## remove this line if you want it to be pseudo random
train <- sample(1:150, 75)
table(Iris$Sp[train])
## your answer may differ
## c s v
## 22 23 30
Using the QDA here
z <- qda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
## get the whole prediction object
pred <- predict(z)
## show first few sample scores on LDs
Here, you can see that it's not working.
head(pred$x)
# NULL
plot(LD2 ~ LD1, data = pred$x)
# Error in eval(expr, envir, enclos) : object 'LD2' not found
NOTE: Too long/formatted for a comment. NOT AN ANSWER
You may want to try the rrcov package:
library(rrcov)
z <- QdaCov(Sp ~ ., Iris[train,], prior = c(1,1,1)/3)
pred <- predict(z)
str(pred)
## Formal class 'PredictQda' [package "rrcov"] with 4 slots
## ..# classification: Factor w/ 3 levels "c","s","v": 2 2 2 1 3 2 2 1 3 2 ...
## ..# posterior : num [1:41, 1:3] 5.84e-45 5.28e-50 1.16e-25 1.00 1.48e-03 ...
## ..# x : num [1:41, 1:3] -97.15 -109.44 -54.03 2.9 -3.37 ...
## ..# ct : 'table' int [1:3, 1:3] 13 0 1 0 16 0 0 0 11
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ Actual : chr [1:3] "c" "s" "v"
## .. .. ..$ Predicted: chr [1:3] "c" "s" "v"
It also has robust PCA methods that may be useful.
Unfortunately, not every model in R conforms to the same object structure/API and this won't be a linear model, so it is unlikely to conform to linear model fit structure APIs.
There's an example of how to visualize the qda results here — http://ramhiser.com/2013/07/02/a-brief-look-at-mixture-discriminant-analysis/
And, you can do:
library(klaR)
partimat(Sp ~ ., data=Iris, method="qda", subset=train)
for a partition plot of the qda results.
I have a very big data set (ds). One of its columns is Popularity, of type factor ('High' / ' Low').
I split the data to 70% and 30% in order to create a training set (ds_tr) and a test set (ds_te).
I have created the following model using a Logistic regression:
mdl <- glm(formula = popularity ~ . -url , family= "binomial", data = ds_tr )
then I created a predict object (will do it again for ds_te)
y_hat = predict(mdl, data = ds_tr - url , type = 'response')
I want to find the precision value which corresponds to a cutoff threshold of 0.5 and find the recall value which corresponds to a cutoff threshold of 0.5, so I did:
library(ROCR)
pred <- prediction(y_hat, ds_tr$popularity)
perf <- performance(pred, "prec", "rec")
The result is a table of many values
str(perf)
Formal class 'performance' [package "ROCR"] with 6 slots
..# x.name : chr "Recall"
..# y.name : chr "Precision"
..# alpha.name : chr "Cutoff"
..# x.values :List of 1
.. ..$ : num [1:27779] 0.00 7.71e-05 7.71e-05 1.54e-04 2.31e-04 ...
..# y.values :List of 1
.. ..$ : num [1:27779] NaN 1 0.5 0.667 0.75 ...
..# alpha.values:List of 1
.. ..$ : num [1:27779] Inf 0.97 0.895 0.89 0.887 ...
How do I find the specific precision and recall values corresponding to a cutoff threshold of 0.5?
Acces the slots of performance object (through the combination of # + list)
We create a dataset with all possible values:
probab.cuts <- data.frame(cut=perf#alpha.values[[1]], prec=perf#y.values[[1]], rec=perf#x.values[[1]])
You can view all associated values
probab.cuts
If you want to select the requested values, it is trivial to do:
tail(probab.cuts[probab.cuts$cut > 0.5,], 1)
Manual check
tab <- table(ds_tr$popularity, y_hat > 0.5)
tab[4]/(tab[4]+tab[2]) # recall
tab[4]/(tab[4]+tab[3]) # precision
I am trying to figure out how to plot the profile likelihood curve of a GLM
parameter with 95% pLCI's on the same plot. The example I have been trying
with is below. The plots I am getting are not the likelihood curves that I
was expecting. The y-axis of the plots is tau and I would like that axis
to be the likelihood so that I have a curve that maxes at the parameter
estimate. I am not sure where I find those likelihood values? I may just
be misinterpreting the theory behind this. Thanks for any help you can give.
Max
clotting <- data.frame(
u = c(5,10,15,20,30,40,60,80,100),
lot1 = c(118,58,42,35,27,25,21,19,18),
lot2 = c(69,35,26,21,18,16,13,12,12))
glm2<-glm(lot2 ~ log(u), data=clotting, family=Gamma)
prof<-profile(glm2)
plot(prof)
Regenerate your example:
clotting <- data.frame(
u = c(5,10,15,20,30,40,60,80,100),
lot1 = c(118,58,42,35,27,25,21,19,18),
lot2 = c(69,35,26,21,18,16,13,12,12))
glm2 <- glm(lot2 ~ log(u), data=clotting, family=Gamma)
The profile.glm function actually lives in the MASS package:
library(MASS)
prof<-profile(glm2)
In order to figure out what profile.glm and plot.profile are doing, see ?profile.glm and ?plot.profile. However, in order to dig into the profile object it may also be useful to examine the code of MASS:::profile.glm and MASS:::plot.profile ... basically, what these tell you is that profile is returning the signed square root of the difference between the deviance and the minimum deviance, scaled by the dispersion parameter. The reason that this is done is so that the profile for a perfectly quadratic profile will appear as a straight line (it's much easier to detect deviations from a straight line than from a parabola by eye).
The other thing that may be useful to know is how the profile is stored. Basically, it's a list of data frames (one for each parameter profiled), except that the individual data frames are a little bit weird (containing one vector component and one matrix component).
> str(prof)
List of 2
$ (Intercept):'data.frame': 12 obs. of 3 variables:
..$ tau : num [1:12] -3.557 -2.836 -2.12 -1.409 -0.702 ...
..$ par.vals: num [1:12, 1:2] -0.0286 -0.0276 -0.0267 -0.0258 -0.0248 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:2] "(Intercept)" "log(u)"
..$ dev : num [1:12] 0.00622 0.00753 0.00883 0.01012 0.0114 ...
$ log(u) :'data.frame': 12 obs. of 2 variables:
..$ tau : num [1:12] -3.516 -2.811 -2.106 -1.403 -0.701 ...
..$ par.vals: num [1:12, 1:2] -0.0195 -0.0204 -0.0213 -0.0222 -0.023 ...
.. ..- attr(*, "dimnames")=List of 2
It also contains attributes summary and original.fit that you can use to recover the dispersion and minimum deviance:
disp <- attr(prof,"summary")$dispersion
mindev <- attr(prof,"original.fit")$deviance
Now reverse the transformation for parameter 1:
dev1 <- prof[[1]]$tau^2
dev2 <- dev1*disp+mindev
Plot:
plot(prof[[1]][,1],dev2,type="b")
(This is the plot of the deviance. You can multiply by 0.5 to get the negative log-likelihood, or -0.5 to get the log-likelihood ...)
edit: some more general functions to transform the profile into a useful format for lattice/ggplot plotting ...
tmpf <- function(x,n) {
data.frame(par=n,tau=x$tau,
deviance=x$tau^2*disp+mindev,
x$par.vals,check.names=FALSE)
}
pp <- do.call(rbind,mapply(tmpf,prof,names(prof),SIMPLIFY=FALSE))
library(reshape2)
pp2 <- melt(pp,id.var=1:3)
pp3 <- subset(pp2,par==variable,select=-variable)
Now plot it with lattice:
library(lattice)
xyplot(deviance~value|par,type="b",data=pp3,
scales=list(x=list(relation="free")))
Or with ggplot2:
library(ggplot2)
ggplot(pp3,aes(value,deviance))+geom_line()+geom_point()+
facet_wrap(~par,scale="free_x")
FYI, for fun, I took the above and whipped it together into a single function using purrr::imap_dfr as I couldn't find a package that implements the above.
get_profile_glm <- function(aglm){
prof <- MASS:::profile.glm(aglm)
disp <- attr(prof,"summary")$dispersion
purrr::imap_dfr(prof, .f = ~data.frame(par = .y,
deviance=.x$z^2*disp+aglm$deviance,
values = as.data.frame(.x$par.vals)[[.y]],
stringsAsFactors = FALSE))
}
Works great!
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
print(d.AD <- data.frame(treatment, outcome, counts))
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
ggplot(get_profile_glm(aglm), aes(x = values, y = deviance)) +
geom_point() +
geom_line() +
facet_wrap(~par, scale = "free_x")