I am getting the error above when trying to use the cv.lm fucntion. Please see my code
sample<-read.csv("UU2_1_lung_cancer.csv",header=TRUE,sep=",",na.string="NA")
sample1<-sample[2:2000,3:131]
samplex<-sample[2:50,3:131]
y<-as.numeric(sample1[1,])
y<-as.numeric(sample1[2:50,2])
x1<-as.numeric(sample1[2:50,3])
x2<-as.numeric(sample1[2:50,4])
x11<-x1[!is.na(y)]
x12<-x2[!is.na(y)]
y<-y[!is.na(y)]
fit1 <- lm(y ~ x11 + x12, data=sample)
fit1
x3<-as.numeric(sample1[2:50,5])
x4<-as.numeric(sample1[2:50,6])
x13<-x3[!is.na(y)]
x14<-x4[!is.na(y)]
fit2 <- lm(y ~ x11 + x12 + x13 + x14, data=sample)
anova(fit1,fit2)
install.packages("DAAG")
library("DAAG")
cv.lm(df=samplex, fit1, m=10) # 3 fold cross-validation
Any insight will be appreciated.
Example of data
ID peak height LCA001 LCA002 LCA003
N001786 32391.111 0.397 0.229 -0.281
N005356 32341.473 0.397 -0.655 -1.301
N002416 32215.474 -0.703 -0.214 -0.901
GS239 31949.777 0.354 0.118 0.272
N016343 31698.853 0.226 0.04 -0.006
N003255 31604.978 0.024 NA -0.534
N004358 31356.597 -0.252 -0.022 -0.407
N000122 31168.09 -0.487 -0.533 -0.134
GS10564 31106.103 -0.156 -0.141 -1.17
GS17987 31043.876 NA 0.253 0.553
N003674 30876.207 0.109 0.093 0.07
Please see the example of the data above
First, you are using lm(..) incorrectly, or at least in a very unconventional way. The purpose of specifying the data=sample argument is so that the formula uses references to columns of the sample. Generally, it is a very bad practice to use free-standing data in the formula reference.
So try this:
## not tested...
sample <- read.csv(...)
colnames(sample)[2:6] <- c("y","x1","x2","x3","x4")
fit1 <- lm(y~x1+x2, data=sample[2:50,],na.action=na.omit)
library(DAAG)
cv.lm(df=na.omit(sample[2:50,]),fit1,m=10)
This will give columns 2:6 the appropriate names and then use those in the formula. The argument na.action=na.omit tells the lm(...) function to exclude all rows where there is an NA value in any of the relevant columns. This is actually the default, so it is not needed in this case, but included for clarity.
Finally, cv.lm(...) uses it's second argument to find the formula definition, so in your code:
cv.lm(df=samplex, fit1, m=10)
is equivalent to:
cv.lm(df=samplex,y~x11+x12,m=10)
Since there are (presumeably) no columns named x11 and x12 in samplex, and since you define these vectors externally, cv.lm(...) throws the error you are getting.
Related
I am trying to do a Difference-In-Differences Regression with Fixed Effects. The regression is meant to estimate the impact of participating in a televised Sports Event on the Social Media Follower Count of the participating Teams, compared to other Teams that did not participate.
My Data looks like this:
[Data][1]
The dependent variable is the Rate_Percent, which is the growth rate of Facebook-Likes, which is calculated as follows
Dataset_FB <- Dataset_FB %>% group_by(ID) %>%
mutate(Diff_Growth = FBLikes - lag(FBLikes),
Rate_Percent = Diff_Growth / lag(FBLikes) * 100)
Teilnahme is a Dummy Variable to tell the Participants from the non-Participants, and Hauptrunde is a Dummy Variable to indicate the time frame of the treatment (0 before the treatment, 1 after the treatment). I am trying to include the ID, Uhrzeit and Spieltag as fixed effects to control for Club- and Time- differences.
My regression looks like this:
reg <- lm (Rate_Percent ~ Teilnahme + Hauptrunde + Teilnahme*Hauptrunde + factor(ID) + factor(Uhrzeit) + factor(Spieltag), data=Dataset_FB)
Now, my questions are as follows:
The summary looks far from correct, but I can't find my mistakes, what did I do wrong?
Is this the correct way to use fixed effects?
I know "Coefficients: (6 not defined because of singularities)" indicates a strong correlation between my independent variables. But since they are dummies, are they not always correlated?
The summary looks like this:
lm(formula = Rate_Percent ~ Teilnahme + Hauptrunde + Teilnahme *
Hauptrunde + factor(ID) + factor(Uhrzeit) + factor(Spieltag),
data = Dataset_FB)
Residuals:
Min 1Q Median 3Q Max
-0.2834 -0.0343 -0.0111 0.0092 4.9302
Coefficients: (6 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0266970 0.0125098 2.134 0.03288 *
Teilnahme 0.0020571 0.1662742 0.012 0.99013
Hauptrunde -0.0158433 0.0060631 -2.613 0.00900 **
factor(ID)8 -0.0344717 0.0171467 -2.010 0.04443 *
factor(ID)25 -0.0155100 0.1662745 -0.093 0.92568
factor(ID)56 0.0122209 0.0171467 0.713 0.47604
factor(ID)69 -0.0093248 0.1662745 -0.056 0.95528
factor(ID)90 -0.0037743 0.0171467 -0.220 0.82578
factor(ID)93 0.0948638 0.0171467 5.532 3.29e-08 ***
factor(ID)103 0.0117689 0.0171467 0.686 0.49251
factor(ID)115 0.0479442 0.0171467 2.796 0.00519 **
factor(ID)166 -0.0129542 0.0171467 -0.755 0.44998
factor(ID)364 -0.0112018 0.0171467 -0.653 0.51359
factor(ID)373 -0.0111296 0.0171467 -0.649 0.51631
factor(ID)490 -0.0231408 0.0171467 -1.350 0.17720
factor(ID)752 -0.0064241 0.0171467 -0.375 0.70793
factor(ID)907 0.1333400 0.0171467 7.776 8.75e-15 ***
factor(ID)951 0.0087327 0.0171467 0.509 0.61057
factor(ID)996 -0.0105943 0.0171467 -0.618 0.53669
factor(ID)1238 0.0076285 0.0171467 0.445 0.65641
factor(ID)1315 0.0304732 0.1662745 0.183 0.85459
factor(ID)1316 0.1290605 0.0171467 7.527 5.98e-14 ***
factor(ID)1400 0.0038137 0.0171467 0.222 0.82400
factor(ID)1401 -0.0135700 0.0171467 -0.791 0.42874
factor(ID)1712 -0.0001285 0.0171467 -0.007 0.99402
factor(ID)3417 0.0053766 0.0171467 0.314 0.75386
factor(ID)5646 0.0052521 0.0171467 0.306 0.75939
factor(ID)6273 -0.0134096 0.0171467 -0.782 0.43422
factor(ID)7679 -0.0104365 0.0171467 -0.609 0.54277
factor(ID)9029 NA NA NA NA
factor(ID)10213 -0.0441121 0.0171467 -2.573 0.01012 *
factor(ID)26957 -0.0287541 0.0171700 -1.675 0.09405 .
factor(ID)29988 0.1015109 0.1662745 0.611 0.54155
factor(ID)40373 0.0203831 0.0171467 1.189 0.23459
factor(Uhrzeit)1530 0.0206731 0.1653880 0.125 0.90053
factor(Uhrzeit)1830 NA NA NA NA
factor(Uhrzeit)2045 NA NA NA NA
factor(Spieltag)NA NA NA NA NA
factor(Spieltag)Sa NA NA NA NA
factor(Spieltag)So NA NA NA NA
Teilnahme:Hauptrunde 0.0053874 0.0085752 0.628 0.52987
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1649 on 5885 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.07278, Adjusted R-squared: 0.06742
F-statistic: 13.59 on 34 and 5885 DF, p-value: < 2.2e-16
[1]: https://i.stack.imgur.com/ZBAqL.png
The output is correct and you did nothing wrong per se, but there are more elegant ways to run the fixed effects regression.
Yes, although the fixed effects will not be consistently estimated in the model.
Here, the singularities means that you have observations in ID, Uhrzeit and Spieltag where you have only one unique observation, so the model cannot estimate a coefficient for these.
I would suggest having a look into two packages:
plm, which is the standard for panel data models. I am not 100% sure if your data is a real panel (and whether you are actually estimating a diff-in-diff specification).
You would have something like:
data <- pdata.frame(data, index=c("ID", "Uhrzeit"))
plm(formula = Rate_Percent ~ Teilnahme + Hauptrunde + Teilnahme *
Hauptrunde + factor(Spieltag), data=Dataset_FB, model = "within", effect = "twoways", index = c("ID","Uhrzeit"))
felm is a great and easy to use alternative, where you specify the factor variables after a | in the formula.
est <- felm(Rate_Percent ~ Teilnahme + Hauptrunde + Teilnahme *
Hauptrunde | ID + Uhrzeit + Spieltag, data = Dataset_FB)
As an explanation: The way these fixed-effects packages work is they first transform your data (with a within-transformation), basically taking out the averages of the groups. Thanks to this, we don't actually need to estimate the coefficients for the fixed effects, as done in your code. So these other solutions are slightly more neat and produce easier-to-read output, but numerically, there shouldn't be any difference.
I'm performing a latent class analysis using Mplus, and trying to get the output into R via the MplusAutomation package (since I'm doing this many times, I want to avoid copying by hand). I'd like to grab the "Results in Probability Scale" subsection in the "Model Results" section of the Mplus output, but I'm unable to find it in the R object MplusAutomation creates from the .out file. That object contains a "parameters" data frame which includes other information from the "Model Results" section, so is it a matter of "Results in Probability Scale" being a simple transformation of the other model results data, that I could do myself in R? If not, is there some other way of recreating the results of this section from what info I do have in R? Or is the information I'm looking for stored somewhere else in the output?
The "Results in Probability Scale"-section does not seem to be parsed by MplusAutomation.
However, you can convert the threshold parameters yourself to probability scale using the formula prob = 1 / (1 + exp(est)).
For example, the code below should reproduce the results in probability scale from this UCLA example:
library(dplyr)
library(tidyr)
library(MplusAutomation)
# Fetch & write output from UCLA LCA-example to temp file
lca_ex_out = tempfile(fileext = '.out')
fileConn <- file(lca_ex_out)
writeLines(readLines('https://stats.idre.ucla.edu/stat/mplus/dae/lca1.out'), fileConn)
close(fileConn)
lca_ex_result = readModels(lca_ex_out) # extract results from temp file
# select threshold parameters, covert to probability & layout in table
lca_ex_result$parameters$unstandardized %>%
filter(paramHeader == 'Thresholds') %>%
mutate(est_probscale = 1 / (1 + exp(est))) %>%
select(param, LatentClass, est_probscale) %>%
spread(LatentClass, est_probscale)
Output:
param 1 2 3
1 ITEM1$1 0.908 0.312 0.923
2 ITEM2$1 0.337 0.164 0.546
3 ITEM3$1 0.067 0.036 0.426
4 ITEM4$1 0.065 0.056 0.418
5 ITEM5$1 0.219 0.044 0.765
6 ITEM6$1 0.320 0.183 0.471
7 ITEM7$1 0.113 0.098 0.512
8 ITEM8$1 0.140 0.110 0.619
9 ITEM9$1 0.325 0.188 0.349
I am using the eRm package to estimate a Rasch model. The RM() function returns a Rasch model that I can summarize using the summary() function. However, when I try to store the results, R creates an empty object.
library(eRm)
my_data <- matrix(sample(0:1, 100, replace = TRUE), nrow = 10)
my_model <- RM(X = my_data)
summary(my_model)
my_summary <- summary(my_model)
Why does this operation not work in this case but does work when storing the summary of a linear model? Is there another way to store the summary of the eRm model?
As #Imo surmised, it looks like summary.eRm just prints to the console, rather than returning an object. You can inspect the code for summary.eRm by running getAnywhere(summary.eRm). summary is a "generic" function, meaning that what it does depends on what "method" is called when the function is invoked.
For an lm model object, when you type summary(my_model), the summary.lm function is dispatched. But when you type summary(my_model) and my_model is an eRm object, the summary.eRm method is dispatched. summary.lm returns an object, but summary.eRm just prints to the console. Run methods(summary) to see the various summary functions that get dispatched for different types of objects.
A workaround would be to create your own summary object (or a function to create such an object), using the model object itself. You can inspect the components of the model object with str(my_model). You can also look at the code for summary.eRm to see where it is getting each of the components that it prints to the console.
Here's a simple example, lifting code from summary.eRm to create a summary function:
RMsmry = function(obj) {
cols = c("Estimate", "Std. Error", "lower CI", "upper CI")
# Create difficulty summary
ci = confint(obj, "eta")
tbl1 = as.data.frame(cbind(round(obj$etapar, 3),
round(obj$se.eta, 3), round(ci, 3)))
names(tbl1) = cols
# Create easiness summary
ci <- confint(obj, "beta")
tbl2 = as.data.frame(cbind(round(obj$betapar, 3),
round(obj$se.beta, 3), round(ci, 3)))
names(tbl2) = cols
return(list(Difficulty=tbl1, Easiness=tbl2))
}
my_summary = RMsmry(my_model)
my_summary
$Difficulty
Estimate Std. Error lower CI upper CI
I2 -1.191 0.658 -2.480 0.098
I3 -1.191 0.658 -2.480 0.098
I4 0.078 0.627 -1.150 1.306
I5 -0.750 0.623 -1.971 0.471
I6 0.078 0.627 -1.150 1.306
I7 1.079 0.748 -0.386 2.544
I8 -0.339 0.614 -1.543 0.865
I9 0.078 0.627 -1.150 1.306
I10 1.079 0.748 -0.386 2.544
$Easiness
Estimate Std. Error lower CI upper CI
beta I1 -1.079 0.748 -2.544 0.386
beta I2 1.191 0.658 -0.098 2.480
beta I3 1.191 0.658 -0.098 2.480
beta I4 -0.078 0.627 -1.306 1.150
beta I5 0.750 0.623 -0.471 1.971
beta I6 -0.078 0.627 -1.306 1.150
beta I7 -1.079 0.748 -2.544 0.386
beta I8 0.339 0.614 -0.865 1.543
beta I9 -0.078 0.627 -1.306 1.150
beta I10 -1.079 0.748 -2.544 0.386
Below, I compare the results from an R-function with my own code. The algorithm simply consists of maximising a function of many parameters (here, 19). My code defines the function and uses nlm for optimisation. Fortunately, both return the same result. However, the R-function is amazingly quick. I therefore suspect I can do better than using nlm (or a similar optimisation routine in R). Any idea?
Here is some survival data that can be fitted with a Cox model. To do so, one needs to maximise the partial log-likelihood (3rd equation in the wikipedia link).
InR, this can be done with coxph() (part of the survival package):
> library(survival)
> fmla <- as.formula(paste("Surv(time, event) ~ ",
+ paste(names(data)[-(1:3)], collapse=" +")))
> mod <- coxph(formula=fmla, data=data)
> round(mod$coef, 3)
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15
-0.246 -0.760 0.089 -0.033 -0.138 -0.051 -0.484 -0.537 -0.620 -0.446 -0.204 -0.112 -0.089 -0.451 0.043
x16 x17 x18 x19
0.106 -0.015 -0.245 -0.653
This can be checked by explicitly writing the partial log-likelihood and by using some numerical optimisation routine. Here is some crude code which does this job.
The code has been edited based on the comments I received
> #------ minus partial log-lik ------
> Mpll <- function(beta, data)
+ #!!!data must be ordered by increasing time!!!
+ #--> data <- data[order(data$time), ]
+ {
+ #preparation
+ N <- nrow(data)
+ linpred <- as.matrix(data[, -(1:3)]) %*% beta
+
+ #pll
+ pll <- sum(sapply(X=which(data$event == 1), FUN=function(j)
+ linpred[j] - log(sum(exp(linpred[j:N])))))
+
+ #output
+ return(- pll)
+ }
> #-----------------------------------
>
> data <- data[order(data$time), ]
> round(nlm(f=Mpll, p=rep(0, 19), data=data)$estimate, 3)
[1] -0.246 -0.760 0.089 -0.033 -0.138 -0.051 -0.484 -0.537 -0.620 -0.446 -0.204 -0.112 -0.089 -0.451
[15] 0.043 0.106 -0.015 -0.245 -0.653
OK, it works... but it is much much slower!
Does anyone have an idea on what is done within coxph() to make it so fast?
Here is a vectorized version of your code.
Mpll2 <- function(beta, data) {
X <- as.matrix(data[, -(1:3)])
a <- X %*% beta
b <- log(rev(cumsum(rev(exp(a)))))
-sum((a - b)[data$event==1])
}
And here is a simple test of the run times.
data <- data[order(data$time), ] # No reason to order every time
# Yours
system.time(round(nlm(f=Mpll, p=rep(0, 19), data=data)$estimate, 3))
# user system elapsed
# 2.77 0.01 2.79
# Vectorized
system.time(round(nlm(f=Mpll2, p=rep(0, 19), data=data)$estimate, 3))
# user system elapsed
# 0.28 0.00 0.28
# Optimized C code
fmla <- as.formula(paste("Surv(time, event) ~ ",
paste(names(data)[-(1:3)], collapse=" +")))
system.time(round(coxph(formula=fmla, data=data)$coef,3))
# user system elapsed
# 0.02 0.00 0.03
So, about an order of magnitude difference between each type. C is very fast, and you are never going to approach those speeds in R. But C is harder to write.
Using R, I am running a logistic model and need to include an interaction term in the following fashion, where A is categorical, and B, continuous.
Y ~ A + B + normalized(B):A
My problem is that when I do so, the reference category is not the same as in
Y ~ A + B + A:B
which makes comparison of the models difficult. I am sure there is a way to force the reference category to be the same all the time, but can't seem to find a straightforward answer.
To illustrate, my data looks like this:
income ndvi sga
30,000$ - 49,999$ -0,141177617 0
30,000$ - 49,999$ -0,170513257 0
>80,000$ -0,054939323 1
>80,000$ -0,14724104 0
>80,000$ -0,207678157 0
missing -0,229890869 1
50,000$ - 79,999$ 0,245063253 0
50,000$ - 79,999$ 0,127565529 0
15,000$ - 29,999$ -0,145778357 0
15,000$ - 29,999$ -0,170944338 0
30,000$ - 49,999$ -0,121060635 0
30,000$ - 49,999$ -0,245407291 0
missing -0,156427532 0
>80,000$ 0,033541238 0
And the outputs are reproduced below. The first set of results is the form the model Y ~ A*B, and the second, Y ~ A + B + A:normalized(B)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.72175 0.29806 -9.132 <2e-16 ***
ndvi 2.78106 2.16531 1.284 0.1990
income15,000$ - 29,999$ -0.53539 0.46211 -1.159 0.2466
income30,000$ - 49,999$ -0.68254 0.39479 -1.729 0.0838 .
income50,000$ - 79,999$ -0.13429 0.33097 -0.406 0.6849
income>80,000$ -0.56692 0.35144 -1.613 0.1067
incomemissing -0.85257 0.47230 -1.805 0.0711 .
ndvi:income15,000$ - 29,999$ -2.27703 3.25433 -0.700 0.4841
ndvi:income30,000$ - 49,999$ -3.76892 2.86099 -1.317 0.1877
ndvi:income50,000$ - 79,999$ -0.07278 2.46483 -0.030 0.9764
ndvi:income>80,000$ -3.32489 2.62000 -1.269 0.2044
ndvi:incomemissing -3.98098 3.35447 -1.187 0.2353
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.07421 0.30680 -10.020 <2e-16 ***
ndvi -1.19992 2.56201 -0.468 0.640
income15,000$ - 29,999$ -0.33379 0.29920 -1.116 0.265
income30,000$ - 49,999$ -0.34885 0.26666 -1.308 0.191
income50,000$ - 79,999$ -0.12784 0.25124 -0.509 0.611
income>80,000$ -0.27255 0.27288 -0.999 0.318
incomemissing -0.50010 0.31299 -1.598 0.110
income<15,000$:normalize(ndvi) 0.40515 0.34139 1.187 0.235
income15,000$ - 29,999$:normalize(ndvi) 0.17341 0.35933 0.483 0.629
income30,000$ - 49,999$:normalize(ndvi) 0.02158 0.32280 0.067 0.947
income50,000$ - 79,999$:normalize(ndvi) 0.39774 0.28697 1.386 0.166
income>80,000$:normalize(ndvi) 0.06677 0.30087 0.222 0.824
incomemissing:normalize(ndvi) NA NA NA NA
So in the first model, the category "income<15,000" is the reference category, whereas in the second, something different happens, which I'm not all clear about yet.
Let say that we would like to perform a regression on this equation.
we tried to implement it using model.matrix. But there is some automation problem illustrated in the results below. Is there a better way to implement it?. To be more specific let's say that X_1 is a continuous variable, while X_2 is a dummy.
Basically the interpretation of the interaction term would be the same, except that the main term X_2 would be evaluated when X_1 is at its mean. (see Early draft of this Paper)
Here are some data to illustrate my point:(It's not a glm but we can apply the same method to glm)
library(car)
str(Prestige)
# some data cleaning
Prestige <- Prestige[!is.na(Prestige$type),]
# interaction the usual way.
lm1 <- lm(income ~ education+ type + education:type, data = Prestige); summary(lm1)
# interacting with demeaned education
Prestige$education_ <- Prestige$education-mean(Prestige$education)
When using the regular formula method, things does not turn out the way we want. As formula does not put any variable as reference
lm2 <- lm(income ~ education+ type + education_:type, data = Prestige); summary(lm2)
# Using model.matrix to shape the interaction
cusInt <- model.matrix(~-1+education_:type,data=Prestige)[,-1];colnames(cusInt)
lm3 <- lm(income ~ education+ type + cusInt, data = Prestige); summary(lm3)
compareCoefs(lm1,lm3,lm2)
The results are here:
Est. 1 SE 1 Est. 2 SE 2 Est. 3 SE 3
(Intercept) -1865 3682 -1865 3682 4280 8392
education 866 436 866 436 297 770
typeprof -3068 7192 -542 1950 -542 1950
typewc 3646 9274 -2498 1377 -2498 1377
education:typeprof 234 617
education:typewc -569 885
cusInteducation_:typeprof 234 617
cusInteducation_:typewc -569 885
typebc:education_ 569 885
typeprof:education_ 803 885
typewc:education_
So basically when using model.matrix we have to intervene to set the reference variable. Besides there is some custInt appearing in front of the variable name so, formatting results when one have a lot of table to compare is quite tedious.