variable lengths differ in R using lm - r
I get a nasty surprise when running lm in R:
variable lengths differ (found for 'returnsandp')
I run the following model:
# regress apple price return on s&p price return
attach(NewSetSlide.ex)
resultr = lm(returnapple ~ returnsandp)
summary(resultr)
It cannot get any more simple than that, but for some reason, I get the error above.
I checked that the length of returnapple & returnsandp is exactly the same. So what on earth is going on, please?
The data.frame in question:
NewSetSlide.ex <- structure(list(returnapple = c(0.1412251, 0.07665801, 0.02560235,
0.09638143, 0.06384145, 0.05163189, -0.1076969, 0.05121892, 0.06428114,
0.09939652, 0.07271771, 0.06923432, 0.02873109, 0.0721757, -0.0121841,
0.07196034, 0.1012038, -0.06786657, 0.06142434, 0.09644931, -0.02754909,
0.005786519, 0.04099078, -0.03320592, -0.03292676, -0.06908485,
-0.01878077, 0.08340874, -0.01004186, -0.1064195, -0.07524236,
-0.006677446, 0.133327, -0.139921, 0.06528701, -0.036831, 0.09006266,
0.01813659, 0.07127628, 0.004334296, -0.02659846, 0.05333548,
0.04774654, 0.1288835, 0.05323629, -0.00006978558, 0.0634182,
-0.0533224, 0.03270362, 0.1026693, -0.05655361, 0.09680779, 0.01662336,
-0.01170586, -0.01063646, 0.0638476, -0.0542103, -0.01501973,
0.1307637, -0.005598485, 0.02798327, 0.1962269, 0.006725292,
0), returnsandp = c(0.1159772758, 0.007614392, 0.1104467964,
0.0359706698, 0.0152313579, 0.0331342721, 0.0189951476, 0.0330947526,
0.0749868297, -0.0124064592, 0.0323295771, -0.0303030364, 0.0113188732,
0.0101582303, -0.0151743475, 0.0174258083, -0.0088341409, -0.0092159647,
-0.0388593467, 0.0134979946, 0.0054655738, -0.05935645, 0.0174692125,
-0.0164511628, 0.1063320628, -0.0034796438, -0.0000602649, -0.0151122528,
0.0223743915, 0.0740851449, 0.0086287811, -0.0028700134, -0.0045942764,
0.0540510532, 0.0121340172, -0.0048475787, -0.0119945162, -0.034724078,
0.0425088143, 0.0650615875, 0.0450610926, 0.0023665278, 0.0714892769,
0.052793919, -0.0141481377, 0.0502292875, 0.0141095206, -0.0586828306,
0.071192607, -0.0854386059, 0.05472933, 0.0214771911, -0.0282882713,
0.1317668962, 0.0369236189, 0.0263898652, -0.0114502121, 0.0060341972,
0.0479144906, 0.0482236974, 0.0349588397, -0.0241661652, -0.2176304161,
-0.0853488645)), class = "data.frame", row.names = c(NA, -64L))
Based on #Dave2e comment.
It is better to use data=NewSetSlide.ex argument inside lm function call to avoid naming conflicts moreover you should avoid using attach function. Please see as below (NewSetSlide.ex data frame was taken from the question above):
resultr <- lm(returnapple ~ returnsandp, data = NewSetSlide.ex)
summary(resultr)
Output:
Call:
lm(formula = returnapple ~ returnsandp, data = NewSetSlide.ex)
Residuals:
Min 1Q Median 3Q Max
-0.166599181 -0.041838291 0.003778841 0.044034591 0.166774667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.028595156 0.008677931 3.29516 0.0016294 **
returnsandp -0.035466006 0.160976847 -0.22032 0.8263478
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0672294 on 62 degrees of freedom
Multiple R-squared: 0.0007822871, Adjusted R-squared: -0.01533413
F-statistic: 0.04853977 on 1 and 62 DF, p-value: 0.8263478
Related
How to get adjusted dependent variable
I am working on adjusting urine sodium by urine creatinine and age in order to use the adjusted variable in further analysis. How do I create a new variable with the adjusted version of the data?? Do I divide NA24 by creatinine and age? Do I multiply them? Please help. I ran a linear model as follows, but not sure what to do with the information: Call: lm(formula = PRENA24 ~ PRECR24mmol * PREALD, data = c1.3) Residuals: Min 1Q Median 3Q Max -228.439 -43.024 -5.215 37.790 274.414 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 66.84482 29.60684 2.258 0.02423 * PRECR24mmol 7.00565 2.10989 3.320 0.00094 *** PREALD -0.66555 0.60912 -1.093 0.27488 PRECR24mmol:PREALD 0.06335 0.04392 1.442 0.14962 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 65.94 on 798 degrees of freedom Multiple R-squared: 0.2963, Adjusted R-squared: 0.2937 F-statistic: 112 on 3 and 798 DF, p-value: < 2.2e-16 I need to adjust the PRENA24 value and I want to make a new column with these values (i.e. PRENA24.ADJ). I know the following is incorrect, but I am not sure what else to do with the information from the linear model. The post lab data is separated by treatment type as well. c1 <- c1.3 %>% mutate(PRENA24.ADJ = (PRENA24-66.84482+(7.00565*PRECR24mmol)+(-0.66555*PREALD))) c2 <- c1 %>% mutate(NA24.ADJ = (NA24-24.59443+(10.54905*CR24mmol)+(0.58894*ALD)))
Changing significance notation in R
R has certain significance codes to determine statistical significance. In the example below, for example, a dot . indicates significance at the 10% level (see sample output below). Dots can be very hard to see, especially when I copy-paste to Excel and display it in Times New Roman. I'd like to change it such that: * = significant at 10% ** = significant at 5% *** = significant at 1% Is there a way I can do this? > y = c(1,2,3,4,5,6,7,8) > x = c(1,3,2,4,5,6,8,7) > summary(lm(y~x)) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1.0714 -0.3333 0.0000 0.2738 1.1191 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.2143 0.6286 0.341 0.74480 x 0.9524 0.1245 7.651 0.00026 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8067 on 6 degrees of freedom Multiple R-squared: 0.907, Adjusted R-squared: 0.8915 F-statistic: 58.54 on 1 and 6 DF, p-value: 0.0002604
You can create your own formatting function with mystarformat <- function(x) symnum(x, corr = FALSE, na = FALSE, cutpoints = c(0, 0.01, 0.05, 0.1, 1), symbols = c("***", "**", "*", " ")) And you can write your own coefficient formatter show_coef <- function(mm) { mycoef<-data.frame(coef(summary(mm)), check.names=F) mycoef$signif = mystarformat(mycoef$`Pr(>|t|)`) mycoef$`Pr(>|t|)` = format.pval(mycoef$`Pr(>|t|)`) mycoef } And then with your model, you can run it with mm <- lm(y~x) show_coef(mm) # Estimate Std. Error t value Pr(>|t|) signif # (Intercept) 0.2142857 0.6285895 0.3408993 0.7447995 # x 0.9523810 0.1244793 7.6509206 0.0002604 ***
One should be aware that stargazer package reports significance levels with a different scale than other statistical softwares like STATA. In R (stargazer) you get # (* p<0.1; ** p<0.05; *** p<0.01). Whereas, in STATA you get # (* p<0.05, ** p<0.01, *** p< 0.001). This means that what is significant with one * in R results may appear not to be significant for a STATA user.
Sorry for the late response, but I found a great solution to this. Just do the following: install.packages("stargazer") library(stargazer) stargazer(your_regression, type = "text") This displays everything in a beautiful way with your desired format. Note: If you leave type = "text" out, then you'll get the LaTeX code.
How to extract p-values from sumurca object?
I'd like to extract the p-values from the summary output of ur.za in package urca. library(urca) data(nporg) gnp <- na.omit(nporg[, "gnp.r"]) za.gnp <- ur.za(gnp, model="both", lag=2) summary(za.gnp) > summary(za.gnp) ################################ # Zivot-Andrews Unit Root Test # ################################ Call: lm(formula = testmat) Residuals: Min 1Q Median 3Q Max -39.753 -9.413 2.138 9.934 22.977 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.49068 10.25301 2.096 0.04096 * y.l1 0.77341 0.05896 13.118 < 2e-16 *** trend 1.19804 0.66346 1.806 0.07675 . y.dl1 0.39699 0.12608 3.149 0.00272 ** y.dl2 0.10503 0.13401 0.784 0.43676 du -25.44710 9.20734 -2.764 0.00788 ** dt 2.11456 0.84179 2.512 0.01515 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.72 on 52 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.9948, Adjusted R-squared: 0.9942 F-statistic: 1651 on 6 and 52 DF, p-value: < 2.2e-16 Teststatistic: -3.8431 Critical values: 0.01= -5.57 0.05= -5.08 0.1= -4.82 Potential break point at position: 21 All methods I found for lm summary objects don't seem to work here. And I've spent quite some time searching through str(summary(za.gnp)) to no avail. Any hints on where to look?
Objects of class ur.za are S4 objects, which behave differently to S3 objects like those produced by lm. One difference is the concept of the slot accessed via the # operator. summary(za.gnp) has pval slot but its value is NULL. summary(za.gnp)#pval NULL However, it also has a testreg slot which contains an lm object with the test results that you can use to obtain the p values in the usual way: coef(summary(summary(za.gnp)#testreg))[,"Pr(>|t|)"] (Intercept) y.l1 trend y.dl1 y.dl2 du 4.096351e-02 4.007914e-18 7.674887e-02 2.716223e-03 4.367588e-01 7.884201e-03 dt 1.514797e-02
Modify an R function to add extra output?
I would like the lineal model regression command "lm()" also added information about the confidence interval. What file should I modidy to get it? At worst I would need to recompile something, but I hope I could compile only a single file. What should I do? Another option would be to create a script that get launched at startup and overwrite the regular behaviour or lm. How?
What you can use is something called a function operator. A function operator takes a function as input, adds a bit of functionality and returns a function. For example, to create a version of lm that always reports the summary: tweak_lm = function(modify_function) { function(...) { result = lm(...) print(modify_function(result)) result } } summarized_lm = tweak_lm(summary) lm_res = summarized_lm(mpg ~ wt, mtcars) Call: lm(formula = ..1, data = ..2) Residuals: Min 1Q Median 3Q Max -4.5432 -2.3647 -0.1252 1.4096 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt -5.3445 0.5591 -9.559 1.29e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 > lm_res Call: lm(formula = ..1, data = ..2) Coefficients: (Intercept) wt 37.285 -5.344 > Using this approach enables you to create other variants of this: coef_lm = tweak_lm(coef) lm_res = coef_lm(mpg ~ wt, mtcars) (Intercept) wt 37.285126 -5.344472 It is not completely clear what you need, but you can use this approach.
R's capture.output() behaves differently when file=NULL vs. file=[file name]
When using capture.output(..., file = NULL) followed by a specification of what line you want captured, then only that line is captured: capture.output(summary(lm(speed ~ dist, data = cars)), file = NULL)[5] [1] "Residuals:" But when a file name is specified, it will capture the entire object: capture.output(summary(lm(speed ~ dist, data = cars)), file = "Results.txt")[5] NULL The content of Results.txt: Call: lm(formula = speed ~ dist, data = cars) Residuals: Min 1Q Median 3Q Max -7.5293 -2.1550 0.3615 2.4377 6.4179 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.28391 0.87438 9.474 1.44e-12 *** dist 0.16557 0.01749 9.464 1.49e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.156 on 48 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12 How can I make R and/or capture.output only write the line I want to a file (in this toy example line no. 5)?
I'm afraid you can't do so within capture.output(), but you can simply write the part of capture.output()'s output that you want to a file using, for example, cat() cat(capture.output(summary(lm(speed ~ dist, data = cars)))[5],file="Results.txt")
The side-effect of writing a file happens before the extraction "[" operation takes place when there is a file argument. So you need to write the value after it gets returned to the console/global environment: cat( capture.output( summary(lm(speed ~ dist, data = cars)), file = NULL)[5] , file="test.txt") It would be pretty easy to wrap this into a function if you will be needing it repeatedly.