R Linear Regression Data in Single Column - r
I have the following data as an example:
InputName InputValue Output
===================================
Oxide 35 0.4
Oxide 35.2 0.42
Oxide 34.6 0.38
Oxide 35.9 0.46
CD 0.5 0.42
CD 0.48 0.4
CD 0.56 0.429
I want to do a linear regression of InputValue vs. Output treating different InputName as independent predictors.
If I want to use lm(Output ~ Oxide + CD) in R, it assumes a separate column for each predictor. In the example above that would mean making a separate column for Oxide and CD. I can do that using cast function from plyr package which might introduce NAs in the data.
However, is there a way to direct tell lm function that the input predictors are grouped according to the column InputName, and the values are given in the column Inputvalue?
It seems to me you are describing a form of dummy variable coding. This is not necessary in R at all, since any factor column in your data will automatically be dummy coded for you.
Recreate your data:
dat <- read.table(text="
InputName InputValue Output
Oxide 35 0.4
Oxide 35.2 0.42
Oxide 34.6 0.38
Oxide 35.9 0.46
CD 0.5 0.42
CD 0.48 0.4
CD 0.56 0.429
", header=TRUE)
Now build the model you described, but drop the intercept to make things a little bit more explicit:
fit <- lm(Output ~ InputValue + InputName - 1, dat)
summary(fit)
Call:
lm(formula = Output ~ InputValue + InputName - 1, data = dat)
Residuals:
1 2 3 4 5 6 7
-0.003885 0.003412 0.001519 -0.001046 0.004513 -0.014216 0.009703
Coefficients:
Estimate Std. Error t value Pr(>|t|)
InputValue 0.063512 0.009864 6.439 0.00299 **
InputNameCD 0.383731 0.007385 51.962 8.21e-07 ***
InputNameOxide -1.819018 0.346998 -5.242 0.00633 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.009311 on 4 degrees of freedom
Multiple R-squared: 0.9997, Adjusted R-squared: 0.9995
F-statistic: 4662 on 3 and 4 DF, p-value: 1.533e-07
Notice how all of your factor levels for InputName appear in the output, giving you a separate estimate of the effect of each level.
Concisely, the information you need are in these two lines:
InputNameCD 0.383731 0.007385 51.962 8.21e-07 ***
InputNameOxide -1.819018 0.346998 -5.242 0.00633 **
Here are 2 ways of doing this, split the data and do the regressions separately, or use interaction terms to specify that you want to consider the different levels of InputName to have separate slopes:
Split
lapply(split(dat,dat$InputName),lm,formula=Output~InputValue)
$CD
Call:
FUN(formula = ..1, data = X[[1L]])
Coefficients:
(Intercept) InputValue
0.2554 0.3135
$Oxide
Call:
FUN(formula = ..1, data = X[[2L]])
Coefficients:
(Intercept) InputValue
-1.78468 0.06254
Interaction
lm(Output~InputName + InputName:InputValue - 1,dat)
Call:
lm(formula = Output ~ InputName + InputName:InputValue - 1, data = dat)
Coefficients:
InputNameCD InputNameOxide InputNameCD:InputValue InputNameOxide:InputValue
0.25542 -1.78468 0.31346 0.06254
For comparision purposes I've also removed the intercept. Note that the estimated coefficients are the same in each case.
Related
Visreg Plotting log-log as log-level
I want to run a regression of money spent on links clicked using a data set where I notice link clicks level off after a certain amount of money spent. I want to use a log transformation to better fit this leveling-off data. My data set looks like this: link.clicks [1] 34 60 54 49 63 100 MoneySpent [1] 10.97 21.81 20.64 21.42 48.03 127.30 I want to predict the % change in link.clicks from a $1 increase in MoneySpent. My regression model is: regClicksLogLevel <- lm(log(link.clicks) ~ (MoneySpent), data = TwtrData) summary(regClicksLogLevel) visreg(regClicksLogLevel) However, The graph visreg generates looks like this: [1]: https://i.stack.imgur.com/eZqVG.png When I change my regression to: regClicksLogLog <- lm(log(link.clicks) ~ log(MoneySpent), data = TwtrData) summary(regClicksLogLog) visreg(regClicksLogLog) I actually get the fitted line I'm looking for: [2]: https://i.stack.imgur.com/MexwC.png I'm confused because I'm not trying to predict a % change in link.clicks from a % change in MoneySpent. I'm trying to predict a % change in link.clicks from a $ unit change in MoneySpent. Why can't I generate the 2nd graph using the my first regression, regClicksLogLevel?
I guess that's what you are looking for library(tidyverse) TwtrData = tibble( link.clicks = c(34,60,54,49,63,100), MoneySpent = c(10.97,21.81,20.64,21.42,48.03,127.30) ) %>% mutate( perc.link.clicks = lag(link.clicks, default = 0)/link.clicks, perc.MoneySpent = lag(MoneySpent, default = 0)/MoneySpent ) regClicksLogLevel <- lm(perc.link.clicks ~ perc.MoneySpent, data = TwtrData) summary(regClicksLogLevel) output Call: lm(formula = perc.link.clicks ~ perc.MoneySpent, data = TwtrData) Residuals: 1 2 3 4 5 6 -0.1422261 -0.0766939 -0.0839233 -0.0002346 0.1912170 0.1118608 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1422 0.1082 1.315 0.25890 perc.MoneySpent 0.9963 0.1631 6.109 0.00363 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1434 on 4 degrees of freedom Multiple R-squared: 0.9032, Adjusted R-squared: 0.879 F-statistic: 37.32 on 1 and 4 DF, p-value: 0.003635 And here is the graph TwtrData %>% ggplot(aes(perc.MoneySpent, perc.link.clicks))+ geom_line()+ geom_smooth(method='lm',formula= y~x)+ scale_y_continuous(labels = scales::percent)+ scale_x_continuous(labels = scales::percent)
SLR of transformed data in R
For Y = % of population with income below poverty level and X = per capita income of population, I have constructed a box-cox plot and found that the lambda = 0.02020: bc <- boxcox(lm(Percent_below_poverty_level ~ Per_capita_income, data=tidy.CDI), plotit=T) bc$x[which.max(bc$y)] # gives lambda Now I want to fit a simple linear regression using the transformed data, so I've entered this code transform <- lm((Percent_below_poverty_level**0.02020) ~ (Per_capita_income**0.02020)) transform But all I get is the error message 'Error in terms.formula(formula, data = data) : invalid power in formula'. What is my mistake?
You could use bcPower() from the car package. ## make sure you do install.packages("car") if you haven't already library(car) data(Prestige) p <- powerTransform(prestige ~ income + education + type , data=Prestige, family="bcPower") summary(p) # bcPower Transformation to Normality # Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd # Y1 1.3052 1 0.9408 1.6696 # # Likelihood ratio test that transformation parameter is equal to 0 # (log transformation) # LRT df pval # LR test, lambda = (0) 41.67724 1 1.0765e-10 # # Likelihood ratio test that no transformation is needed # LRT df pval # LR test, lambda = (1) 2.623915 1 0.10526 mod <- lm(bcPower(prestige, 1.3052) ~ income + education + type, data=Prestige) summary(mod) # # Call: # lm(formula = bcPower(prestige, 1.3052) ~ income + education + # type, data = Prestige) # # Residuals: # Min 1Q Median 3Q Max # -44.843 -13.102 0.287 15.073 62.889 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -3.736e+01 1.639e+01 -2.279 0.0250 * # income 3.363e-03 6.928e-04 4.854 4.87e-06 *** # education 1.205e+01 2.009e+00 5.999 3.78e-08 *** # typeprof 2.027e+01 1.213e+01 1.672 0.0979 . # typewc -1.078e+01 7.884e+00 -1.368 0.1746 # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 22.25 on 93 degrees of freedom # (4 observations deleted due to missingness) # Multiple R-squared: 0.8492, Adjusted R-squared: 0.8427 # F-statistic: 131 on 4 and 93 DF, p-value: < 2.2e-16
Powers (more often represented by ^ than ** in R, FWIW) have a special meaning inside formulas [they represent interactions among variables rather than mathematical operations]. So if you did want to power-transform both sides of your equation you would use the I() or "as-is" operator: I(Percent_below_poverty_level^0.02020) ~ I(Per_capita_income^0.02020) However, I think you should do what #DaveArmstrong suggested anyway: it's only the predictor variable that gets transformed the Box-Cox transformation is actually (y^lambda-1)/lambda (although the shift and scale might not matter for your results)
General Linear Model interpretation of parameter estimates in R
I have a data set that looks like "","OBSERV","DIOX","logDIOX","OXYGEN","LOAD","PRSEK","PLANT","TIME","LAB" "1",1011,984.06650389,6.89169348002254,"L","H","L","RENO_N","1","KK" "2",1022,1790.7973641,7.49041625445373,"H","H","L","RENO_N","1","USA" "3",1031,661.95870145,6.4952031694744,"L","H","H","RENO_N","1","USA" "4",1042,978.06853583,6.88557974511529,"H","H","H","RENO_N","1","KK" "5",1051,270.92290942,5.60183431332639,"N","N","N","RENO_N","1","USA" "6",1062,402.98269729,5.99889362626069,"N","N","N","RENO_N","1","USA" "7",1071,321.71945701,5.77367991426247,"H","L","L","RENO_N","1","KK" "8",1082,223.15260359,5.40785585845064,"L","L","L","RENO_N","1","USA" "9",1091,246.65350151,5.507984523849,"H","L","H","RENO_N","1","USA" "10",1102,188.48323034,5.23900903921703,"L","L","H","RENO_N","1","KK" "11",1141,267.34994025,5.58855843790491,"N","N","N","RENO_N","1","KK" "12",1152,452.10355987,6.11391126834609,"N","N","N","RENO_N","1","KK" "13",2011,2569.6672555,7.85153169693888,"N","N","N","KARA","1","USA" "14",2021,604.79620572,6.40489155123453,"N","N","N","KARA","1","KK" "15",2031,2610.4804449,7.86728956188212,"L","H",NA,"KARA","1","KK" "16",2032,3789.7097503,8.24004471210954,"L","H",NA,"KARA","1","USA" "17",2052,338.97054188,5.82591320649553,"L","L","L","KARA","1","KK" "18",2061,391.09027375,5.96893841249289,"H","L","H","KARA","1","USA" "19",2092,410.04420258,6.01626496505788,"N","N","N","KARA","1","USA" "20",2102,313.51882368,5.74785940190679,"N","N","N","KARA","1","KK" "21",2112,1242.5931417,7.12495571830002,"H","H","H","KARA","1","KK" "22",2122,1751.4827969,7.46821802066524,"H","H","L","KARA","1","USA" "23",3011,60.48026048,4.10231703874031,"N","N","N","RENO_S","1","KK" "24",3012,257.27729731,5.55015448107691,"N","N","N","RENO_S","1","USA" "25",3021,46.74282552,3.84466077914493,"N","N","N","RENO_S","1","KK" "26",3022,73.605375516,4.29871805996994,"N","N","N","RENO_S","1","KK" "27",3031,108.25433812,4.68448344109116,"H","H","L","RENO_S","1","KK" "28",3032,124.40704234,4.82355878915293,"H","H","L","RENO_S","1","USA" "29",3042,123.66859296,4.81760535031397,"L","H","L","RENO_S","1","KK" "30",3051,170.05332632,5.13611207209694,"N","N","N","RENO_S","1","USA" "31",3052,95.868704018,4.56297958887925,"N","N","N","RENO_S","1","KK" "32",3061,202.69261215,5.31169060558111,"N","N","N","RENO_S","1","USA" "33",3062,70.686307069,4.25825187761015,"N","N","N","RENO_S","1","USA" "34",3071,52.034715526,3.95191110210073,"L","H","H","RENO_S","1","KK" "35",3072,93.33525462,4.53619789950355,"L","H","H","RENO_S","1","USA" "36",3081,121.47464906,4.79970559129829,"H","H","H","RENO_S","1","USA" "37",3082,94.833869239,4.55212661590867,"H","H","H","RENO_S","1","KK" "38",3091,68.624596439,4.22865101914209,"H","L","L","RENO_S","1","USA" "39",3092,64.837097371,4.17187792984139,"H","L","L","RENO_S","1","KK" "40",3101,32.351569811,3.47666254561192,"L","L","L","RENO_S","1","KK" "41",3102,29.285124102,3.37707967726539,"L","L","L","RENO_S","1","USA" "42",3111,31.36974463,3.44584388158928,"L","L","H","RENO_S","1","USA" "43",3112,28.127853881,3.33676032670116,"L","L","H","RENO_S","1","KK" "44",3121,91.825330102,4.51988818660262,"H","L","H","RENO_S","1","KK" "45",3122,136.4559307,4.91600171048243,"H","L","H","RENO_S","1","USA" "46",4011,126.11889968,4.83722511024933,"H","L","H","RENO_N","2","KK" "47",4022,76.520259821,4.33755554003153,"L","L","L","RENO_N","2","KK" "48",4032,93.551979795,4.53851721545715,"L","L","H","RENO_N","2","USA" "49",4041,207.09703422,5.33318744777751,"H","L","L","RENO_N","2","USA" "50",4052,383.44185307,5.94918798759058,"N","N","N","RENO_N","2","USA" "51",4061,156.79345897,5.05492939129363,"N","N","N","RENO_N","2","USA" "52",4071,322.72413197,5.77679787769979,"L","H","L","RENO_N","2","USA" "53",4082,554.05710342,6.31726775620079,"H","H","H","RENO_N","2","USA" "54",4091,122.55552697,4.80856420867156,"N","N","N","RENO_N","2","KK" "55",4102,112.70050456,4.72473389805434,"N","N","N","RENO_N","2","KK" "56",4111,94.245481423,4.54590288271731,"L","H","H","RENO_N","2","KK" "57",4122,323.16498582,5.77816298482521,"H","H","L","RENO_N","2","KK" I define a linear model in R using lm as lm1 <- lm(logDIOX ~ 1 + OXYGEN + LOAD + PLANT + TIME + LAB, data=data) and I want to interpret the estimated coefficients. However, when I extract the coefficients I get multiple 'NAs' (I'm assuming it's due to linear dependencies among the variables). How can I then interpret the coefficients? I only have one intercept that somehow represents one of the levels of each of the included factors in the model. Is it possible to get an estimate for each factor level? > summary(lm1) Coefficients: Call: lm(formula = logDIOX ~ OXYGEN + LOAD + PLANT + TIME + LAB, data = data) Residuals: Min 1Q Median 3Q Max -0.90821 -0.32102 -0.08993 0.27311 0.97758 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 7.2983 0.2110 34.596 < 2e-16 *** OXYGENL -0.4086 0.1669 -2.449 0.017953 * OXYGENN -0.7567 0.1802 -4.199 0.000113 *** LOADL -1.0645 0.1675 -6.357 6.58e-08 *** LOADN NA NA NA NA PLANTRENO_N -0.6636 0.2174 -3.052 0.003664 ** PLANTRENO_S -2.3452 0.1929 -12.158 < 2e-16 *** TIME2 -0.9160 0.2065 -4.436 5.18e-05 *** LABUSA 0.3829 0.1344 2.849 0.006392 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5058 on 49 degrees of freedom Multiple R-squared: 0.8391, Adjusted R-squared: 0.8161 F-statistic: 36.5 on 7 and 49 DF, p-value: < 2.2e-16
For the NA part of your question you can have a look here: [linear regression "NA" estimate just for last coefficient, actually of your variables can be described as a linear combination of the rest. For the factors and their levels the way r works is showing intercept with first factor level and shows the difference of the intercept with the rest of the factors. I think it will be more clear with just one factor regression: lm1 <- lm(logDIOX ~ 1 + OXYGEN , data=df) > summary(lm1) Call: lm(formula = logDIOX ~ 1 + OXYGEN, data = df) Residuals: Min 1Q Median 3Q Max -1.7803 -0.7833 -0.2027 0.6597 3.1229 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.5359 0.2726 20.305 <2e-16 *** OXYGENL -0.4188 0.3909 -1.071 0.289 OXYGENN -0.1896 0.3807 -0.498 0.621 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.188 on 54 degrees of freedom Multiple R-squared: 0.02085, Adjusted R-squared: -0.01542 F-statistic: 0.5749 on 2 and 54 DF, p-value: 0.5662 what this result is saying is that for OXYGEN="H" intercept is 5.5359, for OXYGEN="L" intercept is 5.5359-0.4188=5.1171 and for OXYGEN="N" intercept is 5.5359-0.1896= 5.3463. Hope this helps UPDATE: Following your comment I generalize to your model. when OXYGEN = "H" LOAD = "H" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983 when OXYGEN = "L" LOAD = "H" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983-0.4086 =6.8897 when OXYGEN = "L" LOAD = "L" PLANT= "KARRA" TIME=1 LAB="KK" then: logDIOX =7.2983-0.4086-1.0645 =5.8252 etc.
R lm interaction terms with categorical and squared continuous variables
I am trying to get an lm fit for my data. The problem I am having is that I want to fit a linear model(1st order polynomial) when the factor is "true" and a second order polynomial when the factor is "false". How can I get that done using only one lm. a=c(1,2,3,4,5,6,7,8,9,10) b=factor(c("true","false","true","false","true","false","true","false","true","false")) c=c(10,8,20,15,30,21,40,25,50,31) DumbData<-data.frame(cbind(a,c)) DumbData<-cbind(DumbData,b=b) I have tried Lm2<-lm(c~a + b + b*I(a^2), data=DumbData) summary(Lm2) that results in: summary(Lm2) Call: lm(formula = c ~ a + b + b * I(a^2), data = DumbData) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.74483 1.12047 -0.665 0.535640 a 4.44433 0.39619 11.218 9.83e-05 *** btrue 6.78670 0.78299 8.668 0.000338 *** I(a^2) -0.13457 0.03324 -4.049 0.009840 ** btrue:I(a^2) 0.18719 0.01620 11.558 8.51e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7537 on 5 degrees of freedom Multiple R-squared: 0.9982, Adjusted R-squared: 0.9967 F-statistic: 688 on 4 and 5 DF, p-value: 4.896e-07 here I have I(a^2) for both fits and i want 1 1st order and another with second order polynomials. If one tries with: Lm2<-lm(c~a + b + I(b*I(a^2)), data=DumbData) Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels In addition: Warning message: In Ops.factor(b, I(a^2)) : * not meaningful for factors How can I get the proper interaction terms here??? Thanks Andrie, there are still some things I am missing here. In this example the variable b is a logic one, if is a factor of two levels does not work, I guess I have to convert the factor variable in a logic one. The other thing I am missing is the not in the condition, I(!b*a^2) without the ! I get: Call: lm(formula = c ~ a + I(b * a^2), data = dat) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.2692 1.8425 3.945 0.005565 ** a 2.3222 0.3258 7.128 0.000189 *** I(b * a^2) 0.3005 0.0355 8.465 6.34e-05 *** I can not relate the formulas with and without the ! condition, which is a bit strange to me.
Try something along the following lines: dat <- data.frame( a=c(1,2,3,4,5,6,7,8,9,10), b=c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE), c=c(10,8,20,15,30,21,40,25,50,31) ) fit <- lm(c ~ a + I(!b * a^2), dat) summary(fit) This results in: Call: lm(formula = c ~ a + I(!b * a^2), data = dat) Residuals: Min 1Q Median 3Q Max -4.60 -2.65 0.50 2.65 4.40 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.5000 2.6950 3.896 0.005928 ** a 3.9000 0.4209 9.266 3.53e-05 *** I(!b * a^2)TRUE -13.9000 2.4178 -5.749 0.000699 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.764 on 7 degrees of freedom Multiple R-squared: 0.9367, Adjusted R-squared: 0.9186 F-statistic: 51.75 on 2 and 7 DF, p-value: 6.398e-05 Note: I made use of the logical values TRUE and FALSE. These will coerce to 1 and 0, respectively. I used the negation !b inside the formula.
Ummm ... Lm2<-lm(c~a + b + b*I(a^2), data=DumbData) You say that "The problem I am having is that I want to fit a linear model(1st order polynomial) when the factor is "true" and a second order polynomial when the factor is "false". How can I get that done using only one lm. " From that I infer that you don't want b to be directly in the model? In addition, a^2 should be included only if b is false. So that would be... lm(c~ a + I((!b) * a^2)) If b is true (that is, !b equals FALSE) then a^2 is multiplied by zero (FALSE) and omitted from the equation. The only problem is that you have defined b as factor instead of logical. That can be cured. # b=factor(c("true","false","true","false","true","false","true","false","true","false")) # could use TRUE and FALSE instead of "ture" and "false" # alternatively, after defining b as above, do # b <- b=="true" -- that would convert b to logical (i.e boolean TRUE and FALSe values) Ok to be exact, you defined b as "character" but it was converted to "factor" when adding it to the data frame ("DumbData") Another minor point about the way you defined the data frame. a=c(1,2,3,4,5,6,7,8,9,10) b=factor(c("true","false","true","false","true","false","true","false","true","false")) c=c(10,8,20,15,30,21,40,25,50,31) DumbData<-data.frame(cbind(a,c)) DumbData<-cbind(DumbData,b=b) Here, cbind is unnecessary. You coud have it all on one line: Dumbdata<- data.frame(a,b,c) # shorter and cleaner!! In addition, to convert b to logical use: Dumbdata<- data.frame(a,b=b=="true",c) Note. You need to say b=b=="true", it seems redundant but the LHS (b) gives the name of the variable in data frame whereas the RHS (b=="true") is an expression that evaluates to a "logical" (boolean) value.
Why do column names get concatenated into the row output of a linear model summary?
I've never noticed this behavior before, but I'm surprised at the output naming conventions for linear model summaries. My question, essentially, is why row names in a linear model summary always seem to carry the name of the column they came from. An example Suppose you had some data for 300 movie audience members from three different cities: Chicago Milwaukee Dayton And suppose all of them were subjected to the stinking pile of confusing, contaminated waste that was Spider-Man 3. After enduring the entirety of that cinematic abomination, they were asked to rate the movie on a 100-point scale. Because all of the audience members were reasonable human beings, the ratings were all below zero. (Naturally. Anyone who's seen the movie would agree.) Here's what that might look like in R: > score <- rnorm(n = 300, mean = -50, sd = 10) > city <- rep(c("Chicago", "Milwaukee", "Dayton"), times = 100) > spider.man.3.sucked <- data.frame(score, city) > head(spider.man.3.sucked) score city 1 -64.57515 Chicago 2 -50.51050 Milwaukee 3 -56.51409 Dayton 4 -45.55133 Chicago 5 -47.88686 Milwaukee 6 -51.22812 Dayton Great. So let's run a quick linear model, assign it to lm1, and get its summary output: > lm1 <- lm(score ~ city, data = spider.man.3.sucked) > summary(lm1) Call: lm(formula = score ~ city, data = spider.man.3.sucked) Residuals: Min 1Q Median 3Q Max -29.8515 -6.1090 -0.4745 6.0340 26.2616 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -51.3621 0.9630 -53.337 <2e-16 *** cityDayton 1.1892 1.3619 0.873 0.383 cityMilwaukee 0.8288 1.3619 0.609 0.543 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.63 on 297 degrees of freedom Multiple R-squared: 0.002693, Adjusted R-squared: -0.004023 F-statistic: 0.4009 on 2 and 297 DF, p-value: 0.6701 What's bugging me The part I want to highlight is this: cityDayton 1.1892 1.3619 0.873 0.383 cityMilwaukee 0.8288 1.3619 0.609 0.543 It looks like R sensibly concatenated the column name (city, if you remember from above) with the distinct value (in this case either Dayton or Milwaukee). If I don't want R to output in that format, is there any way to override it? For example, in my case all I'd need is simply: Dayton 1.1892 1.3619 0.873 0.383 Milwaukee 0.8288 1.3619 0.609 0.543 Two questions in one So, What's controlling the format of the output for linear model summary rows, and Can/should I change it?
The extractor function for that component of a summary object is coef. Does this provide the means to control your output acceptably: summ <- summary(lm1) csumm <- coef(summ) rownames(csumm) <- sub("^city", "", rownames(csumm)) print(csumm[-1,], digits=4) # Estimate Std. Error t value Pr(>|t|) # Dayton 0.8133 1.485 0.5478 0.5842 # Milwaukee 0.3891 1.485 0.2621 0.7934 (No random seed was set so cannot match your values.)
For 1) it appears to happen inside model.matrix.default() and inside internal R compiled code for that matter. It might be difficult to change this easily - the obvious way would be to write your own model.matrix.default() that calls model.matrix.default() and updates the names afterwards. But this isn't tested or tried.
Here is a hack # RUN REGRESSION require(ggplot2) lm1 = lm(tip ~ total_bill + sex + day, data = tips) # FUNCTION TO REMOVE FACTOR NAMES FROM MODEL SUMMARY remove_factors = function(mod){ mydf = mod$model # PREPARE VECTOR OF VARIABLES WITH REPETITIONS = UNIQUE FACTOR LEVELS vars = names(mod$model)[-1] eachlen = sapply(mydf[,vars,drop=F], function(x) ifelse(is.numeric(x), 1, length(unique(x)) - 1)) vars = rep(vars, eachlen) # REPLACE COEF NAMES WITH VARIABLE NAME WHEN APPROPRIATE coefs = names(lm1$coefficients)[-1] coefs2 = stringr::str_replace(coefs, vars, "") names(mod$coefficients)[-1] = ifelse(coefs2 == "", coefs, coefs2) return(mod) } summary(remove_factors(lm1)) This gives Estimate Std. Error t value Pr(>|t|) (Intercept) 0.95588 0.27579 3.47 0.00063 *** total_bill 0.10489 0.00758 13.84 < 2e-16 *** Male -0.03844 0.14215 -0.27 0.78706 Sat -0.08088 0.26226 -0.31 0.75806 Sun 0.08282 0.26741 0.31 0.75706 Thur -0.02063 0.26975 -0.08 0.93910 However, it is not always advisable to do this, as you can see from running the same hack for a different regression. It is not clear what the Yes variable in the last name stands for. R by default writes it as smokerYes to make its meaning clear. So use with caution. lm2 = lm(tip ~ total_bill + sex + day + smoker, data = tips) summary(remove_factors(lm2)) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.05182 0.29315 3.59 0.00040 *** total_bill 0.10569 0.00763 13.86 < 2e-16 *** Male -0.03769 0.14217 -0.27 0.79114 Sat -0.12636 0.26648 -0.47 0.63582 Sun 0.00407 0.27959 0.01 0.98841 Thur -0.09283 0.27994 -0.33 0.74048 Yes -0.13935 0.14422 -0.97 0.33489