Using the this R linear modelling tutorial I'm finding the format of the model output is annoyingly different to that provided in the text and I can't for the life of me work out why. For example, here is the code:
pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
my.df = data.frame(sex,pitch)
xmdl = lm(pitch ~ sex, my.df)
summary(xmdl)
Here is the output I get:
Call:
lm(formula = pitch ~ sex, data = my.df)
Residuals:
1 2 3 4 5 6
6.667 -22.333 15.667 2.000 -16.000 14.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 177.167 7.201 24.601 1.62e-05 ***
sex1 49.167 7.201 6.827 0.00241 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.64 on 4 degrees of freedom
Multiple R-squared: 0.921, Adjusted R-squared: 0.9012
F-statistic: 46.61 on 1 and 4 DF, p-value: 0.002407
In the tutorial the line for Coefficients has "sexmale" instead of "sex1". What setting do I need to activate to achieve this?
Related
I have 4 data sets which is CO2 emission , GDP per capita, GDP per capita square and Number of tourist arrival. I am trying to run a model to observe the number of tourist arrival impact on Co2 emission in order to derive the Tourism induced Environmental Kuznets Curve. Below is the code and summary results .
Without log
Yt<-Data$`CO2 emissions`
X1t<-Data$`GDP per capita`
X2t<-Data$`GDP per caita square`
X3t<-Data$`Number of Tourist arrival`
model<-lm(Yt~X1t+X2t+X3t)
summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.238e-02 7.395e-03 1.674 0.100187
X1t -2.581e-05 6.710e-05 -0.385 0.702139
X2t 1.728e-07 4.572e-08 3.780 0.000413 ***
X3t 1.928e-07 3.501e-08 5.507 1.2e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.02252 on 51 degrees of freedom
Multiple R-squared: 0.9475, Adjusted R-squared: 0.9444
F-statistic: 306.5 on 3 and 51 DF, p-value: < 2.2e-16
With log
LYt<-(log(Yt))
LX1t<-(log(X1t))
LX2t<-(log(X2t))
LX3t<-(log(X3t))
model1<-lm(LYt~LX1t+LX2t+LX3t)
summary(model1)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.38623 0.46040 -20.387 < 2e-16 ***
LX1t 0.83679 0.09834 8.509 2.01e-11 ***
LX2t NA NA NA NA
LX3t 0.17802 0.06888 2.585 0.0126 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2863 on 52 degrees of freedom
Multiple R-squared: 0.9074, Adjusted R-squared: 0.9038
F-statistic: 254.7 on 2 and 52 DF, p-value: < 2.2e-16
It is pretty evident that GDP per capita and GDP per capita square are perfectly collinear. However, why does the regression coefficients show missing values (NA) only in the case of log transformed model?
I am a beginner in R and statistics in general. I am trying to build a regression model with 4 variables (one of them is nominal data, with 3 alt categories). I think i managed to build a model with my raw data. But I wanted to standardize my data-set and the lm essentially perfect fit: summary may be unreliable message.
This is the summary of the model with raw data;
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.0713 5.9131 -0.350 0.727
Vdownload 8.6046 0.5286 16.279 < 2e-16 ***
DownloadDegisim 2.8854 0.6822 4.229 4.25e-05 ***
Vupload -4.2877 0.5418 -7.914 7.32e-13 ***
Saglayici2 -8.2084 0.6043 -13.583 < 2e-16 ***
Saglayici3 -9.8869 0.5944 -16.634 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.885 on 138 degrees of freedom
Multiple R-squared: 0.8993, Adjusted R-squared: 0.8956
F-statistic: 246.5 on 5 and 138 DF, p-value: < 2.2e-16
I wrote these codes to standardize my data
memnuniyet_scaled <-scale(Vdownload, center = TRUE, scale = TRUE)
Vdownload_scaled <-scale(Vdownload, center = TRUE, scale = TRUE)
Vupload_scaled <-scale(Vupload, center = TRUE, scale = TRUE)
DownloadD_scaled <- scale(DownloadDegisim, center = TRUE, scale = TRUE)
result<-lm(memnuniyet_scaled~Vdownload_scaled+DownloadD_scaled+Vupload_scaled+Saglayıcı)
summary(result)
And this is the summary of my standardized data
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.079e-17 5.493e-17 9.250e-01 0.357
Vdownload_scaled 1.000e+00 6.667e-17 1.500e+16 <2e-16 ***
DownloadD_scaled -4.591e-17 8.189e-17 -5.610e-01 0.576
Vupload_scaled 9.476e-18 6.337e-17 1.500e-01 0.881
Saglayici2 -6.523e-17 7.854e-17 -8.300e-01 0.408
Saglayici3 -8.669e-17 7.725e-17 -1.122e+00 0.264
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.75e-16 on 138 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.034e+32 on 5 and 138 DF, p-value: < 2.2e-16
I do know that R value should not have changed with standardization and have no idea what I did wrong.
I am trying to see if there is a relationship between number of bat calls and the time of pup rearing season. The pup variable has three categories: "Pre", "Middle", and "Post". When I ask for the summary, it only included the p-values for Pre and Post pup production. I created a sample data set below. With the sample data set, I just get an error.... with my actual data set I get the output I described above.
SAMPLE DATA SET:
Calls<- c("55","60","180","160","110","50")
Pup<-c("Pre","Middle","Post","Post","Middle","Pre")
q<-data.frame(Calls, Pup)
q
q1<-lm(Calls~Pup, data=q)
summary(q1)
OUTPUT AND ERROR MESSAGE FROM SAMPLE:
> Calls Pup
1 55 Pre
2 60 Middle
3 180 Post
4 160 Post
5 110 Middle
6 50 Pre
Error in as.character.factor(x) : malformed factor
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors
ACTUAL INPUT FOR MY ANALYSIS:
> pupint <- lm(Calls ~ Pup, data = park2)
summary(pupint)
THIS IS THE OUTPUT I GET FROM MY ACTUAL DATA SET:
Residuals:
Min 1Q Median 3Q Max
-66.40 -37.63 -26.02 -5.39 299.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.54 35.82 1.858 0.0734 .
PupPost -51.98 48.50 -1.072 0.2927
PupPre -26.47 39.86 -0.664 0.5118
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 80.1 on 29 degrees of freedom
Multiple R-squared: 0.03822, Adjusted R-squared: -0.02811
F-statistic: 0.5762 on 2 and 29 DF, p-value: 0.5683
Overall, just wondering why the above output isn't showing "Middle". Sorry my sample data set didn't work out the same but maybe that error message will help better understand the problem.
For R to correctly understand a dummy variable, you have to indicate Pup is a cualitative (dummy) variable by using factor
> Pup <- factor(Pup)
> q<-data.frame(Calls, Pup)
> q1<-lm(Calls~Pup, data=q)
> summary(q1)
Call:
lm(formula = Calls ~ Pup, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.00 15.61 5.444 0.0122 *
PupPost 85.00 22.08 3.850 0.0309 *
PupPre -32.50 22.08 -1.472 0.2374
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9097, Adjusted R-squared: 0.8494
F-statistic: 15.1 on 2 and 3 DF, p-value: 0.02716
If you want R to show all categories inside the dummy variable, then you must remove the intercept from the regression, otherwise, you will be in variable dummy trap.
summary(lm(Calls~Pup-1, data=q))
Call:
lm(formula = Calls ~ Pup - 1, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
PupMiddle 85.00 15.61 5.444 0.01217 *
PupPost 170.00 15.61 10.889 0.00166 **
PupPre 52.50 15.61 3.363 0.04365 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9631
F-statistic: 53.17 on 3 and 3 DF, p-value: 0.004234
If you include a categorical variable like pup in a regression, then it is including a dummy variable for each value within that variable except for one by default. You could show a coefficient for pupmiddle if you omit instead the intercept coefficient like this:
q1<-lm(Calls~Pup - 1, data=q)
This is more of a technical question rather than coding. Is it possible to test whether the slope of a data set is significant or not?
So I have the following plot:
I managed to determine the slope of this blue geom_smooth line, which is basically the mean of all the pink lines. Is it possible to test if the slope of that blue line is significant based on the dataset?
I'm not directly interested in the code but just want to check if it's possible to determine a possible significance in the data set.
This shows the p-value, 0.0544, of the slope in the output:
summary(lm(demand ~ Time, BOD))
giving:
Call:
lm(formula = demand ~ Time, data = BOD)
Residuals:
1 2 3 4 5 6
-1.9429 -1.6643 5.3143 0.5929 -1.5286 -0.7714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.5214 2.6589 3.205 0.0328 *
Time 1.7214 0.6387 2.695 0.0544 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.085 on 4 degrees of freedom
Multiple R-squared: 0.6449, Adjusted R-squared: 0.5562
F-statistic: 7.265 on 1 and 4 DF, p-value: 0.05435
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Ciao Everyone,
I would like to create a dummy variable in R. So I have a list of Italian regions, and a variable called mafia. The mafia variable is coded 1 in the regions with high levels of mafia infiltration and 0 in the regions with lower levels of mafia penetration.
Now, I would like to create a dummy that considers only the regions with high levels of mafia. (=1)
If I understand your question correctly, the typical way of adding dummy variables (also called fixed effects) is to use the function factor. Here is a an example that creates random data and then uses factor in a linear regression:
set.seed(1)
require(data.table)
A = data.table(region = LETTERS[0:3], y = runif(100), x = runif(100), mafia = sample(c(0,1),100,rep = T))
> head(A)
region var mafia
1: A 0.67371223 1
2: B 0.09485786 0
3: C 0.49259612 1
4: A 0.46155184 1
5: B 0.37521653 1
6: C 0.99109922 1
formula = y ~ x + factor(mafia)
reg <- lm(formula, data = A)
> summary(reg)
Call:
lm(formula = formula, data = A)
Residuals:
Min 1Q Median 3Q Max
-0.46965 -0.24828 -0.03362 0.28780 0.51183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.46196 0.07093 6.513 3.28e-09 ***
x 0.06735 0.10521 0.640 0.524
factor(mafia)1 -0.01830 0.06415 -0.285 0.776
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3189 on 97 degrees of freedom
Multiple R-squared: 0.005498, Adjusted R-squared: -0.01501
F-statistic: 0.2681 on 2 and 97 DF, p-value: 0.7654
If you wish to only do a regression on the observations that are coded with 1 in the "mafia" column, this is much easier:
# Note that A is a data.table
A.mafia = A[ mafia == 1 ]
formula = y ~ x
reg <- lm(formula, data = A.mafia)
summary(reg)
Output:
Call:
lm(formula = formula, data = A.mafia)
Residuals:
Min 1Q Median 3Q Max
-0.47163 -0.26063 -0.05724 0.30166 0.52062
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.43334 0.07926 5.467 1.53e-06 ***
x 0.09017 0.14474 0.623 0.536
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3197 on 49 degrees of freedom
Multiple R-squared: 0.007857, Adjusted R-squared: -0.01239
F-statistic: 0.388 on 1 and 49 DF, p-value: 0.5362