How do I seperate data in R by category - r

Say I have the data frame (image below) and I want to split in to two new categories based on region so one would be BC and the other NZ, how do I achieve this? (in R)
data

Here is an example with the mtcars data where we use the transmission variable am to plot different groups in a scatterplot with the ggplot2 package.
We will create a scatterplot with the displacement variable on the x axis and miles per gallon on the y axis. Since cars with larger engine displacement typically consume more gasoline than those with smaller displacement, we expect to see a negative relationship (mpg is higher with low values of displacement) in the chart.
First, we convert am to a factor variable so the legend prints two categories instead of a continuum between 0 and 1. Then we use ggplot() and geom_point() to set the point color based on the value of am.
library(ggplot2)
mtcars$am <- factor(mtcars$am,labels = c("automatic","manual"))
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point(aes(color = am))
...and the output:
Separating charts by group with facets
We can use ggplot2 directly to generate separate charts by a grouping variable. In ggplot2 this is known as a facetted chart. We use facet_wrap() to split the data by values of am as follows.
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point() +
facet_wrap(mtcars$am,ncol = 2)
...and the output:
Adding regression line and confidence intervals
Given the comments in the original question, we an add a regression line to the plot by using the geom_smooth() function, which defaults to lowess smoothing.
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point() +
facet_wrap(mtcars$am,ncol = 2) +
geom_smooth(span = 1)
...and the output:
To use a simple regression instead of lowess smoothing, we use the method = argument in geom_smooth(), and set it to lm.
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point() +
facet_wrap(mtcars$am,ncol = 2) +
geom_smooth(method = "lm")
...and the output:
Generate regression models by group
Here we split the data frame by values of am, and use lapply() to generate regression models for each group.
carsList <- split(mtcars,mtcars$am)
lapply(carsList,function(x){
summary(lm(mpg ~ disp,data = x))
})
...and the output:
$automatic
Call:
lm(formula = mpg ~ disp, data = x)
Residuals:
Min 1Q Median 3Q Max
-2.7341 -1.6546 -0.8855 1.6032 5.0764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.157064 1.592922 15.79 1.36e-11 ***
disp -0.027584 0.005146 -5.36 5.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.405 on 17 degrees of freedom
Multiple R-squared: 0.6283, Adjusted R-squared: 0.6064
F-statistic: 28.73 on 1 and 17 DF, p-value: 5.194e-05
$manual
Call:
lm(formula = mpg ~ disp, data = x)
Residuals:
Min 1Q Median 3Q Max
-4.6056 -2.4200 -0.0956 3.1484 5.2315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.86614 1.95033 16.852 3.33e-09 ***
disp -0.05904 0.01174 -5.031 0.000383 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.545 on 11 degrees of freedom
Multiple R-squared: 0.6971, Adjusted R-squared: 0.6695
F-statistic: 25.31 on 1 and 11 DF, p-value: 0.0003834
NOTE: since this is an example illustrating the code necessary to generate a regression analysis with a split variable, we won't go into the details about whether the data here conforms to modeling assumptions for Ordinary Least Squares regression.
Modeling the groups in one regression model
As I noted in the comments, we can account for the differences between automatic and manual transmissions in one regression model if we specify the am effect as well as an interaction effect between am and disp.
summary(lm(mpg ~ disp + am + am * disp,data=mtcars))
We can demonstrate that this model generates the same predictions as the split model for manual transmissions by generating predictions from each model as follows.
data <- data.frame(am = c(1,1,0),
disp = c(157,248,300))
data$am <- factor(data$am,labels = c("automatic","manual"))
mod1 <- lm(mpg ~ disp + am + am * disp,data=mtcars)
predict(mod1,data)
mod2 <- lm(mpg ~ disp,data = mtcars[mtcars$am == "manual",])
predict(mod2,data[data$am == "manual",])
...and the output:
> data <- data.frame(am = c(1,1,0),
+ disp = c(157,248,300))
> data$am <- factor(data$am,labels = c("automatic","manual"))
> mod1 <- lm(mpg ~ disp + am + am * disp,data=mtcars)
> predict(mod1,data)
1 2 3
23.59711 18.22461 16.88199
> mod2 <- lm(mpg ~ disp,data = mtcars[mtcars$am == "manual",])
> predict(mod2,data[data$am == "manual",])
1 2
23.59711 18.22461
We subset the data prior to predict() for the split model in order to generate predictions only for observations that had manual transmissions. Since the predictions match, we prove that building separate models by transmission type is no different than a fully specified model that includes both the categorical am effect and an interaction effect for am * disp.

Related

Fitting an exponential curve through scatterplot

I am starting to use R and have a bit of a problem.
I have a dataset containing 20 points where leaf temperature and respiration is measured called ADC_dark.
I expect an exponential relationship where an increase in leaf temperature results in increased respiration
Then I plotted an exponential curve through this graph:
ADC_dark %>%
ggplot(aes(x=Tleaf, y=abs_A))+
geom_point()+
stat_smooth(method='lm', formula = log(y)~x)+
labs(title="Respiration and leaf temperature", x="Tleaf", y="abs_A")
This is not looking very good. The formula matching this line is y = -2.70206 * e^(0.11743*x)
Call:
lm(formula = log(ADC_dark$abs_A) ~ ADC_dark$Tleaf)
Residuals:
Min 1Q Median 3Q Max
-2.0185 -0.1059 0.1148 0.2698 0.6825
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.70206 0.51255 -5.272 5.18e-05 ***
ADC_dark$Tleaf 0.11743 0.02161 5.435 3.66e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5468 on 18 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6003
F-statistic: 29.54 on 1 and 18 DF, p-value: 3.659e-05
When I use the same data in excel I get this:
As you can see the intercept between these suggested exponential relationships differs.
Just looking at the pictures I would say that excel is doing a better job.
How can I 'train' R to make a better fitted curve through my data, or am I misinterpreting something?
The problem is that when you fit inside ggplot2 start smooth using log(y) ~ x it occured that scales of your data points and fitted line are different. Basically you plot y and log(y) at the same y scale and since y > log(y) for any positive y your fitted plot shifted lower than your data point.
You have several options like to tweak axises and scales, or just use glm generalized linear model with log link instead of lm. In this case the scales would be presevered, no additional tweaking.
library(ggplot2)
set.seed(123)
ADC_dark <- data.frame(Tleaf = 1:20,
abs_A = exp(0.11*x - 2.7 + rnorm(1:20) / 10))
ADC_dark %>%
ggplot(aes(x = Tleaf, y = abs_A))+
geom_point()+
geom_smooth(method = "glm", type = "response", formula = y ~ x, method.args = list(family = gaussian(link = "log")))+
labs(title = "Respiration and leaf temperature", x = "Tleaf", y = "abs_A")
Output:

Save the result from "summary(lm)" to use in PowerBi

Is it possible to save an summary(lm) object in a format usable in PowerBI?
Lets say the following:
data <- mpg
lm <- lm(hwy ~ displ, data = mpg)
summary(lm)
Output:
Call:
lm(formula = hwy ~ displ, data = mpg)
Residuals:
Min 1Q Median 3Q Max
-7.1039 -2.1646 -0.2242 2.0589 15.0105
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.6977 0.7204 49.55 <2e-16 ***
displ -3.5306 0.1945 -18.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.836 on 232 degrees of freedom
Multiple R-squared: 0.5868, Adjusted R-squared: 0.585
F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16
I would lake to save this information as gglpot2 object, or picture in general, that I can display in Power BI. So that we can use this as a template for a fast regressions inside Power BI. This since Power BI only can display R code that results in a "plot" and not text.
I have tried:
textplot(capture.output(summary(lm)))
But I first got this error:
>install.packages('textplot')
Warning in install.packages :
package ‘textplot’ is not available (for R version 3.5.3)
And unfortunately Power BI doesn't supports textplot().
EDIT: Clarification I'm not looking to plot a regression line nor plane. I'm looking for a way to save the text output from "summary(lm)" as a plot object that I can display in Power BI.
Try something like this:
fit <- lm(hwy ~ displ, data = mpg)
txt = capture.output(print(summary(fit)))
plot(NULL,xlim=c(-1,1),ylim=c(-1,1),xaxt="n",yaxt="n",bty="n",xlab="",ylab="")
text(x=0,y=0,paste(txt,collapse="\n"))
You might need to look at using stringr::str_pad to make the text prettier.. but this should get you something that works.
This is how you put it on ggplot2:
ggplot() + xlim(c(-1,1)) + ylim(c(-1,1)) +
geom_text(aes(x=0,y=0,label=paste(txt,collapse="\n"))) +
theme_void()
This example should be useful for you. To apply here:
plot(hwy ~ displ, data=mpg)
abline(lm1)
For the ggplot2 solution, this works:
ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method='lm', formula=y~x)

R - Regression Analysis for Logarthmic

I perform regression analysis and try to find the best fit model for the dataset diamonds.csv in ggplot2. I use price(response variable) vs carat and I perform linear regression, quadratic, and cubic regression. The line is not the best fit. I realize the logarithmic from excel has the best fitting line. However, I couldn't figure out how to code in R to find the logarithmic fitting line. Anyone can help?
Comparing Price vs Carat
model<-lm(price~carat, data = diamonds)
Model 2 uses the polynomial to compare
model2<-lm(price~carat + I(carat^2), data = diamonds)
use cubic in model3
model3 <- lm(price~carat + I(carat^2) + I(carat^3), data = diamonds)
How can I code the log in R to get same result as excel?
y = 0.4299ln(x) - 2.5495
R² = 0.8468
Thanks!
The result you report from excel y = 0.4299ln(x) - 2.5495 does not contain any polynomial or cubic terms. What are you trying to do? price is very skewed and as with say 'income' it is common practice to take the log from that. This also provides the R2 you are referring to, but very different coefficients for the intercept and carat parameter.
m1 <- lm(log(price) ~ carat, data = diamonds)
summary(m1)
Call:
lm(formula = log(price) ~ carat, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-6.2844 -0.2449 0.0335 0.2578 1.5642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.215021 0.003348 1856 <2e-16 ***
carat 1.969757 0.003608 546 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3972 on 53938 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16

Custom regression equation in R

I have a set of data in R and I want to run a regression to test for correlation using custom coefficients.
Example:
x = lm(a ~ b + c + d, data=data, weights=weights)
That gives me coefficients for b, c, and d, but I just want to give b, c, and d my own coefficients and find, for example, the r^2. How would I do so?
Let's assume your predetermined coefficients are a three-element, numeric vector named: vec and that none of a,b,c are factors or character vectors:
#edit ... add a sum() function
(x = lm(a ~ 1, data=data, offset=apply(data, 1, function(x) {sum( c(1,x) * vec))} )
This should produce a model that has the specified estimates. You will probably need to do this:
summary(x)
As always... if you want tested code, then provide a dataset for testing. With the mtcars dataframe:
m1 = lm(mpg ~ carb + wt, data=mtcars)
vec <- coef(m1)
(x = lm(mpg ~ 1, data=mtcars,
offset=apply( mtcars[c("carb","wt")], 1,
function(x){ sum( c(1,x) *vec)} )))
Call:
lm(formula = mpg ~ 1, data = mtcars, offset = apply(mtcars[c("carb",
"wt")], 1, function(x) {
sum( c(1, x) * vec)
}))
Coefficients:
(Intercept)
-7.85e-17
So the offset model (with the coefficients used in the offset) is essentially an exact fit to the m1 model.
#BondedDust's method will be more efficient in the long run, but just for illustration, here's a simple example of how to create your own function to calculate R-squared for any regression coefficients you choose. We'll use the mtcars data set, which is built into R.
Assume a regression model that predicts "mpg" using the independent variables "carb" and "wt". a, b, and c are the three regression parameters that we need to provide to the function.
# Function to calculate R-squared
R2 = function(a,b,c) {
# Calculate the residual sum of squares from the regression model
SSresid = sum(((a + b*mtcars$carb + c*mtcars$wt) - mtcars$mpg)^2)
# Calculate the total sum of squares
SStot = sum((mtcars$mpg - mean(mtcars$mpg))^2)
# Calculate and return the R-squared for the regression model
return(1 - SSresid/SStot)
}
Now let's run the function. First let's see if our function matches the R-squared calculated by lm. We'll do this by creating a regression model in R, then we'll use the coefficients from that model and calculate the R-squared using our function and see if it matches the output from lm:
# Create regression model
m1 = lm(mpg ~ carb + wt, data=mtcars)
summary(m1)
Call:
lm(formula = mpg ~ carb + wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5206 -2.1223 -0.0467 1.4551 5.9736
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.7300 1.7602 21.435 < 2e-16 ***
carb -0.8215 0.3492 -2.353 0.0256 *
wt -4.7646 0.5765 -8.265 4.12e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.839 on 29 degrees of freedom
Multiple R-squared: 0.7924, Adjusted R-squared: 0.7781
F-statistic: 55.36 on 2 and 29 DF, p-value: 1.255e-10
From the summary, we can see that the R-squared is 0.7924. Let's see what we get from the function we just created. All we need to do is feed our function the three regression coefficients listed in the summary above. We can hard-code those numbers, or we can extract the coefficients from the model object m1 (which is what I've done below):
R2(coef(m1)[1], coef(m1)[2], coef(m1)[3])
[1] 0.7924425
Now let's calculate the R-squared for other choices of the regression coefficients:
a = 37; b = -1; c = -3.5
R2(a, b, c)
[1] 0.5277607
a = 37; b = -2; c = -5
R2(a, b, c)
[1] 0.0256494
To check lots of values of a parameter at once, you can, for example, use sapply. The code below will return the R-squared for values of c ranging from -7 to -3 in increments of 0.1 (with the other two parameters set to the the values returned by lm:
sapply(seq(-7,-3,0.1), function(x) R2(coef(m1)[1],coef(m1)[2],x))

slope and intercept issue in scatterplot ggplot2

I was drawing a regression line (linear) using mtcars dataset (mpg ~ cyl). I ran a simple linear model using mpg and cyl and executed a summary.
Intercept from the linear model summary does not match with the graphical representation.
I am having hard time understanding what is going on. If I use R's base plotting function to draw the scatterplot, I get the same result as ggplot2. I changed the y-axis scale limit (0, 40) without any success.
Here is my code
data(mtcars)
library(ggplot2)
p <- ggplot(mtcars, aes(x=cyl, y=mpg)) + geom_point(shape=1) # create graph
p + geom_smooth(method = lm, se=FALSE) # add line
lm.car <- lm(mpg ~ cyl) # create linear model
summary(lm.car) # summary
Here is linear model output
> summary(lm.car)
Call:
lm(formula = mpg ~ cyl)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16
cyl -2.8758 0.3224 -8.92 6.11e-10
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Based on following suggestions I used
data(mtcars)
library(ggplot2)
p <- ggplot(mtcars, aes(x=cyl, y=mpg)) + geom_point(shape=1) +
xlim(0, 10)# create graph
p + geom_smooth(method = lm, se=FALSE) # add line
Here is the ouput:
When you do regression analysis Intercept value shows how large will be y value if x value is 0. In case of cyl and mpg Intercept value 37.8846 means that mpg will be 37.8846 if cyl value is O.
On the ggplot2 plot regression line only shows values of cyl from 4 to 6 (as there are no other values).
If you calculate predicted value of mpg for cyl value of 4 you will get 26.38142. That's the value you see on plot.
predict(lm.car,data.frame(cyl=4))
1
26.38142

Resources