I want to run a regression of money spent on links clicked using a data set where I notice link clicks level off after a certain amount of money spent. I want to use a log transformation to better fit this leveling-off data.
My data set looks like this:
link.clicks
[1] 34 60 54 49 63 100
MoneySpent
[1] 10.97 21.81 20.64 21.42 48.03 127.30
I want to predict the % change in link.clicks from a $1 increase in MoneySpent. My regression model is:
regClicksLogLevel <- lm(log(link.clicks) ~ (MoneySpent), data = TwtrData)
summary(regClicksLogLevel)
visreg(regClicksLogLevel)
However, The graph visreg generates looks like this:
[1]: https://i.stack.imgur.com/eZqVG.png
When I change my regression to:
regClicksLogLog <- lm(log(link.clicks) ~ log(MoneySpent), data = TwtrData)
summary(regClicksLogLog)
visreg(regClicksLogLog)
I actually get the fitted line I'm looking for:
[2]: https://i.stack.imgur.com/MexwC.png
I'm confused because I'm not trying to predict a % change in link.clicks from a % change in MoneySpent.
I'm trying to predict a % change in link.clicks from a $ unit change in MoneySpent.
Why can't I generate the 2nd graph using the my first regression, regClicksLogLevel?
I guess that's what you are looking for
library(tidyverse)
TwtrData = tibble(
link.clicks = c(34,60,54,49,63,100),
MoneySpent = c(10.97,21.81,20.64,21.42,48.03,127.30)
) %>% mutate(
perc.link.clicks = lag(link.clicks, default = 0)/link.clicks,
perc.MoneySpent = lag(MoneySpent, default = 0)/MoneySpent
)
regClicksLogLevel <- lm(perc.link.clicks ~ perc.MoneySpent, data = TwtrData)
summary(regClicksLogLevel)
output
Call:
lm(formula = perc.link.clicks ~ perc.MoneySpent, data = TwtrData)
Residuals:
1 2 3 4 5 6
-0.1422261 -0.0766939 -0.0839233 -0.0002346 0.1912170 0.1118608
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1422 0.1082 1.315 0.25890
perc.MoneySpent 0.9963 0.1631 6.109 0.00363 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1434 on 4 degrees of freedom
Multiple R-squared: 0.9032, Adjusted R-squared: 0.879
F-statistic: 37.32 on 1 and 4 DF, p-value: 0.003635
And here is the graph
TwtrData %>% ggplot(aes(perc.MoneySpent, perc.link.clicks))+
geom_line()+
geom_smooth(method='lm',formula= y~x)+
scale_y_continuous(labels = scales::percent)+
scale_x_continuous(labels = scales::percent)
Related
For Y = % of population with income below poverty level and X = per capita income of population, I have constructed a box-cox plot and found that the lambda = 0.02020:
bc <- boxcox(lm(Percent_below_poverty_level ~ Per_capita_income, data=tidy.CDI), plotit=T)
bc$x[which.max(bc$y)] # gives lambda
Now I want to fit a simple linear regression using the transformed data, so I've entered this code
transform <- lm((Percent_below_poverty_level**0.02020) ~ (Per_capita_income**0.02020))
transform
But all I get is the error message
'Error in terms.formula(formula, data = data) : invalid power in formula'. What is my mistake?
You could use bcPower() from the car package.
## make sure you do install.packages("car") if you haven't already
library(car)
data(Prestige)
p <- powerTransform(prestige ~ income + education + type ,
data=Prestige,
family="bcPower")
summary(p)
# bcPower Transformation to Normality
# Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
# Y1 1.3052 1 0.9408 1.6696
#
# Likelihood ratio test that transformation parameter is equal to 0
# (log transformation)
# LRT df pval
# LR test, lambda = (0) 41.67724 1 1.0765e-10
#
# Likelihood ratio test that no transformation is needed
# LRT df pval
# LR test, lambda = (1) 2.623915 1 0.10526
mod <- lm(bcPower(prestige, 1.3052) ~ income + education + type, data=Prestige)
summary(mod)
#
# Call:
# lm(formula = bcPower(prestige, 1.3052) ~ income + education +
# type, data = Prestige)
#
# Residuals:
# Min 1Q Median 3Q Max
# -44.843 -13.102 0.287 15.073 62.889
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -3.736e+01 1.639e+01 -2.279 0.0250 *
# income 3.363e-03 6.928e-04 4.854 4.87e-06 ***
# education 1.205e+01 2.009e+00 5.999 3.78e-08 ***
# typeprof 2.027e+01 1.213e+01 1.672 0.0979 .
# typewc -1.078e+01 7.884e+00 -1.368 0.1746
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 22.25 on 93 degrees of freedom
# (4 observations deleted due to missingness)
# Multiple R-squared: 0.8492, Adjusted R-squared: 0.8427
# F-statistic: 131 on 4 and 93 DF, p-value: < 2.2e-16
Powers (more often represented by ^ than ** in R, FWIW) have a special meaning inside formulas [they represent interactions among variables rather than mathematical operations]. So if you did want to power-transform both sides of your equation you would use the I() or "as-is" operator:
I(Percent_below_poverty_level^0.02020) ~ I(Per_capita_income^0.02020)
However, I think you should do what #DaveArmstrong suggested anyway:
it's only the predictor variable that gets transformed
the Box-Cox transformation is actually (y^lambda-1)/lambda (although the shift and scale might not matter for your results)
I am using NBA shot data and am attempting to create shot prediction models using different regression techniques. However, I am running into the following warning message when trying to use a logistic regression model: Warning message:
glm.fit: algorithm did not converge. Also, it seems that the predictions do not work at all (not changed from the original Y variable (make or miss)). I will provide my code below. I got the data from here: Shot Data.
nba_shots <- read.csv("shot_logs.csv")
library(dplyr)
library(ggplot2)
library(data.table)
library("caTools")
library(glmnet)
library(caret)
nba_shots_clean <- data.frame("game_id" = nba_shots$GAME_ID, "location" =
nba_shots$LOCATION, "shot_number" = nba_shots$SHOT_NUMBER,
"closest_defender" = nba_shots$CLOSEST_DEFENDER,
"defender_distance" = nba_shots$CLOSE_DEF_DIST, "points" = nba_shots$PTS,
"player_name" = nba_shots$player_name, "dribbles" = nba_shots$DRIBBLES,
"shot_clock" = nba_shots$SHOT_CLOCK, "quarter" = nba_shots$PERIOD,
"touch_time" = nba_shots$TOUCH_TIME, "game_result" = nba_shots$W
, "FGM" = nba_shots$FGM)
mean(nba_shots_clean$shot_clock) # NA
# this gave NA return which means that there are NAs in this column that we
# need to clean up
# if the shot clock was NA I assume that this means it was the end of a
# quarter and the shot clock was off.
# For now I'm going to just set all of these NAs equal to zero, so all zeros
# mean it is the end of a quarter
# checking the amount of NAs
last_shots <- nba_shots_clean[is.na(nba_shots_clean$shot_clock),]
nrow(last_shots) # this tells me there is 5567 shots taken when the shot
# clock was turned off at the end of a quarter
# setting these NAs equal to zero
nba_shots_clean[is.na(nba_shots_clean)] <- 0
# checking to see if it worked
nrow(nba_shots_clean[is.na(nba_shots_clean$shot_clock),]) # it worked
# create a test and train set
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)
# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance +
points + dribbles + shot_clock + quarter + touch_time, data=nbaTrain,
family="binomial", na.action = na.omit)
nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)
This gives me the output of the following, which tells me the prediction didn't do anything, as it's the same as before.
FALSE TRUE
0 21428 0
1 0 17977
I would really appreciate any guidance.
The confusion matrix of your model (model prediction vs. nbaTest$FGM) tells you that your model has a 100% accuracy !
This is due to the points variable in your dataset which is perfectly associated to the dependent variable:
table(nba_shots_clean$points, nba_shots_clean$FGM)
0 1
0 87278 0
2 0 58692
3 0 15133
Try to delete points from your model:
# create a test and train set
set.seed(1234)
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)
# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance +
dribbles + shot_clock + quarter + touch_time, data=nbaTrain,
family="binomial", na.action = na.omit)
summary(nbaLogitModel)
No warning messages now and the estimated model is:
Call:
glm(formula = FGM ~ location + shot_number + defender_distance +
dribbles + shot_clock + quarter + touch_time, family = "binomial",
data = nbaTrain, na.action = na.omit)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8995 -1.1072 -0.9743 1.2284 1.6799
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.427688 0.025446 -16.808 < 2e-16 ***
locationH 0.037920 0.012091 3.136 0.00171 **
shot_number 0.007972 0.001722 4.630 0.000003656291 ***
defender_distance -0.006990 0.002242 -3.117 0.00182 **
dribbles 0.010582 0.004859 2.178 0.02941 *
shot_clock 0.032759 0.001083 30.244 < 2e-16 ***
quarter -0.043100 0.007045 -6.118 0.000000000946 ***
touch_time -0.038006 0.005700 -6.668 0.000000000026 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 153850 on 111532 degrees of freedom
Residual deviance: 152529 on 111525 degrees of freedom
AIC: 152545
Number of Fisher Scoring iterations: 4
The confusion matrix is:
nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)
FALSE TRUE
0 21554 5335
1 16726 5955
R has certain significance codes to determine statistical significance. In the example below, for example, a dot . indicates significance at the 10% level (see sample output below).
Dots can be very hard to see, especially when I copy-paste to Excel and display it in Times New Roman.
I'd like to change it such that:
* = significant at 10%
** = significant at 5%
*** = significant at 1%
Is there a way I can do this?
> y = c(1,2,3,4,5,6,7,8)
> x = c(1,3,2,4,5,6,8,7)
> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.0714 -0.3333 0.0000 0.2738 1.1191
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2143 0.6286 0.341 0.74480
x 0.9524 0.1245 7.651 0.00026 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8067 on 6 degrees of freedom
Multiple R-squared: 0.907, Adjusted R-squared: 0.8915
F-statistic: 58.54 on 1 and 6 DF, p-value: 0.0002604
You can create your own formatting function with
mystarformat <- function(x) symnum(x, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", " "))
And you can write your own coefficient formatter
show_coef <- function(mm) {
mycoef<-data.frame(coef(summary(mm)), check.names=F)
mycoef$signif = mystarformat(mycoef$`Pr(>|t|)`)
mycoef$`Pr(>|t|)` = format.pval(mycoef$`Pr(>|t|)`)
mycoef
}
And then with your model, you can run it with
mm <- lm(y~x)
show_coef(mm)
# Estimate Std. Error t value Pr(>|t|) signif
# (Intercept) 0.2142857 0.6285895 0.3408993 0.7447995
# x 0.9523810 0.1244793 7.6509206 0.0002604 ***
One should be aware that stargazer package reports significance levels with a different scale than other statistical softwares like STATA.
In R (stargazer) you get # (* p<0.1; ** p<0.05; *** p<0.01). Whereas, in STATA you get # (* p<0.05, ** p<0.01, *** p< 0.001).
This means that what is significant with one * in R results may appear not to be significant for a STATA user.
Sorry for the late response, but I found a great solution to this.
Just do the following:
install.packages("stargazer")
library(stargazer)
stargazer(your_regression, type = "text")
This displays everything in a beautiful way with your desired format.
Note: If you leave type = "text" out, then you'll get the LaTeX code.
I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")
I have the following data as an example:
InputName InputValue Output
===================================
Oxide 35 0.4
Oxide 35.2 0.42
Oxide 34.6 0.38
Oxide 35.9 0.46
CD 0.5 0.42
CD 0.48 0.4
CD 0.56 0.429
I want to do a linear regression of InputValue vs. Output treating different InputName as independent predictors.
If I want to use lm(Output ~ Oxide + CD) in R, it assumes a separate column for each predictor. In the example above that would mean making a separate column for Oxide and CD. I can do that using cast function from plyr package which might introduce NAs in the data.
However, is there a way to direct tell lm function that the input predictors are grouped according to the column InputName, and the values are given in the column Inputvalue?
It seems to me you are describing a form of dummy variable coding. This is not necessary in R at all, since any factor column in your data will automatically be dummy coded for you.
Recreate your data:
dat <- read.table(text="
InputName InputValue Output
Oxide 35 0.4
Oxide 35.2 0.42
Oxide 34.6 0.38
Oxide 35.9 0.46
CD 0.5 0.42
CD 0.48 0.4
CD 0.56 0.429
", header=TRUE)
Now build the model you described, but drop the intercept to make things a little bit more explicit:
fit <- lm(Output ~ InputValue + InputName - 1, dat)
summary(fit)
Call:
lm(formula = Output ~ InputValue + InputName - 1, data = dat)
Residuals:
1 2 3 4 5 6 7
-0.003885 0.003412 0.001519 -0.001046 0.004513 -0.014216 0.009703
Coefficients:
Estimate Std. Error t value Pr(>|t|)
InputValue 0.063512 0.009864 6.439 0.00299 **
InputNameCD 0.383731 0.007385 51.962 8.21e-07 ***
InputNameOxide -1.819018 0.346998 -5.242 0.00633 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.009311 on 4 degrees of freedom
Multiple R-squared: 0.9997, Adjusted R-squared: 0.9995
F-statistic: 4662 on 3 and 4 DF, p-value: 1.533e-07
Notice how all of your factor levels for InputName appear in the output, giving you a separate estimate of the effect of each level.
Concisely, the information you need are in these two lines:
InputNameCD 0.383731 0.007385 51.962 8.21e-07 ***
InputNameOxide -1.819018 0.346998 -5.242 0.00633 **
Here are 2 ways of doing this, split the data and do the regressions separately, or use interaction terms to specify that you want to consider the different levels of InputName to have separate slopes:
Split
lapply(split(dat,dat$InputName),lm,formula=Output~InputValue)
$CD
Call:
FUN(formula = ..1, data = X[[1L]])
Coefficients:
(Intercept) InputValue
0.2554 0.3135
$Oxide
Call:
FUN(formula = ..1, data = X[[2L]])
Coefficients:
(Intercept) InputValue
-1.78468 0.06254
Interaction
lm(Output~InputName + InputName:InputValue - 1,dat)
Call:
lm(formula = Output ~ InputName + InputName:InputValue - 1, data = dat)
Coefficients:
InputNameCD InputNameOxide InputNameCD:InputValue InputNameOxide:InputValue
0.25542 -1.78468 0.31346 0.06254
For comparision purposes I've also removed the intercept. Note that the estimated coefficients are the same in each case.