How to deal with a quadratic model that has too many fitted values? - r

I'm trying to fit a quadratic regression model to a dataset and then plot the curve on a scatterplot. The dataset is about number of episodes and screentime for characters in a TV show.
I plotted a scatterplot with episodes on x axis and screentime on y axis this worked fine.
Then I create the model as follows:
#ordering
gottemp <- got[order(got$episodes),]
#plotting
plot(screentime~episodes, data = gottemp, xlab ="Number of episodes", ylab = "Screentime (minutes)", col=c("blue","red")[gender], pch=c(1,2)[gender])
legend("topleft",pch = c(1,2),col=c("blue","red"),c("female","male"))
title("Plot of Screentimes vs Number of Episodes")
#creating 3model and plotting line
model <- lm(screentime~episodes+I(episodes^2), data = got)
lines(fitted(model))
This gives me a model with correct coeefficients however the line that is plotted is not what would be expected. When I view the model i see that there are 113 fitted values, which I think is due to some characters having the same number of episodes so to fix this I think there should only be one fitted value for each number of episodes.

Something like
nd <- data.frame(episodes=seq(min(episodes), max(episodes), length=51)
nd$screentime <- predict(model, newdata=nd)
with(nd, lines(episodes, screentime))
should do what you want. There's probably a duplicate around somewhere ...

Related

Segmented Regression with two zero constraints at beginning and end boundaries

I am having trouble with this segmented regression as it requires two constraints and so far I have only treated single constraints.
Here is an example of some data I am trying to fit:
library(segmented)
library("readxl")
library(ggplot2)
#DATA PRE-PROCESSING
yields <- c(-0.131, 0.533, -0.397, -0.429, -0.593, -0.778, -0.92, -0.987, -1.113, -1.314, -0.808, -1.534, -1.377, -1.459, -1.818, -1.686, -1.73, -1.221, -1.595, -1.568, -1.883, -1.53, -1.64, -1.396, -1.679, -1.782, -1.033, -0.539, -1.207, -1.437, -1.521, -0.691, -0.879, -0.974, -1.816, -1.854, -1.752, -1.61, -0.602, -1.364, -1.303, -1.186, -1.336)
maturities <- c(2.824657534246575, 2.9013698630136986, 3.106849315068493, 3.1534246575342464, 3.235616438356164, 3.358904109589041, 3.610958904109589, 3.654794520547945, 3.778082191780822, 3.824657534246575, 3.9013698630136986, 3.9863013698630136, 4.153424657534247, 4.273972602739726, 4.32054794520548, 4.654794520547945, 4.778082191780822, 4.986301369863014, 5.153424657534247, 5.32054794520548, 5.443835616438356, 5.572602739726028, 5.654794520547945, 5.824425480949174, 5.941911819746988, 6.275245153080321, 6.4063926940639275, 6.655026573845348, 6.863013698630137, 7.191780821917808, 7.32054794520548, 7.572602739726028, 7.693150684931507, 7.901369863013699, 7.986301369863014, 8.32054794520548, 8.654794520547945, 8.986301369863014, 9.068493150684931, 9.32054794520548, 9.654794520547945, 9.903660453626769, 10.155026573845348)
off_2 <- 2.693277939965566
off_10 <- 10.655026573845348
bond_data = data.frame(yield_change = yields, maturity = maturities) code here
I am trying to fit a segmented model (with a formula of "yield_change~maturity") that has the following constraints:
At maturity = 2 I want the yield_change to be zero
At maturity = 10, I want the yield_change to zero
I want breakpoints(fixed in x) at the 3, 5 and 7-year maturity values.
The off_2 and off_10 variables are the offsets I must use (to set the yields to zero at the 2 and 10-year mark)
As I mentioned before, my previous regressions only required one initial constraint, having one offset value I had to use, I did the following:
I subtracted the offset value from the maturity vector (for example I had maturity = c(10.8,10.9,11,14,16,18, etc... then subtracted the offset, always lower than the initial vector value, 10,4 for example and then fitted a lm with an origin constraint)
From there I could use the segmented package and fit as many breakpoints as I wanted)
As the segmented() function requires a lm object as an input that was possible.
However in this case I cannot to the previous approach as I have two offsets and cannot subtract all the values by off_2 or off_10 as it would fix one point at the zero and not the other.
What I have tried doing is the following:
Separate the dataset into maturities below 5 and maturities over 5 (and essentially apply a segmented model to each of these (with only one breakpoint being 3 or 7).
The issue is that I need to have the 5 year point yield the same for the two models.
I have done this:
bond_data_sub5 <- bond_data[bond_data$maturity < 5,]
bond_data_over5 <- bond_data[bond_data$maturity > 5,]
bond_data_sub5["maturity"] <- bond_data_sub5$maturity - off_2
#2 to 5 year model
model_sub5 <- lm(yield_change~maturity+0, data = bond_data_sub5)
plot(bond_data_sub5$maturity,bond_data_sub5$yield_change, pch=16, col="blue",
xlab = "maturity",ylab = "yield_change", xlim = c(0,12))
abline(model_sub5)
Which gives me the following graph:
The fact that the maturities have an offset of off_2 is not a problem as when I input my predictions to the function I will create, I will then subtract them by off_2.
The worrying thing is that the 5-year prediction is not at all close to where the actual 5 year should be. Looking at the scatter plot of all maturities we can see this:
five_yr_yield <- predict(model_sub5,data.frame(maturity = 5 - off_2))
plot(bond_data$maturity,bond_data$yield_change, pch=16, col="blue",
xlab = "maturity",ylab = "yield_change", xlim = c(0,12), ylim = c(-3,0.5))
points(5,five_yr_yield, pch=16, col = "red")
Gives:
The issue with this method is that if I set the model_sub5 5-year prediction as the beginning constraint of model_over5, I will have the exact same problem I am trying to resolve, two constraints in one lm (but this time (5,five_yr_yield) and (10,0) constraints.
Isn't there a way to fit a lm with no slope and zero as an intercept from (2,0) to (10,0) and then apply the segmented function with breakpoints at 3,5 and 7?
If that isn't possible how would I make the logic I am trying to apply work? Or is there another way of doing this?
If anyone has any suggestions I would greatly appreciate them!
Thank you very much!

Writing linear and exponential equations for fitted SMA model

I have fitted a standard major axis model to my data, and I need add an equation onto the plot but I can't figure out what this should be.
My data looks like this:
> head(d1)
x y
1 5.379431 10.263158
2 5.863559 5.287081
3 10.367855 4.186603
4 11.459073 5.669856
5 14.477543 6.387560
6 16.713999 4.377990
My model gives the following:
library(smatr)
m1 <- sma(y~x, data = d1, log="xy")
#Coefficients:
# elevation slope
#estimate -0.23978063 0.8576100
#lower limit -0.54266508 0.6786154
#upper limit 0.06310381 1.0838170
#H0 : variables uncorrelated
#R-squared : 0.3228417
#P-value : 1.3667e-05
So based on this I have plotted the data and added an equation for the line that looks like this:
plot1 of data and SMA fit
My problem is that the regression equation y=-0.240+0.858x doesn't make sense with the trendline. I have also been asked to provide an exponential equation (e.g. y = ab^x) and I have no idea how to convert it.
Any help would be much appreciated!
Following the comments, the working formula is: y=e^(0.240+ 0.858*log10(x))).
This is directly in the y=a+bx format, with:
a=e^0.240
b=e^(0.858/ln(10))

R: How to read Nomograms to predict the desired variable

I am using Rstudio. I have created nomograms using function nomogram from package rms using following code (copied from the example code of the documentation):
library(rms)
n <- 1000 # define sample size
set.seed(17) # so can reproduce the results
age <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol <- rnorm(n, 200, 25)
sex <- factor(sample(c('female','male'), n,TRUE))
# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
(log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)
ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')
f <- lrm(y ~ lsp(age,50)+sex*rcs(cholesterol,4)+blood.pressure)
nom <- nomogram(f, fun=function(x)1/(1+exp(-x)), # or fun=plogis
fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999),
funlabel="Risk of Death")
#Instead of fun.at, could have specified fun.lp.at=logit of
#sequence above - faster and slightly more accurate
plot(nom, xfrac=.45)
Result:
This code produces a nomogram but there is no line connecting each scale (called isopleth) to help predict the desired variable ("Risk of Death") from the plot. Usually, nomograms have the isopleth for prediction (example from wikipedia). But here, how do I predict the variable value?
EDIT:
From the documentation:
The nomogram does not have lines representing sums, but it has a
reference line for reading scoring points (default range 0--100). Once
the reader manually totals the points, the predicted values can be
read at the bottom.
I don't understand this. It seems that predicting is supposed to be done without the isopleth, from the scale of points. but how? Can someone please elaborate with this example on how I can read the nomograms to predict the desired variable? Thanks a lot!
EDIT 2 (FYI):
In the description of the bounty, I am talking about the isopleth. When starting the bounty, I did not know that nomogram function does not provide isopleth and has points scale instead.
From the documentation, the nomogram is used to manualy obtain prediction:
In the top of the plot (over Total points)
you draw a vertical line for each of the variables of your patient (for example age=40, cholesterol=220 ( and sex=male ), blood.pressure=172)
then you sum up the three values you read on the Points scale (40+60+3=103) to obtain Total Points.
Finally you draw a vertical line on the Total Points scale (103) to read the Risk of death (0.55).
These are regression nomograms, and work in a different way to classic nomograms. A classic nomogram will perform a full calculation. For these nomograms you drop a line from each predictor to the scale at the bottom and add your results.
The only way to have a classic 'isopleth' nomogram working on a regression model would be 1 have just two predictors or 2 have a complex multi- step nomogram.

How to draw my function to plot with data in R

I have data about response time at web site according users that hit at the same time.
For example:
10 users hit the same time have (average) response time 300ms
20 users -> 450ms etc
I import the data in R and I make the plot from 2 columns data (users, response time).
Also I use the function loess to draw a line about those points, at the plot.
Here's the code that I have wrote:
users <- seq(5,250, by=5)
responseTime <- c(179.5,234.0,258.5,382.5,486.0,679.0,594.0,703.5,998.0,758.0,797.0,812.0,804.5,890.5,1148.5,1182.5,1298.0,1422.0,1413.5,1209.5,1488.0,1632.0,1715.0,1632.5,2046.5,1860.5,2910.0,2836.0,2851.5,3781.0,2725.0,3036.0,2862.0,3266.0,3175.0,3599.0,3563.0,3375.0,3110.0,2958.0,3407.0,3035.5,3040.0,3378.0,3493.0,3455.5,3268.0,3635.0,3453.0,3851.5)
data1 <- data.frame(users,responseTime)
data1
plot(data1, xlab="Users", ylab="Response Time (ms)")
lines(data1)
loess_fit <- loess(responseTime ~ users, data1)
lines(data1$users, predict(loess_fit), col = "green")
Here's my plot's image:
My questions are:
How to draw my nonlinear function at the same plot to compare it with the other lines?
example: response_time (f(x)) = 30*users^2.
Also how to make predictions for the line of function loess and for my function and show them to the plot, example: if I have data until 250 users, make prediction until 500 users
If you know the equation of the line that you want to draw, then just define a variable for your prediction:
predictedResponseTime <- 30 * users ^ 2
lines(users, predictedResponseTime)
If the problem is that you want to fit a line, then you need to call a modelling function.
Since loess is a non-parametric model, is isn't appropriate to use it to make predictions outside of the range of your data.
In this case, a simple (ordinary least squares) linear regression using lm provides a reasonable fit.
model <- lm(responseTime ~ users)
prediction <- data.frame(users = 1:500)
prediction$responseTime <- predict(model, prediction)
with(prediction, lines(users, responseTime))
Another solution to plot your curve knowing the underlying function is function curve.
In your example of f(x)=30x^2:
plot(data1, xlab="Users", ylab="Response Time (ms)")
lines(data1)
lines(data1$users, predict(loess_fit), col = "green")
curve(30*x^2,col="red", add=TRUE) #Don't forget the add parameter.

Plot "regression line" from multiple regression in R

I ran a multiple regression with several continuous predictors, a few of which came out significant, and I'd like to create a scatterplot or scatter-like plot of my DV against one of the predictors, including a "regression line". How can I do this?
My plot looks like this
D = my.data; plot( D$probCategorySame, D$posttestScore )
If it were simple regression, I could add a regression line like this:
lmSimple <- lm( posttestScore ~ probCategorySame, data=D )
abline( lmSimple )
But my actual model is like this:
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
I would like to add a regression line that reflects the coefficient and intercept from the actual model instead of the simplified one. I think I'd be happy to assume mean values for all other predictors in order to do this, although I'm ready to hear advice to the contrary.
This might make no difference, but I'll mention just in case, the situation is complicated slightly by the fact that I probably will not want to plot the original data. Instead, I'd like to plot mean values of the DV for binned values of the predictor, like so:
D[,'probCSBinned'] = cut( my.data$probCategorySame, as.numeric( seq( 0,1,0.04 ) ), include.lowest=TRUE, right=FALSE, labels=FALSE )
D = aggregate( posttestScore~probCSBinned, data=D, FUN=mean )
plot( D$probCSBinned, D$posttestScore )
Just because it happens to look much cleaner for my data when I do it this way.
To plot the individual terms in a linear or generalised linear model (ie, fit with lm or glm), use termplot. No need for binning or other manipulation.
# plot everything on one page
par(mfrow=c(2,3))
termplot(lmMultiple)
# plot individual term
par(mfrow=c(1,1))
termplot(lmMultiple, terms="preTestScore")
You need to create a vector of x-values in the domain of your plot and predict their corresponding y-values from your model. To do this, you need to inject this vector into a dataframe comprised of variables that match those in your model. You stated that you are OK with keeping the other variables fixed at their mean values, so I have used that approach in my solution. Whether or not the x-values you are predicting are actually legal given the other values in your plot should probably be something you consider when setting this up.
Without sample data I can't be sure this will work exactly for you, so I apologize if there are any bugs below, but this should at least illustrate the approach.
# Setup
xmin = 0; xmax=10 # domain of your plot
D = my.data
plot( D$probCategorySame, D$posttestScore, xlim=c(xmin,xmax) )
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
# create a dummy dataframe where all variables = their mean value for each record
# except the variable we want to plot, which will vary incrementally over the
# domain of the plot. We need this object to get the predicted values we
# want to plot.
N=1e4
means = colMeans(D)
dummyDF = t(as.data.frame(means))
for(i in 2:N){dummyDF=rbind(dummyDF,means)} # There's probably a more elegant way to do this.
xv=seq(xmin,xmax, length.out=N)
dummyDF$probCSBinned = xv
# if this gives you a warning about "Coercing LHS to list," use bracket syntax:
#dummyDF[,k] = xv # where k is the column index of the variable `posttestScore`
# Getting and plotting predictions over our dummy data.
yv=predict(lmMultiple, newdata=subset(dummyDF, select=c(-posttestScore)))
lines(xv, yv)
Look at the Predict.Plot function in the TeachingDemos package for one option to plot one predictor vs. the response at a given value of the other predictors.

Resources