I have the example data and model
x<-rep(seq(0,100,by=1),10)
y<-15+2*rnorm(1010,10,4)*x+rnorm(1010,20,100)
id<-NULL
for (i in 1:10){
id<-c(id, rep(i,101))}
dtfr<-data.frame(x=x,y=y, id=id)
library(nlme)
with (dtfr, summary(lme((y)~x,random=~1+x|id, na.action=na.omit )))
model.mx<-with (dtfr, (lme((y)~x,random=~1+x|id, na.action=na.omit )))
pd<-predict(model.mx, newdata=data.frame(x=0:100),level=0)
with (dtfr, plot(x, y))
lines(0:100,predict(model.mx, newdata=data.frame(x=0:100),level=0), col="darkred", lwd=7)
How can I extract the modelled intercept and slope of each individual ID and plot the individual trajectories of each ID?
Not sure what you want to do because all your coefficients are almost identical:
> coef(model.mx)
(Intercept) x
1 54.88302 19.18001
2 54.88298 19.18000
3 54.88299 19.18000
4 54.88299 19.18000
5 54.88302 19.18001
6 54.88300 19.18000
7 54.88301 19.18000
8 54.88300 19.18000
9 54.88299 19.18000
10 54.88300 19.18000
Maybe your real data gives you more different results. If it's the case, I would use abline inside a mapply call:
with (dtfr, plot(x, y))
mapply(abline,a=coef(model.mx)[,1],b=coef(model.mx)[,2], col=1:10)
Here's the result. Since all coeffcients are almost the same, the lines are plotted on top of each other. You only see the last one.
Related
I'm trying to calculate point slopes from a series of x,y data. Because some of the x data repeats (...8, 12, 12, 16...) there will be a division by zero issue when using slope = (y2-y1/x2-x1).
My solution is to create a polynomial regression equation of the data, then plug a new set of x values (xx) into the equation that monotonically increase between the limits of x. This eliminates the problem of equal x data points. As a result, (x) and (xx) have the same limits, but (xx) is always longer in length.
The problem I am having is that the fitted values for xx are limited to the length of x. When I try to use the polynomial equation with (xx) that is 20 in length, the fitted yy results provide data for the first 10 points then gives NA for the next 10 points. What is wrong here?
x <- c(1,2,2,5,8,12,12,16,17,20)
y <- c(2,4,5,6,8,11,12,15,16,20)
df <- data.frame(x,y)
my_mod <- lm(y ~ poly(x,2,raw=T), data=df) # This creates the polynomial equation
xx <- x[1]:x[length(x)] # Creates montonically increasing x using boundaries of original x
yy <- fitted(my_mod)[order(xx)]
plot(x,y)
lines(xx,yy)
tag-name
If you look at
fitted(my_mod)
It outputs:
# 1 2 3 4 5 6 7 8 9 10
#3.241032 3.846112 3.846112 5.831986 8.073808 11.461047 11.461047 15.303305 16.334967 19.600584
Meaning the name of the output matches the position of x, not the value of x, so fitted(my_mod)[order(xx)] doesn't quite make sense.
You want to use predict here:
yy <- predict(my_mod, newdata = data.frame(x = xx))
plot(xx, yy)
# 1 2 3 4 5 6 7 8 9 10
# 3.241032 3.846112 4.479631 5.141589 5.831986 6.550821 7.298095 8.073808 8.877959 9.710550
# 11 12 13 14 15 16 17 18 19 20
# 10.571579 11.461047 12.378953 13.325299 14.300083 15.303305 16.334967 17.395067 18.483606 19.600584
I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.
Thank you in advance for your assistance!
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action is na.omit, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Actually, you are not using correctly the predict.lm function.
Either way you have to input the model itself as its first argument, hereby model, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x has one more row than the training x it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)
I ran a multilevel regression and now have a coefficient matrix consisting of value + standard error for each group (=a factor variable) in the regression, e.g. my matrix (for intercept + one variable called Beta1) looks like this:
Group Intercept Beta1 Intercept.se Beta1.se
11 0.044357458 0.4381340 0.08358735 0.1572632
12 -0.007072542 0.1242737 0.09317142 0.1643544
21 0.021075871 0.3727055 0.12050036 0.2459456
22 0.023895981 0.6786013 0.11207848 0.3188887
31 -0.115713481 0.3547718 0.09760681 0.1454787
32 -0.004081244 -0.1954594 0.09993201 0.1953406
What I would like to achieve is to draw a diagram showing possible regression lines for each group. I came up with the following code which produces 15 lines for each group (coef.mtx is the matrix mentioned above):
for (i in 1:6) { # we have 6 groups
x = coef.mtx[i,]
lines[(i*6-5):(i*6),] =
list(Group = replicate(15, x["Group"]),
int = replicate(15, rnorm(1,x["Intercept"],x["Intercept.se"])),
slo = replicate(15, rnorm(1, x["Beta1"], x["Beta1.se"])))
}
This produces a dataframe like this:
Group int slo
1 11 0.09484568 0.3005997
2 11 0.12364749 0.5758899
3 11 -0.02942938 0.4821841
4 11 0.17226587 0.2413752
5 11 0.02923023 0.4251419
6 11 0.14650632 0.4541752
7 12 0.06784996 0.0356669
8 12 -0.02832304 0.2214471
...
And then I can draw those lines with ggplot like this:
ggplot(myData, aes(x=Beta1, y=Outcome)) +
geom_jitter() +
facet_wrap(~ Group) +
geom_abline(aes(intercept=int, slope=slo), data=lines)
The final result looks like this:
Is there a better way to transform the coefficient matrix instead of using this loop? I was unable to think of a better way... Alternatively: how would you visualize possible regression lines (and not just the point-estimate)?
This question is related to: Selecting Percentile curves using gamlss::lms in R
I can get centile curve from following data and code:
age = sample(5:15, 500, replace=T)
yvar = rnorm(500, age, 20)
mydata = data.frame(age, yvar)
head(mydata)
age yvar
1 12 13.12974
2 14 -18.97290
3 10 42.11045
4 12 27.89088
5 11 48.03861
6 5 24.68591
h = lms(yvar, age , data=mydata, n.cyc=30)
centiles(h,xvar=mydata$age, cent=c(90), points=FALSE)
How can I now get yvar on the curve for each of x value (5:15) which would represent 90th percentiles for data after smoothening?
I tried to read help pages and found fitted(h) and fv(h) to get fitted values for entire data. But how to get values for each age level at 90th centile curve level? Thanks for your help.
Edit: Following figure show what I need:
I tried following but it is correct since value are incorrect:
mydata$fitted = fitted(h)
aggregate(fitted~age, mydata, function(x) quantile(x,.9))
age fitted
1 5 6.459680
2 6 6.280579
3 7 6.290599
4 8 6.556999
5 9 7.048602
6 10 7.817276
7 11 8.931219
8 12 10.388048
9 13 12.138104
10 14 14.106250
11 15 16.125688
The values are very different from 90th quantile directly from data:
> aggregate(yvar~age, mydata, function(x) quantile(x,.9))
age yvar
1 5 39.22938
2 6 35.69294
3 7 25.40390
4 8 26.20388
5 9 29.07670
6 10 32.43151
7 11 24.96861
8 12 37.98292
9 13 28.28686
10 14 43.33678
11 15 44.46269
See if this makes sense. The 90th percentile of a normal distribution with mean and sd of 'smn' and 'ssd' is qnorm(.9, smn, ssd): So this seems to deliver (somewhat) sensible results, albeit not the full hack of centiles that I suggested:
plot(h$xvar, qnorm(.9, fitted(h), h$sigma.fv))
(Note the massive overplotting from only a few distinct xvars but 500 points. Ande you may want to set the ylim so that the full range can be appreciated.)
The caveat here is that you need to check the other parts of the model to see if it is really just an ordinary Normal model. In this case it seems to be:
> h$mu.formula
y ~ pb(x)
<environment: 0x10275cfb8>
> h$sigma.formula
~1
<environment: 0x10275cfb8>
> h$nu.formula
NULL
> h$tau.formula
NULL
So the model is just mean-estimate with a fixed-variance (the ~1) across the range of the xvar, and there are no complications from higher order parameters like a Box-Cox model. (And I'm unable to explain why this is not the same as the plotted centiles. For that you probably need to correspond with the package authors.)
I am working with a data frame I created and want to expand it to include a lowess fit. I have been able to add a lowess curve to the plot of my data along with an lm fit but I cannot figure out how to add the lowess values to my data frame.
Please forgive the ugliness of my code as I do everything by brute force (i.e. suggestions for simplification/efficiency are appreciated.) I apologize as I am not allowed to post images. I have a scatterplot with an lm fit yielding an R^2 of .7897 and a lowess curve which well replicates the qqplot of the lm fit. When I get a reputation of "10" I will post it for edification/ease of visualization:
##Read in Data
OPM.df <- read.csv("On Peak Mod TMAX.csv", header = TRUE)
## Data frame 2008-2012
OPM5.df <- OPM.df[4606:6140,]
##Verify headings
OPM5.df[1,]
SummerOPM5.df <- OPM5.df[month(OPM5.df$Date) >= 6 & month(OPM5.df$Date) <= 9, ]
###Fit Linear Regression to Data
fitsummerX <- lm(SummerOPM5.df$MaxLoad~SummerOPM5.df$TMAX)
summary(fitsummerX)
##Plot data
windows()
plot(SummerOPM5.df$TMAX, SummerOPM5.df$MaxLoad, main="Linear Regression Adjusted R- squared: 0.7897",)
## Add fit lines
abline(fitsummerX, col="red") # regression line (y~x)
lines(lowess(SummerOPM5.df$TMAX,SummerOPM5.df$MaxLoad), col="blue") # lowess line (x,y)
## plyr augmentation of df
SummerOPM5.df <- ddply(SummerOPM5.df, .(Date, MaxLoad, TMAX, OnPeakTotal), transform,
Lowess = (lowess(TMAX,MaxLoad)$y))
##Verify headings and values
SummerOPM5.df [1:5,]
This gives the following:
Date MaxLoad TMAX OnPeakTotal Lowess
1 2008-06-02 2880 214.0 43307 2880
2 2008-06-03 2860 197.0 43166 2860
3 2008-06-04 2787 172.5 42088 2787
4 2008-06-05 2902 216.5 43333 2902
5 2008-06-06 3078 275.0 45325 3078
The Lowess values are the same as the MaxLoad
If I run the lowess seperately:
Lowess = (lowess(SummerOPM5.df$TMAX,SummerOPM5.df$MaxLoad))
Lowess$y[1:5]
I get this:
[1] 2522.221 2569.523 2603.625 2622.795 2622.795
Where the vector is in rank order.
.
I am unsure if it preserves the sequence otherwise so am unsure how to fit this to the corresponding rows of my data frame.
Suggestions?