R survey confidence interval plots [closed] - r

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Similarly to barplot and dotchart (from the survey package) barNest (plotrix package) was meant to produce plots for svyby objects on the fly,but also plotted confidence intervals. However barNest.svymean is no longer working on survey data. An alternative would be to plot confidence intervals on top of the survey plotting function dotchart
library(survey)
data(api)
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
#just one variable
a<-svyby(~api99, ~stype, dclus1, svymean)
#several variables
b<-svyby(~api99+api00, ~stype, dclus1, svymean)
dotchart(b)
although I'm not sure how you'd do that. If anyone works this out then it would be really good to automate it (by creating some code that applies to svyby objects of different sizes) and maybe even incoroporate it in dotchart.svystat {survey}. It would make graphic comparison among groups much easier! The standard errors can be extrated from b or using SE(b).

right so you're trying to use an object class (svyby) in a function (barNest) that doesn't know how to handle that class, because the survey package and the plotrix package don't play together too nicely. luckily the dotchart method for svyby objects isn't too much code, so you might just want to modify it..
# run your code above, then review the dotchart method for svyby objects:
getS3method( 'dotchart' , 'svyby' )
..and from that you can learn it's really not much beyond calling the original dotchart function (that is, not using the svyby object, just a regular collection of statistics), after converting the data contained in your b object to a matrix. now all you have left to do is add a confidence interval line.
the confidence interval widths are easily obtained (easier than using SE(b)) by running
confint( b )
can you extract those statistics to build your own barNest or plotCI call?
if it's important to put confidence intervals on a dotchart, the major hurdle is hitting the y coordinates correctly. dig around in the dotchart default method..
getS3method( 'dotchart' , 'default' )
..and you can see how the y coordinates are calculated. whittled down to just the essentials, i think you can use this:
# calculate the distinct groups within the `svyby` object
groups <- as.numeric( as.factor( attr( b , 'row.names' ) ) )
# calculate the distinct statistics within the `svyby` object
nstats <- attr( b , 'svyby' )$nstats
# calculate the total number of confidence intervals you need to add
n <- length( groups ) * nstats
# calculate the offset sizes
offset <- cumsum(c(0, diff(groups) != 0))
# find the exact y coordinates for each dot in the dotchart
# and leave two spaces between each group
y <- 1L:n + sort( rep( 2 * offset , nstats ) )
# find the confidence interval positions
ci.pos <-
rep( groups , each = nstats ) +
c( 0 , length( groups ) )
# extract the confidence intervals
x <- confint( b )[ ci.pos , ]
# add the y coordinates to a new line data object
ld <- data.frame( x )
# loop through each dot in the dotchart..
for ( i in seq_len( nrow( ld ) ) ){
# add the CI lines to the current plot
lines( ld[ i , 1:2 ] , rep( y[i] , 2 ) )
}
but that's obviously clunky since the confidence intervals are allowed to go way off the screen. ignoring the svyby class and even the whole survey package for a second, find us implementation of dotchart that formats confidence intervals nicely, and we may be able to help you more. i don't think the survey package is the root of your problem :)

Adding a new dotchart plot (with min and max) to Anthony's last bit (from ld<-data.frame(x)) solves the problem he outlined.
ld <- data.frame( x )
dotchart(b,xlim=c(min(ld),max(ld)))#<-added
for ( i in seq_len( nrow( ld ) ) ){
lines( ld[ i , 1:2 ] , rep( y[i] , 2 ) )
}
However I agree with Anthony: the plot doesn't look great. Many thanks to Anthony for sharing his knowledge and programming skills. The confidence intervals also look asymmetrical (which might be right), particularly for M api00. Has anyone compared this with other software? Should confit specify a df (degrees of freedom)?

Related

R: draw from a vector using custom probability function

Forgive me if this has been asked before (I feel it must have, but could not find precisely what I am looking for).
Have can I draw one element of a vector of whole numbers (from 1 through, say, 10) using a probability function that specifies different chances of the elements. If I want equal propabilities I use runif() to get a number between 1 and 10:
ceiling(runif(1,1,10))
How do I similarly sample from e.g. the exponential distribution to get a number between 1 and 10 (such that 1 is much more likely than 10), or a logistic probability function (if I want a sigmoid increasing probability from 1 through 10).
The only "solution" I can come up with is first to draw e6 numbers from the say sigmoid distribution and then scale min and max to 1 and 10 - but this looks clumpsy.
UPDATE:
This awkward solution (and I dont feel it very "correct") would go like this
#Draw enough from a distribution, here exponential
x <- rexp(1e3)
#Scale probs to e.g. 1-10
scaler <- function(vector, min, max){
(((vector - min(vector)) * (max - min))/(max(vector) - min(vector))) + min
}
x_scale <- scaler(x,1,10)
#And sample once (and round it)
round(sample(x_scale,1))
Are there not better solutions around ?
I believe sample() is what you are looking for, as #HubertL mentioned in the comments. You can specify an increasing function (e.g. logit()) and pass the vector you want to sample from v as an input. You can then use the output of that function as a vector of probabilities p. See the code below.
logit <- function(x) {
return(exp(x)/(exp(x)+1))
}
v <- c(seq(1,10,1))
p <- logit(seq(1,10,1))
sample(v, 1, prob = p, replace = TRUE)

'Non-conformable arguments' in R code [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 4 years ago.
Improve this question
: ) I previously wrote an R function that will compute a least-squares polynomial of arbitrary order to fit whatever data I put into it. "LeastSquaresDegreeN.R" The code works because I can reproduce results I got previously. However, when I try to put new data into it I get a "Non-conformable arguments" error.
"Error in Conj(t(Q))%*%t(b) : non-conformable arguments"
An extremely simple example of data that should work:
t <- seq(1,100,1)
fifthDegree <- t^5
LeastSquaresDegreeN(t,fifthDegree,5)
This should output and plot a polynomial f(t) = t^5 (up to rounding errors).
However I get "Non-conformable arguments" error even if I explicitly make these vectors:
t <- as.vector(t)
fifthDegree <- as.vector(fifthDegree)
LeastSquaresDegreeN(t,fifthDegree,5)
I've tried putting in the transpose of these vectors too - but nothing works.
Surely the solution is really simple. Help!? Thank you!
Here's the function:
LeastSquaresDegreeN <- function(t, b, deg)
{
# Usage: t is independent variable vector, b is function data
# i.e., b = f(t)
# deg is desired polynomial order
# deg <- deg + 1 is a little adjustment to make the R loops index correctly.
deg <- deg + 1
t <- t(t)
dataSize <- length(b)
A <- mat.or.vec(dataSize, deg) # Built-in R function to create zero
# matrix or zero vector of arbitrary size
# Given basis phi(z) = 1 + z + z^2 + z^3 + ...
# Define matrix A
for (i in 0:deg-1) {
A[1:dataSize,i+1] = t^i
}
# Compute QR decomposition of A. Pull Q and R out of QRdecomp
QRdecomp <- qr(A)
Q <- qr.Q(QRdecomp, complete=TRUE)
R <- qr.R(QRdecomp, complete=TRUE)
# Perform Q^* b^T (Conjugate transpose of Q)
c <- Conj(t(Q))%*%t(b)
# Find x. R isn't square - so we have to use qr.solve
x <- qr.solve(R, c)
# Create xPlot (which is general enough to plot any degree
# polynomial output)
xPlot = x[1,1]
for (i in 1:deg-1){
xPlot = xPlot + x[i+1,1]*t^i
}
# Now plot it. Least squares "l" plot first, then the points in red.
plot(t, xPlot, type='l', xlab="independent variable t", ylab="function values f(t)", main="Data Plotted with Nth Degree Least Squares Polynomial", col="blue")
points(t, b, col="red")
} # End

Plot "regression line" from multiple regression in R

I ran a multiple regression with several continuous predictors, a few of which came out significant, and I'd like to create a scatterplot or scatter-like plot of my DV against one of the predictors, including a "regression line". How can I do this?
My plot looks like this
D = my.data; plot( D$probCategorySame, D$posttestScore )
If it were simple regression, I could add a regression line like this:
lmSimple <- lm( posttestScore ~ probCategorySame, data=D )
abline( lmSimple )
But my actual model is like this:
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
I would like to add a regression line that reflects the coefficient and intercept from the actual model instead of the simplified one. I think I'd be happy to assume mean values for all other predictors in order to do this, although I'm ready to hear advice to the contrary.
This might make no difference, but I'll mention just in case, the situation is complicated slightly by the fact that I probably will not want to plot the original data. Instead, I'd like to plot mean values of the DV for binned values of the predictor, like so:
D[,'probCSBinned'] = cut( my.data$probCategorySame, as.numeric( seq( 0,1,0.04 ) ), include.lowest=TRUE, right=FALSE, labels=FALSE )
D = aggregate( posttestScore~probCSBinned, data=D, FUN=mean )
plot( D$probCSBinned, D$posttestScore )
Just because it happens to look much cleaner for my data when I do it this way.
To plot the individual terms in a linear or generalised linear model (ie, fit with lm or glm), use termplot. No need for binning or other manipulation.
# plot everything on one page
par(mfrow=c(2,3))
termplot(lmMultiple)
# plot individual term
par(mfrow=c(1,1))
termplot(lmMultiple, terms="preTestScore")
You need to create a vector of x-values in the domain of your plot and predict their corresponding y-values from your model. To do this, you need to inject this vector into a dataframe comprised of variables that match those in your model. You stated that you are OK with keeping the other variables fixed at their mean values, so I have used that approach in my solution. Whether or not the x-values you are predicting are actually legal given the other values in your plot should probably be something you consider when setting this up.
Without sample data I can't be sure this will work exactly for you, so I apologize if there are any bugs below, but this should at least illustrate the approach.
# Setup
xmin = 0; xmax=10 # domain of your plot
D = my.data
plot( D$probCategorySame, D$posttestScore, xlim=c(xmin,xmax) )
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
# create a dummy dataframe where all variables = their mean value for each record
# except the variable we want to plot, which will vary incrementally over the
# domain of the plot. We need this object to get the predicted values we
# want to plot.
N=1e4
means = colMeans(D)
dummyDF = t(as.data.frame(means))
for(i in 2:N){dummyDF=rbind(dummyDF,means)} # There's probably a more elegant way to do this.
xv=seq(xmin,xmax, length.out=N)
dummyDF$probCSBinned = xv
# if this gives you a warning about "Coercing LHS to list," use bracket syntax:
#dummyDF[,k] = xv # where k is the column index of the variable `posttestScore`
# Getting and plotting predictions over our dummy data.
yv=predict(lmMultiple, newdata=subset(dummyDF, select=c(-posttestScore)))
lines(xv, yv)
Look at the Predict.Plot function in the TeachingDemos package for one option to plot one predictor vs. the response at a given value of the other predictors.

grouped loess scores

I've got this data.frame: http://sprunge.us/TMGS, and I'd like to calculate the loess of Intermediate.MAP.Score ~ x, so I get one curve from the whole dataset. But every group (by name) should have the same wight as every other group, I'm not sure what happens if I call loess over the whole data.frame. Do I need to call it once per group and combine them? If yes, how do I do that?
If you want to average over all of the values in 'loess.fits' produced in my earlier answer to a different question, you will get one answer. If you want to just get a loess fit on the entire dataset (which would not fit your "equal weighting" spec at least as I interpret that phrase), you will get another answer.
This would produce averaged 'yhat' values at the 51 equally spaced data values for 'x' in the range of [0,1]. Because of missing values, it may not be exactly "equally weighted" but only at the extremes. The estimates are dense elsewhere:
apply( as.data.frame(loess.fits), 1, mean, na.rm=TRUE)
Earlier answer:
I would have titled the question "loess scores split by group":
plot(dat$x, dat$Intermediate.MAP.Score, col=as.numeric(factor(dat$name)) )
If you proceed with loess(Intermediate.MAP.Score ~ x, data=dat) you will get an overall average with no distinction among groups. And loess doesn't accept factor or character arguments in its formula. You need to split by 'name' and calculate separately. The other gotcha to avoid is plotting on the default limits which will be driven varying data ranges:
loess.fits <- lapply(split(dat, dat$name), function(xdf) {
list( yhat=predict( loess(Intermediate.MAP.Score ~ x,
data=xdf[ complete.cases(
xdf[ , c("Intermediate.MAP.Score", "x") ]
),
] ) ,
newdata=data.frame(x=seq(0,1,by=0.02))))})
plot(dat$x, dat$Intermediate.MAP.Score,
col=as.numeric(factor(dat$name)),
ylim=c(0.2,1) )
lapply(loess.fits, function(xdf) { par(new=TRUE);
# so the plots can be compared to predictions
plot(x= seq(0,1,by=0.02), y=xdf$yhat,
ylab="", xlab="",
ylim=c(0.2,1), axes=FALSE) })

Root mean square deviation on binned GAM results using R

Background
A PostgreSQL database uses PL/R to call R functions. An R call to calculate Spearman's correlation looks as follows:
cor( rank(x), rank(y) )
Also in R, a naïve calculation of a fitted generalized additive model (GAM):
data.frame( x, fitted( gam( y ~ s(x) ) ) )
Here x represents the years from 1900 to 2009 and y is the average measurement (e.g., minimum temperature) for that year.
Problem
The fitted trend line (using GAM) is reasonably accurate, as you can see in the following picture:
The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.
Possible Solution
One way to improve the accuracy of the correlation is to use a root mean square error (RMSE) calculation on binned data.
Questions
Q.1. How would you implement the RMSE calculation on the binned data to get a correlation (between 0 and 1) of GAM's fit to the measurements, in the R language?
Q.2. Is there a better way to find the accuracy of GAM's fit to the data, and if so, what is it (e.g., root mean square deviation)?
Attempted Solution 1
Call the PL/R function using the observed amounts and the model (GAM) amounts: correlation_rmse := climate.plr_corr_rmse( v_amount, v_model );
Define plr_corr_rmse as follows (where o and m represent the observed and modelled data): CREATE OR REPLACE FUNCTION climate.plr_corr_rmse(
o double precision[], m double precision[])
RETURNS double precision AS
$BODY$
sqrt( mean( o - m ) ^ 2 )
$BODY$
LANGUAGE 'plr' VOLATILE STRICT
COST 100;
The o - m is wrong. I'd like to bin both data sets by calculating the mean of every 5 data points (there will be at most 110 data points). For example:
omean <- c( mean(o[1:5]), mean(o[6:10]), ... )
mmean <- c( mean(m[1:5]), mean(m[6:10]), ... )
Then correct the RMSE calculation as:
sqrt( mean( omean - mmean ) ^ 2 )
How do you calculate c( mean(o[1:5]), mean(o[6:10]), ... ) for an arbitrary length vector in an appropriate number of bins (5, for example, might not be ideal for only 67 measurements)?
I don't think hist is suitable here, is it?
Attempted Solution 2
The following code will solve the problem, however it drops data points from the end of the list (to make the list divisible by 5). The solution isn't ideal as the number "5" is rather magical.
while( length(o) %% 5 != 0 ) {
o <- o[-length(o)]
}
omean <- apply( matrix(o, 5), 2, mean )
What other options are available?
Thanks in advance.
You say that:
The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.
You could calculate the correlation between the fitted values and the measured values:
cor(y,fitted(gam(y ~ s(x))))
I don't see why you want to bin your data, but you could do it as follows:
mean.binned <- function(y,n = 5){
apply(matrix(c(y,rep(NA,(n - (length(y) %% n)) %% n)),n),
2,
function(x)mean(x,na.rm = TRUE))
}
It looks a bit ugly, but it should handle vectors whose length is not a multiple of the binning length (i.e. 5 in your example).
You also say that:
One way to improve the accuracy of the
correlation is to use a root mean
square error (RMSE) calculation on
binned data.
I don't understand what you mean by this. The correlation is a factor in determining the mean squared error - for example, see equation 10 of Murphy (1988, Monthly Weather Review, v. 116, pp. 2417-2424). But please explain what you mean.

Resources