obtaining components of plot from rms calibrate - r

Are the values for the plot available from rms calibrate? From ?calibrate
library("rms")
set.seed(1)
n <- 200
d.time <- rexp(n)
x1 <- runif(n)
x2 <- factor(sample(c('a', 'b', 'c'), n, TRUE))
f <- cph(Surv(d.time) ~ pol(x1,2) * x2, x=TRUE, y=TRUE, surv=TRUE,time.inc=1.5)
cal <- calibrate(f, u=1.5, cmethod='KM', m=50, B=20)
plot(cal)
cal
I would like to obtain the values that make up this plot. The y-values can be seen on the screen, but it is not clear to me how to obtain the x and y values. Thanks.

Related

How do I add a conditional column based on an inequality with matrix multiplication (in R)?

I have a data frame of values and want to add a column based on an inequality condition that involves matrix multiplication.
The data frame looks like this
# Set possible values for variables
b0 <- b1 <- b2 <- seq(0, 2, by=0.1)
# Create data frame of all the different combos of these variables
df <- setNames(expand.grid(b0, b1, b2), c("b0", "b1", "b2"))
There are a lot of precursor objects I have to define before adding this column:
##### Set n
n = 100
#### Generate (x1i, x2i)
# Install and load the 'MASS' package
#install.packages("MASS")
library("MASS")
# Input univariate parameters
rho <- 0.5
mu1 <- 0; s1 <- 1
mu2 <- 0; s2 <- 1
# Generate parameters for bivariate normal distribution
mu <- c(mu1, mu2)
sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2), nrow=2, ncol=2)
# Generate draws from bivariate normal distribution
bvn <- mvrnorm(n, mu=mu, Sigma=sigma ) # from MASS package
x1 <- bvn[, 1]
x2 <- bvn[, 2]
##### Generate error
error <- rnorm(n)
##### Generate dependent variable
y <- 0.5 + x1 + x2 + error
##### Create the model
lm <- lm(y ~ x1 + x2)
# Setup parameters
n <- 100
K <- 3
c <- qf(.95, K, n - K)
# Define necessary objects
sigma_hat_sq <- 1
b0_hat <- summary(lm)$coefficients[1, 1]
b1_hat <- summary(lm)$coefficients[2, 1]
b2_hat <- summary(lm)$coefficients[3, 1]
x <- cbind(1, x1, x2)
I am trying to add this conditional column like this:
# Add a column to the data frame that says whether the condition holds
df <- transform(df, ueq = (
(1/(K*sigma_hat_sq))*
t(matrix(c(b0_hat-b0, b1_hat-b1, b2_hat-b2)))%*%
t(x)%*%x%*%
matrix(c(b0_hat-b0, b1_hat-b1, b2_hat-b2))
<= c
))
...but doing so generates the error message
Error in t(matrix(c(b0_hat - b0, b1_hat - b1, b2_hat - b2))) %*% t(x) :
non-conformable arguments
Mathematically, the condition is [1/(Ksigmahat^2)](Bhat-B)'X'X(Bhat-B) <= c, where, for each triple (b0,b1,b2), (Bhat-B) is a 3x1 matrix with elements {B0hat, B1hat, B2hat}. I'm just not sure how to write this condition in R.
Any help would be greatly appreciated!
In order to only work with one row of df at a time (and get a separate answer for each 1 x 3 matrix, you need a loop.
A simple way to do this in R is mapply.
df <- transform(df, ueq = mapply(\(b0, b1, b2)
(1/(K*sigma_hat_sq)) *
t(c(b0_hat-b0, b1_hat-b1, b2_hat-b2)) %*%
t(x) %*% x %*%
c(b0_hat-b0, b1_hat-b1, b2_hat-b2)
<= c,
b0 = b0, b1 = b1, b2 = b2
))
This leads to 91 TRUE rows.
sum(df$ueq)
[1] 91

How to estimate SAR spatial model without row-normalizing the matrix?

I am trying to estimate a SAR spatial model without row-normalizing the matrix. For some reason, when I do not row-normalize, the command does not return the correct estimates. Am I missing something on the command options?
Here is an example of what I mean.
If I run the following code, simulating a data with a row-normalized matrix, lagsarlm returns the correct estimates:
set.seed(20100817)
rho <- .5
B <- c(2, 5)
e <- as.matrix(rnorm(100, sd=2))
X0 <- matrix(1, ncol=1, nrow=100) # create Intercept
X1 <- matrix(runif(100, min = -10, max = 10), nrow=100) # generate covariate
Xbe <- X0*B[1]+X1*B[2]+e
I <- diag(100)
W <- rgraph(100, m=1, tprob=0.1, mode="graph", diag=FALSE) #assume I need to start with a matrix of relationships
spatialList <- mat2listw(W)
nb7rt <- spatialList$neighbours
listw <- nb2listw(nb7rt)
W <- nb2mat(nb7rt)
y <- solve(I - rho*W) %*% Xbe
model <- lagsarlm(y ~ X1, listw=listw)
summary(model)
However, if I try to do exactly the same but without row-normalizing, the results are incorrect:
set.seed(20100817)
rho <- .5
B <- c(2, 5)
e <- as.matrix(rnorm(100, sd=2))
X0 <- matrix(1, ncol=1, nrow=100) # create Intercept
X1 <- matrix(runif(100, min = -10, max = 10), nrow=100) # generate covariate
Xbe <- X0*B[1]+X1*B[2]+e
I <- diag(100)
W <- rgraph(100, m=1, tprob=0.1, mode="graph", diag=FALSE) #assume I need to start with a matrix of relationships
spatialList <- mat2listw(W, style ="B")
nb7rt <- spatialList$neighbours
listw <- nb2listw(nb7rt, style="B")
W <- nb2mat(nb7rt, style="B")
y <- solve(I - rho*W) %*% Xbe
model <- lagsarlm(y ~ X1, listw=listw)
summary(model)
The base for this code can be found here https://stat.ethz.ch/pipermail/r-sig-geo/2010-August/009023.html.

Stochastic gradient descent from gradient descent implementation in R

I have a working implementation of multivariable linear regression using gradient descent in R. I'd like to see if I can use what I have to run a stochastic gradient descent. I'm not sure if this is really inefficient or not. For example, for each value of α I want to perform 500 SGD iterations and be able to specify the number of randomly picked samples in each iteration. It would be nice to do this so I could see how the number of samples influences the results. I'm having trouble through with the mini-batching and I want to be able to easily plot the results.
This is what I have so far:
# Read and process the datasets
# download the files from GitHub
download.file("https://raw.githubusercontent.com/dbouquin/IS_605/master/sgd_ex_data/ex3x.dat", "ex3x.dat", method="curl")
x <- read.table('ex3x.dat')
# we can standardize the x vaules using scale()
x <- scale(x)
download.file("https://raw.githubusercontent.com/dbouquin/IS_605/master/sgd_ex_data/ex3y.dat", "ex3y.dat", method="curl")
y <- read.table('ex3y.dat')
# combine the datasets
data3 <- cbind(x,y)
colnames(data3) <- c("area_sqft", "bedrooms","price")
str(data3)
head(data3)
################ Regular Gradient Descent
# http://www.r-bloggers.com/linear-regression-by-gradient-descent/
# vector populated with 1s for the intercept coefficient
x1 <- rep(1, length(data3$area_sqft))
# appends to dfs
# create x-matrix of independent variables
x <- as.matrix(cbind(x1,x))
# create y-matrix of dependent variables
y <- as.matrix(y)
L <- length(y)
# cost gradient function: independent variables and values of thetas
cost <- function(x,y,theta){
gradient <- (1/L)* (t(x) %*% ((x%*%t(theta)) - y))
return(t(gradient))
}
# GD simultaneous update algorithm
# https://www.coursera.org/learn/machine-learning/lecture/8SpIM/gradient-descent
GD <- function(x, alpha){
theta <- matrix(c(0,0,0), nrow=1)
for (i in 1:500) {
theta <- theta - alpha*cost(x,y,theta)
theta_r <- rbind(theta_r,theta)
}
return(theta_r)
}
# gradient descent α = (0.001, 0.01, 0.1, 1.0) - defined for 500 iterations
alphas <- c(0.001,0.01,0.1,1.0)
# Plot price, area in square feet, and the number of bedrooms
# create empty vector theta_r
theta_r<-c()
for(i in 1:length(alphas)) {
result <- GD(x, alphas[i])
# red = price
# blue = sq ft
# green = bedrooms
plot(result[,1],ylim=c(min(result),max(result)),col="#CC6666",ylab="Value",lwd=0.35,
xlab=paste("alpha=", alphas[i]),xaxt="n") #suppress auto x-axis title
lines(result[,2],type="b",col="#0072B2",lwd=0.35)
lines(result[,3],type="b",col="#66CC99",lwd=0.35)
}
Is it more practical to find a way to use sgd()? I can't seem to figure out how to have the level of control I'm looking for with the sgd package
Sticking with what you have now
## all of this is the same
download.file("https://raw.githubusercontent.com/dbouquin/IS_605/master/sgd_ex_data/ex3x.dat", "ex3x.dat", method="curl")
x <- read.table('ex3x.dat')
x <- scale(x)
download.file("https://raw.githubusercontent.com/dbouquin/IS_605/master/sgd_ex_data/ex3y.dat", "ex3y.dat", method="curl")
y <- read.table('ex3y.dat')
data3 <- cbind(x,y)
colnames(data3) <- c("area_sqft", "bedrooms","price")
x1 <- rep(1, length(data3$area_sqft))
x <- as.matrix(cbind(x1,x))
y <- as.matrix(y)
L <- length(y)
cost <- function(x,y,theta){
gradient <- (1/L)* (t(x) %*% ((x%*%t(theta)) - y))
return(t(gradient))
}
I added y to your GD function and created a wrapper function, myGoD, to call yours but first subsetting the data
GD <- function(x, y, alpha){
theta <- matrix(c(0,0,0), nrow=1)
theta_r <- NULL
for (i in 1:500) {
theta <- theta - alpha*cost(x,y,theta)
theta_r <- rbind(theta_r,theta)
}
return(theta_r)
}
myGoD <- function(x, y, alpha, n = nrow(x)) {
idx <- sample(nrow(x), n)
y <- y[idx, , drop = FALSE]
x <- x[idx, , drop = FALSE]
GD(x, y, alpha)
}
Check to make sure it works and try with different Ns
all.equal(GD(x, y, 0.001), myGoD(x, y, 0.001))
# [1] TRUE
set.seed(1)
head(myGoD(x, y, 0.001, n = 20), 2)
# x1 V1 V2
# V1 147.5978 82.54083 29.26000
# V1 295.1282 165.00924 58.48424
set.seed(1)
head(myGoD(x, y, 0.001, n = 40), 2)
# x1 V1 V2
# V1 290.6041 95.30257 59.66994
# V1 580.9537 190.49142 119.23446
Here is how you can use it
alphas <- c(0.001,0.01,0.1,1.0)
ns <- c(47, 40, 30, 20, 10)
par(mfrow = n2mfrow(length(alphas)))
for(i in 1:length(alphas)) {
# result <- myGoD(x, y, alphas[i]) ## original
result <- myGoD(x, y, alphas[i], ns[i])
# red = price
# blue = sq ft
# green = bedrooms
plot(result[,1],ylim=c(min(result),max(result)),col="#CC6666",ylab="Value",lwd=0.35,
xlab=paste("alpha=", alphas[i]),xaxt="n") #suppress auto x-axis title
lines(result[,2],type="b",col="#0072B2",lwd=0.35)
lines(result[,3],type="b",col="#66CC99",lwd=0.35)
}
You don't need the wrapper function--you can just change your GD slightly. It is always good practice to explicitly pass arguments to your functions rather than relying on scoping. Before you were assuming that y would be pulled from your global environment; here y must be given or you will get an error. This will avoid many headaches and mistakes down the road.
GD <- function(x, y, alpha, n = nrow(x)){
idx <- sample(nrow(x), n)
y <- y[idx, , drop = FALSE]
x <- x[idx, , drop = FALSE]
theta <- matrix(c(0,0,0), nrow=1)
theta_r <- NULL
for (i in 1:500) {
theta <- theta - alpha*cost(x,y,theta)
theta_r <- rbind(theta_r,theta)
}
return(theta_r)
}

How to plot nicely-spaced data labels?

Labeling data points in a plot can get unwieldy:
Randomly sampling few labels may disappoint:
What would be a nice way to pick a small set of nicely-spaced data labels? That is, to randomly pick representatives whose labels are not overlapping.
# demo data
set.seed(123)
N <- 50
x <- runif(N)
y <- x + rnorm(N, 0, x)
data <- data.frame(x, y, labels=state.name)
# plot with labels
plot(x,y)
text(x,y,labels)
# plot a few labels
frame()
few_labels <- data[sample(N, 10), ]
plot(x,y)
with(few_labels, text(x,y,labels))
One way to do is through clustering. Here is a solution with stats::hclust. We agglomerate the data points in cluster and then pick one random observation from each cluster.
few_labels <- function(df, coord=1:ncol(df),grp=5){
require(dplyr)
df$cl <- cutree(hclust(dist(df[,coord])),grp)
few_labels <- df %>% group_by(cl) %>%
do(sample_n(.,1))
return(few_labels)
}
# demo data
set.seed(123)
N <- 50
x <- runif(N)
y <- x + rnorm(N, 0, x)
data <- data.frame(x, y, labels=state.name)
# plot a few labels
frame()
few_labels <- few_labels(data,coord=1:2,grp=12)
plot(x,y)
with(few_labels, text(x,y,labels))
For all labels:
xlims=c(-1,2)
plot(x,y,xlim=xlims)
#text(x,y,data$labels,pos = 2,cex=0.7)
library(plotrix)
spread.labels(x,y,data$labels,cex=0.7,ony=NA)
Another way is to pick randomly a point, throw all proximate ones, and so on, until no point is left:
radius <- .1 # of a ball containing the largest label
d <- as.matrix(dist(data[, c("x","y")], upper=TRUE, diag=TRUE))
remaining <- 1:N
spaced <- numeric()
i <- 1
while(length(remaining)>0) {
p <- ifelse(length(remaining)>1, sample(remaining, 1), remaining)
spaced <- c(spaced, p) # ...
remaining <- setdiff(remaining, which(d[p, ] < 2*radius))
i <- i + 1
}
frame()
plot(x,y)
spaced_labels <- data[spaced, ]
with(spaced_labels, text(x,y,labels))

Plot 3D plane (true regression surface)

I'm trying to simulate some data (x1 and x2 - my explanatory variables), calculate y using a specified function + random noise and plot the resulting observations AND the true regression surface. Here's what I have so far:
set.seed(1)
library(rgl)
# Simulate some data
x1 <- runif(50)
x2 <- runif(50)
y <- sin(x1)*x2+x1*x2 + rnorm(50, sd=0.3)
# 3D scatterplot of observations
plot3d(x1,x2,y, type="p", col="red", xlab="X1", ylab="X2", zlab="Y", site=5, lwd=15)
Now I'm not sure how I can add the "true" regression plane. I'm basically looking for something like curve() where I can plug in my (true) model formula.
Thanks!
If you wanted a plane, you could use planes3d.
Since your model is not linear, it is not a plane: you can use surface3d instead.
my_surface <- function(f, n=10, ...) {
ranges <- rgl:::.getRanges()
x <- seq(ranges$xlim[1], ranges$xlim[2], length=n)
y <- seq(ranges$ylim[1], ranges$ylim[2], length=n)
z <- outer(x,y,f)
surface3d(x, y, z, ...)
}
library(rgl)
f <- function(x1, x2)
sin(x1) * x2 + x1 * x2
n <- 200
x1 <- 4*runif(n)
x2 <- 4*runif(n)
y <- f(x1, x2) + rnorm(n, sd=0.3)
plot3d(x1,x2,y, type="p", col="red", xlab="X1", ylab="X2", zlab="Y", site=5, lwd=15)
my_surface(f, alpha=.2 )
Apologies: ( I didn't read the question very carefllly and now see that I rushed into estimation when you wanted to plot the Truth.)
Here's an approach to estimation followed by surface plotting using loess:
mod2 <- loess(y~x1+x2)
grd<- data.frame(x1=seq(range(x1)[1],range(x1)[2],len=20),
x2=seq(range(x2)[1],range(x2)[2],len=20))
grd$pred <- predict(mod2, newdata=grd)
grd <- grd[order(grd$x1,grd$x2),]
x1 <- unique(grd$x1)
x2 <- unique(grd$x2) # shouldn't have used y
surface3d(x1, x2, z=matrix(grd$pred,length(x1),length(x2)) )
IRTFM's somewhat imperfect answers above let me to a thread on the CRAN help pages. https://stat.ethz.ch/pipermail/r-help/2013-December/364037.html
I extracted the relevant bits of code and turned them into a function like so:
require(rgl)
pred.surf.3d <- function(df, x.nm,y.nm,z.nm, ...){
x <- df[,x.nm]; y <- df[,y.nm]; z<-df[,z.nm]
fit <- lm(z ~ x + y + x*y + x^2 + y^2)
xnew <- seq(range(x)[1],range(x)[2],len=20)
ynew <- seq(range(y)[1],range(y)[2],len=20)
df <- expand.grid(x=xnew, y=ynew)
df$z <- predict(fit, newdata=df)
with(df, surface3d(xnew, ynew, z=df$z))
}
I may end up bundling this into my CRAN utility package at some point.
In the mean time, I hope you find it useful! (Run it on IRTFM's first code chunk like so:)
pred.surf.3d(data.frame(x1,x2,y),'x1','x2','y')

Resources