Getting vectors out of ggplot2 - r

I am trying to show that there is a wierd "bump" in some data I am analysing (it is to do with market share. My code is here:-
qplot(Share, Rate, data = Dataset3, geom=c("point", "smooth"))
(I appreciate that this is not very useful code without the dataset).
Is there anyway that I can get the numeric vector used to generate the smoothed line out of R? I just need that layer to try to fit a model to the smoothed data.
Any help gratefully received.

Yes, there is. ggplot uses the function loess as the default smoother in geom_smooth. this means you can use loess directly to estimate your smoothing parameters.
Here is an example, adapted from ?loess :
qplot(speed, dist, data=cars, geom="smooth")
Use loess to estimate the smoothed data, and predict for the estimated values::
cars.lo <- loess(dist ~ speed, cars)
pc <- predict(cars.lo, data.frame(speed = seq(4, 25, 1)), se = TRUE)
The estimates are now in pc$fit and the standard error in pc$fit.se. The following bit of code extraxts the fitted values into a data.frame and then plots it using ggplot :
pc_df <- data.frame(
x=4:25,
fit=pc$fit)
ggplot(pc_df, aes(x=x, y=fit)) + geom_line()

Related

Asymptotic regression function not correlating with raw data

I'm trying to model raw data by an asymptotic function with the equation $$f(x) = a + (b-a)(1-\exp(-c x))$$ using R. To do so I used the following code:
rawData <- import("path/StackTestData.tsv")
# executing regression
X <- rawData$x
Y <- rawData$y
model <- drm(Y ~ X, fct = DRC.asymReg())
# creating the regression function
f_0_ <- model$coefficients[1] #value for y if x=0
steepness <- model$coefficients[2]
plateau <- model$coefficients[3]
eq <- function(x){f_0_+(plateau-f_0_)*(1-exp(-steepness*x))}
# plotting the regression function together with the raw data
ggplot(rawData,aes(x=x,y=y)) +
geom_line(col="red") +
stat_function(fun=eq,col="blue") +
ylim(10,12.5)
In some cases, I got a proper regression function. However, with the attached data I don't get one. The regression function is not showing any correlation with the raw data whatsoever, as shown in the figure below. Can you perhaps offer a better solution for performing the asymptotic regression or do you know where the error lies?
Best Max
R4.1.2 was used using R Studio 1.4.1106. For ggplot the package ggpubr, for DRC.asymReg() the packages aomisc and drc were load.

Plotting Cumulative Events from Adjusted Survival Curve in R

I am attempting to create an adjusted survival curve (from a Cox model) and would like to display this information as cumulative events.
I have attempted this:
library(survival)
data("ovarian")
library(survminer)
model<-coxph(Surv(futime, fustat) ~ age + strata(rx), data=ovarian)
gplot<-ggadjustedcurves(model) ## Expected plot of adjusted survival curve
Because the "fun=" still has not been implemented in ggadjustedcurves I took the advice of a user on this page and extracted the elements into plotdata and created a new column as shown below.
plotdata<-gplot$data
plotdata%<>%
mutate(new=1-surv) ## 1-survival probability
I am new to R environment and ggplot so how can I then plot the new adjusted survival curve with the new created column and keep the theme of the original plot (contained in gplot).
Thanks!
Edit:
My current solution is as follows.
library(rms)
model<-coxph(Surv(futime, fustat) ~ age+ strata(rx), data=ovarian)
survfit(model, conf.type = "plain", conf.int = 1)
plot(survfit(model), conf.int = T,col = c(1,2), fun='event')
This achieves the survival curve I wanted however I am not sure if the confidence bars are really the standard errors (+/-1). I supplied 1 to the conf.int argument and believe this to create the standard errors in this way since conf.type is specified as plain.
How can I further customize this plot as the base graph looks rather bland! How do I get a display as close as possible to the survminer curves?
You can use the adjustedCurves package instead, which allows both plotting confidence intervals and naturally includes an option to display cumulative incidence functions. First, install it using:
devtools::install_github("https://github.com/RobinDenz1/adjustedCurves")
Now you can use:
library(adjustedCurves)
library(survival)
library(riskRegression)
# needs to be a factor
ovarian$rx <- factor(ovarian$rx)
# needs to include x=TRUE
model <- coxph(Surv(futime, fustat) ~ age + strata(rx), data=ovarian, x=TRUE)
adj <- adjustedsurv(data=ovarian,
event="fustat",
ev_time="futime",
variable="rx",
method="direct",
outcome_model=model,
conf_int=TRUE)
plot(adj, cif=TRUE, conf_int=TRUE)
Which produces:
I would probably not use this method here, though. Simulation studies have shown that the cox-regression based method performs badly in small sample sizes. You might want to take a look at method="iptw" or method="aiptw" inside the adjustedCurves package instead.

Draw fitted Exgaussian density curve in ggplot2

I have a set of estimated parameters for an Ex-gaussian curve (i.e. mu, sigma, tau).
Currently I'm creating a visualization of that distribution by simulating data based on those parameters and plotting them in ggplot.
I would rather create a visualization that is effectively a smooth fitted ex-gaussian curve - i.e. an estimated curve for data that presents with the parameters I've estimated. The goal is to not have curves with the same parameters appear differently.
Here is the current simulation approach I'm utilizing:
library(retimes)
library(ggplot2)
g <- rexgauss(1000,mu=1,sigma = 1,tau =1)
g <- as.data.frame(g); colnames(g) <- "obs"
ggplot(g) + geom_density(aes(x = obs), size=1, alpha=.4)
You can use stat_function from ggplot2. It takes a function in fun, and parameters to pass to that function in args. It works well for situations like this where you want to compare a simulation to a calculated distribution, because the x values you supply to aes will be the ones automatically used in showing the function, without you having to do any work to match them up or calculate the range of x values in your simulation.
Here's an example with retimes::rexgauss. I also simplified your data frame creation, and put the parameters in a vector so you can use them in both the simulation and the calculated function.
My laptop is too slow to do all 1000 observations, so yours is probably smoother and closer to the calculated distribution than mine.
library(ggplot2)
exgauss_params <- c(mu = 1, sigma = 1, tau = 1)
exgauss_sim <- data.frame(obs = retimes::rexgauss(n = 100, exgauss_params))
ggplot(exgauss_sim, aes(x = obs)) +
geom_density(aes(color = "simulated")) +
stat_function(aes(color = "calculated"),
fun = retimes::dexgauss, args = exgauss_params)
Created on 2018-05-18 by the reprex package (v0.2.0).

How to directly plot ROC of h2o model object in R

My apologies if I'm missing something obvious. I've been thoroughly enjoying working with h2o in the last few days using R interface. I would like to evaluate my model, say a random forest, by plotting an ROC. The documentation seems to suggest that there is a straightforward way to do that:
Interpreting a DRF Model
By default, the following output displays:
Model parameters (hidden)
A graph of the scoring history (number of trees vs. training MSE)
A graph of the ROC curve (TPR vs. FPR)
A graph of the variable importances
...
I've also seen that in python you can apply roc function here. But I can't seem to be able to find the way to do the same in R interface. Currently I'm extracting predictions from the model using h2o.cross_validation_holdout_predictions and then use pROC package from R to plot the ROC. But I would like to be able to do it directly from the H2O model object, or, perhaps, a H2OModelMetrics object.
Many thanks!
A naive solution is to use plot() generic function to plot a H2OMetrics object:
logit_fit <- h2o.glm(colnames(training)[-1],'y',training_frame =
training.hex,validation_frame=validation.hex,family = 'binomial')
plot(h2o.performance(logit_fit),valid=T),type='roc')
This will give us a plot:
But it is hard to customize, especially to change the line type, since the type parameter is already taken as 'roc'. Also I have not found a way to plot multiple models' ROC curves together on one plot. I have come up with a method to extract true positive rate and false positive rate from the H2OMetrics object and use ggplot2 to plot the ROC curves on one plot by myself. Here is the example code(uses a lot of tidyverse syntax):
# for example I have 4 H2OModels
list(logit_fit,dt_fit,rf_fit,xgb_fit) %>%
# map a function to each element in the list
map(function(x) x %>% h2o.performance(valid=T) %>%
# from all these 'paths' in the object
.#metrics %>% .$thresholds_and_metric_scores %>%
# extracting true positive rate and false positive rate
.[c('tpr','fpr')] %>%
# add (0,0) and (1,1) for the start and end point of ROC curve
add_row(tpr=0,fpr=0,.before=T) %>%
add_row(tpr=0,fpr=0,.before=F)) %>%
# add a column of model name for future grouping in ggplot2
map2(c('Logistic Regression','Decision Tree','Random Forest','Gradient Boosting'),
function(x,y) x %>% add_column(model=y)) %>%
# reduce four data.frame to one
reduce(rbind) %>%
# plot fpr and tpr, map model to color as grouping
ggplot(aes(fpr,tpr,col=model))+
geom_line()+
geom_segment(aes(x=0,y=0,xend = 1, yend = 1),linetype = 2,col='grey')+
xlab('False Positive Rate')+
ylab('True Positive Rate')+
ggtitle('ROC Curve for Four Models')
Then the ROC curve is:
you can get the roc curve by passing the model performance metrics to H2O's plot function.
shortened code snippet which assumes you created a model, call it glm, and split your dataset into train and validation sets:
perf <- h2o.performance(glm, newdata = validation)
h2o.plot(perf)
full code snippet below:
h2o.init()
# Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")
glm = h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), training_frame = prostate.hex, family = "binomial", nfolds = 0, alpha = 0.5, lambda_search = FALSE)
perf <- h2o.performance(glm, newdata = prostate.hex)
h2o.plot(perf)
and this will produce the following:
There is not currently a function in H2O R or Python client to plot the ROC curve directly. The roc method in Python returns the data neccessary to plot the ROC curve, but does not plot the curve itself. ROC curve plotting directly from R and Python seems like a useful thing to add, so I've created a JIRA ticket for it here: https://0xdata.atlassian.net/browse/PUBDEV-4449
The reference to the ROC curve in the docs refers to the H2O Flow GUI, which will automatically plot a ROC curve for any binary classification model in your H2O cluster. All the other items in that list are in fact available directly in R and Python, however.
If you train a model in R, you can visit the Flow interface (e.g. localhost:54321) and click on a binomial model to see it's ROC curves (training, validation and cross-validated versions). It will look like this:
Building off #Lauren's example, after you run model.performance you can extract all necessary information for ggplot from perf#metrics$thresholds_and_metric_scores. This code produces the ROC curve, but you can also add precision, recall to the selected variables for plotting the PR curve.
Here is some example code using the same model as above.
library(h2o)
library(dplyr)
library(ggplot2)
h2o.init()
# Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
prostatePath <- system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex <- h2o.importFile(
path = prostatePath,
destination_frame = "prostate.hex"
)
glm <- h2o.glm(
y = "CAPSULE",
x = c("AGE", "RACE", "PSA", "DCAPS"),
training_frame = prostate.hex,
family = "binomial",
nfolds = 0,
alpha = 0.5,
lambda_search = FALSE
)
# Model performance
perf <- h2o.performance(glm, newdata = prostate.hex)
# Extract info for ROC curve
curve_dat <- data.frame(perf#metrics$thresholds_and_metric_scores) %>%
select(c(tpr, fpr))
# Plot ROC curve
ggplot(curve_dat, aes(x = fpr, y = tpr)) +
geom_point() +
geom_line() +
geom_segment(
aes(x = 0, y = 0, xend = 1, yend = 1),
linetype = "dotted",
color = "grey50"
) +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
ggtitle("ROC Curve") +
theme_bw()
Which produces this plot:
roc_plot

Predict Future values using polynomial regression in R

Was trying to predict the future value of a sample using polynomial regression in R. The y values within the sample forms a wave pattern.
For example
x = 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
y= 1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4
But when the graph is plotted for future values the resultant y values was completely different from what was expected. Instead of a wave pattern, was getting a graph where the y values keep increasing.
futurY = 17,18,19,20,21,22
Tried different degrees of polynomial regression, but the predicted results for futurY were drastically different from what was expected
Following is the sample R code which was used to get the results
dfram <- data.frame('x'=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
dfram$y <- c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4)
plot(dfram,dfram$y,type="l", lwd=3)
pred <- data.frame('x'=c(17,18,19,20,21,22))
myFit <- lm(y ~ poly(x,5), data=dfram)
newdata <- predict(myFit, pred)
print(newdata)
plot(pred[,1],data.frame(newdata)[,1],type="l",col="red", lwd=3)
Is this the correct technique to be used for predicting the unknown future y values OR should I be using other techniques like forecasting?
# Reproducing your data frame
dfram <- data.frame("x" = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
"y" = c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4))
From your graph I've got the phase and period of the signal. There're better ways of calculating that automatically.
# Phase and period
fase = 1
per = 10
In the linear model function I've put the triangular signal equations.
fit <- lm(y ~ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * (x-fase)%%(per/2))
+ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * ((per/2)-((x-fase)%%(per/2))))
,data=dfram)
# Predict the old data
p_olddata <- predict(fit,type="response")
# Predict the new data
newdata <- data.frame('x'=c(17,18,19,20,21,22))
p_newdata <- predict(fit,newdata,type="response")
# Ploting Old and new data
plot(x=c(dfram$x,newdata$x),
y=c(p_olddata,p_newdata),
col=c(rep("blue",length(p_olddata)),rep("green",length(p_olddata))),
xlab="x",
ylab="y")
lines(dfram)
Where the black line is the original signal, the blue circles are the prediction for the original points and the green circles are the prediction for the new data.
The graph shows a perfect fit for the model because there's no noise in the data. In a real dataset you may find it so the fit will not look as nice as that.

Resources