How to create bins in a reliability diagram - r

I created a logistic/logit model with a binomial response variable using
model <- glm(response~predictor1+predictor2+...)
and then I used the predict function to create a new data frame
outcome <-data.frame(predict(model,newdata=IndependentDataSet,type="response"),as.numeric(as.character(Independent$ResponseVariable)))
names(outcome) <- c("Pr","Obs")
I can use one of the following functions
plot(verify(data$obs,data$pr),CI=TRUE)
attribute(verify(data$obs,data$pr))
to create a plot that looks like this
or
reliability.plot(verify(data$obs,data$pr))
from
library(verification)
to create a reliability diagram. I am wondering how I can separate the bins based on specific values. For example, the model that I am evaluating is based around a climatology of 19% (0.19) and I want there to be a bin at (1/3)*climatology, climatology, and go up by (2/3) of climatology for the proceeding bins. How can I do this?
Additionally, I have seen the bins represented as circles that are proportional in size to the percent of the data that is at that bin. Does anyone know how to make a more aesthetically pleasing reliability diagram in R? Any recommendations are welcome.
This is how I would like my diagrams to appear

The easiest could be using
trace("attribute.default",edit=TRUE)
or whichever other function.
In this way, you access the source code and edit it. These changes affect only the current R session.

Related

Single linkage hierarchical clustering - boxplots on height of the branches to detect outliers

before k-means clustering for consumer segmentation, I want to identify and delete outliers of my sample. I tried hierarchical clustering with single linkage algorithm. The problem is, I have a sample with more than 800 cases, and in my plot (single linkage dendrogram) the numbers are written across each other and therefore not readable, so it is impossible for me to clearly identify the outliers by just looking at the graph :-/
Here they say, you can create boxplots based on the branch distance to identify outliers in a more objective way. I thought that would be also a great way to just make the row numbers of the outliers in my dataset readable, however I am struggling with creating the boxplots..
https://link.springer.com/article/10.1186/s12859-017-1645-5/figures/3
Does anyone know, how to write the code to get the boxplots based on the height of the branches?
This is the code I use for clustering and attached you can see the plot
dr_dist<-dist(dr_ma_cluster[,c(148:154)])
hc_dr<-hclust(dr_dist,method = "single") #single linkage
plot(hc_dr,labels=(row.names(dr_ma_cluster)))
This is my failed trial to do the boxplot, as I don't know how to address the branch height
> boxplot(hc_dr)
Error in x[floor(d)] + x[ceiling(d)] :
non-numeric argument for binary operator
> boxplot(hc_dr[,c(148:154)])
Error in hc_dr[, c(148:154)] : Incorrect number of dimensions
And here another way to do the graph (and some automated outlier detection approach), but it makes the readability even worse with large datasets..
Another code to plot the tree, even less readable for large datasets:
Delete outliers automatically of a calculated agglomerative hierarchical clustering data
Thanks for any help!!
boxplot(hc_dr$height) as suggested by StupidWolf was the simple thing I was looking for.
Unfortunately I did not manage to label the outlier dots with the rownames from the original dataframe. Rownames from the branch height table were useless as they were assigned in ascending order.
hang = 0.0001 gave a better look to the dendrogram, but labels were still unreadable as still over eachother.
If anyone has a similar problem check R Shiny, zoomable dendrogram program
the code given there in the answer was super easy to adapt, resulting in a zoomable dendrogram, which makes it easy to identify the relevant cases (->outliers). for details search dendextendas proposed by csgroen.
Both together, the boxplot and this nice tool served to identify the rownames of the outliers after single linkage clustering in order to delete them before km means clustering

How to make plots from distributed data from R

I'm working with spark using R API, and have a grasp on how data is processed from spark, either when only spark native functions are used in which cases it is transparent for the user or in cases where spark_apply() is used, where it is required to have a better understanding on how the partitions are handled.
My doubt is regarding to plots where no aggregation is done, for example, is my understanding that if a group by is used before a plot not all the data will be used. But if I need to make say a scatter plot with 100 million dots, where is that data stored at this point? is it still distributed between all nodes? or is it at one node only, if the later... with the cluster get frozen because of this?
I know you write that no aggregation is (should be?) done, but I'd wager that is precisely what you need and want to do. The point of distributed computing is largely that partial results are computed, well, distributed at each node. For very big data sets, each node (often) sees only a subset of the data.
In regards to the plotting: a scatter plot more that even a few thousand (not to mention a 100 million) points will contain a significant amount of overplotting. Either you 'fix' that by making the points transparent, you do a density estimate, or you do some binning of the data (e.g. a hexbin plot or a heatmap). The latter can be done distributed by the nodes and the plot. The returned binned results from each node can then be aggregated to a final results by the master node and be plotted.
Even if you somehow had a node making a scatter plot of 100 million points, what is your output format? Vector graphics (e.g. pdf/svg) would create a huge file. Raster graphics (e.g. jpg, png) will effectively aggregate on your behalf when the plot is rasterized -- so you might as well control that yourself with bins the size of pixels.

Making a 'flip-book' type animation using density plots from R

I'm new to R, but have worked out how to graph the distribution of my students' grades for a given term using a density plot, and have made some ridgeline plots to show how the distribution evolves throughout the academic year.
I'm thinking it might be fun (and make the graphs easier to interpret) if I could make a kind of flip-book animation that went from one terms grades to the next, relatively quickly, to see how the distribution changes. At its simplest, I could just pop these distribution plots into Powerpoint and just scroll through the pages, but I'm wondering what commands I need to put into R's ggplot command to ensure that the axes/scaling from one chart to the next stays consistent from one chart to the next?
At the moment, I'm just making a simple chart using this command, where HT102 is the data from the 2nd term of Year 10, and A8 is a vector containing all the (numeric) grades. I am then doing the same thing with another set of grades called ht103, and so on...
ggplot(ht102, aes(x = A8)) +
geom_density(alpha=.3)
What would you recommend to keep the scaling consistent, and any thoughts on a better way to animate this than just popping them into powerpoint?

Forest plot from cox object

Please be tolerant :) I am a dummy user of R and I am using the code and sample data to learn how to make forest plot that was shown in the previous post -
Optimal/efficient plotting of survival/regression analysis results
I was wondering is it possible to set user-defined x-axis scale with the code shown there? Up to now x a-axis scale is defined somehow automatically.
Thank you for any tips.
I'm unimpressed with the precision of the documentation since one might assume that the limits argument would be values on the relative risk scale rather than on the log-transformed scale. One gets a ridiculous result if that is done. That quibble not withstanding, it's relatively easy to use that parameter to created an expanded plot:
install('devtools') # then use it to get current package
# executing the install and load of the package referenced at the top of that answer
print(forest_model(lung_cox, limits=log( c(.5, 50) ) ))
Trying for a lower range of 0 on the relative risk scale is not sensible. Would imply a -Inf value on hte log-transformed scale. Trying for lower value, say log(0.001), confuses the pretty printing of the scale in my tests.

R: update plot [xy]lims with new points() or lines() additions?

Background:
I'm running a Monte Carlo simulation to show that a particular process (a cumulative mean) does not converge over time, and often diverges wildly in simulation (the expectation of the random variable = infinity). I want to plot about 10 of these simulations on a line chart, where the x axis has the iteration number, and the y axis has the cumulative mean up to that point.
Here's my problem:
I'll run the first simulation (each sim. having 10,000 iterations), and build the main plot based on its current range. But often one of the simulations will have a range a few orders of magnitude large than the first one, so the plot flies outside of the original range. So, is there any way to dynamically update the ylim or xlim of a plot upon adding a new set of points or lines?
I can think of two workarounds for this: 1. store each simulation, then pick the one with the largest range, and build the base graph off of that (not elegant, and I'd have to store a lot of data in memory, but would probably be laptop-friendly [[EDIT: as Marek points out, this is not a memory-intense example, but if you know of a nice solution that'd support far more iterations such that it becomes an issue (think high dimensional walks that require much, much larger MC samples for convergence) then jump right in]]) 2. find a seed that appears to build a nice looking version of it, and set the ylim manually, which would make the demonstration reproducible.
Naturally I'm holding out for something more elegant than my workarounds. Hoping this isn't too pedestrian a problem, since I imagine it's not uncommon with simulations in R. Any ideas?
I'm not sure if this is possible using base graphics, if someone has a solution I'd love to see it. However graphics systems based on grid (lattice and ggplot2) allow the graphics object to be saved and updated. It's insanely easy in ggplot2.
require(ggplot2)
make some data and get the range:
foo <- as.data.frame(cbind(data=rnorm(100), numb=seq_len(100)))
make an initial ggplot object and plot it:
p <- ggplot(as.data.frame(foo), aes(numb, data)) + layer(geom='line')
p
make some more data and add it to the plot
foo <- as.data.frame(cbind(data=rnorm(200), numb=seq_len(200)))
p <- p + geom_line(aes(numb, data, colour="red"), data=as.data.frame(foo))
plot the new object
p
I think (1) is the best option. I actually don't think this isn't elegant. I think it would be more computationally intensive to redraw every time you hit a point greater than xlim or ylim.
Also, I saw in Peter Hoff's book about Bayesian statistics a cool use of ts() instead of lines() for cumulative sums/means. It looks pretty spiffy:

Resources