Boxplot outlier labeling in R - r

I want to draw boxplots in R and add names to outliers. So far I found this solution.
The function there provides all the functionality I need, but it scrambles incorrectly the labels. In the following example, it marks the outlier as "u" instead of "o":
library(plyr)
library(TeachingDemos)
source("http://www.r-statistics.com/wp-content/uploads/2011/01/boxplot-with-outlier-label-r.txt") # Load the function
set.seed(1500)
y <- rnorm(20)
x1 <- sample(letters[1:2], 20,T)
lab_y <- sample(letters, 20)
# plot a boxplot with interactions:
boxplot.with.outlier.label(y~x1, lab_y)
Do you know of any solution? The ggplot2 library is super nice, but provides no such functionality (as far as I know). My alternative is to use the text() function and extract the outlier information from the boxplot object. However, like this the labels may overlap.
Thanks a lot :-)

I took a look at this with debug(boxplot.with.outlier.label), and ... it turns out there's a bug in the function.
The error occurs on line 125, where the data.frame DATA is constructed from x,y and label_name.
Previously x and y have been reordered, while lab_y hasn't been. When the supplied value of x (your x1) isn't itself already in order, you'll get the kind of jumbling you experienced.
As an immediate fix, you can pre-order the x values like this (or do something more elegant)
df <- data.frame(y, x1, lab_y, stringsAsFactors=FALSE)
df <- df[order(df$x1), ]
# Needed since lab_y is not searched for in data (though it probably should be)
lab_y <- df$lab_y
boxplot.with.outlier.label(y~x1, lab_y, data=df)

The intelligent point label placement is a separate issue discussed here or here. There's no ultimate and ideal solution so you just have to pick one there.
So you would overplot the normal boxplot with labels, as follows:
set.seed(1501)
y <- c(4, 0, 7, -5, rnorm(16))
x1 <- c("a", "a", "b", "b", sample(letters[1:2], 16, T))
lab_y <- sample(letters, 20)
bx <- boxplot(y~x1)
out_lab <- c()
for (i in seq(bx$out)) {
out_lab[i] <- lab_y[which(y == bx$out[i])[1]]
}
identify(bx$group, bx$out, labels = out_lab, cex = 0.7)
Then, during the identify() is running, you just click to position where you want the label,
as described here. When finished, you just press "STOP".
Note that each outlier can have more than one label! In my solution, I just simply picked the first!!
PS: I feel ashamed for the for loop, but don't know how to vectorize it - feel free to post improvement.
EDIT: inspired by the Federico's link now I see it can be done much easier! Just these 2 commands:
boxplot(y~x1)
identify(as.integer(as.factor(x1)), y, labels = lab_y, cex = 0.7)

Related

Can I re-scale the x/y axis aspect ratio in R with rayshader?

I have some data from lab equipment that can be represented as a matrix by a contour plot/heatmap.
I would like to try illustrating this data in R with the rayshader package.
My problem is that the data is far from square in shape, the matrix is 33 rows by 48003 columns. When I plot this with rayshader I get a thin line:
library(dplyr)
library(rayshader)
set.seed(1742)
df <- matrix(rnorm(10000), nrow = 10)
rownames(df) <- 1:10
colnames(df) <- seq(0.01, 10, 0.01)
df %>%
sphere_shade(texture = "desert") %>%
plot_map()
Is there a way to make rayshader plot this as a square by manipulating the x/y aspect ratios? Or to plot them on an equivalent scale (one dimension collects data much faster than the other)? I can't find anything in the docs.
In this example, I tried naming the rows and columns so they were both collected over 10 minutes, but it didn't change the result.
The end result should look similar to:
library(plotly)
set.seed(1742)
plot_ly(z = ~matrix(rnorm(10000), nrow = 10)) %>%
add_surface()
Many thanks.
Solution for rayshader::plot_3d() is to use scale = c(x, y, z), which will alter the x/y/z aspect ratios. This was hidden, but didn't take that much sluthing to find the answer. It is a setting in rgl::par3d(), which is called by plot_3d().
However, I couldn't get plot_map() to work. When I tried adding the argument asp = 1, which is used by rgl::par3d(), it threw errors.

Retrieve facet labels from a ggplot or a gtable/gTree/grob/gDesc object

I have data I'm plotting using ggplot's facet_grid:
My data:
species <- c("spcies1","species2")
conditions <- c("cond1","cond2","cond3")
batches <- 1:6
df <- expand.grid(species=species,condition=conditions,batch=batches)
set.seed(1)
df$y <- rnorm(nrow(df))
df$replicate <- 1
df$col.fill <- paste(df$species,df$condition,df$batch,sep=".")
My plot:
integerBreaks <- function(n = 5, ...)
{
library(scales)
breaker <- pretty_breaks(n, ...)
function(x){
breaks <- breaker(x)
breaks[breaks == floor(breaks)]
}
}
library(ggplot2)
p <- ggplot(df,aes(x=replicate,y=y,color=col.fill))+
geom_point(size=3)+facet_grid(~col.fill,scales="free_x")+
scale_x_continuous(breaks=integerBreaks())+
theme_minimal()+theme(legend.position="none",axis.title=element_text(size=8))
which gives:
Obviously the labels are long and come out pretty messed up in the figure so I was wondering if there's a way edit these labels in the ggplot object (p) or the gtable/gTree/grob/gDesc object (ggplotGrob(p)).
I am aware that one way of getting better labels is to use the labeller function when the ggplot object is created but in my case I'm specifically looking for a way to edit the facet labels after the ggplot object has been created.
As I mentioned in the comments, the facet names are nested quite deeply within the gtable that ggplotGrob() gives you. However, this is still possible and since the OP explicitly wants to edit them after being plotted, you can do this with:
library(grid)
gg <- ggplotGrob(p)
edited_grobs <- mapply(FUN = function(x, y) {
x[["grobs"]][[1]][["children"]][[2]][["children"]][[1]][["label"]] <- y
return(x)
},
gg$grobs[which(grepl("strip-t",gg$layout$name))],
unique(gsub("cond","c", df$condition)),
SIMPLIFY = FALSE)
gg$grobs[which(grepl("strip-t",gg$layout$name))] <- edited_grobs
grid.draw(gg)
Note that this extracts all the strips using gg$grobs[which(grepl("strip-t",gg$layout$name))] and passes them to the mapply to be reset with the gsub(...) that OP specified in their comment.
In general, if you want to access just one of the text labels, there is a very similar structure which I made use of in my mapply:
num_to_access <- 1
gg$grobs[which(grepl("strip-t",gg$layout$name))][[num_to_access]][["grobs"]][[1]][["children"]][[2]][["children"]][[1]]$label
So to access the 4th label for example all you would need to do is change num_to_acces to be 4. Hope this helps!

Generate two categorical variables with a chosen degree of association in R

I'd like to use R to generate two categorical variables (such as eye color and hair color, for instance) where I can specify the degree to which these two variables are associated. It doesn't really matter to me which levels of eye color would be associated with which levels of hair color, but just being able to specify an overall association, such as by specifying the odds ratio, is a requirement. Also, I know there are ways to do this for two normally distributed continuous variables using, for example, the mvtnorm package, so I could take that route and then choose cut points to make the variables categorical after the fact, but I don't want to do it that way if I can avoid it. Any help would be greatly appreciated!
Edit: apologies for not being clearer from the start, but what I'm really asking I suppose is whether or not there's a function anybody knows of in some R package that will do this in one or two lines.
If you can specify the odds ratios (and you also need to specify the baseline odds), you just convert them to probabilities and use runif().
Edit (I misunderstood the question): Take a look at the bindata package.
If you like, here is a function I wrote that you can use to generate such data without the package. It is rather clunky; it's intended to be self-explanatory rather than elegant or fast.
odds.to.probs <- function(odds){
probs <- odds / (odds+1)
return(probs)
}
get.correlated.binary.data <- function(N, odds.x.eq.0, odds.y.eq.0.x.eq.0,
odds.ratio){
odds.y.eq.0.x.eq.1 <- odds.y.eq.0.x.eq.0*odds.ratio
prob.x.eq.0 <- odds.to.probs(odds.x.eq.0)
prob.y.eq.0.x.eq.0 <- odds.to.probs(odds.y.eq.0.x.eq.0)
prob.y.eq.0.x.eq.1 <- odds.to.probs(odds.y.eq.0.x.eq.1)
x <- ifelse(runif(N)<=prob.x.eq.0, 0, 1)
y <- rep(NA, N)
y <- ifelse(x==0, ifelse(runif(sum(x))<=prob.y.eq.0.x.eq.0, 0, 1), y)
y <- ifelse(x==1, ifelse(runif( (N-sum(x)) )<=prob.y.eq.0.x.eq.1, 0, 1), y)
dat <- data.frame(x=x, y=y)
return(dat)
}
> set.seed(9)
> dat <- get.correlated.binary.data(30, 3, 1.5, -.03)
> table(dat)
y
x 0 1
0 10 13
1 0 7

contour plot of a custom function in R

I'm working with some custom functions and I need to draw contours for them based on multiple values for the parameters.
Here is an example function:
I need to draw such a contour plot:
Any idea?
Thanks.
First you construct a function, fourvar that takes those four parameters as arguments. In this case you could have done it with 3 variables one of which was lambda_2 over lambda_1. Alpha1 is fixed at 2 so alpha_1/alpha_2 will vary over 0-10.
fourvar <- function(a1,a2,l1,l2){
a1* integrate( function(x) {(1-x)^(a1-1)*(1-x^(l2/l1) )^a2} , 0 , 1)$value }
The trick is to realize that the integrate function returns a list and you only want the 'value' part of that list so it can be Vectorize()-ed.
Second you construct a matrix using that function:
mat <- outer( seq(.01, 10, length=100),
seq(.01, 10, length=100),
Vectorize( function(x,y) fourvar(a1=2, x/2, l1=2, l2=y/2) ) )
Then the task of creating the plot with labels in those positions can only be done easily with lattice::contourplot. After doing a reasonable amount of searching it does appear that the solution to geom_contour labeling is still a work in progress in ggplot2. The only labeling strategy I found is in an external package. However, the 'directlabels' package's function directlabel does not seem to have sufficient control to spread the labels out correctly in this case. In other examples that I have seen, it does spread the labels around the plot area. I suppose I could look at the code, but since it depends on the 'proto'-package, it will probably be weirdly encapsulated so I haven't looked.
require(reshape2)
mmat <- melt(mat)
str(mmat) # to see the names in the melted matrix
g <- ggplot(mmat, aes(x=Var1, y=Var2, z=value) )
g <- g+stat_contour(aes(col = ..level..), breaks=seq(.1, .9, .1) )
g <- g + scale_colour_continuous(low = "#000000", high = "#000000") # make black
install.packages("directlabels", repos="http://r-forge.r-project.org", type="source")
require(directlabels)
direct.label(g)
Note that these are the index positions from the matrix rather than the ratios of parameters, but that should be pretty easy to fix.
This, on the other hand, is how easilyy one can construct it in lattice (and I think it looks "cleaner":
require(lattice)
contourplot(mat, at=seq(.1,.9,.1))
As I think the question is still relevant, there have been some developments in the contour plot labeling in the metR package. Adding to the previous example will give you nice contour labeling also with ggplot2
require(metR)
g + geom_text_contour(rotate = TRUE, nudge_x = 3, nudge_y = 5)

Labeling outliers on boxplot in R

I would like to plot each column of a matrix as a boxplot and then label the outliers in each boxplot as the row name they belong to in the matrix. To use an example:
vv=matrix(c(1,2,3,4,8,15,30),nrow=7,ncol=4,byrow=F)
rownames(vv)=c("one","two","three","four","five","six","seven")
boxplot(vv)
I would like to label the outlier in each plot (in this case 30) as the row name it belongs to, so in this case 30 belongs to row 7. Is there an easy way to do this? I have seen similar questions to this asked but none seemed to have worked the way I want it to.
There is a simple way. Note that b in Boxplot in following lines is a capital letter.
library(car)
Boxplot(y ~ x, id.method="y")
Or alternatively, you could use the "Boxplot" function from the {car} package which labels outliers for you.
See the following link: https://CRAN.R-project.org/package=car
In the example given it's a bit boring because they are all the same row. but here is the code:
bxpdat <- boxplot(vv)
text(bxpdat$group, # the x locations
bxpdat$out, # the y values
rownames(vv)[which(vv == bxpdat$out, arr.ind=TRUE)[, 1]], # the labels
pos = 4)
This picks the rownames that have values equal to the "out" list (i.e., the outliers) in the result of boxplot. Boxplot calls and returns the values from boxplot.stats. Take a look at:
str(bxpdat)
#DWin's solution works very well for a single boxplot, but will fail for anything with duplicate values, like the dataset I have created:
#Create data
set.seed(1)
basenums <- c(1,2,3,4,8,15,30)
vv=matrix(c(basenums, sample(basenums), 1-basenums,
c(0, 29, 30, 31, 32, 33, 60)),nrow=7,ncol=4,byrow=F)
dimnames(vv)=list(c("one","two","three","four","five","six","seven"), 1:4)
On this dataset, #DWin's solution gives:
Which is false, because in the 4th example, it is not possible for the minimum and maximum to be in the same row.
This solution is monstrous (and I hope can be simplified), but effective.
#Reshape data
vv_dat <- as.data.frame(vv)
vv_dat$row <- row.names(vv_dat)
library(reshape2)
new_vv <- melt(vv_dat, id.vars="row")
#Get boxplot data
bxpdat <- as.data.frame(boxplot(value~variable, data=new_vv)[c("out", "group")])
#Get matches with boxplot data
text_guide <- do.call(rbind, apply(bxpdat, 1,
function(x) new_vv[new_vv$value==x[1]&new_vv$variable==x[2], ]))
#Add labels
with(text_guide, text(x=as.numeric(variable)+0.2, y=value, labels=row))
Or you can simply run the code from this blog post:
source("https://raw.githubusercontent.com/talgalili/R-code-snippets/master/boxplot.with.outlier.label.r") # Load the function
set.seed(6484)
y <- rnorm(20)
x1 <- sample(letters[1:2], 20,T)
lab_y <- sample(letters, 20)
# plot a boxplot with interactions:
boxplot.with.outlier.label(y~x1, lab_y)
(which handles multiple outliers which are close to one another)
#sebastian-c
This is a slight modification of DWin solution that seem to work with more generality
bx1<-boxplot(pb,las=2,cex.axis=.8)
if(length(bx1$out)!=0){
## get the row of each outlier
out.rows<-sapply(1:length(bx1$out),function(i) which(vv[,bx1$group[i]]==bx1$out[i]))
text(bx1$group,bx1$out,
rownames(vv)[out.rows],
pos=4
)
}

Resources