How to not graph the extreme outliers in a boxplot? - r

I have an R script that uses a csv file as it's source data to create sixteen separate boxplots. Each of the sixteen boxplots have varying y-axis scales, which makes it difficult to apply a general ylim statment to the script. I tried using the coor_cartesian function with the ylim statement as well as the scale_y_continuous function, but again, that was too general to apply across sixteen boxplots with varying y-axis scales (I do not want to normalize the scales across the sixteen boxplots, only plots with 'extreme' outliers).
Below is the snipet of data I used to create the sixteen box plots. 'SE_Data' is the csv source file I noted above. I should also mention that the sixteen boxplots are exported as a single pdf file (I don't know if this level of detail is needed or not).
# Enter csv input file:
SE_Data<-read.csv("SE_DATA.csv",header=T)
# Enter output file name:
pdf(file="SE_Box_Plots.pdf", onefile=TRUE)
x=c("A","B","C","D","E","F","G","H")
SE_Data$ACO_Desc <- factor(SE_Data$ACO_Desc , x) #Ensures x-axis is ordered from A through H
#Creates sixteen individual boxplots
for (i in 5:ncol(SE_Data)) {
p<-ggplot(SE_Data, aes(x=Group_Desc, y=SE_Data[,i])) + geom_boxplot() +
ylab(gsub("\\_", " ", colnames(SE_Data)[i])) +
xlab("") +
theme(axis.text.x=element_text(angle = 0))
print(p)
}
dev.off()
dev.list()
I wasn't sure if I would need to create an IF ELSE statment to solve this problem, however, as a someone who is still fairly new to R, this appears to be well above my skill level. Below, I included two of the sixteen boxplots to illustrate how their y-axis scales differ from eachother.
Box Plot 1:
Box Plot 2:
As you can see from the two boxplots, they both have very different y-axis scales. In my opinion 'boxplot 2' looks fine, however, 'boxplot 1' contains extreme outliers. I would to develop a piece of code that could remove these extreme values in order to reduce the amount of 'dead space' on the boxplot; thus, lowering the scale of the y-axis and making it more appealing to the eye.
It's important to stress that I still want outliers to be included in my boxplots, however, I want to remove only the extreme outliers. If you need any more information from my end please be sure to let me know.

I can't reproduce your graph without the data but including
geom_boxplot( outlier.shape=NA )
should hide the outliers. You can manually adjust the yscale with
scale_y_continuous(limits=c(-5, 1)) # or whatever values you want to use.

Related

Can one use ggMarginal on a plot combining points and density lines?

I have been trying to add Marginal graphs to my current plot, which displays some data with density lines and some data with points. However, ggMarginal seems to only pick up the data belonging to the first layer or the first subset which is called within geom_point. As anyone an idea how to still achieve my goal using ggMarignal?
I do have come across workarounds with cowplot however in my case it would require a lot of additional work as I produce loads of figures with varying size (which would need all specific adjustment for perfect alignments)
Thanks for any ideas!
Code to reproduce current Output:
data=iris
PC_Data=prcomp(data[,1:4])
data2Plot=as.data.frame(cbind(PC_Data$x,Species=data$Species))
data2Plot$Species=as.factor(data2Plot$Species)
p<-ggplot(data2Plot,aes(x=PC1,y=PC2,color=Species,fill=Species))+
stat_density2d(data=subset(data2Plot, Species != "3"),geom="polygon",size=0.2,alpha=0.1) +
geom_point(data=subset(data2Plot, Species == "3"),size=1)+theme(legend.position = "none")
ggMarginal(p,type="density",groupColour = T,groupFill = T)
Current Output
Wanted Output:

How to prevent geom_text_repel from labeling points on scatter plot with default number ordering list?

My dataset looks like this:
I'm trying to create a simple scatter plot with data labels that are names (first and last name).
I used geom_text_repel in ggrepel to create data labels, but the labels on the plot are just numbers in the order of the data points in my dataset.
For example, if you look at the first datapoint, instead of the label being "Stephen Curry" it is "1"
I have no idea why this is happening and I can't find anyone else who even has my problem, let alone a solution.
Code:
ggplot(gravity,
aes(TS., USG., label = rownames(gravity))) +
geom_point(aes(TS., USG.), color='black') +
geom_text_repel(aes(TS., USG., label = rownames(gravity)))
The image above shows the plot created by the code. As you can see, the labels are just the ordering number instead of the name. I don't see why this happening considering those ordering numbers are not part of the dataset I imported.
Thanks in advance

How to set heigth of rows grid in graph lines on ggplots (R)?

I'm trying plots a graph lines using ggplot library in R, but I get a good plots but I need reduce the gradual space or height between rows grid lines because I get big separation between lines.
This is my R script:
library(ggplot2)
library(reshape2)
data <- read.csv('/Users/keepo/Desktop/G.Con/Int18/input-int18.csv')
chart_data <- melt(data, id='NRO')
names(chart_data) <- c('NRO', 'leyenda', 'DTF')
ggplot() +
geom_line(data = chart_data, aes(x = NRO, y = DTF, color = leyenda), size = 1)+
xlab("iteraciones") +
ylab("valores")
and this is my actual graphs:
..the first line is very distant from the second. How I can reduce heigth?
regards.
The lines are far apart because the values of the variable plotted on the y-axis are far apart. If you need them closer together, you fundamentally have 3 options:
change the scale (e.g. convert the plot to a log scale), although this can make it harder for people to interpret the numbers. This can also change the behavior of each line, not just change the space between the lines. I'm guessing this isn't what you will want, ultimately.
normalize the data. If the actual value of the variable on the y-axis isn't important, just standardize the data (separately for each value of leyenda).
As stated above, you can graph each line separately. The main drawback here is that you need 3 graphs where 1 might do.
Not recommended:
I know that some graphs will have the a "squiggle" to change scales or skip space. Generally, this is considered poor practice (and I doubt it's an option in ggplot2 because it masks the true separation between the data points. If you really do want a gap, I would look at this post: axis.break and ggplot2 or gap.plot? plot may be too complexe
In a nutshell, the answer here depends on what your numbers mean. What is the story you are trying to tell? Is the important feature of your plots the change between them (in which case, normalizing might be your best option), or the actual numbers themselves (in which case, the space is relevant).
you could use an axis transformation that maps your data to the screen in a non-linear fashion,
fun_trans <- function(x){
d <- data.frame(x=c(800, 2500, 3100), y=c(800,1950, 3100))
model1 <- lm(y~poly(x,2), data=d)
model2 <- lm(x~poly(y,2), data=d)
scales::trans_new("fun",
function(x) as.vector(predict(model1,data.frame(x=x))),
function(x) as.vector(predict(model2,data.frame(y=x))))
}
last_plot() + scale_y_continuous(trans = "fun")
enter image description here

How to plot heatmap with multiple categories in a single cell with ggplot2?

How to plot heatmap with multiple categories in a single cell with ggplot2? Heatmap plot of categorical variables could be done with this code
#data
datf <- data.frame(indv=factor(paste("ID", 1:20),
levels =rev(paste("ID", 1:20))), matrix(sample(LETTERS[1:7], 400, T), ncol = 20))
library(ggplot2);
library(reshape2)
# converting data to long form for ggplot2 use
datf1 <- melt(datf, id.var = 'indv')
ggplot(datf1, aes(variable, indv)) + geom_tile(aes(fill = value),
colour = "white") + scale_fill_manual(values= rainbow (7))
The codes came from here:
http://rgraphgallery.blogspot.com/2013/04/rg54-heatmap-plot-of-categorical.html
But what about multiple categories in a single cell like this? Is it possible to use triangle or other shape as a cell?
http://postimg.org/image/4dudrv0nz/
copy from biostar as Alex Reynolds suggested.
For those interested, this apperas to be Figure 2 from Exome sequencing identifies mutation in CNOT3 and ribosomal genes RPL5 and RPL10 in T-cell acute lymphoblastic leukemia.
I wanted to create a similar plot with ggplot and geom_tile for a bigger collection of genes (few hundreds) but finally decided to use geom_points instead to provide additional information per cell (tile). Also it looks to me a lot like this plot was generated in Excel or some other spreadsheet software (maybe along those lines https://www.youtube.com/watch?v=0s5OiRMMzuY). The colors in the cells (tiles) do not match those in the legend (suggesting that they have been added separately and not automatically) and there appears to be an erroneous cell (diagonal separating colors -upper left to lower right - different from diagonal in black color - lower left to upper right -).
Hence, my concluding two cents: Doing this automatically is probably very time-consuming and in my opinion makes only sense if you want to do this repeatedly, e.g., on data that is subject to change or on multiple datasets, and/or if you have a larger collections of genes.
Otherwise, following the instructions in the youtube video for a rather small number of cells is likely to be more efficient. Or use geom_point (similar to Adding points to a geom_tile layer in ggplot2 or
Marking specific tiles in geom_tile() / geom_raster()
) to represent information about an additional category (variable).
In any case, should anyone have other suggestions on how to automatically create such a figure, I am more than happy to hear about that.

trying to plot ranges of dates

I have 19 tags which were deployed and reported at different times throughout the summer and fall. Currently I am trying to create a plot to display the times of deployment and reporting so that I can visualize where there is overlap in data collection. I have tried several different plotting functions including plot(), boxplot(), and ggplot(). I have gotten close to what I want with boxplot() but would like the box to extend from the start to the end date and eliminate the whiskers entirely. Is there a way to do this or should I use a different function or package? Here is my code, it probably isn't the most efficient since I'm somewhat new to R.
note: tnumber are just the tag numbers I used. The dates were all taken from different data sets.
dep.dates=boxplot(t62104[,8],t40636[,8],t84337[,8],t84353[,8],t62103[,8],
t110289[,8],t62102[,8],t62105[,8],t62101[,8],t84360[,8],
t117641[,8],t40643[,8],t110291[,8],t84338[,8],t110290[,8],
t84363[,8],t117639[,8],t117640[,8],t117638[,8],horizontal=T,
main='Tag deployment and pop-up dates',xlab='Month',
ylab='Tag number',names=c('62104','40636','84337','84353',
'62103','110289','62102','62105','62101','84360','117641',
'40643','110291','84338','110290','84363','117639','117640',
'117638'),las=1)
Something like this will work if all you care about is ranges.
require(ggplot2)
require(SpatioTemporal)
data(mesa.data.raw)
require(data.table)
out <- as.data.table(t(apply(mesa.data.raw$obs, 2, function(.v){
names(.v)[range(which(!is.na(.v)))]
})),keep=TRUE)
setnames(out, "rn", "monitors")
ggplot(out, aes(x=monitors, y=V1, ymin=V1, ymax=V2,)) + geom_crossbar() + coord_flip()
ggplot(out, aes(x=monitors, ymin=V1, ymax=V2)) + geom_linerange() + coord_flip()
The first ggplot call creates horizonal bars but I can't figure out how to get rid of the center line so I just put it at the start.
The second plot creates horizontal lines, which I think looks better anyway.

Resources