boxplots with missing values in R - ggplot - r

I am trying to make boxplots for a matrix (athTp) with 6 variables (columns) but with many missing values, '
ggplot(athTp)+geom_boxplot()
But maybe sth I am doing wrong...
I tried also to make many box plots and after to arrange the grid, but the final plot was very small (in desired dimensions), loosing many of details.
q1 <- ggplot(athTp,aes(x="V1", y=athTp[,1]))+ geom_boxplot()
..continue with other 5 columns
grid.arrange(q1,q2,q3,q4,q5,q6, ncol=6)
ggsave("plot.pdf",plot = qq, width = 8, height = 8, units = "cm")
Do you have any ideas?
Thanks in advance!

# ok so your data has 6 columns like this
set.seed(666)
dat <- data.frame(matrix(runif(60,1,20),ncol=6))
names(dat) <- letters[1:6]
head(dat)
# so let's get in long format like ggplot likes
library(reshape2)
longdat <- melt(dat)
head(longdat)
# and try your plot call again specifying that we want a box plot per column
# which is now indicated by the "variable" column
# [remember you should specify the x and y axes with `aes()`]
library(ggplot2)
ggplot(longdat, aes(x=variable, y=value)) + geom_boxplot(aes(colour = variable))

Related

Adding legend to ggplot curves plotted on the same axis [duplicate]

This question already has answers here:
Add legend to ggplot2 line plot
(4 answers)
Closed 4 months ago.
I have a graph that I'm trying to add a legend to but I can't find any answers.
Here's what the graph looks like
I made a dataframe containing my x-axis as a colum and several othe columns containing y values that I graphed against x (fixed) in order to get these curves. I want a legend to appear on the side saying column 1, ...column 11 and corresponding to the color of the graph
How do I do this? I feel like I'm missing something obvious
Here's what my code looks like:(sorry for the pic. I keep getting errors that my code is not formatted correctly even though I'm using the code button)
interval is just 2:100 and aaaa etc... is a vector the same length as interval.
As Peter says, you will need to convert your data into "long" format. Here is an example using reshape2::melt:
library(reshape2)
library(ggplot2)
n <- 20
df <- data.frame(x = seq(n))
tmp <- as.data.frame(do.call("cbind", lapply(seq(5), FUN = function(x){rnorm(n)})))
names(tmp) <- paste0("aaaa", letters[1:5])
df <- cbind(df, tmp)
head(df)
df2 <- melt(df, id.vars = "x")
head(df2)
ggplot(data = df2) + aes(x = x, y = value, color = variable) +
geom_point() +
geom_line()

Plotting ordered factors on x-axis in ggplot2

I have the following data.
pos <- c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
block <- c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2)
set <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
fsize <- c(4,5,6,1,2,1,2,2,3,4,5,1,7,11,2,1,2,3,5,3,5,6,1,2)
dat <- data.frame(pos,block,set,fsize)
dat <- dat[order(block,set,-fsize),]
dat$pos <- as.factor(dat$pos)
ggplot(dat, aes(x = pos, y = fsize)) + geom_bar(stat="identity") +
facet_wrap(~block+set)
Each position pos is associated with a size fsize. There are 6 positions within each block/set. I want to arrange the sizes in decreasing female size.
So for example, the first block/set with rearranged positions would be 3,2,1,5,4,6 and it would be different for the other. However, when I plot it, the x-axis gets automatically reordered to 1-6 even when I factor the pos column. Any suggestions on how to rectify this?
Here is a solution, but in order to plot in the desired order, I needed to create a new variable with unique names. The variable is a combination of the set and pos columns.
dat <- data.frame(pos,block,set,fsize)
dat <- dat[order(block,set,-fsize),]
#make a key variable in the overall desired order
key<-paste(dat$set, dat$pos, sep=",")
#make an new ordered factor variable in the proper order
dat$order <- factor(key, levels= key, ordered =TRUE)
ggplot(dat, aes(x = order, y = fsize)) + geom_bar(stat="identity") +
facet_wrap(~block+set, scales="free_x") + labs(x="Set,Pos")

ggplotly plots negative bars in positive direction

I am using a geom_bar plot in ggplotly, and it renders negative bars positive. Any ideas why this might be the case, and in particular how to solve this?
library(ggplot2)
library(plotly)
dat1 <- data.frame(
sex = factor(c("Female","Female","Male","Male")),
time = factor(c("Lunch","Dinner","Lunch","Dinner"), levels=c("Lunch","Dinner")),
total_bill = c(-13.53, 16.81, 16.24, 17.42)
)
# Bar graph, time on x-axis, color fill grouped by sex -- use position_dodge()
g <- ggplot(data=dat1, aes(x=time, y=total_bill, fill=sex)) +
geom_bar(stat="identity", position=position_dodge())
ggplotly(g)
Why would the first bar be in a positive direction, with a negative value?
The versions that I am using is the latest:
plotly_3.4.13
ggplot2_2.1.0
If you write your plotly object to another variable you can modify the plotly properties including the 'data' it uses to render the plot.
For your specific example append this to your code:
#create plotly object to manipulate
gly<-ggplotly(g)
#confirm existing data structure/values
gly$x$data[[1]]
# see $y has values of 13.53, 16.81 which corresponds to first groups absolute values
#assign to original data
gly$x$data[[1]]$y <- dat1$total_bill[grep("Female",dat1$sex)]
#could do for second group too if needed
gly$x$data[[2]]$y <- dat1$total_bill[grep("Male",dat1$sex)]
#to see ggplotly object with changes
gly
I have come up with a general solution that works in cases where facet wrap is being used. Here is an example of the problem with toy data:
set.seed(45)
df <- data.frame( group=rep(1:4,5), TitleX=rep(1:5,4), TitleY=sample(-5:5,20, replace = TRUE))
h <- ggplot(df) + geom_bar(aes(TitleX,TitleY),stat = 'identity') + facet_wrap(~group)
h
When we use ggplotly we see what OP saw, which is that the negatives have disappeared:
gly <- ggplotly(h)
gly
I wrote a function that will check for the instances in each facet list where the y values in the text are given as 0, which seems to be a comorbid issue with the one I am currently addressing:
fix_bar_ly <- function(element,yname){
tmp <- as.data.frame(element[c("y","text")])
tmp <- tmp %>% mutate(
y=ifelse(grepl(paste0(yname,": 0$"),text),
ifelse(y!=0,-y,y),
y)
)
element$y <- tmp$y
element
}
Now I apply this function to the data for each facet:
data.list <- gly$x$data
m <- lapply(data.list,function(x){fix_bar_ly(x,"TitleY")})
gly$x$data <- m
gly
For some reason the spaces between the bars have disappeared ... but at least the values are negative in the appropriate places.

R stacked area chart - ignore NA and retain full x-axis

i've decadal time series from 1700 to 1900 (21 time slices) and for each decade i've got 7 categories that represent a quantity; see here
As you can see, only 5 of the decades actually have data.
I can plot a nice little stacked area chart in R, with the help of this very nice example, which retains only the 5 time slices that have data.
My problem is that i want an x-axis that retains all 21 times slices but still plots a stacked area chart using only the 5 time slices. The idea is that the stacked areas will still only be plotted against the correct year but simply connect up to the next point, 10 ticks down the x-axis, ignoring the no-data in between. i can achieve something in excel but i dont like it.
My reasoning is i want to plot lines on the top of the stacked area that are much more complete, for example from 1700 to 1850, or 1800 to 1900, for visual comparison purposes.
This post suggests how to connect dots in a line chart when you want to ignore NAs but it doesnt work for me in this instance.
a <- 1700:1900
b <- a[seq(1, length(a), 10)]
df <- data.frame("Year"=b,replicate(7,sample(1:21)))
rows <- c(2:10,11:15,17,19,21)
df[rows,2:8] <- NA
df
thanks a lot
If you wish to transform your year to factor, on the lines of the code below:
# Transform the data to long
library(reshape2)
df <- melt(data = df, na.rm = FALSE, id.vars = "Year")
df$Year <- as.factor(df$Year)
# Chart
require(ggplot2)
ggplot(df, aes(Year, value)) +
geom_area(aes(colour = variable, fill= variable), position = 'stack')
It will generate the chart below:
I wasn't sure if you are interested in mapping all of the X variables. I was thinking that this is the case so I reshaped your data. Presumably, it is wiser not to change the Year to factor. The code below:
a <- 1700:1900
b <- a[seq(1, length(a), 10)]
df <- data.frame("Year"=b,replicate(7,sample(1:21)))
rows <- c(2:10,11:15,17,19,21)
df[rows,2:8] <- NA
# Transform the data to long
library(reshape2)
df <- melt(data = df, na.rm = FALSE, id.vars = "Year")
# Leave it as int.
# df$Year <- as.factor(df$Year)
# Chart
require(ggplot2)
ggplot(df, aes(Year, value)) +
geom_area(aes(colour = variable, fill= variable), position = 'stack')
would generate much more meaningful chart:
Potentially, if you decide to use years as factors you may group them and have one category for a number of missing years so the x-axis is more readable. I would say it's a matter of presentation to great extent.

Dynamically Set X limits on time plot

I am wondering how to dynamically set the x axis limits of a time series plot containing two time series with different dates. I have developed the following code to provide a reproducible example of my problem.
#Dummy Data
Data1 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1997","9/13/1998"), Area_2D = c(20,11,5,25,50))
Data2 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
Data3 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1996","9/13/1998"), Area_2D = c(20,25,28,30,35))
Data4 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
#Convert date column as date
Data1$Date <- as.Date(Data1$Date,"%m/%d/%Y")
Data2$Date <- as.Date(Data2$Date,"%m/%d/%Y")
Data3$Date <- as.Date(Data3$Date,"%m/%d/%Y")
Data4$Date <- as.Date(Data4$Date,"%m/%d/%Y")
#PLOT THE DATA
max_y1 <- max(Data1$Area_2D)
# Define colors to be used for cars, trucks, suvs
plot_colors <- c("blue","red")
plot(Data1$Date,Data1$Area_2D, col=plot_colors[1],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
par(new=T)
plot(Data2$Date,Data2$Area_2D, col=plot_colors[2],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
The main problem I see with the code above is there are two different x axis on the plot, one for Data1 and another for Data2. I want to have a single x axis spanning the date range determined by the dates in Data1 and Data2.
My questions is:
How do i dynamically create an x axis for both series? (i.e select the minimum and maximum date from the data frames 'Data1' and 'Data2')
The solution is to combine the data into one data.frame, and base the x-axis on that. This approach works very well with the ggplot2 plotting package. First we merge the data and add an ID column, which specifies to which dataset it belongs. I use letters here:
Data1$ID = 'A'
Data2$ID = 'B'
merged_data = rbind(Data1, Data2)
And then create the plot using ggplot2, where the color denotes which dataset it belongs to (can easily be changed to different colors):
library(ggplot2)
ggplot(merged_data, aes(x = Date, y = Area_2D, color = ID)) +
geom_point() + geom_line()
Note that you get one uniform x-axis here. In this case this is fine, but if the timeseries do not overlap, this might be problematic. In that case we can use multiple sub-plots, known as facets in ggplot2:
ggplot(merged_data, aes(x = Date, y = Area_2D)) +
geom_point() + geom_line() + facet_wrap(~ ID, scales = 'free_x')
Now each facet has it's own x-axis, i.e. one for each sub-dataset. What approach is most valid depends on the specific situation.

Resources