Color Dependent Bar Graph in R - r

I'm a bit out of my depth with this one here. I have the following code that generates two equally sized matrices:
MAX<-100
m<-5
n<-40
success<-matrix(runif(m*n,0,1),m,n)
samples<-floor(MAX*matrix(runif(m*n),m))+1
the success matrix is the probability of success and the samples matrix is the corresponding number of samples that was observed in each case. I'd like to make a bar graph that groups each column together with the height being determined by the success matrix. The color of each bar needs to be a color (scaled from 1 to MAX) that corresponds to the number of observations (i.e., small samples would be more red, for instance, whereas high samples would be green perhaps).
Any ideas?

Here is an example with ggplot. First, get data into long format with melt:
library(reshape2)
data.long <- cbind(melt(success), melt(samples)[3])
names(data.long) <- c("group", "x", "success", "count")
head(data.long)
# group x success count
# 1 1 1 0.48513473 8
# 2 2 1 0.56583802 58
# 3 3 1 0.34541582 40
# 4 4 1 0.55829073 64
# 5 5 1 0.06455401 37
# 6 1 2 0.88928606 78
Note melt will iterate through the row/column combinations of both matrices the same way, so we can just cbind the resulting molten data frames. The [3] after the second melt is so we don't end up with repeated group and x values (we only need the counts from the second melt). Now let ggplot do its thing:
library(ggplot2)
ggplot(data.long, aes(x=x, y=success, group=group, fill=count)) +
geom_bar(position="stack", stat="identity") +
scale_fill_gradient2(
low="red", mid="yellow", high="green",
midpoint=mean(data.long$count)
)

Using #BrodieG's data.long, this plot might be a little easier to interpret.
library(ggplot2)
library(RColorBrewer) # for brewer.pal(...)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=count),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)
Note that actual values are probably different because you use random numbers in your sample. In future, consider using set.seed(n) to generate reproducible random samples.
Edit [Response to OP's comment]
You get numbers for x-axis and facet labels because you start with matrices instead of data.frames. So convert success and samples to data.frames, set the column names to whatever your test names are, and prepend a group column with the "list of factors". Converting to long format is a little different now because the first column has the group names.
library(reshape2)
set.seed(1)
success <- data.frame(matrix(runif(m*n,0,1),m,n))
success <- cbind(group=rep(paste("Factor",1:nrow(success),sep=".")),success)
samples <- data.frame(floor(MAX*matrix(runif(m*n),m))+1)
samples <- cbind(group=success$group,samples)
data.long <- cbind(melt(success,id=1), melt(samples, id=1)[3])
names(data.long) <- c("group", "x", "success", "count")
One way to set a threshold color is to add a column to data.long and use that for fill:
threshold <- 25
data.long$fill <- with(data.long,ifelse(count>threshold,max(count),count))
Putting it all together:
library(ggplot2)
library(RColorBrewer)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=fill),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)+
theme(axis.text.x=element_text(angle=-90,hjust=0,vjust=0.4))
Finally, when you have names for the x-axis labels they tend to get jammed together, so I rotated the names -90°.

Related

Plotting a facet grid in R using ggplot2 with only one variable

I have a data frame, called mouse.data, with 3 columns: Eigenvalues, DualEigenvalues and Experiment. This question does not concern the DualEigenvalues data, so that can be forgotten.
We ran 5 experiments and used the data from each experiment to calculate 14 eigenvalues. So the first 14 rows of this data frame are the 14 eigenvalues of the first experiment, with the experiment entry having value 1, the second 14 rows are the 14 eigenvalues of the second experiment with the experiment entry having value 2 etc.
I am then plotting the eigenvalues of each pairwise experiment against each other, here is an example of this code:
eigen.1 <- mouse.data$Eigenvalues[mouse.data$Experiment == 1]
eigen.2 <- mouse.data$Eigenvalues[mouse.data$Experiment == 2]
p.data <- data.frame(x = eigen.1, y = eigen.2)
ggplot(p.data, aes(x,y)) + geom_abline(slope = 1, colour = "red") + geom_point()
This gives me graph like this one:
This is precisely what I want this graph to look like.
What I would like to do, but can't work out, is to plot a facet_grid so that the plot in the ith row and jth column plots the eigenvalues from the ith experiment on the y-axis and the eigenvalues from the jth experiment on the x-axis.
This is the closest I have got so far, I hope this makes it clearer what I mean.
This is tricky without a reproducible example of your data, but it sounds like we can roughly approximate the structure of your data frame like this:
library(ggplot2)
set.seed(1)
Eigen <- as.vector(sapply(runif(5, .5, 1.5),
function(x) sort(rgamma(14, 2, 0.02*x))))
mouse.data <- data.frame(Experiment = rep(seq(5), each = 14), Eigenvalue = Eigen)
head(mouse.data)
#> Experiment Eigenvalue
#> 1 1 39.61451
#> 2 1 44.48163
#> 3 1 54.57964
#> 4 1 75.06725
#> 5 1 75.50014
#> 6 1 94.41255
The key to getting the plot to work is to reshape your data into a long-format data frame that contains each combination of experiments. One way to do this is to split the data frame by Experiment, then use simple indexing of the resultant list (using rep) to get all unique pairs of data frames. Each unique pair is stuck together column-wise, then the resultant 25 data frames are all joined row-wise into the plotting data frame.
experiments <- split(mouse.data, mouse.data$Experiment)
experiments <- mapply(cbind,
experiments[rep(1:5, 5)],
experiments[rep(1:5, each = 5)],
SIMPLIFY = FALSE)
p.data <- do.call(rbind, lapply(experiments, setNames,
nm = c("Experiment1", "x",
"Experiment2", "y")))
Once we have done this, we can use your plot code, with the addition of a facet_grid call:
ggplot(p.data, aes(x,y)) +
geom_abline(slope = 1, colour = "red") +
geom_point() +
facet_grid(Experiment1~Experiment2)

arithmatic operations and labelling in ggplot or R

I have a file that looks like this
2 3 LOGIC:A
2 5 LOGIC:A
3 4 LOGIC:Z
I plotted column 1 on x axis vs column 2 on y with column 3 acting as a legend
ggplot(Data, aes(V1, V2, col = V3)) + geom_point()
However is it possible in ggplot itself to subtract column 2 and column 1 and label the top 10 highest absolute difference rows of this subtraction with column 3 values on each scatter point. I dont want to label the entire dataset. Just the top 10 highest deltas
You can try this (if you original dataframe is Data):
library(dplyr)
library(ggplot2)
Data$sub <- abs(Data$V2 - Data$V1)
Data2<- Data %>%
top_n(10,sub)
ggplot()+ geom_text(data=Data2,aes(V1,V2-0.1,label=V3))+
geom_point(data=Data,aes(V1,V2))
With the library dplyr you can filter the top values of a dataframe.
You can change "0.1" for a better value in your plot

Plot selected rows with multiple columns with ggplot

I have a dataset set with 34 columns and 600+ rows.
I successfully managed to reshape it for my data to be predicted for 5 columns (5 years) using reshape2
Dataset_name <- melt(data=XYZ, id.vars=c("A", "B", "C",.... {so on minus 5 columns}))
Now I have the reshaped data and plotted the graph and since it has 600+ points in each column, I cant make sense of it.
Is it possible for me to plot the top Row 1 to Row 50 in one graph and in another Row 51 to Row 100 and so on?
Also, I want to connect the dots to see whether they varied over the years.
Thanks.
Dataset
You can assign rows numbers (first 50 designated as 1, second 50 as 2...) and use that variable in facet_wrap. Each facet would thus hold 50 data points. Here's an example on the iris dataset which comes shipped with R.
library(ggplot2)
nrow(iris) # 150, let's do 50 obs. per facet
iris <- iris[sample(1:nrow(iris)), ] # shuffle the dataset
iris$desig <- rep(c("set1", "set2", "set3"), each = 50)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
theme_bw() +
geom_point() +
facet_wrap(~ desig)

"Heatbars" for visualizing consecutive missing data days?

I am trying to visualize large chunks of consecutive missing data side-by-side on ranges of 3, 5 and 10 years sampled daily. Hopefully using ggplot2 since I already have some aesthetics functions done.
I imagined this would come from a barplot or maybe some heatmap variation, but I am not too sure how to use them with time-series data.
I chose a black/white list of bars because I think it is easier to observe where (1) lies large chunks of missing data and (2) if they are occurring on different moments in time (which would be important to choose which stations to use, etc), while being (3) relatively easy to observe many bars which would not be true to the more conventional line plots for time-series.
This is a draft of what I had in mind.
Here is some example data for 5 stations (in practice this could be up to over 80):
#Data from 5 different stations sampled daily.
df <- cbind(seq(as.Date(("2010/01/01")),by="day",length.out=365*5),data.frame(matrix(rnorm(365*5*5),365*5,5)))
colnames(df) <- c("timestamp","st1","st2","st3","st4","st5")
#Add varying ranges of missing consecutive amount of days to observe result on visualization.
df[1:50,"st1"] <- NA # 50
df[51:200,"st2"] <- NA # 150
df[1:400,"st3"] <- NA # 400
df[501:1300,"st5"] <- NA # 800
Here's a rough stab at it...Alter the scales and theme elements to your liking...
library(ggplot2)
library(scales)
library(reshape2)
melt(df, id.vars = "timestamp") -> k
k$value <- ifelse(is.na(k$value), "NA", "Not NA")
ggplot(data = k) +
geom_point(aes(x = timestamp, y = variable, fill = value, colour = value), shape =22) +
scale_x_date() +
theme_bw()

Plotting an filled line chart with 4 variables against a 5th variable ggplot2

I am trying to create a postion="fill" which represents an allocation on the y axis (to always sum to 100) and another variable on the x axis. Variable 1-4 are numeric integers, variable 5 is also numeric. Variable 5 is a continuous numeric. All five variables on are on the same row.
Y axis: variable 1 + variable 2 + variable 3 + variable 4 = 100
X axis: variable 5
Is there a way to do this without melting my data table?
Sample code, caution: runs a bit slow due to how I set up variables 1-4...
library(combinat)
combinations <- combn(100, 4)
permutations <- combinations[, colSums(combinations) == 100]
rm(combinations)
data <- t(rbind(permutations,
replicate(ncol(permutations), cumprod(1+rnorm(20, 0.05, 0.30))[20])
))
One way to generate a reproducible example would be
set.seed(1)
data_ex <- data.frame(t(rmultinom(1000,prob=rep(0.25,4),size=100)),
v5=runif(1000,0.8,1))
and then
library(ggplot2)
library(reshape2)
ggplot(melt(data_ex,id.var="v5")) +
geom_area(aes(x=v5,y=value,fill=variable))
draws the plot.
If you really want to do things the hard way you can avoid using melt, but melt is much (much much) easier!
cumvals <- t(apply(data_ex[,1:4],1,cumsum))
data2 <- data.frame(cumvals,v5=data_ex$v5)
ggplot(data2,aes(x=v5)) +
## these must go in reverse order
geom_area(aes(y=X4),fill="green")+
geom_area(aes(y=X3),fill="purple")+
geom_area(aes(y=X2),fill="red")+
geom_area(aes(y=X1),fill="blue")

Resources