Error message while plotting density functions in ggplot - r

I had a data frame with 750 observations and 250 columns, and I would like to plot two density plots on top of each other. In one case, a particular factor is present, in the other it isn't (commercial activities against non-commercial activities).
I created a subset of the data
CommercialActivityData <- subset(MbadSurvey, Q2== 1)
NonCommercialActivityData <- subset(MbadSurvey, Q2== 2)
I then tried to plot this as follows
p1 <- ggplot(CommercialActivityData, aes(x = water_use_PP)) + geom_density()
p1
However, when I do, I get the following error message
Error: Aesthetics must be either length 1 or the same as the data (51): x
I have 51 data values where there is commercial, and 699 where there isn't.

EDIT: new code!!
I don't have access to your data set so I have simulated your data:
# Creating the data frame
MbadSurvey <- data.frame("water_use_PP"=runif(1000,1,100),
"Q2"=as.factor(round(runif(1000,1,2),0)))
# Requiring the package
require(ggplot2)
# Creating 3 different density plots based on the Species
p1 <- ggplot(MbadSurvey, aes(x = water_use_PP,colour = Q2)) + geom_density()
p1
NOTE: The variable Q2 must be a factor!

Related

Boxplots by base R and ggplot2 do not match

I have a simple dataset. When I generate boxplot for the data by base R and ggplot separately, they do not match. In fact the base R boxplot is consistent with the summary function.
library(tidyverse)
library(ggplotify)
library(patchwork)
df <- read.csv("test_boxplot_data.csv")
summary(df)
p1 <- as.ggplot(~boxplot(df$y, outline=FALSE))
p2 <- ggplot(df, aes(y=y)) + geom_boxplot(outlier.shape = NA) + ylim(0,100)
p1 + p2 + plot_layout(ncol = 2)
Generated plot kept here.
Any clue what is happening? It is also surprising that ggplot throws warning that "Removed 845 rows containing non-finite values (stat_boxplot)" but there is no NA in the data.
From: "Removed 845 rows containing non-finite values (stat_boxplot)". It just so happens that the data contains 845 points > 100. These points are being deleted in the calculation of the box plot.
From the first line of help for ylim():
"This is a shortcut for supplying the limits argument to the individual scales. By default, any values outside the limits specified are replaced with NA. Be warned that this will remove data outside the limits and this can produce unintended results. For changing x or y axis limits without dropping data observations, see coord_cartesian()."
This should provide the desired graph:
ggplot(df, aes(y=y)) + geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim=c(0,100))

R - ggplot2 - Get histogram of difference between two groups

Let's say I have a histogram with two overlapping groups. Here's a possible command from ggplot2 and a pretend output graph.
ggplot2(data, aes(x=Variable1, fill=BinaryVariable)) + geom_histogram(position="identity")
So what I have is the frequency or count of each event. What I'd like to do instead is to get the difference between the two events in each bin. Is this possible? How?
For example, if we do RED minus BLUE:
Value at x=2 would be ~ -10
Value at x=4 would be ~ 40 - 200 = -160
Value at x=6 would be ~ 190 - 25 = 155
Value at x=8 would be ~ 10
I'd prefer to do this using ggplot2, but another way would be fine. My dataframe is set up with items like this toy example (dimensions are actually 25000 rows x 30 columns) EDITED: Here is example data to work with GIST Example
ID Variable1 BinaryVariable
1 50 T
2 55 T
3 51 N
.. .. ..
1000 1001 T
1001 1944 T
1002 1042 N
As you can see from my example, I'm interested in a histogram to plot Variable1 (a continuous variable) separately for each BinaryVariable (T or N). But what I really want is the difference between their frequencies.
So, in order to do this we need to make sure that the "bins" we use for the histograms are the same for both levels of your indicator variable. Here's a somewhat naive solution (in base R):
df = data.frame(y = c(rnorm(50), rnorm(50, mean = 1)),
x = rep(c(0,1), each = 50))
#full hist
fullhist = hist(df$y, breaks = 20) #specify more breaks than probably necessary
#create histograms for 0 & 1 using breaks from full histogram
zerohist = with(subset(df, x == 0), hist(y, breaks = fullhist$breaks))
oneshist = with(subset(df, x == 1), hist(y, breaks = fullhist$breaks))
#combine the hists
combhist = fullhist
combhist$counts = zerohist$counts - oneshist$counts
plot(combhist)
So we specify how many breaks should be used (based on values from the histogram on the full data), and then we compute the differences in the counts at each of those breaks.
PS It might be helpful to examine what the non-graphical output of hist() is.
Here's a solution that uses ggplot as requested.
The key idea is to use ggplot_build to get the rectangles computed by stat_histogram. From that you can compute the differences in each bin and then create a new plot using geom_rect.
setup and create a mock dataset with lognormal data
library(ggplot2)
library(data.table)
theme_set(theme_bw())
n1<-500
n2<-500
k1 <- exp(rnorm(n1,8,0.7))
k2 <- exp(rnorm(n2,10,1))
df <- data.table(k=c(k1,k2),label=c(rep('k1',n1),rep('k2',n2)))
Create the first plot
p <- ggplot(df, aes(x=k,group=label,color=label)) + geom_histogram(bins=40) + scale_x_log10()
Get the rectangles using ggplot_build
p_data <- as.data.table(ggplot_build(p)$data[1])[,.(count,xmin,xmax,group)]
p1_data <- p_data[group==1]
p2_data <- p_data[group==2]
Join on the x-coordinates to compute the differences. Note that the y-values aren't the counts, but the y-coordinates of the first plot.
newplot_data <- merge(p1_data, p2_data, by=c('xmin','xmax'), suffixes = c('.p1','.p2'))
newplot_data <- newplot_data[,diff:=count.p1 - count.p2]
setnames(newplot_data, old=c('y.p1','y.p2'), new=c('k1','k2'))
df2 <- melt(newplot_data,id.vars =c('xmin','xmax'),measure.vars=c('k1','diff','k2'))
make the final plot
ggplot(df2, aes(xmin=xmin,xmax=xmax,ymax=value,ymin=0,group=variable,color=variable)) + geom_rect()
Of course the scales and legends still need to be fixed, but that's a different topic.

In R: Get multiple barplots from a single output

I have a data frame (100 x 4). The first column is a set of "bins" 0-100, the remaining columns are the counts for each variable of events within each bin (0 to the maximum number of events).
What I'm trying to do is to plot each of the three columns of data (2:4), alongside each other. Because the counts in each of the bins for each of the data sets is close to identical, the data are overlapped in the histogram/barplots I've created, despite my use of beside=true, and position = dodge.
I've set the first column as both numeric and character, but the results are identical- the bars are overlayed on top of each other. (semi-transparent density plots don't work because I want counts not the distribution densities).
The attached code, based on both R and other documentation produced the attached chart.
barplot(BinCntDF$preT,main=NewMain_Trigger, plot=TRUE,
xlab="sample frequency interval counts (0-100 msec bins)",
names.arg=BinCntDF$dT, las=0,
ylab="bin counts", axes=TRUE, xlim=c(0,100),
ylim=c(0,1000), col="red")
geom_bar(position="dodge")
barplot(BinCntDF$postT, beside=TRUE, add=TRUE)
geom_bar()
The goal is to be able to compare the two (or more) data sets side by side on the same axes, without either overlapping the other(s).
I think you have confused barplot with ggplot2. ggplot2 is a library where the function geom_bar comes from and isn't compatible with barplot which comes with Base R.
Simply compare ?barplot and ?geom_bar, and you will see that geom_bar is from the ggplot2 library. To achieve what you're after I have used the ggplot2 library and reshape2.
Step 1
Based on your description, I have assumed that your data looks roughly like this:
df <- data.frame(x = 1:10,
c1 = sample(0:100, replace=TRUE, size=10),
c2 = sample(0:50, replace=TRUE, size=10),
c3 = sample(0:70, replace=TRUE, size=10))
To plot it using ggplot2 you first have to transform the data to a long format instead of a wide format. You can do this using melt function from reshape2.
library(reshape2)
a <- melt(df, id=c("x"))
The output would look something like this
> head(a)
x variable value
1 1 c1 62
2 2 c1 47
3 3 c1 20
4 4 c1 64
5 5 c1 4
6 6 c1 52
Step 2
There are plenty of tutorials online to what ggplot2 does and the arguments. I would recommend you Google, or search through the many threads in SO to understand.
ggplot(a, aes(x=x, y=value, group=variable, fill=variable)) +
geom_bar(stat='identity', position='dodge')
Which gives you the output:
In a nutshell:
group groups the variables of interest
stat=identity ensures that no additional aggregations are made on your data
With that many bins (100) and groups (3) the plot will look messy, but try this:
set.seed(123)
myDF <- data.frame(bins=1:100, x=sample(1:100, replace=T), y=sample(1:100, replace=T), z=sample(1:100, replace=T))
myDF.m <- melt(myDF, id.vars='bins')
ggplot(myDF.m, aes(x=bins, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge')
You could also try plotting w/ facets:
ggplot(myDF.m, aes(x=bins, y=value, fill=variable)) + geom_bar(stat='identity') + facet_wrap(~ variable)

Colouring a PCA plot by clusters in R

I have some biological data that looks like this, with 2 different types of clusters (A and B):
Cluster_ID A1 A2 A3 B1 B2 B3
5 chr5:100947454..100947489,+ 3.31322 7.52365 3.67255 21.15730 8.732710 17.42640
12 chr5:101227760..101227782,+ 1.48223 3.76182 5.11534 15.71680 4.426170 13.43560
29 chr5:102236093..102236457,+ 15.60700 10.38260 12.46040 6.85094 15.551400 7.18341
I clean up the data:
CAGE<-read.table("CAGE_expression_matrix.txt", header=T)
CAGE_data <- as.data.frame(CAGE)
#Remove clusters with 0 expression for all 6 samples
CAGE_filter <- CAGE[rowSums(abs(CAGE[,2:7]))>0,]
#Filter whole file to keep only clusters with at least 5 TPM in at least 3 files
CAGE_filter_more <- CAGE_filter[apply(CAGE_filter[,2:7] >= 5,1,sum) >= 3,]
CAGE_data <- as.data.frame(CAGE_filter_more)
The data size is reduced from 6981 clusters to 599 after this.
I then go on to apply PCA:
#Get data dimensions
dim(CAGE_data)
PCA.CAGE<-prcomp(CAGE_data[,2:7], scale.=TRUE)
summary(PCA.CAGE)
I want to create a PCA plot of the data, marking each sample and coloring the samples depending on their type (A or B.) So it should be two colors for the plot with text labels for each sample.
This is what I have tried, to erroneous results:
qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))
ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()
qplot(PCA.CAGE[1:3], PCA.CAGE[4:6], label=colnames(PC1, PC2, PC3), geom=c("point", "text"))
The errors appear as such:
> qplot(PCA.CAGE$x[,1:3],PCA.CAGE$x[4:6,], xlab="Data 1", ylab="Data 2")
Error: Aesthetics must either be length one, or the same length as the dataProblems:PCA.CAGE$x[4:6, ]
> qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:CAGE_data, CAGE_data
> ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()
Error: ggplot2 doesn't know how to deal with data of class
Your question doesn't make sense (to me at least). You seem to have two groups of 3 variables (the A group and the B group). When you run PCA on these 6 variables, you'll get 6 principle components, each of which is a (different) linear combination of all 6 variables. Clustering is based on the cases (rows). If you want to cluster the data based on the first two PCs (a common approach), then you need to do that explicitly. Here's an example using the built-in iris data-set.
pca <- prcomp(iris[,1:4], scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=3)$cluster
library(ggbiplot)
ggbiplot(pca, groups=factor(clust)) + xlim(-3,3)
So here we run PCA on the first 4 columns of iris. Then, pca$x is a matrix containing the principle components in the columns. So then we run k-means clustering based on the first 2 PCs, and extract the cluster numbers into clust. Then we use ggibplot(...) to make the plot.

Stacke different plots in a facet manner

To train with ggplot and to improve my skills in writing R functions I decided to build a series of functions that produces survival plots, with all kinds of extras. I managed to build a good working function for the basic survival plot, now I am getting to the extras. One thing I would like to do is an option that stacks an area plot of the number at risk at a given time point, on top of the survival plot. I would like it to look just like the facet_grid option of ggplot, but I did not manage to do it with this function. I do not want the two plots binded, like we can do with grid.arrange, but rather to have the same x-axis.
The following code produces the two (simplified) plots that I would like to stack on top of each other. I tried to do this with facet_grid, but I don't think the solution lies in this
library(survival)
library(ggplot2)
data(lung)
s <- survfit(Surv(time, status) ~ 1, data = lung)
dat <- data.frame(time = c(0, s$time),
surv = c(1, s$surv),
nr = c(s$n, s$n.risk))
pl1 <- ggplot(dat, aes(time, surv)) + geom_step()
pl2 <- ggplot(dat, aes(time, nr)) + geom_area()
First, melt your data to long format.
library(reshape2)
dat.long<-melt(dat,id.vars="time")
head(dat.long)
time variable value
1 0 surv 1.0000000
2 5 surv 0.9956140
3 11 surv 0.9824561
4 12 surv 0.9780702
5 13 surv 0.9692982
6 15 surv 0.9649123
Then use subset() to use only surv data in geom_step() and nr data in geom_area() and with facet_grid() you will get each plot in separate facet as variable is used to divide data for facetting and for subsetting. scales="free_y" will make pretty axis.
ggplot()+geom_step(data=subset(dat.long,variable=="surv"),aes(time,value))+
geom_area(data=subset(dat.long,variable=="nr"),aes(time,value))+
facet_grid(variable~.,scales="free_y")

Resources