Scatter plot with ggplot - r

I want to do a scatter (xy) plot of variables in a melted data frame as shown below.
df
class var mean
0 x 4.25
0 y 6.25
1 x 2.00
1 y 11.00
I have tried this, but it plots 4 points. How can plot x and y?
library(ggplot2)
ggplot(df, aes(x=mean, y=mean, group=var, colour=class)) +
geom_point( size=5, shape=21, fill="white")

As Heroka pointed out, you need the data to be in a more wide type format. If the data was read in like this, you may use the following to convert it.
## you don't need this since you already have df
text = "class var mean
0 x 4.25
0 y 6.25
1 x 2.00
1 y 11.00"
df = read.delim(textConnection(text),header=TRUE,strip.white=TRUE,
stringsAsFactors = FALSE, sep = " ");df2
## use this library to switch from long-wide
library(reshape2)
df2 = dcast(df, class ~ var, value.var = "mean")
library(ggplot2)
ggplot(df2, aes(x=x, y=y, colour=class)) +
geom_point( size=5, shape=21, fill="white")

Related

Grouped bins with multiple y axis

I have a data frame with five columns and five rows. the data frame looks like this:
df <- data.frame(
day=c("m","t","w","t","f"),
V1=c(5,10,20,15,20),
V2=c(0.1,0.2,0.6,0.5,0.8),
V3=c(120,100,110,120,100),
V4=c(1,10,6,8,8)
)
I want to do some plots so I used the ggplot and in particular the geom_bar:
ggplot(df, aes(x = day, y = V1, group = 1)) + ylim(0,20)+ geom_bar(stat = "identity")
ggplot(df, aes(x = day, y = V2, group = 1)) + ylim(0,1)+ geom_bar(stat = "identity")
ggplot(df, aes(x = day, y = V3, group = 1)) + ylim(50,200)+ geom_bar(stat = "identity")
ggplot(df, aes(x = day, y = V4, group = 1)) + ylim(0,15)+ geom_bar(stat = "identity")
My question is, How can I do a grouped ggplot with geom_bar with multiple y axis? I want at the x axis the day and for each day I want to plot four bins V1,V2,V3,V4 but with different range and color. Is that possible?
EDIT
I want the y axis to look like this:
require(reshape)
data.m <- melt(df, id.vars='day')
ggplot(data.m, aes(day, value)) +
geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
facet_grid(variable ~ .)
You can also change the y-axis limits if you like (here's an example).
Alternately you may have meant grouped like this:
require(reshape)
data.m <- melt(df, id.vars='day')
ggplot(data.m, aes(day, value)) +
geom_bar(aes(fill = variable), position = "dodge", stat="identity")
For the latter examples if you want 2 Y axes then you just create the plot twice (once with a left y axis and once with a right y axis) then use this function:
double_axis_graph <- function(graf1,graf2){
graf1 <- graf1
graf2 <- graf2
gtable1 <- ggplot_gtable(ggplot_build(graf1))
gtable2 <- ggplot_gtable(ggplot_build(graf2))
par <- c(subset(gtable1[['layout']], name=='panel', select=t:r))
graf <- gtable_add_grob(gtable1, gtable2[['grobs']][[which(gtable2[['layout']][['name']]=='panel')]],
par['t'],par['l'],par['b'],par['r'])
ia <- which(gtable2[['layout']][['name']]=='axis-l')
ga <- gtable2[['grobs']][[ia]]
ax <- ga[['children']][[2]]
ax[['widths']] <- rev(ax[['widths']])
ax[['grobs']] <- rev(ax[['grobs']])
ax[['grobs']][[1]][['x']] <- ax[['grobs']][[1]][['x']] - unit(1,'npc') + unit(0.15,'cm')
graf <- gtable_add_cols(graf, gtable2[['widths']][gtable2[['layout']][ia, ][['l']]], length(graf[['widths']])-1)
graf <- gtable_add_grob(graf, ax, par['t'], length(graf[['widths']])-1, par['b'])
return(graf)
}
I believe there's also a package or convenience function that does the same thing.
First I reshaped as described in the documentation in the link below the question.
In general ggplot does not support multiple y-axis. I think it is a philosophical thing. But maybe faceting will work for you.
df <- read.table(text = "day V1 V2 V3 V4
m 5 0.1 120 1
t 10 0.2 100 10
w 2 0.6 110 6
t 15 0.5 120 8
f 20 0.8 100 8", header = TRUE)
library(reshape2)
df <- melt(df, id.vars = 'day')
ggplot(df, aes(x = variable, y = value, fill = variable)) + geom_bar(stat = "identity") + facet_grid(.~day)
If I understand correctly you want to include facets in your plot. You have to use reshape2 to get the data in the right format. Here's an example with your data:
df <- data.frame(
day=c("m","t","w","t","f"),
V1=c(5,10,20,15,20),
V2=c(0.1,0.2,0.6,0.5,0.8),
V3=c(120,100,110,120,100),
V4=c(1,10,6,8,8)
)
library(reshape2)
df <- melt(df, "day")
Then plot with and include facet_grid argument:
ggplot(df, aes(x=day, y=value)) + geom_bar(stat="identity", aes(fill=variable)) +
facet_grid(variable ~ .)

R, ggplot: Change linetype within a series

I am using ggplot geom_smooth to plot turnover data of a customer group from previous year against the current year (based on calendar weeks). As the last week is not complete, I would like to use a dashed linetype for the last week. However, I can't figure out how to that. I can either change the linetype for the entire plot or an entire series, but not within a series (depending on the value of x):
To keep it simple, let's just use the following example:
set.seed(42)
frame <- data.frame(series = rep(c('a','b'),50),x = 1:100, y = runif(100))
ggplot(frame,aes(x = x,y = y, group = series, color=series)) +
geom_smooth(size=1.5, se=FALSE)
How would I have to change this to get dashed lines for x >= 75?
The goal would be something like this:
Thx very much for any help!
Edit, 2016-03-05
Of course I fail when trying to use this method on the original plot. The Problem lies with the ribbon, which is calculated using stat_summary and a predefined function. I tried to use use stat_summary on the original data (mdf), and geom_line on the smooth_data. Even when I comment out everything else, I still get "Error: Continuous value supplied to discrete scale". I believe the problem comes from the fact that the original x value (Kalenderwoche) was discrete, whereas the new, smoothed x is continuous. Do I have to somehow transform one into the other? What else could I do?
Here is what I tried (condensed to the essential lines):
quartiles <- function(x) {
x <- na.omit(x) # remove NULL
median <- median(x)
q1 <- quantile(x,0.25)
q3 <- quantile(x,0.75)
data.frame(y = median, ymin = median, ymax = q3)
}
g <- ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
geom_smooth(size=1.5, method="auto", se=FALSE)
# Take out the data for smooth line
smooth_data <- ggplot_build(g)$data[[1]]
ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25)+
geom_line(data=smooth_data, aes(x=x, y=y, group=group, colour=group, fill=group))
mdf looks like this:
str(mdf)
'data.frame': 280086 obs. of 5 variables:
$ konto_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ Kalenderwoche: Factor w/ 14 levels "2015-48","2015-49",..: 4 12 1 3 7 13 10 6 5 9 ...
$ variable : Factor w/ 2 levels "Umsatz","Umsatz Vorjahr": 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 428.3 97.8 76 793.1 ...
There are many accounts (konto_id), and for each account and calendar week (Kalenderwoche), there is a current turnover value (Umsatz) and a turnover value from last year (Umsatz Vorjahr). I can provide a smaller version of the data.frame and the entire code, if required.
Thx very much for any help!
P.S. I am a total novice in R, so my code probably looks rather stupid to pros, sorry for that :(
Edit, 2016-03-06
I have uploaded a subset of the data (mdf):
mdf
The full code of the original graph is the following (looking somewhat weird with so little data, but that's not the point ;)
library(dtw)
library(reshape2)
library(ggplot2)
library(RODBC)
library(Cairo)
# custom breaks for X axis
breaks.custom <- unique(mdf$Kalenderwoche)[c(TRUE,rep(FALSE,0))]
# function called by stat_summary
quartiles <- function(x) {
x <- na.omit(x)
median <- median(x)
q1 <- quantile(x,0.25)
q3 <- quantile(x,0.75)
data.frame(y = median, ymin = median, ymax = q3)
}
# Positions for guidelines and labels
horizontal.center <- (length(unique(mdf$Kalenderwoche))+1)/2
kw.horizontal.center <- as.vector(sort(unique(mdf$Kalenderwoche))[c(horizontal.center-0.5,horizontal.center+0.5)])
vpos.P75.label <- max(quantile(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[1]],0.75)
,quantile(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[2]],0.75))+10
# use the higher P75 value of the two weeks around the center
vpos.mean.label <- min(mean(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[1]])
,mean(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[2]]))-10
vpos.median.label <- min(median(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[1]])
,median(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[2]]))-10
hpos.vline <- which(as.vector(sort(unique(mdf$Kalenderwoche))=="2016-03"))
# custom colour palette (2 colors)
cbPaletteLine <- c("#DA2626", "#2626DA")
cbPaletteFill <- c("#F0A8A8", "#7C7CE9")
# ggplot
ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
geom_smooth(size=1.5, method="auto", se=FALSE)+
# SE=FALSE to suppress drawing of the SE of the fit.SE of the data shall be used instead:
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25)+
scale_x_discrete(breaks=breaks.custom)+
scale_colour_manual(values=cbPaletteLine)+
scale_fill_manual(values=cbPaletteFill)+
#coord_cartesian(ylim = c(0, 250)) +
theme(legend.title = element_blank(), title = element_text(face="bold", size=12))+
#scale_color_brewer(palette="Dark2")+
labs(title = "Tranche 1", x = "Kalenderwoche", y = "Konto-Umsatz [CHF]")+
geom_vline(xintercept = hpos.vline, linetype=2)+
annotate("text", x=horizontal.center, y=vpos.median.label, label = "Median", size=4)+
annotate("text", x=horizontal.center, y=vpos.mean.label, label= "Mean", size=4)+
annotate("text", x=horizontal.center, y=vpos.P75.label, label = "P75%", size=4)+
theme(axis.text.x=element_text(angle = 90, hjust = 0.5, vjust = 0.5))
Edit, 2016-03-06
The final plot now looks like this (thx, Jason!!)
I am not so sure how to smooth all data and use different line types for subsets by geom_smooth function. My idea is to pull out the data which ggplot used to construct the plot and use geom_line to reproduce it. This was the way I did it:
set.seed(42)
frame <- data.frame(series=rep(c('a','b'), 50),
x = 1:100, y = runif(100))
library(ggplot2)
g <- ggplot(frame, aes(x=x, y=y, color=series)) + geom_smooth(se=FALSE)
# Take out the data for smooth line
smooth_data <- ggplot_build(g)$data[[1]]
ggplot(smooth_data[smooth_data$x <= 76, ], aes(x=x, y=y, color=as.factor(group), group=group)) +
geom_line(size=1.5) +
geom_line(data=smooth_data[smooth_data$x >= 74, ], linetype="dashed", size=1.5) +
scale_color_discrete("Series", breaks=c("1", "2"), labels=c("a", "b"))
You're right. The problem is that you add a continuous x to a discrete x in the original layer. One way to deal with it is to create a lookup table which in this case, it is easy because x is a sequence from 1 to 14. We can transform discrete x by indexing. In your code, it should work if you add:
level <- levels(mdf$Kalenderwoche)
ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25) +
geom_line(data=smooth_data, aes(x=level[x], y=y, group=group, colour=as.factor(group), fill=NA))
Here is my attempt for the question:
g <- ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable)) +
geom_smooth(size=1.5, method="auto", se=FALSE) +
# SE=FALSE to suppress drawing of the SE of the fit.SE of the data shall be used instead:
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25)
smooth_data <- ggplot_build(g)$data[[1]]
ribbon_data <- ggplot_build(g)$data[[2]]
# Use them as lookup table
level <- levels(mdf$Kalenderwoche)
clevel <- levels(mdf$variable)
ggplot(smooth_data[smooth_data$x <= 13, ], aes(x=level[x], y=y, group=group, color=as.factor(clevel[group]))) +
geom_line(size=1.5) +
geom_line(data=smooth_data[smooth_data$x >= 13, ], linetype="dashed", size=1.5) +
geom_ribbon(data=ribbon_data,
aes(x=x, ymin=ymin, ymax=ymax, fill=as.factor(clevel[group]), color=NA), alpha=0.25) +
scale_x_discrete(breaks=breaks.custom) +
scale_colour_manual(values=cbPaletteLine) +
scale_fill_manual(values=cbPaletteFill) +
#coord_cartesian(ylim = c(0, 250)) +
theme(legend.title = element_blank(), title = element_text(face="bold", size=12))+
#scale_color_brewer(palette="Dark2")+
labs(title = "Tranche 1", x = "Kalenderwoche", y = "Konto-Umsatz [CHF]")+
geom_vline(xintercept = hpos.vline, linetype=2)+
annotate("text", x=horizontal.center, y=vpos.median.label, label = "Median", size=4)+
annotate("text", x=horizontal.center, y=vpos.mean.label, label= "Mean", size=4)+
annotate("text", x=horizontal.center, y=vpos.P75.label, label = "P75%", size=4)+
theme(axis.text.x=element_text(angle = 90, hjust = 0.5, vjust = 0.5))
Note that the legend has borderline.

Vary scale of geom_point size by facet

I'm using ggplot with facet_wrap to generate 3 side-by-side plots with linear models. In addition, I have another dimension (let's call it "z") I'd like to visualize by varying the size of the points on the plots.
Currently, the plots I generate keep the size of the points on the same scale across all 3 facets. I would instead like to scale the point sizes by facet - that way, one can quickly tell which point contains the highest "z" value for each facet.
Is there any way to do this without creating 3 separate plots? I've included a sample of my data and the code I used below:
x <- c(0.03,1.32,2.61,3.90,5.20,6.48,7.77,0.75,2.04,3.33,4.62,5.91,7.20,8.49,0.41,1.70,3.00,4.28,5.57,6.86,8.15)
y <- c(650,526,382,110,72,209,60,559,296,76,48,64,20,22,50,102,176,21,20,25,5)
z <- c(391174,244856,836435,46282,40351,27118,17411,26232,59162,9737,1917,20575,1484,450,12071,13689,133326,1662,711,728,412)
facet <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B","C","C","C","C","C","C","C")
df <- data.frame(x,y,z,facet)
ggplot(df, aes(x=x, y=y)) +
geom_point(aes(size=z)) +
geom_smooth(method="lm") +
facet_wrap(~facet)
The method below reassigns z to it's z-score within it's facet:
require(dplyr)
require(ggplot)
require(magrittr)
require(scales)
x <- c(0.03,1.32,2.61,3.90,5.20,6.48,7.77,0.75,2.04,3.33,4.62,5.91,7.20,8.49,0.41,1.70,3.00,4.28,5.57,6.86,8.15)
y <- c(650,526,382,110,72,209,60,559,296,76,48,64,20,22,50,102,176,21,20,25,5)
z <- c(391174,244856,836435,46282,40351,27118,17411,26232,59162,9737,1917,20575,1484,450,12071,13689,133326,1662,711,728,412)
facet <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B","C","C","C","C","C","C","C")
df <- data.frame(x,y,z,facet)
df %<>%
group_by(facet) %>%
mutate(z = scale(z)) # calculate point size within group
ggplot(df, aes(x=x, y=y, group = facet)) +
geom_point(aes(size=z)) +
geom_smooth(method="lm") +
facet_wrap(~facet )
Try to rescale size for each facet to take values in (0,1]:
df %>%
group_by(facet) %>%
mutate(newz = z/max(z)) %>%
ggplot(., aes(x=x, y=y)) +
geom_point(aes(size=newz)) +
geom_smooth(method="lm") +
facet_wrap(~facet)
I would just take the mean of the df$z by each df$facet
AverageFacet <- df %>% group_by(facet) %>% summarize(meanwithinfacet= mean(z, na.rm=TRUE))
df <- merge(df, AverageFacet)
df$pointsize<- df$z - df$meanwithinfacet
Now each point size depends on the mean of the facets
> head(df,10)
facet x y z meanwithinfacet pointsize
1 A 0.03 650 391174 229089.57 162084.429
2 A 1.32 526 244856 229089.57 15766.429
3 A 2.61 382 836435 229089.57 607345.429
4 A 3.90 110 46282 229089.57 -182807.571
5 A 5.20 72 40351 229089.57 -188738.571
6 A 6.48 209 27118 229089.57 -201971.571
7 A 7.77 60 17411 229089.57 -211678.571
8 B 0.75 559 26232 17079.57 9152.429
9 B 2.04 296 59162 17079.57 42082.429
and plot
ggplot(df, aes(x=x, y=y)) +
geom_point(aes(size=pointsize)) +
geom_smooth(method="lm") +
facet_wrap(~facet)
Looks like this, not sure about the legend though.
You could also instead of using the absolute difference from the mean use the how many standard deviates from the mean a given z is
AverageFacet <- df %>% group_by(facet) %>% summarize(meanwithinfacet= mean(z, na.rm=TRUE), sdwithinfacet= sd(z, na.rm=TRUE))
df <- merge(df, AverageFacet)
df$absoluteDiff<- df$z - df$meanwithinfacet
df$SDfromMean <- df$absoluteDiff / df$sdwithinfacet
ggplot(df, aes(x=x, y=y)) +
geom_point(aes(size=SDfromMean)) +
geom_smooth(method="lm") +
facet_wrap(~facet)

Grouped Barplot with three measures

I am trying to reproduce the following graph in R which is generated via Excel.
The CSV file has following content:
> myd
type am tae tac
1 kA 81.73212 73.07151 26.92849
2 kI 78.87324 84.50704 15.49296
3 kL 82.52184 69.91262 30.08738
4 kS 82.67147 69.31411 30.68589
5 sA 81.67176 73.31295 26.68705
6 sI 79.54135 81.83460 18.16540
7 sL 81.58485 73.66061 26.33939
8 sS 82.09616 71.61536 28.38464
The following R code creates am on the y axis, but I also want to add tae and tac.
ggplot(myd, aes(x=type, y=am)) + geom_bar(stat="identity")
Do you know how I can add this in R to have it like in the Excel diagram?
require(reshape2)
myd_molten <- melt(data = myd, id.vars = "type")
require(ggplot2)
ggplot(myd_molten, aes(x = type, y = value, fill=variable)) +
geom_bar(stat="identity", position=position_dodge())+
coord_flip()
Try
library(tidyr)
library(dplyr)
library(ggplot2)
gather(myd, Var, Val, am:tac) %>%
ggplot(., aes(x=type, y=Val))+
geom_bar(aes(fill=Var), position='dodge', stat='identity')+
coord_flip()

ggplot not adding legend. What am I missing? very new to R

I'm plotting three samples with ggplot but it's not adding a legend for the samples. It's not spitting out any error message so I'm not sure where I'm going wrong. I'd really appreicate some guidance.
I've tried to declare color for each sample for the legend manually but there is still no legend on the plot.
df<-data.frame(samples$V1, samples$V2, samples$V3, samples$V4, samples$V5, samples$V6, samples$V7)
CG_methplot <- ggplot(df, aes(x=samples$V1,))+
scale_x_continuous(breaks=number_ticks(10))+
xlab("bins")+
ylab("mean CG methylation")+
geom_point(aes(y=samples$V2), size=3, colour='#009933')+
geom_point(aes(y=samples$V3), size=3, colour='#FF0000')+
geom_point(aes(y=samples$V4), size=3, colour='#0033FF')+
scale_color_manual(values=c("samples1"="009933", "sample2"="FF0000", "sample3" ="0033FF"))
CG_methplot
As requested, sample data.
head(df)
samples.V1 samples.V2 samples.V3 samples.V4 samples.V5 samples.V6 samples.V7
1 1 0.033636 0.027857 0.028830 0.029836 0.024457 0.024930
2 2 0.032094 0.029620 0.028005 0.028294 0.026220 0.024105
3 3 0.032011 0.027212 0.029728 0.028211 0.023812 0.025828
4 4 0.030857 0.029833 0.028907 0.027057 0.026433 0.025007
5 5 0.028480 0.028080 0.028553 0.024680 0.024680 0.024653
6 6 0.029445 0.027099 0.029346 0.025645 0.023699 0.025446
library(reshape2)
melted <- melt(df, id.vars = "V1")
p <- ggplot(melted, aes(x = V1, y = value, colour = variable))
p + geom_point()

Resources