Related
I am trying to create a barplot with the ggplot2 library. My data is stored in read.csv2 format.
# Library
library(ggplot2)
library(tidyverse) # function "%>%"
# 1. Read data (comma separated)
data = read.csv2(text = "Age;Frequency
0 - 10;1
11 - 20;5
21 - 30;20
31 - 40;13
41 - 49;1")
# 2. Print table
df <- as.data.frame(data)
df
# 3. Plot bar chart
ggplot(df, aes(x = Age)) +
geom_bar() +
theme_classic()
The code runs fine, but it produces a graph that looks like all data are at max all the time.
You need to specify your y axis as well:
ggplot(df, aes(x = Age, y = Frequency)) +
geom_bar(stat = "identity") +
theme_classic()
The default value of geom_bar plots the frequency of the values which is 1 for all the Age values here (Check table(df$Age)). You may use geom_bar with stat = 'identity'
library(ggplot2)
ggplot(df, aes(Age, Frequency)) +
geom_bar(stat = 'identity') +
theme_classic()
OR geom_col :
ggplot(df, aes(Age, Frequency)) +
geom_col() +
theme_classic()
I have a question on how to stratify making multiple box plots per group. This is what I have for a sample code
library(ggplot2)
mtcars$vs <- as.character(as.numeric(mtcars$vs))
y6 <- ggplot(mtcars, aes(x=vs,y=hp)) +
geom_boxplot(aes(group = vs),outlier.shape=NA, size=1, width = 0.6, fatten = 1) +
geom_jitter(aes(x=vs, y=hp, pch = factor(cyl)), position=position_jitter(width=.1, height=0), size = 2) +
scale_shape_manual(name ="X", values = c(1,2,3)) +
coord_cartesian(ylim=c(0, 350))
This is what I obtain from the graph. I hope to stratify the graphs per X axis by the legend making a total of 6 box plots (3 per X axis; 3 for "1" and 3 for "2"). Is there a way to do this? I have attached an image of it below:
Thank you for your thoughts!
Here is the code for you:
library(ggplot2)
ggplot(mtcars, aes(x=vs,y=hp,fill = factor(cyl))) +
geom_boxplot(aes(fill = factor(cyl)),outlier.shape=NA, size=1, width = 0.6, fatten = 1) +
coord_cartesian(ylim=c(0, 350))
I have used fill= argument in ggplot() to split/group the data by column cyl.
If you look closer at mtcars data and your plot, you actually do not have 3 unique values of cyl for vs = 1, just two (cyl 4 & 8)..Therefore you get total of 5 boxes
Is this what you are asking?
ggplot(mtcars, aes(vs, hp)) +
geom_boxplot() +
facet_wrap(~cyl) +
theme_bw()
There are no values for vs when cyl==8 and only one value for vs when cyl==4.
table(mtcars$cyl, mtcars$vs)
# 0 1
# 4 1 10
# 6 3 4
# 8 14 0
If you are a fan of colouring the plots, you can do it with the fill parameter.
ggplot(mtcars, aes(vs, hp, fill=as.factor(cyl))) +
geom_boxplot() +
facet_wrap(~cyl) +
theme_bw()
I am using ggplot geom_smooth to plot turnover data of a customer group from previous year against the current year (based on calendar weeks). As the last week is not complete, I would like to use a dashed linetype for the last week. However, I can't figure out how to that. I can either change the linetype for the entire plot or an entire series, but not within a series (depending on the value of x):
To keep it simple, let's just use the following example:
set.seed(42)
frame <- data.frame(series = rep(c('a','b'),50),x = 1:100, y = runif(100))
ggplot(frame,aes(x = x,y = y, group = series, color=series)) +
geom_smooth(size=1.5, se=FALSE)
How would I have to change this to get dashed lines for x >= 75?
The goal would be something like this:
Thx very much for any help!
Edit, 2016-03-05
Of course I fail when trying to use this method on the original plot. The Problem lies with the ribbon, which is calculated using stat_summary and a predefined function. I tried to use use stat_summary on the original data (mdf), and geom_line on the smooth_data. Even when I comment out everything else, I still get "Error: Continuous value supplied to discrete scale". I believe the problem comes from the fact that the original x value (Kalenderwoche) was discrete, whereas the new, smoothed x is continuous. Do I have to somehow transform one into the other? What else could I do?
Here is what I tried (condensed to the essential lines):
quartiles <- function(x) {
x <- na.omit(x) # remove NULL
median <- median(x)
q1 <- quantile(x,0.25)
q3 <- quantile(x,0.75)
data.frame(y = median, ymin = median, ymax = q3)
}
g <- ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
geom_smooth(size=1.5, method="auto", se=FALSE)
# Take out the data for smooth line
smooth_data <- ggplot_build(g)$data[[1]]
ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25)+
geom_line(data=smooth_data, aes(x=x, y=y, group=group, colour=group, fill=group))
mdf looks like this:
str(mdf)
'data.frame': 280086 obs. of 5 variables:
$ konto_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ Kalenderwoche: Factor w/ 14 levels "2015-48","2015-49",..: 4 12 1 3 7 13 10 6 5 9 ...
$ variable : Factor w/ 2 levels "Umsatz","Umsatz Vorjahr": 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 428.3 97.8 76 793.1 ...
There are many accounts (konto_id), and for each account and calendar week (Kalenderwoche), there is a current turnover value (Umsatz) and a turnover value from last year (Umsatz Vorjahr). I can provide a smaller version of the data.frame and the entire code, if required.
Thx very much for any help!
P.S. I am a total novice in R, so my code probably looks rather stupid to pros, sorry for that :(
Edit, 2016-03-06
I have uploaded a subset of the data (mdf):
mdf
The full code of the original graph is the following (looking somewhat weird with so little data, but that's not the point ;)
library(dtw)
library(reshape2)
library(ggplot2)
library(RODBC)
library(Cairo)
# custom breaks for X axis
breaks.custom <- unique(mdf$Kalenderwoche)[c(TRUE,rep(FALSE,0))]
# function called by stat_summary
quartiles <- function(x) {
x <- na.omit(x)
median <- median(x)
q1 <- quantile(x,0.25)
q3 <- quantile(x,0.75)
data.frame(y = median, ymin = median, ymax = q3)
}
# Positions for guidelines and labels
horizontal.center <- (length(unique(mdf$Kalenderwoche))+1)/2
kw.horizontal.center <- as.vector(sort(unique(mdf$Kalenderwoche))[c(horizontal.center-0.5,horizontal.center+0.5)])
vpos.P75.label <- max(quantile(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[1]],0.75)
,quantile(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[2]],0.75))+10
# use the higher P75 value of the two weeks around the center
vpos.mean.label <- min(mean(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[1]])
,mean(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[2]]))-10
vpos.median.label <- min(median(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[1]])
,median(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[2]]))-10
hpos.vline <- which(as.vector(sort(unique(mdf$Kalenderwoche))=="2016-03"))
# custom colour palette (2 colors)
cbPaletteLine <- c("#DA2626", "#2626DA")
cbPaletteFill <- c("#F0A8A8", "#7C7CE9")
# ggplot
ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
geom_smooth(size=1.5, method="auto", se=FALSE)+
# SE=FALSE to suppress drawing of the SE of the fit.SE of the data shall be used instead:
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25)+
scale_x_discrete(breaks=breaks.custom)+
scale_colour_manual(values=cbPaletteLine)+
scale_fill_manual(values=cbPaletteFill)+
#coord_cartesian(ylim = c(0, 250)) +
theme(legend.title = element_blank(), title = element_text(face="bold", size=12))+
#scale_color_brewer(palette="Dark2")+
labs(title = "Tranche 1", x = "Kalenderwoche", y = "Konto-Umsatz [CHF]")+
geom_vline(xintercept = hpos.vline, linetype=2)+
annotate("text", x=horizontal.center, y=vpos.median.label, label = "Median", size=4)+
annotate("text", x=horizontal.center, y=vpos.mean.label, label= "Mean", size=4)+
annotate("text", x=horizontal.center, y=vpos.P75.label, label = "P75%", size=4)+
theme(axis.text.x=element_text(angle = 90, hjust = 0.5, vjust = 0.5))
Edit, 2016-03-06
The final plot now looks like this (thx, Jason!!)
I am not so sure how to smooth all data and use different line types for subsets by geom_smooth function. My idea is to pull out the data which ggplot used to construct the plot and use geom_line to reproduce it. This was the way I did it:
set.seed(42)
frame <- data.frame(series=rep(c('a','b'), 50),
x = 1:100, y = runif(100))
library(ggplot2)
g <- ggplot(frame, aes(x=x, y=y, color=series)) + geom_smooth(se=FALSE)
# Take out the data for smooth line
smooth_data <- ggplot_build(g)$data[[1]]
ggplot(smooth_data[smooth_data$x <= 76, ], aes(x=x, y=y, color=as.factor(group), group=group)) +
geom_line(size=1.5) +
geom_line(data=smooth_data[smooth_data$x >= 74, ], linetype="dashed", size=1.5) +
scale_color_discrete("Series", breaks=c("1", "2"), labels=c("a", "b"))
You're right. The problem is that you add a continuous x to a discrete x in the original layer. One way to deal with it is to create a lookup table which in this case, it is easy because x is a sequence from 1 to 14. We can transform discrete x by indexing. In your code, it should work if you add:
level <- levels(mdf$Kalenderwoche)
ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25) +
geom_line(data=smooth_data, aes(x=level[x], y=y, group=group, colour=as.factor(group), fill=NA))
Here is my attempt for the question:
g <- ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable)) +
geom_smooth(size=1.5, method="auto", se=FALSE) +
# SE=FALSE to suppress drawing of the SE of the fit.SE of the data shall be used instead:
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25)
smooth_data <- ggplot_build(g)$data[[1]]
ribbon_data <- ggplot_build(g)$data[[2]]
# Use them as lookup table
level <- levels(mdf$Kalenderwoche)
clevel <- levels(mdf$variable)
ggplot(smooth_data[smooth_data$x <= 13, ], aes(x=level[x], y=y, group=group, color=as.factor(clevel[group]))) +
geom_line(size=1.5) +
geom_line(data=smooth_data[smooth_data$x >= 13, ], linetype="dashed", size=1.5) +
geom_ribbon(data=ribbon_data,
aes(x=x, ymin=ymin, ymax=ymax, fill=as.factor(clevel[group]), color=NA), alpha=0.25) +
scale_x_discrete(breaks=breaks.custom) +
scale_colour_manual(values=cbPaletteLine) +
scale_fill_manual(values=cbPaletteFill) +
#coord_cartesian(ylim = c(0, 250)) +
theme(legend.title = element_blank(), title = element_text(face="bold", size=12))+
#scale_color_brewer(palette="Dark2")+
labs(title = "Tranche 1", x = "Kalenderwoche", y = "Konto-Umsatz [CHF]")+
geom_vline(xintercept = hpos.vline, linetype=2)+
annotate("text", x=horizontal.center, y=vpos.median.label, label = "Median", size=4)+
annotate("text", x=horizontal.center, y=vpos.mean.label, label= "Mean", size=4)+
annotate("text", x=horizontal.center, y=vpos.P75.label, label = "P75%", size=4)+
theme(axis.text.x=element_text(angle = 90, hjust = 0.5, vjust = 0.5))
Note that the legend has borderline.
I have plotted a bar graph already and now I'd like to add a curve,going through the top point of each bar so that the trend of change can be sown more clearly.
The data frame is in a format like:
v1 v2
a 10
b 6
c 7
...
Here is the code I plot the bar:
ggplot(date_count, aes(V1,V2)) + geom_bar(stat = "identity")+ theme(axis.text.x = element_text(angle=45, hjust = 1,vjust = 1)) +xlab("date") + ylab("Number of activity")
I have tried +geom_line() and geom_smooth() but both failed. Do you have any idea? Thanks in advance.
It is assumed you mean tops of bars rather than bottoms since the bottoms are all zero. We make the X axis continuous rather than discrete and in order to be able to see the added lines we make the bars white.
# input data in reproducible form
Lines <- "V1 V2
a 10
b 6
c 7"
date_count <- read.table(text = Lines, header = TRUE)
library(ggplot2)
n <- nrow(date_count)
ggplot(date_count, aes(x = 1:n, y = V2)) +
geom_bar(stat = "identity", fill = "white") +
theme(axis.text.x = element_text(angle=45, hjust = 1, vjust = 1)) +
xlab("date") +
ylab("Number of activity") +
scale_x_continuous(breaks = 1:n, labels = date_count$V1) +
geom_line() +
geom_smooth(lty = 2)
I'm a little confused by your "bottom point". I'm assuming that you mean the minimal point of each group.
It would be easier to reproduce with a larger sample of data. Hence, I'm using mtcars.
I interprete the "bottom" as minimal points which are here
aggregate(mpg ~ cyl , mtcars, function(x)min(x))
cyl mpg
1 4 21.4
2 6 17.8
3 8 10.4
You can generate the plot in the following way:
data(mtcars)
ggplot(mtcars, aes(x=cyl,y=mpg))+
geom_bar(stat="identity")+
stat_summary(fun.y=min ,geom="line",color="red")+
stat_summary(fun.y=sum ,geom="line",color="blue")
The red line is plotted using stat_summary at the minimum value of each group - as you wrote bottom. The blue line is the top (sum) of each group.
I am trying to make a plot in ggplot2 in R with the following code:
feature
[1] abs_deg_sum_1 NumAfterEdits_1 N_1 NumAfterEdits_3
[5] TimeSinceLastEdit_2 wt_product_1 NumAfterEdits_2 dwdt_1
52 Levels: abs_deg_diff_1 abs_deg_diff_2 abs_deg_diff_3 abs_deg_diff_4 ... Z_4
relative_importance
[1] 61.048212 17.235435 1.891542 1.409848 1.356924 1.264824 1.220593 1.184612
library(ggplot2)
df = data.frame(feature, relative_importance)
c <- ggplot(df, aes(x = feature, y = relative_importance, fill = feature)) + geom_bar(stat = "identity")
c + coord_flip()
positions <- c("abs_deg_sum_1", "NumAfterEdits_1", "N_1", "NumAfterEdits_3","TimeSinceLastEdit_2", "wt_product_1", "NumAfterEdits_2",
"dwdt_1")
c <- c + scale_x_discrete(limits = positions)
c + coord_flip()
Since the first value in relative_importance is really large compared to all other values, the plot doesn't show much about the other values. I get the following plot:
How can I change my code to capture more information in my plot? Especially about the smaller values
Here are several options, though I prefer the first or second (or maybe the third if you really want to go with a bar plot):
# Fake data
dat = data.frame(group=LETTERS[1:5], values=c(1.5,0.6,12.6,2.1,85))
# Value labels instead of bars, plus we add a horizontal segment to provide
# better visual guidance as to the relative values. This also requires
# some factor gymnastics to be able to get both the segments and the
# correct x-axis labels. I've left in the legend, but it's not necessary
# and can be removed if you wish.
ggplot(dat, aes(as.numeric(group), values, colour=group)) +
geom_segment(aes(x=as.numeric(group)-0.35, xend=as.numeric(group)+0.35,
yend=values), alpha=0.75) +
geom_text(aes(label=values), fontface="bold", show_guide=FALSE) +
scale_x_continuous(breaks=1:5, labels=levels(dat$group))
#scale_y_log10(limits=c(0.1,100), breaks=c(0.1, 0.3,1,3,10,30,100)) # For a log scale, if desired
#coord_flip() # Flip to horizontal orientation, if desired
# Value labels instead of bars
ggplot(dat, aes(group, values, colour=group)) +
geom_text(aes(label=values), fontface="bold")
# Bar plot with value labels added
ggplot(dat, aes(group, values, fill=group)) +
geom_bar(stat="identity") +
geom_text(aes(label=values, y=0.5*values), size=5, colour="black")
# Value labels instead of bars; log scale
ggplot(dat, aes(group, values, colour=group)) +
geom_text(aes(label=values)) +
scale_y_log10(limits=c(0.1,100), breaks=c(0.1,0.3,1,3,10,30,100)) +
coord_flip()
# Bar plot with log scale. Note that bar baseline is 1 instead of
# zero for a log scale, so this doesn't work so well.
ggplot(dat, aes(group, values, fill=group)) +
geom_bar(stat="identity") +
scale_y_log10(limits=c(0.1,100), breaks=c(0.1,0.3,1,3,10,30,100)) +
coord_flip()
# Points instead of bars; log scale
ggplot(dat, aes(group, values, fill=group)) +
geom_point(pch=21, size=4) +
scale_y_log10(limits=c(0.1,100), breaks=c(0.1,0.3,1,3,10,30,100)) +
coord_flip()
If the logarithmic axis doesn't work for you and if you have some flexibility in the plot format, you could divide the features into two groups based on the value of relative_importance and show each in it's own panel with appropriate y-scales. Code including adjustment of bar widths would look like:
library(ggplot2)
# assign rows to Large or Small group
cut_off_for_small_values <- 3
small_value_title <- "Expanded_Scale_for_Smaller_Values"
df <- data.frame(feature, relative_importance,
importance_grp = ifelse(relative_importance > cut_off_for_small_values,
"All", small_value_title))
# calculate relative bar widths
width_adj <- .8*nrow(df[df$importance_grp==small_value_title,])/nrow(df)
# plot data
c <- ggplot(df, aes(x = feature, y = relative_importance, fill = feature))
c <- c + geom_bar(data=transform(df, importance_grp="All"),
stat = "identity")
c <- c + geom_bar(data=df[df$importance_grp==small_value_title,],
stat = "identity", width=width_adj)
c <- c + geom_text(aes(x = feature, y = relative_importance,
label = format(relative_importance, digits=3), vjust=-.5))
c <- c + theme(axis.text.x = element_text(angle=90))
c <- c + facet_wrap( ~ importance_grp, scales="free" )
which gives plot