Plot including one categorical variable and two numeric variables - r

How can I show the values of AverageTime and AverageCost for their corresponding type on a graph. The scale of the variables is different since one of them is the average of time and another one is the average of cost. I want to define type as x and y refers to the value of AverageTime and AverageCost. (In this case, I will have two line plots just in one graph)
Type<-c("a","b","c","d","e","f","g","h","i","j","k")
AverageTime<-c(12,14,66,123,14,33,44,55,55,6,66)
AverageCost<-c(100,10000,400,20000,500000,5000,700,800,400000,500,120000)
df<-data.frame(Type,AverageTime,AverageCost)

This could be done using facet_wrap and scales="free_y" like so:
library(tidyr)
library(dplyr)
library(ggplot2)
df %>%
mutate(AverageCost=as.numeric(AverageCost), AverageTime=as.numeric(AverageTime)) %>%
gather(variable, value, -Type) %>%
ggplot(aes(x=Type, y=value, colour=variable, group=variable)) +
geom_line() +
facet_wrap(~variable, scales="free_y")
There you can compare the two lines even though they are different scales.
HTH

# install.packages("ggplot2", dependencies = TRUE)
library(ggplot2)
p <- ggplot(df, aes(AverageTime, AverageCost, colour=Type)) + geom_point()
p + geom_abline()

To show both lines in the same plot it will be hard since there are on different scales. You also need to convert AverageTime and AverageCost into a numeric variable.
library(ggplot2)
library(reshape2)
library(plyr)
to be able to plot both lines in one graph and take the average of the two, you need to some reshaping.
df_ag <- melt(df, id.vars=c("Type"))
df_ag_sb <- df_ag %>% group_by(Type, variable) %>% summarise(meanx = mean(as.numeric(value), na.rm=TRUE))
ggplot(df_ag_sb, aes(x=Type, y=as.numeric(meanx), color=variable, group=variable)) + geom_line()

Related

Combine scale_x_upset with scale_y_break

I made an upset plot using the ggupset package and added a break to the y axis with scale_y_break from the ggbreakpackage.
However, when I add scale_y_break, the combination matrix under the bar plot disappears.
Is there a way to combine the combination matrix of the plot made without scale_y_break with the bar plot portion of a plot made with scale_y_break? I can't seem to be able to access the grobs of these plots or use any other workaround. If anyone could help, I would greatly appreciate it!
Example with scale_x_upset and scale_y_break:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)+ scale_y_break(breaks = c(750,1000))
I would like to combine the barplot portion of the plot created with:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)+ scale_y_break(breaks = c(750,1000))
with the combination matrix portion of the plot made with:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)
Thanks!

Plot multiple series where each series have a distinct line type and specific part of each series colored differently using ggplot2

I want to plot multiple series where each series have a distinct line type and specific part of each series colored differently using ggplot2.
I prepared the data and plot it as follows:
# Load melted data frame
df = read.table(text="time,group,variable,value
1,train,preds,-1.01327781066807
2,train,preds,-1.06923407042272
3,train,preds,-1.0738165129006
4,train,preds,-1.0570173408663
5,train,preds,-0.849528539296128
6,train,preds,-1.00956150966228
7,train,preds,-1.05129344633106
8,train,preds,-1.01137384052835
9,train,preds,-0.986500386274102
10,train,preds,-0.782545791298946
11,train,preds,-0.492011449844967
12,train,preds,0.0752350668715425
13,train,preds,0.718851922060212
14,train,preds,0.907488713099219
15,train,preds,0.809418859320128
16,train,preds,0.799428598786513
17,train,preds,0.89455950317809
18,train,preds,0.891727059592248
19,train,preds,0.839506291414727
20,train,preds,0.891986330803872
21,train,preds,0.868653513783531
22,train,preds,0.867573512960701
23,train,preds,0.790999769131768
24,test,preds,0.836612851268108
25,test,preds,0.835266880809444
26,test,preds,0.825396293221058
27,test,preds,0.82669719817616
1,train,actual,-1.06741896705375
2,train,actual,-1.07208489151112
3,train,actual,-1.04309035399799
4,train,actual,-1.11384867929676
5,train,actual,-1.10435803969419
6,train,actual,-1.06534456421351
7,train,actual,-1.04953633499216
8,train,actual,-1.05459775190554
9,train,actual,-0.981186588772681
10,train,actual,-0.96224883216766
11,train,actual,-0.892023497056106
12,train,actual,0.830642326040778
13,train,actual,0.834595424714826
14,train,actual,0.881344777367528
15,train,actual,0.915772459185225
16,train,actual,0.929638947563377
17,train,actual,0.994907176661985
18,train,actual,0.99423350946309
19,train,actual,0.989942263051002
20,train,actual,0.967976146034507
21,train,actual,0.787447328638445
22,train,actual,0.586847009899609
23,train,actual,0.84574152360878
24,test,actual,1.01305250589053
25,test,actual,1.06157202086132
26,test,actual,1.01496086957322
27,test,actual,0.999883908716498", sep=",", stringsAsFactors=F, header=T)
# plot
ggplot(data=df, aes(x=time, y=value)) + geom_line(aes(color=group, linetype=variable))
The result is:
There is a break between train and test part of the series. How can I get them connected? I tried geom_path but couldn't do it.
You need to create another data point to connect the line. For example
library(ggplot2)
library(dplyr)
# I get the first record of test data for each variable and assign
# the group variable to train
additional_line <- df %>%
filter(group == "test") %>%
group_by(variable) %>%
filter(time == min(time)) %>%
ungroup() %>%
mutate(group = "train")
# Then binded them into the original dataset
df_revised <- bind_rows(df, additional_line)
# Now plot the new data with train line connected to the test line by
# additional train line.
ggplot(data=df_revised, aes(x=time, y=value)) +
geom_line(aes(color=group, linetype=variable))
# You can draw the original data without additional records if not
# using linetype
ggplot(data=df, aes(x=time, y=value)) +
geom_line(aes(group = variable, color=group))
# When using linetype in plot this config will cause errors
ggplot(data=df, aes(x=time, y=value)) +
geom_line(aes(group = variable, color=group, linetype = variable))
#> Error: geom_path: If you are using dotted or dashed lines, colour,
#> size and linetype must be constant over the line

Box plot with ggplot2 using data from read.table

I am plotting a box plot that shows the height of students. However I am unsure what I use as x and y. I have only measurments, so one should be height and the other one amount of students that have that height.
x=N, y=Height
My code:
# Library
library(ggplot2)
library(tidyverse)
# 1. Read data (comma separated)
data = read.table(text = "184,180,183,184,184,160,173",
sep=",",stringsAsFactors=F, na.strings="unknown")
# 2. Print table
print(data)
# 3. Plot box plot
data %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x=value, y=value)) +
geom_boxplot() +
theme_classic() +
xlab("Students") +
ylab("Height") +
ggtitle("Height of students")
I think the best plot to represent a vector of data is an histogram. However you could use the boxplot by create a dummy factor that group your observation. i.e.
data %>%
pivot_longer(cols = everything()) %>%
mutate(type="student") %>%
ggplot(aes(x=type, y=value)) +
geom_boxplot() +
theme_classic() +
xlab("Students") +
ylab("Height") +
ggtitle("Height of students")
if you want a histogram (I think much better for your situation), you don'ty need the dummy factor and you could do something like :
data %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x=value)) +
geom_histogram() +
theme_classic() +
xlab("Students") +
ylab("Height") +
ggtitle("Height of students")
To use a boxplot correctly, you have to have one categorical variable and one continuous. Put the categorical (e.g. make, female, etc.) on the x-axis and the continuous on the y-axis (height in your case).

ggplot geom_bar leave blank spaces for 0 values by group

Below is a simple ggplot bar plot:
x<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
y<-c(1,2,3,4,5,3,3,3,3,4,5,5,6,7,6,5,4,3,2,3,4,5,3,2,1,1,1,1,1)
d<-cbind(x,y)
ggplot(data=d,aes(x=x,fill=as.factor(y)))+
geom_bar(position = position_dodge())
The issue I'm having is that each value of y is not present in each grouping x. So for example, group 1 along the x-axis only contains groups 1-5 of the y variable, and doesn't have any values for 6 or 7. What I would like is for the plot to leave blank spaces when there is are no values for a y in the given x-grouping, this way it is easier to compare the x-groups.
A solution is to compute the frequencies manually and plot the graph based on that frequencies table.
library(ggplot2)
d1 <- data.frame(table(d))
d1$x <- factor(d1$x)
ggplot(d1, aes(x, Freq, fill = factor(y))) +
geom_bar(stat = "identity", position = position_dodge())
library(tidyverse)
# set factor levels
d2 <- d %>% data.frame() %>% mutate(x=factor(x, levels=c(1:3)),
y=factor(y, levels=c(1:7)))
# count frequencies and send to ggplot2
d2 %>% group_by(x, y, .drop=F) %>% tally() %>%
ggplot(aes(x=x, y=n, fill=y, color=y)) +
geom_bar(position = position_dodge2(),
stat="identity")
Another way to do this using dplyr is to use tally() to count the frequencies, but you need to make sure that you have your variables set as factors first.
Using color=y & fill=y in the aes statement helps to show exactly where on the plot the zero values are. So, now you can see that it is y=6 & y=7 missing from x=1 & x=3, and y=1 missing from x=2
And I chose position_dodge2 for my own personal preferences.

ggplot2: Stack barcharts with group means

I have tried several things to make ggplot plot barcharts with means derived from factors in a dataframe, but i wasnt successful.
If you consider:
df <- as.data.frame(matrix(rnorm(60*2, mean=3,sd=1), 60, 2))
df$factor <- c(rep(factor(1:3), each=20))
I want to achieve a stacked, relative barchart like this:
This chart was created with manually calculating group means in a separate dataframe, melting it and using geom_bar(stat="identity", position = "fill) and scale_y_continuous(labels = percent_format()). I havent found a way to use stat_summary with stacked barcharts.
In a second step, i would like to have errorbars attached to the breaks of each column. I have six treatments and three species, so errorbars should be OK.
For anything this complicated, I think it's loads easier to pre-calculate the numbers, then plot them. This is easily done with dplyr/tidyr (even the error bars):
gather(df, 'cat', 'value', 1:2) %>%
group_by(factor, cat) %>%
summarise(mean=mean(value), se=sd(value)/sqrt(n())) %>%
group_by(cat) %>%
mutate(perc=mean/sum(mean), ymin=cumsum(perc) -se/sum(mean), ymax=cumsum(perc) + se/sum(mean)) %>%
ggplot(aes(x=cat, y=perc, fill=factor(factor))) +
geom_bar(stat='identity') +
geom_errorbar(aes(ymax=ymax, ymin=ymin))
Of course this looks a bit strange because there are error bars around 100% in the stacked bars. I think you'd be way better off ploting the actual data points, plus means and error bars and using faceting:
gather(df, 'cat', 'value', 1:2) %>%
group_by(cat, factor) %>%
summarise(mean=mean(value), se=sd(value)/sqrt(n())) %>%
ggplot(aes(x=cat, y=mean, colour=factor(factor))) +
geom_point(aes(y=value), position=position_jitter(width=.3, height=0), data=gather(df, 'cat', 'value', 1:2) ) +
geom_point(shape=5, size = 3) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.1) +
facet_grid(factor ~ .)
This way anyone can examine the data and see for themselves that they are normally distributed

Resources