Clustered bar chart R using 2 Numeric Variables/Metrics - r

I want to create a clustered Bar chart in R using 2 numeric variables, e.g:
Movie Genre (X-axis) and Gross$ + Budget$ should be Y-axis
It's a very straightforward chart to create in Excel. However, in R, I have put Genre in my X-axis and Gross$ in Y-axis.
My question is: Where do I need to put another Numeric variable ie Budget$ in my code so that the new Budget$ will be visible beside Gross$ in the chart?
Here is my Code:
ggplot(data=HW, aes(reorder(x=HW$Genre,-HW$Gross...US, sum),
y=HW$Gross...US))+
geom_col()
P.S. In aes I have just put reorder to sort the categories.
Appreciate help!

Could you give us some data so we can recreate it?
I think you are looking for geom_bar() and one of its options, position="dodge", which tells ggplot to put the bars side by side. But without knowing your data and its structure I can't further help you.

Melting the dataset should help in this case. A dummy-data based example below:
Data
HW <- data.frame(Genre = letters[sample(1:6, 100, replace = T)],
Gross...US = rnorm(100, 1e6, sd=1e5),
Budget...US = rnorm(100, 1e5, sd=1e4))
Code
library(tidyverse)
library(reshape2)
HW %>%
melt %>%
ggplot(aes(Genre, value, fill=variable)) + geom_col(position = 'dodge')

Related

Reordering data based on a column in [r] to order x-value items from lowest to highest y-values in ggplot

I have a dataframe that I want to reorder to make a ggplot so I can easily see which items have the highest and lowest values in them. In my case, I've grouped the data into two groups, and it'd be nice to have a visual representation of which group tends to score higher. Based on this question I came up with:
library(ggplot2)
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- line that doesn't seem to be working
ggplot(cor.data.sorted,aes(x=pic,y=r.val,size=df.val,color=exp)) + geom_point()
which produces this:
I've tried quite a few variants to reorder the data, and I feel like this should be pretty simple to achieve. To clarify, if I had succesfully reorganised the data then the y-values would go up as the plot moves along the x-value. So maybe i'm focussing on the wrong part of the code to achieve this in a ggplot figure?
You could do something like this?
library(tidyverse);
cor.data %>%
mutate(pic = factor(pic, levels = as.character(pic)[order(r.val)])) %>%
ggplot(aes(x = pic, y = r.val, size = df.val, color = exp)) + geom_point()
This obviously still needs some polishing to deal with the x axis label clutter etc.
Rather than try to order the data before creating the plot, I can reorder the data at the time of writing the plot:
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- This line controls order points drawn created to make (slightly) more readible plot
gplot(cor.data.sorted,aes(x=reorder(pic,r.val),y=r.val,size=df.val,color=exp)) + geom_point()
to create

Multiple line plot using ggplot2

I am trying to emulate a ggplot of multiple lines which works as follows:
set.seed(45)
df <- data.frame(x=c(1,2,3,4,5,1,2,3,4,5,3,4,5), val=sample(1:100, 13),
variable=rep(paste0("category", 1:3), times=c(5,5,3)))
ggplot(data = df, aes(x=x, y=val)) + geom_line(aes(colour=variable))
I can get this simple example to work, however on a much larger data set I am following the same steps but it is not working.
ncurrencies = 6
dates = c(BTC$Date, BCH$Date, LTC$Date, ETH$Date, XRP$Date, XVG$Date)
opens = c(BTC$Open, BCH$Open, LTC$Open, ETH$Open, XRP$Open, XVG$Open)
categories = rep(paste0("categories", 1:ncurrencies),
times=c(nrow(BTC), nrow(BCH), nrow(LTC), nrow(ETH), nrowXRP), nrow(XVG)))
df = data.frame(dates, opens, categories)
# Plot - Not correct.
ggplot(data=df, aes(x=dates, y=opens)) +
geom_line(aes(colour=categories))
As you can see, the different points are discretised and the y-axis is strange. I am guessing this is a rookie error but I have been going round in circles for a while. Can anyone see it?
P.S. I don't think I can upload the data here as it would be too much code. However, the dataframe is in the same format as the practice example and the categories match up correctly to the x and y data. Therefore I believe it is the way I am defining ggplot - I am relatively new to R.
Thank you Markus and Jan, yes you are correct. df$opens was a factor and changing it to a numeric solved the problem.
opens = as.numeric(c(BTC$Open, BCH$Open, LTC$Open, ETH$Open, XRP$Open, XVG$Open))

Manually added legend not working in ggplot2?

Here's facsimile of my data:
d1 <- data.frame(
e=rnorm(3000,10,10)
)
d2 <- data.frame(
e=rnorm(2000,30,30)
)
So, I got around the problem of plotting two different density distributions from two very different datasets on the same graph by doing this:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2)
But when I try to manually add a legend, like so:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2) +
scale_fill_manual(name="Data", values = c("XXXXX" = "red","YYYYY" = "blue"))
Nothing happens. Does anybody know what's going wrong? I thought I could actually manually add legends if need be.
Generally ggplot works best when your data is in a single data.frame and in long format. In your case we therefore want to combine the data from both data.frames. For this simple example, we just concatenate the data into a long variable called d and use an additional column id to indicate to which dataset that value belongs.
d.f <- data.frame(id = rep(c("XXXXX", "YYYYY"), c(3000, 2000)),
d = c(d1$e, d2$e))
More complex data manipulations can be done using packages such as reshape2 and tidyr. I find this cheat sheet often useful. Then when we plot we map fill to id, and ggplot will take of the legend automatically.
ggplot(d.f, aes(x = d, fill = id)) +
geom_density()

Label selected percentage values inside stacked bar plot (ggplot2)

I want to put labels of the percentages on my stacked bar plot. However, I only want to label the largest 3 percentages for each bar. I went through a lot of helpful posts on SO (for example: 1, 2, 3), and here is what I've accomplished so far:
library(ggplot2)
groups<-factor(rep(c("1","2","3","4","5","6","Missing"),4))
site<-c(rep("Site1",7),rep("Site2",7),rep("Site3",7),rep("Site4",7))
counts<-c(7554,6982, 6296,16152,6416,2301,0,
20704,10385,22041,27596,4648, 1325,0,
17200, 11950,11836,12303, 2817,911,1,
2580,2620,2828,2839,507,152,2)
tapply(counts,site,sum)
tot<-c(rep(45701,7),rep(86699,7), rep(57018,7), rep(11528,7))
prop<-sprintf("%.1f%%", counts/tot*100)
data<-data.frame(groups,site,counts,prop)
ggplot(data, aes(x=site, y=counts,fill=groups)) + geom_bar()+
stat_bin(geom = "text",aes(y=counts,label = prop),vjust = 1) +
scale_y_continuous(labels = percent)
I wanted to insert my output image here but don't seem to have enough reputation...But the code above should be able to produce the plot.
So how can I only label the largest 3 percentages on each bar? Also, for the legend, is it possible for me to change the order of the categories? For example put "Missing" at the first. This is not a big issue here but for my real data set, the order of the categories in the legend really bothers me.
I'm new on this site, so if there's anything that's not clear about my question, please let me know and I will fix it. I appreciate any answer/comments! Thank you!
I did this in a sort of hacky manner. It isn't that elegant.
Anyways, I used the plyr package, since the split-apply-combine strategy seemed to be the way to go here.
I recreated your data frame with a variable perc that represents the percentage for each site. Then, for each site, I just kept the 3 largest values for prop and replaced the rest with "".
# I added some variables, and added stringsAsFactors=FALSE
data <- data.frame(groups, site, counts, tot, perc=counts/tot,
prop, stringsAsFactors=FALSE)
# Load plyr
library(plyr)
# Split on the site variable, and keep all the other variables (is there an
# option to keep all variables in the final result?)
data2 <- ddply(data, ~site, summarize,
groups=groups,
counts=counts,
perc=perc,
prop=ifelse(perc %in% sort(perc, decreasing=TRUE)[1:3], prop, ""))
# I changed some of the plotting parameters
ggplot(data2, aes(x=site, y=perc, fill=groups)) + geom_bar()+
stat_bin(geom = "text", aes(y=perc, label = prop),vjust = 1) +
scale_y_continuous(labels = percent)
EDIT: Looks like your scales are wrong in your original plotting code. It gave me results with 7500000% on the y axis, which seemed a little off to me...
EDIT: I fixed up the code.

ggplot2 geom_bar plot where ..count.. greater than X

How do i tell ggplot to to plot points only if count is greater than X. I know this should be easy but i couldnt figure it out. something like
ggplot(items,aes(x=itemname,y=..count..))+geom_bar(y>X)
If I understand your question correctly (you haven't provided example data), the easiest way is to generate your the data frame you want to plot outside of ggplot. So
##Example data
items = data.frame(itemname = sample(LETTERS[1:5], 30, replace=TRUE))
##Use table to count elements
items_sum = as.data.frame(table(items))
Then plot
X = 4
ggplot(items_sum[items_sum$Freq > X,], aes(x=items,y=Freq)) +
geom_bar(stat="identity")
I may be mistaken here, but can't you simply pass the subset code to through geom_bar()?
ggplot(items_sum, aes(x=items,y=Freq)) + geom_bar(stat="identity", subset=.(Freq>4))

Resources