make multiple separate stacked barplots from one data frame - r

I would like to create multiple grouped and stalked barplots with several data frames and be able to export the plots i a single pdf file.
I have several data frames with the same format but varying values. For each data frame I would like to create multiple stalked and grouped bar plots. Ideally the bar plots of the same group from the data frames should be placed next to each other and share the same Y-axis length (in order to easily visually compare the data frames).
Her an example of what ma data looks like:
data1 <- data.frame(group=c('A','A','A','A','B','B','B','B','C','C','C','C'),
Year=c('2012','2013','214','2015','2012','2013','214','2015','2012','2013','214','2015'),
Fruit=c(5,3,6,3,5,4,2,2,3,4,6,2),
Vegetables=c(3,6,1,4,8,9,43,2,1,5,0,1),
Rice=c(20,23,53,12,45,5,23,12,32,41,54,32))
data2 <- data.frame(group=c('A','A','A','A','B','B','B','B','C','C','C','C'),
Year=c('2012','2013','214','2015','2012','2013','214','2015','2012','2013','214','2015'),
Fruit=c(2,4,5,2,3,9,4,7,5,7,4,7),
Vegetables=c(9,7,8,7,4,3,0,0,2,3,5,6),
Rice=c(23,12,32,41,54,32,20,23,53,12,45,5))
data1 <- pivot_longer(data1, cols = 3:5, names_to = 'Type', values_to = 'value')
data2 <- pivot_longer(data1, cols = 3:5, names_to = 'Type', values_to = 'value')
I started by formating the tables like this:
data1 <- pivot_longer(data1, cols = 3:5, names_to = 'Type', values_to = 'value')
data2 <- pivot_longer(data1, cols = 3:5, names_to = 'Type', values_to = 'value')
My attempts to use ggplot to create the desired PDF have so far failed. I took several different attempts but could not get near to the desired PDF. I found instructions on how to create several plots for one data frame, or grouped plots or stalked plots, but never the combination of all three.
If possible the PDF I would like to get for this example should look like this:
In total 6 plots: left 3 plots data1, right 3 plots data2; Group A row1, Group B row2, Group C row3 (if possible same y axis length in one row/Group)
All bar plots: x-axis= years, y-axis= value / 1 stalked bar per year with colors matching Type (Fruit, Vegetable, Rice)
Group name per row
data source(data1, data2) per column
legend with Types (Fruit, Vegetable, Rice)
Q1. Is something like this possible or would one have to create two PDFs (for each data.table, here: data1 and data2).
Q2. Is it possible to format the code in a way to automatically adjust the amount of plots needed according to the data frames and adjust the PDF file size automatically and create a new page if necessary? (In reality i have 5 data frames and 13 Groups, this may however change with time)
I know this is quite a difficult code to write. I have spent two working days on this already though, which is why I am now asking for help here. I will try again tomorrow and post any possible progress here.
Thank you very much for any suggestions

This code should produce the desired plot (or at least something really close).The two critical steps include: 1) joining all the dataframes into a single one, using bind_rows and 2) using facet_grid to set define the layout panels according to two variables (group and id).
library(tidyverse)
# Combine the data
# id column contains the number of the dataframe from which the data comes from
df <- bind_rows(data1, data2, .id = "id")
df %>%
# Change to long format, add 1 to the columns number, as we now added id column
pivot_longer(cols = 4:6,
names_to = 'Type',
values_to = 'value') %>%
# Transform value to x / 1
mutate_at(vars(value), function(x) x / 1) %>%
# Do plot
ggplot(aes(x = Year,
y = value,
fill = Type)) +
# columns
geom_col()+
# Facets by two factors, groups and data source (id)
facet_grid(group ~ id)
# Save plot to pdf
ggsave("my_plot.pdf",
device = "pdf",
height = 15,
width = 20,
units = "cm",
dpi = 300)

Related

Creating a Matrix in R from a dataset

I am trying to convert data provided to us in a csv into a matrix. We have saved the data as an object (us_quarters) Its a simple dataset containing the name of a state, then the number of quarters produced at two separate mints for that state.
State DenverMint PhillyMint
Delaware 401424 373400
one row for each state.
I am trying to create a side by side barplot of this data, and first need to convert the data into a matrix to work with it. The issue I seem to be struggling with is the fact that the state itself is a column so when I try to convert I end up with a jumbled mess of character values and integer values stored in massive lists.
x <- matrix(us_quarters,ncol=3, byrow = TRUE)
colnames(x) <- c("State", "DenverMint", "Phillymint")
x
produces this result
State DenverMint Phillymint
[1,] character,50 integer,50 integer,50
Everything I am trying to do requires the data to be formatted in this matrix in order to work with it properly and I am at a total loss as to how to proceed. Any thoughts are much appreciated.
Could you use pivot_longer to group Denver and Philly mint?
df <- tribble(~state, ~den_mint, ~philly_mint,
'delaware', 401424,373400,
'newyork', 460858, 494023)
df %>%
mutate(state = as.factor(state)) %>%
pivot_longer(cols = c("den_mint", "philly_mint"), names_to = "mint", values_to = "count") %>%
ggplot(aes(mint, count)) +
geom_col() +
facet_wrap(~state) +
coord_flip()

How do I label my rows and columns in order to work better with them?

I have a dataset with the emisisons of Canada:
and I would like to label the first row to "years" and the second to "emissions".
For example, if I dont do this how can I name my variables in "aes" in ggplot () function:
ggplot(CAN_emissions, aes(___, ___))
To add a name to the first row, we can use rownames(CAN_emissions) <- "emissions" though this won't help much as the years data points are in the column titles, not in a row of their own.
Generally speaking, you'll struggle to plot the data while it is in a 'wide' format like this. A better solution is to convert all of the year columns into rows. The problem of row names will then disappear. We can do this like so:
library(tidyr)
library(dplyr)
library(magrittr)
CAN_emissions <- CAN_emissions %>%
pivot_longer(-country, names_to = "year", values_to = "emissions")
The data can then be plotted directly:
ggplot(CAN_emissions, aes(x = year, y = emissions)) + geom_point()
Data:
CAN_emissions <- tibble(
country = c("Canada"),
`1800` = 0.00568,
`1801` = 0.00561,
`1802` = 0.00555
)

Plotting only 1 hourly datapoint (1 per day) alongside hourly points (24 per day) in R Studio

I am a bit stuck with some code. Of course I would appreciate a piece of code which sorts my dilemma, but I am also grateful for hints of how to sort that out.
Here goes:
First of all, I installed the packages (ggplot2, lubridate, and openxlsx)
The relevant part:
I extract a file from an Italians gas TSO website:
Storico_G1 <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G+1", startRow = 1, colNames = TRUE)
Then I created a data frame with the variables I want to keep:
Storico_G1_df <- data.frame(Storico_G1$pubblicazione, Storico_G1$IMMESSO, Storico_G1$`SBILANCIAMENTO.ATTESO.DEL.SISTEMA.(SAS)`)
Then change the time format:
Storico_G1_df$pubblicazione <- ymd_h(Storico_G1_df$Storico_G1.pubblicazione)
Now the struggle begins. Since in this example I would like to chart the 2 time series with 2 different Y axes because the ranges are very different. This is not really a problem as such, because with the melt function and ggplot i can achieve that. However, since there are NAs in 1 column, I dont know how I can work around that. Since, in the incomplete (SAS) column, I mainly care about the data point at 16:00, I would ideally have hourly plots on one chart and only 1 datapoint a day on the second chart (at said 16:00). I attached an unrelated example pic of a chart style I mean. However, in the attached chart, I have equally many data points on both charts and hence it works fine.
Grateful for any hints.
Take care
library(lubridate)
library(ggplot2)
library(openxlsx)
library(dplyr)
#Use na.strings it looks like NAs can have many values in the dataset
storico.xl <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",
sheet = "Storico_G+1", startRow = 1,
colNames = TRUE,
na.strings = c("NA","N.D.","N.D"))
#Select and rename the crazy column names
storico.g1 <- data.frame(storico.xl) %>%
select(pubblicazione, IMMESSO, SBILANCIAMENTO.ATTESO.DEL.SISTEMA..SAS.)
names(storico.g1) <- c("date_hour","immesso","sads")
# the date column look is in the format ymd_h
storico.g1 <- storico.g1 %>% mutate(date_hour = ymd_h(date_hour))
#Not sure exactly what you want to plot, but here is each point by hour
ggplot(storico.g1, aes(x= date_hour, y = immesso)) + geom_line()
#For each day you can group, need to format the date_hour for a day
#You can check there are 24 points per day
#feed the new columns into the gplot
storico.g1 %>%
group_by(date = as.Date(date_hour, "d-%B-%y-")) %>%
summarise(count = n(),
daily.immesso = sum(immesso)) %>%
ggplot(aes(x = date, y = daily.immesso)) + geom_line()

multiple series in Highcharter R stacked barchart

After going through the highcharter package documentation, visiting JBKunst his website, and looking into list.parse2(), I still can not solve the problem. Problem is as follows: Looking to chart multiple series from a data.frame into a stacked barchart, series can be anywhere from 10 - 30 series. For now the series have been defined as below, but clearly there has to be an easier way, for example passing a list or melted data.frame to the function hc_series similar as what can be done with ggplot2.
Below the code with dummy data
mydata <- data.frame(A=runif(1:10),
B=runif(1:10),
C=runif(1:10))
highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "MyGraph") %>%
hc_yAxis(title = list(text = "Weights")) %>%
hc_plotOptions(column = list(
dataLabels = list(enabled = FALSE),
stacking = "normal",
enableMouseTracking = FALSE)
) %>%
hc_series(list(name="A",data=mydata$A),
list(name="B",data=mydata$B),
list(name="C",data=mydata$C))
Which produces this chart:
a good approach to add multiples series in my opinion is use hc_add_series_list (oc you can use a for loop) which need a list of series (a series is for example list(name="A",data=mydata$A).
As you said, you need to melt/gather the data, you can use tidyr package:
mynewdata <- gather(mydata)
Then you'll need to group the data by key argument to create the data for each key/series. Here you can use dplyr package:
mynewdata2 <- mynewdata %>%
# we change the key to name to have the label in the legend
group_by(name = key) %>%
# the data in this case is simple, is just .$value column
do(data = .$value)
This data frame will contain two columns and the 2nd colum will contain the ten values for each row.
Now you need this information in a list. So we need to parse using list.parse3 instad of list.parse2 beacuse preserve names like name or data.
series <- list.parse3(mynewdata2)
So finally change:
hc_series(list(name="A",data=mydata$A),
list(name="B",data=mydata$B),
list(name="C",data=mydata$C))
by:
hc_add_series_list(series)
Hope this is clear.

How to use ggplot to group and show top X categories?

I am trying to use use ggplot to plot production data by company and use the color of the point to designate year. The follwoing chart shows a example based on sample data:
However, often times my real data has 50-60 different comapnies wich makes the Company names on the Y axis to be tiglhtly grouped and not very asteticly pleaseing.
What is th easiest way to show data for only the top 5 companies information (ranked by 2011 quanties) and then show the rest aggregated and shown as "Other"?
Below is some sample data and the code I have used to create the sample chart:
# create some sample data
c=c("AAA","BBB","CCC","DDD","EEE","FFF","GGG","HHH","III","JJJ")
q=c(1,2,3,4,5,6,7,8,9,10)
y=c(2010)
df1=data.frame(Company=c, Quantity=q, Year=y)
q=c(3,4,7,8,5,14,7,13,2,1)
y=c(2011)
df2=data.frame(Company=c, Quantity=q, Year=y)
df=rbind(df1, df2)
# create plot
p=ggplot(data=df,aes(Quantity,Company))+
geom_point(aes(color=factor(Year)),size=4)
p
I started down the path of a brute force approach but thought there is probably a simple and elegent way to do this that I should learn. Any assistance would be greatly appreciated.
What about this:
df2011 <- subset (df, Year == 2011)
companies <- df2011$Company [order (df2011$Quantity, decreasing = TRUE)]
ggplot (data = subset (df, Company %in% companies [1 : 5]),
aes (Quantity, Company)) +
geom_point (aes (color = factor (Year)), size = 4)
BTW: in order for the code to be called elegant, spend a few more spaces, they aren't that expensive...
See if this is what you want. It takes your df dataframe, and some of the ideas already suggested by #cbeleites. The steps are:
1.Select 2011 data and order the companies from highest to lowest on Quantity.
2.Split df into two bits: dftop which contians the data for the top 5; and dfother, which contains the aggregated data for the other companies (using ddply() from the plyr package).
3.Put the two dataframes together to give dfnew.
4.Set the order for which levels of Company are plotted: Top to bottom is highest to lowest, then "Other". The order is partly given by companies, plus "Other".
5.Plot as before.
library(ggplot2)
library(plyr)
# Step 1
df2011 <- subset (df, Year == 2011)
companies <- df2011$Company [order (df2011$Quantity, decreasing = TRUE)]
# Step 2
dftop = subset(df, Company %in% companies [1:5])
dftop$Company = droplevels(dftop$Company)
dfother = ddply(subset(df, !(Company %in% companies [1:5])), .(Year), summarise, Quantity = sum(Quantity))
dfother$Company = "Other"
# Step 3
dfnew = rbind(dftop, dfother)
# Step 4
dfnew$Company = factor(dfnew$Company, levels = c("Other", rev(as.character(companies)[1:5])))
levels(dfnew$Company) # Check that the levels are in the correct order
# Step 5
p = ggplot (data = dfnew, aes (Quantity, Company)) +
geom_point (aes (color = factor (Year)), size = 4)
p
The code produces:

Resources