I want to create a waterfallchart with several groups where all the groups start at 0.
This is my code:
gdp <- data.frame("Country"=rep(c("China", "USA"), each=2),
"Type"=rep(c("GDP2013", "GDP2014"), 2),
"Cnt"= c(16220, 3560, 34030, -10570))
gdp <- gdp %>%
mutate(start=Cnt,
start=lag(start),
end=ifelse(Type=="GDP2013", Cnt, start+Cnt),
start=ifelse(Type=="GDP2013", 0, start),
amount=end-start,
id=rep(1:2, each=2))
gdp %>%
ggplot(aes(fill=Type)) +
geom_rect(stat="identity", aes(x=Country,
xmin=id-0.25,
xmax=id+0.25,
ymin=start,
ymax=end))
The two bar types should be ordered next to each other per group and USA GDP2014 should start at the height of USA GDP2013 but end 10570 lower.
I know that I could do this with a facet_wrap but I want no separation between groups (e.g. facets.
geom_rect takes a position parameter.
I believe position='dodge' does what you require if I understand your question correctly.
More info: https://ggplot2.tidyverse.org/reference/position_dodge.html
Related
I currently have two dataframes. I wish to get multiple bar plots from both of them in one plot using ggplot. I want to get an average of 'NEE' variable from different years(1850-1950,1951-2012,2013-2100) from both dataframes and plot side by side just like in this green barplot visualization(https://ars.els-cdn.com/content/image/1-s2.0-S0048969716303424-fx1_lrg.jpg).
The header of two dataframes is as follows (this is only a portion).The header is the same for both dataframes from year 1850-1859:
How can I achieve plotting bar plots lets say for the year 1850-1852 , 1854-1856, 1857-1859 from both dataframes in one plot. I know the barplots will be the same in this case as both data frames are similar, but i would like to get an idea and I can edit the code to my desired years.
(Note that I have 39125 obs with 9 variables)
This is what I have done so far (by following a solution posted by member in this website).I achieved data1 and data2 geom_col successfully.But how can i merge them together and plot geom_col of 1850-1852 , 1854-1856, 1857-1859 side by side from both dataframes?graph of data1 graph of data2 :
data1 %>%
# case_when lets us define yr_group based on Year:
mutate(yr_group = case_when(Year <= 1950 ~ "1850-1950",
Year <= 2012 ~ "1951-2012",
Year <= 2100 ~ "2013-2100",
TRUE ~ "Other range")) %>%
# For each location and year group, get the mean of all the columns:
group_by(Lon, Lat, yr_group) %>%
summarise_all(mean) %>%
# Plot the mean Total for each yr_group
ggplot(aes(yr_group, NEE)) + geom_col(position =
"dodge")+theme_classic()+xlab("Year")+ylab(ln)+labs(subtitle="CCSM4
RCP2.6")+
geom_hline(yintercept=0, color = "black", size=1)
My preferred approach is usually to do the data summarization first and then send the output to ggplot. In this case, you might use dplyr from the tidyverse meta-package to add a variable relating to which time epoch a given year belongs to, and then collect the stats for that whole epoch.
For instance, just using your example data, we might group those years arbitrarily and find the averages for 1850-51, 1852-53, and 1854-55, and then display those next to each other:
library(tidyverse)
df %>%
# case_when lets us define yr_group based on Year:
mutate(yr_group = case_when(Year <= 1851 ~ "1850-51",
Year <= 1853 ~ "1852-53",
Year <= 1855 ~ "1854-55",
TRUE ~ "Other range")) %>%
# For each location and year group, get the mean of all the columns:
group_by(Lon, Lat, yr_group) %>%
summarise_all(mean) %>%
# Plot the mean Total for each yr_group
ggplot(aes(yr_group, Total)) + geom_col()
If you have multiple locations, you might use ggplot facets to display those separately, or use dodge within geom_col (equivalent to geom_bar(stat = "identity"), btw) to show the different locations next to each other.
I have a problem with my density histogram in ggplot2. I am working in RStudio, and I am trying to create density histogram of income, dependent on persons occupation. My problem is, that when I use my code:
data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
sep=",",header=F,col.names=c("age", "type_employer", "fnlwgt", "education",
"education_num","marital", "occupation", "relationship", "race","sex",
"capital_gain", "capital_loss", "hr_per_week","country", "income"),
fill=FALSE,strip.white=T)
ggplot(data=dat, aes(x=income)) +
geom_histogram(stat='count',
aes(x= income, y=stat(count)/sum(stat(count)),
col=occupation, fill=occupation),
position='dodge')
I get in response histogram of each value divided by overall count of all values of all categories, and I would like for example for people earning >50K whom occupation is 'craft repair' divided by overall number of people whos occupation is craft-repair, and the same for <=50K and of the same occupation category, and like that for every other type of occupation
And the second question is, after doing propper density histogram, how can I sort the bars in decreasing order?
This is a situation where it makes sence to re-aggregate your data first, before plotting. Aggregating within the ggplot call works fine for simple aggregations, but when you need to aggregate, then peel off a group for your second calculation, it doesn't work so well. Also, note that because your x axis is discrete, we don't use a histogram here, instead we'll use geom_bar()
First we aggregate by count, then calculate percent of total using occupation as the group.
d2 <- data %>% group_by(income, occupation) %>%
summarize(count= n()) %>%
group_by(occupation) %>%
mutate(percent = count/sum(count))
Then simply plot a bar chart using geom_bar and position = 'dodge' so the bars are side by side, rather than stacked.
d2 %>% ggplot(aes(income, percent, fill = occupation)) +
geom_bar(stat = 'identity', position='dodge')
SO!
I am trying to create a plot of monthly deviations from annual means for temperature data using a bar chart. I have data across many years and I want to show the seasonal behavior in temperatures between months. The bars should represent the deviation from the annual average, which is recalculated for each year. Here is an example that is similar to what I want, only it is for a single year:
My data is sensitive so I cannot share it yet, but I made a reproducible example using the txhousing dataset (it comes with ggplot2). The salesdiff column is the deviation between monthly sales (averaged acrross all cities) and the annual average for each year. Now the problem is plotting it.
library(ggplot2)
df <- aggregate(sales~month+year,txhousing,mean)
df2 <- aggregate(sales~year,txhousing,mean)
df2$sales2 <- df2$sales #RENAME sales
df2 <- df2[,-2] #REMOVE sales
df3<-merge(df,df2) #MERGE dataframes
df3$salesdiff <- df3$sales - df3$sales2 #FIND deviation between monthly and annual means
#plot deviations
ggplot(df3,aes(x=month,y=salesdiff)) +
geom_col()
My ggplot is not looking good at the moment-
Somehow it is stacking the columns for each month with all of the data across the years. Ideally the date would be along the x-axis spanning many years (I think the dataset is from 2000-2015...), and different colors depending on if salesdiff is higher or lower. You are all awesome, and I would welcome ANY advice!!!!
Probably the main issue here is that geom_col() will not take on different aesthetic properties unless you explicitly tell it to. One way to get what you want is to use two calls to geom_col() to create two different bar charts that will be combined together in two different layers. Also, you're going to need to create date information which can be easily passed to ggplot(); I use the lubridate() package for this task.
Note that we combine the "month" and "year" columns here, and then useymd() to obtain date values. I chose not to convert the double valued "date" column in txhousing using something like date_decimal(), because sometimes it can confuse February and January months (e.g. Feb 1 gets "rounded down" to Jan 31).
I decided to plot a subset of the txhousing dataset, which is a lot more convenient to display for teaching purposes.
Code:
library("tidyverse")
library("ggplot2")
# subset txhousing to just years >= 2011, and calculate nested means and dates
housing_df <- filter(txhousing, year >= 2011) %>%
group_by(year, month) %>%
summarise(monthly_mean = mean(sales, na.rm = TRUE),
date = first(date)) %>%
mutate(yearmon = paste(year, month, sep = "-"),
date = ymd(yearmon, truncated = 1), # create date column
salesdiff = monthly_mean - mean(monthly_mean), # monthly deviation
higherlower = case_when(salesdiff >= 0 ~ "higher", # for fill aes later
salesdiff < 0 ~ "lower"))
ggplot(data = housing_df, aes(x = date, y = salesdiff, fill = as.factor(higherlower))) +
geom_col() +
scale_x_date(date_breaks = "6 months",
date_labels = "%b-%Y") +
scale_fill_manual(values = c("higher" = "blue", "lower" = "red")) +
theme_bw()+
theme(legend.position = "none") # remove legend
Plot:
You can see the periodic behaviour here nicely; an increase in sales appears to occur every spring, with sales decreasing during the fall and winter months. Do keep in mind that you might want to reverse the colours I assigned if you want to use this code for temperature data! This was a fun one - good luck, and happy plotting!
Something like this should work?
Basically you need to create a binary variable that lets you change the color (fill) if salesdiff is positive or negative, called below factordiff.
Plus you needed a date variable for month and year combined.
library(ggplot2)
library(dplyr)
df3$factordiff <- ifelse(df3$salesdiff>0, 1, 0) # factor variable for colors
df3 <- df3 %>%
mutate(date = paste0(year,"-", month), # this builds date like "2001-1"
date = format(date, format="%Y-%m")) # here we create the correct date format
#plot deviations
ggplot(df3,aes(x=date,y=salesdiff, fill = as.factor(factordiff))) +
geom_col()
Of course this results in a hard to read plot because you have lots of dates, you can subset it and show only a restricted time:
df3 %>%
filter(date >= "2014-1") %>% # we filter our data from 2014
ggplot(aes(x=date,y=salesdiff, fill = as.factor(factordiff))) +
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # adds label rotation
I'm graphing a line plot in ggplot of numbers of migrants to a city over x years, based on country of origin. Each country is graphed as its own line, plotted on a graph against other countries, over a period of five years.
I want to order the legend by country from largest to smallest total sum of migrants over the x years, regardless of the total number of countries, instead of alphabetically as it is now.
I've tried using forcats commands such as fct_relevel, but haven't been able to find anything other than doing it manually, which can be time consuming for multiple graphs.
My data frame has variables year, country, and number_migrants, and each observation is a year-country pair.
library(tidyverse)
g <- ggplot(migrants, aes(x=year, y=number_migrants, col=country)) +
geom_line()
Current example:
You need fct_reorder
library(dplyr)
library(forcats)
migrants %>%
mutate(
country = fct_reorder(country, number_migrants, .desc = TRUE)
) %>%
ggplot(migrants, aes(x=year, y=number_migrants, col=country)) +
geom_line()
I want to aggregate data by year interval inside a bar plot. Based on this answer, I wrote the following code:
years <- seq(as.Date('1970/01/01'), Sys.Date(), by="year")
set.seed(111)
effect <- sample(1:100,length(years),replace=T)
data <- data.frame(year=years, effect=effect)
ggplot(data, aes(year, effect)) + geom_bar(stat="identity", aes(group=cut(year, "5 years")))
However, only the tick marks are affected, but the data is not summed by interval. Can I get ggplot2 to sum the data without preprocessing the data, while keeping the tick marks and labels as they are?
EDIT: Sorry I wasn't clear. I'd like to keep the tick marks and labels as they are, i.e. tick marks positioned at the left hand edge of each bar (which now covers 5 years) and year only in the labels. This is based on the appearance of the linked answer above.
Slightly hacky way of doing what you want:
ggplot(data, aes(cut(year, "5 years"), effect)) +
geom_col() +
xlab("year")
What it actually does: it plots multiple columns (bars) with height equals to effect but stacked on top of each other based on 5-year interval identifier. In other words, on plot there are actually 48 bars with one colour but positioned on top of each other.
Try this:
library(tidyverse)
df %>%
mutate(index = ceiling(seq_along(years) / 5)) %>%
group_by(index) %>%
mutate(sum_effect = sum(effect)) %>%
distinct(sum_effect, .keep_all = TRUE) %>%
ggplot(aes(year, sum_effect)) +
geom_col()
Which returns:
I prefer transforming the dataset so that I don't have to do anything fancy with ggplot2