I want to aggregate data by year interval inside a bar plot. Based on this answer, I wrote the following code:
years <- seq(as.Date('1970/01/01'), Sys.Date(), by="year")
set.seed(111)
effect <- sample(1:100,length(years),replace=T)
data <- data.frame(year=years, effect=effect)
ggplot(data, aes(year, effect)) + geom_bar(stat="identity", aes(group=cut(year, "5 years")))
However, only the tick marks are affected, but the data is not summed by interval. Can I get ggplot2 to sum the data without preprocessing the data, while keeping the tick marks and labels as they are?
EDIT: Sorry I wasn't clear. I'd like to keep the tick marks and labels as they are, i.e. tick marks positioned at the left hand edge of each bar (which now covers 5 years) and year only in the labels. This is based on the appearance of the linked answer above.
Slightly hacky way of doing what you want:
ggplot(data, aes(cut(year, "5 years"), effect)) +
geom_col() +
xlab("year")
What it actually does: it plots multiple columns (bars) with height equals to effect but stacked on top of each other based on 5-year interval identifier. In other words, on plot there are actually 48 bars with one colour but positioned on top of each other.
Try this:
library(tidyverse)
df %>%
mutate(index = ceiling(seq_along(years) / 5)) %>%
group_by(index) %>%
mutate(sum_effect = sum(effect)) %>%
distinct(sum_effect, .keep_all = TRUE) %>%
ggplot(aes(year, sum_effect)) +
geom_col()
Which returns:
I prefer transforming the dataset so that I don't have to do anything fancy with ggplot2
Related
My problem seems quite basic, but I couldn't find any relevant answer. I want to create line plots with the date on the x axis. The y axis will be Covid statistics (deaths, hospitalizations, you name it). I want to create a separate plot for the different waves of the pandemic which means that my charts cover different times. My problem is that R fixes the plot to the same size and thus the lines for the shorter time period are skewed in comparison to those of the longer time period. Ideally, I would want 1 month on the x axis to be fixed to a certain number of px or mm. But I can't find out how. My best idea so far is to assign both plots a different total width, but that doesn't give me an optimal result either.
Here's a reproducible example with a built-in dataset to explain:
library(ggplot2)
library(dplyr)
economics_1967 <- economics %>%
filter(date<"1968-01-01")
economics_1968 <- economics %>%
filter(date<"1969-01-01"&date>"1967-12-31")
#data is only available for six months in 1967, but for 12 in 1968
exampleplot1 <- ggplot(economics_1967)+
geom_line(aes(date, unemploy))+
scale_x_date(date_breaks="1 month", date_labels="%b")
#possible: ggsave("exampleplot1.png", width=2, height=1)
exampleplot2 <- ggplot(economics_1968)+
geom_line(aes(date, unemploy))+
scale_x_date(date_breaks="1 month", date_labels="%b")
ggsave("exampleplot2.png", width=4, height=1)
#possible: ggsave("exampleplot1.png", width=2, height=1)
Thank you!
EDIT: Thanks for the suggestions! Facet wrap would be a good idea but in the end I decided to just plot the whole time in one case. The background is that I classified countries differently for their policies in different times, so that's why I wanted to have a clear break in the visualization, but I just put a vertical line in there.
facet_grid is one approach, if you don't mind showing the two charts together.
library(dplyr); library(ggplot2)
bind_rows(e1967 = economics_1967,
e1968 = economics_1968, .id="source") %>%
ggplot(aes(date, unemploy)) +
geom_line() +
scale_x_date(date_breaks="1 month", date_labels="%b") +
facet_grid(~source, scales = "free_x", space = "free_x")
I like #Jon Spring's solution a lot. I want to present it a tad differently --to show that facet() usually operates on a single dataset that has one existing variable used to facet.
econ_subset <-
economics %>%
dplyr::filter(dplyr::between(date, as.Date("1967-09-01"), as.Date("1968-12-31"))) %>%
dplyr::mutate(
year = lubridate::year(date) # Used below to facet
)
ggplot(econ_subset, aes(date, unemploy)) +
geom_line() +
scale_x_date(date_breaks="1 month", date_labels="%b") +
facet_grid(~year, scales = "free_x", space = "free_x")
(In Jon's solution, bind_rows() is used to stack the two separate datasets back together.)
I am trying to develop an animated plot showing how the rates of three point attempts and assists have changed for NBA teams over time. While the points in my plot are transitioning correctly, I tried to add a vertical and horizontal mean line, however this is staying constant for the overall averages rather than shifting year by year.
p<-ggplot(dataBREFPerPossTeams, aes(astPerPossTeam,fg3aPerPossTeam,col=ptsPerPossTeam))+
geom_point()+
scale_color_gradient(low='yellow',high='red')+
theme_classic()+
xlab("Assists Per 100 Possessions")+
ylab("Threes Attempted Per 100 Possessions")+labs(color="Points Per 100 Possessions")+
geom_hline(aes(yintercept = mean(fg3aPerPossTeam)), color='blue',linetype='dashed')+
geom_vline(aes(xintercept = mean(astPerPossTeam)), color='blue',linetype='dashed')
anim<-p+transition_time(as.integer(yearSeason))+labs(title='Year: {frame_time}')
animate(anim, nframes=300)
Ideally, the two dashed lines would shift as the years progress, however, right now they are staying constant. Any ideas on how to fix this?
I am using datasets::airquality since you have not shared your data. The idea here is that you need to have the values for your other geom (here it is mean) as a variable in your dataset, so gganimate can draw the connection between the values and frame (i.e. transition_time).
So What I did was grouping by frame (here it is month and it will be yearSeason for you) and then mutating a column with the average of my desired variables. Then in geoms I used that appended variable instead of getting the mean inside of the geom. Look below;
library(datasets) #datasets::airquality
library(ggplot2)
library(gganimate)
library(dplyr)
g <- airquality %>%
group_by(Month) %>%
mutate(mean_wind=mean(Wind),
mean_temp=mean(Temp)) %>%
ggplot()+
geom_point(aes(Wind,Temp, col= Solar.R))+
geom_hline(aes(yintercept = mean_temp), color='blue',linetype='dashed')+
geom_vline(aes(xintercept = mean_wind), color='green',linetype='dashed')+
scale_color_gradient(low='yellow',high='red')+
theme_classic()+
xlab("Wind")+
ylab("Temp")+labs(color="Solar.R")
animated_g <- g + transition_time(as.integer(Month))+labs(title='Month: {frame_time}')
animate(animated_g, nframes=18)
Created on 2019-06-09 by the reprex package (v0.3.0)
How do I plot a bar-plot so that every variable (treatment group) on the x-axis displays two bars, representing avgRDm and avgSDM? I would like the bars to be colored by avgRDm and avgSDM.
The data for the plot is in the following image:
Thank you
I'm a big fan of ggplot, so here is an option in that vein. It's easiest (and tidiest) to reshape data from wide to long and then map the fill aesthetic to the key
library(tidyverse)
df %>%
gather(key, val, -trt) %>%
ggplot(aes(trt, val, fill = key)) +
geom_col(position = "dodge2")
PS. For future posts, please share data in a reproducible way using e.g. dput; screenshots are never a good idea as it requires respondents to manually type out your sample data.
Sample data
df <- read.table(text =
"trt avgRDM avgSDM
F10 49.5 108.333
NH4Cl 12.583 50.25
NH4NO3 17.333 73.33
'F10 + ANU843' 6.0 7.333", header = T)
I have a problem with my density histogram in ggplot2. I am working in RStudio, and I am trying to create density histogram of income, dependent on persons occupation. My problem is, that when I use my code:
data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
sep=",",header=F,col.names=c("age", "type_employer", "fnlwgt", "education",
"education_num","marital", "occupation", "relationship", "race","sex",
"capital_gain", "capital_loss", "hr_per_week","country", "income"),
fill=FALSE,strip.white=T)
ggplot(data=dat, aes(x=income)) +
geom_histogram(stat='count',
aes(x= income, y=stat(count)/sum(stat(count)),
col=occupation, fill=occupation),
position='dodge')
I get in response histogram of each value divided by overall count of all values of all categories, and I would like for example for people earning >50K whom occupation is 'craft repair' divided by overall number of people whos occupation is craft-repair, and the same for <=50K and of the same occupation category, and like that for every other type of occupation
And the second question is, after doing propper density histogram, how can I sort the bars in decreasing order?
This is a situation where it makes sence to re-aggregate your data first, before plotting. Aggregating within the ggplot call works fine for simple aggregations, but when you need to aggregate, then peel off a group for your second calculation, it doesn't work so well. Also, note that because your x axis is discrete, we don't use a histogram here, instead we'll use geom_bar()
First we aggregate by count, then calculate percent of total using occupation as the group.
d2 <- data %>% group_by(income, occupation) %>%
summarize(count= n()) %>%
group_by(occupation) %>%
mutate(percent = count/sum(count))
Then simply plot a bar chart using geom_bar and position = 'dodge' so the bars are side by side, rather than stacked.
d2 %>% ggplot(aes(income, percent, fill = occupation)) +
geom_bar(stat = 'identity', position='dodge')
I am trying to create a histogram/bar plot in R to show the counts of each x value I have in the dataset and higher. I am having trouble doing this, and I don't know if I use geom_histogram or geom_bar (I want to use ggplot2). To describe my problem further:
On the X axis I have "Percent_Origins," which is a column in my data frame. On my Y axis - for each of the Percent_Origin values I have occurring, I want the height of the bar to represent the count of rows with that percent value and higher. Right now, if I am to use a histogram, I have:
plot <- ggplot(dataframe, aes(x=dataframe$Percent_Origins)) +
geom_histogram(aes(fill=Percent_Origins), binwidth= .05, colour="white")
What should I change the fill or general code to be to do what I want? That is, plot an accumulation of counts of each value and higher? Thanks!
I think that your best bet is going to be creating the cumulative distribution function first then passing it to ggplot. There are several ways to do this, but a simple one (using dplyr) is to sort the data (in descending order), then just assign a count for each. Trim the data so that only the largest count is still included, then plot it.
To demonstrate, I am using the builtin iris data.
iris %>%
arrange(desc(Sepal.Length)) %>%
mutate(counts = 1:n()) %>%
group_by(Sepal.Length) %>%
slice(n()) %>%
ggplot(aes(x = Sepal.Length, y = counts)) +
geom_step(direction = "vh")
gives:
If you really want bars instead of a line, use geom_col instead. However, note that you either need to fill in gaps (to ensure the bars are evenly spaced across the range) or deal with breaks in the plot.