I have a problem with my density histogram in ggplot2. I am working in RStudio, and I am trying to create density histogram of income, dependent on persons occupation. My problem is, that when I use my code:
data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
sep=",",header=F,col.names=c("age", "type_employer", "fnlwgt", "education",
"education_num","marital", "occupation", "relationship", "race","sex",
"capital_gain", "capital_loss", "hr_per_week","country", "income"),
fill=FALSE,strip.white=T)
ggplot(data=dat, aes(x=income)) +
geom_histogram(stat='count',
aes(x= income, y=stat(count)/sum(stat(count)),
col=occupation, fill=occupation),
position='dodge')
I get in response histogram of each value divided by overall count of all values of all categories, and I would like for example for people earning >50K whom occupation is 'craft repair' divided by overall number of people whos occupation is craft-repair, and the same for <=50K and of the same occupation category, and like that for every other type of occupation
And the second question is, after doing propper density histogram, how can I sort the bars in decreasing order?
This is a situation where it makes sence to re-aggregate your data first, before plotting. Aggregating within the ggplot call works fine for simple aggregations, but when you need to aggregate, then peel off a group for your second calculation, it doesn't work so well. Also, note that because your x axis is discrete, we don't use a histogram here, instead we'll use geom_bar()
First we aggregate by count, then calculate percent of total using occupation as the group.
d2 <- data %>% group_by(income, occupation) %>%
summarize(count= n()) %>%
group_by(occupation) %>%
mutate(percent = count/sum(count))
Then simply plot a bar chart using geom_bar and position = 'dodge' so the bars are side by side, rather than stacked.
d2 %>% ggplot(aes(income, percent, fill = occupation)) +
geom_bar(stat = 'identity', position='dodge')
Related
I am currently working on an R Markdown document for our school, which should make documenting student performance easier for the teachers.
I would like to include a bar chart using ggplot2, which
orders students from best to worst based on their GPA, and
colors the 3 highest bars gold, silver and bronze respectively, and all the other bars blue.
Note that the code needs to work with an arbitrary number of students. What I tried is:
subjects_long %>%
group_by(Name) %>%
summarize(gpa = mean(grade)) %>%
ggplot(aes(x = reorder(Name, GPA), y = GPA, fill = Name)) +
geom_col() +
coord_flip() +
scale_y_continuous(breaks = seq(0, 5, by = 0.1)) +
scale_fill_manual(values = c("#d6af36","#d7d7d7","#a77044",
rep("blue", length(subjects$Name)-3)))
This ensures that the code runs, and there is an appropriate number of columns every time, regardless of which dataset (class data) I run it on, but the bars getting colored gold/silver/bronze are not the ones with the highest value, but the ones with the highest (= alphabetically) names, regardless of how high their GPA is. Apparently, this is because scale_fill_manual orders by levels of the factor, not by Y-axis values.
Any help would be greatly appreciated!
I have been trying to get ggplot in R to map 2 variables side by side in a bar plot against a catergorical Y Value
The data I have been using is the build in mpg in the "carat" Package.
However every time I run my code( which is listed below)
I receive the errorError: Aesthetics must be either length 1 or the same as the data (234): y
my code is :
ggplot(mpg,aes(x=fl,y=c(cty,hwy)))+
geom_bar()
Can someone please help
To summarise I am using the MPG dataset in R and I'm trying to plot cty and why side by side in a barplot against their fuel type(fl)
One way to do this is to put the data into long format.
Not really sure how meaningful this graph is as it gives the sum highway and city miles per gallon. Might be more meaningful to calculate the average highway and city miles per gallon for the different fuel types.
library(ggplot2)
library(tidyr)
mpg %>%
pivot_longer(c(cty,hwy)) %>%
ggplot(aes(x = fl, y=value, fill = name))+
geom_col(position = "dodge")
Created on 2021-04-10 by the reprex package (v2.0.0)
Barcharts makes the height of the bar proportional to the number of cases in each group. It can't have x and y aesthetic at the same time.
From your description I think you want to map categorical data to numeric, in this case use boxplots e.g.
mpg %>%
select(fl, cty, hwy) %>%
pivot_longer(-fl) %>%
ggplot(aes(x = fl, y = value, fill = name)) + geom_boxplot()
I have data saved in multiple datasets, each consisting of four variables. Imagine something like a data.table dt consisting of the variables Country, Male/Female, Birthyear, Weighted Average Income. I would like to create a graph where you see only one country's weighted average income by birthyear and split by male/female. I've used the facet_grid() function to get a grid of graphs for all countries as below.
ggplot() +
geom_line(data = dt,
aes(x = Birthyear,
y = Weighted Average Income,
colour = 'Weighted Average Income'))+
facet_grid(Country ~ Male/Female)
However, I've tried isolating the graphs for just one country, but the below code doesn't seem to work. How can I subset the data correctly?
ggplot() +
geom_line(data = dt[Country == 'Germany'],
aes(x = Birthyear,
y = Weighted Average Income,
colour = 'Weighted Average Income'))+
facet_grid(Country ~ Male/Female)
For your specific case the problem is that you are not quoting Male/Female and Weighted Average Income. Also your data and basic aesthetics should likely be part of ggplot and not geom_line. Doing so isolates these to the single layer, and you would have to add the code to every layer of your plot if you were to add for example geom_smooth.
So to fix your problem you could do
library(tidyverse)
plot <- ggplot(data = dt[Country == 'Germany'],
aes(x = Birthyear,
y = sym("Weighted Average Income"),
col = sym("Weighted Average Income")
) + #Could use "`x`" instead of sym(x)
geom_line() +
facet_grid(Country ~ sym("Male/Female")) ##Could use "`x`" instead of sym(x)
plot
Now ggplot2 actually has a (lesser known) builtin functionality for changing your data, so if you wanted to compare this to the plot with all of your countries included you could do:
plot %+% dt # `%+%` is used to change the data used by one or more layers. See help("+.gg")
I am trying to create a histogram/bar plot in R to show the counts of each x value I have in the dataset and higher. I am having trouble doing this, and I don't know if I use geom_histogram or geom_bar (I want to use ggplot2). To describe my problem further:
On the X axis I have "Percent_Origins," which is a column in my data frame. On my Y axis - for each of the Percent_Origin values I have occurring, I want the height of the bar to represent the count of rows with that percent value and higher. Right now, if I am to use a histogram, I have:
plot <- ggplot(dataframe, aes(x=dataframe$Percent_Origins)) +
geom_histogram(aes(fill=Percent_Origins), binwidth= .05, colour="white")
What should I change the fill or general code to be to do what I want? That is, plot an accumulation of counts of each value and higher? Thanks!
I think that your best bet is going to be creating the cumulative distribution function first then passing it to ggplot. There are several ways to do this, but a simple one (using dplyr) is to sort the data (in descending order), then just assign a count for each. Trim the data so that only the largest count is still included, then plot it.
To demonstrate, I am using the builtin iris data.
iris %>%
arrange(desc(Sepal.Length)) %>%
mutate(counts = 1:n()) %>%
group_by(Sepal.Length) %>%
slice(n()) %>%
ggplot(aes(x = Sepal.Length, y = counts)) +
geom_step(direction = "vh")
gives:
If you really want bars instead of a line, use geom_col instead. However, note that you either need to fill in gaps (to ensure the bars are evenly spaced across the range) or deal with breaks in the plot.
I want to aggregate data by year interval inside a bar plot. Based on this answer, I wrote the following code:
years <- seq(as.Date('1970/01/01'), Sys.Date(), by="year")
set.seed(111)
effect <- sample(1:100,length(years),replace=T)
data <- data.frame(year=years, effect=effect)
ggplot(data, aes(year, effect)) + geom_bar(stat="identity", aes(group=cut(year, "5 years")))
However, only the tick marks are affected, but the data is not summed by interval. Can I get ggplot2 to sum the data without preprocessing the data, while keeping the tick marks and labels as they are?
EDIT: Sorry I wasn't clear. I'd like to keep the tick marks and labels as they are, i.e. tick marks positioned at the left hand edge of each bar (which now covers 5 years) and year only in the labels. This is based on the appearance of the linked answer above.
Slightly hacky way of doing what you want:
ggplot(data, aes(cut(year, "5 years"), effect)) +
geom_col() +
xlab("year")
What it actually does: it plots multiple columns (bars) with height equals to effect but stacked on top of each other based on 5-year interval identifier. In other words, on plot there are actually 48 bars with one colour but positioned on top of each other.
Try this:
library(tidyverse)
df %>%
mutate(index = ceiling(seq_along(years) / 5)) %>%
group_by(index) %>%
mutate(sum_effect = sum(effect)) %>%
distinct(sum_effect, .keep_all = TRUE) %>%
ggplot(aes(year, sum_effect)) +
geom_col()
Which returns:
I prefer transforming the dataset so that I don't have to do anything fancy with ggplot2