Plot number of people (observations) in a data set?

Plot number of people (observations) in a data set? - r

I want to use a barplot to display the number of male and female participants for a program in the month of July. However, I keep getting a percentage instead. I know there's something wrong with my code but I'm not sure what.
july_all%>%
filter(month == 0)%>%
group_by(sex)%>%
summarize(id=round(100*n()/nrow(july_all)))%>%
ggplot(aes(x =sex,y =id)) +
geom_bar(stat ="identity") +
labs(y ="Number of participants")

You haven't provided any data, but I'm guessing the following is a reasonable approximation:
set.seed(1)
july_all <- data.frame(sex = sample(c("Male", "Female"), 500, TRUE), month = 0)
Now, running your code, we get:
july_all %>%
filter(month == 0) %>%
group_by(sex) %>%
summarize(id = round(100*n()/nrow(july_all))) %>%
ggplot(aes(x =sex, y = id)) +
geom_bar(stat ="identity") +
labs(y ="Number of participants")
Which shows percentages. Why? Because that's what you have calculated with summarize(id = round(100*n()/nrow(july_all))) . Here, id is the percentage of each sex present in the data.
If you want raw counts, your code can be much simpler, since geom_bar plots counts by default, so you need only do:
july_all %>%
filter(month == 0) %>%
ggplot(aes(x = sex)) +
geom_bar() +
labs(y ="Number of participants")
And you'll see we have the raw counts plotted.

Related

Difference in n() count and geom_col graph likely resulted from group_by(), but why and how?

Sorry in advance for I'm an R newbie. So I was working on Divvy Bike Share data (details see here. Here is a subset of my df:
I wanted to visualize the total ridership count (how many times bikes are used) as compressed and shown in a week. I tried two blocks of codes, with the only difference being summarize() - the second one has "month" inside the function. I don't understand what resulted in this difference in y-axis values in the two graphs.
p1 <- df %>%
group_by(member_casual, day_of_week) %>%
summarize(total_rides = n()) %>%
ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
theme(axis.title.x = element_blank()) +
scale_fill_discrete(name = "") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p1
p2 <- df %>%
group_by(member_casual, day_of_week, month) %>%
summarize(total_rides = n()) %>%
ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
theme(axis.title.x = element_blank()) +
scale_fill_discrete(name = "") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p2
I tested the tables generated before a plot is visualized, so I tried the following blocks:
df %>%
group_by(member_casual, day_of_week) %>%
summarize(total_rides = n())
df %>%
group_by(member_casual, day_of_week, month) %>%
summarize(total_rides = n())
I guess I understand by adding more elements in group_by, the resulting table will become more catagorized or "grouped". However, the total should always be the same, no? For example, if you add up all the casual & Sundays (as separated into 12 months) in tibble 2, you'll get exactly the number in tibble 1 - 392107, the same number as shown in p1, not p2. So this exacerbated my confusion.
So in a word, I have two questions:
Why the difference in p1 and p2? How could I have avoided such errors in the future?
Where does the numbers come in p2?
Any advice would be greatly appreciated. Thank you!

You’re assuming that the counts for each month will be stacked, so that together the column will show the total across all months. But in fact the counts are overplotted in front of one another, so only the highest month-count is visible. You can see this is the case if you add a border and make your columns transparent. Using mpg as an example, with cyl as the “extra” grouping variable:
library(dplyr)
library(ggplot2)
mpg %>%
count(drv, year, cyl) %>%
ggplot(aes(year, n, fill = drv)) +
geom_col(
position = "dodge",
color = "black",
alpha = .5
)
NB: count(x) is a shortcut for group_by(x) %>% summarize(n = n()).

x-axis starting value for diverging plot

How can I change the "x-axis starting value" from the diverging bar chart below (extracted from here), so that the vertical axis is set at 25 instead of 0. And therefore the bars are drawn from 25 and not 0.
For instance, I want this chart:
To look like this:
EDIT
It it not the label I want to change, it is how the data is plotted. My apologies if I wasn't clear. See example below:
Another example to make it clear:

You can provide computed labels to an (x-)scale via scale_x_continuous(labels = function (x) x + 25).
If you also want to change the data, you’ll first need to offset the x-values by the equivalent amount (in the opposite direction):
Example:
df = tibble(Color = c('red', 'green', 'blue'), Divergence = c(5, 10, -5))
offset = 2
df %>%
mutate(Divergence = Divergence - offset) %>%
ggplot() +
aes(x = Divergence, y = Color) +
geom_col() +
scale_x_continuous(labels = function (x) x + offset)

I'm still not 100% clear on your intended outcome but you can "shift" your data by adding/subtracting 25 from each value, e.g.
Original plot:
library(tidyverse)
library(gapminder)
set.seed(123)
gapminder_subset <- gapminder %>%
pivot_longer(-c(country, continent, year)) %>%
filter(year == "1997" | year == "2007") %>%
select(-continent) %>%
filter(name == "gdpPercap") %>%
pivot_wider(names_from = year) %>%
select(-name) %>%
mutate(gdp_change = ((`2007` - `1997`) / `1997`) * 100) %>%
sample_n(15)
ggplot(data = gapminder_subset,
aes(x = country, y = gdp_change)) +
geom_bar(stat = "identity") +
coord_flip()
subtract 25:
library(tidyverse)
library(gapminder)
set.seed(123)
gapminder_subset <- gapminder %>%
pivot_longer(-c(country, continent, year)) %>%
filter(year == "1997" | year == "2007") %>%
select(-continent) %>%
filter(name == "gdpPercap") %>%
pivot_wider(names_from = year) %>%
select(-name) %>%
mutate(gdp_change = ((`2007` - `1997`) / `1997`) * 100) %>%
sample_n(15)
ggplot(data = gapminder_subset,
aes(x = country, y = gdp_change)) +
geom_bar(stat = "identity") +
coord_flip()
If you combine that with my original relabelling I think that's the solution:
ggplot(data = gapminder_subset,
aes(x = country, y = gdp_change - 25)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_y_continuous(breaks = c(-25, 0, 25, 50),
labels = c(0, 25, 50, 75))

The answers that existed at the time that I'm writing this are suggesting to change the data or to change the label. Here, I'm proposing to change neither the data nor the labels, and instead just change where the starting position of a bar is.
First, for reproducibility, I took #jared_mamrot's approach for the data subset.
library(gapminder)
library(tidyverse)
set.seed(123)
gapminder_subset <- gapminder %>%
pivot_longer(-c(country, continent, year)) %>%
filter(year == "1997" | year == "2007") %>%
select(-continent) %>%
filter(name == "gdpPercap") %>%
pivot_wider(names_from = year) %>%
select(-name) %>%
mutate(gdp_change = ((`2007` - `1997`) / `1997`) * 100) %>%
sample_n(15)
Then, you can set xmin = after_scale(25). You'll get a warning that xmin doesn't exists, but it does exist after the bars are reparameterised to rectangles in the ggplot2 internals (which is after the x-scale has seen the data to determine limits). This effectively changes the position where bars start.
ggplot(gapminder_subset,
aes(gdp_change, country)) +
geom_col(aes(xmin = after_scale(25)))
#> Warning: Ignoring unknown aesthetics: xmin
Created on 2021-06-28 by the reprex package (v1.0.0)

Difference between geom_col() and geom_point() for same value

So, I'm trying to plot missing values here over time (longitudinal data).
I would prefer placing them in a geom_col() to fill up with colours of certain treatments afterwards. But for some weird reason, geom_col() gives me weird values, while geom_point() gives me the correct values using the same function. I'm trying to wrap my head around why this is happening. Take a look at the y-axis.
Disclaimer:
I know the missing values dissappear on day 19-20. This is why I'm making the plot.
Sorry about the lay-out of the plot. Not polished yet.
For the geom_point:
gaussian_transformed %>% group_by(factor(time)) %>% mutate(missing = sum(is.na(Rose_width))) %>% ggplot(aes(x = factor(time), y = missing)) + geom_point()
Picture: geom_point
For the geom_col:
gaussian_transformed %>% group_by(factor(time)) %>% mutate(missing = sum(is.na(Rose_width))) %>% ggplot(aes(x = factor(time), y = missing)) + geom_col()
Picture: geom_col

The problem is that you're using mutate and creating several rows for your groups. You cannot see that, but you will have plenty of points overlapping in your geom_point plot.
One way is to either use summarise, or you use distinct
Compare
library(tidyverse)
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
ggplot(aes(x = order, y = missing)) +
geom_point()
The points look ugly because there is a lot of over plotting.
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
distinct(order, .keep_all = TRUE) %>%
ggplot(aes(x = order, y = missing)) +
geom_col()
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
ggplot(aes(x = order, y = missing)) +
geom_col()
Created on 2021-06-02 by the reprex package (v2.0.0)

So after some digging:
What happens was that the geom_col() function sums up all the missing values while geom_point() does not. Hence the large values for y. Why this is happening, I do not know. However doing the following worked fine for me:
gaussian_transformed$time <- as.factor(gaussian_transformed$time)
gaussian_transformed %>% group_by(time) %>% summarise(missing = sum(is.na(Rose_width))) -> gaussian_transformed
gaussian_transformed %>% ggplot(aes(x = time, y = missing)) + geom_col(fill = "blue", alpha = 0.5) + theme_minimal() + labs(title = "Missing values in Gaussian Outcome over the days", x = "Time (in days)", y = "Amount of missing values") + scale_y_continuous(breaks = seq(0, 10, 1))
With the plot: GaussianMissing

Creating a Line Graph to Display Total Pounds of each Type in a Specific Year

I am trying to create a line graph that shows how many pounds each milk type sold in 2017. It comes from this dataset https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/fluid_milk_sales.csv
This is what I have but I get a message asking if I need to adjust the group aesthetic. Not sure what I am doing wrong so I would love some assistance.
options(scipen = 999)
fluid_milk_sales %>%
filter(year == 2017) %>%
select(milk_type, pounds) %>%
ggplot(aes(x = milk_type, y = pounds)) +
geom_line()

You get that error your x-variable is a category and you can't join them into a line. I guess you would need a bar plot (I flip the plot so that the types can be read, you can remove coord_flip() if you don't need that) :
fluid_milk_sales %>%
filter(year == 2017) %>%
ggplot(aes(x = reorder(milk_type,pounds), y = pounds)) +
geom_col() + xlab("milk_type") + coord_flip()
Or if you want like a lollipop plot, it goes like:
fluid_milk_sales %>%
filter(year == 2017) %>%
ggplot(aes(x = reorder(milk_type,pounds), y = pounds)) +
geom_point() +
geom_segment(aes(xend = milk_type, yend = 0)) +
coord_flip() + xlab("milk_type")
If you really want to force a line, which I think doesn't make sense (note I reorder with the negative to start with the highest):
fluid_milk_sales %>%
filter(year == 2017) %>%
ggplot(aes(x = reorder(milk_type,-pounds), y = pounds,group=1)) +
geom_line() + xlab("milk_type")

Order bars by difference between variables

My intention is to plot a barchart, with to variables visible:
"HH_FIN_EX", "ACT_IND_CON_EXP" but having them ordered by the variable diff, in ascending order. diff itself should not be included in chart
library(eurostat)
library(tidyverse)
#getting the data
data1 <- get_eurostat("nama_10_gdp",time_format = "num")
#filtering
data_1_4 <- data1 %>%
filter(time=="2016",
na_item %in% c("B1GQ", "P31_S14_S15", "P41"),
geo %in% c("BE","BG","CZ","DK","DE","EE","IE","EL","ES","FR","HR","IT","CY","LV","LT","LU","HU","MT","NL","AT","PL","PT","RO","SI","SK","FI","SE","UK"),
unit=="CP_MEUR")%>% select(-unit, -time)
#transformations and calculations
data_1_4 <- data_1_4 %>%
spread(na_item, values)%>%
na.omit() %>%
mutate(HH_FIN_EX = P31_S14_S15/B1GQ, ACT_IND_CON_EXP=P41/B1GQ, diff=ACT_IND_CON_EXP-HH_FIN_EX) %>%
gather(na_item, values, 2:7)%>%
filter(na_item %in% c("HH_FIN_EX", "ACT_IND_CON_EXP", "diff"))
#plotting
ggplot(data=data_1_4, aes(x=reorder(geo, values), y=values, fill=na_item))+
geom_bar(stat="identity", position=position_dodge(), colour="black")+
labs(title="", x="Countries", y="As percentage of GDP")
I appreciate any suggestions how to do this, as aes(x=reorder(geo, values[values=="diff"]) results in an error.

First of all, you shouldn't include diff (your result column) when using gather, it complicates things.
Change line gather(na_item, values, 2:7) to gather(na_item, values, 2:6).
You can use this code to calculate difference and order (using dplyr::arange) rows in descending order:
plotData <- data_1_4 %>%
spread(na_item, values) %>%
na.omit() %>%
mutate(HH_FIN_EX = P31_S14_S15 / B1GQ,
ACT_IND_CON_EXP = P41 / B1GQ,
diff = ACT_IND_CON_EXP - HH_FIN_EX) %>%
gather(na_item, values, 2:6) %>%
filter(na_item %in% c("HH_FIN_EX", "ACT_IND_CON_EXP")) %>%
arrange(desc(diff))
And plot it with:
ggplot(plotData, aes(geo, values, fill = na_item))+
geom_bar(stat = "identity", position = "dodge", color = "black") +
labs(x = "Countries",
y = "As percentage of GDP") +
scale_x_discrete(limits = plotData$geo)

You can explicitly figure out the order that you want -- this is stored in country_order below -- and force the factor geo to have its levels in that order. Then run ggplot after filtering out the diff variable. So replace your call to ggplot with the following:
country_order = (data_1_4 %>% filter(na_item == 'diff') %>% arrange(values))$geo
data_1_4$geo = factor(data_1_4$geo, country_order)
ggplot(data=filter(data_1_4, na_item != 'diff'), aes(x=geo, y=values, fill=na_item))+
geom_bar(stat="identity", position=position_dodge(), colour="black")+
labs(title="", x="Countries", y="As percentage of GDP")
Doing this, I get the plot below:

is this what you are looking for?
data_1_4 %>% mutate(Val = fct_reorder(geo, values, .desc = TRUE)) %>%
filter(na_item %in% c("HH_FIN_EX", "ACT_IND_CON_EXP")) %>%
ggplot(aes(x=Val, y=values, fill=na_item)) +
geom_bar(stat="identity", position=position_dodge(), colour="black") +
labs(title="", x="Countries", y="As percentage of GDP")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Plot number of people (observations) in a data set? - r

Related

Difference in n() count and geom_col graph likely resulted from group_by(), but why and how?

x-axis starting value for diverging plot

Difference between geom_col() and geom_point() for same value

Creating a Line Graph to Display Total Pounds of each Type in a Specific Year

Order bars by difference between variables

Categories

Resources