ggplot: Y-axis as count of specific value - r

I am trying to create a bar chart of how many training courses my employees have completed. To do this, I have a data frame called iud, where each row is a distinct course they have begun taking:
name percent
<chr> <dbl>
1 Nardo 41.7
2 Nardo 0
3 Nardo 4.59
4 Nardo 100
...
I am trying to use ggplot to create a bar chartwhere the y axis is a count of the number of instances where percent is equal to 100. (So for the data above, Nardo's bar would be at 1). I am currently using this:
cpu <- ggplot(iud, aes(name)) +
geom_bar(data=subset(iud,percent=="100"), stat = "count") +
scale_y_continuous(breaks = seq(0,15,1))
The chart looks correct, but it does not include bars where the count of percent equals 0 (Employees who have begun, but not completed a course, are not included on the chart).
Is there a better way I can be doing this to make sure that all employees are charted--including one's where the y-axis values would be 0?

I think it is easiest to pre-process the data, to count the number of 100% first.
library(tidyverse)
df %>% group_by(name) %>%
summarise(n = sum(percent == 100)) %>%
ggplot(aes(x = name, y = n)) +
geom_col()
#data
library(readr)
df <- read_delim("name percent
Nardo 41.7
Nardo 0
Nardo 4.59
Nardo 100
Ardi 45", delim = " ")

Related

boxplot in R is howing a vertical straight line

I have a data frame of multiple columns. I want to create a two boxplots of the two variable "secretary" and "driver" but the result is not satisfiying as the picture shows boxplot. This is my code:
profession ve.count.descrition euse.count.description Qualitative.result
secretary 0 1 -0.5
secretary 0 2 1
driver 1 1 -1
driver 0 2 0.3
data %>%
mutate(Qualitative.result = factor(Qualitative.result)) %>%
ggplot(aes(x = Profession , fill = Qualitative.result)) +
geom_boxplot()
You should not make Qualitative.result as factor. Maybe you want something like this:
library(tidyverse)
data %>%
ggplot(aes(x = Profession, y = Qualitative.result, fill = Profession)) +
geom_boxplot()
Output:

Assign variables in groups based on fractions and several conditions

I've tried for several days on something I think should be rather simple, with no luck. Hope someone can help me!
I have a data frame called "test" with the following variables: "Firm", "Year", "Firm_size" and "Expenditures".
I want to assign firms to size groups by year and then display the mean, median, std.dev and N of expenditures for these groups in a table (e.g. stargazer). So the first size group (top 10% largest firms) should show the mean, median ++ of expenditures for the 10% largest firms each year.
The size groups should be,
The 10% largest firms
The firms that are between 10-25% largest
The firms that are between 25-50% largest
The firms that are between 50-75% largest
The firms that are between 75-90% largest
The 10% smallest firms
This is what I have tried:
test<-arrange(test, -Firm_size)
test$Variable = 0
test[1:min(5715, nrow(test)),]$Variable <- "Expenditures, 0% size <10%"
test[5715:min(14288, nrow(test)),]$Variable <- "Expenditures, 10% size <25%"
test[14288:min(28577, nrow(test)),]$Variable <- "Expenditures, 25% size <50%"
--> And so on
library(dplyr)
testtest = test%>%
group_by(Variable)%>%
dplyr::summarise(
Mean=mean(Expenditures),
Median=median(Expenditures),
Std.dev=sd(Expenditures),
N=n()
)
stargazer(testtest, type = "text", title = "Expenditures firms", digits = 1, summary = FALSE)
As shown over, I dont know how I could use fractions/group by percentage. I have therefore tried to assign firms in groups based on their rows after having arranged Firm_size to descending. The problem with doing so is that I dont take year in to consideration which I need to, and it is a lot of work to do this for each year (20 in total).
My intention was to make a new variable which gives each size group a name. E.g. top 10% largest firms each year should get a variable with the name "Expenditures, 0% size <10%"
Further I make a new dataframe "testtest" where I calculate the different measures, before using the stargazer to present it. This works.
!!EDIT!!
Hi again,
Now I get the error "List object cannot be coerced to type double" when running the code on a new dataset (but it is the same variables as before).
The mutate-step I'm referring to is the "mutate(gs = cut ++" after "rowwise()" in the solution you provided.
enter image description here
The_code
The_error
You can create the quantiles as a nested variable (size_groups), and then use cut() to create the group sizes (gs). Then group by Year and gs to summarize the indicators you want.
test %>%
group_by(Year) %>%
mutate(size_groups = list(quantile(Firm_size, probs=c(.1,.25,.5,.75,.9)))) %>%
rowwise() %>%
mutate(gs = cut(
Firm_size,c(-Inf, size_groups, Inf),
labels = c("Lowest 10%","10%-25%","25%-50%","50%-75%","75%-90%","Highest 10%"))) %>%
group_by(Year, gs) %>%
summarize(across(Expenditures,.fns = list(mean,median,sd,length)), .groups="drop") %>%
rename_all(~c("Year", "Group_Size", "Mean_Exp", "Med_Exp", "SD_Exp","N_Firms"))
Output:
# A tibble: 126 x 6
Year Group_Size Mean_Exp Med_Exp SD_Exp N_Firms
<int> <fct> <dbl> <dbl> <dbl> <int>
1 2000 Lowest 10% 20885. 21363. 3710. 3
2 2000 10%-25% 68127. 69497. 19045. 4
3 2000 25%-50% 42035. 35371. 30335. 6
4 2000 50%-75% 36089. 29802. 17724. 6
5 2000 75%-90% 53319. 54914. 19865. 4
6 2000 Highest 10% 57756. 49941. 34162. 3
7 2001 Lowest 10% 55945. 47359. 28283. 3
8 2001 10%-25% 61825. 70067. 21777. 4
9 2001 25%-50% 65088. 76340. 29960. 6
10 2001 50%-75% 57444. 53495. 32458. 6
# ... with 116 more rows
If you wanted to have an additional column with the yearly mean, you can remove the .groups="drop" from the summarize(across()) line, and then add this final line to the pipeline:
mutate(YrMean = sum(Mean_Exp*N_Firms/sum(N_Firms)))
Note that this is correctly weighted by the number of Firms in each Group_size, and thus returns the equivalent of doing this with the original data
test %>% group_by(Year) %>% summarize(mean(Expenditures))
Input Data:
set.seed(123)
test = data.frame(
Firm = replicate(2000, sample(letters,1)),
Year = sample(2000:2020, 2000, replace=T),
Firm_size= ceiling(runif(2000,2000,5000)),
Expenditures = runif(2000, 10000,100000)
) %>% group_by(Firm,Year) %>% slice_head(n=1)

R - ggplot showing distribution of binary flag variable (0/1) over time as normalized bar chart (%)

I have a data set looking sth like this ....
Date Remaining Volume ID
1990-01-01 0 1000 1
1990-01-01 1 2000 2
1990-01-01 1 5000 3
1990-02-01 0 200 4
1990-03-01 1 4000 5
1990-03-01 0 3000 6
I filter the data according to a series of conditional statements and assign the binary flag variable to the data.table. A value of 0 means that the particular row entry doesn't meet the defined requirements and will subsequently be excluded; 1-flagged rows remain in the data.table. The key is ID and is unique for each row.
I would like to show two relationships.
(1) A stacked normalized/percentage bar chart over the monthly time series to show the percentage of entries remaining/being excluded in the data.set for each month,
f.ex. Jan 1990 --> 2/3 values remaining --> 66.6% vs. 33.3% of entries remain vs. are excluded
(2) A stacked normalized/percentage bar chart showing the normalized percentage of volume remaining/ being excluded by the filtering operation for each month,
f.ex. Jan 1990 --> 2k + 5k out of 8k remaining --> 87.5% vs. 12.5% of volume remains vs. is excluded
I tried various things so far, f.ex. compute the number of occurences of each flag-value per month and the sum of the corresponding "bucket" (0/1) volume, but all my attempts failed so far.
# dt_1 is the original data.table
id.vec <- dt_1[ , id]
dt_2 <- dt_1
# dt_1 is filterd subsequently
id_remaining.vec <- dt_1[ , id]
dt_2 <- dt_2[id.vec %in% id_remaining.vec, REMAIN := 1]
dt_2 <- dt_2[id.vec %notin% id_remaining.vec, REMAIN := 0]
dt_2 <- dt_2[REMAIN == 1 , N_REMAIN := .N]
dt_2 <- dt_2[REMAIN == 1 , N_REMAIN_MON := .N]
# Tried the code below to no avail
ggplot(data = dt_2, aes(x = Date, y = REMAIN, color = REMAIN, fill = REMAIN)) +
geom_bar(position = "fill", stat = "identity")
Usually, I find ggplot grammar very intuitive, but I guess I am overlooking sth here or maybe the data set is not in the right format.
Any pointer or idea highly appreciated!
Here's how I'd do it with dplyr:
library(dplyr)
dt_2 %>%
mutate(Remaining = as.character(Remaining)) %>% # just to make the charts use scale_fill_discrete by default
group_by(Date, Remaining) %>%
summarize(entries = n(),
volume = sum(Volume)) %>%
mutate(share_entries = entries / sum(entries),
share_volume = volume / sum(volume)) %>%
ungroup() -> dt_2_summary
> dt_2_summary
# A tibble: 5 x 6
Date Remaining entries volume share_entries share_volume
<chr> <chr> <int> <int> <dbl> <dbl>
1 1990-01-01 0 1 1000 0.333 0.125
2 1990-01-01 1 2 7000 0.667 0.875
3 1990-02-01 0 1 200 1 1
4 1990-03-01 0 1 3000 0.5 0.429
5 1990-03-01 1 1 4000 0.5 0.571
Then to chart:
dt_2_summary %>%
ggplot(aes(Date, share_entries, fill = Remaining)) +
geom_col()
dt_2_summary %>%
ggplot(aes(Date, share_volume, fill = Remaining)) +
geom_col()
Just as an appendix to Jon's great soution.
I had a large project with >25 libraries loaded and while the proposed code seemingly worked, it only did work for the share_entries and not for share_volume. Output of dt_2_summary was weird. The share_entries column was apparently computed to the total number of entries and not within each group and the share_volume column only showed NAs.
After hours of troubleshooting, I identified the culprit to be the pkg plyr, which did overwrite some functions. Thus, I had to specify which version of the applied functions I wanted to use.
The code below did the trick for me.
library(plyr) # the culprit
library(dplyr)
dt_2 %>%
dplyr::mutate(Remaining = as.character(Remaining)) %>%
group_by(Date, Remaining) %>%
dplyr::summarize(entries = n(),
volume = sum(Volume)) %>%
dplyr::mutate(share_entries = entries / sum(entries),
share_volume = volume / sum(volume)) %>%
ungroup() -> dt_2_summary
Thanks again Jon for your wonderful solutiopn!

Data wrangling for creating multiple bar graph

So, I have this tibble from which I am trying to make a multiple bar graph that shows how much was spent supporting(for) or opposing(against) each of these candidates
However, I am completely lost on how to go about doing it, and I think I want to rearrange this tibble to make it simpler to create a graph. Any pointers would be very helpful.
A tibble: 5 x 5
type clinton sanders omalley fa_camp
<chr> <dbl> <dbl> <dbl> <chr>
1 24A 51937848 859337 0 against
2 24C 15106530 900 0 for
3 24E 29651626 5307952 374821 for
4 24F 5096083 304153 0 for
5 24N 10139 0 0 against
I am hoping to eventually achieve a result that looks like this:
The different colored bars would be for/against, and the y-axis would be the amount spent.
Before plotting, would put into long format.
library(tidyverse)
library(scales)
df %>%
pivot_longer(cols = -c(type, fa_camp), names_to = "candidate", values_to = "amount_spent") %>%
ggplot(aes(x = candidate, y = amount_spent, group = fa_camp, fill = fa_camp)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = dollar)
Plot

Why is geom_bar y-axis unproportional to actual numbers?

Sorry if this question already exists - was googling for a while now already and didn't find anything.
I am relatively new to R and learning while doing all of this.
I'm supposed to create some PDF via r markdown that analyses patient-data with specific main-diagnosis and secondary-diagnosis. For this I'm supposed to plot some numbers via ggplot (geom_bar and geom_boxplot).
So what I do so far is, I retrieve data-sets that include both codes via SQL and load them into data.table-objects afterwards. Afterwards I join them to get the data I need.
After this I add columns that consist sub-strings of those codes and others that consist the count of those certain sub-strings (so I can plot the occurrences of every code).
I wanted now for example to put certain data.table into a geom_bar or geom_boxplot and make it visible. This actually works, but my y-axis has a weird scale that doesn't fit the numbers it actually should show. The proportions of the bars are also not accurate.
For example: one diagnoses appears 600 times and the other one 1000 times. The y-axis shows steps of 0 - 500.000 - 1.000.000 - 1.500.000 - ....
The Bar that shows 600 is super small and the bar with 1000 goes up to 1.500.000
If I create a new variable before and count what I need via count() and plot this it just works. The rows I put for the y-axis have in both variable the same datatype (integer)
So here is just how I create the data.table that I use for plotting
exazerbationsHdComorbiditiesNd <- allExazerbationsHd[allComorbiditiesNd, on="encounter_num", nomatch=0]
exazerbationsHdComorbiditiesNd <- exazerbationsHdComorbiditiesNd[, c("i.DurationGroup", "i.DurationInDays", "i.start_date", "i.end_date", "i.duration", "i.patient_num"):=NULL]
exazerbationsHdComorbiditiesNd[ , IcdHdCodeCount := .N, by = concept_cd]
exazerbationsHdComorbiditiesNd[ , IcdHdCodeClassCount := .N, by = IcdHdClass]
If I want to bar-plot now for example IcdHdClass by IcdHdCodeClassCount I do following:
ggplot(exazerbationsHdComorbiditiesNd, aes(exazerbationsHdComorbiditiesNd$IcdHdClass, exazerbationsHdComorbiditiesNd$IcdHdCodeClassCount, label=exazerbationsHdComorbiditiesNd$IcdHdCodeClassCount)) + geom_bar(stat = "identity") + geom_text(vjust = 0, size = 5)
It outputs said bar-plot with weird proportions.
If I do first:
plotTest <- count(exazerbationsHdComorbiditiesNd, exazerbationsHdComorbiditiesNd$IcdHdClass)
And then bar-plot it:
ggplot(plotTest, aes(plotTest$`exazerbationsHdComorbiditiesNd$IcdHdClass`, plotTest$n, label=plotTest$n)) + geom_bar(stat = "identity") + geom_text(vjust = 0, size = 5)
Its all perfect and works.
I checked also data-types of the columns I needed:
sapply(exazerbationsHdComorbiditiesNd, class)
sapply(plotTest, class)
In both variables the columns I need are of the type character and integer
Edit:
Unfortunately I cant post images. So here are just the links to those.
Here is a screenshot of the plot with wrong y-axis:
https://ibb.co/CbxX1n7
And here is a screenshot of the plot shown right:
https://ibb.co/Xb8gyx1
Here is some example-data that I copied out the data.table object:
Exampledata
Since you added the class counts as an additional column--rather than aggregating--what’s happening is that for each row in your data, the class counts get stacked on top of each other:
library(tidyverse)
set.seed(42)
df <- tibble(class = sample(letters[1:3], 10, replace = TRUE)) %>%
add_count(class, name = "count")
df # this is essentially what your data looks like
#> # A tibble: 10 x 2
#> class count
#> <chr> <int>
#> 1 a 5
#> 2 a 5
#> 3 a 5
#> 4 a 5
#> 5 b 3
#> 6 b 3
#> 7 b 3
#> 8 a 5
#> 9 c 2
#> 10 c 2
ggplot(df, aes(class, count)) + geom_bar(stat = "identity")
You could use position = "identity" so that the bars don’t get stacked:
ggplot(df, aes(class, count)) +
geom_bar(stat = "identity", position = "identity")
However, that creates a whole bunch of unnecessary layers in your plot that you can’t see. A better approach would be to drop the extra rows from your data before plotting:
df %>%
distinct(class, count)
#> # A tibble: 3 x 2
#> class count
#> <chr> <int>
#> 1 a 5
#> 2 b 3
#> 3 c 2
df %>%
distinct(class, count) %>%
ggplot(aes(class, count)) +
geom_bar(stat = "identity")
Created on 2019-09-05 by the reprex package (v0.3.0.9000)

Resources