I am using ggforce to create a plot like this. .
My goal is to facet this type of plot.
For background on how the chart was made, check out update 3 on this question. The only modification that I have made was adding a geom_segment between the x axis and the Y value positions.
The reason why I believe faceting this graph is either difficult, or even impossible, is because continuous value x coordinates are used to determine where the geom_arc_bar is positioned in space.
My only idea for getting this to work has been supplying each "characteristic" that I want to facet with a set of x coordinates (1,2,3). Initially, as I will demonstrate in my code, I worked with set of highly curated data. Ideally, I would like to scale this to a dataset with many variables.
In the example graph that I have provided, the Y value is from table8, filtered for rows with "DFT". The area of the half-circles is proportional to the values of DDFS and FDFS from table9. Ideally, I would like to be able to create a function allowing for the easy creation of these graphs, with perhaps 3 parameters, the data for the y value, and for both half circles.
Here is my data.
Here is the code that I have written thus far.
For making a single plot
#Filter desired Age and Measurement
table9 %>%
filter(Age == "6-11" & Measurement != 'DFS' ) %>%
select( SurveyYear, Total , Measurement ) %>%
arrange(SurveyYear) %>%
dplyr::rename(Percent = Total) -> table9
#Do the same for table 8.
table8 %>%
filter(Age == "6-11" & Measurement != "DS" & Measurement != "FS") %>%
select(SurveyYear, Total) %>%
dplyr::rename(Y = Total)-> table8
table8 <- table8 %>%
bind_rows(table8) %>%
arrange(Y) %>%
add_column(start = rep(c(-pi/2, pi/2), 3), x = c(1,1,2,2,3,3))
table8_9 <- bind_cols(table8,table9) %>%
select(-SurveyYear1)
#Create the plot
ggplot(table8_9) + geom_segment( aes(x=x, xend=x, y=0, yend=Y), size = 0.5, linetype="solid") +
geom_arc_bar(aes(x0 = x, y0 = Y, r0 = 0, r = sqrt((Percent*2)/pi)/20,
start = start, end = start + pi, fill = Measurement),
color = "black") + guides(fill = guide_legend(title = "Type", reverse = T)) +
guides(fill = guide_legend(title = "Measurement", reverse = F)) +
xlab("Survey Year") + ylab("Mean dfs") + coord_fixed() + theme_pubr() +
scale_y_continuous(expand = c(0, 0), limits = c(0, 5.5)) +
scale_x_continuous(breaks = 1:3, labels = paste0(c("1988-1994", "1999-2004", "2011-2014"))) +
scale_fill_discrete(labels = c("ds/dfs", "fs/dfs")) -> lolliPlot
lolliPlot
Attempt at many plots
#Filter for "DFS"
table8 <- table8 %>%
filter(Measurement=="DFS")
#Duplicate DF vertically, and add column specifying the start point for the arcs.
table8 <- table8 %>%
bind_rows(table8) %>%
add_column(start = rep(c(-pi/2, pi/2), length(.$SurveyYear)/2), x = rep(x = c(1,2,3),length(.$SurveyYear)/3)) %>%
arrange(Age, x)
#Bind two tables today, removing all of the characteristic columns from table 8.
table8_9 <- bind_cols(table8,table9) %>%
select(-Age1, -SurveyYear1, -Measurement) %>%
gather(key = Variable, value = Y, -x,-start,-Age, -SurveyYear, -Measurement1, -Total1, -Male1, -Female1, -'White, non-Hispanic1', -'Black, non-hispanic1', -'Mexican American1', -'Less than 100% FPG1', -'100-199% FPG1', -'Greater than 200% FPG1')
This is where I get stuck. I can't figure out a way to format the data so that I can facet the graph. If anybody has any ideas or advice, I would greatly appreciate it.
Related
I am trying to plot the following data (df_input) in the format of a stacked bar graph where we can also see the change over time by line. Any idea how to do it?
df_input <- data.frame( Year= c(2010,2010,2010,2010,2020,2020,2020,2020), village= c("A","B","C","D","A","B","C","D"), share = c(40,30,20,10,30,30,25,15))
df_input_2 <- data.frame( Year= c(2010,2010,2010,2010,2015,2015,2015,2015,2020,2020,2020,2020), village= c("A","B","C","D","A","B","C","D","A","B","C","D"), share = c(40,30,20,10,30,30,25,15,20,10,30,40))
One option to achieve that would be via a geom_col and a geom_line. For the geom_line you have to group by the variable mapped on fill, set position to "stack" and adjust the start/end positions to account for the widths of the bars. Additionally you have to manually set the orientation for the geom_line to y:
library(ggplot2)
width <- .6 # Bar width
ggplot(df_input, aes(share, factor(Year), fill = village)) +
geom_col(width = width) +
geom_line(aes(x = share,
y = as.numeric(factor(Year)) + ifelse(Year == 2020, -width / 2, width / 2),
group = village), position = "stack", orientation = "y")
EDIT With more than two years things get a bit trickier. In that case I would switch to ´geom_segment`. Additionally we have to do some data wrangling to prepare the data for use with ´geom_segment´:
library(ggplot2)
library(dplyr)
# Example data with three years
df_input_2 <- data.frame( Year= c(2010,2010,2010,2010,2015,2015,2015,2015,2020,2020,2020,2020), village= c("A","B","C","D","A","B","C","D","A","B","C","D"), share = c(40,30,20,10,30,30,25,15,20,10,30,40))
width = .6
# Data wrangling
df_input_2 <- df_input_2 %>%
group_by(Year) %>%
arrange(desc(village)) %>%
mutate(share_cum = cumsum(share)) %>%
group_by(village) %>%
arrange(Year) %>%
mutate(Year = factor(Year),
Year_lead = lead(Year), share_cum_lead = lead(share_cum))
ggplot(df_input_2, aes(share, factor(Year), fill = village)) +
geom_col(width = width) +
geom_segment(aes(x = share_cum, xend = share_cum_lead, y = as.numeric(Year) + width / 2, yend = as.numeric(Year_lead) - width / 2, group = village))
#> Warning: Removed 4 rows containing missing values (geom_segment).
Please help!
I have case data I need to prepare for a report soon and just cannot get the graphs to display properly.
From a dataset with CollectionDate as the "record" of cases (i.e. multiple rows with the same date means more cases that day), I want to display Number of positive cases/total (positive + negative) cases for that day as a percent on the y-axis, with collection dates along the x-axis. Then I want to break down by region. Goal is to look like this but in terms of daily positives/# of tests rather than just positives vs negatives. I also want to add a horizontal line on every graph at 20%.
I have tried manipulating it before, in and after ggplot:
ggplot(df_final, aes(x =CollectionDate, fill = TestResult)) +
geom_bar(aes(y=..prop..)) +
scale_y_continuous(labels=percent_format())
Which is, again, close. But the percents are wrong because they are just taking the proportion of that day against counts of all days instead of per day.
Then I tried using tally()in the following command to try and count per region and aggregate:
df_final %>%
group_by(CollectionDate, Region, as.factor(TestResult)) %>%
filter(TestResult == "Positive") %>%
tally()
and I still cannot get the graphs right.
Suggestions?
A quick look at my data:
head(df_final)
Well, I have to say that I am not 100% sure that I got what you want, but anyway, this can be helpful.
The data: Since you are new here, I have to let you know that using a simple and reproducible version of your data will make it easier to the rest of us to answer. To do this you can simulate a data frame o any other objec, or use dput function on it.
library(ggplot2)
library(dplyr)
data <- data.frame(
# date
CollectionDate = sample(
seq(as.Date("2020-01-01"), by = "day", length.out = 15),
size = 120, replace = TRUE),
# result
TestResult = sample(c("Positive", "Negative"), size = 120, replace = TRUE),
# region
Region = sample(c("Region 1", "Region2"), size = 120, replace = TRUE)
)
With this data, you can do ass follow to get the plots you want.
# General plot, positive cases proportion
data %>%
count(CollectionDate, TestResult, name = "cases") %>%
group_by(CollectionDate) %>%
summarise(positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
geom_hline(yintercept = 0.2)
# positive proportion by day within region
data %>%
count(CollectionDate, TestResult, Region, name = "cases") %>%
group_by(CollectionDate, Region) %>%
summarise(
positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)
) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
# horizontal line at 20%
geom_hline(yintercept = 0.2) +
facet_wrap(~Region)
I can get you halfway there (refer to the comments in the code for clarifications). This code is for the counts per day per region (plotted separately for each region). I think you can tweak things further to calculate the counts per day per county too; and whole state should be a cakewalk. I wish you good luck with your report.
rm(list = ls())
library(dplyr)
library(magrittr)
library(ggplot2)
library(scales)
library(tidyr) #Needed for the spread() function
#Dummy data
set.seed(1984)
sdate <- as.Date('2000-03-09')
edate <- as.Date('2000-05-18')
dateslist <- as.Date(sample(as.numeric(sdate): as.numeric(edate), 10000, replace = TRUE), origin = '1970-01-01')
df_final <- data.frame(Region = rep_len(1:9, 10000),
CollectionDate = dateslist,
TestResult = sample(c("Positive", "Negative"), 10000, replace = TRUE))
#First tally the positve and negative cases
#by Region, CollectionDate, TestResult in that order
df_final %<>%
group_by(Region, CollectionDate, TestResult) %>%
tally()
#Then
#First spread the counts (in n)
#That is, create separate columns for Negative and Positive cases
#for each Region-CollectionDate combination
#Then calculate their proportions (as shown)
#Now you have Negative and Positive
#percentages by CollectionDate by Region
df_final %<>%
spread(key = TestResult, value = n) %>%
mutate(Negative = Negative/(Negative + Positive),
Positive = Positive/(Negative + Positive))
#Plotting this now
#Since the percentages are available already
#Use geom_col() instead of geom_bar()
df_final %>% ggplot() +
geom_col(aes(x = CollectionDate, y = Positive, fill = "Positive"),
position = "identity", alpha = 0.4) +
geom_col(aes(x = CollectionDate, y = Negative, fill = "Negative"),
position = "identity", alpha = 0.4) +
facet_wrap(~ Region, nrow = 3, ncol = 3)
This yields:
I've got a question regarding an edge case with ggplot2 in R.
They don't like you adding multiple legends, but I think this is a valid use case.
I've got a large economic dataset with the following variables.
year = year of observation
input_type = *labor* or *supply chain*
input_desc = specific type of labor (eg. plumbers OR building supplies respectively)
value = percentage of industry spending
And I'm building an area chart over approximately 15 years. There are 39 different input descriptions and so I'd like the user to see the two major components (internal employee spending OR outsourcing/supply spending)in two major color brackets (say green and blue), but ggplot won't let me group my colors in that way.
Here are a few things I tried.
Junk code to reproduce
spec_trend_pie<- data.frame("year"=c(2006,2006,2006,2006,2007,2007,2007,2007,2008,2008,2008,2008),
"input_type" = c("labor", "labor", "supply", "supply", "labor", "labor","supply","supply","labor","labor","supply","supply"),
"input_desc" = c("plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck"),
"value" = c(1,2,3,4,4,3,2,1,1,2,3,4))
spec_broad <- ggplot(data = spec_trend_pie, aes(y = value, x = year, group = input_type, fill = input_desc)) + geom_area()
Which gave me
Error in f(...) : Aesthetics can not vary with a ribbon
And then I tried this
sff4 <- ggplot() +
geom_area(data=subset(spec_trend_pie, input_type="labor"), aes(y=value, x=variable, group=input_type, fill= input_desc)) +
geom_area(data=subset(spec_trend_pie, input_type="supply_chain"), aes(y=value, x=variable, group=input_type, fill= input_desc))
Which gave me this image...so closer...but not quite there.
To give you an idea of what is desired, here's an example of something I was able to do in GoogleSheets a long time ago.
It's a bit of a hack but forcats might help you out. I did a similar post earlier this week:
How to factor sub group by category?
First some base data
set.seed(123)
raw_data <-
tibble(
x = rep(1:20, each = 6),
rand = sample(1:120, 120) * (x/20),
group = rep(letters[1:6], times = 20),
cat = ifelse(group %in% letters[1:3], "group 1", "group 2")
) %>%
group_by(group) %>%
mutate(y = cumsum(rand)) %>%
ungroup()
Now, use factor levels to create gradients within colors
df <-
raw_data %>%
# create factors for group and category
mutate(
group = fct_reorder(group, y, max),
cat = fct_reorder(cat, y, max) # ordering in the stack
) %>%
arrange(cat, group) %>%
mutate(
group = fct_inorder(group), # takes the category into account first
group_fct = as.integer(group), # factor as integer
hue = as.integer(cat)*(360/n_distinct(cat)), # base hue values
light_base = 1-(group_fct)/(n_distinct(group)+2), # trust me
light = floor(light_base * 100) # new L value for hcl()
) %>%
mutate(hex = hcl(h = hue, l = light))
Create a lookup table for scale_fill_manual()
area_colors <-
df %>%
distinct(group, hex)
Lastly, make your plot
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "stack") +
scale_fill_manual(
values = area_colors$hex,
labels = area_colors$group
)
I have a data set similar to the one below where I have a lot of data for certain groups and then only single observations for other groups. I would like my single observations to show up as points but the other groups with multiple observations to show up as lines (no points). My code is below:
EDIT: I'm attempting to find a way to do this without using multiple datasets in the geom_* calls because of the issues it causes with the legend. There was an answer that has since been deleted that was able to handle the legend but didn't get rid of the points on the lines. I would potentially like a single legend with points only showing up if they are a single observation.
library(tidyverse)
dat <- tibble(x = runif(10, 0, 5),
y = runif(10, 0, 20),
group = c(rep("Group1", 4),
rep("Group2", 4),
"Single Point 1",
"Single Point 2")
)
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point() +
geom_line()
Created on 2019-04-02 by the reprex package (v0.2.1)
Only plot the data with 1 point in geom_point() and the data with >1 point in geom_line(). These can be precalculated in mutate().
dat = dat %>%
group_by(group) %>%
mutate(n = n() )
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point(data = filter(dat, n == 1) ) +
geom_line(data = filter(dat, n > 1) )
Having the legend match this is trickier. This is the sort of thing that that override.aes argument in guide_legend() can be useful for.
In your case I would separately calculate the number of observations in each group first, since that is what the line vs point is based on.
sumdat = dat %>%
group_by(group) %>%
summarise(n = n() )
The result is in the same order as the factor levels in the legend, which is why this works.
Now we need to remove lines and keep points whenever the group has only a single observation. 0 stands for a blank line and NA stands for now shape. I use an ifelse() statement for linetype and shape for override.aes, based on the number of observations per group.
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point(data = filter(dat, n == 1) ) +
geom_line(data = filter(dat, n > 1) ) +
guides(color = guide_legend(override.aes = list(linetype = ifelse(sumdat$n == 1, 0, 1),
shape = ifelse(sumdat$n == 1, 19, NA) ) ) )
I have a dataset:
https://docs.google.com/spreadsheets/d/1ZgyRQ2uTw-MjjkJgWCIiZ1vpnxKmF3o15a5awndttgo/edit?usp=sharing
that I'm trying to apply PCA analysis and to achieve a graph based on graph provided in this post:
https://stats.stackexchange.com/questions/61215/how-to-interpret-this-pca-biplot-coming-from-a-survey-of-what-areas-people-are-i
However, an error doesn't seem to go away:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE, :
arguments imply differing number of rows: 0, 1006
Following is my code that I have trouble finding the source of error. Would like to have some help for error detection. Any hints?
The goal is to produced a PCA graph grouped by levels of Happiness.in.life. I modified the original code to fit with my dataset. Originally, group is determined by Genders, which has 2 levels. What I'm attempting to do is to build a graph based on 5 levels of Happiness.in.life. However, it doesn't seem I can use the old code...
Thanks!
library(magrittr)
library(dplyr)
library(tidyr)
df <- happiness_reduced %>% dplyr::select(Happiness.in.life:Internet.usage, Happiness.in.life)
head(df)
vars_on_hap <- df %>% dplyr::select(-Happiness.in.life)
head(vars_on_hap)
group<-df$Happiness.in.life
fit <- prcomp(vars_on_hap)
pcData <- data.frame(fit$x)
vPCs <- fit$rotation[, c("PC1", "PC2")] %>% as.data.frame()
multiple <- min(
(max(pcData[,"PC1"]) - min(pcData[,"PC1"]))/(max(vPCs[,"PC1"])-
min(vPCs[,"PC1"])),
(max(pcData[,"PC2"]) - min(pcData[,"PC2"]))/(max(vPCs[,"PC2"])-
min(vPCs[,"PC2"]))
)
ggplot(pcData, aes(x=PC1, y=PC2)) +
geom_point(aes(colour=groups)) +
coord_equal() +
geom_text(data=vPCs,
aes(x = fit$rotation[, "PC1"]*multiple*0.82,
y = fit$rotation[,"PC2"]*multiple*0.82,
label=rownames(fit$rotation)),
size = 2, vjust=1, color="black") +
geom_segment(data=vPCs,
aes(x = 0,
y = 0,
xend = fit$rotation[,"PC1"]*multiple*0.8,
yend = fit$rotation[,"PC2"]*multiple*0.8),
arrow = arrow(length = unit(.2, 'cm')),
color = "grey30")
Here is an approach on how to plot the result of PCA in ggplot2:
library(tidyverse)
library(ggrepel)
A good idea (not in all cases for instance if they are all in the same units) is to scale the variables prior to PCA
hapiness %>% #this is the data from google drive. In the future try not top post such links on SO because they tend to be unusable after some time has passed
select(-Happiness.in.life) %>%
prcomp(center = TRUE, scale. = TRUE) -> fit
Now we can proceed to plotting the fit:
fit$x %>% #coordinates of the points are in x element
as.data.frame()%>% #convert matrix to data frame
select(PC1, PC2) %>% #select the first two PC
bind_cols(hapiness = as.factor(hapiness$Happiness.in.life)) %>% #add the coloring variable
ggplot() +
geom_point(aes(x = PC1, y = PC2, colour = hapiness)) + #plot points and color
geom_segment(data = fit$rotation %>% #data we want plotted by geom_segment is in rotation element
as.data.frame()%>%
select(PC1, PC2) %>%
rownames_to_column(), #get to row names so you can label after
aes(x = 0, y = 0, xend = PC1 * 7, yend = PC2* 7, group = rowname), #I scaled the rotation by 7 so it fits in the plot nicely
arrow = arrow(angle = 20, type = "closed", ends = "last",length = unit(0.2,"cm")),
color = "grey30") +
geom_text_repel(data = fit$rotation %>%
as.data.frame()%>%
select(PC1, PC2) %>%
rownames_to_column(),
aes(x = PC1*7,
y = PC2*7,
label = rowname)) +
coord_equal(ratio = fit$sdev[2]^2 / fit$sdev[1]^2) + #I like setting the ratio to the ratio of eigen values
xlab(paste("PC1", round(fit$sdev[1]^2/ sum(fit$sdev^2) *100, 2), "%")) +
ylab(paste("PC2", round(fit$sdev[2]^2/ sum(fit$sdev^2) *100, 2), "%")) +
theme_bw()
Look at all them happy people on the left (well it is hard to notice because of the colors used, I suggest using the palette jco from ggpubr library) get_palette('jco', 5) ie scale_color_manual(values = get_palette('jco', 5))
quite a similar plot can be achieved with library ggord:
library(ggord)
ggord(fit, grp_in = as.factor(hapiness$Happiness.in.life),
size = 1, ellipse = F, ext = 1.2, vec_ext = 5)
the major difference is ggord uses equal scaling for axes. Also I scaled the rotation by 5 instead of 7 as in the first plot.
As you can see I do not like many intermediate data frames.