Consider this example:
library(dplyr)
library(sf)
library(tmap)
d <- data_frame(one = c(1,1,2,1,1,1,1),
two = c(1,1,2,1,1,1,1))
std <- st_as_sf(d, coords = c('one', 'two'))
std %>% tm_shape() + tm_bubbles(alpha = 0.3)
You can see that point (1, 1) is darker because it appears 6 times in the data. Therefore, thanks to the alpha blending, these points add-up.
My problem is that I cannot store the dataset as it. What I have is only an aggregated version, like
d_agg <- d %>% group_by(one, two) %>%
summarize(count = n()) %>%
ungroup()
# A tibble: 2 x 3
one two count
<dbl> <dbl> <int>
1 1 1 6
2 2 2 1
How can I reproduce the same exact chart as before, using d_agg and the corresponding count variable?
Of course, re-creating the initial dataframe above is not feasible solution because I have too many points (and some points are repeated too many times)
Just using:
std_agg %>% tm_shape() + tm_bubbles(col = 'count', alpha = 0.3)
does not work
Unfortunately, alpha is not (yet) an aesthetic, so it is not possible to do alpha = "count".
My question: do you really need alpha? Probably not if you do not use the color aesthetic. In that case, your approach to use color to model alpha transparency was actually good, but just needs a little configuration:
std_agg %>% tm_shape() + tm_bubbles(col = 'count', style = "cont",
palette = "Greys", contrast = c(.3, .7), legend.col.show = FALSE)
Here I show how to recreate your data frame d by using dplyr. Although it is not addressing your question about how to pass numeric value to the alpha argument in tm_bubbles, consider it as a workaround.
std_agg <- d_agg %>%
slice(rep(row_number(), times = count)) %>%
st_as_sf(coords = c('one', 'two'))
std_agg %>%
tm_shape() +
tm_bubbles(alpha = 0.3)
In fact, this base R to expand the data frame is probably more intuitive.
d_agg[rep(1:nrow(d_agg), times = d_agg$count), ]
Related
I am attempting to use ggplot2 to create a weighted density plot showing the distribution of two groups that each account for a fraction of a certain distribution. The difficulty that I am encountering stems from the fact that although both groups have the same number of observations in the data, they have different weightings, and I would like for each group's area in the graph to reflect this difference in weightings.
My data look something like this.
var <- sort(rnorm(1000, mean = 5, sd = 2))
df <- tibble(id = c(rep(1, 1000), rep(2, 1000)),
var = c(var,var),
weight = c(rep(.1, 500), rep(.2, 500), rep(.9, 500), rep(.8, 500)))
Observe that, group 1 is given low weightings (.1 or .2) while group 2 is given high weighting of (.9 or .8). Also observe that for any given value of var has weightings that add up to 1. In the real data, the shares accounted for by each group differ in a more complex manner across the distribution of var.
I have tried plotting this data as follows, and although using weight captures the way that the distributions vary within each group, it does not capture the way that the distribution varies between groups.
library(ggplot2)
var <- rnorm(1000, mean = 5, sd = 2)
df %>%
ggplot(aes(x = var, group = id, fill = factor(id), weight = weight)) +
geom_density(position = 'stack')
The resulting plot looks something like this.
It is clear that the groups do not account for around 15% and 85% of the area under the density curve respectively, but the issue is clearer to see when we use position = 'fill'.
Each group seems to take up a similar area, apparently because the weighting is applied before grouping is accounted for. I would like to see a solution that results in the area associated with group 1 being commensurate with it's weight (i.e. much smaller than the area associated with group 2).
To clarify, it is the height associated with each group that should differ. In the above plot, the line of demarcation between group 1 and group 2 should be significantly higher, making the area taken up by group 1 significantly smaller.
Dealing with the relative density of the two groups is a bit ambiguous. Clearly, each group's density needs to have an integral of 1 for it to be a true density. The closest you can come is probably to have the integral of both curves sum to 1, which I think requires you to do the density calculation yourself then plot as a stacked geom_area:
library(tidyverse)
df %>%
nest(data = -id) %>%
summarize(id = factor(id),
weight = unlist(map(data, ~sum(.x$weight))),
dens = map(data, function(.x) {
x <- density(.x$var, weights = .x$weight/sum(.x$weight))
data.frame(x = x$x, y = x$y)
})) %>%
mutate(weight = weight / sum(weight)) %>%
unnest(dens) %>%
mutate(y = y * weight) %>%
ggplot(aes(x, y, fill = id)) +
geom_area(position = 'stack', color = 'black') +
labs(y = 'density', x = 'var')
I am not completely sure if I understand you correctly, but maybe you can calculate the value beforehand based on the weight and then stack it like this:
library(ggplot2)
library(dplyr)
# Stacked
df %>%
mutate(weighted_var = var*weight) %>%
ggplot(aes(x = weighted_var, fill = factor(id), group = id)) +
geom_density(position = 'stack')
And check the groups with fill like this:
# Fill
df %>%
mutate(weighted_var = var*weight) %>%
ggplot(aes(x = weighted_var, fill = factor(id), group = id)) +
geom_density(position = 'fill')
Created on 2022-11-01 with reprex v2.0.2
I'd like to create a graph like the one below. It's kind of a combination of using geom_area and geom_point.
Let's say my data looks like this:
library(gcookbook, janitor)
ggplot(uspopage, aes(x = Year, y = Thousands, fill = AgeGroup)) +
geom_area()
I obtain the following graph
Then, I'd like to add the exact number of points as the total for each category, which would be:
library(dplyr)
uspopage |>
group_by(AgeGroup) |>
summarize(total = sum(Thousands))
# A tibble: 8 × 2
AgeGroup total
<fct> <int>
1 <5 1534529
2 5-14 2993842
3 15-24 2836739
4 25-34 2635986
5 35-44 2331680
6 45-54 1883088
7 55-64 1417496
8 >64 1588163
Following some twitter comments my workaround is as follows:
1 - create the original plot with ggplot2
2 - grab the areas of the plot as a data.frame (ggplot_build)
3 - create polygons of the points given in 2, and make it a sensible sf object (downscale to a flatter earth)
4 - generate N random points inside each polygon (st_sample)
5 - grab these points and upscale back to the original scale
6 - ggplot2 once again, now with geom_point
7 - enjoy the wonders of ggplot2
library(gcookbook)
library(tidyverse)
library(sf)
set.seed(42)
# original data
d <- uspopage
# number of points for each group (I divide it by 1000)
d1 <- d |>
group_by(AgeGroup) |>
summarize(n_points = round(sum(Thousands) / 1e3)) |>
mutate(group = 1:n())
# original plot
g <- ggplot(data = d,
aes(x = Year,
y = Thousands,
fill = AgeGroup)) +
geom_area()
# get the geom data from ggplot
f <- ggplot_build(g)$data[[1]]
# polygons are created point by point in order. So let´s, by group, add the data.frame back to itself first part is the ymin line the secound the inverse of ymax line (to make a continous line from encompassing each area).
# list of groups
l_groups <- unique(f$group)
# function to invert and add back the data.frame
f_invert <- function(groups) {
k <- f[f$group == groups,]
k$y <- k$ymin
k1 <- k[nrow(k):1,]
k1$y <- k1$ymax
k2 <- rbind(k, k1)
return(k2)
}
# create a new data frame of the points in order
f1 <- do.call("rbind", lapply(l_groups, f_invert))
# for further use at the end of the script (to upscale back to the original ranges)
max_x <- max(f1$x)
max_y <- max(f1$y)
min_x <- min(f1$x)
min_y <- min(f1$y)
# normalizing: limiting sizes to a fairy small area on the globe (flat earth wannabe / 1 X 1 degrees)
f1$x <- scales::rescale(f1$x)
f1$y <- scales::rescale(f1$y)
# create polygons
polygons <- f1 |>
group_by(group) |>
sf::st_as_sf(coords = c("x", "y"), crs = 4326) |>
summarise(geometry = sf::st_combine(geometry)) |>
sf::st_cast("POLYGON")
# cast N number of points randomly inside each geometry (N is calculated beforehand in d1)
points <- polygons %>%
st_sample(size = d1$n_points,
type = 'random',
exact = TRUE) %>%
# Give the points an ID
sf::st_sf('ID' = seq(length(.)), 'geometry' = .) %>%
# Get underlying polygon attributes (group is the relevant attribute that we want to keep)
sf::st_intersection(., polygons)
# rescale back to the original ranges
points <- points |>
mutate(x = unlist(map(geometry,1)),
y = unlist(map(geometry,2))) |>
mutate(x = (x * (max_x - min_x) + min_x),
y = (y * (max_y - min_y) + min_y))
# bring back the legends
points <- left_join(points, d1, by = c("group"))
# final plot
g1 <- ggplot() +
geom_point(data = points,
aes(x = x,
y = y,
color = AgeGroup),
size = 0.5) +
labs(x = element_blank(),
y = element_blank()) +
theme_bw()
g1
Here's a version without any smoothing, just adding noise to where the dots would go naturally. One nice thing here is we can specify how many people are represented per dot.
dots_per_thou <- 1
uspopage %>%
uncount(round(dots_per_thou * Thousands / 1000)) %>%
group_by(Year) %>%
mutate(x_noise = runif(n(), 0, 1) - 0.5,
x_pos = Year + x_noise,
y_noise = runif(n(), 0, 1000*dots_per_thou),
y_pos = cumsum(row_number() + y_noise)) %>%
ungroup() %>%
ggplot(aes(x_pos, y_pos, color = AgeGroup)) +
geom_point(size = 0.1) +
ggthemes::scale_color_tableau()
You could come close-ish to that look with the ggbeeswarm package. It includes a few positions which "offset points within a category based on their density using quasirandom noise" (this is the description in the vipor package which underlies those positions).
The approach is just a hack and certainly not exactly satisfying. The number of dots might not be accurate and are more like "guessed", and they are too regular with position_beeswarm - I couldn't yet get it to run with the probably more appropriate position_quasirandom.
Also, it is computationally very intense and it made my reprex crash, thus simply copied from my script.
library(gcookbook)
library(ggplot2)
library(dplyr)
## ggbeeswarm needs to be in the development version
# devtools::install_github("eclarke/ggbeeswarm")
library(ggbeeswarm)
uncount_df <- uspopage %>%
group_by(Year) %>%
## inflate every group artificllay to add up to the previous group
## and make numbers much much smaller so to make computations not cray
mutate(cumul_sum = as.integer(cumsum(Thousands)/ 10^3)) %>%
## uncount
tidyr::uncount(cumul_sum)
## I am creating a list of layers
ls_layers <- lapply(split(uncount_df, uncount_df$AgeGroup), function(dat){
## I switched x and y aesthetic so to avoid coord_flip
## side is an argument in the dev version
## the size is a bit of a trial and error
geom_beeswarm(data = dat, aes( x = Year, y = "x", color = AgeGroup),
side = 1L,
size = .4)
})
## reversing the order, a trick to plot from small to large numbers
ls_layers <- ls_layers[length(ls_layers):1]
ggplot() +
## you can now simply add the list of layers to your ggplot object
ls_layers
Please help!
I have case data I need to prepare for a report soon and just cannot get the graphs to display properly.
From a dataset with CollectionDate as the "record" of cases (i.e. multiple rows with the same date means more cases that day), I want to display Number of positive cases/total (positive + negative) cases for that day as a percent on the y-axis, with collection dates along the x-axis. Then I want to break down by region. Goal is to look like this but in terms of daily positives/# of tests rather than just positives vs negatives. I also want to add a horizontal line on every graph at 20%.
I have tried manipulating it before, in and after ggplot:
ggplot(df_final, aes(x =CollectionDate, fill = TestResult)) +
geom_bar(aes(y=..prop..)) +
scale_y_continuous(labels=percent_format())
Which is, again, close. But the percents are wrong because they are just taking the proportion of that day against counts of all days instead of per day.
Then I tried using tally()in the following command to try and count per region and aggregate:
df_final %>%
group_by(CollectionDate, Region, as.factor(TestResult)) %>%
filter(TestResult == "Positive") %>%
tally()
and I still cannot get the graphs right.
Suggestions?
A quick look at my data:
head(df_final)
Well, I have to say that I am not 100% sure that I got what you want, but anyway, this can be helpful.
The data: Since you are new here, I have to let you know that using a simple and reproducible version of your data will make it easier to the rest of us to answer. To do this you can simulate a data frame o any other objec, or use dput function on it.
library(ggplot2)
library(dplyr)
data <- data.frame(
# date
CollectionDate = sample(
seq(as.Date("2020-01-01"), by = "day", length.out = 15),
size = 120, replace = TRUE),
# result
TestResult = sample(c("Positive", "Negative"), size = 120, replace = TRUE),
# region
Region = sample(c("Region 1", "Region2"), size = 120, replace = TRUE)
)
With this data, you can do ass follow to get the plots you want.
# General plot, positive cases proportion
data %>%
count(CollectionDate, TestResult, name = "cases") %>%
group_by(CollectionDate) %>%
summarise(positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
geom_hline(yintercept = 0.2)
# positive proportion by day within region
data %>%
count(CollectionDate, TestResult, Region, name = "cases") %>%
group_by(CollectionDate, Region) %>%
summarise(
positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)
) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
# horizontal line at 20%
geom_hline(yintercept = 0.2) +
facet_wrap(~Region)
I can get you halfway there (refer to the comments in the code for clarifications). This code is for the counts per day per region (plotted separately for each region). I think you can tweak things further to calculate the counts per day per county too; and whole state should be a cakewalk. I wish you good luck with your report.
rm(list = ls())
library(dplyr)
library(magrittr)
library(ggplot2)
library(scales)
library(tidyr) #Needed for the spread() function
#Dummy data
set.seed(1984)
sdate <- as.Date('2000-03-09')
edate <- as.Date('2000-05-18')
dateslist <- as.Date(sample(as.numeric(sdate): as.numeric(edate), 10000, replace = TRUE), origin = '1970-01-01')
df_final <- data.frame(Region = rep_len(1:9, 10000),
CollectionDate = dateslist,
TestResult = sample(c("Positive", "Negative"), 10000, replace = TRUE))
#First tally the positve and negative cases
#by Region, CollectionDate, TestResult in that order
df_final %<>%
group_by(Region, CollectionDate, TestResult) %>%
tally()
#Then
#First spread the counts (in n)
#That is, create separate columns for Negative and Positive cases
#for each Region-CollectionDate combination
#Then calculate their proportions (as shown)
#Now you have Negative and Positive
#percentages by CollectionDate by Region
df_final %<>%
spread(key = TestResult, value = n) %>%
mutate(Negative = Negative/(Negative + Positive),
Positive = Positive/(Negative + Positive))
#Plotting this now
#Since the percentages are available already
#Use geom_col() instead of geom_bar()
df_final %>% ggplot() +
geom_col(aes(x = CollectionDate, y = Positive, fill = "Positive"),
position = "identity", alpha = 0.4) +
geom_col(aes(x = CollectionDate, y = Negative, fill = "Negative"),
position = "identity", alpha = 0.4) +
facet_wrap(~ Region, nrow = 3, ncol = 3)
This yields:
I've got a question regarding an edge case with ggplot2 in R.
They don't like you adding multiple legends, but I think this is a valid use case.
I've got a large economic dataset with the following variables.
year = year of observation
input_type = *labor* or *supply chain*
input_desc = specific type of labor (eg. plumbers OR building supplies respectively)
value = percentage of industry spending
And I'm building an area chart over approximately 15 years. There are 39 different input descriptions and so I'd like the user to see the two major components (internal employee spending OR outsourcing/supply spending)in two major color brackets (say green and blue), but ggplot won't let me group my colors in that way.
Here are a few things I tried.
Junk code to reproduce
spec_trend_pie<- data.frame("year"=c(2006,2006,2006,2006,2007,2007,2007,2007,2008,2008,2008,2008),
"input_type" = c("labor", "labor", "supply", "supply", "labor", "labor","supply","supply","labor","labor","supply","supply"),
"input_desc" = c("plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck"),
"value" = c(1,2,3,4,4,3,2,1,1,2,3,4))
spec_broad <- ggplot(data = spec_trend_pie, aes(y = value, x = year, group = input_type, fill = input_desc)) + geom_area()
Which gave me
Error in f(...) : Aesthetics can not vary with a ribbon
And then I tried this
sff4 <- ggplot() +
geom_area(data=subset(spec_trend_pie, input_type="labor"), aes(y=value, x=variable, group=input_type, fill= input_desc)) +
geom_area(data=subset(spec_trend_pie, input_type="supply_chain"), aes(y=value, x=variable, group=input_type, fill= input_desc))
Which gave me this image...so closer...but not quite there.
To give you an idea of what is desired, here's an example of something I was able to do in GoogleSheets a long time ago.
It's a bit of a hack but forcats might help you out. I did a similar post earlier this week:
How to factor sub group by category?
First some base data
set.seed(123)
raw_data <-
tibble(
x = rep(1:20, each = 6),
rand = sample(1:120, 120) * (x/20),
group = rep(letters[1:6], times = 20),
cat = ifelse(group %in% letters[1:3], "group 1", "group 2")
) %>%
group_by(group) %>%
mutate(y = cumsum(rand)) %>%
ungroup()
Now, use factor levels to create gradients within colors
df <-
raw_data %>%
# create factors for group and category
mutate(
group = fct_reorder(group, y, max),
cat = fct_reorder(cat, y, max) # ordering in the stack
) %>%
arrange(cat, group) %>%
mutate(
group = fct_inorder(group), # takes the category into account first
group_fct = as.integer(group), # factor as integer
hue = as.integer(cat)*(360/n_distinct(cat)), # base hue values
light_base = 1-(group_fct)/(n_distinct(group)+2), # trust me
light = floor(light_base * 100) # new L value for hcl()
) %>%
mutate(hex = hcl(h = hue, l = light))
Create a lookup table for scale_fill_manual()
area_colors <-
df %>%
distinct(group, hex)
Lastly, make your plot
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "stack") +
scale_fill_manual(
values = area_colors$hex,
labels = area_colors$group
)
I am working with categorical longitudinal data. My data has 3 simple variables such as :
id variable value
1 1 1 c
2 1 2 b
3 1 3 c
4 1 4 c
5 1 5 c
...
Where variable is basically time, and value are the 3 possible categories one id can take.
I am interested in producing a "parallel" longitudinal graph, similar to this with ggplot2
I am struggling a bit to get it right. What I came up for now is this :
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70") +
geom_point(aes(colour=value, size = nn), size=4) +
scale_colour_brewer(palette="Set1") + theme_minimal()
The issue with this graph is that we can't really see the "thickness" of the "transition" (the id lines).
I wondered if you could help me for :
a) help make visible the id lines, or make it "thicker" according to the number of id going form one state to the other
b) I also would like to re-size the point according to the number of id in this state. I tried to do it with geom_point(aes(colour=value, size = nn), size=4) but it doesn't seem to work.
Thanks.
# data #
library(dplyr)
library(ggplot2)
set.seed(10)
# generate random sequences #
dt = as.data.frame( cbind(id = 1:1000, replicate(5, sample( c('a', 'b', 'c'), prob = c(0.1,0.2,0.7), 1000, replace = T)) ) )
# transform to PP file #
dt = dt %>% melt(id.vars = c('id'))
# create a vector 1-0 if the activity was performed #
dt0 = dt %>% group_by(id) %>% mutate(variable = 1:n()) %>% arrange(id)
# create the number of people in that state #
dt0 = dt0 %>% count(id, variable, value)
dt0 = dt0 %>% group_by(variable, value, n) %>% mutate(nn = n())
# to produce the first graph #
library(vcrpart)
otsplot(dt0$variable, factor(dt0$value), dt0$id)
you were so close with geom_point(aes(colour=value, size = nn), size=4), the problem was that with you redefined size after defining it in aes() ggplot overwrote the variable reference with the constant 4. Assuming you want to use nn to scale line thinkness as well, you could tweak your code to this:
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70", aes(size = nn)) +
geom_point(aes(colour=value, size = nn)) +
scale_colour_brewer(palette="Set1") + theme_minimal()
If you wanted to use a lag value for the line thickness I would suggests adding that as a new column in dt0.