R - ggplot2 parallel categorical plot - r

I am working with categorical longitudinal data. My data has 3 simple variables such as :
id variable value
1 1 1 c
2 1 2 b
3 1 3 c
4 1 4 c
5 1 5 c
...
Where variable is basically time, and value are the 3 possible categories one id can take.
I am interested in producing a "parallel" longitudinal graph, similar to this with ggplot2
I am struggling a bit to get it right. What I came up for now is this :
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70") +
geom_point(aes(colour=value, size = nn), size=4) +
scale_colour_brewer(palette="Set1") + theme_minimal()
The issue with this graph is that we can't really see the "thickness" of the "transition" (the id lines).
I wondered if you could help me for :
a) help make visible the id lines, or make it "thicker" according to the number of id going form one state to the other
b) I also would like to re-size the point according to the number of id in this state. I tried to do it with geom_point(aes(colour=value, size = nn), size=4) but it doesn't seem to work.
Thanks.
# data #
library(dplyr)
library(ggplot2)
set.seed(10)
# generate random sequences #
dt = as.data.frame( cbind(id = 1:1000, replicate(5, sample( c('a', 'b', 'c'), prob = c(0.1,0.2,0.7), 1000, replace = T)) ) )
# transform to PP file #
dt = dt %>% melt(id.vars = c('id'))
# create a vector 1-0 if the activity was performed #
dt0 = dt %>% group_by(id) %>% mutate(variable = 1:n()) %>% arrange(id)
# create the number of people in that state #
dt0 = dt0 %>% count(id, variable, value)
dt0 = dt0 %>% group_by(variable, value, n) %>% mutate(nn = n())
# to produce the first graph # 
library(vcrpart)
otsplot(dt0$variable, factor(dt0$value), dt0$id)

you were so close with geom_point(aes(colour=value, size = nn), size=4), the problem was that with you redefined size after defining it in aes() ggplot overwrote the variable reference with the constant 4. Assuming you want to use nn to scale line thinkness as well, you could tweak your code to this:
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70", aes(size = nn)) +
geom_point(aes(colour=value, size = nn)) +
scale_colour_brewer(palette="Set1") + theme_minimal()
If you wanted to use a lag value for the line thickness I would suggests adding that as a new column in dt0.

Related

Create an area plot where each area is evenly filled with the actual number of points?

I'd like to create a graph like the one below. It's kind of a combination of using geom_area and geom_point.
Let's say my data looks like this:
library(gcookbook, janitor)
ggplot(uspopage, aes(x = Year, y = Thousands, fill = AgeGroup)) +
geom_area()
I obtain the following graph
Then, I'd like to add the exact number of points as the total for each category, which would be:
library(dplyr)
uspopage |>
group_by(AgeGroup) |>
summarize(total = sum(Thousands))
# A tibble: 8 × 2
AgeGroup total
<fct> <int>
1 <5 1534529
2 5-14 2993842
3 15-24 2836739
4 25-34 2635986
5 35-44 2331680
6 45-54 1883088
7 55-64 1417496
8 >64 1588163
Following some twitter comments my workaround is as follows:
1 - create the original plot with ggplot2
2 - grab the areas of the plot as a data.frame (ggplot_build)
3 - create polygons of the points given in 2, and make it a sensible sf object (downscale to a flatter earth)
4 - generate N random points inside each polygon (st_sample)
5 - grab these points and upscale back to the original scale
6 - ggplot2 once again, now with geom_point
7 - enjoy the wonders of ggplot2
library(gcookbook)
library(tidyverse)
library(sf)
set.seed(42)
# original data
d <- uspopage
# number of points for each group (I divide it by 1000)
d1 <- d |>
group_by(AgeGroup) |>
summarize(n_points = round(sum(Thousands) / 1e3)) |>
mutate(group = 1:n())
# original plot
g <- ggplot(data = d,
aes(x = Year,
y = Thousands,
fill = AgeGroup)) +
geom_area()
# get the geom data from ggplot
f <- ggplot_build(g)$data[[1]]
# polygons are created point by point in order. So let´s, by group, add the data.frame back to itself first part is the ymin line the secound the inverse of ymax line (to make a continous line from encompassing each area).
# list of groups
l_groups <- unique(f$group)
# function to invert and add back the data.frame
f_invert <- function(groups) {
k <- f[f$group == groups,]
k$y <- k$ymin
k1 <- k[nrow(k):1,]
k1$y <- k1$ymax
k2 <- rbind(k, k1)
return(k2)
}
# create a new data frame of the points in order
f1 <- do.call("rbind", lapply(l_groups, f_invert))
# for further use at the end of the script (to upscale back to the original ranges)
max_x <- max(f1$x)
max_y <- max(f1$y)
min_x <- min(f1$x)
min_y <- min(f1$y)
# normalizing: limiting sizes to a fairy small area on the globe (flat earth wannabe / 1 X 1 degrees)
f1$x <- scales::rescale(f1$x)
f1$y <- scales::rescale(f1$y)
# create polygons
polygons <- f1 |>
group_by(group) |>
sf::st_as_sf(coords = c("x", "y"), crs = 4326) |>
summarise(geometry = sf::st_combine(geometry)) |>
sf::st_cast("POLYGON")
# cast N number of points randomly inside each geometry (N is calculated beforehand in d1)
points <- polygons %>%
st_sample(size = d1$n_points,
type = 'random',
exact = TRUE) %>%
# Give the points an ID
sf::st_sf('ID' = seq(length(.)), 'geometry' = .) %>%
# Get underlying polygon attributes (group is the relevant attribute that we want to keep)
sf::st_intersection(., polygons)
# rescale back to the original ranges
points <- points |>
mutate(x = unlist(map(geometry,1)),
y = unlist(map(geometry,2))) |>
mutate(x = (x * (max_x - min_x) + min_x),
y = (y * (max_y - min_y) + min_y))
# bring back the legends
points <- left_join(points, d1, by = c("group"))
# final plot
g1 <- ggplot() +
geom_point(data = points,
aes(x = x,
y = y,
color = AgeGroup),
size = 0.5) +
labs(x = element_blank(),
y = element_blank()) +
theme_bw()
g1
Here's a version without any smoothing, just adding noise to where the dots would go naturally. One nice thing here is we can specify how many people are represented per dot.
dots_per_thou <- 1
uspopage %>%
uncount(round(dots_per_thou * Thousands / 1000)) %>%
group_by(Year) %>%
mutate(x_noise = runif(n(), 0, 1) - 0.5,
x_pos = Year + x_noise,
y_noise = runif(n(), 0, 1000*dots_per_thou),
y_pos = cumsum(row_number() + y_noise)) %>%
ungroup() %>%
ggplot(aes(x_pos, y_pos, color = AgeGroup)) +
geom_point(size = 0.1) +
ggthemes::scale_color_tableau()
You could come close-ish to that look with the ggbeeswarm package. It includes a few positions which "offset points within a category based on their density using quasirandom noise" (this is the description in the vipor package which underlies those positions).
The approach is just a hack and certainly not exactly satisfying. The number of dots might not be accurate and are more like "guessed", and they are too regular with position_beeswarm - I couldn't yet get it to run with the probably more appropriate position_quasirandom.
Also, it is computationally very intense and it made my reprex crash, thus simply copied from my script.
library(gcookbook)
library(ggplot2)
library(dplyr)
## ggbeeswarm needs to be in the development version
# devtools::install_github("eclarke/ggbeeswarm")
library(ggbeeswarm)
uncount_df <- uspopage %>%
group_by(Year) %>%
## inflate every group artificllay to add up to the previous group
## and make numbers much much smaller so to make computations not cray
mutate(cumul_sum = as.integer(cumsum(Thousands)/ 10^3)) %>%
## uncount
tidyr::uncount(cumul_sum)
## I am creating a list of layers
ls_layers <- lapply(split(uncount_df, uncount_df$AgeGroup), function(dat){
## I switched x and y aesthetic so to avoid coord_flip
## side is an argument in the dev version
## the size is a bit of a trial and error
geom_beeswarm(data = dat, aes( x = Year, y = "x", color = AgeGroup),
side = 1L,
size = .4)
})
## reversing the order, a trick to plot from small to large numbers
ls_layers <- ls_layers[length(ls_layers):1]
ggplot() +
## you can now simply add the list of layers to your ggplot object
ls_layers

How to draw stacked barplot on the summed data

For data called df that reads:
car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1
total = apply(df,1,sum)
barplot(total,col= rainbow(5))
So what I did right now is plotting a barplot on total number of cars, which are in fact, the sum of each row. What I want to do now is to present it as a stack barplot on the sum.
For now, it would just show "total" without any lines indicating whether 1 car, 2 suv, 1 pickup addes to 4 "total".
Note. It is different from barplot(matrix(df)), because that's just dividing it my car,suv,pickup, that disregards total number.
You can achieve this easily using ggplot2 and reshape2.
You will need an ID column to track the rows, so I have added that in. I melt the data to long type so that the different groups can be managed and plotted accordingly.
Then plot using geom_bar, specifying the row ids as the x axis and the groupings (fill and colour) for the stack plot and legend.
library(reshape2)
library(ggplot2)
df <- data.frame("ID" = c(1,2,3,4,5), "car" = c(1,2,4,5,3), "suv" = c(2,3,1,4,1), "pickup" = c(1, 4, 2, 2, 1))
long_df <- df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type")
ggplot(data = long_df, aes(x = ID, y = Number)) +
geom_bar(aes(fill = Type, colour = Type),
stat = "identity",
position = "stack")
With base R
df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type") %>%
dcast(Type ~ ID, value.var = "Number") %>%
as.matrix() %>%
barplot()
Are you after something like this?
library(tidyverse)
df %>%
rowid_to_column("row") %>%
gather(k, v, -row) %>%
ggplot(aes(row, v, fill = k)) +
geom_col()
We use a stacked barplot here, so there is no need to manually calculate the sum. The key here is to transform data from wide to long and keep track of the row.
Sample data
df <- read.table(text =
"car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1", header = T)

R - Plot the rolling mean of different time series in a lineplot with ggplot2

I want to plot the rolling mean of data of different time series with ggplot2. My data have the following structure:
library(dplyr)
library(ggplot2)
library(zoo)
library(tidyr)
df <- data.frame(episode=seq(1:1000),
t_0 = runif(1000),
t_1 = 1 + runif(1000),
t_2 = 2 + runif(1000))
df.tidy <- gather(df, "time", "value", -episode) %>%
separate("time", c("t", "time"), sep = "_") %>%
subset(select = -t)
> head(df.tidy)
# episode time value
#1 1 0 0.7466480
#2 2 0 0.7238865
#3 3 0 0.9024454
#4 4 0 0.7274303
#5 5 0 0.1932375
#6 6 0 0.1826925
Now, the code below creates a plot where the lines for time = 1 and time = 2 towards the beginning of the episodes do not represent the data because value is filled with NAs and the first numeric entry in value is for time = 0.
ggplot(df.tidy, aes(x = episode, y = value, col = time)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = rollmean(value, 10, align = "right", fill = NA)))
How do I have to adapt my code such that the rolling-mean lines are representative of my data?
Your issue is you are applying a moving average over the whole column, which makes data "leak" from one value of time to another.
You could group_by first to apply the rollmean to each time separately:
ggplot(df.tidy, aes(x = episode, y = value, col = time)) +
geom_point(alpha = 0.2) +
geom_line(data = df.tidy %>%
group_by(time) %>%
mutate(value = rollmean(value, 10, align = "right", fill = NA)))

How to set the alpha based on a column?

Consider this example:
library(dplyr)
library(sf)
library(tmap)
d <- data_frame(one = c(1,1,2,1,1,1,1),
two = c(1,1,2,1,1,1,1))
std <- st_as_sf(d, coords = c('one', 'two'))
std %>% tm_shape() + tm_bubbles(alpha = 0.3)
You can see that point (1, 1) is darker because it appears 6 times in the data. Therefore, thanks to the alpha blending, these points add-up.
My problem is that I cannot store the dataset as it. What I have is only an aggregated version, like
d_agg <- d %>% group_by(one, two) %>%
summarize(count = n()) %>%
ungroup()
# A tibble: 2 x 3
one two count
<dbl> <dbl> <int>
1 1 1 6
2 2 2 1
How can I reproduce the same exact chart as before, using d_agg and the corresponding count variable?
Of course, re-creating the initial dataframe above is not feasible solution because I have too many points (and some points are repeated too many times)
Just using:
std_agg %>% tm_shape() + tm_bubbles(col = 'count', alpha = 0.3)
does not work
Unfortunately, alpha is not (yet) an aesthetic, so it is not possible to do alpha = "count".
My question: do you really need alpha? Probably not if you do not use the color aesthetic. In that case, your approach to use color to model alpha transparency was actually good, but just needs a little configuration:
std_agg %>% tm_shape() + tm_bubbles(col = 'count', style = "cont",
palette = "Greys", contrast = c(.3, .7), legend.col.show = FALSE)
Here I show how to recreate your data frame d by using dplyr. Although it is not addressing your question about how to pass numeric value to the alpha argument in tm_bubbles, consider it as a workaround.
std_agg <- d_agg %>%
slice(rep(row_number(), times = count)) %>%
st_as_sf(coords = c('one', 'two'))
std_agg %>%
tm_shape() +
tm_bubbles(alpha = 0.3)
In fact, this base R to expand the data frame is probably more intuitive.
d_agg[rep(1:nrow(d_agg), times = d_agg$count), ]

ggplot2: comparing 2 groups through fraction of its members

Lets say we have 10000 users classified in 2 groups: lvl beginner and lvl pro.
Every user has a rank, going from 1 to 20.
The df:
# beginers
n <- 7000
user.id <- 1:n
lvl <- "beginer"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.1,length.out = 20))
df.beginer <- data.frame(user.id, rank, lvl)
# pros
n <- 3000
user.id <- 1:n
lvl <- "pro"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.3,length.out = 20))
df.pro <- data.frame(user.id, rank, lvl)
library(dplyr)
df <- bind_rows(df.beginer, df.pro)
df2 <- tbl_df(df) %>% group_by(lvl, rank) %>% mutate(count = n())
Problem 1:
I need a bar plot comparing each group side by side, but instead if giving me counts, I need percents, so the bars from each group will have the same max hight (100%)
The plot I got so far:
library(ggplot2)
plot <- ggplot(df2, aes(rank))
plot + geom_bar(aes(fill=lvl), position="dodge")
Problem 2:
I need a line plot comparing each group, so we will have 2 lines, but instead if giving me counts, I need percents, so the lines from each group will have the same max hight (100%)
The plot I got so far:
plot + geom_line(aes(y=count, color=lvl))
Problem 3:
Lets say that the ranks are cumulative, so a user who has rank 3, also has rank 1 and 2. A user who has rank 20 has all ranks from 1 to 20.
So, when plotting, I want the plot to start with rank 1 having 100% of users,
rank 2 will have something less, rank 3 even less and so on.
I got all this done on tableau but I really dislike it and want to show myself that R can handle all this stuff.
Thank you!
Three problems, three solutions:
problem 1 - calculate percentage and use geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>% # calculate percentage
ggplot(., aes(x = rank, y = count_perc))+
geom_col(aes(fill = lvl), position = 'dodge')
problem 2 - pretty much the same as problem 1 except use geom_line instead of geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>%
ggplot(., aes(x = rank, y = count_perc))+
geom_line(aes(colour = lvl))
problem 3 - make use of arrange and cumsum
df %>%
group_by(lvl, rank) %>%
summarise(count = n()) %>% # count by level and rank
group_by(lvl) %>%
arrange(desc(rank)) %>% # sort descending
mutate(cumulative_count = cumsum(count)) %>% # use cumsum
mutate(cumulative_count_perc = cumulative_count / max(cumulative_count)) %>%
ggplot(., aes(x = rank, y = cumulative_count_perc))+
geom_line(aes(colour = lvl))

Resources