I have data of sales by year and model, which is visualized via Sankey chart. Now I am struggling to handle 2 issue:
Firstly I need to set model B always on the bottom of chaty regardless its value over the years.
When I re-visualize ggplot via ggplotly hover does not show sales or years
code:
df <- data.frame (model = c("A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J"),
Year = c(2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020),
sales = c(450,678,456,344,984,456,234,244,655,789,234,567,234,567,232,900,1005,1900,450,345,567,235,456,345,144,333,555,777,111,444,222,223,445,776,331,788,980,1003,456,434,345,2222,3456,456,678,8911,4560,4567,4566,5555,6666,7777,8888,1233,1255,5677,3411,2344,6122,4533))
install.packages("remotes")
#remotes::install_github("davidsjoberg/ggsankey")
library(ggsankey)
library(tidyverse)
plot <- ggplot(df, aes(x = Year,
node = model,
fill = model,
value = sales)) +
geom_sankey_bump(space = 0, type = "alluvial", color = "transparent", smooth = 15) +
scale_fill_viridis_d(option = "A", alpha = .8) +
theme_sankey_bump(base_size = 16)
ggplotly(plot)
I'm absolutely certain that there is a better way, but it took me a while to get it working. I think this is what you were looking for.
I started with the ggplot and ggplotly objects that you have here. The primary purpose of this initial plot is to capture the colors. (I could have captured them a few different ways, but this was already done for me in your plot.)
Update ** I've modified the two elements you requested
library(ggsankey)
library(tidyverse)
library(plotly)
# df from the question is unchanged
# visualize the original
(plot <- ggplot(df,
aes(Year, node = model, fill = model, value = sales)) +
geom_sankey_bump(space = 0, type = "alluvial",
color = "transparent", smooth = 15) +
scale_fill_viridis_d(option = "A", alpha = .8) +
theme_sankey_bump(base_size = 16))
ggplotly(plot) -> plp
plp
#-------- colors --------
# collect the 10 colors
cols <- map_dfr(1:10, function(k){
nm <- plp$x$data[[k]]$name
filler <- plp$x$data[[k]]$fillcolor
c(nm = nm, filler = filler)
})
Then I divided the contents of the B model into 10 groups to ensure it was always the smallest bump. This allowed me to collect the stacked values for all of the other models, which is needed to push B to the bottom.
#-------------- splitting B -------------
df1 <- df %>% filter(model != "B") %>%
arrange(Year, sales)
df2 <- df %>% filter(model == "B") %>% # this gets used further down
arrange(Year)
# split B into 10 groups - keep on the bottom, then join the groups
# make the groups
ng <- vector(length = 10)
invisible(
map(1:10,
function(i) {
ng[i] <<- rep("B", i) %>% paste0(collapse = "")
})
)
# add values for these groups by year
df4 <- data.frame(Year = rep(unique(df$Year), each = 10),
model = rep(ng, length(unique(df$Year))),
sales = rep(df2$sales/10, each = 10))
df5 <- rbind(df1, df4)
Recreate the Sankey bump with 10 subsections of model B. Everything that follows works with this plot.
#-------------- plotly after dividing B -------------
(nplt <- ggplot(df5, aes(x = Year, node = model, fill = model, value = sales)) +
geom_sankey_bump(space = 0, type = "alluvial",
color = "transparent", smooth = 15) +
scale_fill_viridis_d(option = "A", alpha = .8) +
theme_sankey_bump(base_size = 16))
ggplotly(nplt) -> plt
plt
Create a Sankey bump with JUST B, to capture data that represents model B at the bottom. Use this data to substitute all of the traces that represent B in the object plt. The colors get fixed here, as well. (The original 10 colors from the first plot.) Lastly, the hoverinfo gets removed. That will get fixed next.
#-------------- get values for B at the bottom -------------
df %>% filter(model == "B") %>%
ggplot(aes(x = Year,
node = model,
fill = model,
value = sales)) +
geom_sankey_bump(space = 0, type = "alluvial", color = "transparent", smooth = 15) +
scale_fill_viridis_d(option = "A", alpha = .8) +
theme_sankey_bump(base_size = 16) -> bplt
ggplotly(bplt) -> bplotly
bplotly
#------- take divided B and remove all but one trace for B --------
# xx <- plt$x$data
# plt$x$data <- xx[c(1:2, 12:length(xx))] # keep only one B trace
#---------------- adjustments to plt's build --------------------
# change out data for the B trace, add the right colors
wh <- vector(length = 0)
invisible(
map(1:length(plt$x$data),
function(j) {
nm <- plt$x$data[[j]]$name
plt$x$data[[j]]$hoverinfo <<- "none"
plt$x$data[[j]]$fillcolor <<- unlist(cols[cols$nm == nm, "filler"],
use.names = F)
if(str_detect(nm, "^B$")){
plt$x$data[[j]]$x <<- bplotly$x$data[[1]]$x
plt$x$data[[j]]$y <<- bplotly$x$data[[1]]$y
}
if(str_detect(nm, "BB")) {
wh[length(wh) + 1] <<- j # list of unnecessary traces (extra B groups)
}
})
)
#----- take divided B and remove all but one trace for B ------
plt$x$data <- plt$x$data[-c(wh)] # <------ forget this line when updated last time
# visualize Sankey bump with B at the bottom
plt
The Plotly object is basically 10 globs of color, there is no separation between years in the background. So if you add a tooltip to this as it is, there can be only one...
To get the tooltips you're looking for, I created another trace (well, 10, actually—1 for each model). In order to get the right values (because the sales data isn't in the 50K range), I used the data in plt to create a new data frame.
#--------------- collect values for hovertext positions ----------
x <- plt$x$data[[1]]$x
inds <- which(x %in% 2015:2020, arr.ind = T)
yrs <- x[inds]
tellMe <- invisible(
map(1:length(plt$x$data),
function(m) {
y <- plt$x$data[[m]]$y
y[inds]
}) %>% setNames(sort(unique(df$model))) %>% # changed from LETTERS[1:10]
as.data.frame() %>%
mutate(yr = yrs %>% as.integer()) %>%
pivot_longer(names_to = "model", values_to = "sales",
cols = sort(unique(df$model))) %>%
distinct() %>%
group_by(yr, model) %>%
summarise(val = mean(sales)) %>%
left_join(df, by = c("yr" = "Year", "model" = "model")) %>%
as.data.frame() # drop groups
)
#-------------- create data trace for hovertext --------------
plot_ly(tellMe, x = ~yr, y = ~val, split = ~model,
customdata = ~sales, text = ~model,
line = list(width = .01, shape = "spline", smoothing = 1.3),
hovertemplate = "Year: %{x}<br>Model: %{text}<br>Sales: %{customdata}<extra></extra>",
type = "scatter", mode = "lines", showlegend = F) -> pp2
pp2
If you look at the plot here, it looks blank. That's because of how small the lines are. This is intentional. You don't want lines on your graph.
Fix the colors, so that the hoverlabel background colors match the legend colors.
# change colors to match sankey
pp2 <- plotly_build(pp2)
invisible(
map(1:10,
function(z) {
nm <- pp2$x$data[[z]]$name
# collect and assign the color
cr <- unlist(cols[cols$nm == nm, "filler"], use.names = F)
pp2$x$data[[z]]$line$color <<- cr
})
)
Using subplot here didn't work. Plotly gave me an error when I tried adding a trace, whether all at once or even one for each model. So I forced the traces together.
#-------------- consolidate the traces (subplot won't work) -----------
# collect data one more time!
dx <- plt$x$data
yx <- pp2$x$data
yx <- append(yx, dx) # put plt on top
# replace data
plt$x$data <- yx
# lines are small, increase the distance searched for matches
plt %>% layout(hoverdistance = 40)
The final product:
I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result
I would like to use gganimate to:
Graph two separate curved lines with geom_path
Call a function that performs a calculation with the data from those lines and returns a single coordinate (x, y)
Plot that coordinate as a geom_point
Move the lines around, with the geom_point updating as the lines move
This is simple if the movement is such that the single (x, y) coordinate moves linearly (just calculate it at each stage ahead of time and then animate it, it will move linearly from each stage to the next), but if it's not I'm not sure what to do. If I call a function within aes(), which seems like the natural solution, it seems to calculate it once at the beginning and then not update it as the lines move.
Here is an example.
library(tidyverse)
library(gganimate)
# A function to find the x and y coordinate of the minimum y value of either set
min_of_both <- function(x1, y1, x2, y2) {
cm <- bind_rows(tibble(x = x1, y = y1),
tibble(x = x2, y = y2))
return(list(x = cm[which(cm$y == min(cm$y)),]$x,
y = min(cm$y)))
}
# Create two parabola paths, curve A which moves downwards from t = 1 to t = 2
curveA <- tibble(xA = -50:50/10, yA = 5+(-50:50/10)^2, t = 1) %>%
bind_rows(tibble(xA = -50:50/10, yA = -10 + (-50:50/10)^2, t = 2))
# And curve B which is static in both time 1 and 2
curveB <- tibble(xB = -50:50/10, yB = 1 + (-30:70/10)^2)
data <- curveB %>%
bind_rows(curveB) %>%
bind_cols(curveA)
# Plot Curve A
p <- ggplot(data, aes(x = xA, y = yA)) +
geom_path(color = 'red') +
# And Curve B
geom_path(aes(x=xB,y=yB), color = 'blue')+
# Then plot a single point that uses both curves as input
# Note I also get problems if trying to run the function through data= instead of mapping=
# or if I define two separate functions, one for x and one for y, so as to avoid $
geom_point(aes(
x = min_of_both(xA,yA,xB,yB)$x,
y = min_of_both(xA,yA,xB,yB)$y),
size = 3,
color = 'black') +
theme_minimal()+
transition_states(t)+
ease_aes('sine-in-out')
animate(p)
This results in (not sure if the animation will play on StackOverflow but the parabola does indeed move):
The black dot is intended to mark the lowest y-coordinate on either parabola at each moment, but instead it marks the lowest y-coordinate on either parabola at any point in the animation (at the end).
Any tips appreciated.
After a lot of head-scratching I think I've understood your point and have found one solution. The best way forward might be to manually tween the paths and calculate the min values using your function whilst grouping by .frame before plotting:
# Same curve setup, but labelling points for grouping later
curveA <- tibble(xA = -50:50/10,
yA = 5+(-50:50/10)^2,
point = 1:101,
t = 1) %>%
bind_rows(tibble(xA = -50:50/10,
yA = -10 + (-50:50/10)^2,
point = 1:101,
t = 2))
curveB <- tibble(xB = -50:50/10,
yB = 1 + (-30:70/10)^2,
point = 1:101,
t = 1)
A_frames <- curveA %>%
tween_along(ease = 'sine-in-out', 100, along = t, id = point) %>%
filter(.phase == "transition") %>%
select(xA, yA, point, .frame) %>%
arrange(.frame, point) # arrange by point needed to keep in order
B_frames <- curveB %>%
bind_rows(curveB %>% mutate(t = 2)) %>%
tween_along(ease = 'sine-in-out', 100, along = t, id = point) %>%
filter(.phase == "transition") %>%
select(xB, yB, point, .frame) %>%
arrange(.frame, point)
data <- A_frames %>%
left_join(B_frames, by = c(".frame", "point")) %>%
group_by(.frame) %>%
mutate(xmin = min_of_both(xA,yA,xB,yB)$x,
ymin = min_of_both(xA,yA,xB,yB)$y)
# Plot Curve A
p <- ggplot(data, aes(x = xA, y = yA)) +
geom_path(color = 'red') +
# And Curve B
geom_path(aes(x=xB,y=yB), color = 'blue')+
# Then plot a single point that uses both curves as input
# Note I also get problems if trying to run the function through data= instead of mapping=
# or if I define two separate functions, one for x and one for y, so as to avoid $
geom_point(aes(xmin, ymin),
size = 3,
color = 'black') +
theme_minimal()+
transition_states(.frame)+
ease_aes('sine-in-out')
animate(p, fps = 24)
I try to generate a heatmap based on values.
Here is my dataset which consists of three variables: Lat (latitude), Lon (longitude), and Value.
https://www.dropbox.com/s/s53xeplywz9jh15/sample_data.csv?dl=0
I have looked through the relevant posts and found this useful:
Generating spatial heat map via ggmap in R based on a value
I copied the code in that post and here my code looks like:
# import data and libaries
library(ggplot2)
library(ggmap)
Yunan<-read.csv("C:\\Program Files\\RStudio\\data\\pb_sp\\sample_data.csv", header = TRUE)
# call the map to see point distribution
Yunan_map<-get_map(location="yunan",zoom=6,maptype="terrain",scale=2)
ggmap(Yunan_map)+geom_point(data=Yunan,aes(x=Yunan$Lon,y=Yunan$Lat,fill="red",alpha=0.3,size=0.05,shape=21))+scale_shape_identity()
# 1. generate bins for x, y coordinates (unit=decimal degree)
xbreaks <- seq(floor(min(Yunan$Lat,na.rm=TRUE)), ceiling(max(Yunan$Lat,na.rm=TRUE)), by = 0.5)
ybreaks <- seq(floor(min(Yunan$Lon,na.rm=TRUE)), ceiling(max(Yunan$Lon,na.rm=TRUE)), by = 0.5)
# 2. allocate the data points into the bins
Yunan$latbin <- xbreaks[cut(Yunan$Lat, breaks = xbreaks, labels=F)]
Yunan$longbin <- ybreaks[cut(Yunan$Lon, breaks = ybreaks, labels=F)]
# 3. summarise the data for each bin (use the median)
datamat <- Yunan[, list(Value= median(Value)),
by = c("latbin", "longbin" )]
# 4. Merge the summarised data with all possible x, y coordinate combinations to get
# a value for every bin
datamat <- merge(setDT(expand.grid(latbin = xbreaks, longbin = ybreaks)), datamat,
by = c("latbin", "longbin"), all.x = TRUE, all.y = FALSE)
# 5. Fill up the empty bins 0 to smooth the contour plot
datamat[is.na(Value), ]$Value <- 0
# 6. Plot the contours
ggmap(Yunan_map,extent ="device") +
stat_contour(data = datamat, aes(x = longbin, y = latbin, z = Value,
fill = ..level.., alpha = ..level..), geom = 'polygon', binwidth = 30) +
scale_fill_gradient(name = "Value", low = "green", high = "red") +
guides(alpha = FALSE)
However, I encountered two problems
After executing the step 3 (summarise the data for each bin), I got this error message:
Error in [.data.frame(Yunan, , list(Value = median(Value)), by = c("latbin", :
unused argument (by = c("latbin", "longbin"))
I wish to change the colour scheme from gradient to discrete colours, something like this map:
Since the values in my dataset range from 17 to 21, I want to classify them in to different bins such as 17-17.5, 17.5-18, 18-18.5.... with corresponding colours.
Any suggestions that I can fix these problems. Thanks in advance.
Here is some workable example of data I wish to plot:
set.seed(123)
x <- rweibull(n = 2000, shape = 2, scale = 10)
x <- round(x, digits = 0)
x <- sort(x, decreasing = FALSE)
y <- c(rep(0.1, times = 500),rep(0.25, times = 500),rep(0.4, times = 500),rep(0.85, times = 500))
z <- rbinom(n=2000, size=1, prob=y)
df1 <- data.frame(x,z)
I want to plot the overal fequency of z across x.
unlike a typical cdf, the function should not reach 1.0, but instead
sum(df1$z)/length(df1$z)
a ymax of 0.36 (721/2000).
using ggplot2 we can create a cdf of x with the following command:
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
But i want to extend this plot to show the cumulative percentage of z (as a function of 'x')
The end result should like like
EDIT
with some very poor data manipulation I am able to generate the something similiar to a cdf plot, but there must be a more beautiful and easy method using various packages and ggplot
mytable <- table(df1$x, df1$z)
mydf <- as.data.frame.matrix(mytable)
colnames(mydf) <- c("z_no", "z_yes")
mydf$A <- 1:length(mydf$z_no)
mydf$sum <- cumsum(mydf$z_yes)
mydf$dis <- mydf$sum/length(z)
plot(mydf$A, mydf$dis)
You can use the package dplyr to process the data as follows:
library(dplyr)
plot_data <- group_by(df1, x) %>%
summarise(z_num = sum(z)) %>%
mutate(cum_perc_z = cumsum(z_num)/nrow(df1))
This gives the same result as the data processing that you describe in your edit. Note, however, that I get sum(df1$z) = 796 and the maximal y value is thus 796/2000 = 0.398.
For the plot, you can use geom_step() to have a step function and add the horizontal line with geom_hline():
ggplot(plot_data, aes(x = x, y = cum_perc_z)) +
geom_step(colour = "red", size = 0.8) +
geom_hline(yintercept = max(plot_data$cum_perc_z))