Adding layer of interpolated values to ggplot chart in R - r

I have created the following dataframe in R to generate a plot using ggplot
library(data.table)
library(ggplot2)
library(plotly)
df <- data.frame("X_Frequency" = c(5, 10, 55, 180, 300, 360, 1000, 2000)
, "X_Axis" = c(0.009185742, 0.207822221, 0.067542222, 0.002597778,
0.002597778, 0.001454756, 0.001454756 , 0.001454756))
Next I have generated a plot using ggplot
B <- ggplot(data = df,
mapping = aes(x = X_Frequency, y = X_Axis)) +
geom_line() + labs(x = "Frequency(Hz)", y="Axis")
B <- ggplotly(B, dynamicTicks = TRUE)###Hovering enabled
B <- layout(B, yaxis = list(type = "log"))##X Y log scales enabled
B <- layout(B, xaxis = list(type = "log"))
B
I have created the following dataframe df241 with interpolated values between various observations in df1. First we create the slopes
df$X_Slope2 <- 0### Initiate slope column
for(i in 2:nrow(df)){
df$X_Slope2[i] = (df$X_Axis[i] - df$X_Axis[i-1]) /
(df$X_Frequency[i] - df$X_Frequency[i - 1])
}
Next we assign the respective slopes to all values
df_new <- bind_cols(df %>%
select(X_Frequency, X_Axis, X_Slope2) %>%
complete(., expand(., X_Frequency = 5:2000))
Now we calculate the interpolated values of X-Frequency, X_Axis from the df_new using slopes
for(i in 1: nrow(df241)){
if(is.na(df241$X_Axis[i]) == T){
df241$X_Axis[i] = df241$X_Slope2[i] *
(df241$X_Frequency[i] - df241$X_Frequency[i-1]) +
df241$X_Axis[i-1] } else {
df241$X_Axis[i] = df241$X_Axis[i]}}
I want to place these interpolated values from df241 on the original chart B generated above. How can this be accomplished. I request someone to help me.
Note: I have tried generating a new plot based df_new dataframe. but the chart appears very different from the original chart -B.

It might be simpler to use the approx function for your interpolation. I believe this gets a similar result as your interpolation steps.
df_interp <- approx(df$X_Frequency, df$X_Axis, xout = 5:2000) %>%
as_tibble() %>%
rename(X_Frequency = x, X_Axis = y)
A linear interpolation may look unexpected on a log-log scale. I was unable to run your code as provided (is df241 created somewhere?), so I'm not sure if this is what you encountered when you said the chart with the interpolated values appears very different.
B <- ggplot(data = df,
mapping = aes(x = X_Frequency, y = X_Axis)) +
geom_line() +
geom_point(data = df_interp, size = 0.1, color = "blue") +
labs(x = "Frequency(Hz)", y="Axis")
B <- ggplotly(B, dynamicTicks = TRUE)###Hovering enabled
B <- layout(B, yaxis = list(type = "log"))##X Y log scales enabled
B <- layout(B, xaxis = list(type = "log"))
B
Edit: interpolation on log scale
Alternatively, you could interpolate using log-transformed inputs, and then use exp to convert back onto the original scale:
df_interp <- approx(log(df$X_Frequency), log(df$X_Axis), xout = log(5:2000)) %>%
as_tibble() %>%
mutate(X_Frequency = exp(x),
X_Axis = exp(y))
Which would result in this:

Related

R: How to customize Sankey plot in ggplotly?

I have data of sales by year and model, which is visualized via Sankey chart. Now I am struggling to handle 2 issue:
Firstly I need to set model B always on the bottom of chaty regardless its value over the years.
When I re-visualize ggplot via ggplotly hover does not show sales or years
code:
df <- data.frame (model = c("A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J"),
Year = c(2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020),
sales = c(450,678,456,344,984,456,234,244,655,789,234,567,234,567,232,900,1005,1900,450,345,567,235,456,345,144,333,555,777,111,444,222,223,445,776,331,788,980,1003,456,434,345,2222,3456,456,678,8911,4560,4567,4566,5555,6666,7777,8888,1233,1255,5677,3411,2344,6122,4533))
install.packages("remotes")
#remotes::install_github("davidsjoberg/ggsankey")
library(ggsankey)
library(tidyverse)
plot <- ggplot(df, aes(x = Year,
node = model,
fill = model,
value = sales)) +
geom_sankey_bump(space = 0, type = "alluvial", color = "transparent", smooth = 15) +
scale_fill_viridis_d(option = "A", alpha = .8) +
theme_sankey_bump(base_size = 16)
ggplotly(plot)
I'm absolutely certain that there is a better way, but it took me a while to get it working. I think this is what you were looking for.
I started with the ggplot and ggplotly objects that you have here. The primary purpose of this initial plot is to capture the colors. (I could have captured them a few different ways, but this was already done for me in your plot.)
Update ** I've modified the two elements you requested
library(ggsankey)
library(tidyverse)
library(plotly)
# df from the question is unchanged
# visualize the original
(plot <- ggplot(df,
aes(Year, node = model, fill = model, value = sales)) +
geom_sankey_bump(space = 0, type = "alluvial",
color = "transparent", smooth = 15) +
scale_fill_viridis_d(option = "A", alpha = .8) +
theme_sankey_bump(base_size = 16))
ggplotly(plot) -> plp
plp
#-------- colors --------
# collect the 10 colors
cols <- map_dfr(1:10, function(k){
nm <- plp$x$data[[k]]$name
filler <- plp$x$data[[k]]$fillcolor
c(nm = nm, filler = filler)
})
Then I divided the contents of the B model into 10 groups to ensure it was always the smallest bump. This allowed me to collect the stacked values for all of the other models, which is needed to push B to the bottom.
#-------------- splitting B -------------
df1 <- df %>% filter(model != "B") %>%
arrange(Year, sales)
df2 <- df %>% filter(model == "B") %>% # this gets used further down
arrange(Year)
# split B into 10 groups - keep on the bottom, then join the groups
# make the groups
ng <- vector(length = 10)
invisible(
map(1:10,
function(i) {
ng[i] <<- rep("B", i) %>% paste0(collapse = "")
})
)
# add values for these groups by year
df4 <- data.frame(Year = rep(unique(df$Year), each = 10),
model = rep(ng, length(unique(df$Year))),
sales = rep(df2$sales/10, each = 10))
df5 <- rbind(df1, df4)
Recreate the Sankey bump with 10 subsections of model B. Everything that follows works with this plot.
#-------------- plotly after dividing B -------------
(nplt <- ggplot(df5, aes(x = Year, node = model, fill = model, value = sales)) +
geom_sankey_bump(space = 0, type = "alluvial",
color = "transparent", smooth = 15) +
scale_fill_viridis_d(option = "A", alpha = .8) +
theme_sankey_bump(base_size = 16))
ggplotly(nplt) -> plt
plt
Create a Sankey bump with JUST B, to capture data that represents model B at the bottom. Use this data to substitute all of the traces that represent B in the object plt. The colors get fixed here, as well. (The original 10 colors from the first plot.) Lastly, the hoverinfo gets removed. That will get fixed next.
#-------------- get values for B at the bottom -------------
df %>% filter(model == "B") %>%
ggplot(aes(x = Year,
node = model,
fill = model,
value = sales)) +
geom_sankey_bump(space = 0, type = "alluvial", color = "transparent", smooth = 15) +
scale_fill_viridis_d(option = "A", alpha = .8) +
theme_sankey_bump(base_size = 16) -> bplt
ggplotly(bplt) -> bplotly
bplotly
#------- take divided B and remove all but one trace for B --------
# xx <- plt$x$data
# plt$x$data <- xx[c(1:2, 12:length(xx))] # keep only one B trace
#---------------- adjustments to plt's build --------------------
# change out data for the B trace, add the right colors
wh <- vector(length = 0)
invisible(
map(1:length(plt$x$data),
function(j) {
nm <- plt$x$data[[j]]$name
plt$x$data[[j]]$hoverinfo <<- "none"
plt$x$data[[j]]$fillcolor <<- unlist(cols[cols$nm == nm, "filler"],
use.names = F)
if(str_detect(nm, "^B$")){
plt$x$data[[j]]$x <<- bplotly$x$data[[1]]$x
plt$x$data[[j]]$y <<- bplotly$x$data[[1]]$y
}
if(str_detect(nm, "BB")) {
wh[length(wh) + 1] <<- j # list of unnecessary traces (extra B groups)
}
})
)
#----- take divided B and remove all but one trace for B ------
plt$x$data <- plt$x$data[-c(wh)] # <------ forget this line when updated last time
# visualize Sankey bump with B at the bottom
plt
The Plotly object is basically 10 globs of color, there is no separation between years in the background. So if you add a tooltip to this as it is, there can be only one...
To get the tooltips you're looking for, I created another trace (well, 10, actually—1 for each model). In order to get the right values (because the sales data isn't in the 50K range), I used the data in plt to create a new data frame.
#--------------- collect values for hovertext positions ----------
x <- plt$x$data[[1]]$x
inds <- which(x %in% 2015:2020, arr.ind = T)
yrs <- x[inds]
tellMe <- invisible(
map(1:length(plt$x$data),
function(m) {
y <- plt$x$data[[m]]$y
y[inds]
}) %>% setNames(sort(unique(df$model))) %>% # changed from LETTERS[1:10]
as.data.frame() %>%
mutate(yr = yrs %>% as.integer()) %>%
pivot_longer(names_to = "model", values_to = "sales",
cols = sort(unique(df$model))) %>%
distinct() %>%
group_by(yr, model) %>%
summarise(val = mean(sales)) %>%
left_join(df, by = c("yr" = "Year", "model" = "model")) %>%
as.data.frame() # drop groups
)
#-------------- create data trace for hovertext --------------
plot_ly(tellMe, x = ~yr, y = ~val, split = ~model,
customdata = ~sales, text = ~model,
line = list(width = .01, shape = "spline", smoothing = 1.3),
hovertemplate = "Year: %{x}<br>Model: %{text}<br>Sales: %{customdata}<extra></extra>",
type = "scatter", mode = "lines", showlegend = F) -> pp2
pp2
If you look at the plot here, it looks blank. That's because of how small the lines are. This is intentional. You don't want lines on your graph.
Fix the colors, so that the hoverlabel background colors match the legend colors.
# change colors to match sankey
pp2 <- plotly_build(pp2)
invisible(
map(1:10,
function(z) {
nm <- pp2$x$data[[z]]$name
# collect and assign the color
cr <- unlist(cols[cols$nm == nm, "filler"], use.names = F)
pp2$x$data[[z]]$line$color <<- cr
})
)
Using subplot here didn't work. Plotly gave me an error when I tried adding a trace, whether all at once or even one for each model. So I forced the traces together.
#-------------- consolidate the traces (subplot won't work) -----------
# collect data one more time!
dx <- plt$x$data
yx <- pp2$x$data
yx <- append(yx, dx) # put plt on top
# replace data
plt$x$data <- yx
# lines are small, increase the distance searched for matches
plt %>% layout(hoverdistance = 40)
The final product:

How to specify groups with colors in qqplot()?

I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result

Use gganimate to display calculation of tweened data

I would like to use gganimate to:
Graph two separate curved lines with geom_path
Call a function that performs a calculation with the data from those lines and returns a single coordinate (x, y)
Plot that coordinate as a geom_point
Move the lines around, with the geom_point updating as the lines move
This is simple if the movement is such that the single (x, y) coordinate moves linearly (just calculate it at each stage ahead of time and then animate it, it will move linearly from each stage to the next), but if it's not I'm not sure what to do. If I call a function within aes(), which seems like the natural solution, it seems to calculate it once at the beginning and then not update it as the lines move.
Here is an example.
library(tidyverse)
library(gganimate)
# A function to find the x and y coordinate of the minimum y value of either set
min_of_both <- function(x1, y1, x2, y2) {
cm <- bind_rows(tibble(x = x1, y = y1),
tibble(x = x2, y = y2))
return(list(x = cm[which(cm$y == min(cm$y)),]$x,
y = min(cm$y)))
}
# Create two parabola paths, curve A which moves downwards from t = 1 to t = 2
curveA <- tibble(xA = -50:50/10, yA = 5+(-50:50/10)^2, t = 1) %>%
bind_rows(tibble(xA = -50:50/10, yA = -10 + (-50:50/10)^2, t = 2))
# And curve B which is static in both time 1 and 2
curveB <- tibble(xB = -50:50/10, yB = 1 + (-30:70/10)^2)
data <- curveB %>%
bind_rows(curveB) %>%
bind_cols(curveA)
# Plot Curve A
p <- ggplot(data, aes(x = xA, y = yA)) +
geom_path(color = 'red') +
# And Curve B
geom_path(aes(x=xB,y=yB), color = 'blue')+
# Then plot a single point that uses both curves as input
# Note I also get problems if trying to run the function through data= instead of mapping=
# or if I define two separate functions, one for x and one for y, so as to avoid $
geom_point(aes(
x = min_of_both(xA,yA,xB,yB)$x,
y = min_of_both(xA,yA,xB,yB)$y),
size = 3,
color = 'black') +
theme_minimal()+
transition_states(t)+
ease_aes('sine-in-out')
animate(p)
This results in (not sure if the animation will play on StackOverflow but the parabola does indeed move):
The black dot is intended to mark the lowest y-coordinate on either parabola at each moment, but instead it marks the lowest y-coordinate on either parabola at any point in the animation (at the end).
Any tips appreciated.
After a lot of head-scratching I think I've understood your point and have found one solution. The best way forward might be to manually tween the paths and calculate the min values using your function whilst grouping by .frame before plotting:
# Same curve setup, but labelling points for grouping later
curveA <- tibble(xA = -50:50/10,
yA = 5+(-50:50/10)^2,
point = 1:101,
t = 1) %>%
bind_rows(tibble(xA = -50:50/10,
yA = -10 + (-50:50/10)^2,
point = 1:101,
t = 2))
curveB <- tibble(xB = -50:50/10,
yB = 1 + (-30:70/10)^2,
point = 1:101,
t = 1)
A_frames <- curveA %>%
tween_along(ease = 'sine-in-out', 100, along = t, id = point) %>%
filter(.phase == "transition") %>%
select(xA, yA, point, .frame) %>%
arrange(.frame, point) # arrange by point needed to keep in order
B_frames <- curveB %>%
bind_rows(curveB %>% mutate(t = 2)) %>%
tween_along(ease = 'sine-in-out', 100, along = t, id = point) %>%
filter(.phase == "transition") %>%
select(xB, yB, point, .frame) %>%
arrange(.frame, point)
data <- A_frames %>%
left_join(B_frames, by = c(".frame", "point")) %>%
group_by(.frame) %>%
mutate(xmin = min_of_both(xA,yA,xB,yB)$x,
ymin = min_of_both(xA,yA,xB,yB)$y)
# Plot Curve A
p <- ggplot(data, aes(x = xA, y = yA)) +
geom_path(color = 'red') +
# And Curve B
geom_path(aes(x=xB,y=yB), color = 'blue')+
# Then plot a single point that uses both curves as input
# Note I also get problems if trying to run the function through data= instead of mapping=
# or if I define two separate functions, one for x and one for y, so as to avoid $
geom_point(aes(xmin, ymin),
size = 3,
color = 'black') +
theme_minimal()+
transition_states(.frame)+
ease_aes('sine-in-out')
animate(p, fps = 24)

Creating a heatmap based on values in R

I try to generate a heatmap based on values.
Here is my dataset which consists of three variables: Lat (latitude), Lon (longitude), and Value.
https://www.dropbox.com/s/s53xeplywz9jh15/sample_data.csv?dl=0
I have looked through the relevant posts and found this useful:
Generating spatial heat map via ggmap in R based on a value
I copied the code in that post and here my code looks like:
# import data and libaries
library(ggplot2)
library(ggmap)
Yunan<-read.csv("C:\\Program Files\\RStudio\\data\\pb_sp\\sample_data.csv", header = TRUE)
# call the map to see point distribution
Yunan_map<-get_map(location="yunan",zoom=6,maptype="terrain",scale=2)
ggmap(Yunan_map)+geom_point(data=Yunan,aes(x=Yunan$Lon,y=Yunan$Lat,fill="red",alpha=0.3,size=0.05,shape=21))+scale_shape_identity()
# 1. generate bins for x, y coordinates (unit=decimal degree)
xbreaks <- seq(floor(min(Yunan$Lat,na.rm=TRUE)), ceiling(max(Yunan$Lat,na.rm=TRUE)), by = 0.5)
ybreaks <- seq(floor(min(Yunan$Lon,na.rm=TRUE)), ceiling(max(Yunan$Lon,na.rm=TRUE)), by = 0.5)
# 2. allocate the data points into the bins
Yunan$latbin <- xbreaks[cut(Yunan$Lat, breaks = xbreaks, labels=F)]
Yunan$longbin <- ybreaks[cut(Yunan$Lon, breaks = ybreaks, labels=F)]
# 3. summarise the data for each bin (use the median)
datamat <- Yunan[, list(Value= median(Value)),
by = c("latbin", "longbin" )]
# 4. Merge the summarised data with all possible x, y coordinate combinations to get
# a value for every bin
datamat <- merge(setDT(expand.grid(latbin = xbreaks, longbin = ybreaks)), datamat,
by = c("latbin", "longbin"), all.x = TRUE, all.y = FALSE)
# 5. Fill up the empty bins 0 to smooth the contour plot
datamat[is.na(Value), ]$Value <- 0
# 6. Plot the contours
ggmap(Yunan_map,extent ="device") +
stat_contour(data = datamat, aes(x = longbin, y = latbin, z = Value,
fill = ..level.., alpha = ..level..), geom = 'polygon', binwidth = 30) +
scale_fill_gradient(name = "Value", low = "green", high = "red") +
guides(alpha = FALSE)
However, I encountered two problems
After executing the step 3 (summarise the data for each bin), I got this error message:
Error in [.data.frame(Yunan, , list(Value = median(Value)), by = c("latbin", :
unused argument (by = c("latbin", "longbin"))
I wish to change the colour scheme from gradient to discrete colours, something like this map:
Since the values in my dataset range from 17 to 21, I want to classify them in to different bins such as 17-17.5, 17.5-18, 18-18.5.... with corresponding colours.
Any suggestions that I can fix these problems. Thanks in advance.

created a nested cdf that doesn't reach 1

Here is some workable example of data I wish to plot:
set.seed(123)
x <- rweibull(n = 2000, shape = 2, scale = 10)
x <- round(x, digits = 0)
x <- sort(x, decreasing = FALSE)
y <- c(rep(0.1, times = 500),rep(0.25, times = 500),rep(0.4, times = 500),rep(0.85, times = 500))
z <- rbinom(n=2000, size=1, prob=y)
df1 <- data.frame(x,z)
I want to plot the overal fequency of z across x.
unlike a typical cdf, the function should not reach 1.0, but instead
sum(df1$z)/length(df1$z)
a ymax of 0.36 (721/2000).
using ggplot2 we can create a cdf of x with the following command:
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
But i want to extend this plot to show the cumulative percentage of z (as a function of 'x')
The end result should like like
EDIT
with some very poor data manipulation I am able to generate the something similiar to a cdf plot, but there must be a more beautiful and easy method using various packages and ggplot
mytable <- table(df1$x, df1$z)
mydf <- as.data.frame.matrix(mytable)
colnames(mydf) <- c("z_no", "z_yes")
mydf$A <- 1:length(mydf$z_no)
mydf$sum <- cumsum(mydf$z_yes)
mydf$dis <- mydf$sum/length(z)
plot(mydf$A, mydf$dis)
You can use the package dplyr to process the data as follows:
library(dplyr)
plot_data <- group_by(df1, x) %>%
summarise(z_num = sum(z)) %>%
mutate(cum_perc_z = cumsum(z_num)/nrow(df1))
This gives the same result as the data processing that you describe in your edit. Note, however, that I get sum(df1$z) = 796 and the maximal y value is thus 796/2000 = 0.398.
For the plot, you can use geom_step() to have a step function and add the horizontal line with geom_hline():
ggplot(plot_data, aes(x = x, y = cum_perc_z)) +
geom_step(colour = "red", size = 0.8) +
geom_hline(yintercept = max(plot_data$cum_perc_z))

Resources