Extracting and summarizing data from interactive histogram selection in R - r

I want to create an interactive histogram using plotly (or other package if better suited) in R from data similar to this example set:
test<-data.frame(sex=c("m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m"),weight=runif(80,5,9))
I want to show two overlayed histograms of weight distribution per sex with some summary statistics such as standard deviation, mean, number of samples, all per sex as well as globally.
Also I want to be able to make a selection preferably using a range slider or selection box while updating these summary statistics to the selection. Then I want to be able to add a variable to the original dataset to indicate if a sample is part of the selection.
Thanks for any help! Even if it's just pointing to a relevant online resource, I'm struggling to find one that tackles a similar problem.

#DataZhukov this is a revised answer based on your larger data sample. Per reply I removed the side-by-side (think age pyramid) and show how to use {plotly} for histograms.
While {plotly} supports interactivity, it is based on the concept of a "static" html-webpage. This means that no "active" calculation is done on the client side/user viewing the page.
For simple stats/summaries you can look into {crosstalk} & SummaryWidget to enable (some) "dynamic" update (i.e. client side calculations).
For a full fledged dynamic select/filter/recalculate type of interactivity, {shiny} is the way to go. (But that is another ballgame.)
{plotly} allows you to place text annotations "freely", by specifying the add_text() layer.
I construct this from your data. You can also just define it by hand in form of vectors.
If you use data frames as your input data structure, note that {plotly} uses the tilde notation (~) for the variable.
test<-data.frame(sex=c("m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m","m","m","f","f","m","m","f","m","f","m"),weight=runif(80,5,9))
# calculate mean, sd, etc based on given data
# note you can also define this with simple vectors
total_stats <- test_df %>%
summarise(SAMPLE = n(), MEAN_WEIGHT = mean(weight), SD = sd(weight)) %>%
mutate(sex = "m+f")
group_stats <- test_df %>% group_by(sex) %>%
summarise(SAMPLE = n(), MEAN_WEIGHT = mean(weight), SD = sd(weight))
my_stats <- bind_rows(total_stats, group_stats) %>%
mutate(LABEL = paste0(sex, " sample size: ", SAMPLE
, " with mean ", round(MEAN_WEIGHT, 2)
, " and SD ", round(SD, 2)
)
)
# format your text, e.g. font face and size ---- format to your liking
tf <- list(
family = "sans serif",
size = 11
)
The {plotly} call to construct the "pyramid" side-by-side rather than overlapping and adding a text layer to it.
test %>%
plot_ly() %>%
# ------------ plot histogram ----------------------
add_histogram( x = ~weight, color = ~sex
,nbinsx = 20 # set the number of bins you want/need
) %>%
# ------------ add annotation layer ---------------
## I provide x, y positions as vector, you could add and place
## each label as its own layer, i.e. add_text() call
add_text(data = my_stats
,x = c(5.2, 6,6.3), y = c(6, 5, 4.5)
,text = ~LABEL
,name = "" # left empty as we do not need to name the layer
,textfont = tf
,textposition = "right"
, showlegend = FALSE
) %>%
layout(yaxis = list(title =""))
This yields:
Obviously, you can freely define the x,y positions of your text annotations.
The default behaviour puts the count-bars side-by-side. If you want to force the "overlaying" behaviour, you can plot 2 histograms and force these 2 graphical layer to overlay. For the latter, you need to set the mode in the layout() layer. I put an alpha transparency as well, as you may have overlapping counts in your data sample. Text placement, etc follows the principles shown above.
# split test data frame in a male and female df
males <- test %>% filter(sex == "m")
fems <- test %>% filter(sex == "f")
plot_ly(
alpha = 0.5 # set alpha to ensure visibility on overlapping counts
, nbinsx = 20 # set number of bins
) %>%
#------------ add a histogram layer per group -------------------
add_histogram(data = males, x = ~weight, name = "male") %>%
add_histogram(data = fems, x = ~weight, name = "female") %>%
#------------ tweak layout --------------------------------------
layout(
barmode = "overlay" # to change side-by-side default to overlay
)

Related

Create interactive bar chart with shared data filtered by time range

I want to create an interactive bar chart that lets users filter observations based on a range of values, and then renders counts per class for the selected time period dynamically. Since the filtered data needs to be available for numerous such graphs, I thought a combination of crosstalk and plotly/ggplot might prove valuable.
I attached a reprex further below that uses shared data and filtering functionality from crosstalk to allow for the dynamic filtering part. When I knit the document, the bar chart renders nicely as long as the full range of values is selected (default).
However, the plotting region becomes empty for any other, ie. user-adjusted range.
What exactly am I missing here? I assume there must be a difference between full and filtered shared datasets that ggplotly() cannot handle proberly. Is there maybe another approach that I could follow to achieve my goal?
Here's the content of my .Rmd file:
---
title: mpg class counts filtered by time period
output: html_document
---
```{r echo = FALSE, message = FALSE, warning = FALSE}
library(crosstalk)
library(plotly)
# Wrap data frame in SharedData
sd = SharedData$new(mpg)
# Create a filter input
filter_slider("Year", "Year", sd, column = ~ year, step = 1, width = 250)
# Render graph
bscols(
ggplotly(
ggplot(aes(x = class), data = sd) +
geom_bar()
)
)
```
I think this may be because "Crosstalk currently only works for linked brushing and filtering of views that show individual data points, not aggregate or summary views (where “observations” is defined as a single row in a data frame). For example, histograms are not supported since each bar represents multiple data points; but scatter plot points each represent a single data point, so they are supported." official doc
If you change it to point plot, it seems to be working.
---
title: mpg class counts filtered by time period
output: html_document
---
```{r echo = FALSE, message = FALSE, warning = FALSE}
library(data.table)
library(crosstalk)
library(plotly)
# Wrap data frame in SharedData
sd = SharedData$new(mpg)
# Create a filter input
filter_slider("Year", "Year", sd, column = ~ year, step = 1, width = 250)
# Render graph
bscols(
ggplotly(
ggplot(aes(hwy, cty), data = sd) +
geom_point()
)
)
```
Would following work for you? If you want to filter for dates, you might want to have a look at plotly::rangeslider.
library(tidyverse)
library(plotly)
df <- crosstalk::SharedData$new(mpg)$data() %>%
group_by(year, class) %>%
count() %>%
mutate(year = as.factor(year))
df %>%
plot_ly(x = ~class, y = ~n, color = ~year) %>%
add_bars() %>%
layout(barmode = "stack")

Plotting select PCA loadings in R

I have just performed a PCA analysis for a large data set with approximately 20,000 variables. To do so, I used the following code:
df_pca <- prcomp(df, center=FALSE, scale.=TRUE)
I am curious how my variables influenced PCA.1 (Dimension 1 of the PCA analysis) and PCA.2 (Dimension 2 of the PCA analysis).
I used the following code to look at how each variable influenced the dimensional analysis:
fviz_pca_var(df_pca, col.var = "black")
However, this creates a graph with all 20,000 of my variables and since there is so much information, it is unreadable.
Is there a way to select the variables that have most influenced PCA.1 and PCA.2 and graph only those?
Thank you in advance!
If you want to see the dimension that you want, you should do this:
library(factoextra)
fviz_contrib(df_pca,
choice = "var",
axes = 5,
top = 10, color = 'darkorange3', barfill = 'blue4',fill ='blue4')
with the axes you can choose the dim that you want to see. In this case you are seeing the dimension number 5.
If you want to see the variables and the curve that help you to choose the number of dimension, you can use this:
fviz_screeplot(df_pca, ncp=14,linecolor = 'darkorange3', barfill = 'blue4',
barcolor ='blue4', xlab = "Dimensioni",
ylab = '% varicance',
main = 'Reduction of components')
get_eigenvalue(df_pca)
What you want to do is first get the actual table that correlates the synthetic variable w/ the real variables. Do that like this:
a <- df_pca$rotation
Then we can use dplyr to manipulate the data frame and extract what we want:
library(dplyr)
library(tibble)
a %>% as.data.frame %>% rownames_to_column %>%
select(rowname, PC1, PC2) %>% arrange(desc(PC1^2+PC2^2)) %>% head(10)
The above will organize show the top 10 most important variables for PC1 and PC2. You can run the same thing for PC1 only by changing to arrange(desc(abs(PC1))), or PC2 by changing to arrange(desc(abs(PC2)))... and see more or less than 10 variables by changing head(10).

How to use for loop across various years and get multiple plots together?

https://www.kaggle.com/nowke9/ipldata ---- contains the data set.
I am fairly new to R programming. This is an exploratory study performed for the IPL data set. (link for the data attached above) After merging both the files with "id" and "match_id", I am trying to plot the relationship between matches won by teams across different cities.
However, since 12 seasons are over the output which I am getting is not helping to make sufficient conclusions. In order to plot the relationship across each year, it is required to use for loop. Right now, the output for all the 12 years is displayed in a single graph.
How to rectify this mistake and plot a separate graph for each year with proper color scheming ?
library(tidyverse)
matches_tbl <- read_csv("data/matches_updated.csv")
deliveries_tbl <- read_csv("data/deliveries_updated.csv")
combined_matches_deliveries_tbl <- deliveries_tbl %>%
left_join(matches_tbl, by = c("match_id" = "id"))
combined_matches_deliveries_tbl %>%
group_by(city, winner)%>%
filter(season == 2008:2019, !result == "no result")%>%
count(match_id)%>%
ungroup()%>%
ggplot(aes(x = winner))+
geom_bar(aes(fill = city),alpha = 0.5, color = "black", position = "stack")+
coord_flip()+
theme_bw()
The output is as follows:-
There were 50 or more warnings (use warnings() to see the first 50)
[Winner of teams across cities for the years between 2008 and 2019][1]
The required output is :- 12 separate graphs in a single code with proper color scheming.
Many thanks in advance.
Here is an example using mtcars to split by a variable into separate plots. What I created is a scatter plot of vs and mpg by splitting the dataset by cyl. First create an empty list. Then I use lapply to loop through the values of cyl (4,6,8) and then filter the data by that value. After that I plot the scatter plot for the subset and save it to the empty list. Each segment of the list will represent a plot and you can pull them out as you see fit.
library(dplyr)
library(ggplot2)
gglist <- list()
gglist <- lapply(c(4,6,8), function(x){
ggplot(filter(mtcars, cyl == x))+
geom_point(aes(x=vs,y=mpg))
})
Is this what you want?
combined_matches_deliveries_tbl %>%
group_by(city, winner,season)%>%
filter(season %in% 2008:2019, !result == "no result")%>%
count(match_id)%>%
ggplot(aes(x = winner))+
geom_bar(aes(fill = city),alpha = 0.5, color = "black", position = "stack")+
coord_flip()+ facet_wrap(season~.)+
theme_bw()

Joining list of nest/ggplot2 generated images to two columns with consistent proportions

In summary, I would like to split a list of plots created using nest and ggplot2 to two columns. The problem I have had in my approaches is that elements in different subplots end up having more or less inconsistent dimensions because subplots have different heights (different number of elements in each groups, possibly exclusion of x-axis labels etc.).
Following example code uses the main tidyverse packages. I first generate some dummy data using mtcars; cars are split to random groups and each car is assigned an in-group position.
dummy <- mtcars %>%
mutate(group = sample(1:10, n(), replace = TRUE)) %>%
filter(group < 6) %>%
group_by(group) %>%
mutate(position = 1:n())
The actual code produces a list of subplots (plots) and information about number of elements in each group (heights).
## install patchwork via:
## devtools::install_github("thomasp85/patchwork")
plots <- dummy %>%
nest(-group, .key = "data") %>%
mutate(plots = map(data, ~ggplot(data = .x, aes(x = position, y = hp)) +
geom_bar(stat="identity") +
coord_flip()),
heights = purrr::map(data, ~ nrow(.)) %>% unlist())
g.plot <- patchwork::wrap_plots(plots$plots, ncol = 1, heights = plots$heights)
Function wrap_plots is able to produce a nice one-panel image using information about heights (included image panel A). When there is a large number of images to the plot, the one-column plot is not practical. I would, therefore, kindly ask help on how to turn the output of the above code (panel A) to the hoped output (panel B). Panel C exaggerates the problematic effect I have had using different image concatenation approaches.
Current output and hoped output
you could try setting the panel size to fixed dimensions and then arranging the gtables together,
library(egg)
library(gridExtra)
lg <- purrr::map2(plots$plots, plots$heights,
function(p,h) gtable_frame(ggplotGrob(p),
height =unit(h/10,'npc'), #tweak
width =unit(0.7,'npc'))) #tweak
grid.arrange(gtable_rbind(lg[[1]],lg[[2]], egg::.dummy_gtable),
gtable_rbind(lg[[3]],lg[[4]], egg::.dummy_gtable), ncol=2)
(tested with set.seed(12); I don't know what sample() OP had)

How to make plots scales the same or trun them into Log scales in ggplot

I am using this script to plot chemical elements using ggplot2 in R:
# Load the same Data set but in different name, becaus it is just for plotting elements as a well log:
Core31B1 <- read.csv('OilSandC31B1BatchResultsCr.csv', header = TRUE)
#
# Calculating the ratios of Ca.Ti, Ca.K, Ca.Fe:
C31B1$Ca.Ti.ratio <- (C31B1$Ca/C31B1$Ti)
C31B1$Ca.K.ratio <- (C31B1$Ca/C31B1$K)
C31B1$Ca.Fe.ratio <- (C31B1$Ca/C31B1$Fe)
C31B1$Fe.Ti.ratio <- (C31B1$Fe/C31B1$Ti)
#C31B1$Si.Al.ratio <- (C31B1$Si/C31B1$Al)
#
# Create a subset of ratios and depth
core31B1_ratio <- C31B1[-2:-18]
#
# Removing the totCount column:
Core31B1 <- Core31B1[-9]
#
# Metling the data set based on the depth values, to have only three columns: depth, element and count
C31B1_melted <- melt(Core31B1, id.vars="depth")
#ratio melted
C31B1_ra_melted <- melt(core31B1_ratio, id.vars="depth")
#
# Eliminating the NA data from the data set
C31B1_melted<-na.exclude(C31B1_melted)
# ratios
C31B1_ra_melted <-na.exclude(C31B1_ra_melted)
#
# Rename the columns:
colnames(C31B1_melted) <- c("depth","element","counts")
# ratios
colnames(C31B1_ra_melted) <- c("depth","ratio","percentage")
#
# Ploting the data in well logs format using ggplot2:
Core31B1_Sp <- ggplot(C31B1_melted, aes(x=counts, y=depth)) +
theme_bw() +
geom_path(aes(linetype = element))+ geom_path(size = 0.6) +
labs(title='Core 31 Box 1 Bioturbated sediments') +
scale_y_reverse() +
facet_grid(. ~ element, scales='free_x') #rasterImage(Core31Image, 0, 1515.03, 150, 0, interpolate = FALSE)
#
# View the plot:
Core31B1_Sp
I got the following image (as you can see the plot has seven element plots, and each one has its scale. Please ignore the shadings and the image at the far left):
My question is, is there a way to make these scales the same like using log scales? If yes what I should change in my codes to change the scales?
It is not clear what you mean by "the same" because that will not give you the same result as log transforming the values. Here is how to get the log transformation, which, when combined with the no using free_x will give you the plot I think you are asking for.
First, since you didn't provide any reproducible data (see here for more on how to ask good questions), here is some that gives at least some of the features that I think your data has. I am using tidyverse (specifically dplyr and tidyr) to do the construction:
forRatios <-
names(iris)[1:3] %>%
combn(2, paste, collapse = " / ")
toPlot <-
iris %>%
mutate_(.dots = forRatios) %>%
select(contains("/")) %>%
mutate(yLocation = 1:n()) %>%
gather(Comparison, Ratio, -yLocation) %>%
mutate(logRatio = log2(Ratio))
Note that the last line takes the log base 2 of the ratio. This allows ratios in each direction (above and below 1) to plot meaningfully. I think that step is what you need. you can accomplish something similar with myDF$logRatio <- log2(myDF$ratio) if you don't want to use dplyr.
Then, you can just plot that:
ggplot(
toPlot
, aes(x = logRatio
, y = yLocation) ) +
geom_path() +
facet_wrap(~Comparison)
Gives:

Resources