Shiny - Efficient way to use ggplot2(boxplot) & a 'reactive' subset function - r

I have a dataset with > 1000K rows and 5 columns. (material & prices been the relevant columns)
I have written a 'reactive' Shiny app which uses ggplot2 to create a boxplot of the price of the various materials.
e.g the user selects 4-5 materials from a list and then Shiny creates a boxplot of the price of each material :
Price spread of: Made of Cotton, Made of Paper, Made of Wood
It also creates a material combination data plot of the pricing spread of the combination of all the materials
e.g Boxplot of
Price spread of: Made of Cotton & Paper & Wood
It is working relatively quickly for the sample dataset (~5000 rows) but I am worried about scaling it effectively.
The dataset is static so I look at the following solutions:
Calculate the quartile ranges of the various materials (data <-
summary(data)) and then use googleViz to create a candle stick,
however I run into problems when trying to calculate the material combination plot as there are over 100 materials, so calculating
all the possible combinations offline is not feasible.
Calculate the quartile ranges of the various materials (data <- summary(data)) and then create a matrix which stores the row numberof the summary data (min,median,max,1st&3rd quartile) for each material. I can then use some rough calculations to establish the summary() data for the material combination plot,
and then plot using GoogleVIZ however I have little experience with this type of calculation using Shiny.
Can anyone suggest the most robust and scalable way to calculate & boxplot reactive subsets using Shiny?
I understand this a question related to method, rather than code, but I am new to the capabilities of R and am still digesting the different class capabilities, and don't want to 'miss a trick' so to speak.
As always thanks!
Please see below for methods reviewed.
Quartile Clustering: A quartile based technique for Generating Meaningful Clusters
http://arxiv.org/ftp/arxiv/papers/1203/1203.4157.pdf
Conditionally subsetting and calculating a new variable in dataframe in shiny

If you really have a dataset that has more than 1000K, which is 1M. It is probably in a flat file or in a database. You can always do some precalculations and store the result in a database table and use shiny app to call that table instead of loading everything into R every time people open up your shiny app.
I have built several shiny apps for internal use and the lesson I have learned is that: before you build your app, you need to carefully think about, how can I minimize the calculations for R and at the same time deliver the info to app user. Some of our data is 10billion+ and use Hive query will take more than 1 hour. Then I ended up precalculate result and put it on the crontab to update the result table every midnight.
I prefer, maybe your method2? or store the precalculation in a mysql database. (Maybe a Python script update the table once a day if you need some real-time feature later).

Related

Easily view/browse free-text datasets in R/RStudio - alternatives to View() in R/RStudio?

When working with survey data (10-30 columns, 100 - 10k rows, mix of demographic columns like name, age etc, and free-text responses up to nchar == 3000), View() isn't so useful, because it only displays the first 50 or so characters of lengthy strings (we can always widen the column, but this has practical limitations). AFAIK, increasing row height is not possible. So it is not easy to view free text inside RStudio unless it's in the console, which is not necessarily designed for easily browsing through columns of long strings.
Is there any function like View() that displays data similarly but allows for resizing of row heights (to display >1 line of long strings), and perhaps some smarts to allow us to explore list columns in data.frames?
One idea is a function that takes a data.frame argument, writes it as a temp file, and starts a shiny app that displays the data. But something in native R (or built into RStudio) would probably be better than an ad hoc shiny app.
Note: I do know how to achieve this in markdown using kableextra and similar packages that make nice bootstrap tables. However, the goal is to reduce friction between coding in the script pane in RStudio and exploring the data, and I feel like moving code into an Rmd has potential but creates extra friction
DT::datatable() provides the ability to view raw data in tables using all the features of the Javascript DataTables library either in RStudio's viewer tab or separately on the browser of your choice. You can further finetune the display of your data to fit your needs using any of the features provided here: https://rstudio.github.io/DT/

Is shiny a good solution to display a computationally intensive fixed big dataset?

Here is my problem:
I have a big dataset that in R that represent an object of ~500MB that I plot with ggplot2.
There is 20 millions num values to plot along an int axis that are associated with a 5 level factor for color aesthetics.
I would like to set up a webapps where users could visualize this dataset, using different filter that rely on the factor to display all the data are once or for example a subset corresponding to 1 level of the factor.
The problem is that when I write the plot it takes a couple of minute (~10 minutes)
Solution 1 : The best one for the user would be to use Shiny UI. But is there a way to have the plot already somehow prewritten thanks to ggplot2 or shiny tricks so it can be quickly displayed?
Solution 2 : Without shiny, I would have done different plots of the dataset already and I will have to rebuild a UI to let user visualizes the different pictures. If I do that I will have to restrict the possible use cases of displaying the data.
Looking forward for advices and discussions
Ideally, you shouldn't need to plot anything this big really. If you're getting the data from a database then just write a sequence of queries that will aggregate the data on the DB side and drag very little data to output in shiny. Seems to be a bad design on your part.
That being said, the author of highcharter package did work on implementing boost.js module to help with plotting millions of points. https://rpubs.com/jbkunst/highcharter-boost.
Also have a look at the bigvis package, which allows 'Exploratory data analysis for large datasets (10-100 million observations)' and has been built by #Hadley Wickham https://github.com/hadley/bigvis. There is a nice presentation about the package at this meetup
Think about following procedure:
With ggplot2 you can produce an R object.
plot_2_save <- ggplot()
an object can be saved by
saveRDS(object, "file.rds")
and in the shiny server.R you can load this data
plot_from_data <- readRDS("path/.../file.rds")
I used this setup for some kind of text classification with a really (really) huge svm model implemented as an application on shiny-server.

R plot data.frame to get more effective overview of data

At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?
The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)

Tableau to R connection - script_real returning rounded fraction numbers

I'm pretty new to Tableau but have a lot of experience with R. Everytime I use SCRIPT_REAL to call an R function based on Tableau aggregates, I get back a number that seems to be like the closest fraction approximation. For example if raw R gives me .741312, Tableau will spit out .777778, and so on. Does anything have any experience with this issue?
I'm pretty sure this is an aggregation issue.
From the Tableau and R Integration post by Jonathan Drummey on their community site:
Using Every Row of Data - Disaggregated Data For accurate results
for the R functions, sometimes those R functions need to be called
with every row in the underlying data. There are two solutions to
this:
Disaggregate the measures using Analysis->Aggregate Measures->Off. This doesn’t actually cause the measures to stop their
aggregations, instead it tells Tableau to return every row in the data
without aggregating by the dimensions on the view (which gives the
wanted effect). Using this with R scripts can get the desired results,
but can cause problems for views that we want to have R work on the
non-aggregated data and then display the data with some level of
aggregation.
The second solution deals with this situation: Add a
dimension such as a unique Row ID to the view, and set the Compute
Using (addressing) of the R script to be along that dimension. If
we’re doing some sort of aggregation with R, then we might need to
reduce the number of values returned by filtering them out with
something like:
IF FIRST()==0 THEN SCRIPT_REAL('insert R script here') END
If we need to then perform additional aggregations on that
data, we can do so with table calculations with the appropriate
Compute Usings that take into account the increased level of detail in
the view.

Creating boxplots in R using lattice for already processed data

I am trying to create a a boxplot in R of an extremely large data set. The file containing the data is 2.5G and crashes R if I try to import it. Fortunately some other piece of (python) software can generate the mean and variance without a problem, which is all I really want to plot(for now).
Every tutorial I've found so far requires you to input the full data set, then R computes the statistics itself, but I was wondering how to pass the mean, median, min, max, etc... to bwplot just for plotting. The reason I prefer R and lattice is because it integrates well with the software suite the code might end up in. If I used matlab or some other software that would be a problem because it would be yet another requirement from our current users.
Boxplots do not plot mean or variance. You actually need the full ranked data to plot a proper boxplot, because the quantities are median, quartiles and the actual value of the closes data point within 1.5 times IRQ plus all data points that are outside that range (outliers). This is typically not a good idea for a large data set (because by definition you have millions of outliers).
That said, you can generate the essential summaries any way you want and use bxp to plot them - see ?bxp in R. Just make sure you clarify what quantities you are plotting if they are not the above.

Resources