I have a question about the limitation of data in R shiny. I am now working on updating a previous project. The original data is around 5MB and the program would resample the data to obtain an estimate for future values. I am not updating the program to make it more general, where I try to import 300MB data. However, the R shiny would crack. I have used R to handle larger data before. But I am not sure if R shiny has any limitation of data size. Does anyone have any idea about it. Thanks.
Related
I have a mathematical simulation written in scala (random numbers, small calculations, lots of going through collections and producing a lot of data). Currently I produce some csv files as output. Then I load them into R and plot the information. But csv is probably not the best option for sharing big data. My problem is that I don't know how to improve my current approach.
Shall I use a database? Which one? MariaDB?
Shall I calculate the data that shall be plotted in scala while my simulation is running? Without calculating plotting data my program needs 20s for 500000 simulation steps. With calculations it needs more than 3min. But I could use Threads for the calculations. Or shall I give R the pure data and do the calculations on this data in R?
Shall I use Hadoop and Spark? Together with a database?
I am quite confused and hope you have some best practices for me.
In my project I developed algorithms with R and R studio to process big images from microscopy analysis. I work on big matrix of pixel value (around 3Gb ), extract biological informations from specific regions of these images especially perform particle detection at specific position of the matrix and then perform statistical analysis.
To be able to transfer the whole process to other people without applying multiple scripts functions... I tried to create a Shiny App but I struggle to import the big images .txt
Actually I use as.matrix(read.table()) function on the file provided by the shiny app user through the file input box
Is any one has an idea to import these matrix easily and more quickly ? even on Rstudio I wait ~20minutes for each picture import.
Thanks a lot in advance for the answers,
Best,
Romain
Here is my problem:
I have a big dataset that in R that represent an object of ~500MB that I plot with ggplot2.
There is 20 millions num values to plot along an int axis that are associated with a 5 level factor for color aesthetics.
I would like to set up a webapps where users could visualize this dataset, using different filter that rely on the factor to display all the data are once or for example a subset corresponding to 1 level of the factor.
The problem is that when I write the plot it takes a couple of minute (~10 minutes)
Solution 1 : The best one for the user would be to use Shiny UI. But is there a way to have the plot already somehow prewritten thanks to ggplot2 or shiny tricks so it can be quickly displayed?
Solution 2 : Without shiny, I would have done different plots of the dataset already and I will have to rebuild a UI to let user visualizes the different pictures. If I do that I will have to restrict the possible use cases of displaying the data.
Looking forward for advices and discussions
Ideally, you shouldn't need to plot anything this big really. If you're getting the data from a database then just write a sequence of queries that will aggregate the data on the DB side and drag very little data to output in shiny. Seems to be a bad design on your part.
That being said, the author of highcharter package did work on implementing boost.js module to help with plotting millions of points. https://rpubs.com/jbkunst/highcharter-boost.
Also have a look at the bigvis package, which allows 'Exploratory data analysis for large datasets (10-100 million observations)' and has been built by #Hadley Wickham https://github.com/hadley/bigvis. There is a nice presentation about the package at this meetup
Think about following procedure:
With ggplot2 you can produce an R object.
plot_2_save <- ggplot()
an object can be saved by
saveRDS(object, "file.rds")
and in the shiny server.R you can load this data
plot_from_data <- readRDS("path/.../file.rds")
I used this setup for some kind of text classification with a really (really) huge svm model implemented as an application on shiny-server.
I am stuck with huge dataset to be imported in R and then processing it (by randomForest). Basically, I have a csv file of about 100K rows and 6K columns. Importing it directly takes a long time with many warnings regarding space allocations (limit reached for 8061mb). At the end of many warnings, I do get that dataframe in R, but not sure whether to rely on it. Even if I use that dataframe, I am pretty sure running a randomForest on it will definitely be a huge problem. Hence, mine is a two part question:
How to efficiently import such a large csv file without any warnings/errors?
Once imported into R, how to proceed for using randomForest function on it.
Should we use some package which enhances computing efficiency. Any help is welcome, thanks.
Actually your limit for loading files in R seems to be 8G, try increasing that if your machine have more memory.
If that does not work, one option is that you can submit to MapReduce from R ( see http://www.r-bloggers.com/mapreduce-with-r-on-hadoop-and-amazon-emr/ and https://spark.apache.org/docs/1.5.1/sparkr.html). However, Random Forest is not supported in either way yet.
Being an R user, I'm now trying to learn the SPSS syntax.
I sed to add the command rm(list=ls()) at the being of R script to ensure that R is empty before I go on my work.
Is there a similar command for SPSS? Thanks.
Close to the functional equivalent in SPSS would be
dataset close all.
This simply closes all open dataframes except for the active dataframe (and strips it of its name). If you open another dataset the previous dataframe will close automatically.
Since the way SPSS uses memory is fundamentally different from how R uses it, there really isn't a close equivalent between rm and SPSS memory management mechanisms. SPSS does not keep datasets in memory in most cases - which is why it can process files of unlimited size. When you close an SPSS dataset, all its associated metadata - which is in memory, is removed.
DATASET CLOSE ALL
closes all open datasets, but there can still be an unnamed dataset remaining. To really remove everything, you would write
dataset close all.
new file.
because a dataset cannot remain open if another one is opened unless it has a dataset name.
You might also be interested to know that you can run R code from within SPSS via
BEGIN PROGRAM R.
END PROGRAM.
SPSS provides apis for reading the active SPSS data, creating SPSS pivot tables, creating new SPSS datasets etc. You can even use the SPSS Custom Dialog Builder to create a dialog box interface for your R program. In addition, there is a mechanism for building SPSS extension commands that are actually implemented in R or Python. All this apparatus is free once you have the basic SPSS Statistics. So it is easy to use SPSS to provide a nice user interface and nice output for an R program.
You can download the R Essentials and a good number of R extensions for SPSS from the SPSS Community website at www.ibm.com/developerworks/spssdevcentral. All free, but registration is required.
p.s. rm(ls()) is useful in some situations - it is often used with R code within SPSS, because the state of the R workspace is retained between R programs within the same SPSS session.
Regards,
Jon Peck