Read a reproductible sample of data from multiple CSVs - r

I'm working with several large CSV files, large enough that I can't efficiently load them into memory.
Instead I would like to read a sample of data from each file. There have been other posts about this topic (such as Load a small random sample from a large csv file into R data frame ) but my requirements are a little different as I would like to read in the same rows from each file.
Using read.csv() with skip and nrows=1 would be very slow and tedious.
Does anyone have a suggestion for how to efficiently load the same N rows from several CSVs without reading them all into memory?

Related

What are my options when dealing with very large tibbles?

I am doing some pre-processing on on data from multiple sources (multiple large CSV's, above 500mb), applying some transformations and ending up with a final tibble dataset whcih has all the data that I need in a tidy "format." At the end of that pre-processing, I save that final tibble as an .RData file that I import later for my subsequent statistical analysis.
The problem is that the tibble dataset is very big (takes 5gb memory in the R workspace) and it is very slow to save and to load. I haven't measured it in time but it takes over 15 minutes to save that object, even with compress = FALSE.
Question: Do I have any (ideally easy) options to speed all this up? I already checked and the data types in the tibble are all as they should be (character is charecter, numeric is dbl etc.)
Thanks
read_csv and the other tidyr functions aren't the fastest, but they make things really easy. Per the comments on your question, data.table::fread is a great option for speeding up the import of data in to data frames. It is ~7x faster than read_csv. Those data frames can then be easily be changed to tibbles using dplyr::as_tibble. You also may not even need to change the data frames to a tibble prior to processing as most tidyverse functions will accept a data frame input and give you a tibble output.

Load dataframe too large for system memory from jld2 file

I have a file "myfile.jld2" which contains a DataFrame, say mydata. Is it possible to load mydata in some way although it doesn't fit into system RAM?
I only want to load it to split it up into smaller pieces and dump those to disk.

Is there a way to read in a sample of an Rda file?

I have a very large dataset in an Rda file that I want to use for a shiny app but since it's so large I'm thinking of just taking a sample of the file and read that in. Is there anyway to do that?

Convert a Spark dataframe to a R dataframe [duplicate]

This question already has an answer here:
Requirements for converting Spark dataframe to Pandas/R dataframe
(1 answer)
Closed 4 years ago.
I use R on Zeppelin at work to develop machine learning models. I extract the data from Hive tables using %sparkr, sql(Constring, 'select * from table') and in default it generates a spark data frame with 94 Million records.
However, I cannot perform all R data munging tasks on this Spark df, so I try to convert it to an R data frame using Collect(), as.data.frame() but I run into memory node/ time-out issues.
I was wondering if stack overflow community is aware of any other way to convert a Spark df to R df by avoiding time-out issues?
Did you try to cache your spark dataframe first? If you cache your data first, it may help speed up the collect as the data is already in RAM...that could get rid of the timeout problem. At the same time, this would only increase your RAM requirements. I too have seen those timeout issues when you are trying to serialize or deserialize certain data types, or just large amounts of data between R and Spark. Serialization and deserialization for large data sets is far from a "bullet proof" operation with R and Spark. Moreover, 94M records may just be too much for your driver node to handle in the first place, especially if there is a lot of dimensionality to your dataset.
One workaround I've used, but am not proud of is to use spark to write out the dataframe as a CSV and then have R read that CSV file back in on the next line of the script. Oddly enough, in a few of the cases I did this, the write a file and read the file method actually ended up being faster than a simple collect operation. A lot faster.
Word of advice- make sure to watch out for partitioning when writing out csv files with spark. You'll get a bunch of csv files and have to do some sort of tmp<- lapply(list_of_csv_files_from_spark, function(x){read.csv(x)}) operation to read in each csv file individually and then maybe a df<- do.call("rbind", tmp) ...it would probably be best to use fread to read in the csvs in place of read.csv as well.
Perhaps the better question is, what other data munging tasks are you unable to do in Spark that you need R for?
Good luck. I hope this was helpful. -nate

extracting data from matlab file in R

It's the first time I deal with Matlab files in R.
The rationale for saving the information in a .mat file type was the length. (the dataset contains 226518 rows). We were worried to excel (and then a csv) would not take them.
I can upload the original file if necessary
So I have my Matlab file and when I open it in Matlab all good.
There are various arrays and the one I want is called "allPoints"
I can open it and then see that it contains values around 0.something.
Screenshot:
What I want to do is to extract the same data in R.
library(R.matlab)
df <- readMat("170314_Col_HD_R20_339-381um_DNNhalf_PPP1-EN_CellWallThickness.mat")
str(df)
And here I get stuck. How do I pull out "allPoints" from it. $ does not seem to work.
I will have multiple files that need to be put together in one single dataframe in R so the plan is to mutate each extracted df generating a new column for sample and then I will rbind together.
Could anybody help?

Resources