Since I have more than 10gb of data, I am looking to find a way to create a row filter criteria before reading the dataset into R so that the data read would be less than 10gb and the time required would be less.
Currently I am using the fread and even though it has the select criteria c("a","b"), I was not able to find a row filter critera (e.g. Col A > 2).
Is there a workaround for this in fread?
If not then is there any other way?
I'm trying to get the distinct values from a SparkDataframe using below statement.
distVals <- collect(distinct(select(dataframeName, 'Column_name')))
To execute this statement, it takes around 30-40 minutes. Is there any better way to perform this?
Also there is not much time difference in collecting full dataframe and collecting distinct values. So why is it suggested not to collect entire dataset? Is it only because of the data size?
Since I have to get different kinds of filtered data, I'm looking for collecting the results faster.
I have a LARGE dataset with over 100 Million rows. I only want to read part of the data corresponds to one particular level of a factor, say column1 == A. How do I accomplish this in R using read.csv?
Thank you
You can't filter rows using read.csv. You might try sqldf::read.csv.sql as outlined in answers to this question.
But I think most people would process the file using another tool first. For example, csvkit allows filtering by rows.
This maybe a novice question.How do I filter with matching string values with values from another column using dplyr when working with a database?
e.g. I want to do something like this.
install.packages("nycflights13")
library(nycflights13)
head(nycflights13)
Lets say I want to filter for rows with origin values contained in destination values, I tried
filter(flights_sqlite, origin %in% (unique(select(flights_sqlite,dest))))
However that operation is not allowed. I do not want to convert this to dataframe as the database that I am working with is large and will eat up any available ram.
My R application reads input data from large txt files. it does not read the entire
file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.
File format: 32,000 rows (one gene per row, first two columns contain info about
gene name, etc.) 35,000 columns with numerical data (decimal numbers).
I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read
35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max)
and then process the numerical results.
The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with
reading the entire file and then taking the data for the desired genes.
Any way to accelerate this? I can rewrite the gene data in another format (one
time processing) if that will accelerate reading operations in the future.
You can use the colClasses argument to read.table to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use
colClasses = c(rep("character",2), rep("numeric",34998))
This would be more efficient if you used a database interface. There are several available via the RODBC package, but a particularly well-integrated-with-R option would be the sqldf package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.