Efficient alphanumeric searching sparkR - r

I have in a Spark data frame with 10 million rows, where each row represents an alpha numeric string indicating id of a user, example:
602d38c9-7077-4ea1-bc8d-af5c965b4e85 my objective is to check if another id like aaad38c9-7087-4ef1-bc8d-af5c965b4e85 is present in the 10 million list.
I would want to do it efficiently and not search all 10 million records, every single time a search happens. Example can I sort my records alphabetically and ask SparkR to search only within records that begin with a instead of the universe to speed up search and make it computationally efficient?
Any solutions primarily using SparkR if not then any Spark solution would be helpful

You can use rlike which is for regex search within a dataframe column.
df.filter($"foo".rlike("regex"))
Or You can index spark dataframe into solr which will definitely search your string within few milliseconds.
https://github.com/lucidworks/spark-solr

Related

FREAD row criteria in R

Since I have more than 10gb of data, I am looking to find a way to create a row filter criteria before reading the dataset into R so that the data read would be less than 10gb and the time required would be less.
Currently I am using the fread and even though it has the select criteria c("a","b"), I was not able to find a row filter critera (e.g. Col A > 2).
Is there a workaround for this in fread?
If not then is there any other way?

SparkR get distinct values from Dataframe fast

I'm trying to get the distinct values from a SparkDataframe using below statement.
distVals <- collect(distinct(select(dataframeName, 'Column_name')))
To execute this statement, it takes around 30-40 minutes. Is there any better way to perform this?
Also there is not much time difference in collecting full dataframe and collecting distinct values. So why is it suggested not to collect entire dataset? Is it only because of the data size?
Since I have to get different kinds of filtered data, I'm looking for collecting the results faster.

Reading subset of large data

I have a LARGE dataset with over 100 Million rows. I only want to read part of the data corresponds to one particular level of a factor, say column1 == A. How do I accomplish this in R using read.csv?
Thank you
You can't filter rows using read.csv. You might try sqldf::read.csv.sql as outlined in answers to this question.
But I think most people would process the file using another tool first. For example, csvkit allows filtering by rows.

Filter tbl_sqlite column by string comparing with another column - dplyr

This maybe a novice question.How do I filter with matching string values with values from another column using dplyr when working with a database?
e.g. I want to do something like this.
install.packages("nycflights13")
library(nycflights13)
head(nycflights13)
Lets say I want to filter for rows with origin values contained in destination values, I tried
filter(flights_sqlite, origin %in% (unique(select(flights_sqlite,dest))))
However that operation is not allowed. I do not want to convert this to dataframe as the database that I am working with is large and will eat up any available ram.

Optimizing File reading in R

My R application reads input data from large txt files. it does not read the entire
file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.
File format: 32,000 rows (one gene per row, first two columns contain info about
gene name, etc.) 35,000 columns with numerical data (decimal numbers).
I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read
35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max)
and then process the numerical results.
The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with
reading the entire file and then taking the data for the desired genes.
Any way to accelerate this? I can rewrite the gene data in another format (one
time processing) if that will accelerate reading operations in the future.
You can use the colClasses argument to read.table to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use
colClasses = c(rep("character",2), rep("numeric",34998))
This would be more efficient if you used a database interface. There are several available via the RODBC package, but a particularly well-integrated-with-R option would be the sqldf package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.

Resources