How to manipulate a huge data set in R?

How to manipulate a huge data set in R? - r

First of all, I'm sorry for probably duplicating this question.
But, I've looked at a lot of other similar question and couldn't be able to solve my problem.
Well, I'm working with a huge data set, which contains 184,903,890 rows. An object with over 6.5GB.
This csv file can be reached on this link: Ad Tracking Fraud Detection Challenge
I'm running it in a pc with the following specifications:
i7 - 7700K - 4.2GHz
16GB Ram
GeForce GTX 1080 Ti with 11.2GB DDR 5
But, even when I'm trying to set the column as Date, the system stops working.
Is it possible to deal with this size of data set using only R?
Code details:
training <- fread('train.csv')
Some tries which stop R or return that cannot allocate vector of size ...:
training$click_time <- as.Date(training$click_time)
training$click_time <- as.POSIXct(training$click_time, 'GMT')
training <- training %>% mutate(d_month = sapply(click_time, mday)
Additional updates:
I've already used gc() to clean the memory;
I've already selected only 2 columns to a new data set;

Maybe you reached out memory assigned to R. Try memory_limit() and if needed you can increase the default with memory.limit(size = xxxx)

Related

Some failures to use Rstudio + sparklyr in Watson Studio for data manipulation on large data set

I got Error in curl::curl_fetch_memory(url, handle = handle) : Empty reply from server for some operations in Rstudio (Watson studio) when I tried to do data manipulation on Spark data frames.
Background:
The data is stored on IBM Cloud Object Storage (COS). It will be several 10GB files but currently I'm testing only on the first subset (10GB).
The workflow I suppose is, in Rstudio (Watson Studio), connect to spark (free plan) using sparklyr, read the file as Spark data frame through sparklyr::spark_read_csv(), then apply feature transformation on it (e.g., split one column into two, compute the difference between two columns, remove unwanted columns, filter out unwanted rows etc.). After the preprocessing, save out the cleaned data back to COS through sparklyr::spark_write_csv().
To work with Spark I added 2 spark services into the project (seems like any spark service under the account can be used by Rstudio.. Rstudio is not limited by project?); I may need to use R notebooks for data exploration (to show the plots in a nice way) so I created the project for that purpose. In previous testings I found that for R notebooks / Rstudio, the two env cannot use the same Spark service at the same time; so I created 2 spark services, the first for R notebooks (let's call it spark-1) and the second for Rstudio (call it spark-2).
As I personally prefer sparklyr (pre-installed in Rstudio only) over SparkR (pre-installed in R notebooks only), for almost the whole week I was developing & testing code in Rstudio using spark-2.
I'm not very familiar with Spark and currently it behaves in a way that I don't really understand. It would be very helpful if anyone can give suggestions on any issue:
1) failure to load data (occasionally)
It worked quite stable until yesterday, since when I started to encounter issues loading data using exactly the same code. The error does not tell anything but R fails to fetch data (Error in curl::curl_fetch_memory(url, handle = handle) : Empty reply from server). What I observed for several times is, after I got this error, if I again run the code to import data (just one line of code), the data would be loaded successfully.
Q1 screenshot
2) failure to apply (possibly) large amount of transformations (always, regardless of data size)
To check whether the data is transformed correctly, I printed out the first several rows of interested variables after each step (most of them are not ordinal, i.e., the order of steps doesn't matter) of transformation. I read a little bit of how sparklyr translates operations. Basically sparklyr doesn't really apply the transformation to the data until you call to preview or print some of the data after transformation. After a set of transformations, if I run some more, when I printed out the first several rows I got error (same useless error as in Q1). I'm sure the code is right as once I run these additional steps of code right after I load the data, I'm able to print and preview the first several rows.
3) failure to collect data (always for the first subset)
By collecting data I want to pull the data frame down to the local machine, here to Rstudio in Watson Studio. After applying the same set of transformations, I'm able to collect the cleaned version of a sample data (originally 1000 rows x 158 cols, about 1000 rows x 90 cols after preprocessing), but failed on the first 10 GB subset file (originally 25,000,000 rows x 158 cols, at most 50,000 rows x 90 cols after preprocessing). The space it takes up should not exceed 200MB in my opinion, which means it should be able to be read into either Spark RAM (1210MB) or Rstudio RAM. But it just failed (again with that useless error).
4) failure to save out data (always, regardless of data size)
The same error happened every time when I tried to write the data back to COS. I suppose this has something to do with the transformations, probably something happens when Spark received too many transformation request?
5) failure to initialize Spark (some kind of pattern found)
Starting from this afternoon, I cannot initialize spark-2, which has been used for about a week. I got the same useless error message. However I'm able to connect to spark-1.
I checked the spark instance information on IBM Cloud:
spark-2
spark-1
It's weird that spark-2 has 67 active tasks since my previous operations got error messages. Also, I'm not sure why "input" in both spark instances are so large.
Does anyone know what happened and why did it happen?
Thank you!

apcluster in R: Memory limitation

I am trying to run clustering exercise in R. The algorithm that I used is apcluster(). The script that I used is:
s1 <- negDistMat(df, r=2, method="euclidean")
apcluster <- apcluster(s1)
My data set is having around 0.1 million rows. When I ran the script, I got the following error:
Error in simpleDist(x[, sapply(x, is.numeric)], sel, method = method, :
negative length vectors are not allowed
When I searched online, I found out that negative length vector error occurs due to the memory limit of my RAM. My question is if there is any workaround to run apcluster() on my dataset with 0.1 million rows with the available RAM, or am I missing something that I will need to take care while running apcluster in R?
I have a machine with 8 GB of RAM.

The standard version of affinity propagation implemented in the apcluster() method will never ever run successfully on data of that size. On the one hand, the similarity matrix (s1 in your code sample) will have 100K x 100K = 10G entries. On the other hand, computation times will be excessive. I suggest you use apclusterL() instead.

How to Efficiently Download One Million Images with R and Fully Utilize Computer/Network Resources

I am attempting to download data consisting of approximately 1 million jpg files for which I have individual URL's and desired file names. The images have a mean filesize of approximately 120KB and range from 1KB to 1MB. I would like to use R to download the images.
I've tried a few things and eventually figured out a way that has let me download all million images in under three hours. My current strategy works, but it is a somewhat absurd solution that I would prefer not to use ever again, and I'm baffled as to why it even works. I would like to understand what's going on and to find a more elegant and efficient way of achieving the same result.
I started out with mapply and download.file() but this only managed a rate of 2 images per second. Next, I parallelized the process with the parallel package. This was very effective and improved the rate to 9 images per second. I assumed that would be the most I could achieve, but I noticed that the resources being used by my modest laptop were nowhere near capacity. I checked to make sure there wasn't a significant disk or network access bottleneck, and sure enough, neither were experiencing much more than ~10% of their capacity.
So I split up the url information and opened a new R console window where I ran a second instance of the same script on a different segment of the data to achieve 18 images per second. Then I just continued to open more and more instances, giving each of them a unique section of the full list of URL's. It was not until I had 12 open that there was any hint of slowing down. Each instance actually gave a nearly linear increase in downloads per second, and with some memory management, I approached my maximum down speed of 13 MB/s.
I have attached a graph showing the approximate total images being downloaded per second as a function of the number of instances running.
Also attached is a screenshot of my resource monitor while 10 simultaneous instances of R were running.
I find this result very surprising and I don't quite understand why this should be possible. What's making each individual script run so slowly? If the computer can run 12 instances of this code with little to no diminishing returns, what prevents it from just running 12 times as fast? Is there a way to achieve the same thing without having to open up entirely new R environments?
Here is the code I am asking about specifically. Unfortunately I cannot disclose the original URL's but the script is nearly identical to what I am using. I have replaced my data with a few CC images from wikimedia. For better replication, please replace "images" with your own large URL list if you have access to such a thing.
library(parallel)
library(data.table)
images <-
data.table(
file = c(
"Otter.jpg",
"Ocean_Ferret.jpg",
"Aquatic_Cat.jpg",
"Amphibious_Snake_Dog.jpg"
),
url = c(
"https://upload.wikimedia.org/wikipedia/commons/thumb/3/3d/Otter_and_Bamboo_Wall_%2822222758789%29.jpg/640px-Otter_and_Bamboo_Wall_%2822222758789%29.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Otter_Looking_Back_%2817939094316%29.jpg/640px-Otter_Looking_Back_%2817939094316%29.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/Otter1_%2814995327039%29.jpg/563px-Otter1_%2814995327039%29.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Otter_Profile_%2817962452452%29.jpg/640px-Otter_Profile_%2817962452452%29.jpg"
) #full URL's are redundant and unnecessary but I kept them in case there was some performance advantage over nesting a function inside download.file that combines strings.
)
#Download with Mapply (just for benchmarking, not actually used in the script)
system.time(
mapply(
function(x, y)
download.file(x, y, mode = 'wb', quiet = TRUE),
x = images$url,
y = images$file,
SIMPLIFY = "vector",
USE.NAMES = FALSE
)
)
#Parallel Download with clusterMap (this is what each instance is running. I give each instance a different portion of the images data table)
cl <- makeCluster(detectCores())
system.time(
clusterMap(
cl,
download.file,
url = images$url,
destfile = images$file,
quiet = TRUE,
mode = 'wb',
.scheduling = 'dynamic',
SIMPLIFY = 'vector',
USE.NAMES = FALSE
)
)
In summary, the questions I am asking are:
1) Why is my solution behaving this way? More specifically, why is 1 script not fully utilizing my computer's resources?
2) What is a better way to achieve the following with R: download 120GB composed of one million jpeg images directly via their URL's in under 3 hours.
Thank you in advance.

cl <- makeCluster(detectCores())
This line says to make a backend cluster with a number of nodes equal to your cores. That would probably be 2, 4 or 8, depending on how beefy a machine you have.
Since, as you noticed, the downloading process isn't CPU-bound, there's nothing stopping you from making the cluster as big as you want. Replace that line with something like
cl <- makeCluster(50)
and you'll have 50 R sessions downloading in parallel. Increase the number until you hit your bandwidth or memory limit.

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.
My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.
Concepts I have tried to code: (all of them in some way give me memory issues or run forever)
The most simple way, one row against all, write the result out using append.T. (Runs forever)
biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.
I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?
Thanks for all your comments and help.

I'm familiar with this problem myself in the context of genetic research.
If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).
If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.

Trimming a huge (3.5 GB) csv file to read into R

So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it.
So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?
Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R?
Any ideas would be greatly appreciated.
Update:
Constraints
Needs to use my machine, so no EC2 instances
As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...
As you can see below, the data contains mixed types, which I need to operate on later
Data
The data is 3.5GB, with about 8.5 million rows and 17 columns
A couple thousand rows (~2k) are malformed, with only one column instead of 17
These are entirely unimportant and can be dropped
I only need ~100,000 rows out of this file (See below)
Data example:
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
NC [Malformed row]
[8.5 Mill rows]
I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R:
County; State; Year; Quarter; Segment; GDP; ...
Ada County;NC;2009;4;FIRE;80.1; ...
Ada County;NC;2010;1;FIRE;82.5; ...
[~200,000 rows]
Results:
After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation.
I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want.
It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket.
## Open a connection separately to hold the cursor position
file.in <- file('bad_data.txt', 'rt')
file.out <- file('chopped_data.txt', 'wt')
line <- readLines(file.in, n=1)
line.split <- strsplit(line, ';')
# Stitching together only the columns we want
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
## Use a loop to read in the rest of the lines
line <- readLines(file.in, n=1)
while (length(line)) {
line.split <- strsplit(line, ';')
if (length(line.split[[1]]) > 1) {
if (line.split[[1]][3] == '2009') {
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
}
}
line<- readLines(file.in, n=1)
}
close(file.in)
close(file.out)
Failings by Approach:
sqldf
This is definitely what I'll use for this type of problem in the future if the data is well-formed. However, if it's not, then SQLite chokes.
MapReduce
To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. It looked like it required the object to be in memory as well, which would defeat the point if that were the case.
bigmemory
This approach cleanly linked to the data, but it can only handle one type at a time. As a result, all my character vectors dropped when put into a big.table. If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.
scan
Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines. In short, it just didn't fit the bill this time.

My try with readLines. This piece of a code creates csv with selected years.
file_in <- file("in.csv","r")
file_out <- file("out.csv","a")
x <- readLines(file_in, n=1)
writeLines(x, file_out) # copy headers
B <- 300000 # depends how large is one pack
while(length(x)) {
ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
if (length(ind)) writeLines(x[ind], file_out)
x <- readLines(file_in, n=B)
}
close(file_in)
close(file_out)

I'm not an expert at this, but you might consider trying MapReduce, which would basically mean taking a "divide and conquer" approach. R has several options for this, including:
mapReduce (pure R)
RHIPE (which uses Hadoop); see example 6.2.2 in the documentation for an example of subsetting files
Alternatively, R provides several packages to deal with large data that go outside memory (onto disk). You could probably load the whole dataset into a bigmemory object and do the reduction completely within R. See http://www.bigmemory.org/ for a set of tools to handle this.

Is there a similar way to read in files a piece at a time in R?
Yes. The readChar() function will read in a block of characters without assuming they are null-terminated. If you want to read data in a line at a time you can use readLines(). If you read a block or a line, do an operation, then write the data out, you can avoid the memory issue. Although if you feel like firing up a big memory instance on Amazon's EC2 you can get up to 64GB of RAM. That should hold your file plus plenty of room to manipulate the data.
If you need more speed, then Shane's recommendation to use Map Reduce is a very good one. However if you go the route of using a big memory instance on EC2 you should look at the multicore package for using all cores on a machine.
If you find yourself wanting to read many gigs of delimited data into R you should at least research the sqldf package which allows you to import directly into sqldf from R and then operate on the data from within R. I've found sqldf to be one of the fastest ways to import gigs of data into R, as mentioned in this previous question.

There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
It passes any arguments along to read.table, so the combination should let you subset pretty tightly.

The ff package is a transparent way to deal with huge files.
You may see the package website and/or a presentation about it.
I hope this helps

What about using readr and the read_*_chunked family?
So in your case:
testfile.csv
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5
lol
Ada County;NC;2013;1;FIRE;Financial;Banks;82.5
Actual code
require(readr)
f <- function(x, pos) subset(x, Year %in% c(2009, 2010))
read_csv2_chunked("testfile.csv", DataFrameCallback$new(f), chunk_size = 1)
This applies f to each chunk, remembering the col-names and combining the filtered results in the end. See ?callback which is the source of this example.
This results in:
# A tibble: 2 × 8
County State Year Quarter Segment `Sub-Segment` `Sub-Sub-Segment` GDP
* <chr> <chr> <int> <int> <chr> <chr> <chr> <dbl>
1 Ada County NC 2009 4 FIRE Financial Banks 801
2 Ada County NC 2010 1 FIRE Financial Banks 825
You can even increase chunk_size but in this example there are only 4 lines.

You could import data to SQLite database and then use RSQLite to select subsets.

Have you consisered bigmemory ?
Check out this and this.

Perhaps you can migrate to MySQL or PostgreSQL to prevent youself from MS Access limitations.
It is quite easy to connect R to these systems with a DBI (available on CRAN) based database connector.

scan() has both an nlines argument and a skip argument. Is there some reason you can just use that to read in a chunk of lines a time, checking the date to see if it's appropriate? If the input file is ordered by date, you can store an index that tells you what your skip and nlines should be that would speed up the process in the future.

These days, 3.5GB just isn't really that big, I can get access to a machine with 244GB RAM (r3.8xlarge) on the Amazon cloud for $2.80/hour. How many hours will it take you to figure out how to solve the problem using big-data type solutions? How much is your time worth? Yes it will take you an hour or two to figure out how to use AWS - but you can learn the basics on a free tier, upload the data and read the first 10k lines into R to check it worked and then you can fire up a big memory instance like r3.8xlarge and read it all in! Just my 2c.

Now, 2017, I would suggest to go for spark and sparkR.
the syntax can be written in a simple rather dplyr-similar way
it fits quite well to small memory (small in the sense of 2017)
However, it may be an intimidating experience to get started...

I would go for a DB and then make some queries to extract the samples you need via DBI
Please avoid importing a 3,5 GB csv file into SQLite. Or at least double check that your HUGE db fits into SQLite limits, http://www.sqlite.org/limits.html
It's a damn big DB you have. I would go for MySQL if you need speed. But be prepared to wait a lot of hours for the import to finish. Unless you have some unconventional hardware or you are writing from the future...
Amazon's EC2 could be a good solution also for instantiating a server running R and MySQL.
my two humble pennies worth.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex