R read.table() extremely slow - r

I have a simple R code which looks like this:
for(B in 1:length(Files)){
InputDaten[,B]<-read.table(Files[B],header=FALSE,dec=".",skip=12,sep = ",",colClasses=c("numeric"))
}
so I read 1.39GB of files into the memory and would like to process them. However, this takes about an hour to read. When I watch the memory space which is occupied it increases only every 10 minutes. The last two minutes result in a linear increase in the memory space in dependence of time. Why might that be? Can I make it faster?
Edit 1
InputDaten<-data.frame(c(1:15360),444)
This is how i initialised InputDaten
I used fread now, the result looks the same. Here is a screenshot of the memory usage when i started fread, the memory usage doesn't increase at all for a while. (fread started approximately at the middle of the timeframe)
http://pic-hoster.net/upload/57790/Unbenannt.png

Related

hvplot taking hours to render image

I'm working with Gaia astrometric data from the data release 3 and saw hvplot/datashader as the go-to for visualizing large data due to very fast render times and interactivity. In every example I'm seeing, it's taking a few seconds to render an image from hundreds of millions of data points on the slow end. However, when I try to employ the same code for my data, it takes hours for any image to render at all.
For context, I'm running this code on a very large research computer cluster with hundreds of gigs of RAM, a hundred or so cores, and terabytes of storage at my disposal, computing power should not be an issue here. Additionally, I've converted the data I need to a series of parquet files that are being read into a dask dataframe with glob. My code is as follows:
...
import dask.dataframe as dd
import hvplot.dask
import glob
df=dd.read_parquet(glob.glob(r'myfiles/*'),engine='fastparquet')
df=df.astype('float32')
df=df[['col1','col2']]
df.hvplot.scatter(x='col1',y='col2',rasterize=True,cmap=cc.fire)
...
does anybody have any ideas what could be the issue here? Any help would be appreciated
Edit: I've got the rendering times below an hour now by turning the data into a smaller number of higher memory files (3386 -> 175)
Hard to debug without access to the data, but one quick optimization you can implement is to avoid loading all the data and select the specific columns of interest:
df=dd.read_parquet(glob.glob(r'myfiles/*'), engine='fastparquet', columns=['col1','col2'])
Unless crucial, I'd also avoid doing .astype. It shouldn't be a bottleneck, but the gains from this float32 might not be relevant if memory isn't a constraint.

The fastest way to to read several huge .txt files OVER LOOP into r

This topic (Quickly reading very large tables as dataframes) investigate the same problem but not over the loop. I have 1000 different.txt file each one 200 mb with 1 million rows. What is the fastest way to read them over the loop then?
I have practiced the below ways with the reported computational time for a case of 10 files.
for (i in 1:10){
x<- read.delim()
# do something
}
# Time : 89 sec
for (i in 1:10){
x<- read.table()
# do something
}
# Time : 90 sec
for (i in 1:10){
x <- fread()
# do something
}
# Time : 108 sec . !!!! (to my knowledge it is supposed to be the fastest, but when it comes to loop it is not the fastest though)!
foreach (i in 1:10)%dopar{
x<- read.delim()
# do something
}
# Time: 83 sec
foreach(i in 1:10) %dopar{
x <- fread()
# do something
}
# Time: 95 sec
I was told that disk.frame() package is the fastest. Could not try that yet. Need your thoughts, please. Can Laapply be applied to speed up the process?
Maybe lapply() could help, as you suggested
myFiles <- list.files(pattern="txt$")
myList <- lapply(myFiles, function(x) fread(x))
I am also suprised that fread takes longer than read.table for you. When I had large files, fread really helped to read them in faster.
I'm adding this as an answer to get some more space than in the comments.
Working fast with 'big data'
200 GB of text files is reasonably big data which require significant effort to speed up the processing, or significant wait time. There's no easy way around ;)
you need to get your data to memory to start any work
it is the fastest to read your files one by one (NOT in parallel) when reading from a single hard drive
measure how much time it takes to load the data without parsing
your load time for multiple similar files will be just a multiple of the single file time, you can't get any magic improvements here
to improve the load time you can compress the input files - it pays of only if you'll be using the same data source multiple times (after compression, less bytes must cross the hard drive -> memory boundary, which is slow)
when choosing how to compress the data, you aim at load(compressed)+decompress times to be smaller than load(decompressed)
you need to parse the raw data
measure how much time it takes to parse the data
if you cannot separate the parsing, measure how much time it takes to load and parse the data, the parse time is then the difference to the previously measured load time
parsing can be parallelized, but it makes sense only if that is a substantial part of the load time
you need to do your thing
this usually can be done in parallel
you need to save the results
unless the results are as huge as the input, you don't care
if they're huge, you need to serialize your IO again, that is save it one by one, not in parallel
again compression helps, if you choose algorithm and settings where the compression time + write time is smaller than write time of the uncompressed data
To get raw load times, bash is your friend. Using pipe viewer or the builtin time you can easily check the time it takes to read through a file by doing
pv mydata.txt > /dev/null
# alternatively
time cat mydata.txt > /dev/null
Be aware that your disk cache will kick in, when you'll be repeatedly measuring a single file.
As for the compression, if you're stuck with R, gzip is the only reasonable option. If you'll do some pre-processing in bash, lz4 is the tool of choice, because it's really fast at decent compression ratios.
gzip -3 mydata.txt
pv mydata.txt.gz | zcat > /dev/null
Here we're getting to the pre-processing. It pays of to use UNIX tools which tend to be really fast to pre-process the data before loading to R. You can filter columns with cut, filter rows with mawk (which is often much faster than gawk).

R accumulating memory in each iteration with large input files

I am reading around 20,000 text files in a for loop for sentiment analysis. Each file is of around 20-40 MB size. In each loop, I am taking out some sentiment counts (just a 2 numbers) out of my input text and storing it in a dataframe. The issue is, in each iteration, I can see R is cumulatively accumulating memory. After 10,000 files I see around 13GB memory allocation for R in my task manager. I tried gc() and rm() to delete objects after each iteration, but still it does not work. The logic is as I am using the same objects iteratively R is not releasing memory used in the previous iterations.
for(i in 1:20,000){
filename <- paste0("file_", i, ".txt")
text <- readLines(filename)
# Doing sentiment analysis based on dictionary based approach
# Storing sentiment counts in dataframe
# Removing used objects
rm(filename, text)
gc()
}
You could try to check which objects are taking memory and that you do not use anymore:
print(sapply(ls(), function(x) pryr::object_size(get(x))/1024/1024))
(EDIT: just saw the comment with this almost identical advice)
this line will give you the size in Megabytes of every object present in the environment (in RAM).
Alternatively if nothing appears, you can call gc() several times instead of once like:
rm(filename, text)
for (i in 1:3) gc()
It is usually more effective...
If nothing works, that could mean the memory is fragmented and thus RAM is free but unavailable to use as misplaced between data you still use.
The solution could be to run your scripts by chunks of files say 1000 by 1000.

Store an increasing matrix into HDD and not in memory

I'm facing a pretty expected problem while I'm running irritatingly the below code which creates all possible combinations for a specified sequence and then it stores them in the final.grid variable. The thing is that there is no only one sequence but about hundred of thousands of them and each one could have enough combinations.
for()...
combs = get.all.combs(sequence)
final.grid = rbind(final.grid, combs)
Anyway. Tried to run my code in a windows PC with 4GB RAM and after 4 hours (not even half of the combinations being calculated) R returned this error
Error: cannot allocate vector of size 4.0 Gb
What i was though as solution is to write after each iteration the final.grid to a file , free the allocated memory and continue. The truth is that I have not experience on such implementations with R and I don't know which solution to choose and if there are some of them that will do better and more efficiently. Have in mind that probably my final grid will need some GBs.
Somewhere in the stack exchange I read about ff package but there was not enough discussion on the subject (at least I didn't found it) and preferred to ask here for your opinions.
Thanks
I cannot understand very well your question, because the piece of code that you put is not clear to figure it out your problem.
But, you can try saving your results as .RData or .nc files, depending on the nature of your data. However, it could be better if you are more explicit about your problem, for instance showing what code is behind get.all.combs function or sequence data.
One thing you can try is the memory.limit() function to see if you can allocate enough memory for your work. This may not work if your Windows OS is 32 bit.
If you have large data object that you don't need for some parts of your program, you could first save them, and them remove using 'rm', and when you need them again you can load the objects.
The link below has more info that could be useful to you.
Increasing (or decreasing) the memory available to R processes
EDIT:
You can use object.size function to see memory requirement for objects you have. If they are too big, try loading them only when you need them.
It is possible one of the functions you use try to allocate more memory than you have. See if you can try to find where exactly the program crashes.

R using waaay more memory than expected

I have an Rscript being called from a java program. The purpose of the script is to automatically generate a bunch of graphs in ggplot and them splat them on a pdf. It has grown somewhat large with maybe 30 graphs each of which are called from their own scripts.
The input is a tab delimited file from 5-20mb but the R session goes up to 12gb of ram usage sometimes (on a mac 10.68 btw but this will be run on all platforms).
I have read about how to look at the memory size of objects and nothing is ever over 25mb and even if it deep copies everything for every function and every filter step it shouldn't get close to this level.
I have also tried gc() to no avail. If I do gcinfo(TRUE) then gc() it tells me that it is using something like 38mb of ram. But the activity monitor goes up to 12gb and things slow down presumably due to paging on the hd.
I tried calling it via a bash script in which I did ulimit -v 800000 but no good.
What else can I do?
In the process of making assignments R will always make temporary copies, sometimes more than one or even two. Each temporary assignment will require contiguous memory for the full size of the allocated object. So the usual advice is to plan to have _at_least_ three time the amount of contiguous _memory available. This means you also need to be concerned about how many other non-R programs are competing for system resources as well as being aware of how you memory is being use by R. You should try to restart your computer, run only R, and see if you get success.
An input file of 20mb might expand quite a bit (8 bytes per double, and perhaps more per character element in your vectors) depending on what the structure of the file is. The pdf file object will also take quite a bit of space if you are plotting each point within a large file.
My experience is not the same as others who have commented. I do issue gc() before doing memory intensive operations. You should offer code and describe what you mean by "no good". Are you getting errors or observing the use of virtual memory ... or what?
I apologize for not posting a more comprehensive description with code. It was fairly long as was the input. But the responses I got here were still quite helpful. Here is how I mostly fixed my problem.
I had a variable number of columns which, with some outliers got very numerous. But I didn't need the extreme outliers, so I just excluded them and cut off those extra columns. This alone decreased the memory usage greatly. I hadn't looked at the virtual memory usage before but sometimes it was as high as 200gb lol. This brought it down to up to 2gb.
Each graph was created in its own function. So I rearranged the code such that every graph was first generated, then printed to pdf, then rm(graphname).
Futher, I had many loops in which I was creating new columns in data frames. Instead of doing this, I just created vectors not attached to data frames in these calculations. This actually had the benefit of greatly simplifying some of the code.
Then after not adding columns to the existing dataframes and instead making column vectors it reduced it to 400mb. While this is still more than I would expect it to use, it is well within my restrictions. My users are all in my company so I have some control over what computers it gets run on.

Resources