reading a huge csv file using cudf - cudf

I am trying to read a huge csv file CUDF but gets memory issues.
import cudf
cudf.set_allocator("managed")
cudf.__version__
user_wine_rate_df = cudf.read_csv('myfile.csv',
sep = "\t",
parse_dates = ['created_at'])
'0.17.0a+382.gbd321d1e93'
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)
If I remove cudf.set_allocator("managed")
I get
MemoryError: std::bad_alloc: CUDA error at: /opt/conda/envs/rapids/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
I am using CUDF through rapidsai/rapidsai:cuda11.0-runtime-ubuntu16.04-py3.8
I wonder whar could be the reason of hitting memory, while I can read this big file with pandas
**Update
I installed dask_cudf
and used dask_cudf.read_csv('myfile.csv') - but still get the
parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered

If the file you are reading is larger than the memory available then you will observe an OOM(Out Of Memory) error as cuDF runs on a sigle GPU.
In order to read files which are very large I would recommend using dask_cudf.

Check out this blog by Nick Becker on reading larger than GPU memory files. It should get you on your way.

Related

DBD::SQLite::db commit failed: disk I/O error

I have a system writing data to an sqlite file. I had everything operational under CentOS 8. After upgrading the system to Rocky Linux 9 I see this error when running a commit command: DBD::SQLite::db commit failed: disk I/O error
I have checked file permissions, disk space, SMART readings, everything disk related that I can think of but without success.
Has anyone encountered this error before? What could I try to fix it?
The problem turned out to be a missing Perl module (LWP::https) that was causing DBD::SQLite not to get the data it wanted. Apparently, DBD::SQLite says Disk I/O error for that case.

SQLite3 Disk IO Error

I'm experiencing a disk I/O error during a query on a SQL database. I have several large (95gb) db with the same schema, and am trying to run the same query on all of them. Two run fine, returning ~28,000,000 results; one hits a 'Disk I/O error', both when run via SQLalchemy and command line.
If I limit the query to return ~10,000,000 results, I don't get the error- but I need the full output as these are photons from a large physics simulation.
The full error message in SQLalchemy is:
Traceback (most recent call last):
File "/Users/swm1n12/anaconda/lib/python3.5/site-packages/sqlalchemy/engine/result.py", line 1120, in fetchall
l = self.process_rows(self._fetchall_impl())
File "/Users/swm1n12/anaconda/lib/python3.5/site-packages/sqlalchemy/engine/result.py", line 1071, in _fetchall_impl
return self.cursor.fetchall()
sqlite3.OperationalError: disk I/O error
And for sqlite3 on the command line:
Error: disk I/O error
Can anyone tell me how I'd be able to figure out what's going wrong?
(I'm aware that for DB of this size, SQLite is not a great choice- they turned out to be bigger than I expected and ~academic publishing pressures~ means retooling SQL version and rerunning the code is not practical)
In the sqlite3 command-line shell, you can use .log stderr to see the OS errors.
But "disk I/O error" means that there is an error on your disk. Throw it away, and restore the database from the backup.

(R error) Error: cons memory exhausted (limit reached?)

I am working with big data and I have a 70GB JSON file.
I am using jsonlite library to load in the file into memory.
I have tried AWS EC2 x1.16large machine (976 GB RAM) to perform this load but R breaks with the error:
Error: cons memory exhausted (limit reached?)
after loading in 1,116,500 records.
Thinking that I do not have enough RAM, I tried to load in the same JSON on a bigger EC2 machine with 1.95TB of RAM.
The process still broke after loading 1,116,500 records. I am using R version 3.1.1 and I am executing it using --vanilla option. All other settings are default.
here is the code:
library(jsonlite)
data <- jsonlite::stream_in(file('one.json'))
Any ideas?
There is a handler argument to stream_in that allows to handle big data. So you could write the parsed data to a file or filter the unneeded data.

Git-svn out of memory

I'm trying to clone a reasonably big svn repository with git-svn and at a certain point I get a error message:
Failure loading plugin: APR: Can't create a character converter from 'UTF-8' to native encoding: Cannot allocate memory at /usr/libexec/git-core/git-svn line 5061
And sometimes a
Cannot allocate memory: zlib (compress2): out of memory: Compression of svndiff data failed at /usr/libexec/git-core/git-svn line 5061
error message. I still have ~3GB RAM free. What should I do so git-svn can utilize it?
(I'm doing this on RedHat Enterprise Linux 6.5 if that makes any difference)
From:
This error message is about the memory git is trying to allocate --
it's more than what is free. This is most likely caused by a large
file having been checked into SVN. Unfortunately, there's no easy way
to fix it (apart from buying more memory) -- you would have to remove
the large file and the commit adding it from SVN.
However try following:
Increase swap memory
Increase ulimit

R: Cannot allocate memory greater than x MB

I have a main function in R which calls other files to run my program. I call the main file through a bat file(.exe). When I run it line-by-line it runs without a memory error, but when I call the bat file to run it, it halts and gives me the following error:
Cannot allocate memory greater than 51 MB.
How can I avoid this?
Memory limitations in R such as this are a recurring nightmare for a lot of us.
Very often the problem is a limit imposed by your OS limits (which can usually be changed on a Bash or PowerShell command line), architecture (32 v. 64 bit), or the availability of contiguous free RAM, irregardless of overall available memory.
It's hard to say why something would not cause a memory issue when run line by line, but would hit the memory limit when run as a .bat.
What version of R are you running? Do you have both installed? Is 32-bit being called by Rscript when you run your .bat file whereas you run a 64-bit version line by line? You can check the version of R that's being run with R.Version().
You can test this by running the command memory.limit() in both your R IDE/terminal and in your .bat file (be sure to print or save the result as an object in your .bat file). You might also do well to try setting memory.limit() in your .bat file, as it may just have a smaller default, perhaps due to differences in your R Profile that's invoked in your IDE or terminal versus the .bat file.
If architecture isn't the cause of your memory error, then you have several more troubleshooting steps to try:
Check memory usage in both environments (in R directly and via your .bat process) using this:
sort( sapply(ls(),function(x){object.size(get(x))}))
Run the garbage collector explicitly in your scripts, that's the gc() command
Check all object sizes to make sure there are no unexpected results in your .bat process: sort( sapply(ls(),function(x){format(object.size(get(x)), units = "Mb")}))
Try memory profiling:
Rprof(tf <- "rprof.log", memory.profiling=TRUE)
Rprof(NULL)
summaryRprof(tf)
While this is a RAM issue, for good measure you might want to check that the compute power available is both sufficient and not varying between these two ways of running your code: parallel::detectCores()
Examine your performance with Prof. Hadley Wikham's lineprof tool (warning: requires devtools and doesn't work on lines of code which call the C programming language)
References While I'm pulling these snippets out of my own code, most of them originally came from other, related StackOverflow posts, such as:
Reaching memory allocation in R
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
R memory limit warning vs "unable to allocate..."
How to compute the size of the allocated memory for a general type
R : Any other solution to "cannot allocate vector size n mb" in R?
Yes you should be using 64bit R, if you can.
See this question, and this from the R docs.

Resources