I am working with big data and I have a 70GB JSON file.
I am using jsonlite library to load in the file into memory.
I have tried AWS EC2 x1.16large machine (976 GB RAM) to perform this load but R breaks with the error:
Error: cons memory exhausted (limit reached?)
after loading in 1,116,500 records.
Thinking that I do not have enough RAM, I tried to load in the same JSON on a bigger EC2 machine with 1.95TB of RAM.
The process still broke after loading 1,116,500 records. I am using R version 3.1.1 and I am executing it using --vanilla option. All other settings are default.
here is the code:
library(jsonlite)
data <- jsonlite::stream_in(file('one.json'))
Any ideas?
There is a handler argument to stream_in that allows to handle big data. So you could write the parsed data to a file or filter the unneeded data.
Related
I am trying to read a huge csv file CUDF but gets memory issues.
import cudf
cudf.set_allocator("managed")
cudf.__version__
user_wine_rate_df = cudf.read_csv('myfile.csv',
sep = "\t",
parse_dates = ['created_at'])
'0.17.0a+382.gbd321d1e93'
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)
If I remove cudf.set_allocator("managed")
I get
MemoryError: std::bad_alloc: CUDA error at: /opt/conda/envs/rapids/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
I am using CUDF through rapidsai/rapidsai:cuda11.0-runtime-ubuntu16.04-py3.8
I wonder whar could be the reason of hitting memory, while I can read this big file with pandas
**Update
I installed dask_cudf
and used dask_cudf.read_csv('myfile.csv') - but still get the
parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered
If the file you are reading is larger than the memory available then you will observe an OOM(Out Of Memory) error as cuDF runs on a sigle GPU.
In order to read files which are very large I would recommend using dask_cudf.
Check out this blog by Nick Becker on reading larger than GPU memory files. It should get you on your way.
The Object in the S3 bucket is 5.3 GB size. In order to convert object into data, I used get_object("link to bucket path"). But this leads to memory issues.
So, I installed Spark 2.3.0 in RStudio and trying to load this object directly into Spark but the command to load object directly into spark is not known.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
If I convert the object into a readable data type (such as data.frame/tbl in R) I would use copy_to to transfer the data into spark from R as below:
Copy data to Spark
spark_tbl <- copy_to(spark_conn,data)
I was wondering how can convert the object inside spark ?
relevant links would be
https://github.com/cloudyr/aws.s3/issues/170
Sparklyr connection to S3 bucket throwing up error
Any guidance would be sincerely appreciated.
Solution.
I was trying to read the csv file which is 5.3 GB from S3 bucket. But Since R is single-threaded, it was giving memory issues (IO exceptions).
However, the solution is to load sparklyr in R (library(sparklyr)) and hence now all the cores in the computer will be utilized.
get_object("link to bucket path") can be replaced by
spark_read_csv("link to bucket path"). Since RStudio uses all cores, we have no memory issues.
Also, depending on the file extension, you can change the functions:
´spark_load_table, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_jdbc, spark_write_json, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text´
I am new to R, kindly help me with below error.
Calling R code using batch file (e.g: c:\batchfile\x.bat) in a machine which has dynamic memory i.e. based on load memory and cores will increase.
in above approach everything I executing with out error. R code using RODBCext
koRpus, akmeans, lsa, stringr, topicmodels, RWeka, lda, snowfall, tm, openNLP, reshape2, plyr, RODBC packages.
But while calling x.bat file using remote server within powershell (e.g.,Invoke-Command -Computername 'Servername' {Start-Process 'C:\batchfile\x.bat' -wait}) resulting below errors:
LoadLibrary failure: The paging file is too small for this operation to complete.
just-in-time debugging errors
This application has requested the Runtime to terminate it in an unusual way.
Thanks in advance
I'm trying to clone a reasonably big svn repository with git-svn and at a certain point I get a error message:
Failure loading plugin: APR: Can't create a character converter from 'UTF-8' to native encoding: Cannot allocate memory at /usr/libexec/git-core/git-svn line 5061
And sometimes a
Cannot allocate memory: zlib (compress2): out of memory: Compression of svndiff data failed at /usr/libexec/git-core/git-svn line 5061
error message. I still have ~3GB RAM free. What should I do so git-svn can utilize it?
(I'm doing this on RedHat Enterprise Linux 6.5 if that makes any difference)
From:
This error message is about the memory git is trying to allocate --
it's more than what is free. This is most likely caused by a large
file having been checked into SVN. Unfortunately, there's no easy way
to fix it (apart from buying more memory) -- you would have to remove
the large file and the commit adding it from SVN.
However try following:
Increase swap memory
Increase ulimit
I have a main function in R which calls other files to run my program. I call the main file through a bat file(.exe). When I run it line-by-line it runs without a memory error, but when I call the bat file to run it, it halts and gives me the following error:
Cannot allocate memory greater than 51 MB.
How can I avoid this?
Memory limitations in R such as this are a recurring nightmare for a lot of us.
Very often the problem is a limit imposed by your OS limits (which can usually be changed on a Bash or PowerShell command line), architecture (32 v. 64 bit), or the availability of contiguous free RAM, irregardless of overall available memory.
It's hard to say why something would not cause a memory issue when run line by line, but would hit the memory limit when run as a .bat.
What version of R are you running? Do you have both installed? Is 32-bit being called by Rscript when you run your .bat file whereas you run a 64-bit version line by line? You can check the version of R that's being run with R.Version().
You can test this by running the command memory.limit() in both your R IDE/terminal and in your .bat file (be sure to print or save the result as an object in your .bat file). You might also do well to try setting memory.limit() in your .bat file, as it may just have a smaller default, perhaps due to differences in your R Profile that's invoked in your IDE or terminal versus the .bat file.
If architecture isn't the cause of your memory error, then you have several more troubleshooting steps to try:
Check memory usage in both environments (in R directly and via your .bat process) using this:
sort( sapply(ls(),function(x){object.size(get(x))}))
Run the garbage collector explicitly in your scripts, that's the gc() command
Check all object sizes to make sure there are no unexpected results in your .bat process: sort( sapply(ls(),function(x){format(object.size(get(x)), units = "Mb")}))
Try memory profiling:
Rprof(tf <- "rprof.log", memory.profiling=TRUE)
Rprof(NULL)
summaryRprof(tf)
While this is a RAM issue, for good measure you might want to check that the compute power available is both sufficient and not varying between these two ways of running your code: parallel::detectCores()
Examine your performance with Prof. Hadley Wikham's lineprof tool (warning: requires devtools and doesn't work on lines of code which call the C programming language)
References While I'm pulling these snippets out of my own code, most of them originally came from other, related StackOverflow posts, such as:
Reaching memory allocation in R
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
R memory limit warning vs "unable to allocate..."
How to compute the size of the allocated memory for a general type
R : Any other solution to "cannot allocate vector size n mb" in R?
Yes you should be using 64bit R, if you can.
See this question, and this from the R docs.