Vroom using too much RAM - r

I was working with some data, trying to use vroom to open it, and it worked like a charm. After reseting R, something happened and it never worked again. Loading exactly the same data, if I run:
df = vroom(directory)
It cannot guess the delimiter. If I run:
df = vroom(directory, delim = ',')
It gets stuck indexing the database and I have to close RStudio. If I run:
df = vroom(directory, delim = ',', progress = FALSE)
I get Error: std::bad_alloc, even though I have 8GB of free RAM for an 80MB file. Tried using a smaller file and it reads it no problem, so I'm guessing the problem IS that vroom() uses too much RAM for wide files. Is there any way to optimize this?. My vroom version is 1.6.0, my R is 4.2.1.
Thank you!
PS: data comes from this link .

Related

Registering a temporary table using sparklyr in Databricks

My colleague is using pyspark in Databricks and the usual step is to run an import using data = spark.read.format('delta').parquet('parquet_table').select('column1', column2') and then this caching step, which is really fast.
data.cache()
data.registerTempTable("data")
As an R user I am looking for this registerTempTable equivalent in sparklyr.
I would usually do
data = sparklyr::spark_read_parquet(sc = sc, path = "parquet_table", memory = FALSE) %>% dplyr::select(column1, column2)
In case I opt for memory = TRUE or tbl_cache(sc, "data") it keeps running and never stops. The contrast in time difference seems very obvious - my colleague's registerTempTable takes seconds whereas my option of sparklyr keeps running, i.e. unknown when it will stop. Is there a better function in sparklyr for R users which can do this registerTempTable faster?
You can try using cache and createOrReplaceTempView
library(SparkR)
df <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv", header="true", inferSchema = "true")
cache(df)
createOrReplaceTempView(df,"df_temp")
The above works for SparkR.
For sparklyR, you're code the spark_read_parquet() with memory=FALSE is a pretty similar procedure to what is happening with SparkR.
You shouldn't need to cache the data if you just want to create a temporary table so I would just use memory = FALSE.
See this question.

read_tsv stalls: is this an encoding issue?

I apologize in advance for the lack of specificity of this post but I can't provide a reproducible example in this case. I'm trying to read a tab-separated data file with R readr's read_tsv. The data is from a confidential source so I can't share it, even just the problematic part. read_tsv stalls around 20% of reading progress and unless I kill R quickly, my RAM usage starts blowing up to the point that my computer freezes (I'm on Ubuntu 18.04). Specifically, I'm running:
read_tsv(file = path_to_file,
skip = 10e6,
n_max = 1e5)
I'm skipping lines and setting n_max to vaguely isolate where the problem is and run faster tests. I also tried setting read_tsv's locale to locale(encoding = 'latin1') without success. I tried inspecting this problematic part by reading it with readr's read_lines:
read_lines(file = path_to_file,
skip = 10e6,
n_max = 1e5)
There's no reading problem there: I'm getting a list of character strings. I ran validUTF8 on all of them and they all seem valid. I just have no idea what type of problem could cause read_tsv to stall. Any ideas?
I solved the problem. It seems like it came from inappropriate handling of quoting characters with the default read_tsv quote option. Using quotes = "" instead made it work smoothly.

Julia problem reading relatively large csv file

Using Julia 1.0.0
I have a csv file with 75 columns and about 700,000 rows. Up until yesterday my code was reading it in seconds and converting to DataFrame.
RawDat = CSV.read("filename.csv", header=true, rows_for_type_detect=500,
missingstring="", categorical=false, types=dictm)
A couple days ago I installed JLD.jl which triggered several packages to update. I had probably not updated my packages, including CSV and DataFrames, for a couple months. Since the updating, I can no longer read the same CSV file. The code hangs for more than 20 minutes and nothing is happening.
I tried using CSV.File since it seems like CSV.read was deprecated. This reads the file but I still cannot convert it to DataFrame.
RawDat = CSV.File("filename.csv", header=1, missingstring="",
categorical=false, types=dictm)
works but then if I try
RawDat1 = DataFrame(RawDat)
it hangs and nothing is happening. Similarly, if I try
RawDat = CSV.File("filename.csv", header=1, missingstring="",
categorical=false, types=dictm) |> DataFrame
the file is not read.
Can someone help me understand why this is happening and how I can read this csv file into a DataTable? I have a lot of downstream code that uses DataTable features to process this file.
EDIT
I believe I figured it out and posting in case others have a similar issue. I was able to convert the file to a DataFrame one column at a time. It was pretty quick but overall I think this should be done automatically without the need for extra lines of code. This is what worked so far:
datcols = Tables.columns(RawDat)
MyDF = DataFrame()
kcs = keys(datcols)
for ci in kcs
MyDF[ci] = datcols[ci]
end

fread issue with archive package unzip file in R

I am having issues while trying to use fread, after I unzip a file using the archive package in R. The data I am using can be downloaded from https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data
The code is as follows:
library(dplyr)
library(devtools)
library(archive)
library(data.table)
setwd("C:/jc/2017/13.Lafavorita")
hol<-archive("./holidays_events.csv.7z")
holcsv<-fread(hol$path, header = T, sep = ",")
This code gives the error message:
File 'holidays_events.csv' does not exist. Include one or more spaces to consider the input a system command.
Yet if I try:
holcsv1<-read.csv(archive_read(hol),header = T,sep = ",")
It works perfectly. I need to use the fread command because the other data bases I need to open are too big to use read.csv. I am puzzled because my code was working fine a few days ago. I could unzip the files manually, but that is not the point. I have tried to solve this problem for hours, but I cannot seem to find anything useful on the documentation. I found this: https://github.com/yihui/knitr/blob/master/man/knit.Rd#L104-L107 , but I cannot understand it.
Turns out the answer is rather simple, but I found it by luck. So after using the archive function you need to pass it to the archive_extract function. So in my case, I should add the following to the code: hol1<-archive_extract(hol) . Then I have to change the last line to: holcsv<-fread(hol1$path, header = T, sep = ",")

Tesseract "Error in pixCreateNoInit: pix_malloc fail for data"

trying to run this function within a function based loosely off of this, however, since xPDF can convert PDFs to PNGs, I skipped the ImageMagick conversion step, as well as the faulty logic with the function(i) process, since pdftopng requires a root name and that is "ocrbook-000001.png" in this case and throws an error when looking for a PNG of the original PDF's file name.
My issue is now with getting Tesseract to do anything with my PNG files. I get the error:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Here is my code:
lapply(myfiles, function(i){
shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 ", i, " ocrbook")))
mypngs <- list.files(path = dest, pattern = "png", full.names = TRUE)
lapply(mypngs, function(z){
shell(shQuote(paste0("tesseract ", z, " out")))
file.remove(paste0(z))
})
})
The issue was the DPI set too high for Tesseract to handle, apparently. Changing the PDFtoPNG DPI parameter from 600 to 150 appears to have corrected the issue. There seems to be a max DPI for Tesseract to understand and know what to do.
I have also corrected my code from a static naming convention to a more dynamic one that mimics the file's original names.
dest <- "C:\\users\\YOURNAME\\desktop"
files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
lapply(files, function(i){
shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 150 ", i,".pdf", " ",i)))
})
myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
lapply(myppms, function(y){
shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
file.remove(paste0(y,".ppm"))
})
mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
lapply(mytiffs, function(z){
shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
file.remove(paste0(z,".tif"))
})
Background
It sounds like you already solved your problem. Yay! I'm writing this answer because I encountered a very similar problem calling tesseract from R and wanted to share some of the workarounds I came up with in case anyone else stumbles across the post and needs further troubleshooting ideas.
In my case I was converting a bunch of faxes (about 3000 individual pdf files, most of them between 1-15 pages) to text. I used an apply function to make the text from each individual fax as a separate entry in a list (length = number of faxes = ~ 3000). Then the faxes were converted to a vector and then that vector was combined with a vector of file names to make a data frame. Finally I wrote the data frame to a csv file. (See below for the code I used).
The problem was I kept getting the same string of errors that you got:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Followed by this error: error in FUN(X[[i]], ...) : basic_string::_M_construct null not valid
What I think the problem is
What was weird for me was that I re-ran the code multiple times and it was always a different fax where the error occurred. It seemed to also occur more often when I was trying to do something else that used a lot of RAM or CPU (opening microsoft teams etc.). I tried changing the DPI as suggested in the first answer and that didn't seem to work.
It was also noticeable that while this code was running I was regularly using close to 100% of RAM and 50% of CPU (based on windows task manager).
When I ran this process (on a similiar batch of about 3,000 faxes) on linux machine with significantly more RAM and CPU I never encountered this problem.
basic_string::_M_construct null not valid, appears to be a c++ error. I'm not familiar with c++, but it sort of sounds like it's a bit of a catch all error that might indicate something that should have been created wasn't created.
Based on all that, I think the problem is that R runs out of memory and in response somehow the memory available to some of the underlying tesseract processes gets throttled off. This means there's not enough memory to convert a pdf to a png and then extract the text which is what throws these errors. This leads to a text blob not getting created where one is expected and the final C++ error of : basic_string::_M_construct null not valid It's possible that lowering the dpi is what gave your process enough memory to complete, but maybe the fundamental underlying problem was the memory not the DPI.
Possible workarounds
So, I'm not sure about any of what I just said, but running with that assumption, here's some ideas I came up with for people running the tesseract package in R who encounter similar problems:
Switch from Rstudio to Rgui: This alone solved my problem. I was able to complete the whole 3000 fax process without any errors using Rgui. Rgui also used between 100-400 MB instead 1000+ that Rstudio used, and about 25% of CPU instead of 50%. Putting R in the path and running R from the console or running R in the background might reduce memory use even further.
Close any memory intensive processes while the code is running. Microsoft teams, videoconferencing, streaming, docker on windows and the windows linux subsystem are all huge memory hogs.
lower DPI As suggested by the first answer, this would also probably reduce memory use.
break the process up. I think running my processes in batches of about 500 might have also reduced the amount of working memory R has to take up before writing to file.
These are all quick and easy solutions that can be done from R without having to learn C++ or upgrade hardware. A more durable solution probably would require doing more customization of the tesseract parameters, implementing the process in C++, changing memory allocation settings for R and the operating system, or buying more RAM.
Example Code
# Load Libraries
library(tesseract)
dir.create("finished_data")
# Define Functions
ocr2 <- function(pdf_path){
# tell tesseract which language to guess
eng <- tesseract("eng")
#convert to png first
#pngfile <- pdftools::pdf_convert(pdf_path, dpi = 300)
# tell tesseract to convert the pdf at pdf_path
seperated_pages <- tesseract::ocr(pdf_path, engine = eng)
#combine all the pages into one page
combined_pages <- paste(seperated_pages, collapse = "**new page**")
# I delete png files as I go to avoid overfilling the hard drive
# because work computer has no hard drive space :'(
png_file_paths <- list.files(pattern = "png$")
file.remove(png_file_paths)
combined_pages
}
# find pdf_paths
fax_file_paths <- list.files(path="./raw_data",
pattern = "pdf$",
recursive = TRUE)
#this converts all the pdfs to text using the ocr
faxes <- lapply(paste0("./raw_data/",fax_file_paths),
ocr2)
fax_table <- data.frame(file_name= fax_file_paths, file_text= unlist(faxes))
write.csv(fax_table, file = paste0("./finished_data/faxes_",format(Sys.Date(),"%b-%d-%Y"), "_test.csv"),row.names = FALSE)

Resources