How can I fix the 'line x did not have y elements' error when trying to use read.csv.sql? - r

I am a relative beginner to R trying to load and explore a large (7GB) CSV file.
It's from the Open Food Facts database and the file is downloadable here: https://world.openfoodfacts.org/data (the raw csv link).
It's too large to read straight into R and my searching has made me think the sqldf package could be useful. But when I try and read the file in with this code ...
library(sqldf)
library(here)
read.csv.sql(here("02. Data", "en.openfoodfacts.org.products.csv"), sep = "\t")
I get this error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 10 did not have 196 elements
Searching around made me think it's because there are missing values in the data. With read.csv, it looks like you can set fill = TRUE and get around this. But I can't work out how to do this with the read.csv.sql function. I also can't actually open the csv in Excel to inspect it because it's too large.
Does anyone know how to solve this or if there is a better method for reading in this large file? Please keep in mind I don't really know how to use SQL or other database tools, mostly just R (but can try and learn the basics if helpful).

Based on the error message, it seems unlikely that you can read the CSV file en toto into memory, even once. I suggest for analyzing the data within it, you may need to change your data-access to something else, such as:
DBMS, whether monolithic (duckdb or RSQLite, lower cost-of-entry) or full DBMS (e.g., PostgreSQL, MariaDB, SQL Server). With this method, you would connect (using DBI) to the database (monolithic or otherwise), query for the subset of data you want/need, and work on that data. It is feasible to do in-database aggregation as well, which might be a necessary step in your analysis.
Arrow parquet file. These are directly supported by dplyr functions and in a lazy fashion, meaning that when you call open_dataset("path/to/my.parquet"), it immediately returns an object but does not load data; you call your dplyr mutate/filter/select/summarize pipe (some limitations), and then you finally call ... %>% collect(), only then it loads the resulting data into memory. Similar to SQL above in that you work on subsets at a time, but if you're already familiar with dplyr, it is much much closer than learning SQL from scratch.
There are ways to get a large CSV file into each of this.
Arrow/Parquet: How to convert a csv file to parquet (python,
arrow/drill), a quick search in your favorite search-engine should provide other possibilities; regardless of the language you want to do your analysis in ("R"), don't constrain yourself to solutions using that language.
SQL: DuckDB (https://duckdb.org/docs/data/csv.html), SQLite (https://www.sqlitetutorial.net/sqlite-import-csv/), and other DBMSes tend to have a "bulk" command for importing raw CSV.

Related

Partially read really large csv.gz in R using vroom

I have a csv.gz file that (from what I've been told) before compression was 70GB in size. My machine has 50GB of RAM, so anyway I will never be able to open it as a whole in R.
I can load for example the first 10m rows as follows:
library(vroom)
df <- vroom("HUGE.csv.gz", delim= ",", n_max = 10^7)
For what I have to do, it is fine to load 10m rows at the time, do my operations, and continue with the next 10m rows. I could do this in a loop.
I was therefore trying the skip argument.
df <- vroom("HUGE.csv.gz", delim= ",", n_max = 10^7, skip = 10^7)
This results in an error:
Error: The size of the connection buffer (131072) was not large enough
to fit a complete line:
* Increase it by setting `Sys.setenv("VROOM_CONNECTION_SIZE")`
I increased this with Sys.setenv("VROOM_CONNECTION_SIZE" = 131072*1000), however, the error persists.
Is there a solution to this?
Edit: I found out that random access to a gzip compressed csv (csv.gz) is not possible. We have to start from top. Probably the easiest is to decompress and save, then skip should work.
I haven't been able to figure out vroom solution for very large more-than-RAM (gzipped) csv files. However, the following approach has worked well for me and I'd be grateful to know about approaches with better querying speed while also saving disk space.
Use split sub-command inxsv from https://github.com/BurntSushi/xsv to split the large csv file into comfortably-within-RAM chunks of say, 10^5, lines and save them in a folder.
Read all chunks using data.table::fread one-by-one (to avoid low-memory error) using a for loop and save all of them into a folder as compressed parquet files using arrow package which saves space and prepares the large table for fast querying. For even faster operations, it is advisable to re-save the parquet files partitioned by the fields by which you need to frequently filter.
Now you can use arrow::open_dataset and query that multi-file parquet folder using dplyr commands. It takes minimum disk space and gives the fastest results in my experience.
I use data.table::fread with explicit definition of column classes of each field for fastest and most reliable parsing of csv files. readr::read_csv has also been accurate but slower. However, auto-assignment of column classes by read_csv as well as the ways in which you can custom-define column classes by read_csv is actually the best - so less human-time but more machine-time - which means that it may be faster overall depending on scenario. Other csv parsers have thrown errors for the kind of csv files that I work with and waste time.
You may now delete the folder containing chunked csv files to save space, unless you want to experiment loop-reading them with other csv parsers.
Other previously successfully approaches: Loop read all csv chunks as mentioned above and save them into:
a folder using disk.frame package. Then that folder may be queried using dplyr or data.table commands explained in the documentation. It has facility to save in compressed fst files which saves space, though not as much as parquet files.
a table in DuckDB database which allows querying with SQL or dplyr commands. Using database-tables approach won't save you disk space. But DuckDB also allows querying partitioned/un-partitioned parquet files (which saves disk space) with SQL commands.
EDIT: - Improved Method Below
I experimented a little and found a much better way to do the above operations. Using the code below, the large (compressed) csv file will be chunked automatically within R environment (no need to use any external tool like xsv) and all chunks will be written in parquet format in a folder ready for querying.
library(readr)
library(arrow)
fyl <- "...path_to_big_data_file.csv.gz"
pqFolder <- "...path_to_folder_where_chunked_parquet_files_are_to_be_saved"
f <- function(x, pos){
write_parquet(x,
file.path(pqFolder, paste0(pos, ".parquet")),
compression = "gzip",
compression_level = 9)
}
read_csv_chunked(
fyl,
col_types = list(Column1="f", Column2="c", Column3="T", ...), # all column specifications
callback = SideEffectChunkCallback$new(f),
chunk_size = 10^6)
If, instead of parquet, you want to use -
disk.frame, the callback function may be used to create chunked compressed fst files for dplyr or data.table style querying.
DuckDB, the callback function may be used to append the chunks into a database table for SQL or dplyr style querying.
By judiciously choosing the chunk_size parameter of readr::read_csv_chunked command, the computer should never run out of RAM while running queries.
PS: I use gzip compression for parquet files since they can then be previewed with ParquetViewer from https://github.com/mukunku/ParquetViewer. Otherwise, zstd (not currently supported by ParquetViewer) decompresses faster and hence improves reading speed.
EDIT 2:
I got a csv file which was really big for my machine: 20 GB gzipped and expands to about 83 GB, whereas my home laptop has only 16 GB. Turns out that the read_csv_chunked method I mentioned in earlier EDIT fails to complete. It always stops working after some time and does not create all parquet chunks. Using my previous method of splitting the csv file with xsv and then looping over them creating parquet chunks worked. To be fair, I must mention it took multiple attempts this way too and I had programmed a check to create only additional parquet chunks when running the program on successive attempts.
EDIT 3:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203
EDIT 4:
Additional tip: The chunked parquet files created by the above mentioned method may be very conveniently queried using SQL with DuckDB method mentioned at
https://duckdb.org/docs/data/parquet
and
https://duckdb.org/2021/06/25/querying-parquet.html
DuckDB method is significant because R Arrow method currently suffers from a very serious limitation which is mentioned in the official documentation page https://arrow.apache.org/docs/r/articles/dataset.html.
Specifically, and I quote: "In the current release, arrow supports the dplyr verbs mutate(), transmute(), select(), rename(), relocate(), filter(), and arrange(). Aggregation is not yet supported, so before you call summarise() or other verbs with aggregate functions, use collect() to pull the selected subset of the data into an in-memory R data frame."
The problem is that if you use collect() on a very big dataset, the RAM usage spikes and the system crashes. Whereas, using SQL statements to do the same aggregation job on the same big-dataset with DuckDB does not cause RAM usage spikes and does not cause system crash. So until Arrow fixes itself for aggregation queries for big-data, SQL from DuckDB provides a nice solution to querying big datasets in chunked parquet format.

Error while parsing a very large (10 GB) XML file in R, using the XML package

Context
I'm currently working on a project involving osm data (Open Street Map). In order to manipulate geographic objects, I have to convert the data (an osm xml file) into an object. The osmar package lets me do this, but it fails to parse the raw xml data.
The error
Error in paste(file, collapse = "\n") : result would exceed 2^31-1 bytes
The code
require(osmar)
osmar_obj <- get_osm("anything", source = osmsource_file("my filename"))
Inside the get_osm function, the code calls ret <- xmlParse(raw), which triggers the error after a few seconds.
The question
How am I supposed to read a large XML file (here 10GB), knowing that I have 64G of memory ?
Thanks a lot !
This is the solution I came up with, even though it is not 100% satisfying.
Transform the .osm file by removing every newline (but the last) in your shell
Run the exact same code as before, skipping the paste that is not needed anymore (since you just did the equivalent in shell)
Profit :)
Obviously, I'm not very happy with it because modifying the data file in shell is more a trick that an actual solution :(

read.sas7bdat unable to read compressed file

I am trying to read a .sas7bdat file in R. When I use the command
library(sas7bdat)
read.sas7bdat("filename")
I get the following error:
Error in read.sas7bdat("county2.sas7bdat") : file contains compressed data
I do not have experience with SAS, so any help will be highly appreciated.
Thanks!
According to the sas7bdat vignette [vignette('sas7bdat')], COMPRESS=BINARY (or COMPRESS=YES) is not currently supported as of 2013 (and this was the vignette active on 6/16/2014 when I wrote this). COMPRESS=CHAR is supported.
These are basically internal compression routines, intended to make filesizes smaller. They're not as good as gz or similar (not nearly as good), but they're supported by SAS transparently while writing SAS programs. Obviously they change the file format significantly, hence the lack of implementation yet.
If you have SAS, you need to write these to an uncompressed dataset.
options compress=no;
libname lib '//drive/path/to/files';
data lib.want;
set lib.have;
run;
That's the simplest way (of many), assuming you have a libname defined as lib as above and change have and want to names that are correct (have should be the filename without extension of the file, in most cases; want can be changed to anything logical with A-Z or underscore only, and 32 or fewer characters).
If you don't have SAS, you'll have to ask your data provided to make the data available uncompressed, or as a different format. If you're getting this from a PUDS somewhere on the web, you might post where you're getting it from and there might be a way to help you identify an uncompressed source.
This admittedly is not a pure R solution, but in many situations (e.g. if you aren't on a pc and don't have the ability to write the SAS file yourself) the other solutions posted are not workable.
Fortunately, Python has a module (https://pypi.python.org/pypi/sas7bdat) which supports reading compressed SAS data sets - it's certainly better using this than needing to acquire SAS if you don't already have it. Once you extract the file and save it to text via Python, you can then access it in R.
from sas7bdat import SAS7BDAT
import pandas as pd
InFileName = "myfile.sas7bdat"
OutFileName = "myfile.txt"
with SAS7BDAT(InFileName) as f:
df = f.to_data_frame()
df.to_csv(path_or_buf = OutFileName, sep = "\t", encoding = 'utf-8', index = False)
The haven package can read compressed SAS-files:
library(haven)
df <- read_sas("sasfile.sas7bdat")
But only SAS-files which are compressed using compress=char, but not compress=binary.
So haven will be able to read this SAS-file:
data output.compressed_data_char (compress=char);
set inputdata;
run;
But not this SAS-file:
data output.compressed_data_binary (compress=binary);
set inputdata;
run;
https://cran.r-project.org/package=haven
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001002773.htm
"RevoScaleR" is a good package to read SAS data sets (compressed or uncompressed).You can use rxImport function of this package. Below is the example
Importing library
library(RevoScaleR)
Reading data
R_df_name <- rxImport("fake_path/file_name.sas7bdat")
The speed of this function is far better than haven/sas7bdat/sas7bdat.parso. I hope this helps anyone who struggles to read SAS data sets in R.
Cheers!
I found R to be the easiest for this kind of challenge, especially with compressed sas7dbat files, three simple lines:
library(haven)
data <- read_sas("yourfile.sas7dbat")
and then transform it to csv
write.csv(data,"data.csv")

Reading excel with R

I am trying to contemplate whether to read excel files directly from R or should I convert them to csv first. I have researched about the various possibilities of reading excel. I also found out that reading excel might have its cons like conversion of date and numeric column data types etc.
XLConnect - dependent on java
read.xslx - slow for large data sets
read.xslx2 - fast but need to use colClasses command to specify desired column classes
ODBC - may have conversion issues
gdata - dependent on perl
I am looking for a solution that will be fast enough for atleast a million rows with minimum data conversion issues . Any suggestions??
EDIT
So finally i have decided to convert to csv and then read the csv file but now I have to figure out the best way to read a large csv file(with atleast 1 million rows)
I found out about the read.csv.ffdf package but that does not let me set my own colClass. Specifically this
setAs("character","myDate", function(from){ classFun(from) } )
colClasses =c("numeric", "character", "myDate", "numeric", "numeric", "myDate")
z<-read.csv.ffdf(file=pathCsv, colClasses=colClassesffdf)
This does not work and i get the following error :-
Error in ff(initdata = initdata, length = length, levels = levels,
ordered = ordered, : vmode 'list' not implemented
I am also aware of the RSQlite and ODBC functionality but do not wish to use it . Is there a solution to the above error or any other way around this?
Since this question, Hadley Wickham has released the R package readxl which wraps C and C++ libraries to read both .xls and .xlsx files, respectively. It is a big improvement on the previous possibilities, but not without problems. It is fast and simple, but if you have messy data, you will have to do some work whichever method you choose. Going down the .csv route isn't a terrible idea, but does introduce a manual step in your analysis, and relies on whichever version of Excel you happen to use giving consistent CSV output.
All the solutions you mentioned will work - but if manually converting to .csv and reading with read.csv is an option, I'd recommend that. In my experience it is faster and easier to get right.
If you want speed and large data, then you might consider converting your excel file(s) to a database format, then connect R to the database.
A quick Google search showed several links for converting Excel files to SQLite databases, then you could use the RSQlite or sqldf package to read into R.
Or use the ODBC package if you convert to one of the databases that work with ODBC. The conversion of fields problems should be less if you are do the conversion to database correctly.

How to write multiple tables, dataframes, regression results etc - to one excel file?

I am looking for an easy way to get objects into MS Excel.
(I am using the preinstalled "Puromycin"-dataset for the examples)
I would like to place the contents of these objects to a single excel file:
Puromycin
summary(Puromycin$rate)
summary(Purymycin$conc)
table(Puromycin$state)
lm( conc ~ rate , data=Puromycin)
By "contents" i mean what is shown in the console when i press enter. I dont know what to call it.
I tried to do this:
sink("datafilewhichexcelhopefullyunderstands.csv")
Puromycin
summary(Puromycin$rate)
summary(Purymycin$conc)
table(Puromycin$state)
lm( conc ~ rate , data=Puromycin)
sink()
This gives med a file with the CSV-extension, however when i open the file in notepad,
there is comma-separation. That means that i cant get Excel to open it properly. By properly
i mean that each number is in its own cell.
Others have suggested this for a similar problem
https://stackoverflow.com/a/13007555/1831980
But as a novice i feel that the solution is too complex, and I am hoping for a simpler method.
What I am doing now is this:
write.table(Puromycin, file="clipboard" , sep=";" , row.names=FALSE )
write.table(summary(Purymycin$conc), file="clipboard" , sep=";" , row.names=FALSE )
... etc...
But this requires i lot of copy-ing and pasting, which I hope to eliminate.
Any help would appreciated.
write.table and its friends are intended to write out columns of data separated by whatever separator is specified. Your clipboard contains several data types because you are using summary which always gives a unique output.
For writing the data values out, you can use write.csv on a data frame and then open with Excel. For example, Puromycin is already a data frame (which you can see with str(Puromycin)) so you can just write it out directly:
write.csv(file = "some file.csv", x = Puromycin)
Which will go into the current working directory (which can be determined with getwd()).
To write out/save the results of the regression model is a bit more of a challenge. You could definitely use sink as you did, but specify an extension of .txt on your file so a text editor can open it. There are fancier methods (sweave, knitr) which you might want to look into in the long run, as they can write really nice reports automatically.
In the meantime, get to know str(any R object) as it will be your friend. You can see all the objects in your workspace with ls().
This will only be helpful if you are prepared to use Excel's Data/Text to Columns functions:
capture.output( sapply( c(Puromycin,
summary(Puromycin$rate),
summary(Puromycin$conc),
table(Puromycin$state),
lm( conc ~ rate , data=Puromycin) ), FUN=print), file="datafilewhichexcelhopefullyunderstands.csv", append=TRUE)
The problem being that Excel will not read the whitespace as a cell separator unless you specifically tell it to. You can (and I have often done so) use the fixed filed input features offered by the Text-to-Columns dialog interface.
Your simplest option may be to use the RExcel tool, it transfers information between R and Excel. However it is not free software.
The XLConnect package is another option, it can be used to write information directly to an Excel file.
The tricky part is the lm call. lm does not return a simple vector, matrix, or data frame (all of which are easy to convert to csv or send directly) and there is not a clear way to convert the various parts of a list to cells in a spreadsheet. What would be better is to use extractor functions to pull the important parts from the return of lm or the summary of the lm object and send those to Excel using the other tools.
If you can tell us more about why you want the numbers in Excel and what you plan to do with them after, then we may be able to offer better help (you may be able to completely skip excel).
If the main goal is to share output with others then you should really look at the knitr package (or other related packages). This will not create Excel files, but can be used (along with the pandoc program and possibly other tools) to create a report file in a format easy to share with others not familiar with R. You could put everything into a .pdf file or a .docx file (the latter read by MS Word and would have tables wich can be edited using Word). There is not a simple way to get edits back into R, but with the track changes you can easily see what changes have been made and hand edit your R script/template accordingly.

Resources