Reading and writing binary content in R - r

I have to download binary files stored in PostgreSQL database as bytea and then work with them in R. I use DBI library to download the content
data <- dbGetQuery("select binary_content from some_table limit 1", connection)
next I have to work with this content. The problem is that even after reviewing SO threads (e.g. this one), PostgreSQL documentation, R documentation for several functions (writeBin, charToRaw, as.raw etc.), multiple web pages and intensive Googling I am unable to find any hints how it can be done in R. What I want to do is to (1) download the content, (2) save it locally as individual files, (3) work with the files. No matter what I do R always saves the content as it was a long gibberish character string.
Unfortunately I am unable to provide reproducible example since I cannot share the data I am using.

Related

How can I fix the 'line x did not have y elements' error when trying to use read.csv.sql?

I am a relative beginner to R trying to load and explore a large (7GB) CSV file.
It's from the Open Food Facts database and the file is downloadable here: https://world.openfoodfacts.org/data (the raw csv link).
It's too large to read straight into R and my searching has made me think the sqldf package could be useful. But when I try and read the file in with this code ...
library(sqldf)
library(here)
read.csv.sql(here("02. Data", "en.openfoodfacts.org.products.csv"), sep = "\t")
I get this error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 10 did not have 196 elements
Searching around made me think it's because there are missing values in the data. With read.csv, it looks like you can set fill = TRUE and get around this. But I can't work out how to do this with the read.csv.sql function. I also can't actually open the csv in Excel to inspect it because it's too large.
Does anyone know how to solve this or if there is a better method for reading in this large file? Please keep in mind I don't really know how to use SQL or other database tools, mostly just R (but can try and learn the basics if helpful).
Based on the error message, it seems unlikely that you can read the CSV file en toto into memory, even once. I suggest for analyzing the data within it, you may need to change your data-access to something else, such as:
DBMS, whether monolithic (duckdb or RSQLite, lower cost-of-entry) or full DBMS (e.g., PostgreSQL, MariaDB, SQL Server). With this method, you would connect (using DBI) to the database (monolithic or otherwise), query for the subset of data you want/need, and work on that data. It is feasible to do in-database aggregation as well, which might be a necessary step in your analysis.
Arrow parquet file. These are directly supported by dplyr functions and in a lazy fashion, meaning that when you call open_dataset("path/to/my.parquet"), it immediately returns an object but does not load data; you call your dplyr mutate/filter/select/summarize pipe (some limitations), and then you finally call ... %>% collect(), only then it loads the resulting data into memory. Similar to SQL above in that you work on subsets at a time, but if you're already familiar with dplyr, it is much much closer than learning SQL from scratch.
There are ways to get a large CSV file into each of this.
Arrow/Parquet: How to convert a csv file to parquet (python,
arrow/drill), a quick search in your favorite search-engine should provide other possibilities; regardless of the language you want to do your analysis in ("R"), don't constrain yourself to solutions using that language.
SQL: DuckDB (https://duckdb.org/docs/data/csv.html), SQLite (https://www.sqlitetutorial.net/sqlite-import-csv/), and other DBMSes tend to have a "bulk" command for importing raw CSV.

In R and Sparklyr, writing a table to .CSV (spark_write_csv) yields many files, not one single file. Why? And can I change that?

Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)

Is there a way to accelerate formatted table writing from R to excel?

I have a 174603 rows and 178 column dataframe, which I'm importing to Excel using openxlsx::saveWorkbook, (Using this package to obtain the aforementioned format of cells, with colors, header styles and so on). But the process is extremely slow, (depending on the amount of memory used by the machine it can take from 7 to 17 minutes!!) and I need a way to reduce this significantly (Doesn't need to be seconds, but anything bellow 5 min would be OK)
I've already searched other questions but they all seem to focus either in exporting to R (I have no problem with this) or writing non-formatted files to R (using write.csv and other options of the like)
Apparently I can't use xlsx package because of the settings on my computer (industrial computer, Check comments on This question)
Any suggestions regarding packages or other functionalities inside this package to make this run faster would be highly appreciated.
This question has some time ,but I had the same problem as you and came up with a solution worth mentioning.
There is package called writexl that has implemented a way to export a data frame to Excel using the C library libxlsxwriter. You can export to excel using the next code:
library(writexl)
writexl::write_xlsx(df, "Excel.xlsx",format_headers = TRUE)
The parameter format_headers only apply centered and bold titles, but I had edited the C code of the its source in github writexl library made by ropensci.
You can download it or clone it. Inside src folder you can edit write_xlsx.c file.
For example in the part that he is inserting the header format
//how to format headers (bold + center)
lxw_format * title = workbook_add_format(workbook);
format_set_bold(title);
format_set_align(title, LXW_ALIGN_CENTER);
you can add this lines to add background color to the header
format_set_pattern (title, LXW_PATTERN_SOLID);
format_set_bg_color(title, 0x8DC4E4);
There are lots of formating you can do searching in the libxlsxwriter library
When you have finished editing that file and given you have the source code in a folder called writexl, you can build and install the edited package by
shell("R CMD build writexl")
install.packages("writexl_1.2.tar.gz", repos = NULL)
Exporting again using the first chunk of code will generate the Excel with formats and faster than any other library I know about.
Hope this helps.
Have you tried ;
write.table(GroupsAlldata, file = 'Groupsalldata.txt')
in order to obtain it in txt format.
Then on Excel, you can simply transfer you can 'text to column' to put your data into a table
good luck

Working with excel file in R

I am still suffering every time I deal with excel file in R.
What is the best way to do the following?
1- Import excel in R as a "whole workbook" and be able to do analysis in any sheet in the workbook? if you think about using XLConnect, please bear in mind the "out of memory" problem with Java. I have over 30MB files and dealing with Java memory problem every time consume more time. (running -Xmx does not work for me).
2- Do not miss any data from any excel sheet? saving file into csv says that some sheets are "out of range" which is 65,536 rows and 256 columns. Also it can not deal with some formulas.
3- Do not have to import each sheet separately? Importing sheets to SPSS, STATA or Eviews and save it into their extension and then work with the output file in R works fine most of the time. However, this method has two major problems; one is that you have to have the software downloaded on the machine and the other is that it imports only one sheet at time. If I have over 30 sheets, it will become very time consuming.
This might be an ongoing question that has been answered many many times, however, each answer solving a part of the problem not the whole issue. It is like putting the fire not strategically solving the problem.
I am on Mac OS 10.10 with R 3.1.1
I have tried a few package to open an excel openxlsx is definitely the best route. It is way faster and more stable than the other ones. The function is : openxlsx::read.xlsx. My advice is to use it to read the whole sheet and then play with the data within R, rather than reading several times part of the sheet. I used it a lot to open large excel files (8000 col plus) for 1000 lines plus, and it always worked well. I use the package xlsx to write in excel, but it had numerous memory issues to read (that's why I moved to openxlsx)
-Add In
On a side note, if you want to use R with excel you sometimes need to execute a VBA code from R. I found the procedure to be quite difficult to achieve. I fully documented the proper way of doing it in a previous question in stack : Apply VBA from R
Consider using the xlsx package. It has methods for dealing with excel files and worksheets. Your question is quite broad, but I think this can be an example:
library(xlsx)
wb <- loadWorkbook('r_test.xlsx')
sheets <- getSheets(wb)
sheet <- sheets[[1]]
df <- readColumns(sheet,
startColumn = 1, endColumn = 3,
startRow = 1, endRow = 6)
df
## id name x_value
##1 1 A 10
##2 2 B 15
##3 3 C 20
##4 4 D 13
##5 5 E 17
As for the memory issue I think you should check the ff package:
The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory.
Another option (but it may be overkill) would be to load the data to a real database and deal with database connections. If you are dealing with really big datasets, a database may be the best approach.
Some options would be:
The RSQLite package
If you can load your data to an SQLite database, you can use this package to connect directly to that database and handle the data directly. That would "split" the workload between R and the database engine. SQLite is quite easy to use and (almost) "config free", and each SQLite database is stored in a single file.
The RMySQL package
Even better than the above option; MySQL is great for storing large datasets. However you'll need to install and configure a MySQL server in your computer.
Remember: If you work with R and a database, delegate as much heavy workload to the database (e.g. data filtering, aggregation, etc), and use R to get the final results.

Reading excel with R

I am trying to contemplate whether to read excel files directly from R or should I convert them to csv first. I have researched about the various possibilities of reading excel. I also found out that reading excel might have its cons like conversion of date and numeric column data types etc.
XLConnect - dependent on java
read.xslx - slow for large data sets
read.xslx2 - fast but need to use colClasses command to specify desired column classes
ODBC - may have conversion issues
gdata - dependent on perl
I am looking for a solution that will be fast enough for atleast a million rows with minimum data conversion issues . Any suggestions??
EDIT
So finally i have decided to convert to csv and then read the csv file but now I have to figure out the best way to read a large csv file(with atleast 1 million rows)
I found out about the read.csv.ffdf package but that does not let me set my own colClass. Specifically this
setAs("character","myDate", function(from){ classFun(from) } )
colClasses =c("numeric", "character", "myDate", "numeric", "numeric", "myDate")
z<-read.csv.ffdf(file=pathCsv, colClasses=colClassesffdf)
This does not work and i get the following error :-
Error in ff(initdata = initdata, length = length, levels = levels,
ordered = ordered, : vmode 'list' not implemented
I am also aware of the RSQlite and ODBC functionality but do not wish to use it . Is there a solution to the above error or any other way around this?
Since this question, Hadley Wickham has released the R package readxl which wraps C and C++ libraries to read both .xls and .xlsx files, respectively. It is a big improvement on the previous possibilities, but not without problems. It is fast and simple, but if you have messy data, you will have to do some work whichever method you choose. Going down the .csv route isn't a terrible idea, but does introduce a manual step in your analysis, and relies on whichever version of Excel you happen to use giving consistent CSV output.
All the solutions you mentioned will work - but if manually converting to .csv and reading with read.csv is an option, I'd recommend that. In my experience it is faster and easier to get right.
If you want speed and large data, then you might consider converting your excel file(s) to a database format, then connect R to the database.
A quick Google search showed several links for converting Excel files to SQLite databases, then you could use the RSQlite or sqldf package to read into R.
Or use the ODBC package if you convert to one of the databases that work with ODBC. The conversion of fields problems should be less if you are do the conversion to database correctly.

Resources