Reading excel with R - r

I am trying to contemplate whether to read excel files directly from R or should I convert them to csv first. I have researched about the various possibilities of reading excel. I also found out that reading excel might have its cons like conversion of date and numeric column data types etc.
XLConnect - dependent on java
read.xslx - slow for large data sets
read.xslx2 - fast but need to use colClasses command to specify desired column classes
ODBC - may have conversion issues
gdata - dependent on perl
I am looking for a solution that will be fast enough for atleast a million rows with minimum data conversion issues . Any suggestions??
EDIT
So finally i have decided to convert to csv and then read the csv file but now I have to figure out the best way to read a large csv file(with atleast 1 million rows)
I found out about the read.csv.ffdf package but that does not let me set my own colClass. Specifically this
setAs("character","myDate", function(from){ classFun(from) } )
colClasses =c("numeric", "character", "myDate", "numeric", "numeric", "myDate")
z<-read.csv.ffdf(file=pathCsv, colClasses=colClassesffdf)
This does not work and i get the following error :-
Error in ff(initdata = initdata, length = length, levels = levels,
ordered = ordered, : vmode 'list' not implemented
I am also aware of the RSQlite and ODBC functionality but do not wish to use it . Is there a solution to the above error or any other way around this?

Since this question, Hadley Wickham has released the R package readxl which wraps C and C++ libraries to read both .xls and .xlsx files, respectively. It is a big improvement on the previous possibilities, but not without problems. It is fast and simple, but if you have messy data, you will have to do some work whichever method you choose. Going down the .csv route isn't a terrible idea, but does introduce a manual step in your analysis, and relies on whichever version of Excel you happen to use giving consistent CSV output.

All the solutions you mentioned will work - but if manually converting to .csv and reading with read.csv is an option, I'd recommend that. In my experience it is faster and easier to get right.

If you want speed and large data, then you might consider converting your excel file(s) to a database format, then connect R to the database.
A quick Google search showed several links for converting Excel files to SQLite databases, then you could use the RSQlite or sqldf package to read into R.
Or use the ODBC package if you convert to one of the databases that work with ODBC. The conversion of fields problems should be less if you are do the conversion to database correctly.

Related

How can I fix the 'line x did not have y elements' error when trying to use read.csv.sql?

I am a relative beginner to R trying to load and explore a large (7GB) CSV file.
It's from the Open Food Facts database and the file is downloadable here: https://world.openfoodfacts.org/data (the raw csv link).
It's too large to read straight into R and my searching has made me think the sqldf package could be useful. But when I try and read the file in with this code ...
library(sqldf)
library(here)
read.csv.sql(here("02. Data", "en.openfoodfacts.org.products.csv"), sep = "\t")
I get this error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 10 did not have 196 elements
Searching around made me think it's because there are missing values in the data. With read.csv, it looks like you can set fill = TRUE and get around this. But I can't work out how to do this with the read.csv.sql function. I also can't actually open the csv in Excel to inspect it because it's too large.
Does anyone know how to solve this or if there is a better method for reading in this large file? Please keep in mind I don't really know how to use SQL or other database tools, mostly just R (but can try and learn the basics if helpful).
Based on the error message, it seems unlikely that you can read the CSV file en toto into memory, even once. I suggest for analyzing the data within it, you may need to change your data-access to something else, such as:
DBMS, whether monolithic (duckdb or RSQLite, lower cost-of-entry) or full DBMS (e.g., PostgreSQL, MariaDB, SQL Server). With this method, you would connect (using DBI) to the database (monolithic or otherwise), query for the subset of data you want/need, and work on that data. It is feasible to do in-database aggregation as well, which might be a necessary step in your analysis.
Arrow parquet file. These are directly supported by dplyr functions and in a lazy fashion, meaning that when you call open_dataset("path/to/my.parquet"), it immediately returns an object but does not load data; you call your dplyr mutate/filter/select/summarize pipe (some limitations), and then you finally call ... %>% collect(), only then it loads the resulting data into memory. Similar to SQL above in that you work on subsets at a time, but if you're already familiar with dplyr, it is much much closer than learning SQL from scratch.
There are ways to get a large CSV file into each of this.
Arrow/Parquet: How to convert a csv file to parquet (python,
arrow/drill), a quick search in your favorite search-engine should provide other possibilities; regardless of the language you want to do your analysis in ("R"), don't constrain yourself to solutions using that language.
SQL: DuckDB (https://duckdb.org/docs/data/csv.html), SQLite (https://www.sqlitetutorial.net/sqlite-import-csv/), and other DBMSes tend to have a "bulk" command for importing raw CSV.

Working with excel file in R

I am still suffering every time I deal with excel file in R.
What is the best way to do the following?
1- Import excel in R as a "whole workbook" and be able to do analysis in any sheet in the workbook? if you think about using XLConnect, please bear in mind the "out of memory" problem with Java. I have over 30MB files and dealing with Java memory problem every time consume more time. (running -Xmx does not work for me).
2- Do not miss any data from any excel sheet? saving file into csv says that some sheets are "out of range" which is 65,536 rows and 256 columns. Also it can not deal with some formulas.
3- Do not have to import each sheet separately? Importing sheets to SPSS, STATA or Eviews and save it into their extension and then work with the output file in R works fine most of the time. However, this method has two major problems; one is that you have to have the software downloaded on the machine and the other is that it imports only one sheet at time. If I have over 30 sheets, it will become very time consuming.
This might be an ongoing question that has been answered many many times, however, each answer solving a part of the problem not the whole issue. It is like putting the fire not strategically solving the problem.
I am on Mac OS 10.10 with R 3.1.1
I have tried a few package to open an excel openxlsx is definitely the best route. It is way faster and more stable than the other ones. The function is : openxlsx::read.xlsx. My advice is to use it to read the whole sheet and then play with the data within R, rather than reading several times part of the sheet. I used it a lot to open large excel files (8000 col plus) for 1000 lines plus, and it always worked well. I use the package xlsx to write in excel, but it had numerous memory issues to read (that's why I moved to openxlsx)
-Add In
On a side note, if you want to use R with excel you sometimes need to execute a VBA code from R. I found the procedure to be quite difficult to achieve. I fully documented the proper way of doing it in a previous question in stack : Apply VBA from R
Consider using the xlsx package. It has methods for dealing with excel files and worksheets. Your question is quite broad, but I think this can be an example:
library(xlsx)
wb <- loadWorkbook('r_test.xlsx')
sheets <- getSheets(wb)
sheet <- sheets[[1]]
df <- readColumns(sheet,
startColumn = 1, endColumn = 3,
startRow = 1, endRow = 6)
df
## id name x_value
##1 1 A 10
##2 2 B 15
##3 3 C 20
##4 4 D 13
##5 5 E 17
As for the memory issue I think you should check the ff package:
The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory.
Another option (but it may be overkill) would be to load the data to a real database and deal with database connections. If you are dealing with really big datasets, a database may be the best approach.
Some options would be:
The RSQLite package
If you can load your data to an SQLite database, you can use this package to connect directly to that database and handle the data directly. That would "split" the workload between R and the database engine. SQLite is quite easy to use and (almost) "config free", and each SQLite database is stored in a single file.
The RMySQL package
Even better than the above option; MySQL is great for storing large datasets. However you'll need to install and configure a MySQL server in your computer.
Remember: If you work with R and a database, delegate as much heavy workload to the database (e.g. data filtering, aggregation, etc), and use R to get the final results.

Large csv file fails to fully read in to R data.frame

I am trying to load a fairly large csv file into R. It has about 50 columns and 2million row.
My code is pretty basic, and I have used it to open files before but none this large.
mydata <- read.csv('file.csv', header = FALSE, sep=",", stringsAsFactors = FALSE)
The result is that it reads in the data but stops after 1080000 rows or so. This is roughly where excel stops as well. Is their way to get R to read the whole file in? Why is it stopping around half way.
Update: (11/30/14)
After speaking with the provider of the data it was discovered that they may have been some corruption issue with the file. A new file was provided which also is smaller and loads into R easily.
As, "read.csv()" read up to 1080000 rows, "fread" from library(data.table) should read it with ease. If not, there exists two other options, either try with library(h20) or with "fread" you can use select option to read required columns (or read in two halves, do some cleaning and can merge them back).
You can try using read.table and include the parameter colClasses to specify the type of the individual columns.
With your current code, R will read all data first as strings and then check for each column if it is convertible e. g. to a numeric type, which needs more memory than reading right away as numeric. colClasses will also allow you to ignore columns you might not need.

read.sas7bdat unable to read compressed file

I am trying to read a .sas7bdat file in R. When I use the command
library(sas7bdat)
read.sas7bdat("filename")
I get the following error:
Error in read.sas7bdat("county2.sas7bdat") : file contains compressed data
I do not have experience with SAS, so any help will be highly appreciated.
Thanks!
According to the sas7bdat vignette [vignette('sas7bdat')], COMPRESS=BINARY (or COMPRESS=YES) is not currently supported as of 2013 (and this was the vignette active on 6/16/2014 when I wrote this). COMPRESS=CHAR is supported.
These are basically internal compression routines, intended to make filesizes smaller. They're not as good as gz or similar (not nearly as good), but they're supported by SAS transparently while writing SAS programs. Obviously they change the file format significantly, hence the lack of implementation yet.
If you have SAS, you need to write these to an uncompressed dataset.
options compress=no;
libname lib '//drive/path/to/files';
data lib.want;
set lib.have;
run;
That's the simplest way (of many), assuming you have a libname defined as lib as above and change have and want to names that are correct (have should be the filename without extension of the file, in most cases; want can be changed to anything logical with A-Z or underscore only, and 32 or fewer characters).
If you don't have SAS, you'll have to ask your data provided to make the data available uncompressed, or as a different format. If you're getting this from a PUDS somewhere on the web, you might post where you're getting it from and there might be a way to help you identify an uncompressed source.
This admittedly is not a pure R solution, but in many situations (e.g. if you aren't on a pc and don't have the ability to write the SAS file yourself) the other solutions posted are not workable.
Fortunately, Python has a module (https://pypi.python.org/pypi/sas7bdat) which supports reading compressed SAS data sets - it's certainly better using this than needing to acquire SAS if you don't already have it. Once you extract the file and save it to text via Python, you can then access it in R.
from sas7bdat import SAS7BDAT
import pandas as pd
InFileName = "myfile.sas7bdat"
OutFileName = "myfile.txt"
with SAS7BDAT(InFileName) as f:
df = f.to_data_frame()
df.to_csv(path_or_buf = OutFileName, sep = "\t", encoding = 'utf-8', index = False)
The haven package can read compressed SAS-files:
library(haven)
df <- read_sas("sasfile.sas7bdat")
But only SAS-files which are compressed using compress=char, but not compress=binary.
So haven will be able to read this SAS-file:
data output.compressed_data_char (compress=char);
set inputdata;
run;
But not this SAS-file:
data output.compressed_data_binary (compress=binary);
set inputdata;
run;
https://cran.r-project.org/package=haven
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001002773.htm
"RevoScaleR" is a good package to read SAS data sets (compressed or uncompressed).You can use rxImport function of this package. Below is the example
Importing library
library(RevoScaleR)
Reading data
R_df_name <- rxImport("fake_path/file_name.sas7bdat")
The speed of this function is far better than haven/sas7bdat/sas7bdat.parso. I hope this helps anyone who struggles to read SAS data sets in R.
Cheers!
I found R to be the easiest for this kind of challenge, especially with compressed sas7dbat files, three simple lines:
library(haven)
data <- read_sas("yourfile.sas7dbat")
and then transform it to csv
write.csv(data,"data.csv")

Importing Excel files into R, xlsx or xls

Please can someone help me on the best way to import an excel 2007 (.xlsx) file into R. I have tried several methods and none seems to work. I have upgraded to 2.13.1, windows XP, xlsx 0.3.0, I don't know why the error keeps coming up. I tried:
AB<-read.xlsx("C:/AB_DNA_Tag_Numbers.xlsx","DNA_Tag_Numbers")
OR
AB<-read.xlsx("C:/AB_DNA_Tag_Numbers.xlsx",1)
but I get the error:
Error in .jnew("java/io/FileInputStream", file) :
java.io.FileNotFoundException: C:\AB_DNA_Tag_Numbers.xlsx (The system cannot find the file specified)
Thank you.
For a solution that is free of fiddly external dependencies*, there is now readxl:
The readxl package makes it easy to get data out of Excel and into R.
Compared to many of the existing packages (e.g. gdata, xlsx,
xlsReadWrite) readxl has no external dependencies so it's easy to
install and use on all operating systems. It is designed to work with
tabular data stored in a single sheet.
Readxl supports both the legacy .xls format and the modern xml-based
.xlsx format. .xls support is made possible the with libxls C library,
which abstracts away many of the complexities of the underlying binary
format. To parse .xlsx, we use the RapidXML C++ library.
It can be installed like so:
install.packages("readxl") # CRAN version
or
devtools::install_github("hadley/readxl") # development version
Usage
library(readxl)
# read_excel reads both xls and xlsx files
read_excel("my-old-spreadsheet.xls")
read_excel("my-new-spreadsheet.xlsx")
# Specify sheet with a number or name
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)
# If NAs are represented by something other than blank cells,
# set the na argument
read_excel("my-spreadsheet.xls", na = "NA")
* not strictly true, it requires the Rcpp package, which in turn requires Rtools (for Windows) or Xcode (for OSX), which are dependencies external to R. But they don't require any fiddling with paths, etc., so that's an advantage over Java and Perl dependencies.
Update There is now the rexcel package. This promises to get Excel formatting, functions and many other kinds of information from the Excel file and into R.
You may also want to try the XLConnect package. I've had better luck with it than xlsx (plus it can read .xls files too).
library(XLConnect)
theData <- readWorksheet(loadWorkbook("C:/AB_DNA_Tag_Numbers.xlsx"),sheet=1)
also, if you are having trouble with your file not being found, try selecting it with file.choose().
I would definitely try the read.xls function in the gdata package, which is considerably more mature than the xlsx package. It may require Perl ...
Update
As the Answer below is now somewhat outdated, I'd just draw attention to the readxl package. If the Excel sheet is well formatted/lain out then I would now use readxl to read from the workbook. If sheets are poorly formatted/lain out then I would still export to CSV and then handle the problems in R either via read.csv() or plain old readLines().
Original
My preferred way is to save individual Excel sheets in comma separated value (CSV) files. On Windows, these files are associated with Excel so you don't loose the double-click-open-in-Excel "feature".
CSV files can be read into R using read.csv(), or, if you are in a location or using a computer set up with some European settings (where , is used as the decimal place), using read.csv2().
These functions have sensible defaults that makes reading appropriately formatted files simple. Just keep any labels for samples or variables in the first row or column.
Added benefits of storing files in CSV are that as the files are plain text they can be passed around very easily and you can be confident they will open anywhere; one doesn't need Excel to look at or edit the data.
Example 2012:
library("xlsx")
FirstTable <- read.xlsx("MyExcelFile.xlsx", 1 , stringsAsFactors=F)
SecondTable <- read.xlsx("MyExcelFile.xlsx", 2 , stringsAsFactors=F)
I would try 'xlsx' package for it is easy to handle and seems mature enough
worked fine for me and did not need any additionals like Perl or whatever
Example 2015:
library("readxl")
FirstTable <- read_excel("MyExcelFile.xlsx", 1)
SecondTable <- read_excel("MyExcelFile.xlsx", 2)
nowadays I use readxl and have made good experience with it.
no extra stuff needed
good performance
This new package looks nice http://cran.r-project.org/web/packages/openxlsx/openxlsx.pdf
It doesn't require rJava and is using 'Rcpp' for speed.
If you are running into the same problem and R is giving you an error -- could not find function ".jnew" -- Just install the library rJava. Or if you have it already just run the line library(rJava). That should be the problem.
Also, it should be clear to everybody that csv and txt files are easier to work with, but life is not easy and sometimes you just have to open an xlsx.
For me the openxlx package worked in the easiest way.
install.packages("openxlsx")
library(openxlsx)
rawData<-read.xlsx("your.xlsx");
I recently discovered Schaun Wheeler's function for importing excel files into R after realising that the xlxs package hadn't been updated for R 3.1.0.
https://gist.github.com/schaunwheeler/5825002
The file name needs to have the ".xlsx" extension and the file can't be open when you run the function.
This function is really useful for accessing other peoples work. The main advantages over using the read.csv function are when
Importing multiple excel files
Importing large files
Files that are updated regularly
Using the read.csv function requires manual opening and saving of each Excel document which is time consuming and very boring. Using Schaun's function to automate the workflow is therefore a massive help.
Big props to Schaun for this solution.
What's your operating system? What version of R are you running: 32-bit or 64-bit? What version of Java do you have installed?
I had a similar error when I first started using the read.xlsx() function and discovered that my issue (which may or may not be related to yours; at a minimum, this response should be viewed as "try this, too") was related to the incompatability of .xlsx pacakge with 64-bit Java. I'm fairly certain that the .xlsx package requires 32-bit Java.
Use 32-bit R and make sure that 32-bit Java is installed. This may address your issue.
You have checked that R is actually able to find the file, e.g. file.exists("C:/AB_DNA_Tag_Numbers.xlsx") ? – Ben Bolker Aug 14 '11 at 23:05
Above comment should've solved your problem:
require("xlsx")
read.xlsx("filepath/filename.xlsx",1)
should work fine after that.
I have tried very hard on all the answers above. However, they did not actually help because I used a mac. The rio library has this import function which can basically import any type of data file into Rstudio, even those file using languages other than English!
Try codes below:
library(rio)
AB <- import("C:/AB_DNA_Tag_Numbers.xlsx")
AB <- AB[,1]
Hope this help.
For more detailed reference: https://cran.r-project.org/web/packages/rio/vignettes/rio.html
You may be able to keep multiple tabs and more formatting information if you export to an OpenDocument Spreadsheet file (ods) or an older Excel format and import it with the ODS reader or the Excel reader you mentioned above.
As stated by many here, I am writing the same thing but with an additional point!
At first we need to make sure that our R Studio has these two packages installed:
"readxl"
"XLConnect"
In order to load a package in R you can use the below function:
install.packages("readxl/XLConnect")
library(XLConnect)
search()
search will display the list of current packages being available in your R Studio.
Now another catch, even though you might have these two packages but still you may encounter problem while reading "xlsx" file and the error could be like "error: more columns than column name"
To solve this issue you can simply resave your excel sheet "xlsx" in to
"CSV (Comma delimited)"
and your life will be super easy....
Have fun!!
The installation of xlsx package require rJava and xlsxjars. Indirectly they require the specific (32 or 64 bit) java runtime environment on the system.
Pro of read.xlsx: In the same package there are read.xlsx and write.xlsx
Con: Very low speed
As suggested, the easy way is to save in .csv format from excel.
Simple benchmark on a 5800x15 dataset (median)
read.xlsx: >10000ms
read_xlsx: 70ms
read.csv: 15ms

Resources