Read Excel Tables, not simple named ranges - r

To avoid "duplicate" close request: I know how to read Excel named ranges; examples are given in the code below. This is about "real" tables in Excel.
Excel2007 and later have the useful concept of tables: you can convert ranges to tables, and avoid hassles when sorting and rearranging. When you create a table in an Excel range, it gets a default name (Tabelle1 in German Version, TableName in the following example), but you can additionally simply name the range of the table (TableAsRangeName); as indicated by the icons in the Excel range name editor, these two seem to be treated differently.
I have not been able to read these tables (in the strict sense) from R. The only known workaround is using CSV intermediate, or converting the table to a normal named range, which has nasty irreversible side effects when you use column names in cell references; these are converted to A1 notation.
The example below shows the problem. You mileage may vary with different combinations of 32/64 bit ODBC drivers and 32/64 bit Java
# Read Excel Tables (not simply named ranges)
# Test Computer: 64 Bit Windows 7, R 32 bit
# My ODBC drivers are 32 bit
# Test file has three ranges
# NonTable Simple named range
# TableName Name of table
# TableAsRangeName Named Range covering the above table
sampleFile = "ExcelTables.xlsx"
if (!file.exists(sampleFile)){
# Or do it manually, if this fails
channel = odbcConnectExcel2007(sampleFile)
sqlQuery(channel, "SELECT * from NonTable") # Ok
sqlQuery(channel, "SELECT * from TableName") # Could not find range
sqlQuery(channel, "SELECT * from TableAsRangeName") # Could not find range
# gdata has read.xls, but seems not to support named regions
wb = loadWorkbook(sampleFile)
getRanges(wb) # This one fails already with "TableName" does not exist
ws = getSheets(wb)[[1]]
readRange("NonTable",ws) # Invalid range address
readRange("TableName",ws) # Invalid range address
readRange("TableAsRangeName",ws) # Invalid range address
# my machine requires 64 bit for this one; depends on your Java installation
sampleFile = "ExcelTables.xlsx"
library(XLConnect) # requires Java
readNamedRegionFromFile(sampleFile,"NonTable") # OK
readNamedRegionFromFile(sampleFile,"TableName") # "TableName" does not exist
readNamedRegionFromFile(sampleFile,"TableAsRangeName") # NullPointerException
wb <- loadWorkbook(sampleFile)
readNamedRegion(wb,"NonTable") # Ok
readNamedRegion(wb,"TableName") # does not exist
readNamedRegion(wb,"TableAsRangeName") # Null Pointer

I've added some initial support for Excel tables in XLConnect. Please find the latest changes on github at
In the following a small sample:
sampleFile = "ExcelTables.xlsx"
wb = loadWorkbook(sampleFile)
readTable(wb, sheet = "ExcelTable", table = "TableName")
Note that Excel tables are associated to a sheet. So as far as I can see it's possible to have multiple tables with the same name associated to different sheets. For this reason there is a sheet-argument to readTable.

You are correct that the table definitions are stored in XML.
sampleFile = "ExcelTables.xlsx"
unzip(sampleFile, exdir = 'test')
tData <- xmlParse('test/xl/tables/table1.xml')
tables <- xpathApply(tData, "//*[local-name() = 'table']", xmlAttrs)
id name displayName ref totalsRowShown
"1" "TableName" "TableName" "G1:I4" "0"
readWorksheetFromFile(sampleFile, sheet = "ExcelTable", region = tables[[1]]['ref'], header = TRUE)
Name Age AgeGroup
1 Anton 44 4
2 Bertha 33 3
3 Cäsar 21 2
Depending on your situation you could search in the XML files for appropriate quantities.

Later addition:
readxl::readxl can read "real" tables and probably is the least troublesome solution when you want to read data frames/tibbles.
** After #Jamzy comment **
I tried again and could not read named ranges. False positive then or false negative now???


Read a 20GB file in chunks without exceeding my RAM - R

I'm currently trying to read a 20GB file. I only need 3 columns of that file.
My problem is, that I'm limited to 16 GB of ram. I tried using readr and processing the data in chunks with the function read_csv_chunked and read_csv with the skip parameter, but those both exceeded my RAM limits.
Even the read_csv(file, ..., skip = 10000000, nrow = 1) call that reads one line uses up all my RAM.
My question now is, how can I read this file? Is there a way to read chunks of the file without using that much ram?
The LaF package can read in ASCII data in chunks. It can be used directly or if you are using dplyr the chunked package uses it providing an interface for use with dplyr.
The readr package has readr_csv_chunked and related functions.
The section of this web page entitled The Loop as well as subsequent sections of that page describes how to do chunked reads with base R.
It may be that if you remove all but the first three columns that it will be small enough to just read it in and process in one go.
vroom in the vroom package can read in files very quickly and also has the ability to read in just the columns named in the select= argument which may make it small enough to read it in in one go.
fread in the data.table package is a fast reading function that also supports a select= argument which can select only specified columns.
read.csv.sql in the sqldf (also see github page) package can read a file larger than R can handle into a temporary external SQLite database which it creates for you and removes afterwards and reads the result of the SQL statement given into R. If the first three columns are named col1, col2 and col3 then try the code below. See ?read.csv.sql and ?sqldf for the remaining arguments which will depend on your file.
DF <- read.csv.sql("myfile", "select col1, col2, col3 from file",
dbname = tempfile(), ...)
read.table and read.csv in the base of R have a colClasses=argument which takes a vector of column classes. If the file has nc columns then use colClasses = rep(c(NA, "NULL"), c(3, nc-3)) to only read the first 3 columns.
Another approach is to pre-process the file using cut, sed or awk (available natively in UNIX and in the Rtools bin directory on Windows) or any of a number of free command line utilities such as csvfix outside of R to remove all but the first three columns and then see if that makes it small enough to read in one go.
Also check out the High Performance Computing task view.
We can try something like this, first a small example csv:
X = data.frame(id=1:1e5,matrix(runi(1e6),ncol=10))
You can use the nrow function, instead of providing a file, you provide a connection, and you store the results inside a list, for example:
data = vector("list",200)
con = file("test.csv","r")
data[[1]] = read.csv(con, nrows=1000)
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,1:3]
id X1 X2 X3
1 1 0.13870273 0.4480100 0.41655108
2 2 0.82249489 0.1227274 0.27173937
3 3 0.78684815 0.9125520 0.08783347
4 4 0.23481987 0.7643155 0.59345660
5 5 0.55759721 0.6009626 0.08112619
6 6 0.04274501 0.7234665 0.60290296
In the above, we read the first chunk, collected the colnames and subsetted. If you carry on reading through the connection, the headers will be missing, and we need to specify that:
for(i in 2:200){
data[[i]] = read.csv(con, nrows=1000,col.names=COLS,header=FALSE)[,1:3]
Finally, we build of all of those into a data.frame:
data =,data)
[1] TRUE
You can see that I specified a much larger list than required, this is to show if you don't know how long the file is, as you specify something larger, it should work. This is a bit better than writing a while loop..
So we wrap it into a function, specifying the file, number of rows to read at one go, the number of times, and the column names (or position) to subset:
data = vector("list",rows_to_read)
con = file(file,"r")
data[[1]] = read.csv(con, nrows=rows_to_read)
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,col_subset]
for(i in 2:ntimes){
data[[i]] = read.csv(con,

How to write contents of data frame back to range?

I need to perform the following sequence:
Open Excel Workbook
Read specific worksheet into R dataframe
Read from a database updating dataframe
Write dataframe back to worksheet
I have steps 1-3 working OK using the BERT tool. (the R scripting interface)
For step 2 I use from BERT
Any pointer on how to perform step 4? There is no
I tried range$put_Value(df) but no error return and no update to Excel
I can update a single cell from R using put_Value - which I cannot see documented
# manipulate status data using R BERT tool
wb <- EXCEL$Application$get_ActiveWorkbook()
wbname = wb$get_FullName()
ws <- EXCEL$Application$get_ActiveSheet()
topleft = ws$get_Range( "a1" )
rng = topleft$get_CurrentRegion()
#rngbody = rng$get_Offset(1,0)
ssot = rng$get_Value()
ssotdf = ssot, headers=T )
# emulate data update on 2 columns
ssotdf$ServerStatus = "Disposed"
ssotdf$ServerID = -1
# try to write df back
retcode = rng$put_Value(ssotdf)
This answer doesn't use R Excel BERT.
Try the openxlsx library. You probably can do all the steps using that library. For the step 4, after installing openxlsx library, the following code will write a file:
openxlsx::write.xlsx(ssotdf, 'Dataframe.xlsx',asTable = T)
I think your problem is that you are not changing the size of the range, so you are not going to see your new columns. Try creating a new range that has two extra columns before you insert the data.
I just had the same question and was able to resolve it by transforming the data.frame to a matrix in the call to put_value. I figured this out after playing with the old version in excel-functions.r. Try something like:
retcode = rng$put_Value(as.matrix(ssotdf))
You may have already solved your problem but, if not, the following stripped down R function does what I think you need:
testDF <- function(rng1,rng2){
app <- EXCEL$Application
ref1 <- app$get_Range( rng1 ) # get source range reference
data <- ref1$get_Value() # get source range data
ref2 <- app$get_Range( rng2 ) # get destination range reference
ref2$put_Value( data ) # put data in destination range
I simulated a dataframe by setting values in range "D4:F6" of the speadsheet to:
col1 col2 col3
1 2 txt1
7 3 txt2
then ran
in the Bert console. The dataframe then appears in range "H10:J12".

Subset large .csv file at reading in R

I have a very large .csv file (~4GB) which I'd like to read, then subset.
The problem comes at reading (memory allocation error). Being that large reading crashes, so what I'd like is a way to subset the file before or while reading it, so that it only gets the rows for one city (Cambridge).
id City Value
1 London 17
2 Coventry 21
3 Cambridge 14
I've already tried the usual approaches:
f <- read.csv(f, stringsAsFactors=FALSE, header=T, nrows=100)
f.colclass <- sapply(f,class)
f <- read.csv(f,sep = ",",nrows = 3000000, stringsAsFactors=FALSE,
which seem to work for up to 1-2M rows, but not for the whole file.
I've also tried subsetting at the reading itself using pipe:
f<- read.table(file = f,sep = ",",colClasses=f.colclass,stringsAsFactors = F,pipe('grep "Cambridge" f ') )
and this also seems to crash.
I thought packages sqldf or data.table would have something, but no success yet !!
Thanks in advance, p.
I think this was alluded to already but just in case it wasn't completely clear. The sqldf package creates a temporary SQLite DB on your machine based on the csv file and allows you to write SQL queries to perform subsets of the data before saving the results to a data.frame
query_string <- "select * from file where City=='Cambridge' "
f <- read.csv.sql(file = "f.csv", sql = query_string)
#or rather than saving all of the raw data in f, you may want to perform a sum
f_sum <- read.csv.sql(file = "f.csv",
sql = "select sum(Value) from file where City=='Cambridge' " )
One solution to this type of error is
you can convert your csv file to excel file first.
Then you can map your excel file into mysql table by using toad for mysql it is easy.Just check for datatype of variables.
then using RODBC package you can access such a large dataset.
I am working with a datasets of size more than 20 GB this way.
Although there's nothing wrong with the existing answers, they miss the most conventional/common way of dealing with this: chunks (Here's an example from one of the multitude of similar questions/answers).
The only difference is, unlike for most of the answers that load the whole file, you would read it chunk by chunk and only keep the subset you need at each iteration
# open connection to file (mostly convenience)
file_location = "C:/users/[insert here]/..."
file_name = 'name_of_file_i_wish_to_read.csv'
con <- file(paste(file_location, file_name,sep='/'), "r")
# set chunk size - basically want to make sure its small enough that
# your RAM can handle it
chunk_size = 1000 # the larger the chunk the more RAM it'll take but the faster it'll go
i = 0 # set i to 0 as it'll increase as we loop through the chunks
# loop through the chunks and select rows that contain cambridge
repeat {
# things to do only on the first read-through
# read in columns only on the first go
# load the chunk
tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header)
# subset only to desired criteria
cond = tmp_chunk[,'City'] == "Cambridge"
# initiate container for desired data
df = tmp_chunk[cond,] # save desired subset in initial container
cols = colnames(df) # save column names to re-use on next chunks
# things to do on all subsequent non-first chunks
else if(i>0){
tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header,col.names = cols)
# set stopping criteria for the loop
# when it reads in 0 rows, exit loop
# subset only to desired criteria
cond = tmp_chunk[,'City'] == "Cambridge"
# append to existing dataframe
df = rbind(df, tmp_chunk[cond,])
# add 1 to i to avoid the things needed to do on the first read-in
close(con) # close connection
# check out the results

Getting an SPSS data file into R

At my company, we are thinking of gradually phasing out SPSS in choice of R. During the transition though we'll still be having the data coming in SPSS data file format (.sav).
I'm having issues importing this SPSS datafile into R. When I import an SPSS file into R, I want to retain both the values and value labels for the variables. The read.spss() function from foreign package gives me option to retain either values OR value labels of a variable but not both.
AFAIK, R does allow factor variables to have values (levels) and value labels (level labels). I was just wondering if it's possible to somehow modify the read.spss() function to incorporate this.
Alternatively, I came across spss.system.file() function from memisc package which supposedly allows this to happen, but it asks for a separate syntax file (codes.file), which is not necessarily available to me always.
Here's a sample data file.
I'd appreciate any help resolving this issue.
I do not know how to read in SPSS metadata; I usually read .csv files and add metadata back, or write a small one-off PERL script to do the job. What I wanted to mention is that a recently published R package, Rz, may assist you with bringing SPSS data into R. I have had a quick look at it and seems useful.
There is a solution to read SPSS data file in R by ODBC driver.
1) There is a IBM SPSS Statistics Data File Driver. I could not find the download link. I got it from my SPSS provider. The Standalone Driver is all you need. You do not need SPSS to install or use the driver.
2) Create a DSN for the SPSS data driver.
3) Using RODBC package you can read in R any SPSS data file. It will be possible to get value labels for each variable as separate tables. Then it is possible to use the labels in R in any way as you wish.
Here is a working example on Windows (I do not have SPSS on my computer now) to read in R your example data file. I have not testted this on Linux. It probably works also on Linux, because there is a SPSS data driver also for Linux.
# Create connection
# Change the DSN name and CP_CONNECT_STRING according to your setting
con <- odbcDriverConnect("DSN=spss_ehsis;SDSN=SAVDB;HST=C:\\Program Files\\IBM\\SPSS\\StatisticsDataFileDriver\\20\\Standalone\\cfg\\oadm.ini;PRT=StatisticsSAVDriverStandalone;CP_CONNECT_STRING=C:\\temp\\data_expt.sav")
# List of tables
Tables <- sqlTables(con)
# List of table names to extract
table.names <- Tables$TABLE_NAME[Tables$TABLE_SCHEM != "SYSTEM"]
# Function to query a table by name <- function(table) {
sqlQuery(con, paste0("SELECT * FROM [", table, "]"))
# Retrieve all tables
Data <- lapply(table.names,
# See the data
lapply(Data, head)
# Close connection
For example we can that value labels are defined for two variables:
VAR00002 VAR00002_label
1 1 Male
2 2 Female
VAR00003 VAR00003_label
1 2 Student
2 3 Employed
3 4 Unemployed
Additional information
Here is a function that allows to read SPSS data after the connection has been made to the SPSS data file. The function allows to specify the list of variables to be selected. If value.labels=T the selected variables with value labels in SPSS data file are converted to the R factors with labels attached.
I have to say I am not satisfied with the performance of this solution. It work good for small data files. The RAM limit is reached quite often for large SPSS data files (even the subset of variables is selected).
get.spss <- function(channel, variables = NULL, value.labels = F) {
VarNames <- sqlQuery(channel = channel,
query = "SELECT VarName FROM [Variables]", = T)$VarName
if (is.null(variables)) variables <- VarNames else {
if (any(!variables %in% VarNames)) stop("Wrong variable names")
if (value.labels) {
ValueLabelTableName <- sqlQuery(channel = channel,
query = "SELECT VarName FROM [Variables]
WHERE ValueLabelTableName is not null", = T)$VarName
ValueLabelTableName <- intersect(variables, ValueLabelTableName)
variables <- paste(variables, collapse = ", ")
data <- sqlQuery(channel = channel,
query = paste("SELECT", variables, "FROM [Cases]"), = T)
if (value.labels) {
for (var in ValueLabelTableName) {
VL <- sqlQuery(channel = channel,
query = paste0("SELECT * FROM [VLVAR", var,"]"), = T)
data[, var] <- factor(data[, var], levels = VL[, 1], labels = VL[, 2])
My work is going through the same transition.
read.spss() returns the variable labels as an attribute of the object you create with it. So in the example below I have a data frame called rvm which was created by read.spss() with It has 3,500 variables with short names a1, a2 etc but long labels for each variable in SPSS. I can access the variable labels by
which returns a list of all 3,500 variables full names up to
x23 "Other Expenditure Uncapped Daily Expenditure In Region"
x24 "Accommodation Expenditure In Region"
x25 "Food/Meals/Drink Expenditure In Region"
x26 "Local Transport Expenditure In Region"
x27 "Sightseeing/Attractions Expenditure In Region"
x28 "Event/Conference Expenditure In Region"
x29 "Gambling/Casino Expenditure In Region"
x30 "Gifts/Souvenirs Expenditure In Region"
x31 "Other Shopping Expenditure In Region"
x0 "Accommodation Daily Expenditure In Region"
What to do with these is another matter, but at least I have them, and if I want I can put them in some other object for safekeeping, searching with grep, etc.
Since you have SPSS available, I recommend installing the "Essentials for R" plugin (free of charge, but you need to register, also see the installation instructions) which allows you to run R within SPSS. The plugin includes an R package with functions that transfer the active SPSS data frame to R (and back) - including labeled factor levels, dates, German umlauts - details that are otherwise notoriously difficult. In my experience, it is more reliable than R's own foreign package.
Once you have everything set up, open the data in SPSS, and run something like the following code in the syntax window:
begin program r.
myDf <- spssdata.GetDataFromSPSS(missingValueToNA=TRUE,
save(myDf, file="d:/path/to/your/myDf.Rdata")
end program.
Essentials for R plugin link (apparently breaks markdown link syntax):®%20SPSS®%20Statistics?lang=en
Nowadays, the package haven provides the functionality to achieve what you want (and much more).
The function read_sav() can import *.sav and *.zsav files and returns a tibble. The variable labels are automatically stored in the labels attribute of the corresponding variables within that tibble. The class labelled preserves the original semantics and allows us to associate arbitrary labels with numeric or character vectors. If needed, we can use the function as_factor() to coerce labeled objects, i.e. objects of the class labelled, and even all labeled vectors within data.frames or tibbles (at once) to factors.

How to allocate/append a large column of Date objects to a data-frame

I have a data-frame (3 cols, 12146637 rows) called tr.sql which occupies 184Mb.
(it's backed by SQL, it is the contents of my dataset which I read in via read.csv.sql)
Column 2 is tr.sql$visit_date. SQL does not allow natively representing dates as an R Date object, this is important for how I need to process the data.
Hence I want to copy the contents of tr.sql to a new data-frame tr
(where the visit_date column can be natively represented as Date (chron::Date?). Trust me, this makes exploratory data analysis easier, for now this is how I want to do it - I might use native SQL eventually but please don't quibble that for now.)
Here is my solution (thanks to gsk and everyone) + workaround:
tr <- data.frame(customer_id=integer(N), visit_date=integer(N), visit_spend=numeric(N))
# fix up col2's class to be Date
class(tr[,2]) <- 'Date'
then workaround copying tr.sql -> tr in chunks of (say) N/8 using a for-loop, so that the temporary involved in the str->Date conversion does not out-of-memory, and a garbage-collect after each:
for (i in 0:7) {
from <- floor(i*N/8)
to <- floor((i+1)*N/8) -1
if (i==7)
to <- N
print(c("Copying tr.sql$visit_date",from,to," ..."))
tr$visit_date[from:to] <- as.Date(tr.sql$visit_date[from:to])
memsize_gc() ... # only 321 Mb in the end! (was ~1Gb during copying)
The problem is allocating then copying the visit_date column.
Here is the dataset and code, I am having multiple separate problems with this, explanation below:
'training.csv' looks like...
and code:
# Read in as SQL (for memory-efficiency)...
tr.sql <- read.csv.sql('training.csv')
# Count of how many rows we are about to declare
N <- nrow(tr.sql)
# Declare a new empty data-frame with same columns as the source d.f.
# Attempt to declare N Date objects (fails due to bad qualified name for Date)
# ... does this allocate N objects the same as data.frame(colname = numeric(N)) ?
tr <- data.frame(visit_date = Date(N))
tr <- tr.sql[0,]
# Attempt to assign the column - fails
tr$visit_date <- as.Date(tr.sql$visit_date)
# Attempt to append (fails)
> tr$visit_date <- append(tr$visit_date, as.Date(tr.sql$visit_date))
Error in `$<`(`*tmp*`, "visit_date", value = c("14700", "14705", :
replacement has 12146637 rows, data has 0
The second line that tries to declare data.frame(visit_date = Date(N)) fails, I don't know the correct qualified name with namespace for Date object (tried chron::Date , Dates::Date? don't work)
Both the attempt to assign and append fail. Not even sure whether it is legal, or efficient, to use append on a single large column of a data-frame.
Remember these objects are big, so avoid using temporaries.
Thanks in advance...
Try this ensuring that you are using the most recent version of sqldf (currently version 0.4-1.2).
(If you find you are running out of memory try putting the database on disk by adding the dbname = tempfile() argument to the read.csv.sql call. If even that fails then its so large in relation to available memory that its unlikely you are going to be able to do much analysis with it anyways.)
# create test data file
Lines <-
cat(Lines, file = "trainingtest.csv")
# read it back
DF <- read.csv.sql("trainingtest.csv", method = c("integer", "Date2", "numeric"))
It doesn't look to me like you've got a data.frame there (N is a vector of length 1). Should be simple:
tr <- tr.sql
tr$visit_date <- as.Date(tr.sql$visit_date)
Or even more efficient:
tr <- data.frame(colOne = tr.sql[,1], visit_date = as.Date(tr.sql$visit_date), colThree = tr.sql[,3])
As a side note, your title says "append" but I don't think that's the operation you want. You're making the data.frame wider, not appending them on to the end (making it longer). Conceptually, this is a cbind() operation.
Try this:
tr <- data.frame(visit_date= as.Date(tr.sql$visit_date, origin="1970-01-01") )
This will succeed if your format is YYYY-MM-DD or YYYY/MM/DD. If not one of those formats then post more details. It will also succeed if tr.sql$visit_date is a numeric vector equal to the number of days after the origin. E.g:
vdfrm <- data.frame(a = as.Date(c(1470, 1475, 1480), origin="1970-01-01") )
1 1974-01-10
2 1974-01-15
3 1974-01-20
