So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))
Related
I'm importing and appending hundreds of Excel spreadsheets into R using map_dfr in combination with a user-defined function:
Function to import specific columns in each worksheet:
fctn <- function(path){map_dfc(.x = c(1,2,3,7,10,11,12,13), ~ read.xlsx(path,
sheet=1,
startRow = 7,
colNames = FALSE,
cols = .x))}
Code to pull all the files in the "path" and append them, where file.list is the list of paths and files to import:
all.files <- map_dfr(file.list, ~ fctn(path=.x))
My problem is, some of these sheets have missing values in some of the columns, but not others, and R doesn't like that. I encounter this error, for instance:
"Error: can't recycle '..1' (size 8) to match '..2' (size 6)", which happens because column 2 is missing information in two cells.
Is there any way to make R accept missing values in cells?
Is there a way to import a named Excel-table into R as a data.frame?
I typically have several named Excel-tables on a single worksheet, that I want to import as data.frames, without relying on static row - and column references for the location of the Excel-tables.
I have tried to set namedRegion which is an available argument for several Excel-import functions, but that does not seem to work for named Excel-tables. I am currently using the openxlxs package, which has a function getTables() that creates a variable with Excel-table names from a single worksheet, but not the data in the tables.
To get your named table is a little bit of work.
First you need to load the workbook.
library(openxlsx)
wb <- loadWorkbook("name_excel_file.xlsx")
Next you need to extract the name of your named table.
# get the name and the range
tables <- getTables(wb = wb,
sheet = 1)
If you have multiple named tables they are all in tables. My named table is called Table1.
Next you to extract the column numbers and row numbers, which you will later use to extract the named table from the Excel file.
# get the range
table_range <- names(tables[tables == "Table1"])
table_range_refs <- strsplit(table_range, ":")[[1]]
# use a regex to extract out the row numbers
table_range_row_num <- gsub("[^0-9.]", "", table_range_refs)
# extract out the column numbers
table_range_col_num <- convertFromExcelRef(table_range_refs)
Now you re-read the Excel file with the cols and rows parameter.
# finally read it
my_df <- read.xlsx(xlsxFile = "name_excel_file.xlsx",
sheet = 1,
cols = table_range_col_num[1]:table_range_col_num[2],
rows = table_range_row_num[1]:table_range_row_num[2])
You end up with a data frame with only the content of your named table.
I used this a while ago. I found this code somewhere, but I don't know anymore from where.
This link is might be useful for you
https://stackoverflow.com/a/17709204/10235327
1. Install XLConnect package
2. Save a path to your file in a variable
3. Load workbook
4. Save your data to df
To get table names you can use function
getTables(wb,sheet=1,simplify=T)
Where:
wb - your workbook
sheet - sheet name or might be the number as well
simplify = TRUE (default) the result is simplified to a vector
https://rdrr.io/cran/XLConnect/man/getTables-methods.html
Here's the code (not mine, copied from the topic above, just a bit modified)
require(XLConnect)
sampleFile = "C:/Users/your.name/Documents/test.xlsx"
wb = loadWorkbook(sampleFile)
myTable <- getTables(wb,sheet=1)
df<-readTable(wb, sheet = 1, table = myTable)
You can check next packages:
library(xlsx)
Data <- read.xlsx('YourFile.xlsx',sheet=1)
library(readxl)
Data <- read_excel('YourFile.xlsx',sheet=1)
Both options allow you to define specific regions to load the data into R.
I use read.xlsx from package openxlsx. For example:
library(openxlsx)
fileA <- paste0(some.directory,'excel.file.xlsx')
A <- read.xlsx(fileA, startRow = 3)
hope it helps
So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))
I want to export a data frame to a xls file. I found the library XLConnect, which is easy to use, but I would not mind trying something else.
The problem I have is in the output (xls file). Since my data frame contains strings like 'A_B', the output contains the character ' before each integer. This is a problem for me because I need OpenOffice to detect column with integers to be able to apply formulas later on.
Here is a small example without any problems, meaning each entry in the xls file is an integer.
library(XLConnect)
test = as.data.frame(matrix(c(0,0),nrow=1),stringsAsFactors = FALSE)
wb =loadWorkbook('file.xls',create=TRUE)
createSheet(wb, name = "test")
writeWorksheet(wb,test,'test',header=FALSE)
saveWorkbook(wb)
The problem occurs in the following example.
library(XLConnect)
test = as.data.frame(matrix(c(0,'A_B'),nrow=1),stringsAsFactors = FALSE)
wb =loadWorkbook('file.xls',create=TRUE)
createSheet(wb, name = "test")
writeWorksheet(wb,test,'test',header=FALSE)
saveWorkbook(wb)
In this last example, the first entry in the xls file is '0, which is not detected as an integer in OpenOffice.
Is there a way to remove the character ' from the output data ?
Thank you
In XLConnect, the column types of a data.frame determine the resulting cell types. E.g. character columns result in string/text cells while numeric columns result in numeric cells.
In your example, the expression c(0, 'A_B') results in a character vector (R automatically coerces your 0 to character) and as such the resulting cells written will be string/text cells (see str(test) to investigate your data.frame). As such, you should be fine by simply constructing your data.frame differently, e.g. by doing the following:
test = data.frame(NumericColumn = 0, TextColumn = "A_B", stringsAsFactors = FALSE)
str(test)
Looking at the output of str(test) you should now see that "NumericColumn" is a numeric column and "TextColumn" is a character column. When writing that data.frame with XLConnect, you will see that the resulting cells will have the appropriate cell type.
I am trying to write binary data to a csv file for further reading this file with 'read.csv2', 'read.table' or 'fread' to get a dataframe. The script is as follows:
library(iotools)
library(data.table)
#make a dataframe
n<-data.frame(x=1:100000,y=rnorm(1:100000),z=rnorm(1:100000),w=c("1dfsfsfsf"))
#file name variable
file_output<-"test.csv"
#check the existence of the file -> if true -> to remove it
if (file.exists(file_output)) file.remove(file_output)
#create a file
file(file_output, ifelse(FALSE, "ab", "wb"))
#to make a file object
zz <- file(file_output, "wb")
#to make a binary vector with column names
rnames<-as.output(rbind(colnames(n),""),sep=";",nsep="\t")
#to make a binary vector with dataframe
r = as.output(n, sep = ";",nsep="\t")
#write column names to the file
writeBin(rnames, zz)
#write data to the file
writeBin(r, zz)
#close file object
close(zz)
#test readings
check<-read.table(file_output,header = TRUE,sep=";",dec=".",stringsAsFactors = FALSE
,blank.lines.skip=T)
str(check)
class(check)
check<-fread(file_output,dec=".",data.table = FALSE,stringsAsFactors = FALSE)
str(check)
class(check)
check<-read.csv2(file_output,dec=".")
str(check)
class(check)
The output from the file is attached:
My questions are:
how to remove the blank line from the file without downloading to R?
It has been made on purpose to paste a binary vector of colnames as a dataframe. Otherwise colnames were written as one-column vector. Maybe it is possible to remove a blank line before 'writeBin()'?
How make the file to be written all numeric values as numeric but not as a character?
I use the binary data transfer on purpose because it is much faster then 'write.csv2'. For instance, if you apply
system.time(write.table.raw(n,"test.csv",sep=";",col.names=TRUE))
the time elapsed will be ~4 times as less as 'write.table' used.
I could not comment on your question because of my reputation but I hope it helps you.
Two things come in my mind
Using the fill in read.table in which, if TRUE then in that case the rows have unequal length, blank fields are implicitly added. (do ??read.table)
You have mentioned blank.lines.skip=TRUE. If TRUE blank lines in the input are ignored.