I want to export a data frame to a xls file. I found the library XLConnect, which is easy to use, but I would not mind trying something else.
The problem I have is in the output (xls file). Since my data frame contains strings like 'A_B', the output contains the character ' before each integer. This is a problem for me because I need OpenOffice to detect column with integers to be able to apply formulas later on.
Here is a small example without any problems, meaning each entry in the xls file is an integer.
library(XLConnect)
test = as.data.frame(matrix(c(0,0),nrow=1),stringsAsFactors = FALSE)
wb =loadWorkbook('file.xls',create=TRUE)
createSheet(wb, name = "test")
writeWorksheet(wb,test,'test',header=FALSE)
saveWorkbook(wb)
The problem occurs in the following example.
library(XLConnect)
test = as.data.frame(matrix(c(0,'A_B'),nrow=1),stringsAsFactors = FALSE)
wb =loadWorkbook('file.xls',create=TRUE)
createSheet(wb, name = "test")
writeWorksheet(wb,test,'test',header=FALSE)
saveWorkbook(wb)
In this last example, the first entry in the xls file is '0, which is not detected as an integer in OpenOffice.
Is there a way to remove the character ' from the output data ?
Thank you
In XLConnect, the column types of a data.frame determine the resulting cell types. E.g. character columns result in string/text cells while numeric columns result in numeric cells.
In your example, the expression c(0, 'A_B') results in a character vector (R automatically coerces your 0 to character) and as such the resulting cells written will be string/text cells (see str(test) to investigate your data.frame). As such, you should be fine by simply constructing your data.frame differently, e.g. by doing the following:
test = data.frame(NumericColumn = 0, TextColumn = "A_B", stringsAsFactors = FALSE)
str(test)
Looking at the output of str(test) you should now see that "NumericColumn" is a numeric column and "TextColumn" is a character column. When writing that data.frame with XLConnect, you will see that the resulting cells will have the appropriate cell type.
Related
So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))
So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))
My application is reading the xls and xlsx files using the read_excel function of the readxl package.
The sequence and the exact number of columns are not known earlier while reading the xls or xlsx file. There are 15 predefined columns out of which 10 columns are mandatory and remaining 5 columns are optional. So the file will always have minimum 10 columns and at maximum 15 columns.
I need to specify the the col-types to the mandatory 10 columns. The only way I can think of is using the column names to specify the col_types as I know for fact that the file has all 10 columns which are mandatory but they are in the random sequence.
I tried looking out for the way of doing so but failed to do so.
Can anyone help me find a way to assign the col_types by column names?
I solve the problem by below workaround. It is not the best way to solve this problem though. I have read the excel file twice which will take a hit on performance if the file has very large volume of data.
First read: Building column data type vector- Reading the file for retrieving the columns information(like column names, number of columns and it's types) and building the column_data_types vector which will have the datatype for every column in the file.
#reading .xlsx file
site_data_columns <- read_excel(paste(File$datapath, ".xlsx", sep = ""))
site_data_column_names <- colnames(site_data_columns)
for(i in 1 : length(site_data_column_names)){
#where date is a column name
if(site_data_column_names[i] == "date"){
column_data_types[i] <- "date"
#where result is a column name
} else if (site_data_column_names[i] == "result") {
column_data_types[i] <- "numeric"
} else{
column_data_types[i] <- "text"
}
}
Second read: Reading the file content- reading the excel file by supplying col_types parameter with the vector column_data_types which has the column data types.
#reading .xlsx file
site_data <- read_excel(paste(File$datapath, ".xlsx", sep = ""), col_types = column_data_types)
I am trying to write binary data to a csv file for further reading this file with 'read.csv2', 'read.table' or 'fread' to get a dataframe. The script is as follows:
library(iotools)
library(data.table)
#make a dataframe
n<-data.frame(x=1:100000,y=rnorm(1:100000),z=rnorm(1:100000),w=c("1dfsfsfsf"))
#file name variable
file_output<-"test.csv"
#check the existence of the file -> if true -> to remove it
if (file.exists(file_output)) file.remove(file_output)
#create a file
file(file_output, ifelse(FALSE, "ab", "wb"))
#to make a file object
zz <- file(file_output, "wb")
#to make a binary vector with column names
rnames<-as.output(rbind(colnames(n),""),sep=";",nsep="\t")
#to make a binary vector with dataframe
r = as.output(n, sep = ";",nsep="\t")
#write column names to the file
writeBin(rnames, zz)
#write data to the file
writeBin(r, zz)
#close file object
close(zz)
#test readings
check<-read.table(file_output,header = TRUE,sep=";",dec=".",stringsAsFactors = FALSE
,blank.lines.skip=T)
str(check)
class(check)
check<-fread(file_output,dec=".",data.table = FALSE,stringsAsFactors = FALSE)
str(check)
class(check)
check<-read.csv2(file_output,dec=".")
str(check)
class(check)
The output from the file is attached:
My questions are:
how to remove the blank line from the file without downloading to R?
It has been made on purpose to paste a binary vector of colnames as a dataframe. Otherwise colnames were written as one-column vector. Maybe it is possible to remove a blank line before 'writeBin()'?
How make the file to be written all numeric values as numeric but not as a character?
I use the binary data transfer on purpose because it is much faster then 'write.csv2'. For instance, if you apply
system.time(write.table.raw(n,"test.csv",sep=";",col.names=TRUE))
the time elapsed will be ~4 times as less as 'write.table' used.
I could not comment on your question because of my reputation but I hope it helps you.
Two things come in my mind
Using the fill in read.table in which, if TRUE then in that case the rows have unequal length, blank fields are implicitly added. (do ??read.table)
You have mentioned blank.lines.skip=TRUE. If TRUE blank lines in the input are ignored.
I am very new to R and I am having trouble accessing a dataset I've imported. I'm using RStudio and used the Import Dataset function when importing my csv-file and pasted the line from the console-window to the source-window. The code looks as follows:
setwd("c:/kalle/R")
stuckey <- read.csv("C:/kalle/R/stuckey.csv")
point <- stuckey$PTS
time <- stuckey$MP
However, the data isn't integer or numeric as I am used to but factors so when I try to plot the variables I only get histograms, not the usual plot. When checking the data it seems to be in order, just that I'm unable to use it since it's in factor form.
Both the data import function (here: read.csv()) as well as a global option offer you to say stringsAsFactors=FALSE which should fix this.
By default, read.csv checks the first few rows of your data to see whether to treat each variable as numeric. If it finds non-numeric values, it assumes the variable is character data, and character variables are converted to factors.
It looks like the PTS and MP variables in your dataset contain non-numerics, which is why you're getting unexpected results. You can force these variables to numeric with
point <- as.numeric(as.character(point))
time <- as.numeric(as.character(time))
But any values that can't be converted will become missing. (The R FAQ gives a slightly different method for factor -> numeric conversion but I can never remember what it is.)
You can set this globally for all read.csv/read.* commands with
options(stringsAsFactors=F)
Then read the file as follows:
my.tab <- read.table( "filename.csv", as.is=T )
When importing csv data files the import command should reflect both the data seperation between each column (;) and the float-number seperator for your numeric values (for numerical variable = 2,5 this would be ",").
The command for importing a csv, therefore, has to be a bit more comprehensive with more commands:
stuckey <- read.csv2("C:/kalle/R/stuckey.csv", header=TRUE, sep=";", dec=",")
This should import all variables as either integers or numeric.
None of these answers mention the colClasses argument which is another way to specify the variable classes in read.csv.
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "numeric") # all variables to numeric
or you can specify which columns to convert:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = c("PTS" = "numeric", "MP" = "numeric") # specific columns to numeric
Note that if a variable can't be converted to numeric then it will be converted to factor as default which makes it more difficult to convert to number. Therefore, it can be advisable just to read all variables in as 'character' colClasses = "character" and then convert the specific columns to numeric once the csv is read in:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "character")
point <- as.numeric(stuckey$PTS)
time <- as.numeric(stuckey$MP)
I'm new to R as well and faced the exact same problem. But then I looked at my data and noticed that it is being caused due to the fact that my csv file was using a comma separator (,) in all numeric columns (Ex: 1,233,444.56 instead of 1233444.56).
I removed the comma separator in my csv file and then reloaded into R. My data frame now recognises all columns as numbers.
I'm sure there's a way to handle this within the read.csv function itself.
This only worked right for me when including strip.white = TRUE in the read.csv command.
(I found the solution here.)
for me the solution was to include skip = 0
(number of rows to skip at the top of the file. Can be set >0)
mydata <- read.csv(file = "file.csv", header = TRUE, sep = ",", skip = 22)