I am trying to write binary data to a csv file for further reading this file with 'read.csv2', 'read.table' or 'fread' to get a dataframe. The script is as follows:
library(iotools)
library(data.table)
#make a dataframe
n<-data.frame(x=1:100000,y=rnorm(1:100000),z=rnorm(1:100000),w=c("1dfsfsfsf"))
#file name variable
file_output<-"test.csv"
#check the existence of the file -> if true -> to remove it
if (file.exists(file_output)) file.remove(file_output)
#create a file
file(file_output, ifelse(FALSE, "ab", "wb"))
#to make a file object
zz <- file(file_output, "wb")
#to make a binary vector with column names
rnames<-as.output(rbind(colnames(n),""),sep=";",nsep="\t")
#to make a binary vector with dataframe
r = as.output(n, sep = ";",nsep="\t")
#write column names to the file
writeBin(rnames, zz)
#write data to the file
writeBin(r, zz)
#close file object
close(zz)
#test readings
check<-read.table(file_output,header = TRUE,sep=";",dec=".",stringsAsFactors = FALSE
,blank.lines.skip=T)
str(check)
class(check)
check<-fread(file_output,dec=".",data.table = FALSE,stringsAsFactors = FALSE)
str(check)
class(check)
check<-read.csv2(file_output,dec=".")
str(check)
class(check)
The output from the file is attached:
My questions are:
how to remove the blank line from the file without downloading to R?
It has been made on purpose to paste a binary vector of colnames as a dataframe. Otherwise colnames were written as one-column vector. Maybe it is possible to remove a blank line before 'writeBin()'?
How make the file to be written all numeric values as numeric but not as a character?
I use the binary data transfer on purpose because it is much faster then 'write.csv2'. For instance, if you apply
system.time(write.table.raw(n,"test.csv",sep=";",col.names=TRUE))
the time elapsed will be ~4 times as less as 'write.table' used.
I could not comment on your question because of my reputation but I hope it helps you.
Two things come in my mind
Using the fill in read.table in which, if TRUE then in that case the rows have unequal length, blank fields are implicitly added. (do ??read.table)
You have mentioned blank.lines.skip=TRUE. If TRUE blank lines in the input are ignored.
Related
So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))
So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))
My application is reading the xls and xlsx files using the read_excel function of the readxl package.
The sequence and the exact number of columns are not known earlier while reading the xls or xlsx file. There are 15 predefined columns out of which 10 columns are mandatory and remaining 5 columns are optional. So the file will always have minimum 10 columns and at maximum 15 columns.
I need to specify the the col-types to the mandatory 10 columns. The only way I can think of is using the column names to specify the col_types as I know for fact that the file has all 10 columns which are mandatory but they are in the random sequence.
I tried looking out for the way of doing so but failed to do so.
Can anyone help me find a way to assign the col_types by column names?
I solve the problem by below workaround. It is not the best way to solve this problem though. I have read the excel file twice which will take a hit on performance if the file has very large volume of data.
First read: Building column data type vector- Reading the file for retrieving the columns information(like column names, number of columns and it's types) and building the column_data_types vector which will have the datatype for every column in the file.
#reading .xlsx file
site_data_columns <- read_excel(paste(File$datapath, ".xlsx", sep = ""))
site_data_column_names <- colnames(site_data_columns)
for(i in 1 : length(site_data_column_names)){
#where date is a column name
if(site_data_column_names[i] == "date"){
column_data_types[i] <- "date"
#where result is a column name
} else if (site_data_column_names[i] == "result") {
column_data_types[i] <- "numeric"
} else{
column_data_types[i] <- "text"
}
}
Second read: Reading the file content- reading the excel file by supplying col_types parameter with the vector column_data_types which has the column data types.
#reading .xlsx file
site_data <- read_excel(paste(File$datapath, ".xlsx", sep = ""), col_types = column_data_types)
I have a folder with several hundred csv files. I want to use lappply to calculate the mean of one column within each csv file and save that value into a new csv file that would have two columns: Column 1 would be the name of the original file. Column 2 would be the mean value for the chosen field from the original file. Here's what I have so far:
setwd("C:/~~~~")
list.files()
filenames <- list.files()
read_csv <- lapply(filenames, read.csv, header = TRUE)
dataset <- lapply(filenames[1], mean)
write.csv(dataset, file = "Expected_Value.csv")
Which gives the error message:
Warning message: In mean.default("2pt.csv"[[1L]], ...) : argument is not numeric or logical: returning NA
So I think I have 2(at least) problems that I cannot figure out.
First, why doesn't r recognize that column 1 is numeric? I double, triple checked the csv files and I'm sure this column is numeric.
Second, how do I get the output file to return two columns the way I described above? I haven't gotten far with the second part yet.
I wanted to get the first part to work first. Any help is appreciated.
I didn't use lapply but have done something similar. Hope this helps!
i= 1:2 ##modify as per need
##create empty dataframe
df <- NULL
##list directory from where all files are to be read
directory <- ("C:/mydir/")
##read all file names from directory
x <- as.character(list.files(directory,,pattern='csv'))
xpath <- paste(directory, x, sep="")
##For loop to read each file and save metric and file name
for(i in i)
{
file <- read.csv(xpath[i], header=T, sep=",")
first_col <- file[,1]
d<-NULL
d$mean <- mean(first_col)
d$filename=x[i]
df <- rbind(df,d)
}
###write all output to csv
write.csv(df, file = "C:/mydir/final.csv")
CSV file looks like below
mean filename
1999.000661 hist_03082015.csv
1999.035121 hist_03092015.csv
Thanks for the two answers. After much review, it turns out that there was a much easier way to accomplish my goal. The csv files that I had were originally in one file. I split them into multiple files by location. At the time, I thought this was necessary to calculate mean on each type. Clearly, that was a mistake. I went to the original file and used aggregate. Code:
setwd("C:/~~")
allshots <- read.csv("All_Shots.csv", header=TRUE)
EV <- aggregate(allshots$points, list(Location = allshots$Loc), mean)
write.csv(EV, file= "EV_location.csv")
This was a simple solution. Thanks again or the answers. I'll need to get better at lapply for future projects so they were not a waste of time.
I am very new to R and I am having trouble accessing a dataset I've imported. I'm using RStudio and used the Import Dataset function when importing my csv-file and pasted the line from the console-window to the source-window. The code looks as follows:
setwd("c:/kalle/R")
stuckey <- read.csv("C:/kalle/R/stuckey.csv")
point <- stuckey$PTS
time <- stuckey$MP
However, the data isn't integer or numeric as I am used to but factors so when I try to plot the variables I only get histograms, not the usual plot. When checking the data it seems to be in order, just that I'm unable to use it since it's in factor form.
Both the data import function (here: read.csv()) as well as a global option offer you to say stringsAsFactors=FALSE which should fix this.
By default, read.csv checks the first few rows of your data to see whether to treat each variable as numeric. If it finds non-numeric values, it assumes the variable is character data, and character variables are converted to factors.
It looks like the PTS and MP variables in your dataset contain non-numerics, which is why you're getting unexpected results. You can force these variables to numeric with
point <- as.numeric(as.character(point))
time <- as.numeric(as.character(time))
But any values that can't be converted will become missing. (The R FAQ gives a slightly different method for factor -> numeric conversion but I can never remember what it is.)
You can set this globally for all read.csv/read.* commands with
options(stringsAsFactors=F)
Then read the file as follows:
my.tab <- read.table( "filename.csv", as.is=T )
When importing csv data files the import command should reflect both the data seperation between each column (;) and the float-number seperator for your numeric values (for numerical variable = 2,5 this would be ",").
The command for importing a csv, therefore, has to be a bit more comprehensive with more commands:
stuckey <- read.csv2("C:/kalle/R/stuckey.csv", header=TRUE, sep=";", dec=",")
This should import all variables as either integers or numeric.
None of these answers mention the colClasses argument which is another way to specify the variable classes in read.csv.
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "numeric") # all variables to numeric
or you can specify which columns to convert:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = c("PTS" = "numeric", "MP" = "numeric") # specific columns to numeric
Note that if a variable can't be converted to numeric then it will be converted to factor as default which makes it more difficult to convert to number. Therefore, it can be advisable just to read all variables in as 'character' colClasses = "character" and then convert the specific columns to numeric once the csv is read in:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "character")
point <- as.numeric(stuckey$PTS)
time <- as.numeric(stuckey$MP)
I'm new to R as well and faced the exact same problem. But then I looked at my data and noticed that it is being caused due to the fact that my csv file was using a comma separator (,) in all numeric columns (Ex: 1,233,444.56 instead of 1233444.56).
I removed the comma separator in my csv file and then reloaded into R. My data frame now recognises all columns as numbers.
I'm sure there's a way to handle this within the read.csv function itself.
This only worked right for me when including strip.white = TRUE in the read.csv command.
(I found the solution here.)
for me the solution was to include skip = 0
(number of rows to skip at the top of the file. Can be set >0)
mydata <- read.csv(file = "file.csv", header = TRUE, sep = ",", skip = 22)