R readr function for previewing csv - r

I am looking for a function or workaround in readr or R base to "preview" the column types that read_csv will guess before actually importing the data.
I am working with several files about 60Mb size containing 51 columns and 160k rows so that would make it much easier to build the col_types specification for read_csv.
My excuses if it sounds like an obvious question. I found no answers in the forum to this specific issue and have only recently starting using dplyr. Thanks.

Went into the readr code and tried to do some surgery to use the read_csv function code but only as far as the spec is guessed.
getReaderSpec <- function (file, col_names = TRUE, col_types = NULL, locale = default_locale(),
na = c("", "NA"), quoted_na = TRUE, quote = "\"",
comment = "", trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max), progress = show_progress(),
skip_empty_rows = TRUE)
{
tokenizer <- readr:::tokenizer_csv(na = na, quoted_na = quoted_na,
quote = quote, comment = comment, trim_ws = trim_ws,
skip_empty_rows = skip_empty_rows)
name <- readr:::source_name(file)
file <- readr:::standardise_path(file)
if (readr:::is.connection(file)) {
data <- readr:::datasource_connection(file, skip, skip_empty_rows,
comment)
if (readr:::empty_file(data[[1]])) {
return(tibble::tibble())
}
}
else {
if (!isTRUE(grepl("\n", file)[[1]]) && readr:::empty_file(file)) {
return(tibble::tibble())
}
if (is.character(file) && identical(locale$encoding,
"UTF-8")) {
data <- enc2utf8(file)
}
else {
data <- file
}
}
spec <- readr:::col_spec_standardise(data, skip = skip, skip_empty_rows = skip_empty_rows,
comment = comment, guess_max = guess_max, col_names = col_names,
col_types = col_types, tokenizer = tokenizer, locale = locale)
readr:::show_cols_spec(spec)
invisible(spec)
}
myspec <- getReaderSpec("someexample.csv")

Related

Why R reads CSV file differently

I am using
myCounts<-read.csv("myCounts.csv", header = TRUE, row.names = 1, sep = ",")
and
Book4 <- read_delim("Book4.csv", delim = ";",
escape_double = FALSE, trim_ws = TRUE)
to read two csv files. But read.csv and read.delim is pressing them differently.
Could you please explane how to read in book4 data in the same structure of myCounts data?
I tried following, it works.
df<-read.delim("~/Documents/sample.csv" ,sep = ";",row.names = 1)

How to "fread" a list of files with a different number of columns?

Usually, I am led to read lists of files which all have the same format, the same number of columns.
My function looks like :
fun.read <- function(files) {
read <- function(filename){
DT <- data.table::fread(filename, header = FALSE, sep = ";", select = 1:7, col.names = c(...))
}
lst <- lapply(files, read)
}
It works fine.
But now, I have to do the same, assuming my files doesn't have the same number of columns.
The way I do this is, for example, something like :
fun.read <- function(files) {
read <- function(filename){
if (max(count.fields(filename, sep = ";")) == 7) {
DT <- data.table::fread(filename, header = FALSE, sep = ";", select = 1:7, col.names = c(...))
} else if (max(count.fields(filename, sep = ";")) == 8){
DT <- data.table::fread(filename, header = FALSE, sep = ";", select = 1:8, col.names = c(...))
}
}
lst <- lapply(files, read)
}
It seems to work fine too, but I'm wondering if there is not a more efficient / elegant way to do this ?
I looked towards the fill = TRUE option, without success...
Many thanks !!
In case it could be helpfull to someone, here is how I lightened my script :
Including the if in the col.names()
x <- max(count.fields(filename, sep = ";"))
DT <- data.table::fread(filename, header = FALSE, sep = ";", col.names = c("...", "...", "...", if(x>3)"..."))
Thanks.

R asks for a list which seems to be a list according to is.list (=TRUE)

I am using the RAM package.
The function I use is very simple for diversity index, adding up a column in my metadata ;
outname <-OTU.diversity(data=OTUtables, meta=metatables)
(Arguments: data a list of OTU tables.
meta the metadata to append the outputs)
I am looping it but I get this error:
please provide otu tables as list; see ?RAM.input.formatting
So I go to that help menu and read this:
one data set:
data=list(data=otu)
multiple data sets:
data=list(data1=otu1, data2=otu2, data3=otu3)
here is my code:
i <- 1
for(i in 1:nrow(metadataMasterTax)){
temp <- read.table(paste(metadataMasterTax$DataAnFilePath[i], metadataMasterTax$meta[i], sep = ""),
sep = "\t", header = TRUE, dec = ".", comment.char = "", quote = "", stringsAsFactors = TRUE,
as.is = TRUE)
temp2 <- temp
temp2$row.names <- NULL #to unactivate numbers generated in the margin
trans <- read.table(paste(metadataMasterTax$taxPath[i], metadataMasterTax$taxName[i], sep = ""),
sep = "\t", header = TRUE, dec = ".", comment.char = "", quote = "", stringsAsFactors = TRUE,
as.is = TRUE, check.names = FALSE)
trans2 <- trans
trans2$row.names <- NULL #to unactivate numbers generated in the margin
data=list(data=trans2[i])
temp2[i] <- OTU.diversity(data=trans2[i], meta=temp2[i])
# Error in OTU.diversity(trans2, temp2) :
# please provide otu tables as list; see ?RAM.input.formatting
# is.list(trans2)
# [1] TRUE
# is.list(data)
# [1] TRUE
temp$taxonomy <- temp2$taxonomy
write.table(temp, file=paste(pathDataAn, "diversityDir/", metadataMasterTax$ShortName[i], ".meta.div.tsv", sep = ""),
append = FALSE,
sep = "\t",
row.names = FALSE)
}
Can anyone help me please....
thanks a lot
Because the main problem appears to be getting the OTU.diversity function to work, I focus on this issue. The code snippet below runs OTU.diversity without any problems, using the Google sheets data provided by OP.
library(gsheet)
library(RAM)
for (i in 1:2) {
# Meta data
temp <- as.data.frame(gsheet2tbl("https://drive.google.com/open?id=1hF47MbYZ1MG6RzGW-fF6tbMT3z4AxbGN5sAOxL4E8xM"))
temp$row.names <- NULL
# OTU
trans <- as.data.frame(gsheet2tbl("https://drive.google.com/open?id=1gOaEjDcs58T8v1GA-OKhnUsyRDU8Jxt2lQZuPWo6XWU"))
trans$row.names <- NULL
rownames(temp) <- colnames(trans)[-ncol(trans)]
temp2 <- OTU.diversity(data = list(data = trans), meta = temp)
write.table(temp2,
file = paste0("file", i, ".meta.div.tsv"), # replace
append = FALSE,
sep = "\t",
row.names = FALSE)
}
Replace for (i in 1:2) with for(i in 1:nrow(metadataMasterTax)), as.data.frame(gsheet2tbl(...)) with read.table(...), and the file argument in write.table with the appropriate string.

Error in reading a CSV file with read.table()

I am encountering an issue while loading a CSV data set in R. The data set can be taken from
https://data.baltimorecity.gov/City-Government/Baltimore-City-Employee-Salaries-FY2015/nsfe-bg53
I imported the data using read.csv as below and the dataset was imported correctly.
EmpSal <- read.csv('E:/Data/EmpSalaries.csv')
I tried reading the data using read.table and there were a lot of anomalies when looking at the dataset.
EmpSal1 <- read.table('E:/Data/EmpSalaries.csv',sep=',',header = T,fill = T)
The above code started reading the data from 7th row and the dataset actually contains ~14K rows but only 5K rows were imported. When looked at the dataset in few cases 15-20 rows were combined into a single row and the entire row data appeared in a single column.
I can work on the dataset using read.csv but I am curious to know the reason why it didn't work with read.table.
read.csv is defined as:
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
You need to add quote="\"" (read.table expects single quotes by default whereas read.csv expects double quotes)
EmpSal <- read.csv('Baltimore_City_Employee_Salaries_FY2015.csv')
EmpSal1 <- read.table('Baltimore_City_Employee_Salaries_FY2015.csv', sep=',', header = TRUE, fill = TRUE, quote="\"")
identical(EmpSal, EmpSal1)
# TRUE
As you mentioned, your data is imported successfully by using read.csv() command without mentioning quote argument.
Default value of quote argument for read.csv function is "\"" and for read.table function, it is "\"'".
Check following code,
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
There are many single quotation in your specified data. And this is the reason why read.table function isn't working for you.
Try the following code and it will work for you.
r<-read.table('/home/workspace/Downloads/Baltimore_City_Employee_Salaries_FY2015.csv',sep=",",quote="\"",header=T,fill=T)

Simplify R code to import big data as character

I am currently using the code below very often to import a big dataset into R and forcing it to treat everything as character in order to avoid the truncation of rows. The code seems to work well, but I was wondering whether any of you knows how it could be simplified or improved to so it doesn't get so repetitive each time I need to do it.
library(readr)
library(stringr)
dataset.path <- choose.files(caption = "Select dataset", multi = FALSE)
data.columns <- read_delim(dataset.path, delim = '\t', col_names = TRUE, n_max = 0)
data.coltypes <- c(rep("c", ncol(data.columns)))
data.coltypes <- str_c(data.coltypes, collapse = "")
dataset <- read_delim(dataset.path, delim = '\t', col_names = TRUE, col_types = data.coltypes)
like #Roland has suggested, you should write a function. here is one possibility:
foo <- function(){
require(readr)
dataset.path <- choose.files(caption = "Select dataset", multi = FALSE)
data.columns <- read_delim(dataset.path, delim = '\t', col_names = TRUE, n_max = 0)
data.coltypes <- paste(rep("c", ncol(data.columns)), collapse = "")
dataset <- read_delim(dataset.path, delim = '\t', col_names = TRUE, col_types = data.coltypes)
}
you can then just call foo() whenever you need to read a database in using this method.
your two liner:
data.coltypes <- c(rep("c", ncol(data.columns)))
data.coltypes <- str_c(data.coltypes, collapse = "")
can be collapsed into just one line and only using base R paste instead of str_c in the stringr package.

Resources