Warning message in R when using colClasses when reading csv files - r

I am using lapply to read a list of files. The files have multiple rows and columns, and I interested in the first row in the first column. The code I am using is:
lapply(file_list, read.csv,sep=',', header = F, col.names=F, nrow=1, colClasses = c('character', 'NULL', 'NULL'))
The first row has three columns but I am only reading the first one. From other posts on stackoverflow I found that the way to do this would be to use colClasses = c('character', 'NULL', 'NULL'). While this approach is working, I would like to know the underlying issue that is causing the following error message to be generated and hopefully prevent it from popping up:
"In read.table(file = file, header = header, sep = sep, quote = quote, :
cols = 1 != length(data) = 3"

It's to let you know that you're just keeping one column of the data out of three because it doesn't know how to handle colClasses of "NULL". Note your NULL is in quotation marks.
An example:
write.csv(data.frame(fi=letters[1:3],
fy=rnorm(3,500,1),
fo=rnorm(3,50,2))
,file="a.csv",row.names = F)
write.csv(data.frame(fib=letters[2:4],
fyb=rnorm(3,5,1),
fob=rnorm(3,50,2))
,file="b.csv",row.names = F)
file_list=list("a.csv","b.csv")
lapply(file_list, read.csv,sep=',', header = F, col.names=F, nrow=1, colClasses = c('character', 'NULL', 'NULL'))
Which results in:
[[1]]
FALSE.
1 fi
[[2]]
FALSE.
1 fib
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
cols = 1 != length(data) = 3
Which is the same as if you used:
lapply(file_list, read.csv,sep=',', header = F, col.names=F,
nrow=1, colClasses = c('character', 'asdasd', 'asdasd'))
But the warning goes away (and you get the rest of the row as a result) if you do:
lapply(file_list, read.csv,sep=',', header = F, col.names=F,
nrow=1, colClasses = c( 'character',NULL, NULL))
You can see where errors and warnings come from in source code for a function by entering, for example, read.table directly without anything following it, then searching for your particular warning within it.

Related

R - Loop through directory throws error but I do not know where (try and catch)

I have a loop, which is supposed to take all files which fit the provided Regex.
However, some files obviously don't have the correct amount of columns in all rows. Therefore, the loop crashes.
I do now want to find out, which files cause these errors. There are 100s of files, but only a few that do cause this error.
I know from Java, that I would now try to make a try-catch clause and to print the name of the files in order to find them, have a look and erase/change them. I can't deal with that in R though:
#PATH WITH ALL FILES
files <- list.files(path="/Users/Test/Trackingpoint",
pattern="Trackingpoint.*\\.csv\\.gz", full.names=TRUE, recursive=FALSE)
Trackingpoint_Tables <-
tryCatch({
lapply(files, function(x) {
a <- read.table(gzfile(x), sep = "\t", header = TRUE)
})
}, warning = function(w) {
print(w)
}, error = function(e) {
print(e)
})
As you know, what I have in w and e is not the file itself, but the error. How can I print the file's name and respectively any other information from the file itself?
I want my code to ignore the errors and just proceed, but to tell me, where this error occures (which file).
Right now, it only says:
<simpleError in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE, fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip, multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes, flush = flush, encoding = encoding, skipNul = skipNul): line 24610 did not have 44 elements>
A simple change from read.table to read.csv and fill=TRUE were sufficient.

Skip different number of rows in import

I'm importing a lot of datasets. All of them have some empty lines at the top (before header), however it's not always the same number of rows that I need to skip.
Right now I'm using:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE, skip = 9)
But sometimes I only need to skip 3 lines fx.
Can I somehow set up a rule that when my column B (in Excel) contains one of the following words at the beginning of a sentence:
Datastatistik
Overførte records
FI-CA
Oprettet
Column A is always empty but I delete this in a code after the import.
This is an example of my data (I have hidden personal numbers):
My first variable header is called "Bilagsnummer" or "Bilagsnr.".
I don't know if it's possible to set up a rule that says something like the first occurrence of this word is my header? Really I'm just brainstorming here, cause I have no idea how to automatise this data import.
---EDIT---
I looked at the post #Bram linked to, and it did solve some of my problem.
I changed some of it.
This is the code I used:
temp <- readLines("file.xls")
skipline <- which(grepl("\tDatastatistik", temp) |
grepl("\tOverførte", temp) |
grepl("FI-CA", temp) |
grepl("Oprettet", temp) |
temp == "")
So the skipline interger that I made contains those lines that need to be skipped. These are correct using the grepl function (since the wording at the end of sentence changes from time to time).
Now, I still have a problem though.
When I use skip = skipline in my read.delim It only works for the fist row.
I get the warning message:
In if (skip > 0L) readLines(file, skip) :
the condition has length > 1 and only the first element will be used
May have found a solution, but not the optimal one. Let's see.
Import your df with the empty lines:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE)
Find the number of empty rows at the beginning:
NonNAindex <- which(!is.na(df2[,2]))
lastEmpty <- (min(NonNAindex)-1)
Re-import your document using that info:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE, skip = lastEmpty)

Conflict between comment character and headers to import DF with read.table

How could I import a file :
starting with an undefined number of comment lines
followed by a line with headers, some of them containing the comment character which is used to identify the comment lines above?
For example, with a file like this:
# comment 1
# ...
# comment X
c01,c#02,c03,c04
1,2,3,4
5,6,7,8
Then:
myDF = read.table(myfile, sep=',', header=T)
Error in read.table(myfile, sep = ",", header = T) : more columns
than column names
The obvious problem is that # is used as comment character to announce comment lines, but also in the headers (which, admittedly, is bad practice, but I have no control on this).
The number of comment lines being unknown a priori, I can't even use the skip argument. Also, I don't know the column names (not even their number) before importing, so I'd really need to read them from the file.
Any solution beyond manually manipulating the file?
It may be easy enough to count the number of lines that start with a comment, and then skip them.
csvfile <- "# comment 1
# ...
# comment X
c01,c#02,c03,c04
1,2,3,4
5,6,7,8"
# return a logical for whether the line starts with a comment.
# remove everything from the first FALSE and afterward
# take the sum of what's left
start_comment <- grepl("^#", readLines(textConnection(csvfile)))
start_comment <- sum(head(start_comment, which(!start_comment)[1] - 1))
# skip the lines that start with the comment character
Data <- read.csv(textConnection(csvfile),
skip = start_comment,
stringsAsFactors = FALSE)
Note that this will work with read.csv, because in read.csv, comment.char = "". If you must use read.table, or must have comment.char = #, you may need a couple more steps.
start_comment <- grepl("^#", readLines(textConnection(csvfile)))
start_comment <- sum(head(start_comment, which(!start_comment)[1] - 1))
# Get the headers by themselves.
Head <- read.table(textConnection(csvfile),
skip = start_comment,
header = FALSE,
sep = ",",
comment.char = "",
nrows = 1)
Data <- read.table(textConnection(csvfile),
sep = ",",
header = FALSE,
skip = start_comment + 1,
stringsAsFactors = FALSE)
# apply column names to Data
names(Data) <- unlist(Head)

R <- How to use data.table's fread and fwrite when data contains newlines?

I have problems when having to save a data.table object to a file with text columns that can contain new lines or other types of characters.
For example:
x <- data.frame(
x = c(1,2,3),
y = c("Lorem ipsum dixit", 'the "brown"\nfox', "a|b"),
stringsAsFactors = F)
I write it to file with:
data.table::fwrite(
dataset,
"testfile.tsv",
sep = "\t",
na = "NULL",
quote = TRUE,
eol = "\n",
append = FALSE,
col.names = !append,
row.names = FALSE,
qmethod = "escape"
)
The output seems OK to me:
"x" "y"
1 "Lorem ipsum dixit"
2 "the \"brown\"
fox"
3 "a|b"
However when I read the dataset from file with
data.table::fread("testfile.tsv")
I get error:
Error in data.table::fread("x.gitignore.tsv") :
Expected sep (' ') but new line,
EOF (or other non printing character) ends field 0
when detecting types from point 0: fox"
I have tried to explicitly read stating the quoting character with:
data.table::fread("testfile.tsv",
sep "\t", header = T, na.strings = "NULL", quote = "\"")
but I still get the same error.
I also tried to use read.delim but it skips the first row of data for some reason:
read.delim("testfile.tsv", header = T, sep = "\t", na.strings = "NULL")
So how can I write and read back such data frames using data.table and fwrite?

R load csv files from folder

I am loading a bunch of csv files simultaneously from a local directory using
the following code:
myfiles = do.call(rbind, lapply(files, function(x) read.table(x, stringsAsFactors = FALSE, header = F, fill = T, sep=",", quote=NULL)))
and getting an error message:
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
I am afraid that quotes cause this as I inspect the number of columns in each of the 4 files I see that file number 3 contain 10 columns (incorrect) and the rest only 9 columns (correct). Looking into the corrupted file - it is definitely caused by quotes that cause a column split.
Any help apreciated
Found the answer, quote parameter should be set to quote ="\""
myfiles = do.call(rbind, lapply(files, function(x) read.table(x, stringsAsFactors = FALSE, header = F, fill = T, sep=",", quote ="\"")))

Resources