fread unable to read .csv files with first column empty

fread unable to read .csv files with first column empty - r

Say I have the first test.csv that looks like this
,a,b,c,d,e
If I try to read it using read.csv, it works fine.
read.csv("test.csv",header=FALSE)
# V1 V2 V3 V4 V5 V6
#1 NA a b c d e
#Warning message:
#In read.table(file = file, header = header, sep = sep, quote = quote, :
# incomplete final line found by readTableHeader on 'test.csv'
However, if I attempt to read this file using fread, i get an error instead.
require(data.table)
fread("test.csv",header=FALSE)
#Error in fread("test.csv", header = FALSE) :
# Not positioned correctly after testing format of header row. ch=','
Why does this happen and what can I do to correct this?

As for me, my problem was only that the first ? rows of my file had a missing ID value.
So I was able to solve the problem by specifying autostart to be sufficiently far into the file that a nonmissing value popped up:
fread("test.csv", autostart = 100L, skip = "A")
This guarantees that when fread attempts to automatically identify sep and sep2, it does so at a well-formatted place in the file.
Specifying skip also makes sure fread finds the correct row in which to base the names of the columns.
If indeed there are no nonmissing values for the first field, you're better off just deleting that field from the .csv with Richard Scriven's approach or a find-and-replace in your favorite text editor.

I think you could use skip/select/drop attributes of the fread function for this purpose.
fread("myfile.csv",sep=",",header=FALSE,skip="A")#to just skip the 1st column
fread("myfile.csv",sep=",",header=FALSE,select=c(2,3,4,5)) # to read other columns except 1
fread("myfile.csv",sep=",",header=FALSE,drop="A") #to drop first column

I've tried making that csv file and running the code. It seems to work now - same for other people? I thought it might be an issue with not having a new line at the end (hence the warning from read.csv), but fread copes fine whether there's an new line at the end or not.

Related

Why is read.csv getting wrong classes?

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks

I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

Skip all leading empty lines in read.csv

I am wishing to import csv files into R, with the first non empty line supplying the name of data frame columns. I know that you can supply the skip = 0 argument to specify which line to read first. However, the row number of the first non empty line can change between files.
How do I work out how many lines are empty, and dynamically skip them for each file?
As pointed out in the comments, I need to clarify what "blank" means. My csv files look like:
,,,
w,x,y,z
a,b,5,c
a,b,5,c
a,b,5,c
a,b,4,c
a,b,4,c
a,b,4,c
which means there are rows of commas at the start.

read.csv automatically skips blank lines (unless you set blank.lines.skip=FALSE). See ?read.csv
After writing the above, the poster explained that blank lines are not actually blank but have commas in them but nothing between the commas. In that case use fread from the data.table package which will handle that. The skip= argument can be set to any character string found in the header:
library(data.table)
DT <- fread("myfile.csv", skip = "w") # assuming w is in the header
DF <- as.data.frame(DT)
The last line can be omitted if a data.table is ok as the returned value.

Depending on your file size, this may be not the best solution but will do the job.
Strategy here is, instead of reading file with delimiter, will read as lines,
and count the characters and store into temp.
Then, while loop will search for first non-zero character length in the list,
then will read the file, and store as data_filename.
flist = list.files()
for (onefile in flist) {
temp = nchar(readLines(onefile))
i = 1
while (temp[i] == 0) {
i = i + 1
}
temp = read.table(onefile, sep = ",", skip = (i-1))
assign(paste0(data, onefile), temp)
}
If file contains headers, you can start i from 2.

If the first couple of empty lines are truly empty, then read.csv should automatically skip to the first line. If they have commas but no values, then you can use:
df = read.csv(file = 'd.csv')
df = read.csv(file = 'd.csv',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))
It's not efficient if you have large files (since you have to import twice), but it works.
If you want to import a tab-delimited file with the same problem (variable blank lines) then use:
df = read.table(file = 'd.txt',sep='\t')
df = read.table(file = 'd.txt',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))

In read.table(): incomplete final line found by readTableHeader

I have a CSV when I try to read.csv() that file, I get the warning warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on ...
And I cannot isolate the problem, despite scouring StackOverflow and R-help for solutions.
This is the Dropbox link for the data: https://www.dropbox.com/s/h0fp0hmnjaca9ff/PING%20CONCOURS%20DONNES.csv

As explained by Hendrik Pon,The message indicates that the last line of the file doesn't end with an End Of Line (EOL) character (linefeed (\n) or carriage return+linefeed (\r\n)).
The remedy is simple:
Open the file
Navigate to the very last line of the file
Place the cursor the end of that line
Press return/enter
Save the file
so here is your file without warning
df=read.table("C:\\Users\\Administrator\\Desktop\\tp.csv",header=F,sep=";")
df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 Date 20/12/2013 09:04 20/12/2013 09:08 20/12/2013 09:12 20/12/2013 09:16 20/12/2013 09:20 20/12/2013 09:24 20/12/2013 09:28 20/12/2013 09:32 20/12/2013 09:36
2 1 1,3631 1,3632 1,3634 1,3633 1,363 1,3632 1,3632 1,3632 1,3629
3 2 0,83407 0,83408 0,83415 0,83416 0,83404 0,83386 0,83407 0,83438 0,83472
4 3 142,35 142,38 142,41 142,4 142,41 142,42 142,39 142,42 142,4
5 4 1,2263 1,22635 1,22628 1,22618 1,22614 1,22609 1,22624 1,22643 1,2265
But i think you should not read in this way because you have to again reshape the dataframe,thanks.

I faced the same problem while creating a data matrix in notepad.
So i came to the last row of data matrix and pressed enter. Now i have a "n" line data matrix and a new blank line with cursor at the starting of "n+1" line.
Problem solved.

This is not a CSV file, each line is a column, you can parse it manually, e.g.:
file <- '~/Downloads/PING CONCOURS DONNES.csv'
lines <- readLines(file)
columns <- strsplit(lines, ';')
headers <- sapply(columns, '[[', 1)
data <- lapply(columns, '[', -1)
df <- do.call(cbind, data)
colnames(df) <- headers
print(head(df))
Note that you can ignore the warning, due that the last end-of-line is missing.

I had the same problem with .xls files.
My solution is to save the file as a tab delimited .txt. Then you can also manually change the .txt extension to .xls, then you can open the dataframe with read.delim.
This is very rude way to overcome the issue anyway.

Having a "proper" CSV file depends on the software that was used to generate it in the first place.
Consider Google Sheets. The warning will be issued every time that the CSV file -- downloaded via utils::download.file -- contains less than five lines. This likely is related to the fact that (utils:read.table):
The number of data columns is determined by looking at the first five lines of input (or the whole input if it has less than five lines), or from the length of col.names if it is specified and is longer.
In my short experience, if the data in the CSV file is rectangular, then the warning can be ignored.
Now consider LibreOffice Calc. There won't be any warnings, irrespective of the number of lines in the CSV file.

I had similar issue which didn't get resolved by the "enter method". After the mentioned error, I noticed the row count of the data frame was lesser than that of CSV. I noticed some non-alpha numeric values were hindering the import to R.
I followed Aurezio comment on [below link] (https://stackoverflow.com/a/29150226) to remove non-alpha numeric values (I included space)
Here the snippet:
Function CleanCode(Rng As Range)
Dim strTemp As String
Dim n As Long
For n = 1 To Len(Rng)
Select Case Asc(Mid(UCase(Rng), n, 1))
Case 32, 48 To 57, 65 To 90
strTemp = strTemp & Mid(UCase(Rng), n, 1)
End Select
Next
CleanCode = strTemp
End Function
I then used CleanCode as a function for the final result

Another option: sending an extra linefeed from R (instead of opening the file)
From Getting Data from Excel to R
cat("\n", file = file.choose(), append = TRUE)

Or you can simply open that excel file and save it as .csv file and voila that warning is gone.

Reading CSV using R where header is on the second line

I know that in R I can read in a csv file using read.csv. I also know that by setting header = TRUE I can indicate to R that there is a header with variable names on the first row.
However, I am trying to read in a csv that places a timestamp on the first row and the header / variable names on the second. I can obviously manually strip off the first line before loading it into R, but it’s a pain to do this each time. Is there an elegant solution to this in R?

Use the skip argument to read.csv
read.csv(.... , skip=1)

For the subjective "elegant", you may want to look at fread from "data.table" which generally does a good job of figuring out where the data actually start.
An example:
Create a fake CSV file in our workspace
The first line has "something" and the actual data starts on the second line with the headers "V1", "V2", and "V3".
x <- tempfile()
cat("something",
"V1,V2,V3",
"1,2,3", "4,5,6", "7,8,9", sep = "\n", file = x)
Load "data.table" and try fread
Seems to work out of the box! Obviously replace x with the name of your actual CSV file.
library(data.table)
fread(x)
# V1 V2 V3
# 1: 1 2 3
# 2: 4 5 6
# 3: 7 8 9

read.table creates too few rows, but readLines has the right number

I am trying to import a tab separated list into R.
It is 81704 rows long. However, read.table is only creating 31376. Here is my code:
population <- read.table('population.txt', header=TRUE,sep='\t',na.strings = 'NA',blank.lines.skip = FALSE)
There are no # commenting anything out.
Here are the first few lines:
[1] "NAME\tSTATENAME\tPOP_2009" "Alabama\tAlabama\t4708708" "Abbeville city\tAlabama\t2934" "Adamsville city\tAlabama\t4782"
[5] "Addison town\tAlabama\t711"
When I read it raw, readLines gives the right number.
Any ideas are much appreciated!

Difficult to diagnose without seeing the input file, but the usual suspects are quotes and comment characters (even if you think there are none of the latter). You can try:
quote = "", comment.char = ""
as arguments to read.table() and see if that helps.

Check with count.fields what's in file:
n <- count.fields('population.txt', sep='\t', blank.lines.skip=FALSE)
Then you could check
length(n) # should be 81705 (it count header so rows+1), if yes then:
table(n) # show you what's wrong
Then you readLines your file and check rows with wrong number of fields. (e.g. x<-readLines('population.txt'); head(x[n!=6]))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

fread unable to read .csv files with first column empty - r

I've tried making that csv file and running the code. It seems to work now - same for other people? I thought it might be an issue with not having a new line at the end (hence the warning from read.csv), but fread copes fine whether there's an new line at the end or not.

Related

Why is read.csv getting wrong classes?

Skip all leading empty lines in read.csv

In read.table(): incomplete final line found by readTableHeader

Reading CSV using R where header is on the second line

read.table creates too few rows, but readLines has the right number

Categories

Resources