I have a CSV when I try to read.csv() that file, I get the warning warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on ...
And I cannot isolate the problem, despite scouring StackOverflow and R-help for solutions.
This is the Dropbox link for the data: https://www.dropbox.com/s/h0fp0hmnjaca9ff/PING%20CONCOURS%20DONNES.csv
As explained by Hendrik Pon,The message indicates that the last line of the file doesn't end with an End Of Line (EOL) character (linefeed (\n) or carriage return+linefeed (\r\n)).
The remedy is simple:
Open the file
Navigate to the very last line of the file
Place the cursor the end of that line
Press return/enter
Save the file
so here is your file without warning
df=read.table("C:\\Users\\Administrator\\Desktop\\tp.csv",header=F,sep=";")
df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 Date 20/12/2013 09:04 20/12/2013 09:08 20/12/2013 09:12 20/12/2013 09:16 20/12/2013 09:20 20/12/2013 09:24 20/12/2013 09:28 20/12/2013 09:32 20/12/2013 09:36
2 1 1,3631 1,3632 1,3634 1,3633 1,363 1,3632 1,3632 1,3632 1,3629
3 2 0,83407 0,83408 0,83415 0,83416 0,83404 0,83386 0,83407 0,83438 0,83472
4 3 142,35 142,38 142,41 142,4 142,41 142,42 142,39 142,42 142,4
5 4 1,2263 1,22635 1,22628 1,22618 1,22614 1,22609 1,22624 1,22643 1,2265
But i think you should not read in this way because you have to again reshape the dataframe,thanks.
I faced the same problem while creating a data matrix in notepad.
So i came to the last row of data matrix and pressed enter. Now i have a "n" line data matrix and a new blank line with cursor at the starting of "n+1" line.
Problem solved.
This is not a CSV file, each line is a column, you can parse it manually, e.g.:
file <- '~/Downloads/PING CONCOURS DONNES.csv'
lines <- readLines(file)
columns <- strsplit(lines, ';')
headers <- sapply(columns, '[[', 1)
data <- lapply(columns, '[', -1)
df <- do.call(cbind, data)
colnames(df) <- headers
print(head(df))
Note that you can ignore the warning, due that the last end-of-line is missing.
I had the same problem with .xls files.
My solution is to save the file as a tab delimited .txt. Then you can also manually change the .txt extension to .xls, then you can open the dataframe with read.delim.
This is very rude way to overcome the issue anyway.
Having a "proper" CSV file depends on the software that was used to generate it in the first place.
Consider Google Sheets. The warning will be issued every time that the CSV file -- downloaded via utils::download.file -- contains less than five lines. This likely is related to the fact that (utils:read.table):
The number of data columns is determined by looking at the first five lines of input (or the whole input if it has less than five lines), or from the length of col.names if it is specified and is longer.
In my short experience, if the data in the CSV file is rectangular, then the warning can be ignored.
Now consider LibreOffice Calc. There won't be any warnings, irrespective of the number of lines in the CSV file.
I had similar issue which didn't get resolved by the "enter method". After the mentioned error, I noticed the row count of the data frame was lesser than that of CSV. I noticed some non-alpha numeric values were hindering the import to R.
I followed Aurezio comment on [below link] (https://stackoverflow.com/a/29150226) to remove non-alpha numeric values (I included space)
Here the snippet:
Function CleanCode(Rng As Range)
Dim strTemp As String
Dim n As Long
For n = 1 To Len(Rng)
Select Case Asc(Mid(UCase(Rng), n, 1))
Case 32, 48 To 57, 65 To 90
strTemp = strTemp & Mid(UCase(Rng), n, 1)
End Select
Next
CleanCode = strTemp
End Function
I then used CleanCode as a function for the final result
Another option: sending an extra linefeed from R (instead of opening the file)
From Getting Data from Excel to R
cat("\n", file = file.choose(), append = TRUE)
Or you can simply open that excel file and save it as .csv file and voila that warning is gone.
Related
I wrote an R script to make some scientometric analyses of Journal Citation Report data (JCR), which I have been using and updating in the past years.
Today, Clarivate has just introduced some changes in its database and now the exported CSV file contains one last empty column, which spoils my script. Because of this last empty column, read.csv automatically assumes that the first column contains the row names.
As before, there is also one first useless row, which is automatically removed in my script with skip = 1.
One simple solution to this "empty column situation" would be to manually remove this last column in Excel, and then proceed with my script as usual.
However, is there a way to add this removal to my script using base R?
The beginning of my script is:
jcreco = read.csv("data/jcr ecology 2020.csv",
na = "n/a", skip = 1, header = T)
The original CSV file downloaded from JCR is available in my Dropbox.
Could you please help me? Thank you!
The real problem is that empty column doesn't have a header. If they had only had the extra comma at the end of the header line this probably wouldn't be as messy. But you can also do a bit of column shuffling with fill=TRUE. For example
dd <- read.table("~/../Downloads/jcr ecology 2020.csv", sep=",",
skip=2, fill=T, header=T, row.names=NULL)
names(dd)[-ncol(dd)] <- names(dd)[-1]
dd <- dd[,-ncol(dd)]
This reads in the data but puts the rows names in the data.frame and fills the last column with NA. Then you shift all the column names over to the left and drop the last column.
Here is a way.
Read the data as text lines;
Discard the first line;
Remove the end comma with sub;
Create a text connection;
And read in the data from the connection.
The variable fl holds the file, on my disk I had to set the directory.
fl <- "jcr_ecology_2020.csv"
txt <- readLines(fl)
txt <- txt[-1]
txt <- sub(",$", "", txt)
con <- textConnection(txt)
df1 <- read.csv(con)
close(con)
head(df1)
I have this VCF format file, I want to read this file in R. However, this file contains some redundant lines which I want to skip. I want to get something like in the result where the row starts with the line matching #CHROM.
This is what I have tried:
chromo1<-try(scan(myfile.vcf,what=character(),n=5000,sep="\n",skip=0,fill=TRUE,na.strings="",quote="\"")) ## find the start of the vcf file
skip.lines<-grep("^#CHROM",chromo1)
column.labels<-read.delim(myfile.vcf,header=F,nrows=1,skip=(skip.lines-1),sep="\t",fill=TRUE,stringsAsFactors=FALSE,na.strings="",quote="\"")
num.vars<-dim(column.labels)[2]
myfile.vcf
#not wanted line
#unnecessary line
#junk line
#CHROM POS ID REF ALT
11 33443 3 A T
12 33445 5 A G
result
#CHROM POS ID REF ALT
11 33443 3 A T
12 33445 5 A G
Maybe this could be good for you:
# read two times the vcf file, first for the columns names, second for the data
tmp_vcf<-readLines("test.vcf")
tmp_vcf_data<-read.table("test.vcf", stringsAsFactors = FALSE)
# filter for the columns names
tmp_vcf<-tmp_vcf[-(grep("#CHROM",tmp_vcf)+1):-(length(tmp_vcf))]
vcf_names<-unlist(strsplit(tmp_vcf[length(tmp_vcf)],"\t"))
names(tmp_vcf_data)<-vcf_names
p.s.: If you have several vcf files then you should use lapply function.
Best,
Robert
data.table::fread reads it as intended, see example:
library(data.table)
#try this example vcf from GitHub
vcf <- fread("https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf")
#or if the file is local:
vcf <- fread("path/to/my/vcf/sample.vcf")
We can also use vcfR package, see the manuals in the link.
Don't know how fread reads vcf correctly in comments above, but use 'skip' to define the first row start (or, if integer, amount of rows to skip).
library(data.table)
df = fread(file='some.vcf', sep='\t', header = TRUE, skip = '#CHROM')
I was trying to read the Toxic Release Inventory (TRI) csv files which I downloaded from Here using the command tri2016 <- fread("TRI_2016_US.csv") but it gives me a warning about discarding line 1 has too few or too many items to be column names or data.
However, tri2016_1 <- read.csv("TRI_2016_US.csv") reads it without giving any errors and correct column names! Using tri2016_1 <- fread("TRI_2016_US.csv", header=TRUE) still generates the warning and still ignores the header.
The TRI files have 108 columns and the header row contains special characters. The list of columns are listed in Pdf file (Appendix A on pg 7).
Is there any way to get fread to read these csv files along with the header?
Or should I just stick with tri2016 <- as.data.table(read.csv("TRI_2016_US.csv")) and not worry about it?
The header line seems to have a trailing comma (one more than in the other rows) - tested with TRI_2016_US.csv - 111 columns.
If you remove that, the problem should be solved.
Try the readr package.
library(readr)
tri2016_1 <- readr::read_csv("TRI_2016_US.csv")
You'll get a warning saying
Warning messages:
1: Missing column names filled in: 'X112' [112]
2: In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 1)
I am trying to read a csv file line by line and only select the 2nd and the 3rd cell from left, and the 3rd cell from the right. For example, if there are 17 cells in this line, I am going to take the 15th cell. Then I want to combine those 3 cells, separated by comma, and then to write this line to a new csv file.
Foe now, I am just using a for loop to access each line and then split them by comma. Then I select the cells I want and combine them as a string and append to a big String variable. Once the for-loop finishes, I write out the file by writeLines(). However, it takes a long time to finish this process because there are 2.8 million rows and it takes a lot of memory. Is there any way to make it more efficient? or can I write the output file line by line in the for-loop?
FileLinebyLine <- read_lines("testfile.csv")
pt<-proc.time()
NewFile <- ""
RowList <- list()
for (i in 1:length(FileLinebyLine))
{
a <- strsplit(FileLinebyLine[i],",")
RowList[i] = paste(a[[1]][2],a[[1]][3],a[[1]][(length(a[[1]]) - 2)], sep = ",")
}
NewFile <- paste(unlist(RowList), sep = "\n")
proc.time()-pt
outputfile <- file("output.txt")
writeLines(NewFile,outputfile)
close(outputfile)
I have also tried to use write_lines() in the for loop but it always gives me the error Error in
isOpen(path) : invalid connection
Can anyone help me? Appreciate that!!!
Yes you can read and write line by line, although I don't know how fast it will be. Here's an example that read a file line by line, the 4th item in every line and writes to a new file one line at a time:
con = file("temp.csv", "r")
while(length(x <- readLines(con, n = 1)) > 0) {
write(strsplit(x,",")[[1]][4], file="out.csv", append=T)
}
close(con)
temp.csv
a,b,c,d,e,f,g,h
x,y,z,a,b,c,d,e
1,2,3,4,5,6,7,8
q,w,e,r,t,y,u,i
out.csv
d
a
4
r
Hope that helps.
Edit: You can also add library(compiler); enableJIT(3) to speed up your loops a little.
Say I have the first test.csv that looks like this
,a,b,c,d,e
If I try to read it using read.csv, it works fine.
read.csv("test.csv",header=FALSE)
# V1 V2 V3 V4 V5 V6
#1 NA a b c d e
#Warning message:
#In read.table(file = file, header = header, sep = sep, quote = quote, :
# incomplete final line found by readTableHeader on 'test.csv'
However, if I attempt to read this file using fread, i get an error instead.
require(data.table)
fread("test.csv",header=FALSE)
#Error in fread("test.csv", header = FALSE) :
# Not positioned correctly after testing format of header row. ch=','
Why does this happen and what can I do to correct this?
As for me, my problem was only that the first ? rows of my file had a missing ID value.
So I was able to solve the problem by specifying autostart to be sufficiently far into the file that a nonmissing value popped up:
fread("test.csv", autostart = 100L, skip = "A")
This guarantees that when fread attempts to automatically identify sep and sep2, it does so at a well-formatted place in the file.
Specifying skip also makes sure fread finds the correct row in which to base the names of the columns.
If indeed there are no nonmissing values for the first field, you're better off just deleting that field from the .csv with Richard Scriven's approach or a find-and-replace in your favorite text editor.
I think you could use skip/select/drop attributes of the fread function for this purpose.
fread("myfile.csv",sep=",",header=FALSE,skip="A")#to just skip the 1st column
fread("myfile.csv",sep=",",header=FALSE,select=c(2,3,4,5)) # to read other columns except 1
fread("myfile.csv",sep=",",header=FALSE,drop="A") #to drop first column
I've tried making that csv file and running the code. It seems to work now - same for other people? I thought it might be an issue with not having a new line at the end (hence the warning from read.csv), but fread copes fine whether there's an new line at the end or not.