r - read.csv - skip rows with different number of columns - r

There are 5 rows at the top of my csv file which serve as information about the file, which I do not need.
These information rows have only 2 columns, while the headers, and rows of data (from 6 on-wards) have 8. This appears to be the cause of the issue.
I have tried using the skip function within read.csv to skip these lines, and the same with read.table
df = read.csv("myfile.csv", skip=5)
df = read.table("myfile.csv", skip=5)
but this still gives me the same error message, which is:
Error in read.table("myfile.csv", :empty beginning of file
In addition: Warning messages:
1: In readLines(file, skip) : line 1 appears to contain an embedded nul
2: In readLines(file, skip) : line 2 appears to contain an embedded nul
...
5: In readLines(file, skip) : line 5 appears to contain an embedded nul
How can I get this .csv to be read into r without the null values in the first 5 rows causing this issue?

You could try:
read.csv(text=readLines('myfile.csv')[-(1:5)])
This will initially store each line in its own vector element, then drop the first five and treat the rest as a csv.

You can get rid of warning messages by using parameter 'skipNul';
text=readLines('myfile.csv', skipNul=True)

Related

Missing CRLF in final records causing a problem in reading in data sets

I have two data sets containing rows of data where the last row is missing a CRLF. I am having to add it to the files in order to read them in. Is there a way I can read in without modifying these files?
One of the final records looks like this:
surface NewYork Ave. 1259 1290 no final carriage return
at end of record
Warning message:
In readLines(file, n = thisblock) : incomplete final line found on
roadways.dat'
Thanks. MM
The only way I managed to have reproduced your problem is when I use a win unicode file encoding = "UCS-2LE". A couple of ways to go about the problem, and a warning for you to test it if it produces the desired output. In most cases it is a warning which you can suppress using available switches.
# set the warning FALSE (Assuming it is just a warning with no effect)
data <- readLines(con <- file("your_file", encoding = "UCS-2LE"), warn = FALSE, n=-1)
# Or see if other alternative encoding can solve your problem
A <- readLines(con <- file("your_file", encoding = "UTF-8"), n=-1)

Reading csv file with long header line containing special characters

I was trying to read the Toxic Release Inventory (TRI) csv files which I downloaded from Here using the command tri2016 <- fread("TRI_2016_US.csv") but it gives me a warning about discarding line 1 has too few or too many items to be column names or data.
However, tri2016_1 <- read.csv("TRI_2016_US.csv") reads it without giving any errors and correct column names! Using tri2016_1 <- fread("TRI_2016_US.csv", header=TRUE) still generates the warning and still ignores the header.
The TRI files have 108 columns and the header row contains special characters. The list of columns are listed in Pdf file (Appendix A on pg 7).
Is there any way to get fread to read these csv files along with the header?
Or should I just stick with tri2016 <- as.data.table(read.csv("TRI_2016_US.csv")) and not worry about it?
The header line seems to have a trailing comma (one more than in the other rows) - tested with TRI_2016_US.csv - 111 columns.
If you remove that, the problem should be solved.
Try the readr package.
library(readr)
tri2016_1 <- readr::read_csv("TRI_2016_US.csv")
You'll get a warning saying
Warning messages:
1: Missing column names filled in: 'X112' [112]
2: In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 1)

Importing data from an Excel file online

I am trying to download an excel file online and read only lines that contains the word "ORD".
fileUrl <-("http://www.hkexnews.hk/reports/sharerepur/documents/SRRPT20151211.xls")
x <- getURLContent(fileUrl)
out <- read.table(fileUrl,x )
I am using GetUrlContent but receive an error at the early stage of the process:
Warning messages:
1: In read.table(fileUrl, x) : line 1 appears to contain embedded nulls
2: In read.table(fileUrl, x) : line 2 appears to contain embedded nulls
3: In read.table(fileUrl, x) : line 3 appears to contain embedded nulls
4: In read.table(fileUrl, x) : line 4 appears to contain embedded nulls
5: In read.table(fileUrl, x) : line 5 appears to contain embedded nulls
6: In if (!header) rlabp <- FALSE :
the condition has length > 1 and only the first element will be used
7: In if (header) { :
the condition has length > 1 and only the first element will be used
8: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : embedded nul(s) found in input
The table "out" comes out almost unreadable. Does anyone knows how to read exactly the specific line rather than importing the whole file at the risk of getting the error lines?
One of the answers to this SO question recommends using the gdata library to download the Excel file from the web and then using read.xls() to read it into a data frame. Something like this:
library(gdata)
download.file("http://www.hkexnews.hk/reports/sharerepur/documents/SRRPT20151211.xls", destfile="file.xls")
out <- read.xls("file.xls", header=TRUE, pattern="Some Pattern")
The pattern flag tells read.xls() to ignore everything until the first line in which Some Pattern appears. You can change the value to something which allows you to skip the preliminary material before the actual data you want in your data frame.
I just found a solution, thank you Tim for putting me in the right direction:
library(gdata)
DownloadURL <- "http://www.hkexnews.hk/reports/sharerepur/documents/SRRPT20151211.xls"
out <- read.xls(DownloadURL, pattern="ORD", perl = "C:\\Perl64\\bin\\perl.exe")

Not being able to properly read csv with R

I run the following:
data <- read.csv("C:/Users/user/Desktop/directory/file.csv")
These are my warning messages:
Warning messages:
1: In read.table("C:/Users/user/Desktop/directory/file.csv", :
line 1 appears to contain embedded nulls
2: In read.table("C:/Users/user/Desktop/directory/file.csv", :
line 2 appears to contain embedded nulls
....
6: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
The resulting data frame is just a column filled with NAs.
A problem might be that all my data is in one column separated by commas like this(first row is header, second row is example data):
stat_breed,tran_disp,train_key,t_type,trainer_Id,horsedata_Id
QH,"Tavares Joe","214801","E",0,0
What can I do to accurately read my data?

fread unable to read .csv files with first column empty

Say I have the first test.csv that looks like this
,a,b,c,d,e
If I try to read it using read.csv, it works fine.
read.csv("test.csv",header=FALSE)
# V1 V2 V3 V4 V5 V6
#1 NA a b c d e
#Warning message:
#In read.table(file = file, header = header, sep = sep, quote = quote, :
# incomplete final line found by readTableHeader on 'test.csv'
However, if I attempt to read this file using fread, i get an error instead.
require(data.table)
fread("test.csv",header=FALSE)
#Error in fread("test.csv", header = FALSE) :
# Not positioned correctly after testing format of header row. ch=','
Why does this happen and what can I do to correct this?
As for me, my problem was only that the first ? rows of my file had a missing ID value.
So I was able to solve the problem by specifying autostart to be sufficiently far into the file that a nonmissing value popped up:
fread("test.csv", autostart = 100L, skip = "A")
This guarantees that when fread attempts to automatically identify sep and sep2, it does so at a well-formatted place in the file.
Specifying skip also makes sure fread finds the correct row in which to base the names of the columns.
If indeed there are no nonmissing values for the first field, you're better off just deleting that field from the .csv with Richard Scriven's approach or a find-and-replace in your favorite text editor.
I think you could use skip/select/drop attributes of the fread function for this purpose.
fread("myfile.csv",sep=",",header=FALSE,skip="A")#to just skip the 1st column
fread("myfile.csv",sep=",",header=FALSE,select=c(2,3,4,5)) # to read other columns except 1
fread("myfile.csv",sep=",",header=FALSE,drop="A") #to drop first column
I've tried making that csv file and running the code. It seems to work now - same for other people? I thought it might be an issue with not having a new line at the end (hence the warning from read.csv), but fread copes fine whether there's an new line at the end or not.

Resources