Reading csv file with long header line containing special characters - r

I was trying to read the Toxic Release Inventory (TRI) csv files which I downloaded from Here using the command tri2016 <- fread("TRI_2016_US.csv") but it gives me a warning about discarding line 1 has too few or too many items to be column names or data.
However, tri2016_1 <- read.csv("TRI_2016_US.csv") reads it without giving any errors and correct column names! Using tri2016_1 <- fread("TRI_2016_US.csv", header=TRUE) still generates the warning and still ignores the header.
The TRI files have 108 columns and the header row contains special characters. The list of columns are listed in Pdf file (Appendix A on pg 7).
Is there any way to get fread to read these csv files along with the header?
Or should I just stick with tri2016 <- as.data.table(read.csv("TRI_2016_US.csv")) and not worry about it?

The header line seems to have a trailing comma (one more than in the other rows) - tested with TRI_2016_US.csv - 111 columns.
If you remove that, the problem should be solved.

Try the readr package.
library(readr)
tri2016_1 <- readr::read_csv("TRI_2016_US.csv")
You'll get a warning saying
Warning messages:
1: Missing column names filled in: 'X112' [112]
2: In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 1)

Related

Error in coercing R data.frame to a nz.data.frame

One of the columns in R dataframe has "," (comma) in it and because of it, when I try to convert it into netezza data frame, it throws me below error:
Error in nzQuery(sqlCommandUpload) : HY008 51 Operation canceled
01000 1 Unable to write nzlog/bad files
01000 1 Unable to write nzlog/bad files
HY000 46 ERROR: External Table : count of bad input rows reached maxerrors limit
How can I achieve this without making any changes to data?
With a dataframe like this, everything works fine:
I get error when the dataframe is like this:
library(nzr)
library(forecast)
library (reshape2)
library(doBy)
nzDisconnect()
nzConnectDSN('DSNInfo', force=FALSE , verbose=TRUE)
#read file
test2<-read.csv("test_df.csv", stringsAsFactors = F)
# convert to nz dataframe, no error
#nzdf.test2<-as.nz.data.frame(test2)
nzdf.d<-as.nz.data.frame(d)
# copy
#test<-test2
testd<-d
#replace one of the values containing a ","
#test$Category[1]<-"a,b"
testd$Category[1]<-"Bed, Bath & Towels"
# converting to nz gives error
#nzdf.test<-as.nz.data.frame(test)
nzdf.testd<-as.nz.data.frame(testd)
#remove ","
test$Category <- gsub(",","",test$Category)
# converting to nz dataframe, gives no error
nzdf.test<-as.nz.data.frame(test)
Did you check if you have nulls (NAs) in your data? I have faced the same problem but when i checked Netezza-R documentation i found that you can not write Nulls into a Netezza tables from another system. there is a mention about using setOutputNull funciton in such cases.
So a workaround is replace nulls with the string "NULL" in your R-dataframe, this makes the numerical columns become varchar, mind you. But fortunately "NULL" becomes null in your netezza table automatically. Only extra effort is that you have to covnert the columns back to numeric later.
Hope this helps

Difficulties with understanding read.csv code

I'm improving my R-skills rebuilding some of the amazing stuff they do on r-bloggers. Right now im trying to reproduce this:
http://wiekvoet.blogspot.nl/2015/06/deaths-in-netherlands-by-cause-and-age.html. The relevant dataset for this excersize could be found here:
http://statline.cbs.nl/Statweb/publication/?VW=D&DM=SLNL&PA=7052_95&D1=0-1%2c7%2c30-31%2c34%2c38%2c42%2c49%2c56%2c62-63%2c66%2c69-71%2c75%2c79%2c92&D2=0&D3=0&D4=0%2c10%2c20%2c30%2c40%2c50%2c60%2c63-64&HD=150710-0924&HDR=G1%2cG2%2cG3&STB=T
If I'm diving into the code (to be found at the bottom of the first link) and am running into this piece of code:
r1 <- read.csv(sep=';',header=FALSE,
col.names=c('Causes','Causes2','Age','year','aantal','count'),
na.strings='-',text=txtlines[3:length(txtlines)]) %>%
select(.,-aantal,-Causes2)
Could anybody help me seperating the steps that are taken here?
Here is an explanation of what each line in the call to read.csv() is doing from your example. Note that the assignment of the last parameter text is complicated and is dependent on the script from the link you gave above. From a high level, he is first reading in all lines from the file "Overledenen__doodsoo_170615161506.csv" which contain the string "Centraal", using only the third to final lines from that filtered set. There is an additional step applied to these lines as well.
r1 <- read.csv( # columns separate by semi-colon
sep=';',
# first row is data (i.e. is NOT a header)
header=FALSE,
# names of the six columns
col.names=c('Causes','Causes2','Age','year','aantal','count'),
# treat hyphen as NA
na.strings='-',
# read from third line to final line of the original input
# Overledenen__doodsoo_170615161506.csv, after some
# filtering has been applied
text=txtlines[3:length(txtlines)]) %>% select(.,-aantal,-Causes2)
The read.csv, read the csv file, separating column with the separator ";"
so that an input like this a;b;c will be separated in: first column=a, second=b, third=c
header=FALSE -> It specifies no header in the original file was given.
col.names assigns the listed names to your columns in r
na.strings='-' substitutes NA values with '-'
text=txtlines[3:length(txtlines)]) read the lines from position 3 till the end.
%>% select(.,-aantal,-Causes2) filter the data frame

Read data into R deleting or skipping lines containing characters

I'm sure this is simple, but I'm not coming across an answer. I would like to import a data frame into R without processing the lines in a text editor first. Essentially, I want R to do it on read in. So all lines containing
FRAME 1 of ***
OR
ATOM-WISE TOTAL CONTACT ENERGY
will be skipped, deleted or ignored.
And all that will be left is;
Chain Resnum Atom number Energy(kcal/mol)
ATOM C 500 1519 -2.1286
ATOM C 500 1520 -1.1334
ATOM C 500 1521 -0.8180
ATOM C 500 1522 -0.7727
Is there a simple solution to this? I'm not sure which scan() of read.table() arguments would work.
EDIT
I was able to use readLines and gsub to read in the file and remove the (un)necessary lines. I omitted the "" left from the deleted words and now I am trying to convert the character df to a regular(numeric) df. When I use data.frame(x) or as.data.frame(x) I am left with a data frame with 100K rows and only one variable. There should be at least 5 variables.
readLines gives you a vector with one character string for each line of the file. So you have to split these strings into the elements you want before you convert to a dataframe. If you have nice space-separated values, try:
m = matrix(unlist(strsplit(data, " +")), ncol=5, byrow=TRUE)
# where 'data' is the name of the vector of strings
df = data.frame(m, stringsAsFactors=FALSE)
Then for each column with numeric data, use as.numeric() on the column to convert.

r - read.csv - skip rows with different number of columns

There are 5 rows at the top of my csv file which serve as information about the file, which I do not need.
These information rows have only 2 columns, while the headers, and rows of data (from 6 on-wards) have 8. This appears to be the cause of the issue.
I have tried using the skip function within read.csv to skip these lines, and the same with read.table
df = read.csv("myfile.csv", skip=5)
df = read.table("myfile.csv", skip=5)
but this still gives me the same error message, which is:
Error in read.table("myfile.csv", :empty beginning of file
In addition: Warning messages:
1: In readLines(file, skip) : line 1 appears to contain an embedded nul
2: In readLines(file, skip) : line 2 appears to contain an embedded nul
...
5: In readLines(file, skip) : line 5 appears to contain an embedded nul
How can I get this .csv to be read into r without the null values in the first 5 rows causing this issue?
You could try:
read.csv(text=readLines('myfile.csv')[-(1:5)])
This will initially store each line in its own vector element, then drop the first five and treat the rest as a csv.
You can get rid of warning messages by using parameter 'skipNul';
text=readLines('myfile.csv', skipNul=True)

In read.table(): incomplete final line found by readTableHeader

I have a CSV when I try to read.csv() that file, I get the warning warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on ...
And I cannot isolate the problem, despite scouring StackOverflow and R-help for solutions.
This is the Dropbox link for the data: https://www.dropbox.com/s/h0fp0hmnjaca9ff/PING%20CONCOURS%20DONNES.csv
As explained by Hendrik Pon,The message indicates that the last line of the file doesn't end with an End Of Line (EOL) character (linefeed (\n) or carriage return+linefeed (\r\n)).
The remedy is simple:
Open the file
Navigate to the very last line of the file
Place the cursor the end of that line
Press return/enter
Save the file
so here is your file without warning
df=read.table("C:\\Users\\Administrator\\Desktop\\tp.csv",header=F,sep=";")
df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 Date 20/12/2013 09:04 20/12/2013 09:08 20/12/2013 09:12 20/12/2013 09:16 20/12/2013 09:20 20/12/2013 09:24 20/12/2013 09:28 20/12/2013 09:32 20/12/2013 09:36
2 1 1,3631 1,3632 1,3634 1,3633 1,363 1,3632 1,3632 1,3632 1,3629
3 2 0,83407 0,83408 0,83415 0,83416 0,83404 0,83386 0,83407 0,83438 0,83472
4 3 142,35 142,38 142,41 142,4 142,41 142,42 142,39 142,42 142,4
5 4 1,2263 1,22635 1,22628 1,22618 1,22614 1,22609 1,22624 1,22643 1,2265
But i think you should not read in this way because you have to again reshape the dataframe,thanks.
I faced the same problem while creating a data matrix in notepad.
So i came to the last row of data matrix and pressed enter. Now i have a "n" line data matrix and a new blank line with cursor at the starting of "n+1" line.
Problem solved.
This is not a CSV file, each line is a column, you can parse it manually, e.g.:
file <- '~/Downloads/PING CONCOURS DONNES.csv'
lines <- readLines(file)
columns <- strsplit(lines, ';')
headers <- sapply(columns, '[[', 1)
data <- lapply(columns, '[', -1)
df <- do.call(cbind, data)
colnames(df) <- headers
print(head(df))
Note that you can ignore the warning, due that the last end-of-line is missing.
I had the same problem with .xls files.
My solution is to save the file as a tab delimited .txt. Then you can also manually change the .txt extension to .xls, then you can open the dataframe with read.delim.
This is very rude way to overcome the issue anyway.
Having a "proper" CSV file depends on the software that was used to generate it in the first place.
Consider Google Sheets. The warning will be issued every time that the CSV file -- downloaded via utils::download.file -- contains less than five lines. This likely is related to the fact that (utils:read.table):
The number of data columns is determined by looking at the first five lines of input (or the whole input if it has less than five lines), or from the length of col.names if it is specified and is longer.
In my short experience, if the data in the CSV file is rectangular, then the warning can be ignored.
Now consider LibreOffice Calc. There won't be any warnings, irrespective of the number of lines in the CSV file.
I had similar issue which didn't get resolved by the "enter method". After the mentioned error, I noticed the row count of the data frame was lesser than that of CSV. I noticed some non-alpha numeric values were hindering the import to R.
I followed Aurezio comment on [below link] (https://stackoverflow.com/a/29150226) to remove non-alpha numeric values (I included space)
Here the snippet:
Function CleanCode(Rng As Range)
Dim strTemp As String
Dim n As Long
For n = 1 To Len(Rng)
Select Case Asc(Mid(UCase(Rng), n, 1))
Case 32, 48 To 57, 65 To 90
strTemp = strTemp & Mid(UCase(Rng), n, 1)
End Select
Next
CleanCode = strTemp
End Function
I then used CleanCode as a function for the final result
Another option: sending an extra linefeed from R (instead of opening the file)
From Getting Data from Excel to R
cat("\n", file = file.choose(), append = TRUE)
Or you can simply open that excel file and save it as .csv file and voila that warning is gone.

Resources