There is a strange thing when i use read.table to get data.
data=read.table('/home/tiger/nasdaqlisted.txt',head=T,sep='|')
dim(data)
[1] 750 6
in fact,there are 2454 lines in the file,what's wrong?
http://freeuploadfiles.com/bb3cwypih2d2
I think the issue comes from the fact that some of the names contain the quote character ' (in names such as Angie's List, Inc.).
The default argument in read.table for quote being "\"'" it needs to be changed for your data to be read correctly.
read.table("path/to/file", header=TRUE, sep="|", quote="")
As per #mrdwab suggestion, read.delim having "\"" as default quote argument will work without needing any change:
read.delim("path/to/file", header=TRUE, sep="|")
Related
I have got a comma delimited csv document with predefined headers and a few rows. I just want to exchange the comma delimiter to a pipe delimiter. So my naive approach is:
myData <- read.csv(file="C:/test.CSV", header=TRUE, sep=",", check.names = FALSE)
Viewing myData gives me results without X subscripts in header columns. If I set check.names = TRUE, the column headers have a X subscript.
Now I am trying to write a new csv with pipe-delimiter.
write.table(MyData1, file = "C:/test_pipe.CSV",row.names=FALSE, na="",col.names=TRUE, sep="|")
In the next step I am going to test my results:
mydata.test <- read.csv(file="C:/test_pipe.CSV", header=TRUE, sep="|")
Import seems fine, but unfortunately the X subscript in column headers appear again. Now my question is:
Is there something wrong with the original file or is there an error in my naive approach?
The original csv test.csv was created with Excel, of course without X subscripts in column headers.
Thanks in advance
You have to keep using check.names = FALSE, also the second time.
Else your header will be modified, because apparently it contains variable names that would not be considered valid names of columns of a data.frame. E.g., special characters would be replaced by dots, i.e. . Similarly, numbers would be pre-fixed with X.
I am trying to read.table for a tab-delimited file using the following command:
df <- read.table("input.txt", header=FALSE, sep="\t", quote="", comment.char="",
encoding="utf-8")
There should be 30 million rows. However, after read.table() returns, df contains only ~6 million rows. And there is a warning message like this:
Warning message:
In read.table("input.txt", header = FALSE, sep = "\t", quote = "", :
incomplete final line found by readTableHeader on 'input.txt'
I believe read.table quits after encountering a special sympbol (ASCII code: 1A Substitute) in one of the string columns. In the input file, the only special character is tab because it is used to separate columns. Is there anyway to ask read.table to treat any other character as not special?
If you have 30 million rows. I would use fread rather than read.table. It is faster. Learn more about here http://www.inside-r.org/packages/cran/data.table/docs/fread
fread(input, sep="auto", encoding = "UTF-8" )
Regarding your issue with read.table. I think the solutions here should solve it.
'Incomplete final line' warning when trying to read a .csv file into R
I would like to understand why this difference exists between read.table and fread. Maybe you'll know the workaround to make fread work. I have two lines of code that perform the same goal-to read a file. fread performs faster and more efficiently than read.table, but read.table produces less no errors on the same data set.
SUCCESSFUL READ.TABLE approach
table <- read.table("A.txt",header=FALSE,sep = "|", fill=TRUE, quote="", stringsAsFactors=FALSE)
FREAD approach
table <- fread("A.txt",header=FALSE,sep = "|")
FREAD returns the classic error, which I explored,
Expected sep ('|') but new line or EOF ends field 44 on line 57193 when reading data
Initially, read.table returned what I think is a similar error when fill=TRUE was not included and would not read the file.
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 7 did not have 45 elements
I am thinking that the errors might be similar in some way. According to the documentation, fill allows the following. If TRUE then in case the rows have unequal length, blank fields are implicitly added.
Is there a work around similar to fill=TRUE that can solve might address the fread problem?
ANSWER FROM MATT DOWLE: Fill option for fread
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.
This answer highlights how data.table can now fill using fread.
https://stackoverflow.com/a/34197074/1569064
fread(input, sep="auto", sep2="auto", nrows=-1L, header="auto", na.strings="NA",
stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), autostart=1L,
skip=0L, select=NULL, drop=NULL, colClasses=NULL,
integer64=getOption("datatable.integer64"), # default: "integer64"
dec=if (sep!=".") "." else ",", col.names,
check.names=FALSE, encoding="unknown", quote="\"",
strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL,
showProgress=getOption("datatable.showProgress"), # default: TRUE
data.table=getOption("datatable.fread.datatable") # default: TRUE
)
I am trying to read in a tab-separated file (available online, using ftp) using read.table() in R.
The issue seems to be that for the third column, the character string for some of the rows contains characters such as apostrophe character ' as well as percentage character % for example Capital gains 15.0% and Bob's earnings, for example -
810| 13 | Capital gains 15.0% |
170| -20 | Bob’s earnings|
100| 80 | Income|
To handle the apostrophe character, I was using the following syntax
df <- read.table(url('ftp://location'), sep="|", quote="\"", as.is=T)
Yet this syntax shown above does not seem to handle the lines where the problematic third column contains a string with the percentage character, so that the entire remainder of the table is jammed into one field by the function.
I also tried to ignore the third column altogether by using colClasses df <- read.table(url('ftp://location'), sep="|", quote="\"", as.is=T, colClasses=c("numeric", "numeric", "NULL")), but the function still fails over on the lines with the percentage character is present in that third column.
Any ideas on how to address the issue?
Instead of read.table wrapper, use scan function. You can control anything. And is faster, specially if you use "what" option to indicate R what are you scanning.
dat <- scan(file = "data.tsv", # Substitute this for url
what = list(c(rep("numeric",2), "character")), # describe columns
skip = 0, # lines to skip
sep = "|") # separator
Instead of file, use your url.
In this case scan is expecting 2 columns of "numeric" followed by a column of "character"
Type
?scan
to get the complete list of options.
I am trying to import a tab separated list into R.
It is 81704 rows long. However, read.table is only creating 31376. Here is my code:
population <- read.table('population.txt', header=TRUE,sep='\t',na.strings = 'NA',blank.lines.skip = FALSE)
There are no # commenting anything out.
Here are the first few lines:
[1] "NAME\tSTATENAME\tPOP_2009" "Alabama\tAlabama\t4708708" "Abbeville city\tAlabama\t2934" "Adamsville city\tAlabama\t4782"
[5] "Addison town\tAlabama\t711"
When I read it raw, readLines gives the right number.
Any ideas are much appreciated!
Difficult to diagnose without seeing the input file, but the usual suspects are quotes and comment characters (even if you think there are none of the latter). You can try:
quote = "", comment.char = ""
as arguments to read.table() and see if that helps.
Check with count.fields what's in file:
n <- count.fields('population.txt', sep='\t', blank.lines.skip=FALSE)
Then you could check
length(n) # should be 81705 (it count header so rows+1), if yes then:
table(n) # show you what's wrong
Then you readLines your file and check rows with wrong number of fields. (e.g. x<-readLines('population.txt'); head(x[n!=6]))