I am trying to read a tab delimited file that looks like this:
I am using the read.table for this propose but I am not able to read the file.
table<- read.table("/Users/Desktop/R-test/HumanHT-12_V4_0_R2_15002873_B.txt",
header =FALSE, sep = "\t",
comment.char="#", check.names=FALSE)
When I run the code I have this error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 2 elements
What am I doing wrong while reading the table?
I am not so familiarized with R, so any help would be really useful.
I am very familiar with this type of file: It is a GEO platform data for Microarray analysis.
As baptiste proposed above, the best way is to skip the first few lines by skip=9. You may replace read.table(...,sep="\t") with just read.delim(...). Then you will have your table with suitable column names - please note that the column names should be in the 1st line.
Then if you are really interested in the first 9 lines you may read them by readLines(...) command and paste the data to your table by acting like this:
foo = read.delim(...)
bar = readLines(...)
baz = list(foo, bar)
Related
I am trying to read data from the PISA 2012 study (http://pisa2012.acer.edu.au/downloads.php) into R using the read.table function. This is the code I tried:
pisa <- read.table("pisa2012.txt", sep = "")
unfortunately I keep getting the following error message:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
: line 2 did not have 184 elements
I have tried to set
header = T
but then get the following error message
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
:line 1 did not have 184 elements
Lastly, this is what the .txt file looks like ...
http://postimg.org/image/4u9lqtxqd/
Thanks for your help!
You can see from the first line that you'll need some sort of control file to delimit the individual variables. So, from working with PISA in other environments, I know the first three columns corrrespond to the ISO 3 letter country code (e.g., ALB). What follows are numbers and letters that need to be made sense of in a meaninful way by separating them. You could use the codebook for this (https://pisa2012.acer.edu.au/downloads/M_stu_codebook.pdf), but that is a real bear for every single variable. Why not download in SPSS or sAS and import? Not a 'slick' solution, but without a control file, you'd have a lot of manual work to do.
I just read the files using readr package. So what will you need: readr package, the TXT file, SAScii package and the associated sas file.
So, let say you want to read the student files. Then you will need the following files: INT_STU12_DEC03.txt and INT_STU12_DEC03.sas.
##################### READING STUDENT DATA ###################
## Loading the dictionary
dic_student = parse.SAScii(sas_ri = 'INT_STU12_SAS.sas')
## Creating the positions to read_fwf
student <- read_fwf(file = 'INT_STU12_DEC03.txt', col_positions = fwf_widths(dic_student$width), progress = T)
colnames(student) <- dic_student$varname
OBS 1: As i'm using Linux, I needed to delete the first lines from the sas file and change the encoding to UTF-8.
OBS 2: The lines deleted, were:
libname M_DEC03 "C:\XXX";
filename STU "C:\XXX\INT_STU12_DEC03.txt";
options nofmterr;
OBS 3: The dataset takes about 1Gb, so you will need enougth RAM.
I am trying to read a CSV file into R. I tried:
data <- read.csv(file="train.csv")
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
But, this reads in only a small percentage of the total observations. Then I tried removing quotes:
data <- read.csv(file="train.csv",quote = "",sep = ",",header = TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
Since the data is text, it seems there is some issue with the delimiter.
It is difficult to share the entire data set as it is huge. I tried going to the line where the error comes, but there seems to be no non printable character. I also tried other readers like fread(), but to no avail.
Have encountered this before. Can be very tricky. Try a specialized CSV reader.:
library(readr)
data <- read_csv(file="train.csv")
This should do it.
I would like to understand why this difference exists between read.table and fread. Maybe you'll know the workaround to make fread work. I have two lines of code that perform the same goal-to read a file. fread performs faster and more efficiently than read.table, but read.table produces less no errors on the same data set.
SUCCESSFUL READ.TABLE approach
table <- read.table("A.txt",header=FALSE,sep = "|", fill=TRUE, quote="", stringsAsFactors=FALSE)
FREAD approach
table <- fread("A.txt",header=FALSE,sep = "|")
FREAD returns the classic error, which I explored,
Expected sep ('|') but new line or EOF ends field 44 on line 57193 when reading data
Initially, read.table returned what I think is a similar error when fill=TRUE was not included and would not read the file.
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 7 did not have 45 elements
I am thinking that the errors might be similar in some way. According to the documentation, fill allows the following. If TRUE then in case the rows have unequal length, blank fields are implicitly added.
Is there a work around similar to fill=TRUE that can solve might address the fread problem?
ANSWER FROM MATT DOWLE: Fill option for fread
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.
This answer highlights how data.table can now fill using fread.
https://stackoverflow.com/a/34197074/1569064
fread(input, sep="auto", sep2="auto", nrows=-1L, header="auto", na.strings="NA",
stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), autostart=1L,
skip=0L, select=NULL, drop=NULL, colClasses=NULL,
integer64=getOption("datatable.integer64"), # default: "integer64"
dec=if (sep!=".") "." else ",", col.names,
check.names=FALSE, encoding="unknown", quote="\"",
strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL,
showProgress=getOption("datatable.showProgress"), # default: TRUE
data.table=getOption("datatable.fread.datatable") # default: TRUE
)
I have a large file consisting of two columns which I want to read in. While doing read.table I encounter this:
> x <- read.table('lindsay.csv')
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 493016 did not have 2 elements
Unfortunately that line is only 2% of the file..so finding all those lines which have a bit of corruption is really hard. Is there a way to read.table and automatically skip those lines which do not have 2 elements?
For starters, use read.csv() or use sep="," with read.table() if in fact you are working with a comma delimited value file.
x <- read.csv('lindsay.csv')
OR
x <- read.table('lindsay.csv', sep=",")
If that still doesn't work, you really should find out what is special about those lines and preprocess the text to fix them. This may mean either removing them, or correcting mistakes, or something else I haven't imagined.
I am trying to read a large tab delimited file in to R.
First I tried this:
data <- read.table("data.csv", sep="\t")
But it is reading some of the numeric variables in as factors
So I tried to read in the data based on what type I want each variable to be like this:
data <- read.table("data.csv", sep="\t", colClasses=c("character","numeric","numeric","character","boolean","numeric"))
But when I try this it gives me an error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '"4"'
I think it might be that there are quotes around some of the numeric values in the original raw file, but I'm not sure.
Without seeing your data, you have one of a few things: you don't have all tabs separating the data; there are embeded tabs in single observations; or a litnay of others.
The way you can sort this out is to set options(stringsAsFactors=FALSE) then use your first line.
Check out str(data) and try to figure out which rows are the culprits. The reason some of the numeric values are reading as factors is because there is something in that column that R is interpreting as a character and so it coerces the whole column to character. It usually takes some digging but the problem is almost surely with your input file.
This is a common data munging issue, good luck!
x <- paste("'",floor(runif(10,0,10)),"'",sep="")
x
[1] "'7'" "'3'" "'0'" "'3'" "'9'" "'1'" "'4'" "'8'" "'5'" "'8'"
as.numeric(gsub("'", "",x))
[1] 7 3 0 3 9 1 4 8 5 8