Reading Tab Delimited Data in to R - r

I am trying to read a large tab delimited file in to R.
First I tried this:
data <- read.table("data.csv", sep="\t")
But it is reading some of the numeric variables in as factors
So I tried to read in the data based on what type I want each variable to be like this:
data <- read.table("data.csv", sep="\t", colClasses=c("character","numeric","numeric","character","boolean","numeric"))
But when I try this it gives me an error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '"4"'
I think it might be that there are quotes around some of the numeric values in the original raw file, but I'm not sure.

Without seeing your data, you have one of a few things: you don't have all tabs separating the data; there are embeded tabs in single observations; or a litnay of others.
The way you can sort this out is to set options(stringsAsFactors=FALSE) then use your first line.
Check out str(data) and try to figure out which rows are the culprits. The reason some of the numeric values are reading as factors is because there is something in that column that R is interpreting as a character and so it coerces the whole column to character. It usually takes some digging but the problem is almost surely with your input file.
This is a common data munging issue, good luck!

x <- paste("'",floor(runif(10,0,10)),"'",sep="")
x
[1] "'7'" "'3'" "'0'" "'3'" "'9'" "'1'" "'4'" "'8'" "'5'" "'8'"
as.numeric(gsub("'", "",x))
[1] 7 3 0 3 9 1 4 8 5 8

Related

R error handling using read.table and defined colClasses having corrupt lines in CSV file

I have a big .csv file to read in. Badly some lines are corrupt, meaning that something is wrong in the formatting like a number like 0.-02 instead of -0.02. Sometimes even the line break (\n) is missing, so that two lines merge to one.
I want to read the .csv file with read.table and define all colClasses to the format that I expect the file to have (except of course for the corrupt lines). This is a minimal example:
colNames <- c("date", "parA", "parB")
colClasses <- c("character", "numeric", "numeric")
inputText <- "2015-01-01;123;-0.01\n
2015-01-02;421;-0.022015-01-03;433;-0.04\n
2015-01-04;321;-0.03\n
2015-01-05;230;-0.05\n
2015-01-06;313;0.-02"
con <- textConnection(inputText, "r")
mydata <- read.table(con, sep=";", fill = T, colClasses = colClasses)
At the first corrupt lines read.table stops with the error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '-0.022015-01-03'
With this error message I have no Idea in which line of the input the error occurred. Hence my only option is to copy the line -0.022015-01-03 and search for it in the file. But this is really annoying if you have to do it for a lot of lines and always have to re-execute read.table until it detects the next corrupt line.
So my question is:
Is there a way to get read.table to tell me the line where the error occurred (and maybe save it for further processing)
Is there a way to get read.table to just skip lines with improper formatting (not to stop at an error)?
Did anyone figure out a way to display these lines for manual correction during the read process? I mean maybe display the whole corrupt line in the plain csv format for manual correction (maybe including the line before and after) and then continue the read-in process including the corrected lines.
What I tried so far is to read everything with colClasses="character" to avoid format checking in the first place. Then do the format checking while I convert every column to the right format. Then which() all lines where the format could not be converted or the result is NA and just delete them.
I have a solution, but it its very slow
With ideas I got from some of the comments the thing I tried next is to read the input line by line with readLine and pipe the result to read.table via the text argument. If read.table files the line is presented to the user via edit() for correction and re-submission. Here is my code:
con <- textConnection(inputText, "r")
mydata <- data.frame()
while(length(text <- readLines(con, n=1)) > 0){
correction = T
while(correction) {
err <- tryCatch(part <- read.table(text=text, sep=";", fill = T,
col.names = colNames,
colClasses = colClasses),
error=function(e) e)
if(inherits(err, "error")){
# try to correct this line
message(err, "\n")
text <- edit(text)
}else{
correction = F
}
}
mydata <- rbind(mydata, part)
}
If the user made the corrections right this returns:
> mydata
date parA parB
1 2015-01-01 123 -0.01
2 2015-01-02 421 -0.02
3 2015-01-03 433 -0.04
4 2015-01-04 321 -0.03
5 2015-01-05 230 -0.05
6 2015-01-06 313 -0.02
The input text had 5 lines, since one linefeed was missing. The corrected output has 6 lines and the 0.-02 is corrected to -0.02.
What I still would change in this solution is to present all corrupt lines together for correction after everything is read in. This way the user can run the script and after it finished can do all corrections at once. But for a minimal example this should be enough.
The really bad thing about this solution is, that it is really slow! Too slow to handle big datasets. Hence I still would like to have another solution using more standard methods or probably a special package.

Reading PISA data into R - read.table error

I am trying to read data from the PISA 2012 study (http://pisa2012.acer.edu.au/downloads.php) into R using the read.table function. This is the code I tried:
pisa <- read.table("pisa2012.txt", sep = "")
unfortunately I keep getting the following error message:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
: line 2 did not have 184 elements
I have tried to set
header = T
but then get the following error message
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
:line 1 did not have 184 elements
Lastly, this is what the .txt file looks like ...
http://postimg.org/image/4u9lqtxqd/
Thanks for your help!
You can see from the first line that you'll need some sort of control file to delimit the individual variables. So, from working with PISA in other environments, I know the first three columns corrrespond to the ISO 3 letter country code (e.g., ALB). What follows are numbers and letters that need to be made sense of in a meaninful way by separating them. You could use the codebook for this (https://pisa2012.acer.edu.au/downloads/M_stu_codebook.pdf), but that is a real bear for every single variable. Why not download in SPSS or sAS and import? Not a 'slick' solution, but without a control file, you'd have a lot of manual work to do.
I just read the files using readr package. So what will you need: readr package, the TXT file, SAScii package and the associated sas file.
So, let say you want to read the student files. Then you will need the following files: INT_STU12_DEC03.txt and INT_STU12_DEC03.sas.
##################### READING STUDENT DATA ###################
## Loading the dictionary
dic_student = parse.SAScii(sas_ri = 'INT_STU12_SAS.sas')
## Creating the positions to read_fwf
student <- read_fwf(file = 'INT_STU12_DEC03.txt', col_positions = fwf_widths(dic_student$width), progress = T)
colnames(student) <- dic_student$varname
OBS 1: As i'm using Linux, I needed to delete the first lines from the sas file and change the encoding to UTF-8.
OBS 2: The lines deleted, were:
libname M_DEC03 "C:\XXX";
filename STU "C:\XXX\INT_STU12_DEC03.txt";
options nofmterr;
OBS 3: The dataset takes about 1Gb, so you will need enougth RAM.

read txt file with some data missing

I do realize similar questions have already been posed, but given that none of the answers provided solves my problem, frustration is beginning to set in. The problem is the following: I have 27 identically shaped time series data (date, Open, High, Low, Last) in txt format and I want to import them in R as .txt file in such a way that the first line read is the one with all 5 data. Example given below shows that while the data in text file starts on 1984-01-03, I would like for the file to be read from 1990-11-05 (since Open is missing for earlier dates), save first column of dates as rownames and save other 4 columns as numeric with the obvious name for each of the column.
Open High Low Last
1984-01-03 1001.40 997.50 997.50
1984-01-04 999.50 993.30 998.60
1990-11-05 2038.00 2050.20 2038.00 2050.10
1990-11-06 2055.00 2071.00 2052.20 2069.80
Given that this is a common problem, I have tried the following code:
ftse <- read.table("FTSE.txt", sep="", quote="", dec=".", as.is=TRUE,
blank.lines.skip=TRUE, strip.white=TRUE,na.strings=c("","NA"),
row.names=1, col.names=c("Open","High","Low","Last"))
I have tried all sorts of combination also with specifying colClasses, header=TRUE and other commands (for fill=TRUE the data is actually read, but this is exactly what I dont want) but I always get the following error (or the number of the line in error message is different)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1731 did not have 4 elements
Line 1731 is the one corresponding to the date 1984-01-03. I am kindly asking for help, since I can not afford to lose any more time on such issues, so please provide suggestions how I can fix that. Thank you in advance.
I don't know what the general solution might be but a combination of readLines and read.fwf might work in your case:
ftse.lines <- readLines("FTSE.txt")
ftse.lines <- ftse.lines[ftse.lines != ""] # skip empty lines
ftse <- read.fwf(textConnection(ftse.lines), widths=c(11,8,8,8,8), skip=1, row.names=1)
names(ftse) <- c("Open", "Hi", "Lo", "Last")
You may need to modify some parts but it works with your example.
The following (using just read.fwf) also works:
ftse <- read.fwf("FTSE.txt", widths=c(11,8,8,8,8), col.names=c("blah", "Open", "Hi", "Lo", "Last"), skip=1)
And then try to convert the first col to rownames if that's really needed.

Large two column file for read.table in R, how to automatically ditch lines with the "line X did not have 2 elements"?

I have a large file consisting of two columns which I want to read in. While doing read.table I encounter this:
> x <- read.table('lindsay.csv')
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 493016 did not have 2 elements
Unfortunately that line is only 2% of the file..so finding all those lines which have a bit of corruption is really hard. Is there a way to read.table and automatically skip those lines which do not have 2 elements?
For starters, use read.csv() or use sep="," with read.table() if in fact you are working with a comma delimited value file.
x <- read.csv('lindsay.csv')
OR
x <- read.table('lindsay.csv', sep=",")
If that still doesn't work, you really should find out what is special about those lines and preprocess the text to fix them. This may mean either removing them, or correcting mistakes, or something else I haven't imagined.

trouble in read tab delimited file R

I am trying to read a tab delimited file that looks like this:
I am using the read.table for this propose but I am not able to read the file.
table<- read.table("/Users/Desktop/R-test/HumanHT-12_V4_0_R2_15002873_B.txt",
header =FALSE, sep = "\t",
comment.char="#", check.names=FALSE)
When I run the code I have this error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 2 elements
What am I doing wrong while reading the table?
I am not so familiarized with R, so any help would be really useful.
I am very familiar with this type of file: It is a GEO platform data for Microarray analysis.
As baptiste proposed above, the best way is to skip the first few lines by skip=9. You may replace read.table(...,sep="\t") with just read.delim(...). Then you will have your table with suitable column names - please note that the column names should be in the 1st line.
Then if you are really interested in the first 9 lines you may read them by readLines(...) command and paste the data to your table by acting like this:
foo = read.delim(...)
bar = readLines(...)
baz = list(foo, bar)

Resources