read txt file with some data missing - r

I do realize similar questions have already been posed, but given that none of the answers provided solves my problem, frustration is beginning to set in. The problem is the following: I have 27 identically shaped time series data (date, Open, High, Low, Last) in txt format and I want to import them in R as .txt file in such a way that the first line read is the one with all 5 data. Example given below shows that while the data in text file starts on 1984-01-03, I would like for the file to be read from 1990-11-05 (since Open is missing for earlier dates), save first column of dates as rownames and save other 4 columns as numeric with the obvious name for each of the column.
Open High Low Last
1984-01-03 1001.40 997.50 997.50
1984-01-04 999.50 993.30 998.60
1990-11-05 2038.00 2050.20 2038.00 2050.10
1990-11-06 2055.00 2071.00 2052.20 2069.80
Given that this is a common problem, I have tried the following code:
ftse <- read.table("FTSE.txt", sep="", quote="", dec=".", as.is=TRUE,
blank.lines.skip=TRUE, strip.white=TRUE,na.strings=c("","NA"),
row.names=1, col.names=c("Open","High","Low","Last"))
I have tried all sorts of combination also with specifying colClasses, header=TRUE and other commands (for fill=TRUE the data is actually read, but this is exactly what I dont want) but I always get the following error (or the number of the line in error message is different)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1731 did not have 4 elements
Line 1731 is the one corresponding to the date 1984-01-03. I am kindly asking for help, since I can not afford to lose any more time on such issues, so please provide suggestions how I can fix that. Thank you in advance.

I don't know what the general solution might be but a combination of readLines and read.fwf might work in your case:
ftse.lines <- readLines("FTSE.txt")
ftse.lines <- ftse.lines[ftse.lines != ""] # skip empty lines
ftse <- read.fwf(textConnection(ftse.lines), widths=c(11,8,8,8,8), skip=1, row.names=1)
names(ftse) <- c("Open", "Hi", "Lo", "Last")
You may need to modify some parts but it works with your example.
The following (using just read.fwf) also works:
ftse <- read.fwf("FTSE.txt", widths=c(11,8,8,8,8), col.names=c("blah", "Open", "Hi", "Lo", "Last"), skip=1)
And then try to convert the first col to rownames if that's really needed.

Related

Why is read.csv getting wrong classes?

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks
I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

R error handling using read.table and defined colClasses having corrupt lines in CSV file

I have a big .csv file to read in. Badly some lines are corrupt, meaning that something is wrong in the formatting like a number like 0.-02 instead of -0.02. Sometimes even the line break (\n) is missing, so that two lines merge to one.
I want to read the .csv file with read.table and define all colClasses to the format that I expect the file to have (except of course for the corrupt lines). This is a minimal example:
colNames <- c("date", "parA", "parB")
colClasses <- c("character", "numeric", "numeric")
inputText <- "2015-01-01;123;-0.01\n
2015-01-02;421;-0.022015-01-03;433;-0.04\n
2015-01-04;321;-0.03\n
2015-01-05;230;-0.05\n
2015-01-06;313;0.-02"
con <- textConnection(inputText, "r")
mydata <- read.table(con, sep=";", fill = T, colClasses = colClasses)
At the first corrupt lines read.table stops with the error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '-0.022015-01-03'
With this error message I have no Idea in which line of the input the error occurred. Hence my only option is to copy the line -0.022015-01-03 and search for it in the file. But this is really annoying if you have to do it for a lot of lines and always have to re-execute read.table until it detects the next corrupt line.
So my question is:
Is there a way to get read.table to tell me the line where the error occurred (and maybe save it for further processing)
Is there a way to get read.table to just skip lines with improper formatting (not to stop at an error)?
Did anyone figure out a way to display these lines for manual correction during the read process? I mean maybe display the whole corrupt line in the plain csv format for manual correction (maybe including the line before and after) and then continue the read-in process including the corrected lines.
What I tried so far is to read everything with colClasses="character" to avoid format checking in the first place. Then do the format checking while I convert every column to the right format. Then which() all lines where the format could not be converted or the result is NA and just delete them.
I have a solution, but it its very slow
With ideas I got from some of the comments the thing I tried next is to read the input line by line with readLine and pipe the result to read.table via the text argument. If read.table files the line is presented to the user via edit() for correction and re-submission. Here is my code:
con <- textConnection(inputText, "r")
mydata <- data.frame()
while(length(text <- readLines(con, n=1)) > 0){
correction = T
while(correction) {
err <- tryCatch(part <- read.table(text=text, sep=";", fill = T,
col.names = colNames,
colClasses = colClasses),
error=function(e) e)
if(inherits(err, "error")){
# try to correct this line
message(err, "\n")
text <- edit(text)
}else{
correction = F
}
}
mydata <- rbind(mydata, part)
}
If the user made the corrections right this returns:
> mydata
date parA parB
1 2015-01-01 123 -0.01
2 2015-01-02 421 -0.02
3 2015-01-03 433 -0.04
4 2015-01-04 321 -0.03
5 2015-01-05 230 -0.05
6 2015-01-06 313 -0.02
The input text had 5 lines, since one linefeed was missing. The corrected output has 6 lines and the 0.-02 is corrected to -0.02.
What I still would change in this solution is to present all corrupt lines together for correction after everything is read in. This way the user can run the script and after it finished can do all corrections at once. But for a minimal example this should be enough.
The really bad thing about this solution is, that it is really slow! Too slow to handle big datasets. Hence I still would like to have another solution using more standard methods or probably a special package.

R workarounds: fread versus read.table

I would like to understand why this difference exists between read.table and fread. Maybe you'll know the workaround to make fread work. I have two lines of code that perform the same goal-to read a file. fread performs faster and more efficiently than read.table, but read.table produces less no errors on the same data set.
SUCCESSFUL READ.TABLE approach
table <- read.table("A.txt",header=FALSE,sep = "|", fill=TRUE, quote="", stringsAsFactors=FALSE)
FREAD approach
table <- fread("A.txt",header=FALSE,sep = "|")
FREAD returns the classic error, which I explored,
Expected sep ('|') but new line or EOF ends field 44 on line 57193 when reading data
Initially, read.table returned what I think is a similar error when fill=TRUE was not included and would not read the file.
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 7 did not have 45 elements
I am thinking that the errors might be similar in some way. According to the documentation, fill allows the following. If TRUE then in case the rows have unequal length, blank fields are implicitly added.
Is there a work around similar to fill=TRUE that can solve might address the fread problem?
ANSWER FROM MATT DOWLE: Fill option for fread
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.
This answer highlights how data.table can now fill using fread.
https://stackoverflow.com/a/34197074/1569064
fread(input, sep="auto", sep2="auto", nrows=-1L, header="auto", na.strings="NA",
stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), autostart=1L,
skip=0L, select=NULL, drop=NULL, colClasses=NULL,
integer64=getOption("datatable.integer64"), # default: "integer64"
dec=if (sep!=".") "." else ",", col.names,
check.names=FALSE, encoding="unknown", quote="\"",
strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL,
showProgress=getOption("datatable.showProgress"), # default: TRUE
data.table=getOption("datatable.fread.datatable") # default: TRUE
)

Large two column file for read.table in R, how to automatically ditch lines with the "line X did not have 2 elements"?

I have a large file consisting of two columns which I want to read in. While doing read.table I encounter this:
> x <- read.table('lindsay.csv')
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 493016 did not have 2 elements
Unfortunately that line is only 2% of the file..so finding all those lines which have a bit of corruption is really hard. Is there a way to read.table and automatically skip those lines which do not have 2 elements?
For starters, use read.csv() or use sep="," with read.table() if in fact you are working with a comma delimited value file.
x <- read.csv('lindsay.csv')
OR
x <- read.table('lindsay.csv', sep=",")
If that still doesn't work, you really should find out what is special about those lines and preprocess the text to fix them. This may mean either removing them, or correcting mistakes, or something else I haven't imagined.

Reading Tab Delimited Data in to R

I am trying to read a large tab delimited file in to R.
First I tried this:
data <- read.table("data.csv", sep="\t")
But it is reading some of the numeric variables in as factors
So I tried to read in the data based on what type I want each variable to be like this:
data <- read.table("data.csv", sep="\t", colClasses=c("character","numeric","numeric","character","boolean","numeric"))
But when I try this it gives me an error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '"4"'
I think it might be that there are quotes around some of the numeric values in the original raw file, but I'm not sure.
Without seeing your data, you have one of a few things: you don't have all tabs separating the data; there are embeded tabs in single observations; or a litnay of others.
The way you can sort this out is to set options(stringsAsFactors=FALSE) then use your first line.
Check out str(data) and try to figure out which rows are the culprits. The reason some of the numeric values are reading as factors is because there is something in that column that R is interpreting as a character and so it coerces the whole column to character. It usually takes some digging but the problem is almost surely with your input file.
This is a common data munging issue, good luck!
x <- paste("'",floor(runif(10,0,10)),"'",sep="")
x
[1] "'7'" "'3'" "'0'" "'3'" "'9'" "'1'" "'4'" "'8'" "'5'" "'8'"
as.numeric(gsub("'", "",x))
[1] 7 3 0 3 9 1 4 8 5 8

Resources