skip and autostart in fread - r

I am using the following code to read a file with the data.table library:
fread(myfile, header=FALSE, sep=",", skip=100, colClasses=c("character","numeric","NULL","numeric"))
but I get the following error:
The supplied 'sep' was not found on line 80. To read the file as a single character column set sep='\n'.
It says it did not find sep on line 80, however I set skip=100 so it should not pay attention to the first 100 lines.
UPDATE:
I tried with skip=101 and it worked but it skips the first line where the data starts
I am using version 1.9.2 of the data.table package and R version 3.02 64 bit on windows 7

We don't know the version number you're using, but I can make a guess in this case.
Try setting autostart=101.
Note the first paragraph of Details in ?fread :
Once the separator is found on line autostart, the number of columns is determined. Then the file is searched backwards from autostart until a row is found that doesn't have that number of columns. Thus, the first data row is found and any human readable banners are automatically skipped. This feature can be particularly useful for loading a set of files which may not all have consistently sized banners. Setting skip>0 overrides this feature by setting autostart=skip+1 and turning off the search upwards step.
the skip argument has :
If -1 (default) use the procedure described below starting on line autostart to find the first data row. skip>=0 means ignore autostart and take line skip+1 as the first data row (or column names according to header="auto"|TRUE|FALSE as usual). skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).
and the autostart argument has :
Any line number within the region of machine readable delimited text, by default 30. If the file is shorter or this line is empty (e.g. short files with trailing blank lines) then the last non empty line (with a non empty line above that) is used. This line and the lines above it are used to auto detect sep, sep2 and the number of fields. It's extremely unlikely that autostart should ever need to be changed, we hope.
In your case perhaps the human readable header is much larger than 30 rows, which is why I guess setting autostart=101 might work. No need to use skip.
One motivation is for convenience when a file contains multiple tables. By setting autostart to any row inside the table that you want to pluck out of the file, it'll find the first data row and header row for you automatically, and then read just that table. You don't have to worry about getting the exact line number at the start of data like you do with skip. fread can only read one table currently. It could feasibly return a list of tables from a single file, but that's getting a bit complicated and nobody has asked for that.

Related

R Dataframe from a Text file with 2 Byts Separator

if you can help with converting a big text:
sample of the text :
X1"II"ID_Sitze.x"II"Produktionsdatum.x"II"Herstellernummer.x"II"Werksnummer.x"II"Fehlerhaft.x"II"Fehlerhaft_Datum.x"II"Fehlerhaft_Fahrleistung.x"II"ID_Sitze.y"II"Produktionsdatum.y"II"Herstellernummer.y"II"Werksnummer.y"II"Fehlerhaft.y"II"Fehlerhaft_Datum.y"II"Fehlerhaft_Fahrleistung.y""1"II1II"K2LE1-109-1091-2"II2008-11-12II"109"II1091II1II2010-10-18II37080IINAIINAIINAIINAIINAIINAIINA"2"II2II"K2LE1-109-1091-1"II2008-11-12II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"3"II3II"K2LE1-109-1091-12"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"4"II4II"K2LE1-109-1091-5"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"5"II5II"K2LE1-109-1091-40"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"6"II6II"K2LE1-109-1091-15"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"7"II7II"K2LE1-109-1091-31"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"8"II8II"K2LE1-109-1091-6"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"9"II9II"K2LE1-109-1091-8"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"10"II10II"K2LE1-109-1091-25"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"11"II11II"K2LE1-109-1091-24"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"12"II12II"K2LE1-109-1091-36"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"13"II13II"K2LE1-109-1091-33"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"14"II14II"K2LE1-109-1091-42"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"15"II15II"K2LE1-109-1091-14"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"16"II16II"K2LE1-109-1091-21"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"17"II17II"K2LE1-109-1091-43"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"18"II18II"K2LE1-109-1091-44"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"19"II19II"K2LE1-109-1091-19"II2008-11-13II"109"II1091II1II2010-10-19II37
with separator "II" to a Dataframe.
i have used :
df_BSt7<-readLines("Komponente_K2LE1.txt")
df_BST7<-str_replace_all(df_BSt7,"II",",")
df_BST7<-read.table(df_BST7,sep = ",")
head(df_BST7)
but I am always getting an Error
could not allocate memory (206 Mb) in C function 'R_AllocStringBuffer'
and when i call head() I am getting
'"X1","ID_Sitze.x","Produktionsdatum.x","Herstellernummer.x","Werksnummer.x","Fehlerhaft.x","Fehlerhaft_Datum.x","Fehlerhaft_Fahrleistung.x","ID_Sitze.y","Produktionsdatum.y","Herstellernummer.y","Werksnummer.y","Fehlerhaft.y","Fehlerhaft_Datum.y","Fehlerhaft_Fahrleistung.y""1",1,"K2LE1-109-1091-2",2008-11-12,"109",1091,1,2010-10-18,37080,NA,NA,NA,NA,NA,NA,NA"2",2,"K2LE1-109-1091-1",2008-11-12,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"3",3,"K2LE1-109-1091-12",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"4",4,"K2LE1-109-1091-5",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"5",5,"K2LE1-109-1091-40",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"6",6,"K2LE1-109-1091-15",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"7",7,"K2LE1-109-1091-31",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"8",8,"K2LE1-109-1091-6",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"9",9,"K2LE1-109-1091-8",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"10",10,"K2LE1-109-109 [... abgeschnitten]
So, there are several possible problems, some might be specific to your examples.
Clean example data
First, let's take a look at your example data. In what you provide, there are no newlines, everything is on a single line. Is that the case in the original "Komponente_K2LE1.txt" file? If yes, we might need some more work to find where to add newlines (see below).
The first column name, X1, only has a quote on the right. It can't work without the quote on the left: "X1"IIID_Sitze.
The saved dataframe has 16 columns, I expect because there is a row number at the beginning of each row which is not in the header. So we can add an additional column header to have 16 of them:
"row_nb"II"X1"II"ID_Sitze.x"II"Produktionsdatum.x"II"Herstellernummer.x"II"Werksnummer.x"II"Fehlerhaft.x"II"Fehlerhaft_Datum.x"II"
Then we have a small problem with line 19 which is truncated, I assume it comes from your copy/paste and that's not a problem with the full file. So let's forget about it for now. So I have this text:
raw_lines <- '"row_nb"II"X1"II"ID_Sitze.x"II"Produktionsdatum.x"II"Herstellernummer.x"II"Werksnummer.x"II"Fehlerhaft.x"II"Fehlerhaft_Datum.x"II"Fehlerhaft_Fahrleistung.x"II"ID_Sitze.y"II"Produktionsdatum.y"II"Herstellernummer.y"II"Werksnummer.y"II"Fehlerhaft.y"II"Fehlerhaft_Datum.y"II"Fehlerhaft_Fahrleistung.y"
"1"II1II"K2LE1-109-1091-2"II2008-11-12II"109"II1091II1II2010-10-18II37080IINAIINAIINAIINAIINAIINAIINA
"2"II2II"K2LE1-109-1091-1"II2008-11-12II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"3"II3II"K2LE1-109-1091-12"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"4"II4II"K2LE1-109-1091-5"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"5"II5II"K2LE1-109-1091-40"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"6"II6II"K2LE1-109-1091-15"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"7"II7II"K2LE1-109-1091-31"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"8"II8II"K2LE1-109-1091-6"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"9"II9II"K2LE1-109-1091-8"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"10"II10II"K2LE1-109-1091-25"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"11"II11II"K2LE1-109-1091-24"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"12"II12II"K2LE1-109-1091-36"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"13"II13II"K2LE1-109-1091-33"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"14"II14II"K2LE1-109-1091-42"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"15"II15II"K2LE1-109-1091-14"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"16"II16II"K2LE1-109-1091-21"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"17"II17II"K2LE1-109-1091-43"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"18"II18II"K2LE1-109-1091-44"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA'
Now you are replacing "II" with "," and reading it with read.table(), which is perfectly correct, except that read.table() would assume you're giving a file name and throw an error as it can't open that connection (that file). To make it work you need this:
df_BST7<-str_replace_all(raw_lines,'II',",")
df_BST7 <- read.table(text = df_BST7,sep = ",")
So now that does run on my computer.
Side note, since you're already using the tidyverse, you could as well use that equivalent code instead:
df_BST7 <- str_replace_all(raw_lines,'II',",")
df_BST7 <- read_csv(df_BST7)
which could help with something later
The error message
Now the error message you get suggests it's a memory problem. I see 2 possibilities: the table is so big it can't fit in your computer's memory, or indeed your whole input table is on a single line, so that makes a very long line, which won't fit in memory.
Whole table too big
I don't think it's the problem here, but just in case, check how big the file on the disk is, and how much memory is free on your computer, and whether you could free up enough memory by just closing a few programs. Possibly you could save your modified text to disk and delete it from R's memory with rm(df_BSt7), then load it directly from disk into df_BST7. Since the raw text fits in memory, that should work. If memory is a challenge, you can replace read_csv() with read_csv_chunked() and process one chunk at a time.
All on one line
I think this is the most likely. Again, there are two possibilities.
Missing carriage return
Actually line breaks can be described in 2 ways, Unix-like systems (MacOS and GNU/Linux) use the symbol newline (\n), whereas Windows uses a pair of carriage return and newline (\r\n). I'm not sure how this could create problems inside R, but if your file was generated on a Unix-like system and you're trying to read it on Windows that's an explanation. Then the goal would become to replace \n with \r\n.
No line breaks at all
If there is absolutely no line break, neither \r nor \n, then we need to guess where they are. On a Unix system you could try awk or sed, but there are ways to do it in R. The following code should work, except the last column will need some cleaning up afterwards:
raw_lines2 <- str_remove_all(raw_lines2, "\r")
all_fields <- raw_lines2 %>%
str_split("II") %>%
unlist()
nb_lines <- (length(all_fields) - 1)/15
reconstruct_lines <- map_chr(0:(nb_lines-1), ~ paste(all_fields[(2+15*.):(16+15*.)], collapse = ",")) %>%
paste(collapse = "\n")
cat(reconstruct_lines)

R read csv with comma in column

Update 2020-5-14
Working with a different but similar dataset from here, I found read_csv seems to work fine. I haven't tried it with the original data yet though.
Although the replies didn't help solve the problem because my question was not correct, Shan's reply fits the original question I posted the most, so I accepted his answer.
Update 2020-5-12
I think my original question is not correct. Like mentioned in the comment, the data was quoted. Although changing the separator made the 11582 row in R look the same as the 11583 row in excel, it doesn't mean it's "right". Maybe there is some incorrect line switch due to inappropriate encoding or something, and thus causing some of the columns to be displaced. If I open the data with notepad++, the instance at row 11583 in excel is at the 11596 row.
Original question
I am trying to read the listings.csv from this dataset in kaggle into R. I downloaded the file and wrote the coderead.csv('listing.csv'). The first column, the column id, is supposed to be numeric. However, it shows:
listing$id[1:10]
[1] 2015 2695 3176 3309 7071 9991 14325 16401 16644 17409
13129 Levels: Ole Berl穩n!,16736423,Nerea,Mitte,Parkviertel,52.55554132116211,13.340658248460871,Entire home/apt,36,6,3,2018-01-26,0.16,1,279\n17312576,Great 2 floor apartment near Friederich Str MITTE,116829651,Selin,Mitte,Alexanderplatz,52.52349354926847,13.391003496971203,Entire home/apt,170,3,31,2018-10-13,1.63,1,92\n17316675,80簡 m of charm in 3 rooms with office space,116862833,Jon,Neuk繹lln,Schillerpromenade,52.47499080234379,13.427509313575928...
I think it is because there are values with commas in the second column. For example, opening the file with MiCrosoft excel, I can see one of the value in the second column is Ole,Ole...:
How can I read a csv file into R correctly when some values contain commas?
Since you have access to the data in Excel, you can 'Save As' in Excel with a seperator other than comma (,). First go in to Control Panel –> Region and Language -> Additional settings, you can change the "List Seperator". Most common one other than comma is pipe symbol (|). In R, when you read_csv, specify the seperator as '|'.
You could try this?
lsitings <- read.csv("listings.csv", stringsAsFactors = FALSE)
listings$name <- gsub(",","", listings$name) - This will remove the comma in Col name
If you don't need the information in the second column, then you can always delete it (in Excel) before importing into R. The read.csv function, which calls scan, can also omit unwanted columns using the colClasses argument. However, the fread function from the data.table package does this much more simply with the drop argument:
library(data.table)
listings <- fread("listings.csv", drop=2)
If you do need the information in that column, then other methods are needed (see other solutions).

fread() error and strange behaviour when reading csv

I used fread() from data.table library to try read a 540MB csv file. It returned an error message saying:
' ends field 36 on line 4 when detecting types: 20.00,8/25/2006 0:00:00,"07:05:00 PM","CST",143.00,"OTTAWA","KS","HAIL",1.00,"S","MINNEAPOLIS",8/25/2006 0:00:00,"07:05:00 PM",0.00,,1.00,"S","MINNEAPOLIS",0.00,0.00,,88.00,0.00,0.00,0.00,,0.00,,"TOP","KANSAS, East",,3907.00,9743.00,3907.00,9743.00,"Dime to nickel sized hail.
I have no idea what caused the error and want to track down if it's a bug or just some data formating issue that I can tweak fread() to process.
I managed to read the csv using read.csv(), and decided to track down the row that triggered the error above (line 617174, not line 4 as the error message above). I then re-output the row and one row each immediately preceding and following the offending row, written out using write.csv() as testout.csv
I was able to read back testout.csv using read.csv() creating a data frame with 3 observations, as expected. Using fread() on testout.csv, however, resulted in a data table with only 1 observation, which is the last row.
The four lines in testout.csv are below (I start a new line for each entry below for readability).
"STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"
20,"8/25/2006 0:00:00","07:01:00 PM","CST",139,"OSAGE","KS","TSTM WIND",5,"WNW","OSAGE CITY","8/25/2006 0:00:00","07:01:00 PM",0,NA,5,"WNW","OSAGE CITY",0,0,NA,52,0,0,0,"",0,"","TOP","KANSAS, East","",3840,9554,3840,9554,".",617129
20,"8/25/2006 0:00:00","07:05:00 PM","CST",143,"OTTAWA","KS","HAIL",1,"S","MINNEAPOLIS","8/25/2006 0:00:00","07:05:00 PM",0,NA,1,"S","MINNEAPOLIS",0,0,NA,88,0,0,0,"",0,"","TOP","KANSAS, East","",3907,9743,3907,9743,"Dime to nickel sized hail.
.",617130
20,"8/25/2006 0:00:00","07:07:00 PM","CST",125,"MONTGOMERY","KS","TSTM WIND",3,"N","COFFEYVILLE","8/25/2006 0:00:00","07:07:00 PM",0,NA,3,"N","COFFEYVILLE",0,0,NA,61,0,0,0,"",0,"","ICT","KANSAS, Southeast","",3705,9538,3705,9538,"",617131
When I ran fread("testout.csv", sep=",", verbose=TRUE), the output was
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 1.05E-06B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 5 (the last non blank line in the first 'autostart') ... found ok
Found 37 columns
First row with 37 fields occurs on line 5 (either column names or first row of data)
Some fields on line 5 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 1444144414444111441111111414444111141 (first 5 rows)
Type codes: 1444144414444111441111111414444111141 (after applying colClasses and integer64)
Type codes: 1444144414444111441111111414444111141 (after applying drop or select (if supplied)
Any idea what may have caused the unexpected results, and the error in the first place? And any way around it? Just to be clear, my aim is to be able to use fread() to read the main file, even though read.csv() works so far.
UPDATE: Now fixed in v1.9.3 on GitHub :
fread() now accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting.See:
fread and a quoted multi-line column value
Windows users are reporting success with the latest version from GitHub.

fread - skip lines starting with certain character - "#"

I am using the fread function in R for reading files to data.tables objects.
However, when reading the file I'd like to skip lines that start with #, is that possible?
I could not find any mention to that in the documentation.
fread can read from a piped command that filters out such lines, like this:
fread("grep -v '^#' filename")
Not currently, but it's on the list to do.
Are the # lines at the top forming a header which is more than 30 lines long?
If so, that's come up before and the solution is :
fread("filename", autostart=60)
where 60 is chosen to be inside the block of data to be read.
From ?fread :
Once the separator is found on line autostart, the number of columns
is determined. Then the file is searched backwards from autostart
until a row is found that doesn't have that number of columns. Thus,
the first data row is found and any human readable banners are
automatically skipped. This feature can be particularly useful for
loading a set of files which may not all have consistently sized
banners. Setting skip>0 overrides this feature by setting
autostart=skip+1 and turning off the search upwards step.
The default autostart=30 might just need bumping up a bit in your case.
Or maybe skip=n or skip="string" helps :
If -1 (default) use the procedure described below starting on line autostart to find the first data row. skip>=0 means ignore autostart and take line skip+1 as the first data row (or column names according to header="auto"|TRUE|FALSE as usual). skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).

Read lines by number from a large file

I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.
How can I read-out the lines in one pass?
I was hoping for a C function that does it on one pass.
The trick is to use connection AND open it before read.table:
con<-file('filename')
open(con)
read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)
You may also try scan, it is faster and gives more control.
If it's a binary file
Some discussion is here:
Reading in only part of a Stata .DTA file in R
If it's a CSV or other text file
If they are contiguous and at the top of the file, just use the ,nrows argument to read.csv or any of the read.table family. If not, you can combine the ,nrows and the ,skip arguments to repeatedly call read.csv (reading in a new row or group of contiguous rows with each call) and then rbind the results together.
If your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.
However, from the R docs:
Use of seek on Windows is discouraged. We have found so many
errors in the Windows implementation of file positioning that
users are advised to use it only at their own risk, and asked not
to waste the R developers' time with bug reports on Windows'
deficiencies.
You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!
Before I was able to get an R solution/answer, I've done it in Ruby:
#!/usr/bin/env ruby
NUM_SEQS = 14024829
linenumbers = (1..10).collect{(rand * NUM_SEQS).to_i}
File.open("./data/uniprot_2011_02.tab") do |f|
while line = f.gets
print line if linenumbers.include? f.lineno
end
end
runs fast (as fast as my storage can read the file).
I compile a solution based on the discussions here.
scan(filename,what=list(NULL),sep='\n',blank.lines.skip = F)
This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.

Resources