I use the following (pseudo) code to import huge CSV files in R when only a rather small portion of the data in the file has to be treated/imported and other rows should simply be ignored. I don't put data in-memory but in a SQLite database which is created by the program.
library(sqldf)
suppressWarnings(read.csv.sql(
file = input_file_path,
sql = "
create table mytable as
select
. . .
from
file
where
. . .
",
header = FALSE,
sep = "|",
eol = "\n",
`field.types` = list( . . . )
dbname = sqlite_db_path,
drv = "SQLite"
))
If I understand the documentation of read.csv.sql correctly, the WHERE clause in the CREATE TABLE statement above, guarantees that only rows satisfying the condition of the WHERE clause will be inserted in the created table mytable. However, last night during a test I found somewhat a strange behaviour in my program. I had to import a big CSV file (more than 100 000 000 rows), but only a set of 10 000 rows was the target and all other rows had to be ignored. The above program did the job and at the end, the created table mytable indicated in the code above, had only 10 000 rows as expected which proves that the condition in the WHERE clause was taken into account. However, when I checked the size of the created SQLite database, it was abnormally huge (more than 25 GB). That cannot possibly be the size of 10 000 rows.
Therefore, I'm confused. What happened during the import process? Why the SQLite database file became so huge depsite the fact that only 10 000 rows were inserted in mytable and everything else was ignored? Is this the normal behaviour or it's me who is not using correctly read.csV.sql?
edit: this is an elaboration on the suggestions #G.Grothendieck makes in the comment-section above.
--
I do not have experience with read.csv.sql, but you might want to take a look at importing the output from a shell-command. This way, you preprocess the csv, and only import the result.
For eaxmple, data.table::fread(cmd = 'findstr "^[^#]" test.csv', sep = "\n", header = FALSE) imports all lines from the csv test.csv that do not start with a #-character.
So it import the output from the (windows) command: findstr "^[^#]" test.csv to a data.table, using all the power of data.table::fread() and it's arguments. This will also work on *nix-environments, but you'll have to use the appropriate shell-commands (like grep).
Note that an approach like this might work very well (and fast!), but it makes it impossible to swap your code around on multiple operating systems.
Thanks to #G. Grothendieck for his clarification :
It first uses dbWriteTable to write the file to the database. Then it operates on that. You co consider using grep/findstr/xsv to extract out the rows of interest piping that direclty into R. read.table(pipe("grep ..."), sep = ",") or in Windows findstr or the xsv or other csv utility and use that instead of grep. –
I am looking for a way to concatenate a string or a number (3 digits at least) into a save file.
For instance, in python I can use '%s' % [format(str), format(number)]) and add it to csv file with a random generator.
How do I generate a random number into a format in R?
That is my save file and I want to add a random string or a number in the end of the file name:
file = paste(path, 'group1N[ADD FORMAT HERE].csv',sep = '')
file = paste(path, 'group1N.csv',sep = '') to become -- >
file = paste(path, 'group1N212.csv',sep = '') or file = paste(path, 'group1Nkut.csv',sep = '')
after using a random generator of strings or numbers and appending it to the save .csv file, each time it is saved, as a random generated end of file
You could use the built-in tempfile() function:
tempfile(pattern="group1N", tmpdir=".", fileext=".csv")
[1] "./group1N189d494eaaf2ea.csv"
(if you don't specify tmpdir the results go to a session-specific temporary directory).
This won't write over existing files; given that there are 14 hex digits in the random component, I think the "very likely to be unique" in the description is an understatement ... (i.e. at a rough guess the probability of collision might be something like 16^(-14) ...)
The names are very likely to be unique among calls to ‘tempfile’
in an R session and across simultaneous R sessions (unless
‘tmpdir’ is specified). The filenames are guaranteed not to be
currently in use.
if you can help with converting a big text:
sample of the text :
X1"II"ID_Sitze.x"II"Produktionsdatum.x"II"Herstellernummer.x"II"Werksnummer.x"II"Fehlerhaft.x"II"Fehlerhaft_Datum.x"II"Fehlerhaft_Fahrleistung.x"II"ID_Sitze.y"II"Produktionsdatum.y"II"Herstellernummer.y"II"Werksnummer.y"II"Fehlerhaft.y"II"Fehlerhaft_Datum.y"II"Fehlerhaft_Fahrleistung.y""1"II1II"K2LE1-109-1091-2"II2008-11-12II"109"II1091II1II2010-10-18II37080IINAIINAIINAIINAIINAIINAIINA"2"II2II"K2LE1-109-1091-1"II2008-11-12II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"3"II3II"K2LE1-109-1091-12"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"4"II4II"K2LE1-109-1091-5"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"5"II5II"K2LE1-109-1091-40"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"6"II6II"K2LE1-109-1091-15"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"7"II7II"K2LE1-109-1091-31"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"8"II8II"K2LE1-109-1091-6"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"9"II9II"K2LE1-109-1091-8"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"10"II10II"K2LE1-109-1091-25"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"11"II11II"K2LE1-109-1091-24"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"12"II12II"K2LE1-109-1091-36"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"13"II13II"K2LE1-109-1091-33"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"14"II14II"K2LE1-109-1091-42"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"15"II15II"K2LE1-109-1091-14"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"16"II16II"K2LE1-109-1091-21"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"17"II17II"K2LE1-109-1091-43"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"18"II18II"K2LE1-109-1091-44"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"19"II19II"K2LE1-109-1091-19"II2008-11-13II"109"II1091II1II2010-10-19II37
with separator "II" to a Dataframe.
i have used :
df_BSt7<-readLines("Komponente_K2LE1.txt")
df_BST7<-str_replace_all(df_BSt7,"II",",")
df_BST7<-read.table(df_BST7,sep = ",")
head(df_BST7)
but I am always getting an Error
could not allocate memory (206 Mb) in C function 'R_AllocStringBuffer'
and when i call head() I am getting
'"X1","ID_Sitze.x","Produktionsdatum.x","Herstellernummer.x","Werksnummer.x","Fehlerhaft.x","Fehlerhaft_Datum.x","Fehlerhaft_Fahrleistung.x","ID_Sitze.y","Produktionsdatum.y","Herstellernummer.y","Werksnummer.y","Fehlerhaft.y","Fehlerhaft_Datum.y","Fehlerhaft_Fahrleistung.y""1",1,"K2LE1-109-1091-2",2008-11-12,"109",1091,1,2010-10-18,37080,NA,NA,NA,NA,NA,NA,NA"2",2,"K2LE1-109-1091-1",2008-11-12,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"3",3,"K2LE1-109-1091-12",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"4",4,"K2LE1-109-1091-5",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"5",5,"K2LE1-109-1091-40",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"6",6,"K2LE1-109-1091-15",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"7",7,"K2LE1-109-1091-31",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"8",8,"K2LE1-109-1091-6",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"9",9,"K2LE1-109-1091-8",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"10",10,"K2LE1-109-109 [... abgeschnitten]
So, there are several possible problems, some might be specific to your examples.
Clean example data
First, let's take a look at your example data. In what you provide, there are no newlines, everything is on a single line. Is that the case in the original "Komponente_K2LE1.txt" file? If yes, we might need some more work to find where to add newlines (see below).
The first column name, X1, only has a quote on the right. It can't work without the quote on the left: "X1"IIID_Sitze.
The saved dataframe has 16 columns, I expect because there is a row number at the beginning of each row which is not in the header. So we can add an additional column header to have 16 of them:
"row_nb"II"X1"II"ID_Sitze.x"II"Produktionsdatum.x"II"Herstellernummer.x"II"Werksnummer.x"II"Fehlerhaft.x"II"Fehlerhaft_Datum.x"II"
Then we have a small problem with line 19 which is truncated, I assume it comes from your copy/paste and that's not a problem with the full file. So let's forget about it for now. So I have this text:
raw_lines <- '"row_nb"II"X1"II"ID_Sitze.x"II"Produktionsdatum.x"II"Herstellernummer.x"II"Werksnummer.x"II"Fehlerhaft.x"II"Fehlerhaft_Datum.x"II"Fehlerhaft_Fahrleistung.x"II"ID_Sitze.y"II"Produktionsdatum.y"II"Herstellernummer.y"II"Werksnummer.y"II"Fehlerhaft.y"II"Fehlerhaft_Datum.y"II"Fehlerhaft_Fahrleistung.y"
"1"II1II"K2LE1-109-1091-2"II2008-11-12II"109"II1091II1II2010-10-18II37080IINAIINAIINAIINAIINAIINAIINA
"2"II2II"K2LE1-109-1091-1"II2008-11-12II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"3"II3II"K2LE1-109-1091-12"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"4"II4II"K2LE1-109-1091-5"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"5"II5II"K2LE1-109-1091-40"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"6"II6II"K2LE1-109-1091-15"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"7"II7II"K2LE1-109-1091-31"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"8"II8II"K2LE1-109-1091-6"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"9"II9II"K2LE1-109-1091-8"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"10"II10II"K2LE1-109-1091-25"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"11"II11II"K2LE1-109-1091-24"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"12"II12II"K2LE1-109-1091-36"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"13"II13II"K2LE1-109-1091-33"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"14"II14II"K2LE1-109-1091-42"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"15"II15II"K2LE1-109-1091-14"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"16"II16II"K2LE1-109-1091-21"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"17"II17II"K2LE1-109-1091-43"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"18"II18II"K2LE1-109-1091-44"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA'
Now you are replacing "II" with "," and reading it with read.table(), which is perfectly correct, except that read.table() would assume you're giving a file name and throw an error as it can't open that connection (that file). To make it work you need this:
df_BST7<-str_replace_all(raw_lines,'II',",")
df_BST7 <- read.table(text = df_BST7,sep = ",")
So now that does run on my computer.
Side note, since you're already using the tidyverse, you could as well use that equivalent code instead:
df_BST7 <- str_replace_all(raw_lines,'II',",")
df_BST7 <- read_csv(df_BST7)
which could help with something later
The error message
Now the error message you get suggests it's a memory problem. I see 2 possibilities: the table is so big it can't fit in your computer's memory, or indeed your whole input table is on a single line, so that makes a very long line, which won't fit in memory.
Whole table too big
I don't think it's the problem here, but just in case, check how big the file on the disk is, and how much memory is free on your computer, and whether you could free up enough memory by just closing a few programs. Possibly you could save your modified text to disk and delete it from R's memory with rm(df_BSt7), then load it directly from disk into df_BST7. Since the raw text fits in memory, that should work. If memory is a challenge, you can replace read_csv() with read_csv_chunked() and process one chunk at a time.
All on one line
I think this is the most likely. Again, there are two possibilities.
Missing carriage return
Actually line breaks can be described in 2 ways, Unix-like systems (MacOS and GNU/Linux) use the symbol newline (\n), whereas Windows uses a pair of carriage return and newline (\r\n). I'm not sure how this could create problems inside R, but if your file was generated on a Unix-like system and you're trying to read it on Windows that's an explanation. Then the goal would become to replace \n with \r\n.
No line breaks at all
If there is absolutely no line break, neither \r nor \n, then we need to guess where they are. On a Unix system you could try awk or sed, but there are ways to do it in R. The following code should work, except the last column will need some cleaning up afterwards:
raw_lines2 <- str_remove_all(raw_lines2, "\r")
all_fields <- raw_lines2 %>%
str_split("II") %>%
unlist()
nb_lines <- (length(all_fields) - 1)/15
reconstruct_lines <- map_chr(0:(nb_lines-1), ~ paste(all_fields[(2+15*.):(16+15*.)], collapse = ",")) %>%
paste(collapse = "\n")
cat(reconstruct_lines)
[This is kind of multiple bug-reports/feature requests in one post, but they don't necessarily make sense in isolation. Apologies for the monster post in advance. Posting here as suggested by help(data.table). Also, I'm new to R; so apologies if I'm not following best practices in my code below. I'm trying.]
1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)
First I want to report that using rbindlist to append large data.tables causes R to segfault (ubuntu 13.10, the packaged R version 3.0.1-3ubuntu1, data.table installed from within R from CRAN). The machine has 128 GiB of RAM; so, I shouldn't be running out of memory given the size of the data.
My code:
append.tables <- function(files) {
moves.by.year <- lapply(files, fread)
move <- rbindlist(moves.by.year)
rm(moves.by.year)
move[,week_end := as.Date(as.character(week_end), format="%Y%m%d")]
return(move)
}
Crash message:
append.tables crashes with this:
> system.time(move <- append.tables(files))
*** caught segfault ***
address 0x7f8e88dc1d10, cause 'memory not mapped'
Traceback:
1: rbindlist(moves.by.year)
2: append.tables(files)
3: system.time(move <- append.tables(files))
There are 6 files, each about 8 GiB or 100 million lines long with 8 variables, tab separated.
2. Could fread accept multiple file names?
In any case, I think a better approach here would be to allow fread to take files as a vector of file names:
files <- c("my", "files", "to be", "appended")
dt <- fread(files)
Presumably you can be much more memory efficient under the hood than without having to keep all of these objects around at the same time as appears to be necessary as a user of R.
3. colClasses gives an error message
My second problem is that I need to specify a custom coercion handler for one of my data types, but that fails:
dt <- fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) :
Column name 'myDate' in colClasses not found in data
Yes, in the case of dates, a simple:
dt[,date := as.Date(as.character(date), format="%Y%m%d")]
works.
However, I have a different use case, which is to strip the decimal point from one of the data columns before it is converted from a character. Precision here is extremely important (thus our need for using the integer type), and coercing to an integer from the double type results in lost precision.
Now, I can get around this with some system() calls to append the files and pipe them through some sed magic (simplified here) (where tfile is another temporary file):
if (has_header) {
tfile2 <- tempfile()
system(paste("echo fakeline >>", tfile2))
system(paste("head -q -n1", files[[1]], ">>", tfile2))
system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
" | sed 's/\\.//' >>", tfile), wait=wait)
unlink(tfile2)
} else {
system(paste("cat", paste(files, collapse=" "), ">>", tfile), wait=wait)
}
but this involves an extra read/write cycle. I have 4 TiB of data to process, which is a LOT of extra reading and writing (no, not all into one data.table. About 1000 of them.)
4. fread thinks named pipes are empty files
I typically leave wait=TRUE. But I was trying to see if I could avoid the extra read/write cycle by making tfile a named pipe system('mkfifo', tfile), setting wait=FALSE, and then running fread(tfile). However, fread complains about the pipe being an empty file:
system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
" | sed 's/\\.//' >>", tfile), wait=FALSE)
move <- fread(tfile)
Error in fread(tfile) : File is empty: /tmp/RtmpbxNI1L/file78a678dc1999
In any case, this is a bit of a hack.
Simplified Code if I had my wish list
Ideally, I would be able to do something like this:
setClass("Int_Price")
setAs("character", "Int_Price",
function (from) {
return(as.integer(gsub("\\.", "", from)))
}
)
dt <- fread(files, colClasses=list(price="Int_Price"))
And then I'd have a nice long data.table with properly coerced data.
Update: The rbindlist bug has been fixed in commit 1100 v1.8.11. From NEWS:
o Fixed a rare segfault that occurred on >250m rows (integer overflow during memory allocation); closes #5305. Thanks to Guenter J. Hitsch for reporting.
As mentioned in comments, you're supposed to ask separate questions separately. But since they're such good points and linked together into the wish at the end, ok, will answer in one go.
1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)
Please run again changing the line :
moves.by.year <- lapply(files, fread)
to
moves.by.year <- lapply(files, fread, verbose=TRUE)
and send me the output. I don't think it is the size of the files, but something about the type and contents. You're right that fread and rbindlist should have no issue loading the 48GB of data on your 128GB box. As you say, the lapply should return 48GB and then the rbindlist should create a new 48GB single table. This should work on your 128GB machine since 96GB < 128GB. 100 million rows * 6 is 600 million rows, which is well under the 2 billion row limit so should be fine (data.table hasn't caught up with long vector support in R3 yet, otherwise > 2^31 rows would be fine, too).
2. Could fread accept multiple file names?
Excellent idea. As you say, fread could then sweep through all 6 files detecting their types and counting the total number of rows, first. Then allocate once for the 600 million rows directly. This would save churning through 48GB of RAM needlessly. It might also detect any anomalies in the 5th or 6th file (say) before starting to read the first files, so would return quicker in the event of problems.
I'll file this as a feature request and post the link here.
3. colClasses gives an error message
When type list, the type appears to the left of the = and a vector of column names or positions appears to the right. The idea is to be easier than colClasses in read.csv which only accepts a vector; to save repeating "character" over and over. I could have sworn this was better documented in ?fread but it seems not. I'll take a look at that.
So, instead of
fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) :
Column name 'myDate' in colClasses not found in data
the correct syntax is
fread(tfile, colClasses=list(myDate="date"))
Given what you go on to say in the question, iiuc, you actually want :
fread(tfile, colClasses=list(character="date")) # just fread accepts list
or
fread(tfile, colClasses=c("date"="character")) # both read.csv and fread
Either of those should load the column called "date" as character so you can manipulate it before coercion. If it really is just dates, then I've still to implement that coercion automatically. You mentioned precision of numeric so just to remind that integer64 can be read directly by fread too.
4. fread thinks named pipes are empty files
Hopefully this goes away now assuming the previous point is resolved? fread works by memory mapping its input. It can accept non-files such as http addresses and connections (tbc) and what it does first for convenience is to write the complete input to ramdisk so it can map the input from there. The reason fread is fast is hand in hand with seeing the entire input first.
I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.
How can I read-out the lines in one pass?
I was hoping for a C function that does it on one pass.
The trick is to use connection AND open it before read.table:
con<-file('filename')
open(con)
read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)
You may also try scan, it is faster and gives more control.
If it's a binary file
Some discussion is here:
Reading in only part of a Stata .DTA file in R
If it's a CSV or other text file
If they are contiguous and at the top of the file, just use the ,nrows argument to read.csv or any of the read.table family. If not, you can combine the ,nrows and the ,skip arguments to repeatedly call read.csv (reading in a new row or group of contiguous rows with each call) and then rbind the results together.
If your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.
However, from the R docs:
Use of seek on Windows is discouraged. We have found so many
errors in the Windows implementation of file positioning that
users are advised to use it only at their own risk, and asked not
to waste the R developers' time with bug reports on Windows'
deficiencies.
You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!
Before I was able to get an R solution/answer, I've done it in Ruby:
#!/usr/bin/env ruby
NUM_SEQS = 14024829
linenumbers = (1..10).collect{(rand * NUM_SEQS).to_i}
File.open("./data/uniprot_2011_02.tab") do |f|
while line = f.gets
print line if linenumbers.include? f.lineno
end
end
runs fast (as fast as my storage can read the file).
I compile a solution based on the discussions here.
scan(filename,what=list(NULL),sep='\n',blank.lines.skip = F)
This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.