How to read unquoted extra \r with data.table::fread - r

Data I have to process has unquoted text with some additional \r character. Files are big (500MB), copious (>600), and changing the export is not an option. Data might look like
A,B,C
blah,a,1
bloo,a\r,b
blee,c,d
How can this be handled with data.table's fread?
Is there a better R read CSV function for this, that's similarly performant?
Repro
library(data.table)
csv<-"A,B,C\r\n
blah,a,1\r\n
bloo,a\r,b\r\n
blee,c,d\r\n"
fread(csv)
Error in fread(csv) :
Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types from point 0:
bloo,a
Advanced repro
The simple repro might be too trivial to give a sense of scale...
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Naive approach
fread("sample.csv")
# Akrun's approach with needing text read first
fread(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#>Error in file.info(input) : file name conversion problem -- name too long?
# Julia's approach with needing text read first
readr::read_csv(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#> Error: C stack usage 48029706 is too close to the limit

Further to #dirk-eddelbuettel & #nrussell's suggestions, a way of solving this is to is to pre-process the file. The processor could also be called within fread() but here it is performed in seperate steps:
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Remove errant `\r`'s with tr - shown here is the Windows R solution
shell("C:/Rtools/bin/tr.exe -d '\\r' < sample.csv > sampleNEW.csv")
fread("sampleNEW.csv")

We can try with gsub
fread(gsub("\r\n|\r", "", csv))
# A B C
#1: blah a 1
#2: bloo a b
#3: blee c d

You can also do this with tidyverse packages, if you'd like.
> library(readr)
> library(stringr)
> read_csv(str_replace_all(csv, "\r", ""))
# A tibble: 3 × 3
A B C
<chr> <chr> <chr>
1 blah a 1
2 bloo a b
3 blee c d

If you do want to do it purely in R, you could try working with connections. As long as a connection is kept open, it will start reading/writing from its previous position. Of course, this means the burden of opening and closing connections falls on you.
In the following code, the file is processed by chunks:
library(data.table)
input_csv <- "sample.csv"
in_conn <- file(input_csv)
output_csv <- "out.csv"
out_conn <- file(output_csv, "w+")
open(in_conn)
chunk_size <- 1E6
return_pattern <- "(?<=^|,|\n)([^,]*(?<!\n)\r(?!\n)[^,]*)(?=,|\n|$)"
buffer <- ""
repeat {
new_chars <- readChar(in_conn, chunk_size)
buffer <- paste0(buffer, new_chars)
while (grepl("[\r\n]$", buffer, perl = TRUE)) {
next_char <- readChar(in_conn, 1)
buffer <- paste0(buffer, next_char)
if (!length(next_char))
break
}
chunk <- gsub("(.*)[,\n][^,\n]*$", "\\1", buffer, perl = TRUE)
buffer <- substr(buffer, nchar(chunk) + 1, nchar(buffer))
cleaned <- gsub(return_pattern, '"\\1"', chunk, perl = TRUE)
writeChar(cleaned, out_conn, eos = NULL)
if (!length(new_chars))
break
}
writeChar('\n', out_conn, eos = NULL)
close(in_conn)
close(out_conn)
result <- fread(output_csv)
Process:
If a chunk ends with a \r or \n, another character is added until it doesn't.
Quotes are put around values containing a \r which isn't adjacent to a
\n.
The cleaned chunk is added to the end of another file.
Rinse and repeat.
This code simplifies the problem by assuming no quoting is done for any field in sample.csv. It's not especially fast, but not terribly slow. Larger values for chunk_size should reduce the amount of time spent in I/O operations. If used for anything beyond this toy example, I'd strongly suggesting wrapping it in a tryCatch(...) call to make sure the files are closed afterwards.

Related

How to quietly change the System Locale in R [duplicate]

I'm looking to suppress the output of one command (in this case, the apply function).
Is it possible to do this without using sink()? I've found the described solution below, but would like to do this in one line if possible.
How to suppress output
It isn't clear why you want to do this without sink, but you can wrap any commands in the invisible() function and it will suppress the output. For instance:
1:10 # prints output
invisible(1:10) # hides it
Otherwise, you can always combine things into one line with a semicolon and parentheses:
{ sink("/dev/null"); ....; sink(); }
Use the capture.output() function. It works very much like a one-off sink() and unlike invisible(), it can suppress more than just print messages. Set the file argument to /dev/null on UNIX or NUL on windows. For example, considering Dirk's note:
> invisible(cat("Hi\n"))
Hi
> capture.output( cat("Hi\n"), file='NUL')
>
The following function should do what you want exactly:
hush=function(code){
sink("NUL") # use /dev/null in UNIX
tmp = code
sink()
return(tmp)
}
For example with the function here:
foo=function(){
print("BAR!")
return(42)
}
running
x = hush(foo())
Will assign 42 to x but will not print "BAR!" to STDOUT
Note than in a UNIX OS you will need to replace "NUL" with "/dev/null"
R only automatically prints the output of unassigned expressions, so just assign the result of the apply to a variable, and it won't get printed.
you can use 'capture.output' like below. This allows you to use the data later:
log <- capture.output({
test <- CensReg.SMN(cc=cc,x=x,y=y, nu=NULL, type="Normal")
})
test$betas
In case anyone's arriving here looking for a solution applicable to RMarkdown, this will suppress all output:
```{r error=FALSE, warning=FALSE, message=FALSE}
invisible({capture.output({
# Your code goes here
2 * 2
# etc
# etc
})})
```
The code will run, but the output will not be printed to the HTML document
invisible(cat("Dataset: ", dataset, fill = TRUE))
invisible(cat(" Width: " ,width, fill = TRUE))
invisible(cat(" Bin1: " ,bin1interval, fill = TRUE))
invisible(cat(" Bin2: " ,bin2interval, fill = TRUE))
invisible(cat(" Bin3: " ,bin3interval, fill = TRUE))
produces output without NULL at the end of the line or on the next line
Dataset: 17 19 26 29 31 32 34 45 47 51 52 59 60 62 63
Width: 15.33333
Bin1: 17 32.33333
Bin2: 32.33333 47.66667
Bin3: 47.66667 63
Making Hadley's comment to an answer: Use of apply family without printing is possible with use of the plyr package
x <- 1:2
lapply(x, function(x) x + 1)
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 3
plyr::l_ply(x, function(x) x + 1)
Here is a version that is robust to errors in the code to be shushed:
quietly <- function(x) {
sink("/dev/null") # on Windows (?) instead use `sink("NUL")`
tryCatch(suppressMessages(x), finally = sink())
}
This is based directly on the accepted answer, for which thanks.
But it avoids leaving output silenced if an error occurs in the quieted code.

Slow wordcloud in R

Trying to create a word cloud from a 300MB .csv file with text, but its taking hours on a decent laptop with 16GB of RAM. Not sure how long this should typically take...but here's my code:
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
dfTemplate <- read.csv("CleanedDescMay.csv", header=TRUE, stringsAsFactors = FALSE)
template <- dfTemplate
template <- Corpus(VectorSource(template))
template <- tm_map(template, removeWords, stopwords("english"))
template <- tm_map(template, stripWhitespace)
template <- tm_map(template, removePunctuation)
dtm <- TermDocumentMatrix(template)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
par(bg="grey30")
png(file="WordCloudDesc1.png", width=1000, height=700, bg="grey30")
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per = 0.3, max.words=500)
title(main = "Top Template Words", font.main=1, col.main="cornsilk3", cex.main=1.5)
dev.off()
Any advice is appreciated!
Step 1: Profile
Have you tried profiling your full workflow yet with a small subset to figure out which steps are taking the most time? Profiling with RStudio here
If not, that should be your first step.
If the tm_map() functions are taking a long time:
If I recall correctly, I found working with stringi to be faster than the dedicated corpus tools.
My workflow wound up looking like the following for the pre-cleaning steps. This could definitely be optimized further -- magrittr pipes %>% do contribute to some additional processing time, but I feel like that's an acceptable trade-off for the sanity of not having dozens of nested parenthesis.
library(data.table)
library(stringi)
library(parallel)
## This function handles the processing pipeline
textCleaner <- function(InputText, StopWords, Words, NewWords){
InputText %>%
stri_enc_toascii(.) %>%
toupper(.) %>%
stri_replace_all_regex(.,"[[:cntrl:]]"," ") %>%
stri_replace_all_regex(.,"[[:punct:]]"," ") %>%
stri_replace_all_regex(.,"[[:space:]]+"," ") %>% ## Replaces multiple spaces with
stri_replace_all_regex(.,"^[[:space:]]+|[[:space:]]+$","") %>% ## Remove leading and trailing spaces
stri_replace_all_regex(.,"\\b"%s+%StopWords%s+%"\\b","",vectorize_all = FALSE) %>% ## Stopwords
stri_replace_all_regex(.,"\\b"%s+%Words%s+%"\\b",NewWords,vectorize_all = FALSE) ## Replacements
}
## Replacement Words, I would normally read in a .CSV file
Replace <- data.table(Old = c("LOREM","IPSUM","DOLOR","SIT"),
New = c("I","DONT","KNOW","LATIN"))
## These need to be defined globally
GlobalStopWords <- c("AT","UT","IN","ET","A")
GlobalOldWords <- Replace[["Old"]]
GlobalNewWords <- Replace[["New"]]
## Generate some sample text
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
## Running Single Threaded
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
The process of cleaning text is embarrassingly parallel, so in theory you should be able some big time savings possible with multiple cores.
I used to run this pipeline in parallel, but looking back at it today, it turns out that the communication overhead makes this take twice as long with 8 cores as it does single threaded. I'm not sure if this was the same for my original use case, but I guess this may simply serve as a good example of why trying to parallelize instead of optimize can lead to more trouble than value.
## This function handles the cluster creation
## and exporting libraries, functions, and objects
parallelCleaner <- function(Text, NCores){
cl <- parallel::makeCluster(NCores)
clusterEvalQ(cl, library(magrittr))
clusterEvalQ(cl, library(stringi))
clusterExport(cl, list("textCleaner",
"GlobalStopWords",
"GlobalOldWords",
"GlobalNewWords"))
Text <- as.character(unlist(parallel::parLapply(cl, Text,
fun = function(x) textCleaner(x,
GlobalStopWords,
GlobalOldWords,
GlobalNewWords))))
parallel::stopCluster(cl)
return(Text)
}
## Run it Parallel
system.time({
DT[,CleanedText := parallelCleaner(Text = Text,
NCores = 8)]
})
# user system elapsed
# 6.700 5.099 131.429
If the TermDocumentMatrix(template) is the chief offender:
Update: I mentioned Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN recently that might be worth checking out: ngram Github Repository. Turns out I should have just tried it before explaining the really cumbersome process of building a command line tool from source-- this would have saved me a lot of time had been around 18 months ago!
It is a good deal more memory intensive and not quite as fast -- my memory usage peaked around 31 GB so that may or may not be a deal-breaker for you. All things considered, this seems like a really good option.
For the 500,000 paragraph case, ngrams clocks in at around 7 minutes of runtime:
#install.packages("ngram")
library(ngram)
library(data.table)
system.time({
ng1 <- ngram::ngram(DT[["CleanedText"]],n = 1)
ng2 <- ngram::ngram(DT[["CleanedText"]],n = 2)
ng3 <- ngram::ngram(DT[["CleanedText"]],n = 3)
pt1 <- setDT(ngram::get.phrasetable(ng1))
pt1[,Ngrams := 1L]
pt2 <- setDT(ngram::get.phrasetable(ng2))
pt2[,Ngrams := 2L]
pt3 <- setDT(ngram::get.phrasetable(ng3))
pt3[,Ngrams := 3L]
pt <- rbindlist(list(pt1,pt2,pt3))
})
# user system elapsed
# 411.671 12.177 424.616
pt[Ngrams == 2][order(-freq)][1:5]
# ngrams freq prop Ngrams
# 1: SED SED 75096 0.0018013693 2
# 2: AC SED 33390 0.0008009444 2
# 3: SED AC 33134 0.0007948036 2
# 4: SED EU 30379 0.0007287179 2
# 5: EU SED 30149 0.0007232007 2
You can try using a more efficient ngram generator. I use a command line tool called ngrams (available on github here) by Zheyuan Yu- partial implementation of Dr. Vlado Keselj 's Text-Ngrams 1.6 to take pre-processed text files off disk and generate a .csv output with ngram frequencies.
You'll need to build from source yourself using make and then interface with it using system() calls from R, but I found it to run orders of magnitude faster while using a tiny fraction of the memory. Using it, I was was able generate 5-grams from ~700MB of text input in well under an hour, the CSV result with all the output was 2.9 GB file with 93 million rows.
Continuing the example above, In my working directory, I have a folder, ngrams-master, in my working directory that contains the ngrams executable created with make.
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
I think I may have made a couple tweaks to get the output format how I wanted it, if you're interested I can try to find the changes I made to generate a .csvoutputs that differ from the default and upload to Github. (I did that project before I was familiar with the platform so I don't have a good record of the changes I made, live and learn.)
Update 2: I created a fork on Github, msummersgill/ngrams that reflects the slight tweaks I made to output results in a .CSV format. If someone was so inclined, I have a hunch that this could be wrapped up in a Rcpp based package that would be acceptable for CRAN submission -- any takers? I honestly have no clue how Ternary Search Trees work, but they seem to be significantly more memory efficient and faster than any other N-gram implementation currently available in R.
Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN, I haven't used it personally but it might be worth checking out as well: ngram Github Repository.
The Whole Shebang:
Using the same pipeline described above but with a size closer to what you're dealing with (ExampleText.txt comes out to ~274MB):
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
While the example may not be a perfect representation due to the limited vocabulary generated by stringi::stri_rand_lipsum(), the total run time of ~4.2 minutes using less than 8 GB of RAM on 500,000 paragraphs has been fast enough for the corpuses (corpi?) I've had to tackle in the past.
If wordcloud() is the source of the slowdown:
I'm not familiar with this function, but #Gregor's comment on your original post seems like it would take care of this issue.
library(wordcloud)
GramSubset <- Grams[Ngrams == 2][1:500]
par(bg="gray50")
wordcloud(GramSubset[["Token"]],GramSubset[["Frequency"]],color = GramSubset[["Frequency"]],
rot.per = 0.3,font.main=1, col.main="cornsilk3", cex.main=1.5)

Different spacing while printing to log

I am printing importance matrix of xgBoost into log using write command (write works with file connection and direct it to stderr well). Here is the command I am using:
importance_matrix <- xgb.importance(names, model=bst)
write("The top 30 variables are:",stderr())
write(paste0("Feature",'\t','\t','Gain','\t','Cover','\t','Frequency'),stderr())
write(t(as.matrix(importance_matrix[1:30,])),sep="\t",ncolumns = length(names(importance_matrix)),stderr())
Output comes in format:
Feature Gain Cover Frequency
pctTillDate 0.560359696 0.1314074664 0.024278250
colr_per 0.183149483 0.0962457545 0.049618673
date 0.050528297 0.1143752021 0.066395735
GREG_D 0.025648433 0.0381476142 0.018070143
LNGTD_I 0.020346020 0.0485235001 0.101322109
LATTD_I 0.019241497 0.0421892270 0.093867103
which make it look a bit clumsy (much clumsy in log than appearing here in SO). So in order to make it better looking I want to change last line of t(as.matrix(importance_matrix[1:30,])),sep="\t" such that first sep will be 2 tabs ('\t','\t') and rest single tab ('\t'); instead of current uniform spacing. Simple but search doesn't give any idea. Any suggestions?
Consider padding the column names and first char column of matrix with whitespace to align each to largest character size of first column:
write.table(importance_matrix, sep="\t", row.names = FALSE, quote = FALSE)
# Feature Gain Cover Frequency
# pctTillDate 0.56035970 0.13140747 0.02427825
# colr_per 0.18314948 0.09624575 0.04961867
# date 0.05052830 0.11437520 0.06639573
# GREG_D 0.02564843 0.03814761 0.01807014
# LNGTD_I 0.02034602 0.04852350 0.10132211
# LATTD_I 0.01924150 0.04218923 0.09386710
new_matrix <- importance_matrix
# FIRST COLUMN LARGEST CHAR LENGTH
charmax <- max(nchar(new_matrix[,1]))
# PAD COLUMN HEADERS
colnames(new_matrix) <- lapply(1:ncol(new_matrix), function(i)
paste0(colnames(new_matrix)[i],
paste(rep(" ", charmax - nchar(colnames(new_matrix)[i])), collapse=""))
)
# PAD FIRST COLUMN
new_matrix[,1] <- sapply(1:nrow(new_matrix), function(i)
paste0(new_matrix[i,1],
paste(rep(" ", charmax - nchar(new_matrix[i,1])), collapse=""))
)
write.table(new_matrix, sep="\t", row.names = FALSE, quote = FALSE)
# Feature Gain Cover Frequency
# pctTillDate 0.56035970 0.13140747 0.02427825
# colr_per 0.18314948 0.09624575 0.04961867
# date 0.05052830 0.11437520 0.06639573
# GREG_D 0.02564843 0.03814761 0.01807014
# LNGTD_I 0.02034602 0.04852350 0.10132211
# LATTD_I 0.01924150 0.04218923 0.09386710

R split text on empty line

I have a very long file that looks like this :
"Ach! Hans, Run!"
2RRGG
Enchantment
At the beginning of your upkeep, you may say "Ach! Hans, run! It's the . . ." and name a creature card. If you do, search your library for the named card, put it into play, then shuffle your library. That creature has haste. Remove it from the game at end of turn.
UNH-R
A Display of My Dark Power
Scheme
When you set this scheme in motion, until your next turn, whenever a player taps a land for mana, that player adds one mana to his or her mana pool of any type that land produced.
ARC-C
AErathi Berserker
2RRR
Creature -- Human Berserker
2/4
Rampage 3 (Whenever this creature becomes blocked, it gets +3/+3 until end of turn for each creature blocking it beyond the first.)
LE-U
AEther Adept
1UU
Creature -- Human Wizard
2/2
When AEther Adept enters the battlefield, return target creature to its owner's hand.
M11-C, M12-C, DDM-C
...
I'd like to load this file into a data.frame or vector "oracle", split by each empty line(actually a space and a newline) so that
oracle[1]
gives output like
"Ach! Hans, Run!" 2RRGG Enchantment At the beginning of your upkeep, you may say "Ach! Hans, run! It's the . . ." and name a creature card. If you do, search your library for the named card, put it into play, then shuffle your library. That creature has haste. Remove it from the game at end of turn. UNH-R
I've tried code like
oracle <- read.table(file = "All Sets.txt", quote = "", sep="\n")
as well as scan(), but
oracle[1]
gives very long, undesired output.
Thanks!
Try this, based on your edited question:
oracle <- readLines("BenYoung2.txt")
nvec <- length(oracle)
breaks <- which(! nzchar(oracle))
nbreaks <- length(breaks)
if (breaks[nbreaks] < nvec) {
breaks <- c(breaks, nvec + 1L)
nbreaks <- nbreaks + 1L
}
if (nbreaks > 0L) {
oracle <- mapply(function(a,b) paste(oracle[a:b], collapse = " "),
c(1L, 1L + breaks[-nbreaks]),
breaks - 1L)
}
oracle[1]
# [1] "\"Ach! Hans, Run!\" 2RRGG Enchantment At the beginning of your upkeep, you may say \"Ach! Hans, run! It's the . . .\" and name a creature card. If you do, search your library for the named card, put it into play, then shuffle your library. That creature has haste. Remove it from the game at end of turn. UNH-R"
Edit: though this works fine if you always have truly-empty lines as breaks, you can use this line instead to use lines with white-space only:
breaks <- which(grepl("^[[:space:]]*$", oracle))
This gives the same results when the lines are truly empty.
I think it's easiest to build a new variable that says which group the line belongs in, then group by that and call paste. In base R:
lines <- readLines(textConnection(txt))
i <- cumsum(lines == '')
by(lines, i, paste, collapse='\n')
The most straight forward way to do that is first splitting on a line break (i.e. \n), then throwing away empty lines.
text = "line1
line2
line3
"
split1 = unlist(strsplit(text, "\n"))
filter = split1[split1 != ""]
# [1] "line1" "line2" "line3"

Fast reading (by chunk?) and processing of a file with dummy lines at regular interval in R

I have a file with regular numeric output (same format) of many arrays, each separated by a single line (containing some info).
For example:
library(gdata)
nx = 150 # ncol of my arrays
ny = 130 # nrow of my arrays
myfile = 'bigFileWithRowsToSkip.txt'
niter = 10
for (i in 1:niter) {
write(paste(i, 'is the current iteration'), myfile, append=T)
z = matrix(runif(nx*ny), nrow = ny) # random numbers with dim(nx, ny)
write.fwf(z, myfile, append=T, rownames=F, colnames=F) #write in fixed width format
}
With nx=5 and ny=2, I would have a file like this:
# 1 is the current iteration
# 0.08051668 0.19546772 0.908230985 0.9920930408 0.386990316
# 0.57449532 0.21774728 0.273851698 0.8199024885 0.441359571
# 2 is the current iteration
# 0.655215475 0.41899060 0.84615044 0.03001664 0.47584591
# 0.131544592 0.93211342 0.68300161 0.70991368 0.18837031
# 3 is the current iteration
# ...
I want to read the successive arrays as fast as possible to put them in a single data.frame (in reality, I have thousands of them). What is the most efficient way to proceed?
Given the output is regular, I thought readr would be a good idea (?).
The only way I can think of, is to do it manually by chunks in order to eliminate the useless info lines:
library(readr)
ztot = numeric(niter*nx*ny) # allocate a vector with final size
# (the arrays will be vectorized and successively appended to each other)
for (i in 1:niter) {
nskip = (i-1)*(ny+1) + 1 # number of lines to skip, including the info lines
z = read_table(myfile, skip = nskip, n_max = ny, col_names=F)
z = as.vector(t(z))
ifirst = (i-1)*ny*nx + 1 # appropriate index
ztot[ifirst:(ifirst+nx*ny-1)] = z
}
# The arrays are actually spatial rasters. Compute the coordinates
# and put everything in DF for future analysis:
x = rep(rep(seq(1:nx), ny), niter)
y = rep(rep(seq(1:ny), each=nx), niter)
myDF = data.frame(x=x, y=y, z=z)
But this is not fast enough. How can I achieve this faster?
Is there a way to read everything at once and delete the useless rows afterwards?
Alternatively, is there no reading function accepting a vector with precise locations as skip argument, rather than a single number of initial rows?
PS: note the reading operation is to be repeated on many files (same structure) located in different directories, in case it influences the solution...
EDIT
The following solution (reading all lines with readLines and removing the undesirable ones and then processing the rest) is a faster alternative with niter very high:
bylines <- readLines(myfile)
dummylines = seq(1, by=(ny+1), length.out=niter)
bylines = bylines[-dummylines] # remove dummy, undesirable lines
asOneChar <- paste(bylines, collapse='\n') # Then process output from readLines
library(data.table)
ztot <- fread(asOneVector)
ztot <- c(t(ztot))
Discussion on how to proceed results from the readLines can be found here
Pre-processing the file with a command line tool (i.e., not in R) is actually way faster. For example with awk:
tmpfile <- 'cleanFile.txt'
mycommand <- paste("awk '!/is the current iteration/'", myfile, '>', tmpfile)
# "awk '!/is the current iteration/' bigFileWithRowsToSkip.txt > cleanFile.txt"
system(mycommand) # call the command from R
ztot <- fread(tmpfile)
ztot <- c(t(ztot))
Lines can be removed on the basis of a pattern or of indices for example.
This was suggested by #Roland from here.
Not sure if I still understood your problem correctly. Running your script created a file with 1310 lines. With This is iteration 1or2or3 printed at lines
Line 1: This is iteration 1
Line 132: This is iteration 2
Line 263: This is iteration 3
Line 394: This is iteration 4
Line 525: This is iteration 5
Line 656: This is iteration 6
Line 787: This is iteration 7
Line 918: This is iteration 8
Line 1049: This is iteration 9
Line 1180: This is iteration 10
Now there is data between these lines that you want to read and skip this 10 strings.
You can do this by tricking read.table saying your comment.char is "T" which will make read.table thinks all lines starting with letter "T" are comments and will skip those.
data<-read.table("bigFile.txt",comment.char = "T")
this will give you a data.frame of 1300 observations with 150 variables.
> dim(data)
[1] 1300 150
For a non-consisted strings. Read your data with read.table with fill=TRUE flag. This will not break your input process.
data<-read.table("bigFile.txt",fill=TRUE)
Your data looks like this
> head(data)
V1 V2 V3 V4 V5 V6 V7
1: 1.0000000 is the current iteration NA NA
2: 0.4231829 0.142353335 0.3813622692 0.07224282 0.037681101 0.7761575 0.1132471
3: 0.1113989 0.587115721 0.2960257430 0.49175715 0.642754463 0.4036675 0.4940814
4: 0.9750350 0.691093967 0.8610487920 0.08208387 0.826175117 0.8789275 0.3687355
5: 0.1831840 0.001007096 0.2385952028 0.85939856 0.646992019 0.5783946 0.9095849
6: 0.7648907 0.204005372 0.8512769730 0.10731854 0.299391995 0.9200760 0.7814541
Now if you see how the strings are distributed in columns. Now you can simply subset your data set with pattern matching. Matching columns that match these strings. For example
library(data.table)
data<-as.data.table(data)
cleaned_data<-data[!(V3 %like% "the"),]
> head(cleaned_data)
V1 V2 V3 V4 V5 V6 V7
1: 0.4231829 0.142353335 0.3813622692 0.07224282 0.037681101 0.7761575 0.1132471
2: 0.1113989 0.587115721 0.2960257430 0.49175715 0.642754463 0.4036675 0.4940814
3: 0.9750350 0.691093967 0.8610487920 0.08208387 0.826175117 0.8789275 0.3687355
4: 0.1831840 0.001007096 0.2385952028 0.85939856 0.646992019 0.5783946 0.9095849
5: 0.7648907 0.204005372 0.8512769730 0.10731854 0.299391995 0.9200760 0.7814541
6: 0.3943193 0.508373900 0.2131134905 0.92474343 0.432134031 0.4585807 0.9811607

Resources