How to simultaneously read and write a file line by line? - r

I would like to remove all lines from a file which start with a certain pattern. I would like to do this with R. It is good practice to not first read the whole file, then remove all matching lines and afterwards write the whole file, as the file can be huge. I am thus wondering if I can have both a read and a write connection (open all the time, one at a time?) to the same file. The following shows the idea (but 'hangs' and thus fails).
## Create an example file
fnm <- "foo.txt" # file name
sink(fnm)
cat("Hello\n## ----\nworld\n")
sink()
## Read the file 'fnm' one line at a time and write it back to 'fnm'
## if it does *not* contain the pattern 'pat'
pat <- "## ----" # pattern
while(TRUE) {
rcon <- file(fnm, "r") # read connection
line <- readLines(rcon, n = 1) # read one line
close(rcon)
if(length(line) == 0) { # end of file
break
} else {
if(!grepl(pat, line)) {
wcon <- file(fnm, "w")
writeLines(line, con = wcon)
close(wcon)
}
}
}
Note:
1) See here for an answer if one writes to a new file. One could then delete the old file and rename the new one to the old one, but that does not seem very elegant :-).
2) Update: The following MWE produces
Hello
world
-
world
See:
## Create an example file
fnm <- "foo.txt" # file name
sink(fnm)
cat("Hello\n## ----\nworld\n")
sink()
## Read the file 'fnm' one line at a time and write it back to 'fnm'
## if it does *not* contain the pattern 'pat'
pat <- "## ----" # pattern
con <- file(fnm, "r+") # read and write connection
while(TRUE) {
line <- readLines(con, n = 1L) # read one line
if(length(line) == 0) break # end of file
if(!grepl(pat, line))
writeLines(line, con = con)
}
close(con)

I think you just need open = 'r+'. From ?file:
Modes
"r+", "r+b" -- Open for reading and writing.
I don't have your sample file, so I'll instead just have the following minimal example:
take a file with a-z on 26 lines and replace them one by one with A-Z:
tmp = tempfile()
writeLines(letters, tmp)
f = file(tmp, 'r+')
while (TRUE) {
l = readLines(f, n = 1L)
if (!length(l)) break
writeLines(LETTERS[match(l, letters)], f)
}
close(f)
readLines(f) afterwards confirms this worked.

I understand you want to use R, but just in case you're not aware, there are some really simple scripting tools that excel in this type of task. E.g gawk is designed for pretty much exactly this type of operation and is simple enough to learn that you could write a script for this within minutes even without any prior knowledge.
Here's a one-liner to do this in gawk (or awk if you are on Unix):
gawk -i inplace '!/^pat/ {print}' foo.txt
Of course, it is trivial to do this from within R using
system(paste0("gawk -i inplace '!/^", pat, "/ {print}' ", fnm))

Related

Writing to file in R one line after the other

I have the following piece of code to write to an R file one line at a time.
for (i in c(1:10)){
writeLines(as.character(i),file("output.csv"))
}
It just writes 10 presumably over-writing the previous lines. How do I make R append the new line to the existing output? append = TRUE does not work.
append = TRUE does work when using the function cat (instead of writeLines), but only if you give cat a file name, not when you give it a file object: whether a file is being appended to or overwritten is a property of the file object itself, i.e. it needs to be specifried when the file is being opened.
Thus both of these work:
f = file('filename', open = 'a') # open in “a”ppend mode
for (i in 1 : 10) writeLines(i, f)
for (i in 1 : 10) cat(i, '\n', file = 'filename', sep = '', append = TRUE)
Calling file manually is almost never necessary in R.
… but as the other answer shows, you can (and should!) avoid the loop anyway.
You won't need a loop. Use newline escape charater \n as separator instead.
vec <- c(1:10)
writeLines(as.character(vec), file("output.csv"), sep="\n")

R: Read single file from within a tar.gz directory

Consider a tar.gz file of a directory which containing a lot of individual files.
From within R I can easily extract the name of the individual files with this command:
fileList <- untar(my_tar_dir.tar.gz, list=T)
Using only R is it possible to directly read/load a single of those files into R (aka without first unpacking and writing the file to the disk)?
It is possible, but I don't know of any clean implementation (it may exist). Below is some very basic R code that should work in many cases (e.g. file names with full path inside the archive should be less than 100 characters). In a way, it's just re-implementing "untar" in an extremely crude way, but in such a way that it will point to the desired file in a gzipped file.
The first problem is that you should only read a gzipped file from the start. Using "seek()" to re-position the file pointer to the desired file is, unfortunately, erratic in a gzipped file.
ParseTGZ<- function(archname){
# open tgz archive
tf <- gzfile(archname, open='rb')
on.exit(close(tf))
fnames <- list()
offset <- 0
nfile <- 0
while (TRUE) {
# go to beginning of entry
# never use "seek" to re-locate in a gzipped file!
if (seek(tf) != offset) readBin(tf, what="raw", n= offset - seek(tf))
# read file name
fName <- rawToChar(readBin(tf, what="raw", n=100))
if (nchar(fName)==0) break
nfile <- nfile + 1
fnames <- c(fnames, fName)
attr(fnames[[nfile]], "offset") <- offset+512
# read size, first skip 24 bytes (file permissions etc)
# again, we only use readBin, not seek()
readBin(tf, what="raw", n=24)
# file size is encoded as a length 12 octal string,
# with the last character being '\0' (so 11 actual characters)
sz <- readChar(tf, nchars=11)
# convert string to number of bytes
sz <- sum(as.numeric(strsplit(sz,'')[[1]])*8^(10:0))
attr(fnames[[nfile]], "size") <- sz
# cat(sprintf('entry %s, %i bytes\n', fName, sz))
# go to the next message
# don't forget entry header (=512)
offset <- offset + 512*(ceiling(sz/512) + 1)
}
# return a named list of characters strings with attributes?
names(fnames) <- fnames
return(fnames)
}
This will give you the exact position and length of all files in the tar.gz archive.
Now the next step is to actually extact a single file. You may be able to do this by using a "gzfile" connection directly, but here I will use a rawConnection(). This presumes your files fit into memory.
extractTGZ <- function(archfile, filename) {
# this function returns a raw vector
# containing the desired file
fp <- ParseTGZ(archfile)
offset <- attributes(fp[[filename]])$offset
fsize <- attributes(fp[[filename]])$size
gzf <- gzfile(archfile, open="rb")
on.exit(close(gzf))
# jump to the byte position, don't use seek()
# may be a bad idea on really large archives...
readBin(gzf, what="raw", n=offset)
# now read the data into a raw vector
result <- readBin(gzf, what="raw", n=fsize)
result
}
now, finally:
ff <- rawConnection(ExtractTGZ("myarchive", "myfile"))
Now you can treat ff as if it were (a connection pointing to) your file. But it only exists in memory.
One can read in a csv within an archive using library(archive) as follows (this should be a lot more elegant than the currently accepted answer, this package also supports all major archive formats - 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' & 'xz' and it works on all platforms):
library(archive)
library(readr)
read_csv(archive_read("my_tar_dir.tar.gz", file = 1), col_types = cols())

Extract the line with the same content from two files in R

I would like to use readLines function to read the text file line by line
69C_t.txt
Also, I would like to write a simple for loop with condition to extract the identical lines in two files.
69C_t <- "69C_t.txt"
conn <- file(69C_t,open="r")
t <-readLines(conn)
69C_b <- "69C_b.txt"
conn <- file(69C_b,open="r")
b <-readLines(conn)
for (i in 1:length(t)){
for (j in 1:length(b)){
if (i==j)
write(t[i], file = "overlap.txt")
}
}
close(tumor)
However, it seems only print out the first line.
Can someone please have a check ?
A faster approach would be, instead of the loop
writeLines(t[t %in% b],"overlap.txt")
How about adding append in the write function:
write(t[i], file = "overlap.txt", append = TRUE)

can I read a file consecutively without rewinding to the beginning in R?

Hallo experts,
I am trying to read in a large file in consecutive blocks of 10000 lines. This is
because the file is too large to read in at once. The "skip" field of read.csv comes in
handy to accomplish this task ( see below). However I noticed that the program starts
slowing down towards the end of the file ( for large values of i).
I suspect this is because each call to read.csv(file,skip=nskip,nrows=block) always
starts reading the file from the beginning until the required starting line "skip" is
reached. This becomes increasingly time-consuming as i increases.
Question: Is there a way to continue reading a file starting from the last location that
was reached in the previous block?
numberOfBlocksInFile<-800
block<-10000
for ( i in 1:(n-1))
{
print(i)
nskip<-i*block
out<-read.csv(file,skip=nskip,nrows=block)
colnames(out)<-names
.....
print("keep going")
}
many thanks (:-
One way is to use readLines with a file connection. For example, you could do something like this:
temp.fpath <- tempfile() # create a temp file for this demo
d <- data.frame(a=letters[1:10], b=1:10) # sample data, 10 rows. we'll read 5 at a time
write.csv(d, temp.fpath, row.names=FALSE) # write the sample data
f.cnxn <- file(temp.fpath, 'r') # open a new connection
fields <- readLines(f.cnxn, n=1) # read the header, which we'll reuse for each block
block.size <- 5
repeat { # keep reading and printing 5 row chunks until you reach the end of the cnxn.
block.text <- readLines(f.cnxn, n=5) # read chunk
if (length(block.text) == 0) # if there's nothing left, leave the loop
break
block <- read.csv(text=c(fields, block.text)) # process chunk with
print(block)
}
close(f.cnxn)
file.remove(temp.fpath)
Another option is to use fread from read.table package.
N <- 1e6 ## 1 second to read 1e6 rows/10cols
skip <- N
DT <- fread("test.csv",nrows=N)
repeat {
if (nrow(DT) < N) break
DT <- fread("test.csv",nrows=N,skip=skip)
## here use DT for your process
skip <- skip + N
}

Troubleshooting R mapper script on Amazon Elastic MapReduce - Results not as expected

I am trying to use Amazon Elastic Map Reduce to run a series of simulations of several million cases. This is an Rscript streaming job with no reducer. I am using the Identity Reducer in my EMR call --reducer org.apache.hadoop.mapred.lib.IdentityReducer.
The script file works fine when tested and run locally from the command line on a Linux box when passing one line of string manually echo "1,2443,2442,1,5" | ./mapper.R and I get the one line of results that I am expecting. However, when I tested my simulation using about 10,000 cases (lines) from the input file on EMR, I only got output for a dozen lines or so out of 10k input lines. I've tried several times and I cannot figure out why. The Hadoop job runs fine without any errors. It seems like input lines are being skipped, or perhaps something is happening with the Identity reducer. The results are correct for the cases where there is output.
My input file is a csv with the following data format, a series of five integers separated by commas:
1,2443,2442,1,5
2,2743,4712,99,8
3,2443,861,282,3177
etc...
Here is my R script for mapper.R
#! /usr/bin/env Rscript
# Define Functions
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
# function to read in the relevant data from needed data files
get.data <- function(casename) {
list <- lapply(casename, function(x) {
read.csv(file = paste("./inputdata/",x, ".csv", sep = ""),
header = TRUE,
stringsAsFactors = FALSE)})
return(data.frame(list))
}
con <- file("stdin")
line <- readLines(con, n = 1, warn = FALSE)
line <- trimWhiteSpace(line)
values <- unlist(strsplit(line, ","))
lv <- length(values)
cases <- as.numeric(values[2:lv])
simid <- paste("sim", values[1], ":", sep = "")
l <- length(cases) # for indexing
## create a vector for the case names
names.vector <- paste("case", cases, sep = ".")
## read in metadata and necessary data columns using get.data function
metadata <- read.csv(file = "./inputdata/metadata.csv", header = TRUE,
stringsAsFactors = FALSE)
d <- cbind(metadata[,1:3], get.data(names.vector))
## Calculations that use df d and produce a string called 'output'
## in the form of "id: value1 value2 value3 ..." to be used at a
## later time for agregation.
cat(output, "\n")
close(con)
The (generalized) EMR call for this simulation is:
ruby elastic-mapreduce --create --stream --input s3n://bucket/project/input.txt --output s3n://bucket/project/output --mapper s3n://bucket/project/mapper.R --reducer org.apache.hadoop.mapred.lib.IdentityReducer --cache-archive s3n://bucket/project/inputdata.tar.gz#inputdata --name Simulation --num-instances 2
If anyone has any insights as to why I might be experiencing these issues, I am open to suggestions, as well as any changes/optimization to the R script.
My other option is to turn the script into a function and run a parallelized apply using R multicore packages, but I haven't tried it yet. I'd like to get this working on EMR. I used JD Long's and Pete Skomoroch's R/EMR examples as a basis for creating the script.
Nothing obvious jumps out. However, can you run the job using a simple input file of only 10 lines? Make sure these 10 lines are scenarios which did not run in your big test case. Try this to eliminate the possibility that your inputs are causing the R script to not produce an answer.
Debugging EMR jobs is a skill of its own.
EDIT:
This is a total fishing expedition, but fire up a EMR interactive pig session using the AWS GUI. "Interactive pig" sessions stay up and running so you can ssh into them. You could also do this from the command line tools, but it's a little easier from the GUI since, hopefully, you only need to do this once. Then ssh into the cluster, transfer over your test case infile your cachefiles and your mapper and then run this:
cat infile.txt | yourMapper.R > outfile.txt
This is just to test if your mapper can parse the infile in the EMR environment with no Hadoop bits in the way.
EDIT 2:
I'm leaving the above text there for posterity but the real issue is your script never goes back to stdin to pick up more data. Thus you get one run for each mapper then it ends. If you run the above one liner you will only get out one result, not a result for each line in infile.txt. If you had run the cat test even on your local machine the error should pop out!
So let's look at Pete's word count in R example:
#! /usr/bin/env Rscript
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
## **** could wo with a single readLines or in blocks
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
## **** can be done as cat(paste(words, "\t1\n", sep=""), sep="")
for (w in words)
cat(w, "\t1\n", sep="")
}
close(con)
The piece your script is missing is this bit:
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
#do your dance
#do your dance quick
#come on everybody tell me what's the word
#word up
}
you should, naturally, replace the lyrics of Cameo's Word Up! with your actual logic.
Keep in mind that proper debugging music makes the process less painful:
http://www.youtube.com/watch?v=MZjAantupsA

Resources