monitor changes to a file in R (similar to tail "follow") - r

I'm wondering if there's a way to monitor the contents of a file from within R, similar to the behavior of tail -f (details here) in the Linux terminal.
Specifically, I want a function that you could pass a file path and it would
print the last n lines of the file to the console
hold the console
continue printing any new lines, as they are added
There are outstanding questions like "what if previously printed lines in the file get modified?" and honestly I'm not sure how tail -f handles that, but I'm interested in streaming a log file to the console, so it's kind of beside the point for my current usage.
I was looking around in the ?readLines and ?file docs and I feel like I'm getting close, but I can't quite figure it out. Plus, I can't imagine I'm the first one to want to do this, so maybe there's an established best practice (or even an existing function). Any help is greatly appreciated.
Thanks!

I made progress on this using the processx package. I created an R script which I named fswatch.R:
library(processx)
monitor <- function(fpath = "test.csv", wait_monitor = 1000 * 60 * 2){
system(paste0("touch ", fpath))
print_last <- function(fpath){
con <- file(fpath, "r", blocking = FALSE)
lines <- readLines(con)
print(lines[length(lines)])
close(con)
}
if(file.exists(fpath)){
print_last(fpath)
}
p <- process$new("fswatch", fpath, stdin = "|", stdout = "|", stderr = "|")
while(
# TRUE
p$is_alive() &
file.exists(fpath)
){
p$poll_io(wait_monitor)
p$read_output()
print_last(fpath)
# call poll_io twice otherwise endless loop :shrug:
p$poll_io(wait_monitor)
p$read_output()
}
p$kill()
}
monitor()
Then I ran the script as a "job" in RStudio.
Every time I wrote to test.csv the job printed the last line. I stopped monitoring by deleting the log file:
log_path <- "test.csv"
write.table('1', log_path, sep = ",", col.names = FALSE,
append = TRUE, row.names = FALSE)
write.table("2", log_path, sep = ",", col.names = FALSE,
append = TRUE, row.names = FALSE)
unlink(log_path)

Related

Writing to file in R one line after the other

I have the following piece of code to write to an R file one line at a time.
for (i in c(1:10)){
writeLines(as.character(i),file("output.csv"))
}
It just writes 10 presumably over-writing the previous lines. How do I make R append the new line to the existing output? append = TRUE does not work.
append = TRUE does work when using the function cat (instead of writeLines), but only if you give cat a file name, not when you give it a file object: whether a file is being appended to or overwritten is a property of the file object itself, i.e. it needs to be specifried when the file is being opened.
Thus both of these work:
f = file('filename', open = 'a') # open in “a”ppend mode
for (i in 1 : 10) writeLines(i, f)
for (i in 1 : 10) cat(i, '\n', file = 'filename', sep = '', append = TRUE)
Calling file manually is almost never necessary in R.
… but as the other answer shows, you can (and should!) avoid the loop anyway.
You won't need a loop. Use newline escape charater \n as separator instead.
vec <- c(1:10)
writeLines(as.character(vec), file("output.csv"), sep="\n")

Catch an error produced by pbsapply

I have this code, where I copy and untar (using gunzip) a bunch of files into a directory on my harddisk using pbsapply:
library(pbapply)
library(parallel)
library(R.utils)
unpack <- function(x, exdir, remove, overwrite, skip){
copy <- paste(exdir, tail(unlist(strsplit(x, "/")), 1), sep = "")
file.copy(from = x, to = copy)
x <- copy
gunzip(as.character(x), remove = remove, overwrite = overwrite, skip = skip)
}
files <- as.matrix(dir(path.to.files, pattern = ".tar.gz"))
expath <- "C:/temp/
cl <- makeCluster(detectCores()-1)
clusterExport(cl, "unpack")
clusterExport(cl, "files")
clusterExport(cl, "expath")
pbsapply(cl = cl, t(files), FUN = function(x){
unpack(x, exdir = expath, overwrite = FALSE, skip = TRUE, remove = TRUE)
})
I use gunzip because I want to keep the .tar files and do not extract them.
In principle the code works just fine. However, at random points, I get the error:
Error in checkForRemoteErrors: one node produced an error: No write permission for directory: C:/temp
I'm sure I have write permission.
Since this happens at random points, it's not reproducible.
My question now is, can I catch the error and just skip the file and continue processing?
Any help is appreciated.
Author of R.utils here: This could be because of a race condition where each worker is asserting that C:/temp/ exists and it has write permissions to that folder. If a worker finds that C:/temp/ does not exists, it tries to create it. Now, if multiple workers try to create it at the same time, you might have a race condition.
Try to make sure that C:\temp\ really exists before launching the parallel code, e.g. dir.create(expath). Let me know if this makes a difference.
Also, in order to try to reproduce this, how big is detectCores() and roughly how many tar.gz files do you have?
BTW, the line
copy <- paste(exdir, tail(unlist(strsplit(x, "/")), 1), sep = "")
looks complicated. AFAIU, tail(unlist(strsplit(x, "/")), 1) can be replaced by basename(x), e.g. with C:/a/b/c.tar.gz you're getting c.tar.gz. Also, instead of using paste() to build your paths, use file.path(). In other words, do something like:
copy <- file.path(exdir, basename(x))

Why can't I append with R's write function

Pardons if this has already been answered, I didn't see anything quite like this. I want to create a running log but I can't get the write function to append. Here is a sample:
fName <- "D:/Temp/foo.txt"
fCn <- file(fName)
write('test1', fCn, append = TRUE)
write('test2', fCn, append = TRUE)
close(fCn)
When I open the resulting file I only see the last line. I have also tried opening and closing the file like so:
fCn <- file(fName)
write('test1', fCn, append = TRUE)
close(fCn)
fCn <- file(fName)
write('test2', fCn, append = TRUE)
close(fCn)
Seems like it should be easy. Where am I going wrong? TIA
Open the connection in append mode:
> fCn <- file(fName,open="a")
Full example:
> fName="out1.txt"
> fCn <- file(fName,open="a")
> write('test1', fCn, append = TRUE)
> write('test2', fCn, append = TRUE)
> close(fCn)
Results in both strings written to the file.
Alternatively you can just write to the file name (not a connection object) with append=TRUE:
> write('test1', "out2.txt", append = TRUE)
> write('test2', "out2.txt", append = TRUE)
also results in a two-line output file, created from scratch.
You can use sink for this purpose. It is always easier to write what you can actually see in R console to a text file so you are sure about the output.
sink("C:/Users/mahdisoltanim/Desktop/a.txt", append= TRUE)
cat("\n")
cat("test1")
cat("\n")
cat("test2")
sink()

Keep rows separate with write.table R

I'm trying to produce some files that have slightly unusual field seperators.
require(data.table)
dset <- data.table(MPAN = c(rep("AAAA",1000),rep("BBBB",1000),rep("CCCC",1000)),
INT01 = runif(3000,0,1), INT02 = runif(3000,0,1), INT03 = runif(3000,0,1))
write.table(dset,"C:/testing_write_table.csv",
sep = "|",row.names = FALSE, col.names = FALSE, na = "", quote = FALSE, eol = "")
I'm findiong however that the rows are not being kept seperate in the output file, e.g.
AAAA|0.238683722680435|0.782154920976609|0.0570344978477806AAAA|0.9250325632......
Would you know how to ensure the text file retains distinct rows?
Cheers
You are using the wrong eol argument. The end of line argument needs to be a break line:
This worked for me:
require(data.table)
dset <- data.table(MPAN = c(rep("AAAA",1000),rep("BBBB",1000),rep("CCCC",1000)),
INT01 = runif(3000,0,1), INT02 = runif(3000,0,1), INT03 = runif(3000,0,1))
write.table(dset,"C:/testing_write_table.csv", #save as .txt if you want to open it with notepad as well as excel
sep = "|",row.names = FALSE, col.names = FALSE, na = "", quote = FALSE, eol = "\n")
Using the break line symbol '\n' as the end of line argument creates separate lines for me.
Turns out this was a UNIX - Windows encoding issue. So something of a red herring, but perhaps worth recording in case anyone else has this at first perplexing issue.
It turns out that Windows notepad sometimes struggles to render files generated in UNIX properly, a quick test to see if this is the issue is to open in Windows WordPad instead and you may find that it will render properly.

Troubleshooting R mapper script on Amazon Elastic MapReduce - Results not as expected

I am trying to use Amazon Elastic Map Reduce to run a series of simulations of several million cases. This is an Rscript streaming job with no reducer. I am using the Identity Reducer in my EMR call --reducer org.apache.hadoop.mapred.lib.IdentityReducer.
The script file works fine when tested and run locally from the command line on a Linux box when passing one line of string manually echo "1,2443,2442,1,5" | ./mapper.R and I get the one line of results that I am expecting. However, when I tested my simulation using about 10,000 cases (lines) from the input file on EMR, I only got output for a dozen lines or so out of 10k input lines. I've tried several times and I cannot figure out why. The Hadoop job runs fine without any errors. It seems like input lines are being skipped, or perhaps something is happening with the Identity reducer. The results are correct for the cases where there is output.
My input file is a csv with the following data format, a series of five integers separated by commas:
1,2443,2442,1,5
2,2743,4712,99,8
3,2443,861,282,3177
etc...
Here is my R script for mapper.R
#! /usr/bin/env Rscript
# Define Functions
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
# function to read in the relevant data from needed data files
get.data <- function(casename) {
list <- lapply(casename, function(x) {
read.csv(file = paste("./inputdata/",x, ".csv", sep = ""),
header = TRUE,
stringsAsFactors = FALSE)})
return(data.frame(list))
}
con <- file("stdin")
line <- readLines(con, n = 1, warn = FALSE)
line <- trimWhiteSpace(line)
values <- unlist(strsplit(line, ","))
lv <- length(values)
cases <- as.numeric(values[2:lv])
simid <- paste("sim", values[1], ":", sep = "")
l <- length(cases) # for indexing
## create a vector for the case names
names.vector <- paste("case", cases, sep = ".")
## read in metadata and necessary data columns using get.data function
metadata <- read.csv(file = "./inputdata/metadata.csv", header = TRUE,
stringsAsFactors = FALSE)
d <- cbind(metadata[,1:3], get.data(names.vector))
## Calculations that use df d and produce a string called 'output'
## in the form of "id: value1 value2 value3 ..." to be used at a
## later time for agregation.
cat(output, "\n")
close(con)
The (generalized) EMR call for this simulation is:
ruby elastic-mapreduce --create --stream --input s3n://bucket/project/input.txt --output s3n://bucket/project/output --mapper s3n://bucket/project/mapper.R --reducer org.apache.hadoop.mapred.lib.IdentityReducer --cache-archive s3n://bucket/project/inputdata.tar.gz#inputdata --name Simulation --num-instances 2
If anyone has any insights as to why I might be experiencing these issues, I am open to suggestions, as well as any changes/optimization to the R script.
My other option is to turn the script into a function and run a parallelized apply using R multicore packages, but I haven't tried it yet. I'd like to get this working on EMR. I used JD Long's and Pete Skomoroch's R/EMR examples as a basis for creating the script.
Nothing obvious jumps out. However, can you run the job using a simple input file of only 10 lines? Make sure these 10 lines are scenarios which did not run in your big test case. Try this to eliminate the possibility that your inputs are causing the R script to not produce an answer.
Debugging EMR jobs is a skill of its own.
EDIT:
This is a total fishing expedition, but fire up a EMR interactive pig session using the AWS GUI. "Interactive pig" sessions stay up and running so you can ssh into them. You could also do this from the command line tools, but it's a little easier from the GUI since, hopefully, you only need to do this once. Then ssh into the cluster, transfer over your test case infile your cachefiles and your mapper and then run this:
cat infile.txt | yourMapper.R > outfile.txt
This is just to test if your mapper can parse the infile in the EMR environment with no Hadoop bits in the way.
EDIT 2:
I'm leaving the above text there for posterity but the real issue is your script never goes back to stdin to pick up more data. Thus you get one run for each mapper then it ends. If you run the above one liner you will only get out one result, not a result for each line in infile.txt. If you had run the cat test even on your local machine the error should pop out!
So let's look at Pete's word count in R example:
#! /usr/bin/env Rscript
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
## **** could wo with a single readLines or in blocks
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
## **** can be done as cat(paste(words, "\t1\n", sep=""), sep="")
for (w in words)
cat(w, "\t1\n", sep="")
}
close(con)
The piece your script is missing is this bit:
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
#do your dance
#do your dance quick
#come on everybody tell me what's the word
#word up
}
you should, naturally, replace the lyrics of Cameo's Word Up! with your actual logic.
Keep in mind that proper debugging music makes the process less painful:
http://www.youtube.com/watch?v=MZjAantupsA

Resources