I have this code, where I copy and untar (using gunzip) a bunch of files into a directory on my harddisk using pbsapply:
library(pbapply)
library(parallel)
library(R.utils)
unpack <- function(x, exdir, remove, overwrite, skip){
copy <- paste(exdir, tail(unlist(strsplit(x, "/")), 1), sep = "")
file.copy(from = x, to = copy)
x <- copy
gunzip(as.character(x), remove = remove, overwrite = overwrite, skip = skip)
}
files <- as.matrix(dir(path.to.files, pattern = ".tar.gz"))
expath <- "C:/temp/
cl <- makeCluster(detectCores()-1)
clusterExport(cl, "unpack")
clusterExport(cl, "files")
clusterExport(cl, "expath")
pbsapply(cl = cl, t(files), FUN = function(x){
unpack(x, exdir = expath, overwrite = FALSE, skip = TRUE, remove = TRUE)
})
I use gunzip because I want to keep the .tar files and do not extract them.
In principle the code works just fine. However, at random points, I get the error:
Error in checkForRemoteErrors: one node produced an error: No write permission for directory: C:/temp
I'm sure I have write permission.
Since this happens at random points, it's not reproducible.
My question now is, can I catch the error and just skip the file and continue processing?
Any help is appreciated.
Author of R.utils here: This could be because of a race condition where each worker is asserting that C:/temp/ exists and it has write permissions to that folder. If a worker finds that C:/temp/ does not exists, it tries to create it. Now, if multiple workers try to create it at the same time, you might have a race condition.
Try to make sure that C:\temp\ really exists before launching the parallel code, e.g. dir.create(expath). Let me know if this makes a difference.
Also, in order to try to reproduce this, how big is detectCores() and roughly how many tar.gz files do you have?
BTW, the line
copy <- paste(exdir, tail(unlist(strsplit(x, "/")), 1), sep = "")
looks complicated. AFAIU, tail(unlist(strsplit(x, "/")), 1) can be replaced by basename(x), e.g. with C:/a/b/c.tar.gz you're getting c.tar.gz. Also, instead of using paste() to build your paths, use file.path(). In other words, do something like:
copy <- file.path(exdir, basename(x))
Related
I am currently working through Coursera's R Programming course and have hit a bit of a snag with this assignment. I have been getting various errors (not I'm not totally sure I've nailed down) but this is a new one and no matter what I do I can't seem to shake it.
Whenever I run the below code it comes back with
Error in file(file, "rt") : cannot open the connection
pollutantmean <- function (directory, pollutant, id){
files<- list.files(path = directory, "/", full.names = TRUE)
dat <- data.frame()
dat <- sapply(file = directory,"/", read.csv)
mean(dat["pollutant"], na.rm = TRUE)
}
I have tried numerous different solutions posted here on SO for this issue but none of it has worked. I made sure that I am running after setting the working directory to the folder with all of the CSV files and I can see all of the files in the file pane. I have also moved that working directory around a few times since some of the suggestions were to put it on the desktop, etc. but none of that has worked. I am currently running R Studio as an admin but that does not seem to have done anything and I have also modified the permissions on the specdata file to ensure there's no weird restrictions there. Any help is appreciated.
Here are two possible implementations:
# list all files in "directory", read them, combine and then take mean of "pollutant" column
pollutantmean_1 <- function (directory){
files <- list.files(path = directory, full.names = TRUE)
dat <- lapply(file = directory, read.csv)
dat <- data.table::rbindlist(dat) |> as.data.frame()
mean(dat[, 'pollutant' ], na.rm = TRUE)
}
# list all files in "directory", read them, take the mean of "pollutant" column for each file and return them
pollutantmean_2 <- function (directory){
files <- list.files(path = directory, full.names = TRUE)
dat <- lapply(file = directory, read.csv)
pollutant_means <- sapply(dat, function(x) mean(x[ , 'pollutant' ], na.rm = TRUE))
names(pollutant_means) <- basename(files)
pollutant_means
}
I'm wondering if there's a way to monitor the contents of a file from within R, similar to the behavior of tail -f (details here) in the Linux terminal.
Specifically, I want a function that you could pass a file path and it would
print the last n lines of the file to the console
hold the console
continue printing any new lines, as they are added
There are outstanding questions like "what if previously printed lines in the file get modified?" and honestly I'm not sure how tail -f handles that, but I'm interested in streaming a log file to the console, so it's kind of beside the point for my current usage.
I was looking around in the ?readLines and ?file docs and I feel like I'm getting close, but I can't quite figure it out. Plus, I can't imagine I'm the first one to want to do this, so maybe there's an established best practice (or even an existing function). Any help is greatly appreciated.
Thanks!
I made progress on this using the processx package. I created an R script which I named fswatch.R:
library(processx)
monitor <- function(fpath = "test.csv", wait_monitor = 1000 * 60 * 2){
system(paste0("touch ", fpath))
print_last <- function(fpath){
con <- file(fpath, "r", blocking = FALSE)
lines <- readLines(con)
print(lines[length(lines)])
close(con)
}
if(file.exists(fpath)){
print_last(fpath)
}
p <- process$new("fswatch", fpath, stdin = "|", stdout = "|", stderr = "|")
while(
# TRUE
p$is_alive() &
file.exists(fpath)
){
p$poll_io(wait_monitor)
p$read_output()
print_last(fpath)
# call poll_io twice otherwise endless loop :shrug:
p$poll_io(wait_monitor)
p$read_output()
}
p$kill()
}
monitor()
Then I ran the script as a "job" in RStudio.
Every time I wrote to test.csv the job printed the last line. I stopped monitoring by deleting the log file:
log_path <- "test.csv"
write.table('1', log_path, sep = ",", col.names = FALSE,
append = TRUE, row.names = FALSE)
write.table("2", log_path, sep = ",", col.names = FALSE,
append = TRUE, row.names = FALSE)
unlink(log_path)
I'm trying to do a standardized directory setup through a function call. Inside this function I'm using two file.copy calls to copy some files from a selfmade package into the working directory of a project.
If I run the code line by line, everything is working fine but if I run the whole function, only the directories get created, but no files get copied. Unfortunately the function does not throw any error, so I really do not understand whats going on or where to start troubleshooting.
Maybe one of you guys can give me a hint where to find the solution.
abstract (non working) example:
dir_setup <- function() {
# list directories which shall be created
dir_names <- c("dir1", "dir2", "dir3", "dir4")
# create directories
lapply(dir_names, function(x){dir.create(path = paste(getwd(), x, sep = '/'))})
# get path of package library
lib_path <- .libPaths()
# shorten list to vector of length 1
if (length(lib_path) > 1) lib_path = lib_path[1]
# list files in source
files <- list.files(paste0(lib_path, "/package/files/dir1"), full.names = TRUE)
# copy resource files from package directory to working directory
file.copy(files, paste(getwd(), "dir1", sep = '/'), overwrite = TRUE)
# list more files
files2 <- list.files(paste0(lib_path, "/package/files/dir2"), full.names = TRUE)
# copy more files from package directory to working directory
file.copy(files2, paste(getwd(), "dir2", sep = '/'), overwrite = TRUE)
}
I'm trying to upload multiple images to do some machine learning in R. I can upload a single image just fine, but when I try to upload multiple images using either lapply or a for loop, I get the following error: "Error in wrap.url(file, load.image.internal) : File not found". I did a check to make sure the files do exist, my WD is set correctly and R recognizes that the files and directory do exist. No matter what I change, the error is always the same. It doesn't change the outcome if I list the path from the drive it originates in or from the WD onward. I've asked many people for help with no success. I've posted my code using lapply and a for loop below. I'm still relatively new to R so if there is something I'm missing I'd greatly appreciate knowing. Also, I'm using imager here to load the files.
eggs2015 <- list()
file_list <- list.files(path="~/Grad School/Thesis Work/Machine Learning R/a2015_experimental_clustering_R/*.jpg", pattern="*.jpg", full.names = TRUE)
for (i in 1:length(file_list)){
Path <- paste0("a2015_experimental_clustering_R",file_list[i])
eggs2015 <- c(eggs2015, list(load.image(Path)))
}
names(eggs2015) <- file_list
eggs2015 <- list.files(path = "~/Grad School/Thesis Work/Machine Learning R/2015_experimental_clustering_R", pattern = ".jpg", all.files = TRUE, full.names = TRUE)
eggs2015 <- lapply(list, FUN = load.image("~/Grad School/Thesis Work/Machine Learning R/a2015_experimental_clustering_R/*.jpg"))
eggs2015 <- as.data.frame(eggs2015)
Personally for this kind of operation I prefer to use sapply so I can identify images with the original file names later on (if needed):
FilesToRead <- list.files(path = "~/Grad School/Thesis Work/Machine Learning R/2015_experimental_clustering_R", pattern = ".jpg", all.files = TRUE, full.names = TRUE)
ListOfImages <- sapply(FilesToRead, FUN = load.image, simplify = FALSE, USE.NAMES = TRUE)
should work and give you a list of elements with your images using the file paths as names
Or using lapply (sapply is just a wrapper for lapply)
ListOfImages <- lapply(FilesToRead, FUN = load.image)
As you can see, your code just needed a little tweaking.
Hope it helps
I'm processing files through an application using R. The application requires a simple inputfile, outputfilename specification as parameters. Using the below code, this works fine.
input <- "\"7374.txt\""
output <- "\"7374_cleaned.txt\""
system2("DataCleaner", args = c(input, output))
However I wish to process a folder of .txt files, rather then have to do each one individually. If i had access to the source code i would simply alter the application to accept a folder rather then an individual file, but unfortunately i don't. Is it possible to somehow do this in R? I had tried starting to create a loop,
input <- dir(pattern=".txt")
but i don't know how i could insert a vector in as an argument without the regex included as part of that? Also i would then need to be able to paste '_cleaned' on to the end of the outputfile names? Many thanks in advance.
Obviously, I can't test it because I don't have your DataCleaner program but how about this...
# make some files
dir.create('folder')
x = sapply(seq_along(1:5), function(f) {t = tempfile(tmpdir = 'folder', fileext = '.txt'); file.create(t); t})
# find the files
inputfiles = list.files(path = 'folder', pattern = 'txt', full.names = T)
# remove the extension
base = tools::file_path_sans_ext(inputfiles)
# make the output file names
outputfiles = paste0(base, '_cleaned.txt')
mysystem <- function(input, output) {
system2('DataCleaner', args = c(input, output))
}
lapply(seq_along(1:length(inputfiles)), function(f) mysystem(inputfiles[f], outputfiles[f]))
It uses lapply to iterate over all the members of the input and output files and calls the system2 function.