File locks in R - r

The short version
How would I go about blocking the access to a file until a specific function that both involves read and write processes to that very file has returned?
The use case
I often want to create some sort of central registry and there might be more than one R process involved in reading from and writing to that registry (in kind of a "poor man's parallelization" setting where different processes run independently from each other except with respect to the registry access).
I would not like to depend on any DBMS such as SQLite, PostgreSQL, MongoDB etc. early on in the devel process. And even though I later might use a DBMS, a filesystem-based solution might still be a handy fallback option. Thus I'm curious how I could realize it with base R functionality (at best).
I'm aware that having a lot of reads and writes to the file system in a parallel setting is not very efficient compared to DBMS solutions.
I'm running on MS Windows 8.1 (64 Bit)
What I'd like to get a deeper understanding of
What actually exactly happens when two or more R processes try to write to or read from a file at the same time? Does the OS figure out the "accesss order" automatically and does the process that "came in second" wait or does it trigger an error as the file access might is blocked by the first process? How could I prevent the second process from returning with an error but instead "just wait" until it's his turn?
Shared workspace of processes
Besides the rredis Package: are there any other options for shared memory on MS Windows?
Illustration
Path to registry file:
path_registry <- file.path(tempdir(), "registry.rdata")
Example function that registers events:
registerEvent <- function(
id=gsub("-| |:", "", Sys.time()),
values,
path_registry
) {
if (!file.exists(path_registry)) {
registry <- new.env()
save(registry, file=path_registry)
} else {
load(path_registry)
}
message("Simulated additional runtime between reading and writing (5 seconds)")
Sys.sleep(5)
if (!exists(id, envir=registry, inherits=FALSE)) {
assign(id, values, registry)
save(registry, file=path_registry)
message(sprintf("Registering with ID %s", id))
out <- TRUE
} else {
message(sprintf("ID %s already registered", id))
out <- FALSE
}
out
}
Example content that is registered:
x <- new.env()
x$a <- TRUE
x$b <- letters[1:5]
Note that the content usually is "nested", i.e. RDBMS would not be really "useful" anyway or at least would involve some normalization steps before writing to the DB. That's why I prefer environments (unique variable IDs and pass-by-reference is possible) over lists and, if one does make the step to use a true DBMS, I would rather turn NoSQL approaches such as MongoDB.
Registration cycle:
The actual calls might be spread over different processes, so there is a possibility of concurrent access atempts.
I want to have other processes/calls "wait" until a registerEvent read-write cycle is finished before doing their read-write cycle (without triggering errors).
registerEvent(values=list(x_1=x, x_2=x), path_registry=path_registry)
registerEvent(values=list(x_1=x, x_2=x), path_registry=path_registry)
registerEvent(id="abcd", values=list(x_1=x, x_2=x),
path_registry=path_registry)
registerEvent(id="abcd", values=list(x_1=x, x_2=x),
path_registry=path_registry)
Check registry content:
load(path_registry)
ls(registry)

See filelock R package, available since 2018. It is cross-platform. I am using it on Windows and have not found a single problem.
Make sure to read the documentation.
?filelock::lock
Although the docs suggest to leave the lock file, I have had no problems removing it on function exit in a multi-process environment:
on.exit({filelock::unlock(lock); file.remove(path.lock)})

Related

Is it possible to create dynamic jobs with Dagster?

Consider this example - you need to load table1 from source database, do some generic transformations (like convert time zones for timestamped columns) and write resulting data into Snowflake. This is an easy one and can be implemented using 3 dagster ops.
Now, imagine you need to do the same thing but with 100s of tables. How would you do it with dagster? Do you literally need to create 100 jobs/graphs? Or can you create one job, that will be executed 100 times? Can you throttle how many of these jobs will run at the same time?
You have a two main options for doing this:
Use a single job with Dynamic Outputs:
With this setup, all of your ETLs would happen in a single job. You would have an initial op that would yield a DynamicOutput for each table name that you wanted to do this process for, and feed that into a set of ops (probably organized into a graph) that would be run on each individual DynamicOutput.
Depending on what executor you're using, it's possible to limit the overall step concurrency (for example, the default multiprocess_executor supports this option).
Create a configurable job (I think this is more likely what you want)
from dagster import job, op, graph
import pandas as pd
#op(config_schema={"table_name": str})
def extract_table(context) -> pd.DataFrame:
table_name = context.op_config["table_name"]
# do some load...
return pd.DataFrame()
#op
def transform_table(table: pd.DataFrame) -> pd.DataFrame:
# do some transform...
return table
#op(config_schema={"table_name": str})
def load_table(context, table: pd.DataFrame):
table_name = context.op_config["table_name"]
# load to snowflake...
#job
def configurable_etl():
load_table(transform_table(extract_table()))
# this is what the configuration would look like to extract from table
# src_foo and load into table dest_foo
configurable_etl.execute_in_process(
run_config={
"ops": {
"extract_table": {"config": {"table_name": "src_foo"}},
"load_table": {"config": {"table_name": "dest_foo"}},
}
}
)
Here, you create a job that can be pointed at a source table and a destination table by giving the relevant ops a config schema. Depending on those config options, (which are provided when you create a run through the run config), your job will operate on different source / destination tables.
The example shows explicitly running this job using python APIs, but if you're running it from Dagit, you'll also be able to input the yaml version of this config there. If you want to simplify the config schema (as it's pretty nested as shown), you can always create a Config Mapping to make the interface nicer :)
From here, you can limit run concurrency by supplying a unique tag to your job, and using a QueuedRunCoordinator to limit the maximum number of concurrent runs for that tag.

How am I sure that my process has a user interface?

When running a class which can be used interactively or silently by batch, I want to display a hourglass, only if in interactive mode.
I found the function xGlobal::clientKind() , read below, but not sure it is sufficient (can't batches also run on Client ?)
if (xGlobal::clientKind() == ClientType::Client)
startLengthyOperation();
// here do the process
if (xGlobal::clientKind() == ClientType::Client)
endLengthyOperation();
Do not bother to test client kind when using startLengthyOperation, the method does a sufficient test itself.
Testing should be like this:
if (clientKind() == ClientType::Client)
...
Don't use xGlobal::clientKind, use without qualification.
The ClientType has four values, matching what you see in "Online Users".
Batch can be called interactively in Basic/Periodic/Batch, but it should be rarely used.

System.Io.Directory::GetFiles() Polling from AX 2009, Only Seeing New Files Every 10s

I wrote code in AX 2009 to poll a directory on a network drive, every 1 second, waiting for a response file from another system. I noticed that using a file explorer window, I could see the file appear, yet my code was not seeing and processing the file for several seconds - up to 9 seconds (and 9 polls) after the file appeared!
The AX code calls System.IO.Directory::GetFiles() using ClrInterop:
interopPerm = new InteropPermission(InteropKind::ClrInterop);
interopPerm.assert();
files = System.IO.Directory::GetFiles(#POLLDIR,'*.csv');
// etc...
CodeAccessPermission::revertAssert();
After much experimentation, it emerges that the first time in my program's lifetime, that I call ::GetFiles(), it starts a notional "ticking clock" with a period of 10 seconds. Only calls every 10 seconds find any new files that may have appeared, though they do still report files that were found on an earlier 10s "tick" since the first call to ::GetFiles().
If, when I start the program, the file is not there, then all the other calls to ::GetFiles(), 1 second after the first call, 2 seconds after, etc., up to 9 seconds after, simply do not see the file, even though it may have sitting there since 0.5s after the first call!
Then, reliably, and repeatably, the call 10s after the first call, will find the file. Then no calls from 11s to 19s will see any new file that might have appeared, yet the call 20s after the first call, will reliably see any new files. And so on, every 10 seconds.
Further investigation revealed that if the polled directory is on the AX AOS machine, this does not happen, and the file is found immediately, as one would expect, on the call after the file appears in the directory.
But this figure of 10s is reliable and repeatable, no matter what network drive I poll, no matter what server it's on.
Our network certainly doesn't have 10s of latency to see files; as I said, a file explorer window on the polled directory sees the file immediately.
What is going on?
Sounds like your issue is due to SMB caching - from this technet page:
Name, type, and ID
Directory Cache [DWORD] DirectoryCacheLifetime
Registry key the cache setting is controlled by
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Lanmanworkstation\Parameters
This is a cache of recent directory enumerations performed by the
client. Subsequent enumeration requests made by client applications as
well as metadata queries for files in the directory can be satisfied
from the cache. The client also uses the directory cache to determine
the presence or absence of a file in the directory and uses that
information to prevent clients from repeatedly attempting to open
files which are known not to exist on the server. This cache is likely
to affect distributed applications running on multiple computers
accessing a set of files on a server – where the applications use an
out of band mechanism to signal each other about
modification/addition/deletion of files on the server.
In short try to set the registry key
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Lanmanworkstation\Parameters\DirectoryCacheLifetime
to 0
Thanks to #Jan B. Kjeldsen , I have been able to solve my problem using FileSystemWatcher. Here is my implementation in X++ :
class SelTestThreadDirPolling
{
}
public server static Container SetStaticFileWatcher(str _dirPath,str _filenamePattern,int _timeoutMs)
{
InteropPermission interopPerm;
System.IO.FileSystemWatcher fw;
System.IO.WatcherChangeTypes watcherChangeType;
System.IO.WaitForChangedResult res;
Container cont;
str fileName;
str oldFileName;
str changeType;
;
interopPerm = new InteropPermission(InteropKind::ClrInterop);
interopPerm.assert();
fw = new System.IO.FileSystemWatcher();
fw.set_Path(_dirPath);
fw.set_IncludeSubdirectories(false);
fw.set_Filter(_filenamePattern);
watcherChangeType = ClrInterop::parseClrEnum('System.IO.WatcherChangeTypes', 'Created');
res = fw.WaitForChanged(watcherChangeType,_timeoutMs);
if (res.get_TimedOut()) return conNull();
fileName = res.get_Name();
//ChangeTypeName can be: Created, Deleted, Renamed and Changed
changeType = System.Enum::GetName(watcherChangeType.GetType(), res.get_ChangeType());
fw.Dispose();
CodeAccessPermission::revertAssert();
if (changeType == 'Renamed') oldFileName = res.get_OldName();
cont += fileName;
cont += changeType;
cont += oldFileName;
return cont;
}
void waitFileSystemWatcher(str _dirPath,str _filenamePattern,int _timeoutMs)
{
container cResult;
str filename,changeType,oldFilename;
;
cResult=SelTestThreadDirPolling::SetStaticFileWatcher(_dirPath,_filenamePattern,_timeoutMs);
if (cResult)
{
[filename,changeType,oldFilename]=cResult;
info(strfmt("filename=%1, changeType=%2, oldFilename=%3",filename,changeType,oldFilename));
}
else
{
info("TIMED OUT");
}
}
void run()
{;
this.waitFileSystemWatcher(#'\\myserver\mydir','filepattern*.csv',10000);
}
I should acknowledge the following for forming the basis of my X++ implementation:
https://blogs.msdn.microsoft.com/floditt/2008/09/01/how-to-implement-filesystemwatcher-with-x/
I would guess DAXaholic's answer is correct, but you could try other solutions like EnumerateFiles.
In your case I would rather wait for the files rather than poll for the files.
Using FileSystemWatcher there will be a minimal delay from file creation till your process wakes up. It is more tricky to use, but avoiding polling is a good thing. I have never used it over a network.

Invalidate/prevent memoize with plone.memoize.ram

I've and Zope utility with a method that perform network processes.
As the result of the is valid for a while, I'm using plone.memoize.ram to cache the result.
MyClass(object):
#cache(cache_key)
def do_auth(self, adapter, data):
# performing expensive network process here
...and the cache function:
def cache_key(method, utility, data):
return time() // 60 * 60))
But I want to prevent the memoization to take place when the do_auth call returns empty results (or raise network errors).
Looking at the plone.memoize code it seems I need to raise ram.DontCache() exception, but before doing this I need a way to investigate the old cached value.
How can I get the cached data from the cache storage?
I put this together from several code I wrote...
It's not tested but may help you.
You may access the cached data using the ICacheChooser utility.
It's call method needs the dotted name to the function you cached, in your case itself
key = '{0}.{1}'.format(__name__, method.__name__)
cache = getUtility(ICacheChooser)(key)
storage = cache.ramcache._getStorage()._data
cached_infos = storage.get(key)
In cached_infos there should be all infos you need.

R and RJDBC: Using dbSendUpdate results in ORA-01000: maximum open cursors exceeded

I have encountered an error when trying to insert thousands of rows with R/RJDBC and the dbSendUpdate command on an Oracle database.
Problem can be reproduced by creating a test table with
CREATE TABLE mytest (ID NUMBER(10) not null);
and then executing the following R script
library(RJDBC)
drv<-JDBC("oracle.jdbc.OracleDriver","ojdbc-11.1.0.7.0.jar") # place your JDBC driver here
conn <- dbConnect(drv, "jdbc:oracle:thin:#MYSERVICE", "myuser", "mypasswd") # place your connection details here
for (i in 1:10000) {
dbSendUpdate(conn,"INSERT INTO mytest VALUES (?)",i))
}
Searching the Internet provided the information, that one should close result cursors, which is obvious (e.g. see java.sql.SQLException: - ORA-01000: maximum open cursors exceeded or Unable to resolve error - java.sql.SQLException: ORA-01000: maximum open cursors exceeded).
But the help file for ??dbSendUpdate claims for not using result cursors at all:
.. that dbSendUpdate is used with DBML queries and thus doesn't return any result set.
Therefore this behavior doesn't make much sense to me :-(
Can anybody help?
Thanks, a lot!
PS: Found something interessting in the RJDBC documentation http://www.rforge.net/RJDBC/
Note that the life time of a connection, result set, driver etc. is determined by the lifetime of the corresponding R object. Once the R handle goes out of scope (or if removed explicitly by rm) and is garbage-collected in R, the corresponding connection or result set is closed and released. This is important for databases that have limited resources (like Oracle) - you may need to add gc() by hand to force garbage collection if there could be many open objects. The only exception are drivers which stay registered in the JDBC even after the corresponding R object is released as there is currently no way to unload a JDBC driver (in RJDBC).
But again, even inserting gc() within the loop will produce the same beahvoir.
Finally we realized this being a bug in package RJDBC. If you are willing to patch RJDBC you can use the patched sources available at https://github.com/Starfox899/RJDBC (as long as they are not imported into the package).

Resources