Runtime customizable settings - r

I am coming from python and search for a pendant to a runtime customisable setting.
In a python package I would just have a
package.settings dictionary
and could change those settings with a simple:
package.settings[key] = value.
I tried to simulate dictionaries in R with a named list (I am aware of the hash package).
If I import the package, and want to change the list with:
package::settings$key = value
I get the error:
Error in package::settings <- value : object 'package' not found
How can I implement runtime customisable settings in R?
I know that I sometimes have to change my mental model (especially when switching languages ;), so if a dictionary is not the correct data structure, I am happy to hear it.

Related

'ConcatenatedDoc2Vec' object has no attribute 'docvecs'

I am a beginner in Machine Learning and trying Document Embedding for a university project. I work with Google Colab and Jupyter Notebook (via Anaconda). The problem is that my code is perfectly running in Google Colab but if i execute the same code in Jupyter Notebook (via Anaconda) I run into an error with the ConcatenatedDoc2Vec Object.
With this function I build the vector features for a Classifier (e.g. Logistic Regression).
def build_vectors(model, length, vector_size):
vector = np.zeros((length, vector_size))
for i in range(0, length):
prefix = 'tag' + '_' + str(i)
vector[i] = model.docvecs[prefix]
return vector
I concatenate two Doc2Vec Models (d2v_dm, d2v_dbow), both are working perfectly trough the whole code and have no problems with the function build_vectors():
d2v_combined = ConcatenatedDoc2Vec([d2v_dm, d2v_dbow])
But if I run the function build_vectors() with the concatenated model:
#Compute combined Vector size
d2v_combined_vector_size = d2v_dm.vector_size + d2v_dbow.vector_size
d2v_combined_vec= build_vectors(d2v_combined, len(X_tagged), d2v_combined_vector_size)
I receive this error (but only if I run this in Jupyter Notebook (via Anaconda) -> no problem with this code in the Notebook in Google Colab):
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [20], in <cell line: 4>()
1 #Compute combined Vector size
2 d2v_combined_vector_size = d2v_dm.vector_size + d2v_dbow.vector_size
----> 4 d2v_combined_vec= build_vectors(d2v_combined, len(X_tagged), d2v_combined_vector_size)
Input In [11], in build_vectors(model, length, vector_size)
3 for i in range(0, length):
4 prefix = 'tag' + '_' + str(i)
----> 5 vector[i] = model.docvecs[prefix]
6 return vector
AttributeError: 'ConcatenatedDoc2Vec' object has no attribute 'docvecs'
Since this is mysterious (for me) -> Working in Google Colab but not Anaconda and Juypter Notebook -> and I did not find anything to solve my problem in the web.
If it's working one place, but not the other, you're probably using different versions of the relevant libraries – in this case, gensim.
Does the following show exactly the same version in both places?
import gensim
print(gensim.__version__)
If not, the most immediate workaround would be to make the place where it doesn't work match the place that it does, by force-installing the same explicit version – pip intall gensim==VERSION (where VERSION is the target version) – then ensuring your notebook is restarted to see the change.
Beware, though, that unless starting from a fresh environment, this could introduce other library-version mismatches!
Other things to note:
Last I looked, Colab was using an over-4-year-old version of Gensim (3.6.0), despite more-recent releases with many fixes & performance improvements. It's often best to stay at or closer-to the latest versions of any key libraries used by your project; this answer describes how to trigger the installation of a more-recent Gensim at Colab. (Though of course, the initial effects of that might be to cause the same breakage in your code, adapted for the older version, at Colab.)
In more-recent Gensim versions, the property formerly called docvecs is now called just dv - so some older code erroring this way may only need docvecs replaced with dv to work. (Other tips for migrating older code to the latest Gensim conventions are available at: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4 )
It's unclear where you're pulling the ConcatenatedDoc2Vec class from. A clas of that name exists in some Gensim demo/test code, as a very minimal shim class that was at one time used in attempts to reproduce the results of the original "Paragaph Vector" (aka Doc2Vec) paper. But beware: that's not a usual way to use Doc2Vec, & the class of that name I know barely does anything outside its original narrow purpose.
Further, beware that as far as I know, noone has ever reproduced the full claimed performance of the two-kinds-of-doc-vectors-concatenated approach reported in that paper, even using the same data/described-technique/evaluation. The claimed results likely relied on some other undisclosed techniques, or some error in the writeup. So if you're trying to mimic that, don't get too frustrated. And know most uses of Doc2Vec just pick one mode.
If you have your own separate reasons for creating concatenated feature-vectors, from multiple algorithms, you should probably write your own code for that, not limited to the peculiar two-modes-of-Doc2Vec code from that one experiment.

Using reticulate with targets

I'm having this weird issue where my target, which interfaces a slightly customized python module (installed with pip install --editable) through reticulate, gives different results when it's being called from an interactive session in R from when targets is being started from the command line directly, even when I make sure the other argument(s) to tar_make are identical (callr_function = NULL, which I use for interactive debugging). The function is deterministic and should be returning the exact same result but isn't.
It's tricky to provide a reproducible example but if truly necessary I'll invest the required time in it. I'd like to get tips on how to debug this and identify the exact issue. I already safeguarded against potential pointer issues; the python object is not getting passed around between different targets/environments (anymore), rather it's immediately used to compute the result of interest. I also checked that the same python version is being used by printing the result of reticulate::pyconfig() to screen. I also verified both approaches are using the same version of the customized module.
Thanks in advance..!

R package: Refresh/Update internal raw data

I am developing a package in R (3.4.3) that has some internal data and I would like to be able to let users refresh it without having to do too much work. But let me be more specific.
The package contains several functions that rely on multiple parameters but I am really only exporting wrapper functions that do all the work. Because there is a large number of parameters to pass to those wrapper functions, I am putting them in a .csv file and passing them as a list. The data is therefore a simple .csv file called param.csv with the names of the parameters in one column and their value in the next column
parameter_name,parameter_value
ticker, SPX Index
field, PX_LAST
rolling_window,252
upper_threshold,1
lower_threshold,-1
signal_positive,1
signal_neutral,0
signal_negative,-1
I process it running
param.df <- read.csv(file = "param.csv", header = TRUE, sep = ",")
param.list <- split(param.df[, 2], param.df[, 1])
and I put the list inside the package like this
usethis::use_data(param.list, overwrite = TRUE)
Then, when reset the R interactive window and execute
devtools::document()
devtools::load_all()
devtools::install()
require(pkg)
everything works out great, the data is available and can be passed to functions.
First problem: when I change param.csv, save it and repeat the above four lines of code, the internal param.list is not updated. Is there something I am missing here ? Or is it by nature that the developer of the package should run usethis::use_data(param.list, overwrite = TRUE) each time he changes the .csv file where the data comes from?
Because it is intended to be a sort of map, the users will want to tweak parameters for (manual) model calibration. To try and solve this issue I let users provide the functions with their own parameter list (as a .csv file like before). I have a function get_param_list("path/to/file.csv") that does exactly the same processing that is done above and returns the list. It allows the users to pass their own parameter list. De facto, the internal data param.list is considered a default setting of parameters.
Second Problem: I'd like to let users modify this default list of parameters param.list from outside the package in a safe way.
So since the internal object is a list, the users can simply modify the element of their choice in the list from outside the package. But that is pretty dangerous in my mind because users could
- forget parameters, so I have default values inside the functions for that case
- replace a parameter with another type which can provoke an error or worse, side effects.
Is there a way to let users modify the internal list of parameters without opening the package ? For example I had thought about a function that could replace the .csv file by a new one provided by the user but this brings me back to the First Problem, when I restart the R prompt and reinstall the package, nothing happens unless I usethis::use_data(param.list, overwrite = TRUE) from inside the package. Would another type of data be more useful (e.g. make the data internal) ?
This is my first question on this website but I realize it's quite long and might be taken as a coding style question. But if anyone can see an obvious mistake or misunderstanding of package development in R from my side that would really help me.
Cheers

Can I load an RData file while bypassing loading the namespaces?

Let's say some of my users cannot alter their R environments, but I need them to be able to open up RData files. These environment files require a package to be loaded (httpuv to be exact). We don't care about the package, we don't need its capabilities, we just need to get at the data. Is there a way to either force R to bypass loading namespaces when loading the RData file, or force it to save it without namespace dependencies at the originating end? Thanks.
To reproduce, install Shiny. Create and save a some R objects to the server's file system from within a Shiny applet as an RData file. Copy the file over to a computer that doesn't have Shiny or the httpuv package installed. Try loading the RData file, even if the actual objects you saved are completely ordinary data.frames that have nothing to do with Shiny or httpuv.
I did strings on the RData, and the damn thing is full of references to httpuv. The software is loading the file and then actively deciding to not continue in the internal loadFromConn2() function. Therefore there must be a way to make it stop doing so.
Really #baptiste should get credit for the link in his comment to some general solutions, especially the R CMD INSTALL --fake trick, and I will accept that if he reposts it as an answer. That is why I am not accepting the following answer of my own to the specific problem that caused it in my case, but I am posting my answer in case it helps someone else.
Some of the objects I was saving were lm fitted objects. Those contain formula/terms objects (at least two each, for some reason... maybe because they've been through stepAIC), and those formulas in turn each have an environment attribute. The environment attribute is .GlobalEnv which probably does contain copies of package functions someplace. When I dug through the objects inside the fitted models, and then the objects inside all the attributes of those objects, and then the objects inside the attributes of the attributes of those objects... and set every environment attribute I could find to NULL, eventually I was able to save that fitted model to a file that could be opened from a different R installation without getting the error about not being able to load a namespace.
I suppose I could also write a function to iterate through the objects within a fitted model, and their attributes, and remove environments but that sounds ugly and dangerous. Maybe there is a way to force formulas and fitted models not to retain environments, and that will be better. For the time being, instead of saving fitted models, I will save their call attributes after scrubbing any environment attributes I might find there. If that doesn't work, I'll deparse them into character strings.
PS: I used the RDS format and haven't yet tested it with RData, but I suspect that the problem was the saving of the evalution environment in some of the attributes, and had nothing to do with the format in which the objects get saved. I'll post an update if it turns out that this doesn't also work with RData.
PPS: I suspect I'm not the only one here who's hearing about the R CMD INSTALL --fake trick for the first time, and perhaps the word should be spread about this... because to the extent other R users don't know about it, this remains an obvious vector for denial-of-service attacks against R!
I will accept my own answer to get rid of the SO auto-nagger, but will unaccept it and accept #baptiste if they make it possible for me to do so by posting it as an answer. Thanks.

Dynamically Generate Reference Classes

I'm attempting to generate reference classes within an R package on the fly, and it's proving to be fairly difficult. Here are the approaches I've taken and problems I've run into:
I'm creating a package in which I hope to be able to dynamically read in a schema and automatically generate an associated reference class (think SOAP). Of course, this means I won't be able to define my reference classes before-hand in the package sources.
I initially attempted to create a new class using a simple:
myClass <- setRefClass("NewClassName", fields=list(fieldA="character"))
which, of course, works fine when executed interactively, but when included in the package sources, I get a locked binding error. From my reading, it looks like this occurs because when running interactively, the class information is being stored in the global environment, which is not locked, while my package's base environment is locked.
I then found a thread that suggested using something to the effect of:
myClass <- setRefClass("NewClassName", fields=list(fieldA="character"), where=globalenv())
This actually crashed R/Studio when I tried to build the package, so I don't have a log of the error it generated, unfortunately, but it certainly didn't work.
Next I tried creating a new environment within my package which I could use to store these reference classes. So I added a .classEnv <- new.env() line in my package sources (not inside of any function) and then attempted to use this class when creating a new reference class:
myClass <- setRefClass("NewClassName", fields=list(fieldA="character"), where=.classEnv)
This actually seemed to work OK, but generates the following warning:
> myClass <- setRefClass("NewClassName", where=.classEnv)
Warning message:
In getPackageName(where) :
Created a package name, ‘2013-04-23 10:19:14’, when none found
So, for some reason, methods::getPackageName() isn't able to pick up which package my new environment is in?
Is there a way to create my new environment differently so that getPackageName() can properly recognize the package? Can I add some feature which allows me to help getPackageName() detect the package? Will this even work if I can deal with the warning, or am I misusing reference classes by trying to create them dynamically?
To get the conversation going, I found that getpackageName stores the package name in a hidden .packageName variable in the specified environment.
So you can actually get around the warning with
assign(".packageName", "MyPkg", envir=.classEnv)
myClass <- setRefClass("NewClassName", fields=classFields, where=.classEnv)
which resolves the warning, but the documentation says not to trust the .packageName variable indefinitely, and I still feel like I'm hacking this in and may be misunderstanding something important about reference classes and their relationship to environments.
Full details from documentation:
Package names are normally installed during loading of the package, by the INSTALL script or by the library function. (Currently, the name is stored as the object .packageName but don't trust this for the future.)
Edit:
After reading a little further, the setPackageName method may be a more reliable way to set the package name for the environment. Per the docs:
setPackageName can be used to establish a package name in an environment that would otherwise not have one. This allows you to create classes and/or methods in an arbitrary environment, but it is usually preferable to create packages by the standard R programming tools (package.skeleton, etc.)
So it looks like one valid solution would be the following:
setPackageName("MyPkg", .classEnv)
myClass <- setRefClass("NewClassName", fields=classFields, where=.classEnv)
That eliminates the warning message and doesn't rely on anything that's documented as unstable. I'm still not clear why it's necessary, but...

Resources