Using reticulate with targets - r

I'm having this weird issue where my target, which interfaces a slightly customized python module (installed with pip install --editable) through reticulate, gives different results when it's being called from an interactive session in R from when targets is being started from the command line directly, even when I make sure the other argument(s) to tar_make are identical (callr_function = NULL, which I use for interactive debugging). The function is deterministic and should be returning the exact same result but isn't.
It's tricky to provide a reproducible example but if truly necessary I'll invest the required time in it. I'd like to get tips on how to debug this and identify the exact issue. I already safeguarded against potential pointer issues; the python object is not getting passed around between different targets/environments (anymore), rather it's immediately used to compute the result of interest. I also checked that the same python version is being used by printing the result of reticulate::pyconfig() to screen. I also verified both approaches are using the same version of the customized module.
Thanks in advance..!

Related

'ConcatenatedDoc2Vec' object has no attribute 'docvecs'

I am a beginner in Machine Learning and trying Document Embedding for a university project. I work with Google Colab and Jupyter Notebook (via Anaconda). The problem is that my code is perfectly running in Google Colab but if i execute the same code in Jupyter Notebook (via Anaconda) I run into an error with the ConcatenatedDoc2Vec Object.
With this function I build the vector features for a Classifier (e.g. Logistic Regression).
def build_vectors(model, length, vector_size):
vector = np.zeros((length, vector_size))
for i in range(0, length):
prefix = 'tag' + '_' + str(i)
vector[i] = model.docvecs[prefix]
return vector
I concatenate two Doc2Vec Models (d2v_dm, d2v_dbow), both are working perfectly trough the whole code and have no problems with the function build_vectors():
d2v_combined = ConcatenatedDoc2Vec([d2v_dm, d2v_dbow])
But if I run the function build_vectors() with the concatenated model:
#Compute combined Vector size
d2v_combined_vector_size = d2v_dm.vector_size + d2v_dbow.vector_size
d2v_combined_vec= build_vectors(d2v_combined, len(X_tagged), d2v_combined_vector_size)
I receive this error (but only if I run this in Jupyter Notebook (via Anaconda) -> no problem with this code in the Notebook in Google Colab):
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [20], in <cell line: 4>()
1 #Compute combined Vector size
2 d2v_combined_vector_size = d2v_dm.vector_size + d2v_dbow.vector_size
----> 4 d2v_combined_vec= build_vectors(d2v_combined, len(X_tagged), d2v_combined_vector_size)
Input In [11], in build_vectors(model, length, vector_size)
3 for i in range(0, length):
4 prefix = 'tag' + '_' + str(i)
----> 5 vector[i] = model.docvecs[prefix]
6 return vector
AttributeError: 'ConcatenatedDoc2Vec' object has no attribute 'docvecs'
Since this is mysterious (for me) -> Working in Google Colab but not Anaconda and Juypter Notebook -> and I did not find anything to solve my problem in the web.
If it's working one place, but not the other, you're probably using different versions of the relevant libraries – in this case, gensim.
Does the following show exactly the same version in both places?
import gensim
print(gensim.__version__)
If not, the most immediate workaround would be to make the place where it doesn't work match the place that it does, by force-installing the same explicit version – pip intall gensim==VERSION (where VERSION is the target version) – then ensuring your notebook is restarted to see the change.
Beware, though, that unless starting from a fresh environment, this could introduce other library-version mismatches!
Other things to note:
Last I looked, Colab was using an over-4-year-old version of Gensim (3.6.0), despite more-recent releases with many fixes & performance improvements. It's often best to stay at or closer-to the latest versions of any key libraries used by your project; this answer describes how to trigger the installation of a more-recent Gensim at Colab. (Though of course, the initial effects of that might be to cause the same breakage in your code, adapted for the older version, at Colab.)
In more-recent Gensim versions, the property formerly called docvecs is now called just dv - so some older code erroring this way may only need docvecs replaced with dv to work. (Other tips for migrating older code to the latest Gensim conventions are available at: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4 )
It's unclear where you're pulling the ConcatenatedDoc2Vec class from. A clas of that name exists in some Gensim demo/test code, as a very minimal shim class that was at one time used in attempts to reproduce the results of the original "Paragaph Vector" (aka Doc2Vec) paper. But beware: that's not a usual way to use Doc2Vec, & the class of that name I know barely does anything outside its original narrow purpose.
Further, beware that as far as I know, noone has ever reproduced the full claimed performance of the two-kinds-of-doc-vectors-concatenated approach reported in that paper, even using the same data/described-technique/evaluation. The claimed results likely relied on some other undisclosed techniques, or some error in the writeup. So if you're trying to mimic that, don't get too frustrated. And know most uses of Doc2Vec just pick one mode.
If you have your own separate reasons for creating concatenated feature-vectors, from multiple algorithms, you should probably write your own code for that, not limited to the peculiar two-modes-of-Doc2Vec code from that one experiment.

R cmd check: how to avoid error for examples when using internal functions

I am developing a small R package to define custom color palettes. I have two functions that I expose to the user with the #export roxygen/doxygen tag. Both of these functions rely on an internal function, which I marked with #keywords internal. When I run devtools::check() I get an error that the examples provided for my two external functions could not be run because the internal function was not found. This happens whether or not I prefix the internal function with my_package:::my_internal_function or not (just my_internal_function). I found this blogpost which states that the example code can be wrapped into a \dontrun{}, which is what I do to avoid the error from R CMD check. however, this seems a bit of a hack. so what is the preferred way to deal with examples that rely on an internal function?
Update:
I cannot reproduce the problem anymore :-/ sorry for the bother, now it seems to run without issues. I don't know what is different now, but kind of problem solved.

How can I add a breakpoint to debug my R package without recompiling

When I develop a R package I endup doing the following thing:
load the latest build
use it
realize there is a bug in say function 'f_bug'
try to debug
I would be easy if I could just 're-source' f_bug and that the newly sourced version would be chosen (I'd rebuild the package clean latter).
But I cannot do that, it looks like the package::f_bug is always "chosen" by default when called within another package function.
Can I do such a thing ?
You can't use RStudio's convenient graphical breakpoints, but you can do the same thing using trace:
trace(package::f_bug, browser, at = insertion_point)
Here insertion_point refers not to line numbers but to a vector of substeps. From ?trace:
look at ‘as.list(body(f))’ to get the numbers
associated with the steps in function ‘f’.)
Another option might be to use utils::setBreakpoint which takes a file name and line number as arguments. See the help file for details.

Trouble of understanding the concept of packages in Common Lisp

A time ago I started learning Common Lisp, but now I have come to my first real stumbling block, understanding a concept. I started to change my learning projects to move from single file sources to packages. Everything so far went as expected, but then, I stumbled upon one file, a sudoku game I coded, that behaves other then I thought. You can find it here: https://github.com/Silberbogen/cl-sudoku
When I started (spiele-sudoku) after I switched inside the package via (in-package :cl-sudoku), everything works fine, but when I start it via (cl-sudoku:spiele-sudoku), only my input of coordinates is excepted, while any other input seems not to be interpreted.
What concept do I miss, so I could start the game via (cl-sudoku:spiele)?
You use read-from-string to read your input. That will intern any word encountered as a symbol into the current package.
In your main function, you use case to compare with symbols, but those are interned into the cl-sudoku package. So, if your current package is cl-sudoku, it will work, otherwise not.
You should not use read or read-form-string to parse user input (if you absolutely must, at least bind *read-eval* to nil). Instead call intern yourself (possibly in combination string-upcase) to create symbols in the right package. If you want to use package-independent symbols, intern them into the KEYWORD package, so that you can do case on keywords.
It might be helpful to use ecase or ccase, or at least log some debug information on invalid input.

How can I remove a lock from a linked environment in R?

I tried to run a Bioconductor package (truncateCDF) that modify an environment(hgu133plus2cdf), to remove unwanted probesets, from an affymetrix chip.
Everything went fine, until I got the following message (translated from french):
> assign(cdfname, cdf.env, env=CDF.env)
Error in assign(cdfname, cdf.env, env = CDF.env) :
impossible to change the value of a locked link for 'hgu133plus2cdf'
The assign function is the ultimate function of the code, that save the changes made to the environment dataset CDF.env to the original environment (hgu133plus2cdf), before using it in analyses of affymetrix chip results; so, it is essential.
My question: what is this locked link to the hgu133plus2cdf environment, and how could I bypass it.
The author of this package successfully run its package around 2005; so I suppose it is a feature introduced since then in R (probably not related to Bioconductor, since assign is a basic R function, reason why I ask this question on this forum instead of Biostar).
I tried to read the docs, but I am overwhelmed, when it comes to environments.
Thanks in advance for any help.
I don't think truncateCDF is from a Bioconductor package; it is a at least not current. This sounds like this post and the next two from the same thread from the Bioconductor mailing list. It is a result of a change in R -- packages now have not-easily-modified name spaces, and these are implemented by locking the environment in which name space symbols are defined. Removing probes is not an essential part of a typical microarray work flow. Please ask on the Bioconductor mailing list (no subscription required) if you'd like more help.

Resources