How do I update scripts from OpenMx 1 to OpenMx 2? - r

I have an example OpenMx script written a few years ago to do twin modelling.
It was written for OpenMx version 1.0 (script linked here )
When I run it, there are some warnings about updating fit functions and objectives. How should I update it to use OpenMx 2.0 fit function calls?

There are a small number of changes from OpenMx 1.0 to 2.0 and higher. Nearly all scripts will run fine, but some pre-2012 or will benefit in features if you update for OpenMx 2.x
An example is referenced here
The user had hassles with:
1. No path to the helper functions
This is a more generic robustness issue for example R code: better to include web urls rather than disk-based file paths.
source("http://www.vipbg.vcu.edu/~vipbg/Tc24/GenEpiHelperFunctions.R")
A better solution is CRAN-based helper packages like umx. These are easier to keep up to date and accessible.
2. Old-style objectives (instead of expectations and fit functions)
Calls like this one are deprecated:
objMZ<- mxFIMLObjective(covariance="expCovMZ", means="expMean", dimnames=selVars)
It’s easy to update these across a stack of scripts, replacing mxFIMLObjective with mxExpectationNormal + a call to mxFitFunctionML
In addition, in old-style multiple-group objectives like this:
minus2ll <- mxAlgebra( expression = MZ.objective + DZ.objective, name="m2LL")
obj <- mxAlgebraObjective("m2LL")
You should replace mxAlgebraObjective with mxFitFunctionAlgebra
However, OpenMx 2 has a neat Multigroup function which handles this in one line and enables identification checks, reference model generation etc.
So just replace the whole thing with (for example):
mxFitFunctionMultigroup(c("MZ", "DZ"))}

Related

'ConcatenatedDoc2Vec' object has no attribute 'docvecs'

I am a beginner in Machine Learning and trying Document Embedding for a university project. I work with Google Colab and Jupyter Notebook (via Anaconda). The problem is that my code is perfectly running in Google Colab but if i execute the same code in Jupyter Notebook (via Anaconda) I run into an error with the ConcatenatedDoc2Vec Object.
With this function I build the vector features for a Classifier (e.g. Logistic Regression).
def build_vectors(model, length, vector_size):
vector = np.zeros((length, vector_size))
for i in range(0, length):
prefix = 'tag' + '_' + str(i)
vector[i] = model.docvecs[prefix]
return vector
I concatenate two Doc2Vec Models (d2v_dm, d2v_dbow), both are working perfectly trough the whole code and have no problems with the function build_vectors():
d2v_combined = ConcatenatedDoc2Vec([d2v_dm, d2v_dbow])
But if I run the function build_vectors() with the concatenated model:
#Compute combined Vector size
d2v_combined_vector_size = d2v_dm.vector_size + d2v_dbow.vector_size
d2v_combined_vec= build_vectors(d2v_combined, len(X_tagged), d2v_combined_vector_size)
I receive this error (but only if I run this in Jupyter Notebook (via Anaconda) -> no problem with this code in the Notebook in Google Colab):
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [20], in <cell line: 4>()
1 #Compute combined Vector size
2 d2v_combined_vector_size = d2v_dm.vector_size + d2v_dbow.vector_size
----> 4 d2v_combined_vec= build_vectors(d2v_combined, len(X_tagged), d2v_combined_vector_size)
Input In [11], in build_vectors(model, length, vector_size)
3 for i in range(0, length):
4 prefix = 'tag' + '_' + str(i)
----> 5 vector[i] = model.docvecs[prefix]
6 return vector
AttributeError: 'ConcatenatedDoc2Vec' object has no attribute 'docvecs'
Since this is mysterious (for me) -> Working in Google Colab but not Anaconda and Juypter Notebook -> and I did not find anything to solve my problem in the web.
If it's working one place, but not the other, you're probably using different versions of the relevant libraries – in this case, gensim.
Does the following show exactly the same version in both places?
import gensim
print(gensim.__version__)
If not, the most immediate workaround would be to make the place where it doesn't work match the place that it does, by force-installing the same explicit version – pip intall gensim==VERSION (where VERSION is the target version) – then ensuring your notebook is restarted to see the change.
Beware, though, that unless starting from a fresh environment, this could introduce other library-version mismatches!
Other things to note:
Last I looked, Colab was using an over-4-year-old version of Gensim (3.6.0), despite more-recent releases with many fixes & performance improvements. It's often best to stay at or closer-to the latest versions of any key libraries used by your project; this answer describes how to trigger the installation of a more-recent Gensim at Colab. (Though of course, the initial effects of that might be to cause the same breakage in your code, adapted for the older version, at Colab.)
In more-recent Gensim versions, the property formerly called docvecs is now called just dv - so some older code erroring this way may only need docvecs replaced with dv to work. (Other tips for migrating older code to the latest Gensim conventions are available at: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4 )
It's unclear where you're pulling the ConcatenatedDoc2Vec class from. A clas of that name exists in some Gensim demo/test code, as a very minimal shim class that was at one time used in attempts to reproduce the results of the original "Paragaph Vector" (aka Doc2Vec) paper. But beware: that's not a usual way to use Doc2Vec, & the class of that name I know barely does anything outside its original narrow purpose.
Further, beware that as far as I know, noone has ever reproduced the full claimed performance of the two-kinds-of-doc-vectors-concatenated approach reported in that paper, even using the same data/described-technique/evaluation. The claimed results likely relied on some other undisclosed techniques, or some error in the writeup. So if you're trying to mimic that, don't get too frustrated. And know most uses of Doc2Vec just pick one mode.
If you have your own separate reasons for creating concatenated feature-vectors, from multiple algorithms, you should probably write your own code for that, not limited to the peculiar two-modes-of-Doc2Vec code from that one experiment.

Correct usage of drake::expose_imports() - Where to place call - Is it recursive?

Summary
I've noticed hints/suggestions/warnings in the drake docs suggesting use of expose_imports to ensure that changes in imported packages are tracked reproducibly, but the docs are relatively brief on the correct usage of this.
Example
I've now witnessed an example of the behaviour expose_imports is designed to correct in my own usage of drake, and I'd like to start using it.
In my case, the dependency that wasn't tracked was forcats, which, in version 0.4.0 had a bug in fct_collapse (Used by one of my functions) which would assign incorrect groups to the output factor.
0.4.0.9000 resolved this bug, and I updated to 0.4.0.9000, some time ago, but did notice that targets that must have run against the old version were not invalidated.
Question
I'm guessing that this is a problem that expose_imports might mitigate, but I don't really understand how / where to use it.
If I make scoped calls to my.package in my drake plans like so:
plan <- drake::drake_plan(
mtc = mtcars,
mtc_xformed = my.package::transfom_mtc(mtc)
)
And my.package::transform_mtc() has some dependency on another package, (Eg. forcats) then:
where should I be calling expose_imports?
In the prework argument of make?
In the top level of a file in my.package/R/ ?
Should I be calling
expose_imports("my.package") ? or
expose_imports("forcats")
Some clarification of this would be awesome
expose_imports() is mostly for packages you update/reinstall a lot. For example, say you write a package to implement a new statistical method, and the package is still under active development. Meanwhile, you are also writing a journal article about the method, and you have a reproducible drake pipeline to run simulation studies and compile the manuscript. Here, it is important to refresh the paper when you make changes to the package. In the project archetype here, your R/packages.R file would look something like this:
library(drake)
library(tidyverse)
library(yourCustomPackage)
expose_imports(yourCustomPackage)
Then, the plan can use functions from yourCustomPackage.
plan <- drake_plan(
analysis = custom_method(...) # from yourCustomPackages
# ...
)
Now, drake will invalidate targets in response to changes in custom_method(), along with any nested dependency functions of custom_method() in yourCustomPackages, and the dependencies of those dependencies in yourCustomPackages, etc. (Check vis_drake_graph() to see for yourself.)
expose_imports() is usually something I only recommend for packages directly related to the content of your research. It is not something I usually recommend for utilities like forcats. For those packages, I recommend renv to prevent unexpected changes from happening to begin with. In your case, I would update forcats, lock it down with renv, invalidate the targets you know depend on forcats, and trust that future changes to forcats are unlikely to be necessary.
Scoped calls like my.package::transfom_mtc(mtc) tell drake to track transform_mtc(), but not any unscoped dependency functions called from my.package::transfom_mtc(mtc). This is a one-foot-in-one-foot-out idea behavior that I no longer agree with. Next chance I get, I will make drake stop tracking these calls.

Are there any good resources/best-practices to "industrialize" code in R for a data science project?

I need to "industrialize" an R code for a data science project, because the project will be rerun several times in the future with fresh data. The new code should be really easy to follow even for people who have not worked on the project before and they should be able to redo the whole workflow quite quickly. Therefore I am looking for tips, suggestions, resources and best-practices on how to achieve this objective.
Thank you for your help in advance!
You can make an R package out of your project, because it has everything you need for a standalone project that you want to share with others :
Easy to share, download and install
R has a very efficient documentation system for your functions and objects when you work within R Studio. Combined with roxygen2, it enables you to document precisely every function, and makes the code clearer since you can avoid commenting with inline comments (but please do so anyway if needed)
You can specify quite easily which dependancies your package will need, so that every one knows what to install for your project to work. You can also use packrat if you want to mimic python's virtualenv
R also provide a long format documentation system, which are called vignettes and are similar to a printed notebook : you can display code, text, code results, etc. This is were you will write guidelines and methods on how to use the functions, provide detailed instructions for a certain method, etc. Once the package is installed they are automatically included and available for all users.
The only downside is the following : since R is a functional programming language, a package consists of mainly functions, and some other relevant objects (data, for instance), but not really scripts.
More details about the last point if your project consists in a script that calls a set of functions to do something, it cannot directly appear within the package. Two options here : a) you make a dispatcher function that runs a set of functions to do the job, so that users just have to call one function to run the whole method (not really good for maintenance) ; b) you make the whole script appear in a vignette (see above). With this method, people just have to write a single R file (which can be copy-pasted from the vignette), which may look like this :
library(mydatascienceproject)
library(...)
...
dothis()
dothat()
finishwork()
That enables you to execute the whole work from a terminal or a distant machine with Rscript, with the following (using argparse to add arguments)
Rscript myautomatedtask.R --arg1 anargument --arg2 anotherargument
And finally if you write a bash file calling Rscript, you can automate everything !
Feel free to read Hadley Wickham's book about R packages, it is super clear, full of best practices and of great help in writing your packages.
One can get lost in the multiple files in the project's folder, so it should be structured properly: link
Naming conventions that I use: first, second.
Set up the random seed, so the outputs should be reproducible.
Documentation is important: you can use the Roxygen skeleton in rstudio (default ctrl+alt+shift+r).
I usually separate the code into smaller, logically cohesive scripts, and use a main.R script, that uses the others.
If you use a special set of libraries, you can consider using packrat. Once you set it up, you can manage the installed project-specific libraries.

How to use/install merge method for data.sets from memisc package?

I have two data.sets (from the memisc package) all set for merge, and the merge goes through without error or warning, but the output is a data.frame, not a data.set. The command is:
datTS <- merge(datT1, datT2, by.x="ryear", by.y="ryear")
(Sorry I don't have a more convenient example with toy data handy.) The following pages seem to make it very clear that there should be a method built into memisc that properly merges the data.sets into one data.set:
http://rpackages.ianhowson.com/rforge/memisc/man/dataset-manip.html
https://github.com/melff/memisc/blob/master/pkg/R/dataset-methods.R
...but it just doesn't seem to be properly triggering on my machine (sorry also for my clumsy lingo). Note the similarity of my code and the example code from the very end of the first page I linked:
ds6 <- merge(ds1,ds5,by.x="a",by.y="c")
I've verified that I have the most recent versions of R, RStudio, memisc, and all dependencies. I've used a number of other memisc methods so far (within, transform, missing.values, etc.) without issue.
So my question is: what else does one need to do to get the merge function to properly produce a data.set when the source data are in data.set form, as per the memisc package? (There's no explicit addressing of this merge capability in the official package documentation.) Since the code in the second link above seems to provide the method for this, is there some workaround, at least, for installing and utilizing that code? Maybe there's just some separate "methods installation" I'm not aware of (but why would it be separate from the main package?).
The help page for pkg:memisc in the released version 0.97 does not describe a merge function method for data.sets. You are pointing us to the github version which may not be the one that has been released. You need to install the github version. See: https://github.com/melff/memisc/releases

when do you want to set up new environments in R

per the discussion of R programming styles, I saw someone once said he puts all his custom function into a new environment and attach it. I also remember R environment maybe used as a hash table. Is this good style? When do you want to put your data/functions into a new enviroment? Or just use the.GlobalEnv whatever?
EDIT put my second part of question back:
how to inspect same name variable for different environments?
Martin Mächler suggests that this is the one time you might want to consider attach(), although he suggested it in the context of attaching a .Rdata file to the search path but your Q is essentially the same thing.
The advantage is that you don't clutter the global environment with functions, that might get overwritten accidentally etc. Whilst I wouldn't go so far as to call this bad style, you might be better off sticking your custom functions into your own personal R package. Yes, this will incur a bit of overhead of setting up the package structure and providing some documentation to allow the package to be installed, but in the longer term this is a better solution. With tools like roxygen this process is getting easier to boot.
Personally, I haven't found a need for fiddling with environments in 10+ years of using R; well documented scripts that load, process and analyse data, cleaning up after themselves all the while have served me well thus far.
Another suggestion for the second part of your question (now deleted) is to use with() (following on from #Joshua's example):
> .myEnv <- new.env()
> .myEnv$a <- 2
> a <- 1
> str(a)
num 1
> ls.str(.myEnv, a)
a : num 2
> str(.myEnv$a)
num 2
> with(.myEnv, a)
[1] 2
> a
[1] 1
If your ecosystem of data and code has grown large enough that you are considering isolating it in an environment, you are better off creating a package. A package gives you much more support for:
Managing a project that is growing large and complex by separating code and data into files so there is less to dig through at one time.
A package makes it dead simple to hand off your work to someone else so they can use your code and data.
A package provides additional support for documentation and reporting.
Setting up a package for R is so easy, just call package.skeleton(), that every project I work on gets its code and data stored in a package.
The only time I use environments is when I need to isolate a run of some code, usually a script written by someone else, so that it's side effects and variable names don't get crossed with mine. I do this by evalq(source('someScript.R', local=TRUE), SomeEnvironment).
To answer your second question (that you've now deleted), use ls.str, or just access the object in the environment with $:
.myEnv <- new.env()
.myEnv$a <- 2
a <- 1
str(a)
ls.str(.myEnv, a)
str(.myEnv$a)

Resources