Automating version increase of R packages - r

Problem
I am developing an R package and I want to increase the version automatically each time I build it. I want that to be able to associate my results to package versions. For now I was using my own ugly function to do that.
My question is: is there a way to do it better? Or, should I avoid doing that in general?
Another option
Another option I was thinking of is to install my package (hosted in github) using ´devtools::install_github´ and then save with my results (or adding to plots) the GithubSHA1 that is saved in the installed DESCRIPTION file.
For example I can get the version and GithubSHA1 like that for the ´devtools´ package:
read.dcf(file=system.file("DESCRIPTION", package="devtools"),
fields=c("Version", "GithubSHA1"))
## Version GithubSHA1
## [1,] "1.5.0.99" "3ae58a2a2232240e67b898f875b8da5e57d1b3a8"
My tries so far
I wrote the following function to produce a new DESCRIPTION file, with updated version and date. (Increasing the major version is something I don't mind increasing per hand)
incVer <- function(pkg, folder=".", increase="patch"){
## Read DESCRIPTION from installed package ´pkg´ and make new one on the specified
## ´folder´. Two options for ´increase´ are "patch" and "minor"
f <- read.dcf(file=system.file("DESCRIPTION", package=pkg),
fields=c("Package", "Type", "Title", "Version", "Date",
"Author", "Maintainer", "Description", "License",
"Depends", "Imports", "Suggests"))
curVer <- package_version(f[4])
if(increase == "patch") {
curVer[[1,3]] <- ifelse(is.na(curVer$patchlevel), 1, curVer$patchlevel + 1)
} else if (increase == "minor") {
curVer[[1,2]] <- ifelse(is.na(curVer$minor), 1, curVer$minor + 1)
curVer[[1,3]] <- 0
} else {
stop(paste("Can not identify the increase argument: " , increase))
}
f[4] <- toString(curVer)
## Update also the date
f[5] <- format (Sys.time(), "%Y-%m-%d")
write.dcf(f, file=paste(folder, "DESCRIPTION", sep="/"))
}

If you are using git, then you can use git tags to create a version string. This is how we generate the version string of our igraph library:
git describe HEAD --tags | rev | sed 's/g-/./' | sed 's/-/+/' | rev
It gives you a format like this:
0.8.0-pre+131.ca78343
0.8.0-pre is the last tag on the current branch. (The last released version was 0.7.1, and we create a -pre tag immediately after the release tag.) 131 is the number of commits since the last tag. ca78343 is the first seven character of the hex id of the last commit.
This would be great, except that you cannot have version strings like this in R packages, R does not allow it. So for R we transform this version string using the following script: https://github.com/igraph/igraph/blob/develop/interfaces/R/tools/convertversion.sh
Essentially it creates a version number that is larger than the last released version and smaller than the next versions (the one in the -pre tag). From 0.8.0-pre+131.ca78343 it creates
0.7.999-131
where 131 is the number of commits since the last release.
I put the generation of the DESCRIPTION file in a Makefile. This replaces the date, and the version number:
VERSION=$(shell ./tools/convertversion.sh)
igraph/DESCRIPTION: src/DESCRIPTION version_number
sed 's/^Version: .*$$/Version: '$(VERSION)'/' $< | \
sed 's/^Date: .*$$/Date: '`date "+%Y-%m-%d"`'/' > $#
This is quite convenient, you don't need to do anything, except for adding the release tags and
the -pre tags.
Btw. this was mostly worked out by my friend and igraph co-developer, Tamás Nepusz, so the credit is his.

For a simpler approach, consider using the crant tool with the -u switch. For instance,
crant -u 3
will increment the third component of the version by one. There is also Git and SVN integration, and a bunch of other useful switches for roxygenizing, building, checking etc..

As auto-incrementing version numbering is not going to be built into the devtools package, I figured out a way based on Gabor's answer (the link to igraph in his answer is dead btw).
When I am about to commit to our repository, I run this bash script to set the date to today and to set the version number based on the latest tag, the .9000 suffix (as suggested here in the book R Packages by Hadley Wickham) and the number of commits within that tag:
echo "••••••••••••••••••••••••••••••••••••••••••••"
echo "• Updating package date and version number •"
echo "••••••••••••••••••••••••••••••••••••••••••••"
sed -i -- "s/^Date: .*/Date: $(date '+%Y-%m-%d')/" DESCRIPTION
# get latest tags
git pull --tags --quiet
current_tag=`git describe --tags --abbrev=0 | sed 's/v//'`
current_commit=`git describe --tags | sed 's/.*-\(.*\)-.*/\1/'`
# combine tag (e.g. 0.1.0) and commit number (like 40) increased by 9000 to indicate beta version
new_version="$current_tag.$((current_commit + 9000))" # results in 0.1.0.9040
sed -i -- "s/^Version: .*/Version: ${new_version}/" DESCRIPTION
echo "First 3 lines of DESCRIPTION:"
head -3 DESCRIPTION
echo
# ... after here more commands like devtools::document() and git commit
To be clear - this script actually makes these changes to the DESCRIPTION file.
EDIT: support for hundreds - now just increases the commit sequence number by 9000. So commit #120 in tag v0.6.1 leads to 0.6.1.9120.

Related

Array job based on R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am trying to construct and submit an array job based on R in the HPC of my university.
I'm used to submit array jobs based on Matlab and I have some doubts on how to translate the overall procedure to R. Let me report a very simple Matlab example and then my questions.
The code is based on 3 files:
"main" which does some preliminary operations.
"subf" which should be run by each task and uses some matrices created by "main".
a bash file which I qsub in the terminal.
1. main:
clear
%% Do all the operations that are common across tasks
% Here, as an example, I create
% 1) a matrix A that I will sum to the output of each task
% 2) a matrix grid; each task will use some rows of the matrix grid
m=1000;
A=rand(m,m);
grid=rand(m,m);
%% Tasks
tasks=10; %number of tasks
jobs=round(size(grid,1)/tasks); %I split the number of rows of the matrix grid among the tasks
2. subf:
%% Set task ID
idtemp=str2double(getenv('SGE_TASK_ID'));
%% Select local grid
if idtemp<tasks
grid_local= grid(jobs*(idtemp-1)+1: idtemp*jobs,:);
else
grid_local= grid(jobs*(idtemp-1)+1: end,:); %for the last task, we should take all the rows of grid that have been left
end
sg_local=size(grid_local,1);
%% Do the task
output=zeros(sg_local,1);
for g=1:sg_local
output(g,:)=sum(sum(A+repmat(grid_local(g,:),m,1)));
end
%% Save output by keeping track of task ID
filename = sprintf('output.%d.mat', ID);
save(filename,'output')
3. bash
#$ -S /bin/bash
#$ -l h_vmem=6G
#$ -l tmem=6G
#$ -l h_rt=480:0:0
#$ -cwd
#$ -j y
#Run 10 tasks where each task has a different $SGE_TASK_ID ranging from 1 to 10
#$ -t 1-10
#$ -N Example
date
hostname
#Output the Task ID
echo "Task ID is $SGE_TASK_ID"
export PATH=/xx/xx/matlab/bin:$PATH
matlab -nodisplay -nodesktop -nojvm -nosplash -r "main; ID = $SGE_TASK_ID; subf; exit"
These are my questions:
Suppose I'm able to translate "main" and "subf" into R language. Should I be extra-careful about anything in particular concerning the parallelisation? For example, do I have to declare some parallel environment, such as parLapply or dopar?
In the "main" file I should also install some R packages. Can I do them locally in my folder directly at the beginning of the "main" file, or should I contact the HPC administrator to install them globally?
I could not find any example of bash file for R in the instructions given by my university. Therefore, I have doubts on how to re-adapt the above bash file. I suppose that the only lines to change are:
export PATH=/xx/xx/matlab/bin:$PATH
matlab -nodisplay -nodesktop -nojvm -nosplash -r "main; ID = $SGE_TASK_ID; subf; exit"
Could you give some hints on how I should change them?
The parallelization is handled by the HPC, right? In which case, I think "no", nothing special required.
It depends on how they allow/enable R. In a HPC that I use (not your school), the individual nodes do not have direct internet access, so it would require special care; this might be the exception, I don't know.
Recommendation: if there is a shared filesystem that both you and all of the nodes can access, then create an R "library" there that contains the installed packages you need, then use .libPaths(...) in your R scripts here to add that to the search path for packages. The only gotcha to this might be if there are non-R shared library (e.g., .dll, .so, .a) requirements. For this, either "docker" or "ask admins".
If you don't have a shared filesystem, then you might ask the cluster admins if they use/prefer docker images (you might provide an image or a DOCKERFILE to create one) or if they have preferred mechanisms for enabling various packages.
I do not recommend asking them to install the packages, for two reasons: First, think about them needing to do this with every person who has a job to run, for any number of programming languages, and then realize that they may have no idea how to do it for that language. Second, package versions are very important, and you asking them to install a package may install either a too-new package or overwrite an older version that somebody else is relying on. (See packrat and renv for discussions on reproducible environments.)
Bottom line, the use of a path you control (and using .libPaths) enables you to have complete control over package versions. If you have not been bitten by unintended consequences of newer-versioned packages, just wait ... congratulations, you've been lucky.
I suggest you can add source("main.R") to the beginning of subf.R, which would make your bash file perhaps as simple as
export PATH=/usr/local/R-4.x.x/bin:$PATH
Rscript /path/to/subf.R
(Noting that you'll need to reference Sys.getenv("SGE_TASK_ID") somewhere in subf.R.)

How to avoid Jupyter cell-ids from changing all the time and thereby spamming the VCS diffs?

As discussed in q/66678305, newer Jupyter versions store in addition to the source code and output of cells an ID for the purpose of e.g. linking to a cell.
However, these IDs aren't stable but often change even when the cell's source code was not touched. As a result, if you have the .ipynb file under version control with e.g. git, the commits end up having lots of rather funny sounding “changed lines” that don't correspond to any actual change made in the commit. Like,
{
"cell_type": "code",
"execution_count": null,
- "id": "respected-breach",
+ "id": "incident-winning",
"metadata": {},
"outputs": [],
Is there a way to prevent this?
Answer for Git on Linux. Probably also works on MacOS, but not Windows.
It is good practice to not VCS the .ipynb files as saved by Jupyter, but instead a filtered version that does not contain all the volatile information. For this purpose, various git hooks are available; the one I'm using is based on https://github.com/toobaz/ipynb_output_filter/blob/master/ipynb_output_filter.py.
Strangely enough, it turns out this script can not be modified to remove the "id" field from cells. Namely, if you try to remove that field in the filtering loop, like with
for field in ("prompt_number", "execution_number", "id"):
if field in cell:
del cell[field]
then the write function from jupyter_nbformat will just put an id back in. It is possible to merely change the id to something constant, but then Jupyter will complain about nonunique ids.
As a hack to circumvent this, I now use this filter with a simple grep to delete the ID:
#!/bin/bash
grep -v '^ *"id": "[a-z\-]*",$'
Store that in e.g. ~/bin/ipynb_output_filter.sh, make it executable (chmod +x ~/bin/ipynb_output_filter.sh) and ensure you have the following ~/.gitattributes file:
*.ipynb filter=dropoutput_ipynb
and in your git config (either global ~/.gitconfig or project)
[core]
attributesfile = ~/.gitattributes
[filter "dropoutput_ipynb"]
clean = ~/bin/ipynb_output_filter.sh
smudge = cat
If you want to use a standard python filter in addition to that, you can invoke it before the grep in ~/bin/ipynb_output_filter.sh, like
#!/bin/bash
~/bin/ipynb_output_filter.py | grep -v '^ *"id": "[a-z\-]*",$'

GNAT Metric and RTL files

For running GNAT metric (for Windows, GPL 2017 or CE 2018) I'd like to include the RTL sources as well. There is a "-a" switch but it seems to be ineffective. When I'm forcing visibility of RTL sources, only ada.ads and system.ads are processed. Guessing it is a "crunched name" issue (RTL file names forced to 8 character names) I've tried other tricks without success.
My question is: is there a way to get the RTL source metrics (of the source files actually used) with GNAT Metric?
I'm using the command
gnatmetric -a -xs -nt -j0 -Pmyproj.gpr -U somemain.adb
TIA
In the meantime I've found a workaround by using the gnathtml.pl script.
I've customized the script a bit by removing the H1 headers.
The result is a few hundreds of HTML files with the sources of units actually used: the script does find all dependencies, recursively, through the .ali files - including the RTL.
Then I group the HTML files together, convert them back to text files, pass them through Adalog's Normalize tool for removing comments and empty lines, count lines with the wc command, and the job is done.

Tensorflow: How to convert .meta, .data and .index model files into one graph.pb file

In tensorflow the training from the scratch produced following 6 files:
events.out.tfevents.1503494436.06L7-BRM738
model.ckpt-22480.meta
checkpoint
model.ckpt-22480.data-00000-of-00001
model.ckpt-22480.index
graph.pbtxt
I would like to convert them (or only the needed ones) into one file graph.pb to be able to transfer it to my Android application.
I tried the script freeze_graph.py but it requires as an input already the input.pb file which I do not have. (I have only these 6 files mentioned before). How to proceed to get this one freezed_graph.pb file? I saw several threads but none was working for me.
You can use this simple script to do that. But you must specify the names of the output nodes.
import tensorflow as tf
meta_path = 'model.ckpt-22480.meta' # Your .meta file
output_node_names = ['output:0'] # Output nodes
with tf.Session() as sess:
# Restore the graph
saver = tf.train.import_meta_graph(meta_path)
# Load weights
saver.restore(sess,tf.train.latest_checkpoint('path/of/your/.meta/file'))
# Freeze the graph
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
# Save the frozen graph
with open('output_graph.pb', 'wb') as f:
f.write(frozen_graph_def.SerializeToString())
If you don't know the name of the output node or nodes, there are two ways
You can explore the graph and find the name with Netron or with console summarize_graph utility.
You can use all the nodes as output ones as shown below.
output_node_names = [n.name for n in tf.get_default_graph().as_graph_def().node]
(Note that you have to put this line just before convert_variables_to_constants call.)
But I think it's unusual situation, because if you don't know the output node, you cannot use the graph actually.
As it may be helpful for others, I also answer here after the answer on github ;-).
I think you can try something like this (with the freeze_graph script in tensorflow/python/tools) :
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
The important flag here is --input_binary=false as the file graph.pbtxt is in text format. I think it corresponds to the required graph.pb which is the equivalent in binary format.
Concerning the output_node_names, that's really confusing for me as I still have some problems on this part but you can use the summarize_graph script in tensorflow which can take the pb or the pbtxt as an input.
Regards,
Steph
I tried the freezed_graph.py script, but the output_node_name parameter is totally confusing. Job failed.
So I tried the other one: export_inference_graph.py.
And it worked as expected!
python -u /tfPath/models/object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=/your/config/path/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix=/your/checkpoint/path/model.ckpt-50000 \
--output_directory=/output/path
The tensorflow installation package I used is from here:
https://github.com/tensorflow/models
First, use the following code to generate the graph.pb file.
with tf.Session() as sess:
# Restore the graph
_ = tf.train.import_meta_graph(args.input)
# save graph file
g = sess.graph
gdef = g.as_graph_def()
tf.train.write_graph(gdef, ".", args.output, True)
then, use summarize graph get the output node name.
Finally, use
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
to generate the freeze graph.

TeamCity Current Date variable in MMdd format

In TeamCity is there an easy way to get a variable for the current date in the format MMdd (eg 0811 for 8-Aug)?
My google-fu did not turn up an existing plugins. I looked into writing a plugin, but not having a jdk installed, that looks time consuming.
This is quite easy to do with a PowerShell build step (no plugin required) using the following source code:
echo "##teamcity[setParameter name='env.BUILD_START_TIME' value='$([DateTime]::Now)']"
or (for UTC):
echo "##teamcity[setParameter name='env.BUILD_START_TIME' value='$([DateTime]::UtcNow)']"
This uses TeamCity's Service Message feature that allows you to interact with the build engine at runtime e.g. set build parameters.
You can then reference this build parameter from other places in TeamCity using the syntax %env.BUILD_START_TIME%
The advantage of this approach is you don't need to use a plugin. The disadvantage is you need to introduce a build step.
For Unix based build agents I propose next custom script as one of build commands:
export current_build_date_format="+%%Y.%%m.%%d"
export current_build_date="$(date $current_build_date_format)"
echo "##teamcity[setParameter name='env.current_build_date' value='$current_build_date']"
You have to make double % sign to avoid interpretation for date executable command line argument FORMAT string (see %Y.%m.%d) as already existing TeamCity variable.
The Groovy Plugin for TeamCity provides build start date/time properties:
Provides build properties:
system.build.start.date / env.BUILD_START_DATE
system.build.start.time / env.BUILD_START_TIME
This blog post has installation / configuration instructions for the Groovy plugin, as well an example of customizing the date/time format.
You can also try Date Build Number plug-in. It povides additional var in build number format rather than build property.
Similar to the Date Build Number plugin mentioned in this answer, there exists a derived plugin called Formatted Date Parameter. It provides a customizable parameter build.formatted.timestamp that can be used out of the box in fields or other parameters. No need for a separate build step.
An old question, but for those looking for a solution now there is a system parameter available.
system.buildStartTime
You need to declare it in config (it's not available until runtime) in order to run. I set mine to value [Filled Automatically]
As you can guess, this time is set to the build start time, so that may not be ideal for some needs. But it's easy and reliable.
To add a dated folder to my build in TeamCity I added the following to my custom script. What had me stuck was the double % sign in the date string. D'oh
TARGET_DIR=/Users/admin/build/daily
TARGET=$(date "+%%Y-%%m-%%d")
if [ ! -d ${TARGET_DIR} ]; then
mkdir -vp ${TARGET_DIR}/
fi
mv -v build.dmg ${TARGET_DIR}/build_${TARGET}.dmg
If you only want to have one-line bash command in a build step, just use as below.
echo "##teamcity[setParameter name='build.timestamp' value='$(date +%%m%%d)']"
(double % symbol is for TeamCity own escape rule to use % character)
It will set a MMdd parameter value right after the execution during runtime so very useful to put at any build step. Then, you can retrieve a parameter value afterward.
Note that you should create build.timestamp parameter firstly to TeamCity project.
A step further, I made a simple bash script to have bash date format timestamp. This script will set timestamp to whatever bash supported datetime format and parameter name to TeamCity.
name="" # TeamCity parameter name
format="%Y-%m-%dT%H:%M:%S%z" # ISO8601 format by default
result="" # a TeamCity parameter value to be set
for ARGUMENT in "$#"
do
KEY=$(echo "$ARGUMENT" | cut -f1 -d=)
VALUE=$(echo "$ARGUMENT" | cut -f2 -d=)
case "$KEY" in
name) name=${VALUE} ;;
format) format=${VALUE} ;;
*)
esac
done
result=$(date "+$format")
echo "##teamcity[setParameter name='$name' value='$result']"
Below usage will set ISO8601 format timestamp to build.timestamp parameter.
./teamcity_datetime.sh name=build.timestamp
If you want to set only MMdd, the execution could be as below.
./teamcity_datetime.sh name=build.timestamp format="%%m%%d"

Resources