Particle Swarm Optimization Calling Counts - r

While trying to use the pso or hydroPSO package from R-CRAN, I have a need to use/access the counts (current iteration, current number function evaluation, and current restart) to work with the function I've been writing. However, I can't seem to wrap my head around this. Any suggestions on figuring out how to call the current iteration/function/restarts within the objective function would be great. A piece of example code would be appreciated as I seem to fail to fully understand the documentation.
Background:
My function requires the iteration number as it is a wrapper to some code written in FORTRAN where the input files are generated and the output files are read back in to R. I want the iteration number surviving so that I can return back to the previous output files for further analysis. An example of this would be:
~/runs/<restart #>/<iteration>/<particle>/input/
~/runs/<restart #>/<iteration>/<particle>/output/
The wrapper function accepts the parameters, automatically generates the input files, runs the FORTRAN model, then parses in the output and post-processes them (e.g. performance index calculations).

Related

Trying to automate an R script that always runs against one dataset and conditionally against another

Very new to R and trying to modify a script to help my end users.
Every week a group of files are produced and my modified script, reaches out to the network, makes the necessary changes and puts the files back, all nice and tidy. However, every quarter, there is a second set of files, that needs the EXACT same transformation completed. My thoughts were to check if the files exist on the network with a file.exists statement and then run through script and then continue with the normal weekly one, but my limited experience can only think of writing it this way (lots of stuff is a couple hundred lines)and I'm sure there's something I can do other than double the size of the program:
if file.exists("quarterly.txt"){
do lots of stuff}
else{
do lots of stuff}
Both starja and lemonlin were correct, my solution was to basically turn my program into a function and just create a program that calls the function with each dataset. I also skipped the 'else' portion of my if statement, which works perfectly (for me).

Model namespaces in code completion with R - or how to organize R code

This is more of a general code structuring question.
At the moment I try to write my code into "namespaces". So for example, I would have:
Mine.FancyPlot.Plot(...)
Mine.FancyPlot.Impl.PlotCanvas(...)
Mine.FancyPlot.Impl.PlotLegend(...)
Mine.BasicPlot.Plot(...)
Mine.BasicPlot.Impl.PlotCanvas(...)
Mine.BasicPlot.Impl.PlotLegend(...)
Mine.BasicPlot.Impl.PlotLines(...)
The idea is that I am trying to hide away "private" functions in a "Impl" for implementation namespace. So outside of Mine_FancyPlot.R I wouldn't call Mine.FancyPlot.Impl functions.
This approach works reasonably well, except code completion isn't as nice as it could be.
To begin with, when I type Mine.BasicPlot. and hit TAB, I get all functions, including the Impl functions, and because I is before P, they even hide the "public" user functions.
So I started changing the structure to
MyPub.FancyPlot.Plot(...)
MyPriv.FancyPlot.PlotCanvas(...)
MyPriv.FancyPlot.PlotLegend(...)
MyPub.BasicPlot.Plot(...)
MyPriv.Mine.BasicPlot.PlotCanvas(...)
MyPriv.Mine.BasicPlot.PlotLegend(...)
MyPriv.Mine.BasicPlot.PlotLines(...)
This works better in that "private" functions are no longer predicted. However, I still have the issue that if I type MyPub. and hit TAB, I can't actually see all different "namespaces" (such as I would in Java, C++, ...), but rather a long list of functions starting all in the first "namespace".
Ideally, I'd like code completion in R to cut off all predictions at the next dot, and unique them, so Ideally when I type MyPub. and hit TAB, I would only get a list of "sub-namespaces" and functions in MyPub.
Is this possible? Can the code prediction be altered to reflect this behaviour? Or is there a better way to achieve what I am aiming for?
You should consider putting your functions in a package to organise them. Functions that are not exported will only be accessible by doing 'package:::functionNotExported' and will not be listed when just doing 'functionNotExpo[tab]'
See for instance debugging a function in R that was not exported by a package

Avoiding using global objects when building an R package with multiple separate functions

I have built an R package that runs a complex Bayesian model (Dirichlet Process Mixture model on spatial data) including an MCMC, thinning and validation and interface with Googlemaps. I'm very happy with performance and it runs without problems. The only issue is I would like to get it up on CRAN and it will be rejected because I extensively use global variables.
The package is built around the use of 8 core functions (which the user interacts with):
1) LoadData: Loads in data, extracts key information and sets up a series of global matrices as well as other small list objects.
2) ModelParameters: Sets model parameters, option to plot prior on parameter sigma on Googlemap. Calculates a hyper-prior at this point and saves a large matrix to the global environment
3) GraphicParameters: Sets graphic parameters of maps and plots (see code below)
4) CreateMaps: Creates the prior surface on source location tau and plots the data on a Google map. Keeps a number of global objects saved for repeated plotting of this map.
5) RunMCMC: Runs the bulk of the analysis using MCMC (a time intensive step), creates many global objects.
6) ThinandAnalsye: Thins the posterior samples and constructs the geoprofile (a time intensive step)
7) PlotGP: Plots the data and overlays the geoprofile onto a Google map
8) reporthitscores: OPTIONAL if source data is imported, calculates the hit scores of potential sources
Each one is run in turn before the next, and I pass global variables out which are used by one or more of the other functions.
I built it this way for a reason, as the user must stop and evaluate the results of these functions before rushing ahead to the future ones.
Each of these functions passes not just fixed parameters, but also large map objects, lists and matrices as global objects. I thought it was a nice simple solution with a smooth workflow (you can check the results in your main working environment before moving on, possibly applying transformations etc) and I have given all the objects unique and informative names.
How do I get around this, and pass the checks of CRAN whilst keeping my user friendly workflow of a series of interacting functions?
I dont want to post up a lot of code (as just the MCMC part is several hundred lines long)
But I will include one of the simple examples. GraphicParameters is one of my simple parameter setting functions, that comes with the default values set. This is a simple example, there are much more complex ones in the package. There is a model parameters function that pulls many of the variables from an existing data loading function for example.
GraphicParameters <-
function(Guardrail=0.05, nring=20,transp=0.4,gridsize=640,gridsize2=300,MapType= "roadmap",Location=getwd(),pointcol="black") {
Guardrail<<-Guardrail
nring<<-nring
transp<<-transp
gridsize<<-gridsize
gridsize2<<-gridsize2
MapType<<-MapType
Location<<-Location
pointcol<<-pointcol
}
Most of the material I have seen concerning avoiding global objects resolves around a single function that will do all the work. I want to keep my step by step multi-function approach, but loose the global objects.
Any help would be greatly appreciated.
I understand this may be a major reworking of the code (which is several 1000 lines currently), so I would also love solutions that minimally affect the overall structure of the package.
P.S. I wish I had known about CRANs displeasure with global objects before I started!!!
Your problem is very amenable to OOP-style design. You can use reference classes or S4 to export a single global, e.g., a MapAnalysis class generator. The idea is then that someone creates this using
ma <- new('MapAnalysis', option1 = ..., option2 = ..., ...) # S4
# or
ma <- MapAnalysis$new(option1 = ..., ...) # refClass
and can then call your methods with
ma$loadData(...)
ma$setParameters(...)
with the object doing any bookkeeping of options and auxiliary objects internally. It should not be that much work to refactor. If you read the page I linked to at the top of this post, you should see it's probably possible to just wrap all your functions with a refClass('MapAnalysis', fields = (...), methods = (...)) with few further modifications. (Although it would do you a lot of good down the road to re-think the architecture in OOP terms.)

using value of a function & nested function in R

I wrote a function in R - called "filtre": it takes a dataframe, and for each line it says whether it should go in say bin 1 or 2. At the end, we have two data frames that sum up to the original input, and corresponding respectively to all lines thrown in either bin 1 or 2. These two sets of bin 1 and 2 are referred to as filtre1 and filtre2. For convenience the values of filtre1 and filtre2 are calculated but not returned, because it is an intermediary thing in a bigger process (plus they are quite big data frame). I have the following issue:
(i) When I later on want to use filtre1 (or filtre2), they simply don't show up... like if their value was stuck within the function, and would not be recognised elsewhere - which would oblige me to copy the whole function every time I feel like using it - quite painful and heavy.
I suspect this is a rather simple thing, but I did search on the web and did not find the answer really (I was not sure of best key words). Sorry for any inconvenience.
Thxs / g.
It's pretty hard to know the optimum way of achieve what you want as you do not provide proper example, but I'll give it a try. If your variables filtre1 and filtre2 are defined inside of your function and you do not return them, of course they do not show up on your environment. But you could just return the classification and make filtre1 and filtre2 afterwards:
#example data
df<-data.frame(id=1:20,x=sample(1:20,20,replace=TRUE))
filtre<-function(df){
#example function, this could of course be done by bins<-df$x<10
bins<-numeric(nrow(df))
for(i in 1:nrow(df))
if(df$x<10)
bins[i]<-1
return(bins)
}
bins<-filtre(df)
filtre1<-df[bins==1,]
filtre2<-df[bins==0,]

Strategies for repeating large chunk of analysis

I find myself in the position of having completed a large chunk of analysis and now need to repeat the analysis with slightly different input assumptions.
The analysis, in this case, involves cluster analysis, plotting several graphs, and exporting cluster ids and other variables of interest. The key point is that it is an extensive analysis, and needs to be repeated and compared only twice.
I considered:
Creating a function. This isn't ideal, because then I have to modify my code to know whether I am evaluating in the function or parent environments. This additional effort seems excessive, makes it harder to debug and may introduce side-effects.
Wrap it in a for-loop. Again, not ideal, because then I have to create indexing variables, which can also introduce side-effects.
Creating some pre-amble code, wrapping the analysis in a separate file and source it. This works, but seems very ugly and sub-optimal.
The objective of the analysis is to finish with a set of objects (in a list, or in separate output files) that I can analyse further for differences.
What is a good strategy for dealing with this type of problem?
Making code reusable takes some time, effort and holds a few extra challenges like you mention yourself.
The question whether to invest is probably the key issue in informatics (if not in a lot of other fields): do I write a script to rename 50 files in a similar fashion, or do I go ahead and rename them manually.
The answer, I believe, is highly personal and even then, different case by case. If you are easy on the programming, you may sooner decide to go the reuse route, as the effort for you will be relatively low (and even then, programmers typically like to learn new tricks, so that's a hidden, often counterproductive motivation).
That said, in your particular case: I'd go with the sourcing option: since you plan to reuse the code only 2 times more, a greater effort would probably go wasted (you indicate the analysis to be rather extensive). So what if it's not an elegant solution? Nobody is ever going to see you do it, and everybody will be happy with the swift results.
If it turns out in a year or so that the reuse is higher than expected, you can then still invest. And by that time, you will also have (at least) three cases for which you can compare the results from the rewritten and funky reusable version of your code with your current results.
If/when I do know up front that I'm going to reuse code, I try to keep that in mind while developing it. Either way I hardly ever write code that is not in a function (well, barring the two-liners for SO and other out-of-the-box analyses): I find this makes it easier for me to structure my thoughts.
If at all possible, set parameters that differ between sets/runs/experiments in an external parameter file. Then, you can source the code, call a function, even utilize a package, but the operations are determined by a small set of externally defined parameters.
For instance, JSON works very well for this and the RJSONIO and rjson packages allow you to load the file into a list. Suppose you load it into a list called parametersNN.json. An example is as follows:
{
"Version": "20110701a",
"Initialization":
{
"indices": [1,2,3,4,5,6,7,8,9,10],
"step_size": 0.05
},
"Stopping":
{
"tolerance": 0.01,
"iterations": 100
}
}
Save that as "parameters01.json" and load as:
library(RJSONIO)
Params <- fromJSON("parameters.json")
and you're off and running. (NB: I like to use unique version #s within my parameters files, just so that I can identify the set later, if I'm looking at the "parameters" list within R.) Just call your script and point to the parameters file, e.g.:
Rscript --vanilla MyScript.R parameters01.json
then, within the program, identify the parameters file from the commandArgs() function.
Later, you can break out code into functions and packages, but this is probably the easiest way to make a vanilla script generalizeable in the short term, and it's a good practice for the long-term, as code should be separated from the specification of run/dataset/experiment-dependent parameters.
Edit: to be more precise, I would even specify input and output directories or files (or naming patterns/prefixes) in the JSON. This makes it very clear how one set of parameters led to one particular output set. Everything in between is just code that runs with a given parametrization, but the code shouldn't really change much, should it?
Update:
Three months, and many thousands of runs, wiser than my previous answer, I'd say that the external storage of parameters in JSON is useful for 1-1000 different runs. When the parameters or configurations number in the thousands and up, it's better to switch to using a database for configuration management. Each configuration may originate in a JSON (or XML), but being able to grapple with different parameter layouts requires a larger scale solution, for which a database like SQLite (via RSQLite) is a fine solution.
I realize this answer is overkill for the original question - how to repeat work only a couple of times, with a few parameter changes, but when scaling up to hundreds or thousands of parameter changes in ongoing research, more extensive tools are necessary. :)
I like to work with combination of a little shell script, a pdf cropping program and Sweave in those cases. That gives you back nice reports and encourages you to source. Typically I work with several files, almost like creating a package (at least I think it feels like that :) . I have a separate file for the data juggling and separate files for different types of analysis, such as descriptiveStats.R, regressions.R for example.
btw here's my little shell script,
#!/bin/sh
R CMD Sweave docSweave.Rnw
for file in `ls pdfs`;
do pdfcrop pdfs/"$file" pdfs/"$file"
done
pdflatex docSweave.tex
open docSweave.pdf
The Sweave file typically sources the R files mentioned above when needed. I am not sure whether that's what you looking for, but that's my strategy so far. I at least I believe creating transparent, reproducible reports is what helps to follow at least A strategy.
Your third option is not so bad. I do this in many cases. You can build a bit more structure by putting the results of your pre-ample code in environments and attach the one you want to use for further analysis.
An example:
setup1 <- local({
x <- rnorm(50, mean=2.0)
y <- rnorm(50, mean=1.0)
environment()
# ...
})
setup2 <- local({
x <- rnorm(50, mean=1.8)
y <- rnorm(50, mean=1.5)
environment()
# ...
})
attach(setup1) and run/source your analysis code
plot(x, y)
t.test(x, y, paired = T, var.equal = T)
...
When finished, detach(setup1) and attach the second one.
Now, at least you can easily switch between setups. Helped me a few times.
I tend to push such results into a global list.
I use Common Lisp but then R isn't so different.
Too late for you here, but I use Sweave a lot, and most probably I'd have used a Sweave file from the beginning (e.g. if I know that the final product needs to be some kind of report).
For repeating parts of the analysis a second and third time, there are then two options:
if the results are rather "independent" (i.e. should produce 3 reports, comparison means the reports are inspected side by side), and the changed input comes in the form of new data files, that goes into its own directory together with a copy of the Sweave file, and I create separate reports (similar to source, but feels more natural for Sweave than for plain source).
if I rather need to do the exactly same thing once or twice again inside one Sweave file I'd consider reusing code chunks. This is similar to the ugly for-loop.
The reason is that then of course the results are together for the comparison, which would then be the last part of the report.
If it is clear from the beginning that there will be some parameter sets and a comparison, I write the code in a way that as soon as I'm fine with each part of the analysis it is wrapped into a function (i.e. I'm acutally writing the function in the editor window, but evaluate the lines directly in the workspace while writing the function).
Given that you are in the described situation, I agree with Nick - nothing wrong with source and everything else means much more effort now that you have it already as script.
I can't make a comment on Iterator's answer so I have to post it here. I really like his answer so I made a short script for creating the parameters and exporting them to external JSON files. And I hope someone finds this useful: https://github.com/kiribatu/Kiribatu-R-Toolkit/blob/master/docs/parameter_configuration.md

Resources