I am trying to generate an MDAO problem from an external specification. This requires the automated creation of groups, disciplines and variables. I would like to reuse some analytical functions, but with different arguments. I have to assume that the names of these arguments can differ between reused instances, therefore I am looking for a way to formulate analytical functions without the necessary consistency between the keys in the function's dictionary-style inputs/outputs parameters and the discipline input and output variables.
Is it possible (if so, how?) to employ one of the following reusable functions MyReusableFunction / MyReusableFunctionAlt in the following example?
import openmdao.api as om
### External information
# I can choose the format of disciplinary functions. Some alternatives:
def MyNonReusableFunction1(inputs, outputs): # <- The way it works
# I have to use keys 'A', 'B', 'C' here
outputs['C'] = inputs['A']*inputs['B']
def MyNonReusableFunction2(inputs, outputs): # <- The way it works
# I have to use keys 'D', 'E', 'F' here
outputs['F'] = inputs['D']*inputs['E']
def MyReusableFunction(x, y): # <- The way I want it to work
return x*y
def MyReusableFunctionAlt(inputs, outputs): # <- This would also be fine
outputs['z'] = inputs['x']*inputs['y']
# Given structure of the problem
disciplines = {
'D1': {
'inputs': ['A', 'B'],
'outputs': ['C'],
'function': MyReusableFunction}, # <- instead of MyNonReusableFunction1
'D2': {
'inputs': ['D', 'E'],
'outputs': ['F'],
'function': MyReusableFunction}, # <- instead of MyNonReusableFunction2
}
connections = [('D2.F', 'D1.B')]
### My script starts here
problem = om.Problem()
for disc_name, disc_data in disciplines.items():
discipine = om.ExplicitComponent()
discipline.compute = disc_data['function']
for param_in in disc_data['inputs']:
discipline.add_input(param_in, 1)
for param_out in disc_data['outputs']:
discipline.add_output(param_out, 1)
problem.add_subsystem(disc_name, discipline)
for connection in connections:
problem.connect(connection[0], connection[1])
This feels like a use case for user-defined function registration in ExecComps. This is a brand new feature.
http://openmdao.org/twodocs/versions/latest/features/building_blocks/components/exec_comp.html#registering-user-functions
An example of its use is here:
http://openmdao.org/twodocs/versions/latest/features/building_blocks/components/exec_comp.html#execcomp-example-user-function-registration
This will handle derivatives for you, using either complex step or finite difference, depending on whether the given function is complex-safe.
Here's an example of your code. It doesn't perfectly replicate the dictionary which stores the user functions, but it's probably a bit easier to get their with this path, as opposed to reassigning compute.
import openmdao.api as om
def MyReusableFunction(x, y): # <- The way I want it to work
return x*y
connections = [('D2.F', 'D1.B')]
problem = om.Problem()
om.ExecComp.register('myfunc', MyReusableFunction, complex_safe=True)
D1 = om.ExecComp('C = myfunc(A, B)')
D2 = om.ExecComp('F = myfunc(D, E)')
problem.model.add_subsystem('D1', D1)
problem.model.add_subsystem('D2', D2)
for connection in connections:
problem.model.connect(connection[0], connection[1])
problem.setup()
problem.run_model()
Let us know if this will not work for your use-case.
Related
In this example, I wish the z_proto could be global for different GPUs. However, in the data parallel mode, it is split into different GPUs as well. How to solve such a problem? Thank you.
class SequencePrototypeTokenClassification(nn.Module):
def __init__(self,seq_model, label_num):
super(SequencePrototypeTokenClassification, self).__init__()
self.seq_model = seq_model
self.label_num = label_num
def forward(self, input_ids, token_type_ids, attention_mask, labels, z_proto, n_query, target_inds):
z, _ = self.seq_model(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
z_dim = z.size(-1)
zq = z.squeeze().view(-1, z_dim)
dists = euclidean_dist(zq, z_proto)
log_p_y = F.log_softmax(-dists, dim=1).view(-1, self.label_num)
loss_val = -log_p_y.gather(1, self.target_inds).squeeze().view(-1).mean()
_, y_hat = log_p_y.max(1)
return loss_val, y_hat
Based on your above code, z_proto seems to be one of the arguments of the forward function and not part of the model. Therefore, simply storing it in a tensor on the main GPU would enable it to have the same value across GPUs.
Edit
Based on the documentation, it seems that DataParallel splits all the inputs to the forward pass function across the GPUs. A method by which you can circumvent it is by storing it as a class variable inside the model object itself. You can update the value before calling the forward function if it's not a static variable.
class SequencePrototypeTokenClassification(nn.Module):
def __init__(self,seq_model, label_num):
...
self.z_proto = None
...
...
#Training loop
...
model.z_proto = value
model.forward()
...
It turns out the DataParallel would only replicate the nn.Parameter of the nn.Module. So I random initialized a nn.Parameter named z_proto in the module and copy the value of tensor z_proto into the parameter. Then the parameter is replicated into 4 GPUs.
In my code I use lookup tables quite often, for example to have more verbose versions of column names in a data frame. For instance:
lkp <- c(speed = "Speed in mph", dist = "Stopping Distance in ft")
makePlot <- function(x = names(cars)) {
x <- match.arg(x)
hist(cars[[x]], xlab = lkp[[x]])
}
Now it happens that I want to reverse the lookup vector [*], which is easily done by
setNames(names(lkp), lkp)
If lkp is a bit more complicated, this becomes quite a lot of typing:
setNames(names(c(firstLkp, secondLkp, thirdLkp, youGotTheIdea)),
c(firstLkp, secondLkp, thirdLkp, youGotTheIdea))
with a lot of redundant code. Of course I could create a temporary variable
fullLkp <- c(firstLkp, secondLkp, thirdLkp, youGotTheIdea)
setNames(names(fullLkp), fullLkp)
Or even write a simple function doing it for me
swap_names_content <- function(x) setNames(names(x), x)
However, since this seems to me to be such a common task, I was wondering whether there is already a function in one of the popular packages doing the same?
[*] A common use case for me is the use of shiny's selectInput for instance:
List of values to select from. If elements of the list are named, then that name rather than the value is displayed to the user.
That is, it is exactly the reverse of my typical lookup table.
Little introduction to the question :
I am developing an ecophysiological model, and I use a reference class list called S that store every object the model need for input/output (e.g. meteo, physiological parameters etc...).
This list contains 5 objects (see example below):
- two dataframes, S$Table_Day (the outputs from the model) and S$Met_c(the meteo in input), which both have variables in columns, and observations (input or output) in row.
- a list of parameters S$Parameters.
- a matrix
- a vector
The model runs many functions with a daily time step. Each day is computed in a for loop that runs from the first day i=1 to the last day i=n. This list is passed to the functions that often take data from S$Met_c and/or S$Parameters in input and compute something that is stored in S$Table_Day, using indexes (the ith day). S is a Reference Class list because they avoid copy on modification, which is very important considering the number of computations.
The question itself :
As the model is very slow, I am trying to decrease computation time by micro-benchmarking different solutions.
Today I found something surprising when comparing two solutions to store my data. Storing data by indexing in one of the preallocated dataframes is longer than storing it into an undeclared vector. After reading this, I thought preallocating memory was always faster, but it seems that R performs more operations while modifying by index (probably comparing the length, type etc...).
My question is : is there a better way to perform such operations ? In other words, is there a way for me to use/store more efficiently the inputs/outputs (in a data.frame, a list of vector or else) to keep track of all computations of each day ? For example would it be better to use many vectors (one for each variable) and regroup them in more complex objects (e.g. list of dataframe) at then end ?
By the way, am I right to use Reference Classes to avoid copy of the big objects in S while passing it to functions and modify it from within them ?
Reproducible example for the comparison:
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
# Initializing the table with dummy numbers :
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
f1= function(i){
a= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
f2= function(i){
S$Table_Day$Bud_dd[(i-1000):i]= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
res= microbenchmark(f1(1000),f2(1000),times = 10000)
autoplot(res)
And the result :
Also if someone has any experience in programming such models, I am deeply interested in any advice for model development.
I read more about the question, and I'll just write here for prosperity some of the solutions that were proposed on other posts.
Apparently, reading and writing are both worth to consider when trying to reduce the computation time of assignation to a data.frame by index.
The sources are all found in other discussions:
How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)
Faster i, j matrix cell fill
Time in getting single elements from data.table and data.frame objects
Several solutions appeared relevant :
Use a matrix instead of a data.frame if possible to leverage in place modification (Advanced R).
Use a list instead of a data.frame, because [<-.data.frame is not a primitive function (Advanced R).
Write functions in C++ and use Rcpp (from this source)
Use .subset2 to read instead of [ (third source)
Use data.table as recommanded by #JulienNavarre and #Emmanuel-Lin and the different sources, and use either set for data.frame or := if using a data.table is not a problem.
Use [[ instead of [ when possible (index by one value only). This one is not very effective, and very restrictive, so I removed it from the following comparison.
Here is the analysis of performance using the different solutions :
The code :
# Loading packages :
library(data.table)
library(microbenchmark)
library(ggplot2)
# Creating dummy data :
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
# Transforming data objects into simpler forms :
mat= as.matrix(S$Table_Day)
Slist= as.list(S$Table_Day)
Metlist= as.list(S$Met_c)
MetDT= as.data.table(S$Met_c)
SDT= as.data.table(S$Table_Day)
# Setting up the functions for the tests :
f1= function(i){
S$Table_Day$Bud_dd[i]= cumsum(S$Met_c$DegreeDays[i])
}
f2= function(i){
mat[i,4]= cumsum(S$Met_c$DegreeDays[i])
}
f3= function(i){
mat[i,4]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f4= function(i){
Slist$Bud_dd[i]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f5= function(i){
Slist$Bud_dd[i]= cumsum(Metlist$DegreeDays[i])
}
f6= function(i){
set(S$Table_Day, i=as.integer(i), j="Bud_dd", cumsum(S$Met_c$DegreeDays[i]))
}
f7= function(i){
set(S$Table_Day, i=as.integer(i), j="Bud_dd", MetDT[i,cumsum(DegreeDays)])
}
f8= function(i){
SDT[i,Bud_dd := MetDT[i,cumsum(DegreeDays)]]
}
i= 6000:6500
res= microbenchmark(f1(i),f3(i),f4(i),f5(i),f7(i),f8(i), times = 10000)
autoplot(res)
And the resulting autoplot :
With f1 the reference base assignment, f2 using a matrix instead of a data.frame, f3 using the combination of .subset2 and matrix, f4 using a list and .subset2, f5 using two lists (both reading and writing), f6 using data.table::set, f7 using data.table::set and data.table for cumulative sum, and f8using data.table :=.
As we can see the best solution is to use lists for reading and writing. This is pretty surprising to see that data.table is the worst solution. I believe I did something wrong with it, because it is supposed to be the best. If you can improve it, please tell me.
I would like to maintain a variable on the GPU, and perform some operations on that variable in place. The following snippet is a minimalish example of this.
import numpy as np
import tensorflow as tf
with tf.Graph().as_default():
i = tf.placeholder(tf.int32, [4], name='i')
y = tf.placeholder(tf.float32, [4], name='y')
_x = tf.get_variable('x', [4], initializer=tf.random_normal_initializer())
x = _x + tf.reduce_sum(tf.mul(_x,y))
assign_op = tf.assign(_x, x).op
permute_op = tf.assign(_x, tf.gather(_x, i))
ii = np.array([1,2,3,0])
yy = np.random.randn(4)
s = tf.Session()
s.run(tf.initialize_all_variables())
xxx0 = s.run(_x)
s.run([permute_op, assign_op], feed_dict={i: ii, y: yy})
xxx1 = s.run(_x)
print('assigned then permuted', np.allclose((xxx0+np.dot(xxx0,yy))[ii], xxx1))
print('permuted then assigned', np.allclose((xxx0[ii]+np.dot(xxx0[ii], yy)), xxx1))
The problem is that this program is ambiguous, in terms of the ordering of the assign_op and permute_op operations. Hence, one or the other of the final two print statements will be true, but which one that is varies randomly across multiple runs of the program. I could break this into two steps, the first running the permute_op and the second running the assign_op, but it seems this will be less efficient.
Is there an efficient way of breaking the race condition, and making the results predictable?
The easiest way to order the two assignments is to use the result of the first assignment as the variable input to the second one. This creates a data dependency between the assignments, which gives them a deterministic order. For example:
assigned = tf.assign(_x, x)
permuted = tf.assign(assigned, tf.gather(assigned, i))
sess.run(permuted.op) # Runs both assignments.
Note that I reversed the order of the permutation and assignment operations from what you said in your question, because doing the permutation first and then updating still has a race. Even if this isn't the semantics you wanted, the principle should hopefully be clear.
An alternative approach is to use with tf.control_dependencies(ops): blocks, where ops is a list of operations (such as assignments) that must run before the operations in the with block. This is slightly trickier to use, because you have to be careful about reading the updated value of a variable. (Like a non-volatile variable in C, the read may be cached.) The typical idiom to force a read is to use tf.identity(var.ref()), so the example would look something like:
assign_op = tf.assign(_x, x).op
with tf.control_dependencies([assign_op]):
# Read updated value of `_x` after `assign_op`.
new_perm = tf.gather(tf.identity(_x.ref()), i)
permute_op = tf.assign(_x, new_perm).op
sess.run(permute_op) # Runs both assignments.
I'm running various modeling algorithms on a data set. I've had best results by modeling my input variables to my responses one at a time, e.g.:
model <- train(y ~ x1 + x2 + ... + xn, ...)
Once I train my models, I'd like to not re-run them each time, so I've been trying to save them as .rda files. Here's an example loop for a random forest model (feel free to suggest a better way than a loop!):
# data_resp contains my measured responses, one per column
# data_pred contains my predictors, one per column
for (i in 1:ncol(data_resp)) {
model <- train(data_pred_scale[!is.na(data_resp[, i]), ],
data_resp[!is.na(data_resp[, i]), i],
method = "rf",
tuneGrid = data.frame(.mtry = c(3:6)),
nodesize = 3,
ntrees = 500)
save(model, file = paste("./models/model_rf_", names(data_resp)[i], ".rda", sep = ""))
When I load the model, however, it's going to be called model.
I haven't found a good way to save the model with it's corresponding name to try and refer back to it later. I found that one can assign an object to a string like so:
assign(paste("./models/model_rf_", names(data_resp)[i], ".rda", sep = ""), train(...))
But I'm still left with how to refer to the object when I save it:
save(???, file = ...)
I don't know how to call the object by it's custom name.
Lastly, even loading has presented a problem. I've tried assign("model_name", load("./model.rda")), but the resultant object, called string ends up just holding a string of the object name, "model".
In looking around, I found THIS question, which seems relevant, but I'm trying to figure out how to apply it to my situation.
I could create a list with the names of each column name in data_resp (my measured responses) and then use lapply to use train(), but I'm still a bit stuck on how to dynamically refer to the new object name to keep the resultant model in.
When you save the model, save another object called 'name' which is a character string of the thing you want to name it as:
> d=data.frame(x=1:10,y=rnorm(10))
> model=lm(y~x,data=d)
> name="m1"
> save(model,name,file="save1.rda")
> d=data.frame(x=1:10,y=rnorm(10))
> model=lm(y~x,data=d)
> name="m2"
> save(model,name,file="save2.rda")
Now each file knows what it wants its resulting object to be called. How do you get that back on load? Load into a new environment, and assign:
> e=new.env()
> load("save1.rda",env=e)
> assign(e$name,e$model)
> summary(m1)
Call:
lm(formula = y ~ x, data = d)
You can now safely rm or re-use the 'e' object. You can of course wrap this in a function:
> blargh=function(f){e=new.env();load(f,env=e);assign(e$name,e$model,.GlobalEnv)}
> blargh("save2.rda")
> m2
Call:
lm(formula = y ~ x, data = d)
Note this is a double bad thing to do - firstly, you should probably store all the models in one file as a list with names. Secondly, this function has side effects, and if you had an object called m2 already it would get stomped on.
Using assign like this is nearly always a sign (dyswidt?) that you should use a list instead.
B
There is a fair amount of guesswork involved in this answer but I think this could help:
# get a vector with the column names in data_resp
modNames <- colnames( data_resp )
# create empty list
models <- as.list( NULL )
# iterate through your columns and assign the result as list members
for( n in modNames )
{
models[[n]] <- train(data_pred_scale[!is.na(data_resp[, n]), ], ### this may need modification, can't test without data
data_resp[!is.na(data_resp[, n]), n],
method = "rf",
tuneGrid = data.frame(.mtry = c(3:6)),
nodesize = 3,
ntrees = 500)
}
# save the whole bunch
save( models, file = "models.rda" )
You can now retrieve, just with load( "models.rda ), this one object, the list with all your models, and address them with list notation, either as models[[1]] or with the column name, eg. models[["first"]].
I think the other answers about doing this with a loop are great. I used this as a chance to finally try and understand lapply better, as many of the StackOverflow questions about how to do this ended up suggesting the use of lists and lapply instead of loops.
I really like the idea of combining all results of train() into a list (which #vaettchen did in his loop), and in thinking about how to do this with a list, this is what I came up with. First, I needed my data.frame in list form, one entry per column. Since I don't really work with lists, I hunted around until just trying as.list(df), which worked like a charm.
Next, I want to apply my train function to each element of my list of measured response variables, so I defined the function like this:
# predictors are stored in data_pred
# responses are in data_resp (one per column)
# rows in data_pred/data_resp (perhaps obviously) match, one per observation
train_func <- function(y) { train(x = data_pred, y = y,
method = "rf", tuneGrid = data.frame(.mtry = 3:6),
ntrees = 500) }
Now I just need to use lapply to apply the train() call on each element of data_resp. I didn't know how to create an empty, placeholder list, so thanks to #vaettchen for that (I was trying list_name <- list() without success):
models <- lapply(as.list(data_resp), train_func)
Awesomely, I found that models has it's elements automatically named to my column names in data_resp, which is just fantastic. I'm using this in conjunction with the shiny package, so this will make it incredibly easy for the user to select a response variable from a drop down (which can store the response variable name) and do:
predict(models[["resp_name"]], new_data)
I think this is much better than the loop based approach and everything just happened to fall in place nicely. I realize the question explicitly asked for naming variables programmatically, so apologies if that pushed others to answer in that fashion vs. a "bigger picture" answer. The ease of lapply suggests I was trying to force a particular solution when a (at least to my eyes) much better one existed.
Bonus: I didn't realize lists could be multi-dimensional, but in trying it, it appears they can be! This is even better, as I'm using numerous algorithms and I can store everything in one big list object.
func_rf <- function(y) { train(x = data_pred, y = y,
method = "rf", tuneGrid = data.frame(.mtry = 3),
ntrees = 100) }
# svmRadial method requires formula syntax to work with factors,
# so the train function has to be a bit different
# add `scale = F` since I had to preProcess the numeric vars ahead of time
# and cbind to the factors. Without it, caret will try to scale the data
# for you, which fails for factors
func_svm <- function(y) { train(y ~ ., cbind(data_pred, y),
method = "svmRadial", tuneGrid = data.frame(.C = 1, .sigma = .2),
scale = F) }
model_list <- list(NULL)
model_list$rf <- lapply(as.list(data_resp), func_rf)
model_list$svm <- lapply(as.list(data_resp), func_svm)
Now I can refer the desired model and response variable with list syntax!
predict(model_list[["svm"]][["response_variable"]], new_data)
Super happy with this and hopefully it makes the code more efficient, faster, and I really love the "meta-object" I end up with vs. a ton of files, one per model/response variable combination, that I have to load in one at a time later on.
A bit of an old question but still without an accepted answer.
As I understand, you need to programmatically rename a variable and save it so that when reloaded it keeps the new name.
Try this:
saveWithName = function(var.name, obj){
# var.name is a string with the name of the variable you want to assign
# obj is any kind of R object (data.frame, list, etc.) you want to rename and save
assign(var.name, obj)
save(list=var.name, file=sprintf("model_%s.RData", var.name))
}
saveWithName("lab1", c(1,2))
saveWithName("lab2", c(3,4))
load("model_lab1.RData")
load("model_lab2.RData")
print(lab1)
#>[1] 1 2
print(lab2)
#[1] 3 4