Loading KrigingSurrogate Trained Model without Storing Training Data - openmdao

openmdao's KrigingSurrogate allows the user to cache a trained Kriging surrogate model and load it later using the optional argument training_cache. This works great except for one sometimes inconvenient feature - `KrigingSurrogate always checks the training data in the trained model against the provided training data to make sure they are the same before loading the trained model. Otherwise, the model will be retrained with the new training data. Unfortunately, this seems to require the user to separately pickle the training data, both inputs and outputs, if they want to train the model in one script and then load it in another.
Is there any way to skip the training data validation and instead use the training data that is already saved in the trained model?
My current method for creating a Kriging model in one script and then loading it in another looks like this:
# create_model.py
import numpy as np
import openmdao.api as om
import pickle
x = np.arange(0, 11, 1)
y = x**2
surrogate = om.MetaModelUnStructuredComp()
surrogate.add_input('x', training_data = x)
surrogate.add_output('y', training_data = y, surrogate = om.KrigingSurrogate(training_cache = 'surrogate.dat'))
prob = om.Problem()
prob.model.add_subsystem('surrogate', surrogate)
prob.setup()
prob.run_model() # trains model, saves to surrogate.dat
training_data = {
'x': x,
'y': y
}
# pickle training data
with open('training_data.pickle', 'wb') as training_data_file:
pickle.dump(training_data, training_data_file)
# load_model.py
import numpy as np
import openmdao.api as om
import pickle
'''
I want to skip this because the training data is already saved with the model I am about to load,
but I can't because KrigingSurrogate requires training data to check the saved model against.
'''
with open('training_data.pickle', 'rb') as training_data_file:
training_data = pickle.load(training_data_file)
x = training_data['x']
y = training_data['y']
surrogate = om.MetaModelUnStructuredComp()
surrogate.add_input('x', training_data = x)
surrogate.add_output('y', training_data = y, surrogate = om.KrigingSurrogate(training_cache = 'surrogate.dat'))
prob = om.Problem()
prob.model.add_subsystem('surrogate', surrogate)
prob.setup()
prob.run_model() # loads trained model

As of OpenMDAO V3.25 the kriging models still requires a cache validation against the training data. In theory it would be a nice improvement to have the surrogate model include that data into the cache it stored and then reload it from the same file. This would save you the extra step of pickling it.
The problem is there would be no great way for the surrogate to know if the input training data had changed for a necessary reason or not. A user might set new training data, then be surprised when it got overwritten by the data in the cache. Maybe you could throw a warning, but I've found that many users ignore warnings :(
If you want to customize the behavior, you can make your own version of the KrigigingSurrogate class and remove that validation. It only requires removing a few lines of code.
If you can think of a decent update to sort out the problem with knowing if the cache is valid and re-loading the inputs (without stomping on any user provided inputs) feel free to submit a POEM. Otherwise, just make your own surrogate and comment out the cache validity check (and be careful!)

Related

Exclude specific tensors being updated by optimizer in TensorFlow

I have two graphs, which I suppose to train them independently, which means I have two different optimizers, but at the same time one of them is using the tensor values of the other graph. As a result, I need to be able to stop specific tensors being updated while training one of the graphs. I have assigned two different namescopes two my tensors and using this code to control updates over tensors for different optimizers:
mentor_training_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "mentor")
train_op_mentor = mnist.training(loss_mentor, FLAGS.learning_rate, mentor_training_vars)
mentee_training_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "mentee")
train_op_mentee = mnist.training(loss_mentee, FLAGS.learning_rate, mentee_training_vars)
the vars variable is being used like below, in the training method of mnist object:
def training(loss, learning_rate, var_list):
# Add a scalar summary for the snapshot loss.
tf.summary.scalar('loss', loss)
# Create the gradient descent optimizer with the given learning rate.
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# Create a variable to track the global step.
global_step = tf.Variable(0, name='global_step', trainable=False)
# Use the optimizer to apply the gradients that minimize the loss
# (and also increment the global step counter) as a single training step.
train_op = optimizer.minimize(loss, global_step=global_step, var_list=var_list)
return train_op
I'm using the var_list attribute of the optimizer class in order to control vars being updated by the optimizer.
Right now I'm confused whether I have done what I supposed to do appropriately, and even if there is anyway to check if any optimizer would only update partial of a graph?
I would appreciate if anyone can help me with this issue.
Thanks!
I have had a similar problem and used the same approach as you, i.e. via the var_list argument of the optimizer. I then checked whether the variables not intended for training stayed the same using:
the_var_np = sess.run(tf.get_default_graph().get_tensor_by_name('the_var:0'))
assert np.equal(the_var_np, pretrained_weights['the_var']).all()
pretrained_weights is a dictionary returned by np.load('some_file.npz') which I used to store the pre-trained weights to disk.
Just in case you need that as well, here is how you can override a tensor with a given value:
value = pretrained_weights['the_var']
variable = tf.get_default_graph().get_tensor_by_name('the_var:0')
sess.run(tf.assign(variable, value))

h2o autoencoders high error (h2o.mse)

I am trying to use h2o to create an autoencoder using its deeplearning function. I am feeding a set of data about 4000x50 in size to the deeplearning function (hidden node c(200)) and then using h2o.mse to check its error and I am getting about 0.4, a fairly high value.
Is there anyway to reduce that error by changing something in the deeplearning function?
I assume everything is the defaults, except defining a single hidden layer with 200 nodes?
Your first set of things to try are:
Use more epochs (or use less aggressive early stopping criteria)
Use a 2nd hidden layer
Use more nodes in your hidden layer(s)
Get more training data
Note that all of those will increase your training time.
You can use H2OGridSearch to find the best autoencoder model with the smallest MSE.
Below is an example in Python. Here you can find example in R.
def tuneAndTrain(hyperParameters, model, trainDataFrame):
h2o.init()
trainData=trainDataFrame.values
trainDataHex=h2o.H2OFrame(trainData)
modelGrid = H2OGridSearch(model,hyper_params=hyperParameters)
modelGrid.train(x=list(range(0,int(len(trainDataFrame.columns)))),training_frame=trainDataHex)
gridperf1 = modelGrid.get_grid(sort_by='mse', decreasing=True)
bestModel = gridperf1.models[0]
return bestModel
And you can call the above function to find and train the best model:
hiddenOpt = [[50,50],[100,100], [5,5,5],[50,50,50]]
l2Opt = [1e-4,1e-2]
hyperParameters = {"hidden":hiddenOpt, "l2":l2Opt}
bestModel=tuneAndTrain(hyperParameters,H2OAutoEncoderEstimator(activation="Tanh", ignore_const_cols=False, epochs=200),dataFrameTrainPreprocessed)

Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API

I am trying to build random forest models by group(School_ID, more than 3 thousands) on a large model input csv file using Spark Scala API. Each of the group contains about 3000-4000 records. The resources I have at disposal are 20-30 aws m3.2xlarge instances.
In R, I can construct models by group and save them to a list like this-
library(dplyr);library(randomForest);
Rf_model <- train %>% group_by(School_ID) %>%
do(school= randomForest(formula=Rf_formula, data=., importance = TRUE))
The list can be stored somewhere and I can call them when I need to use them like below -
save(Rf_model.school,file=paste0(Modelpath,"Rf_model.dat"))
load(file=paste0(Modelpath,"Rf_model.dat"))
pred <- predict(Rf_model.school$school[school_index][[1]], newdata=test)
I was wondering how to do that in Spark, whether or not I need to split the data by group first and how to do it efficiently if it's necessary.
I was able to split up the file by School_ID based on the below code but it seems it creates one individual job to subset for each iteration and takes a long time to finish the jobs. Is there a way to do it in one pass?
model_input.cache()
val schools = model_input.select("School_ID").distinct.collect.flatMap(_.toSeq)
val bySchoolArray = schools.map(School_ID => model_input.where($"School_ID" <=> School_ID))
for( i <- 0 to programs.length - 1 ){
bySchoolArray(i).
write.format("com.databricks.spark.csv").
option("header", "true").
save("model_input_bySchool/model_input_"+ schools(i))
}
Source:
How can I split a dataframe into dataframes with same column values in SCALA and SPARK
Edit 8/24/2015
I'm trying to convert my dataframe into a format that is accepted by the random forest model. I'm following the instruction on this thread
How to create correct data frame for classification in Spark ML
Basically, I create a new variable "label" and store my class in Double. Then I combine all my features using VectorAssembler function and transform my input data as follows-
val assembler = new VectorAssembler().
setInputCols(Array("COL1", "COL2", "COL3")).
setOutputCol("features")
val model_input = assembler.transform(model_input_raw).
select("SCHOOL_ID", "label", "features")
Partial error message(let me know if you need the complete log message) -
scala.MatchError: StringType (of class
org.apache.spark.sql.types.StringType$)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2.apply(VectorAssembler.scala:57)
This is resolved after converting all the variables to numeric types.
Edit 8/25/2015
The ml model doesn't accept the label I coded manually so I need to use StringIndexer to go around the problem as indicated here. According to the official documentation, the most frequent label gets 0. It causes inconsistent labels across School_ID. I was wondering if there's a way to create the labels without resetting the order of the values.
val indexer = new StringIndexer().
setInputCol("label_orig").
setOutputCol("label")
Any suggestions or directions would be helpful and feel free to raise any questions. Thanks!
Since you already have separate data frame for each school there is not much to be done here. Since you data frames I assume you want to use ml.classification.RandomForestClassifier. If so you can try something like this:
Extract pipeline logic. Adjust RandomForestClassifier parameters and transformers according to your requirements
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel}
def trainModel(df: DataFrame): PipelineModel = {
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
pipeline.fit(df)
}
Train models on each subset
val bySchoolArrayModels = bySchoolArray.map(df => trainModel(df))
Save models
import java.io._
def saveModel(name: String, model: PipelineModel) = {
val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
oos.writeObject(model)
oos.close
}
schools.zip(bySchoolArrayModels).foreach{
case (name, model) => saveModel(name, Model)
}
Optional: Since individual subsets are rather small you can try an approach similar to the one I've describe here to submit multiple tasks at the same time.
If you use mllib.tree.model.RandomForestModel you can omit 3. and use model.save directly. Since there seem to be some problems with deserialization (How to deserialize Pipeline model in spark.ml? - as far as I can tell it works just fine but better safe than sorry, I guess) it could be a preferred approach.
Edit
According to the official documentation:
VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.
Since error indicates your column is a String you should transform it first, for example using StringIndexer.

how assign new text to the built model (text mining)

yesterday i found good R code for classification emotion and took some part
happy = readLines("./happy.txt")
sad = readLines("./sad.txt")
happy_test = readLines("./happy_test.txt")
sad_test = readLines("./sad_test.txt")
tweet = c(happy, sad)
tweet_test= c(happy_test, sad_test)
tweet_all = c(tweet, tweet_test)
sentiment = c(rep("happy", length(happy) ),
rep("sad", length(sad)))
sentiment_test = c(rep("happy", length(happy_test) ),
rep("sad", length(sad_test)))
sentiment_all = as.factor(c(sentiment, sentiment_test))
library(RTextTools)
mat= create_matrix(tweet_all, language="english",
removeStopwords=FALSE, removeNumbers=TRUE,
stemWords=FALSE, tm::weightTfIdf)
container = create_container(mat, as.numeric(sentiment_all),
trainSize=1:160, testSize=161:180,virgin=FALSE)
models = train_models(container, algorithms=c("MAXENT",
"SVM",
#"GLMNET", "BOOSTING",
"SLDA","BAGGING",
"RF", # "NNET",
"TREE"
))
# test the model
results = classify_models(container, models)
table(as.numeric(as.numeric(sentiment_all[161:180])), results[,"FORESTS_LABEL"])
all is good, but one question. Here, we work with data where we self indicate to the machine what is sad and what is happy text. If i have new documents without indicating what sad, what happy or whats positive and what's negative(suppose, path one of this document n=read.csv("C:/1/ttt.csv")), how to do, that built model can define what phrase is negative and what positive?
Well, What was all the purpose of building a model to detect what is sad and what is happy? What is you want to achieve? And this does not look like a SO question/answer.
So you are using Supervised Learning in a labeled data (you already know is sad or happy) to learn what defines those classes, so later on you can use the models built for predicting new content where you do not have the label.
So, any transformations done to the data for training you have to do it for the new data coming in, and you ask your model to predict (evaluate) based on this new input data. So you use the prediction as a result. This does not change your model, it is just evaluating it in new data.
Another scenario is that you come with new labeled data, so you want to update your model, so you can retrain it based on the new data you might learn new models that have maybe more features.
In your case you should look at classify_model or classify_models functions in that package.

How can I export a Time Series model in R?

Is there a standard (or available) way to export a Time Series model in R? PMML would work, but when I I try to use the pmml library, perhaps incorrectly, I get an error:
For example, my code looks similar to this:
require(fpp)
library(forecast)
library(pmml)
data <- ts(livestock, start = 1970, end = 2000,frequency=3)
model <- ses(data , h=10 )
export <- pmml(model)
And the error I get is:
Error in UseMethod("pmml") : no applicable method for 'pmml' applied to an object of class "forecast"
Here is what I can tell:
When you use ses(), you're not creating a model; you're using a model to find a prediction (in particular, making a forecast via exponential smoothing for a time series). Your result is not a predictive model, but rather a particular prediction of a model for a particular data set. While I'm not that familiar with PMML, from what I can tell, it's not meant for the job you are trying to use it for.
If you want to export the time series and the result, I would say your best bet would be to just export a .csv file with the data; just about anything can read .csv's. A ts object is nothing more than a glorified vector, so you can export the data and the times. Additionally, model is just a table with data. So try this:
write.csv(model, file="forecast.csv")
If you want to write the ts object, try one of the following:
write.csv(data, file="ts1.csv") # No dates for index
write.csv(cbind("time" = time(data), "val" = data), file = "ts2.csv") # Adds dates

Resources