Azure ML Studio not showing datasets under models - azure-machine-learning-studio

I registered a model in an Azure ML notebook along with its datasets. In ML Studio I can see the model listed under the dataset, but no dataset gets listed under the model. What should I do to have datasets listed under models?
Model listed under dataset:
Dataset not listed under the model:
Notebook code:
import pickle
import sys
from azureml.core import Workspace, Dataset, Model
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import assert_all_finite
workspace = Workspace('<snip>', '<snip>', '<snip>')
dataset = Dataset.get_by_name(workspace, name='creditcard')
data = dataset.to_pandas_dataframe()
data.dropna(inplace=True)
X = data.drop(labels=["Class"], axis=1, inplace=False)
y = data["Class"]
model = make_pipeline(StandardScaler(), GradientBoostingClassifier())
model.fit(X, y)
with open('creditfraud_sklearn_model.pkl', 'wb') as outfile:
pickle.dump(model, outfile)
Model.register(
Workspace = workspace,
model_name = 'creditfraud_sklearn_model',
model_path = 'creditfraud_sklearn_model.pkl',
description = 'Gradient Boosting classifier for Kaggle credit-card fraud',
model_framework = Model.Framework.SCIKITLEARN,
model_framework_version = sys.modules['sklearn'].__version__,
sample_input_dataset = dataset,
sample_output_dataset = dataset)

It looks like add_dataset_references() needs to be called to have datasets displayed under models:
model_registration.add_dataset_references([("input dataset", dataset)])

Related

How to give R's "matchit" a formula object via Python's r2py?

I use rpy2 to use R in Python. Especially I want to use the MatchIt package but stuck on a detail. The call to the primary function of that package in R looks like this
# R code
m.out1 <- matchit(
gruppe ~ geschlecht + alter + pflege,
data = df,
method = "nearest",
distance = "glm"
)
The first argument is a "formula". I don't have an idea how to create such an object/argument in Python code? The other three arguments are no problem. The error from rpy2 is this:
[WARNING] R[write to console]: Error: 'formula' must be a formula object.
Traceback (most recent call last):
...
File "...\AppData\Roaming\Python\Python39\site-packages\rpy2\rinterface.py", line 813, in __call__
raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error: 'formula' must be a formula object.
This is the Python code producing that problem.
# Python code
import pandas
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
r_package_matchit = robjects.packages.importr('MatchIt')
func_matchit = robjects.r['matchit']
df = pandas.DataFrame({
'gruppe': list('IICC'),
'geschlecht': list('mwmw'),
'alter': range(4),
'pflege': range(4)
})
# convert the data frame from Pandas to R
with robjects.conversion.localconverter(
robjects.default_converter + pandas2ri.converter):
rdf = robjects.conversion.py2rpy(df)
func_matchit(formula='gruppe ~ geschlecht + alter + pflege',
data=rdf, method='nearest',
distance='glm')
Mutch easier then I thougt. rpy2 offers a Formula() for cases like this.
import rpy2
from rpy2.robjects import Formula
# ...
func_matchit(formula=Formula('gruppe ~ geschlecht + alter + pflege'),
data=rdf, method='nearest',
distance='glm')

Initialize HuggingFace Bert with random weights

How is it possible to initialize BERT with random weights? I want to compare the performance of multilingual vs monolingual vs randomly initialized BERT in a masked language modeling task. While in the former cases it is very straightforward:
from transformers import BertTokenizer, BertForMaskedLM
tokenizer_multi = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model_multi = BertForMaskedLM.from_pretrained('bert-base-multilingual-cased')
model_multi.eval()
tokenizer_mono = BertTokenizer.from_pretrained('bert-base-cased')
model_mono = BertForMaskedLM.from_pretrained('bert-base-cased')
model_mono.eval()
I don't know how to load random weights.
Thanks in advance!
You can use the following function:
def randomize_model(model):
for module_ in model.named_modules():
if isinstance(module_[1],(torch.nn.Linear, torch.nn.Embedding)):
module_[1].weight.data.normal_(mean=0.0, std=model.config.initializer_range)
elif isinstance(module_[1], torch.nn.LayerNorm):
module_[1].bias.data.zero_()
module_[1].weight.data.fill_(1.0)
if isinstance(module_[1], torch.nn.Linear) and module_[1].bias is not None:
module_[1].bias.data.zero_()
return model

Extracting leaf indices that each sample was assigned to in the forest from random forests (RF) in R

I'm trying to transpile code from Python to R in order to do supervised dimensionality reduction with Random Forests and UMAP following instructions from this blog post.
I need to get an array that contains the leaf indices that each sample was assigned to in the forest so I can feed this information into the {uwot} package (for UMAP).
I would like to get this information from the following R packages: {randomForest}, {ranger}, and {extraTrees}. ranger is expected to require the lowest compute-time (critical for my project) and ExtraTrees, in some-cases, can outperform RF implementations (in terms of AUC).
In Python, scikit-learn the ExtraTreesClassifier returns a numpy array which contains the leaf indices that each sample was assigned to in the forest like so:
To make thing as comparable as-possible between R/Python we will do the brunt of the work using Python in reticulate following the blog (but decreasing the size of simulated data) comparing only the various RF implementations.
Python Implementation
library(reticulate)
repl_python()
Now in the Python interpreter
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.datasets import make_classification
from tqdm import tqdm
from umap import UMAP ## if this doesn't work try: import umap.umap_ as UMAP
from pynndescent import NNDescent
from fastcluster import single
from scipy.cluster.hierarchy import cut_tree, fcluster, dendrogram
from scipy.spatial.distance import squareform
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.model_selection import train_test_split
# turning off automatic plot showing, and setting style
plt.style.use('bmh')
# let us generate some data with 10 clusters per class
X, y = make_classification(n_samples=10000, n_features=500, n_informative=5,
n_redundant=0, n_clusters_per_class=10, weights=[0.80],
flip_y=0.05, class_sep=3.5, random_state=42)
# normalizing to eliminate scaling differences
X = pd.DataFrame(StandardScaler().fit_transform(X))
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 70% training and 30% test
Use ExtraTreesClassifier to get the information from leaf indices and plot.
In the blog post the author used StratifiedKFold & cross_val_predict. Here I use train_test_split instead to create a train/test split. This way I can use the same train/split for Python & R and ensure the Area under the ROC curve measurement is comparable.
# model instance
et = ExtraTreesClassifier(n_estimators=100, min_samples_leaf=500,
max_features=0.80, bootstrap=True, class_weight='balanced', n_jobs=-1, random_state=42)
# Train ExtraTreesClassifer
et.fit(X_train, y_train)
# Print the AUC
print('Area under the ROC Curve:', roc_auc_score(y_test, et.predict_proba(X_test)[:,1]))
The models performance is 0.8504 AUC. Now we move on to embedding.
# let us train our model with the full data
et.fit(X, y)
# and get the leaves that each sample was assigned to
leaves = et.apply(X)
What are dimensions of the numpy array
leaves.shape
(10000, 100)
# calculating the embedding with hamming distance
sup_embed_et = UMAP(metric='hamming').fit_transform(leaves)
# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_et[y == 0,0], sup_embed_et[y == 0,1], s=1, c='C0', cmap='viridis', label='$y=0$')
plt.scatter(sup_embed_et[y == 1,0], sup_embed_et[y == 1,1], s=1, c='C1', cmap='viridis', label='$y=1$')
plt.title('Supervised embedding with ExtraTrees')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.legend(fontsize=16, markerscale=5);
plt.show()
# taking a sample of the dataframe
embed_sample = pd.DataFrame(sup_embed_et).sample(5000, random_state=42)
# running fastcluster hierarchical clustering on the improved embedding
H = single(embed_sample)
# getting the clusters
clusters = cut_tree(H, height=0.35)
print('Number of clusters:', len(np.unique(clusters)))
This shows 19 clusters (the simulated number was 20).
# creating a dataframe for the clustering sample
clust_sample_df = pd.DataFrame({'cluster': clusters.reshape(-1), 'cl_sample':range(len(clusters))})
# creating an index with the sample used for clustering
index = NNDescent(embed_sample, n_neighbors=10)
# querying for all the data
nn = index.query(sup_embed_et, k=1)
# creating a dataframe with nearest neighbors for all samples
to_cluster_df = pd.DataFrame({'sample':range(sup_embed_et.shape[0]), 'cl_sample': nn[0].reshape(-1)})
# merging to assign cluster to all other samples, and tidying it
final_cluster_df = to_cluster_df.merge(clust_sample_df, on='cl_sample')
final_cluster_df = final_cluster_df.set_index('sample').sort_index()
# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_et[:,0], sup_embed_et[:,1], s=1, c=final_cluster_df['cluster'], cmap='plasma')
plt.title('Hierarchical clustering and extraTrees')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.show()
exit
R Implementation(s)
Now in R create a data.frame using the same data we simulated with Python and split into train/test with sklearn's train_test_split:
library(dplyr)
df <- data.frame(py$X)
df$labels <- as.factor(py$y) # convert to factor for classification
d_train <- data.frame(py$X_train)
d_test <- data.frame(py$X_test)
d_train$labels <- py$y_train
d_test$labels <- py$y_test
# Identity the response column
ycol <- "labels"
# Identify the predictor columns
xcols <- setdiff(names(d_train), ycol)
# Convert response to factor (required by randomForest)
d_train[,ycol] <- as.factor(d_train[,ycol])
d_test[,ycol] <- as.factor(d_test[,ycol])
randomForest
First, the randomForest package (one of the oldest, most well-known, but not most-optimal, packages):
library(randomForest)
library(cvAUC)
# Train a default RF model with 100 trees
## set
set.seed(123) # For reproducibility
system.time(
model <- randomForest(
x = d_train[,xcols],
y = d_train[,ycol],
xtest = d_test[,xcols],
ntree = 100,
nodes = TRUE # set to keep information on which trees in forest assigned to
)
) ## user: 30.835 system: 0.008 elapsed: 30.837
# Generate predictions on test dataset
preds <- model$test$votes[, 2]
labels <- d_test[,ycol]
# Compute AUC on the test set
cvAUC::AUC(predictions = preds, labels = labels)
The model performance is 0.8919 AUC. This is slightly better than what we saw in Python (albeit a bit slower)
Now as we did in python, we need to train and apply it on the whole dataset, keeping track of which leaves in the forest each sample was assigned to.
set.seed(123) # For reproducibility
md_full <- randomForest(formula = labels ~ ., data = df, ntree = 100, keep.forest = TRUE)
phat_full <- predict(md_full, newdata = df, type = "prob", nodes = TRUE)
# get the leaf indices that each sample was assigned to in the forest
leaves <- attr(phat_full, "nodes")
dim(leaves)
Takes this back into Python and plot with repl_python():
# Assign R object as Python object
leafs = r.leaves
# Get embeddings from UMAP
sup_embed_rf = UMAP(metric='hamming').fit_transform(leafs)
# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_rf[y == 0,0], sup_embed_rf[y == 0,1], s=1, c='C0', cmap='viridis', label='$y=0$')
plt.scatter(sup_embed_rf[y == 1,0], sup_embed_rf[y == 1,1], s=1, c='C1', cmap='viridis', label='$y=1$')
plt.title('Supervised embedding with randomForest')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.legend(fontsize=16, markerscale=5);
plt.show()
exit
# taking a sample of the dataframe
embed_sample = pd.DataFrame(sup_embed_rf).sample(5000, random_state=42)
# running fastcluster hierarchical clustering on the improved embedding
H = single(embed_sample)
# getting the clusters
clusters = cut_tree(H, height=0.35)
print('Number of clusters:', len(np.unique(clusters)))
This shows 12 clusters (although the simulated number was 20). I'm confused that I found fewer clusters considering that the AUC was higher for this method?
# creating a dataframe for the clustering sample
clust_sample_df = pd.DataFrame({'cluster': clusters.reshape(-1), 'cl_sample':range(len(clusters))})
# creating an index with the sample used for clustering
index = NNDescent(embed_sample, n_neighbors=10)
# querying for all the data
nn = index.query(sup_embed_et, k=1)
# creating a dataframe with nearest neighbors for all samples
to_cluster_df = pd.DataFrame({'sample':range(sup_embed_et.shape[0]), 'cl_sample': nn[0].reshape(-1)})
# merging to assign cluster to all other samples, and tidying it
final_cluster_df = to_cluster_df.merge(clust_sample_df, on='cl_sample')
final_cluster_df = final_cluster_df.set_index('sample').sort_index()
# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_et[:,0], sup_embed_et[:,1], s=1, c=final_cluster_df['cluster'], cmap='plasma')
plt.title('Hierarchical clustering and randomForest')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.show()
exit
Other R packages are known to outperform randomForest both in terms of accuracy and speed so I would also like to get this information from ranger and ExtraTrees. 'ranger' for example, has excellent speed and support for high-dimensional or wide data (e.g. scRNA-sequencing data)
ranger
Here's what I've got so far with the ranger method:
library(ranger)
library(pROC)
set.seed(123)
# ranger speed
system.time(
df_ranger <- ranger(
formula = labels ~ .,
data = d_train,
num.trees = 100,
num.threads = 1, # default is the number of CPUs on machine
probability = TRUE
)
) # user 13.047 system: 0.01 elapsed: 13.048
pred.ranger <- predict(df_ranger, data = d_test, type = "terminalNodes")
# get model accuracy
ranger.roc <- roc(d_test$labels, pred.ranger$predictions[,2])
pROC::auc(ranger.roc)
This model didn't perform well with 0.5471 AUC.
set.seed(123)
df_ranger <- ranger(formula = labels ~ ., data = df, num.trees = 100, probability = TRUE)
pred.ranger <- predict(df_ranger, data = df, type = "terminalNodes")
lyves <- pred.ranger$predictions
Now in enter repl_python() and in Python enter:
# Assign R object as Python object
lyves = r.lyves
# Get embeddings from UMAP
sup_embed_rg = UMAP(metric='hamming').fit_transform(lyves)
# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_rg[y == 0,0], sup_embed_rg[y == 0,1], s=1, c='C0', cmap='viridis', label='$y=0$')
plt.scatter(sup_embed_rg[y == 1,0], sup_embed_rg[y == 1,1], s=1, c='C1', cmap='viridis', label='$y=1$')
plt.title('Supervised embedding with ranger')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.legend(fontsize=16, markerscale=5);
plt.show()
# taking a sample of the dataframe
embed_sample = pd.DataFrame(sup_embed_rg).sample(5000, random_state=42)
# running fastcluster hierarchical clustering on the improved embedding
H = single(embed_sample)
# getting the clusters
clusters = cut_tree(H, height=0.35)
print('Number of clusters:', len(np.unique(clusters)))
This shows 17 clusters (although the simulated number was 20).
# creating a dataframe for the clustering sample
clust_sample_df = pd.DataFrame({'cluster': clusters.reshape(-1), 'cl_sample':range(len(clusters))})
# creating an index with the sample used for clustering
index = NNDescent(embed_sample, n_neighbors=10)
# querying for all the data
nn = index.query(sup_embed_et, k=1)
# creating a dataframe with nearest neighbors for all samples
to_cluster_df = pd.DataFrame({'sample':range(sup_embed_et.shape[0]), 'cl_sample': nn[0].reshape(-1)})
# merging to assign cluster to all other samples, and tidying it
final_cluster_df = to_cluster_df.merge(clust_sample_df, on='cl_sample')
final_cluster_df = final_cluster_df.set_index('sample').sort_index()
# plotting the embedding
plt.figure(figsize=(12,7), dpi=150)
plt.scatter(sup_embed_et[:,0], sup_embed_et[:,1], s=1, c=final_cluster_df['cluster'], cmap='plasma')
plt.title('Hierarchical clustering and ranger')
plt.xlabel('$x_0$'); plt.ylabel('$x_1$')
plt.show()
exit
extraTrees
The extraTrees method, could be the closest comparison to scikit-learn's ExtraTreesClassifier.
library(extraTrees)
y <- "labels"
characteristics <- setdiff(names(d_train), y)
train <- d_train[, characteristics]
test <- d_test[, characteristics]
set.seed(123)
system.time({
model_extraTrees <- extraTrees(x = train,
y = as.factor(d_train$labels),
ntree = 500,
numThreads = 1 # must be set explicitly as the default is 1
)
}) # user 10.184 system: 0.104 elapsed: 9.770
UPDATE: At this time one cannot get the leaf indices from {extraTrees} so I have made a feature request.
h20
In this excellent benchmark repo the machine learning library h2o is shown to achieve a higher AUC so I want to try it out as well. I'll use 1 core and ntrees = 100 like in all the other methods.
library(h2o)
h2o.init(nthreads = 1)
# convert data to h2o objects
train <- as.h2o(d_train)
test <- as.h2o(d_test)
# Convert response to factor (required by randomForest)
train[,ycol] <- as.factor(train[,ycol])
test[,ycol] <- as.factor(test[,ycol])
system.time(
model <- h2o.randomForest(
x = xcols,
y = ycol,
training_frame = train,
seed = 123,
ntrees = 100
)
) ## user: 0.199 system: 0.018 elapsed: 18.399
perf <- h2o.performance(model = model, newdata = test)
h2o.auc(perf)
The models performance is 0.9077 AUC - this is the best so far. I'm not sure if it's possible to get the leaf indices that were assigned?
There are, at least, 32 R packages for Random Forests.
If anyone is familiar with other packages with good performance that can get leaf indices I'd love to know about it. If you've noticed any errors please let me know, thank you.

How estimate a brms model with rpy2?

I am trying to estimate the model below. The model uses a package in R called brms. I am a doing all the data manipulation in Python. To bridge the two languages I am using rpy2. I am able to load the brms package with rpy2, but I can't figure out the syntax to estimate the model. Below is a simple example of what I would like to do. I tried to follow the documentation on rpy2's website, but I can't seem to get it to work. This code works natively in R. How do I translate it to rpy2?
library(brms)
data("kidney", package = "brms")
head(kidney, n = 3)
fit1 <- brm(time | cens(censored) ~ age + sex + disease,
data = kidney, family = weibull, inits = "0")
summary(fit1)
plot(fit1)
fit2 <- brm(time | cens(censored) ~ age + sex + disease + (1|patient),
data = kidney, family = weibull(), inits = "0",
prior = set_prior("cauchy(0,2)", class = "sd"))
summary(fit2)
plot(fit2)
In Python, every non-built-in attribute or object must be qualified with a namespace. Fortunately, in R everything is an object within implicit namespaces! Most new useRs may not know but built-in core libraries, base, stats, utils, are loaded with each session. So many everyday functions like read.csv, data.frame, and lapply are actually methods within libraries and can be called in Python's style with double-colon operator: utils::read_csv(), base::lapply(), stats::lm(). To find such libaries, in R check method's doc pages with ? (i.e., ?lapply) and find in upper left corner.
Therefore, simply retain all of your R syntax, of course, adhering to Python's syntax rules such as translating dot names and without the assignment <- operator. However, rpy2 does not render graphs interactively, so you need to save plots as images to disk and print any console output. Also, one challenge may be the loading of built-in datasets. Below includes the mtcars load from the built-in R datasets package. Hopefully it is translatable.
from rpy2.robjects.packages import importr, data
# IMPORT R PACKAGES
base = importr('base')
utils = importr('utils')
datasets = importr('datasets')
stats = importr('stats', robject_translations={'as.formula': 'as_formula'})
graphics = importr('graphics')
grDevices = importr('grDevices')
brms = importr('brms')
# LOADING DATA
# WORKING EXAMPLE: mtcars = data(datasets).fetch('mtcars')['mtcars']
kidney_df = data(brms).fetch('kidney')['kidney']
print(utils.head(kidney_df, n = 3))
# MODELING
formula1 = stats.as_formula("time | cens(censored) ~ age + sex + disease")
fit1 = brms.brm(formula1, data = kidney_df, family = "weibull", inits = "0")
print(stats.summary(fit1))
formula2 = stats.as_formula("time | cens(censored) ~ age + sex + disease + (1|patient)")
fit2 <- brms.brm(formula2, data = kidney_df, family = "weibull", inits = "0",
prior = brms.set_prior("cauchy(0,2)", class = "sd"))
print(stats.summary(fit2))
# GRAPHING
grDevices.png('/path/to/plot1.png')
graphics.plot(fit1)
grDevices.dev_off()
grDevices.png('/path/to/plot2.png')
graphics.plot(fit2)
grDevices.dev_off()

keras r how to save a model and continue training

I'm following the example here: https://keras.rstudio.com/articles/examples/lstm_text_generation.html
I'm struggling to figure out how to save the model and then at a later date continue training (possibly on a different computer).
thanks!
in keras save your model architecture and weights.
then again every time to load and fit model with your new input datasets.
like this way.
from keras.layers import SimpleRNN, TimeDistributed
model=Sequential()
model.add(SimpleRNN(input_shape=(None, 2),
return_sequences=True,
units=5))
model.add(TimeDistributed(Dense(activation='sigmoid', units=3)))
model.compile(loss = 'mse', optimizer = 'rmsprop')
model.fit(inputs, outputs, epochs = 500, batch_size = 32)
model.save('my_model.h5')
from keras.models import load_model
model = load_model('my_model.h5')
# continue fitting
model.fit(inputs, outputs, epochs = 500, batch_size = 32)

Resources