BERT attribution scores for token probability prediction - bert-language-model

I've been trying to find a library or an example for getting token importance when a BERT model predicts a masked span, eg:
from transformers import BertTokenizerFast, BertForMaskedLM
import torch
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
text = 'Brad Pitt is an [MASK] actor.'
tokenized_text = tokenizer.tokenize(text)
masked_index = tokenized_text.index("[MASK]")
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
# Predict all tokens
with torch.no_grad():
outputs = model(tokens_tensor)
predictions = outputs[0]
probs = torch.nn.functional.softmax(predictions[0, masked_index], dim=-1)
You could then pick the highest predicted value, or the 5 top values.
How would I go about calculating, let's say vanilla gradients or any other type or saliency method and see which tokens where important when predicting the masked token?
I read Ecco's documentation but they don't support attribution for BERT yet, AllenNLP has a demo for MLM task, but it's only for that demo, and I couldn't find anything relevant using SHAP or Captum.
Any help pointing to the right direction woudl be appreciated.

Related

randomForest in R is including the class label as a feature prevents classifier from predicting on new dataset

So I have two datasets, og.data and newdata.df. I have matched their features and I want to use a feature from og.data to train a model so I can identify cases of this class in newdata.df. I am using the randomForest package in R documentation for it is here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
split <- sample.split(og.data$class_label, SplitRatio = 0.7)
training_set = subset(og.data$class_label, split == TRUE)
test_set = subset(og.data$class_label, split == FALSE)
rf.classifier.object = randomForest(x = training_set[-1],
y = training_set$Engramcell,
ntree = 500)
I then use the test set to calculate the AUC, visualize ROC, precision, recall etc etc.
I do that using prediction probability generated like so...
predictions.df <- as.data.frame(predict(rf.classifier.object,
test_set,
type = "prob")
)
All is good I proceed to try to use the classifier I've trained on new data and now I am encountering a problem because the new data does not contain the feature class label. Whihc is annoying as the entire purpose of training the classifier to to label this newdata.
predictions.df <- as.data.frame(predict(rf.classifier.object,
newdata.df,
type = "prob")
)
Please note the error has different variable names simply because I changed the code to make it more general for readability.
Error in predict.randomForest(rf.classifier.object, newdata.df, :
variables in the training data missing in newdata
As per this stack post predict.randomForest(), called here as predict(), uses rownames of feature importance to make its precitions. And when I checked with a search of the feature names I find that it is infact the class label preventing me from making the test as I show bellow.
# > rownames(rf.classifier.object$importance)[!(rownames(rf.classifier.object$importance) %in% colnames(newdata) )]
# [1] "class_label"
It is not clear to me what I should change in my script so that the classifier can be used on other data than the testing set. I have followed the instructions exactly this seems like a bad design choice to have made the function this way. The class label should not be used for calculating feature importance at all and should not even be considered a feature imo.

Initialize HuggingFace Bert with random weights

How is it possible to initialize BERT with random weights? I want to compare the performance of multilingual vs monolingual vs randomly initialized BERT in a masked language modeling task. While in the former cases it is very straightforward:
from transformers import BertTokenizer, BertForMaskedLM
tokenizer_multi = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model_multi = BertForMaskedLM.from_pretrained('bert-base-multilingual-cased')
model_multi.eval()
tokenizer_mono = BertTokenizer.from_pretrained('bert-base-cased')
model_mono = BertForMaskedLM.from_pretrained('bert-base-cased')
model_mono.eval()
I don't know how to load random weights.
Thanks in advance!
You can use the following function:
def randomize_model(model):
for module_ in model.named_modules():
if isinstance(module_[1],(torch.nn.Linear, torch.nn.Embedding)):
module_[1].weight.data.normal_(mean=0.0, std=model.config.initializer_range)
elif isinstance(module_[1], torch.nn.LayerNorm):
module_[1].bias.data.zero_()
module_[1].weight.data.fill_(1.0)
if isinstance(module_[1], torch.nn.Linear) and module_[1].bias is not None:
module_[1].bias.data.zero_()
return model

Importance based variable reduction

I am facing a difficulty with filtering out the least important variables in my model. I received a set of data with more than 4,000 variables, and I have been asked to reduce the number of variables getting into the model.
I did try already two approaches, but I have failed twice.
The first thing I tried was to manually check variable importance after the modelling and based on that removing non significant variables.
# reproducible example
data <- iris
# artificial class imbalancing
data <- iris %>%
mutate(Species = as.factor(ifelse(Species == "virginica", "1", "0")))
Everything works fine while using simple Learner:
# creating Task
task <- TaskClassif$new(id = "score", backend = data, target = "Species", positive = "1")
# creating Learner
lrn <- lrn("classif.xgboost")
# setting scoring as prediction type
lrn$predict_type = "prob"
lrn$train(task)
lrn$importance()
Petal.Width Petal.Length
0.90606304 0.09393696
The issue is that the data is highly imbalanced, so I decided to use GraphLearner with PipeOp operator to undersample majority group which is then passed to AutoTuner:
I did skip some part of the code which I believe is not important for this case, things like search space, terminator, tuner etc.
# undersampling
po_under <- po("classbalancing",
id = "undersample", adjust = "major",
reference = "major", shuffle = FALSE, ratio = 1 / 2)
# combine learner with pipeline graph
lrn_under <- GraphLearner$new(po_under %>>% lrn)
# setting the autoTuner
at <- AutoTuner$new(
learner = lrn_under,
resampling = resample,
measure = measure,
search_space = ps_under,
terminator = terminator,
tuner = tuner
)
at$train(task)
The problem right know is that despite the importance property being still visable within at the $importance() in unavailable.
> at
<AutoTuner:undersample.classif.xgboost.tuned>
* Model: list
* Parameters: list()
* Packages: -
* Predict Type: prob
* Feature types: logical, integer, numeric, character, factor, ordered, POSIXct
* Properties: featureless, importance, missings, multiclass, oob_error, selected_features, twoclass, weights
So I decided to change my approach and try to add filtering into a Learner. And that's where I've failed even more. I have started by looking into this mlr3book blog - https://mlr3book.mlr-org.com/fs.html. I tried to add importance = "impurity" into Learner just like in the blog but id did yield an error.
> lrn <- lrn("classif.xgboost", importance = "impurity")
Błąd w poleceniu 'instance[[nn]] <- dots[[i]]':
nie można zmienić wartości zablokowanego połączenia dla 'importance'
Which basically means something like this:
Error in 'instance[[nn]] <- dots[[i]]': can't change value of blocked connection for 'importance'
I did also try to workaround with PipeOp filtering but it also failed miserably. I believe I won't be able to do it without importance = "impurity".
So my question is, is there a way to achieve what I am aiming for?
In addition I would be greatly thankful for explaining why is filtering by importance possible before modeling? Shouldn't it be based on the model result?
The reason why you can't access $importance of the at variable is that it is an AutoTuner, which does not directly offer variable importance and only "wraps" around the actual Learner being tuned.
The trained GraphLearner is saved inside your AutoTuner under $learner:
# get the trained GraphLearner, with tuned hyperparameters
graphlearner <- at$learner
This object also does not have $importance(). (Theoretically, a GraphLearner could contain more than one Learner and then it wouldn't even know which importance to give!).
Getting the actual LearnerClassifXgboost object is a bit tedious, unfortunately, because of shortcomings in the "R6" object system used by mlr3:
Get the untrained Learner object
get the trained state of the Learner and put it into that object
# get the untrained Learner
xgboostlearner <- graphlearner$graph$pipeops$classif.xgboost$learner
# put the trained model into the Learner
xgboostlearner$state <- graphlearner$model$classif.xgboost
Now the importance can be queried
xgboostlearner$importance()
The example from the book that you link to does not work in your case because the book uses the ranger Learner, while are using xgboost. importance = "impurity" is specific to ranger.

mlr3: obtaining response (predicted survival time) from surv.gbm

surv.gbm in the mlr3 framework outputs linear predictors, however what I'm really interested in are predicted survival times per case, which I want to compare with the actual survival times. Is there a way to obtain actual survival times?
In the mlr3 book, there is an example of a transformation between linear predictors and a distribution.
pod = po("distrcompose", param_vals = list(form = "ph", overwrite = FALSE))
prediction = pod$predict(list(base = prediction_distr, pred = prediction_lp))$output
Is there a way to change this pipeline so that it converts "lp" to "response" ?
Any help would be appriciated.
Yes this is definitely possible it just requires another transformation. Your first step is correct to compose a distribution from a linear predictor; as you're using surv.gbm only Cox PH is possible as the underlying model so default for distrcompose works for this.
Now you need to use crankcompose in order to create a survival time prediction from the distribution, you could use the mean, median, or mode of the distribution, people usually pick mean or median but that's your choice! Just make sure to include response = TRUE, overwrite = FALSE. Example code below, includes creating predictions and scoring with RMSE (surprisingly quite good!). I think the book may need updating...
Thanks,
Raphael
library(mlr3extralearners)
library(mlr3proba)
library(mlr3pipelines)
library(mlr3)
learn = ppl("crankcompositor", ppl("distrcompositor", lrn("surv.gbm")),
response = TRUE, overwrite = FALSE, method = "mean",
graph_learner = TRUE)
set.seed(1)
task = tgen("simsurv")$generate(50)
learn$train(task)
p = learn$predict(task)
p$score(msr("surv.rmse"))

IV Estimation with Cluster Robust Standard Errors using the plm package in R

I'm using the plm package for panel data to do instrumental variable estimation. However, it seems that calculating cluster robust standard errors by using the vcovHC() function is not supported.
More specifically, when I use the vcovHC() function, the following error message is displayed:
Error in vcovG.plm(x, type = type, cluster = cluster, l = 0, inner = >inner, :
Method not available for IV
Example:
data("Wages", package = "plm")
IV <- plm(lwage ~ south + exp | wks + south,
data = Wages, model = "pooling", index = 595)
vcvIV <- vcovHC(IV)
According to this thread, someone worked on a fix two years ago. Is there any progress on the issue? I know that the packages "lfe" and "ivpack" allow to compute cluster robust standard errors for IV estimation but none of them allows for random effects/intercepts.
In fact it's not implemented. However, you can use Schrimpf's clustered errors function which is applied directly to a object of the plm class.
Using your example:
library (plm)
data("Wages", package = "plm")
IV <- plm(lwage ~ south + exp | wks + south, data = Wages, model = "pooling", index = 595)
Wages$id <- rep(1:595, each = 7)
cl.plm(Wages, IV, Wages$id)
Where I'm using Wages$idas the panel first dimension around which clusters will be formed. You may want to compare these results with the obtained in other software. Anyway, the code is simple allowing some tricks. The cl.plm function is based on Arai's clustering notes which can help you further.
You can obtain the same result from cl.plm doing this in Stata:
ivregress 2sls lwage south (exp = wks), vce(cluster id) small
Or for the within model:
xtset id time, generic
xtivreg2 lwage south (exp = wks), fe small cluster(id)
Note however I used the small sample formulation in Stata, which is not big deal. More about this here. Anyway, cl.plm properly deals with the plm class object.
For sake of completeness: as suggested by #Helix123, you can use the development version (1.6-1) of plm package and proceed as you did in tour question.

Resources