Spacy error on unnamed vectors when apply simple training model - python-3.6

I am testing out with the spacy sample NER code. Which is directly copied from the spacy website https://spacy.io/usage/training. i just add the import spacy and random myself
import spacy
import random
TRAIN_DATA = [
("Uber blew through $1 million a week", {'entities': [(0, 4, 'ORG')]}),
("Google rebrands its business apps", {'entities': [(0, 6, "ORG")]})]
nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')
however, when I run the code. It shows the error.
Warning: Unnamed vectors -- this won't allow multiple vectors models to be loaded. (Shape: (0, 0))
I searched on the community but got no clue. Thank you for your help

Put nlp.vocab.vectors.name = 'spacy_pretrained_vectors' before optimizer will be enough
import spacy
import random
TRAIN_DATA = [
("Uber blew through $1 million a week", {'entities': [(0, 4, 'ORG')]}),
("Google rebrands its business apps", {'entities': [(0, 6, "ORG")]})]
nlp = spacy.blank('en')
nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
optimizer = nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')

Related

graph isomorphism neural network

I am trying to understand graph isomorphism network and graph attention network through PyTorch (GIN) and GAT for some classification tasks.
however, I can't find already implemented projects to read and understand as hints.
there are some for GCN and they are ok.
I wanted to know if anyone can suggest any kind of material except raw theoretical papers so I can refer to.
Graph Isomorphism networks (GIN) can be built using Tensorflow and spektral libraries.
Here is an example of GIN network built using above mentioned libraries:
class GIN0(Model):
def __init__(self, channels, n_layers):
super().__init__()
self.conv1 = GINConv(channels, epsilon=0, mlp_hidden=[channels, channels])
self.convs = []
for _ in range(1, n_layers):
self.convs.append(
GINConv(channels, epsilon=0, mlp_hidden=[channels, channels])
)
self.pool = GlobalAvgPool()
self.dense1 = Dense(channels, activation="relu")
def call(self, inputs):
x, a, i = inputs
x = self.conv1([x, a])
for conv in self.convs:
x = conv([x, a])
x = self.pool([x, i])
return self.dense1(x)
You can use this model for training and testing just like any other tensorflow model with some limitations.

Initialize HuggingFace Bert with random weights

How is it possible to initialize BERT with random weights? I want to compare the performance of multilingual vs monolingual vs randomly initialized BERT in a masked language modeling task. While in the former cases it is very straightforward:
from transformers import BertTokenizer, BertForMaskedLM
tokenizer_multi = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model_multi = BertForMaskedLM.from_pretrained('bert-base-multilingual-cased')
model_multi.eval()
tokenizer_mono = BertTokenizer.from_pretrained('bert-base-cased')
model_mono = BertForMaskedLM.from_pretrained('bert-base-cased')
model_mono.eval()
I don't know how to load random weights.
Thanks in advance!
You can use the following function:
def randomize_model(model):
for module_ in model.named_modules():
if isinstance(module_[1],(torch.nn.Linear, torch.nn.Embedding)):
module_[1].weight.data.normal_(mean=0.0, std=model.config.initializer_range)
elif isinstance(module_[1], torch.nn.LayerNorm):
module_[1].bias.data.zero_()
module_[1].weight.data.fill_(1.0)
if isinstance(module_[1], torch.nn.Linear) and module_[1].bias is not None:
module_[1].bias.data.zero_()
return model

Pipeline' object has no attribute 'feature_importances_

I have a problem with my code, I want to see the feature importance about vector from word2vec model, but I can't beacause it's a pipeline. Someone could help me to find a solution please ?
## Import the random forest model.
from sklearn.ensemble import RandomForestClassifier
## This line instantiates the model.
rf = Pipeline([
("word2vec vectorizer", MeanEmbeddingVectorizer(w2v)),
("Random_forest", RandomForestClassifier(n_estimators=100, max_depth=6,random_state=0))])
## Fit the model on your training data.
rf.fit(X_train, y_train)
## And score it on your testing data.
rf.score(X_test, y_test)
X = model.wv.syn0
X = X.astype(int)
def plot_feat_imp(model, X):
Feature_Imp = pd.DataFrame([X, rand_w2v_tfidf.feature_importances_]).transpose(
).sort_values(1, ascending=False)
plt.figure(figsize=(14, 7))
sns.barplot(y=Feature_Imp.loc[:, 0], x=Feature_Imp.loc[:, 1], data=Feature_Imp, orient='h')
plt.title("Importance des variables (qu'est ce qui explique le mieux la satisfaction)", fontsize=21)
plt.show()
return
MY PROBLEM IS HERE
AttributeError: 'Pipeline' object has no attribute 'feature_importances_'
plot_feat_imp(gbc_w2v, X)
Maybe not the answer you were seeking for, but if you want the feature_importances_ of your pipeline object you might want to first get into the best classifier.
This is possible with:
rf_fit = rf.fit(X_train, y_train)
feature_importances = rf_fit.best_estimator_._final_estimator.feature_importances_
Hope that helps.

NameError: name 'gensim' is not defined (doc2vec similarity)

I have gensim installed in my system. I did the summarization with gensim. NOw I want to find the similarity between the sentence and it showing an error. sample code is given below. I have downloaded the Google news vectors.
from gensim.models import KeyedVectors
#two sample sentences
s1 = 'the first sentence'
s2 = 'the second text'
#model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
model = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)
#calculate distance between two sentences using WMD algorithm
distance = model.wmdistance(s1, s2)
print ('distance = %.3f' % distance)
Error#################################################
****Traceback (most recent call last): File "/home/abhi/Desktop/CHiir/CLustering &
summarization/.idea/FInal_version/sentence_embedding.py", line 7, in
model = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz',
binary=True) NameError: name 'gensim' is not defined****
Importing with from x import y only lets you use y, but not x.
You can either do import gensim instead of from gensim.models import KeyedVectors, or you can directly use the imported KeyedVectors:
model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)

Python equivalent to R caTools random 'sample.split'

Is there a Python (perhaps pandas) equivalent to R's
install.packages("caTools")
library(caTools)
set.seed(88)
split = sample.split(df$col, SplitRatio = 0.75)
that will generate exactly the same value split?
My current context for this is, as an example getting Pandas dataframes that correspond exactly to the R dataframes (qualityTrain, qualityTest) created by:
# https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv
quality = read.csv("quality.csv")
set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)
I think scikit-learn's train_test_split function might work for you (link).
import pandas as pd
from sklearn.cross_validation import train_test_split
url = 'https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv'
quality = pd.read_csv(url)
train, test = train_test_split(quality, train_size=0.75, random_state=88)
qualityTrain = pd.DataFrame(train, columns=quality.columns)
qualityTest = pd.DataFrame(test, columns=quality.columns)
Unfortunately I don't get the same rows as the R function. I'm guessing it's the seeding, but could be wrong.
Splitting with sample.split from caTools library means the class distribution is preserved. Scikit-learn method train_test_split does not guarantee that (it splits dataset into a random train and test subsets).
You can get equivalent result as R caTools library (regarding class distribution) by using instead sklearn.cross_validation.StratifiedShuffleSplit
sss = StratifiedShuffleSplit(quality['PoorCare'], n_iter=1, test_size=0.25, random_state=0)
for train_index, test_index in sss:
qualityTrain = quality.iloc[train_index,:]
qualityTest = quality.iloc[test_index,:]
I know this is an old thread but I just found it looking for any potential solution because for a lot of online classes in stats and machine learning that are taught in R, if you want to use Python you run into this issue where all the classes say to do a set.seed() in R and then you use something like the caTools sample.split and you must get the same split or your result won't be the same later and you can't get the right answer for some quiz or exercise question. One of the main issues is that although both Python and R use, by default, the Mercenne Twister algorithm for their pseudo-random number generation, I discovered, by looking at the random states of their respective prngs, that they won't produce the same result given the same seed. And one (I forget which) is using signed numbers and the other unsigned, so it seems like there's little hope that you could find a seed to use with Python that would produce the same series of numbers as R.
A small correction in the above, StatifiedShuffleSplit is now part of sklearn.model_selection.
I have a some data with X and Y in different numpy arrays. The distribution of 1s against 0s in my Y array is about 4.1%. If I use StatifiedShuffleSplit it maintains this distribution in test and train set made after wards. See below.
full_data_Y_np.sum() / len(full_data_Y_np)
0.041006701187937859
for train_index, test_index in sss.split(full_data_X_np, full_data_Y_np):
X_train = full_data_X_np[train_index]
Y_train = full_data_Y_np[train_index]
X_test = full_data_X_np[test_index]
Y_test = full_data_Y_np[test_index]
Y_train.sum() / len(Y_train)
0.041013925152306355
Y_test.sum() / len(Y_test)
0.040989847715736043

Resources