NameError: name 'gensim' is not defined (doc2vec similarity) - similarity

I have gensim installed in my system. I did the summarization with gensim. NOw I want to find the similarity between the sentence and it showing an error. sample code is given below. I have downloaded the Google news vectors.
from gensim.models import KeyedVectors
#two sample sentences
s1 = 'the first sentence'
s2 = 'the second text'
#model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
model = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)
#calculate distance between two sentences using WMD algorithm
distance = model.wmdistance(s1, s2)
print ('distance = %.3f' % distance)
Error#################################################
****Traceback (most recent call last): File "/home/abhi/Desktop/CHiir/CLustering &
summarization/.idea/FInal_version/sentence_embedding.py", line 7, in
model = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz',
binary=True) NameError: name 'gensim' is not defined****

Importing with from x import y only lets you use y, but not x.
You can either do import gensim instead of from gensim.models import KeyedVectors, or you can directly use the imported KeyedVectors:
model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)

Related

XGBoost Custom Objective Function

this will be a long question. I’m trying to define my own custom objective function
I want the XGBClassifier, so I run
from xgboost import XGBClassifier
the documentation of xgboost says:
A custom objective function can be provided for the objective parameter. In this case, it should have the signature
objective(y_true, y_pred) -> grad, hess :
y_true: array_like of shape [n_samples], The target values
y_pred: array_like of shape [n_samples], The predicted values
grad: array_like of shape [n_samples], The value of the gradient for each sample point.
hess: array_like of shape [n_samples], The value of the second derivative for each sample point
Now, I’ve coded this custom:
def guess_averse_loss(y_true, y_pred):
y_true = y_true.astype(int)
y_pred = y_pred.astype(int)
... stuffs ...
return grad, hess
everything is compatible with the previous documentation.
If I run:
classifier=XGBClassifier(eval_metric=custom_weighted_accuracy,objective=guess_averse_loss,**params_common_model)
classifier.train(X_train, y_train)
(where custom_weighted_accuracy is a custom metric defined by me following the documentation of scikitlearn)
I get the error:
-> first_term = np.multiply(cost_matrix[y_true, y_pred], np.exp(y_pred - y_true))
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (4043,) (4043,5)
So, y_pred enters the function as a matrix (n_samples x n_classes) where the element ij is the probability that the sample i belongs to the class j.
Then, I modify the line as
first_term = np.multiply(cost_matrix[y_true, np.argmax(y_pred, axis=1)],np.exp(np.argmax(y_pred, axis=1) - y_true))
so it passes from a matrix to an array,
This leads to the error:
unknown custom metric
so it seems that the problem now is the metric.
I try to remove the custom obj function using the default one and another error comes:
XGBoostError: Check failed: in_gpair->Size() % ngroup == 0U (3 vs. 0) : must have exactly ngroup * nrow gpairs
WHAT CAN I DO???
You read what I've tried, I'm excepting some suggestion to solve this problems

How to give R's "matchit" a formula object via Python's r2py?

I use rpy2 to use R in Python. Especially I want to use the MatchIt package but stuck on a detail. The call to the primary function of that package in R looks like this
# R code
m.out1 <- matchit(
gruppe ~ geschlecht + alter + pflege,
data = df,
method = "nearest",
distance = "glm"
)
The first argument is a "formula". I don't have an idea how to create such an object/argument in Python code? The other three arguments are no problem. The error from rpy2 is this:
[WARNING] R[write to console]: Error: 'formula' must be a formula object.
Traceback (most recent call last):
...
File "...\AppData\Roaming\Python\Python39\site-packages\rpy2\rinterface.py", line 813, in __call__
raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error: 'formula' must be a formula object.
This is the Python code producing that problem.
# Python code
import pandas
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
r_package_matchit = robjects.packages.importr('MatchIt')
func_matchit = robjects.r['matchit']
df = pandas.DataFrame({
'gruppe': list('IICC'),
'geschlecht': list('mwmw'),
'alter': range(4),
'pflege': range(4)
})
# convert the data frame from Pandas to R
with robjects.conversion.localconverter(
robjects.default_converter + pandas2ri.converter):
rdf = robjects.conversion.py2rpy(df)
func_matchit(formula='gruppe ~ geschlecht + alter + pflege',
data=rdf, method='nearest',
distance='glm')
Mutch easier then I thougt. rpy2 offers a Formula() for cases like this.
import rpy2
from rpy2.robjects import Formula
# ...
func_matchit(formula=Formula('gruppe ~ geschlecht + alter + pflege'),
data=rdf, method='nearest',
distance='glm')

Pytorch Geometric Graph Classification : AttributeError: 'Batch' object has no attribute 'local_var'

I am currently working on doing graph classification on the IMDB-Binary dataset using deep learning and specifically the pytorch geometric environment.
I have split my data into test/train samples that are list of tuples containing a graph and its label. One thing I've had to do is to treat the different graph as a "Batch", a large disconnected graph, using torch_geometric.data.Batch. To start, I am using a data loader with the following collate function
def collate(samples) :
graphs,labels = map(list,zip(*samples))
datalist = make_datalist(graphs)
datalist = Batch.from_data_list(datalist)
return datalist, torch.tensor(labels)
and my classifier is the following :
class Classifier(nn.Module):
def __init__(self, in_dim, hidden_dim, n_classes):
super(Classifier, self).__init__()
self.conv1 = GraphConv(in_dim, hidden_dim)
self.conv2 = GraphConv(hidden_dim, hidden_dim)
self.classify = nn.Linear(hidden_dim, n_classes)
def forward(self, g):
# Use node degree as the initial node feature. For undirected graphs, the in-degree
# is the same as the out_degree.
h = g.in_degrees
# Perform graph convolution and activation function.
h = F.relu(self.conv1(g, h))
h = F.relu(self.conv2(g, h))
g.ndata['h'] = h
# Calculate graph representation by averaging all the node representations.
hg = dgl.mean_nodes(g, 'h')
return self.classify(hg)
Which simply averages the nodes representations of each graph, and feeds it to a MLP
The problem I come up with is that during the prediction of our batch, I have the error
AttributeError: 'Batch' object has no attribute 'local_var'
and I can't find where it may come from, would anyone know ?
Thank you for taking the time to read !
I am also experimenting with Pytorch geometric and its' data set capabilities.
Maybe following information will help someone in the future:
I'm facing AttributeErrors when forgetting to set #property annotated getters/setters for my data set class attributes. See https://docs.python.org/3.7/library/functions.html#property
I think to answer your question we need more information about your make_datalist function.
However, here are the links to the batch class:
https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html
https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/data/batch.html#Batch
And indeed, there is nothing like a local_var variable.

Export Linear Mixed Effects Model Outputs in csv using Julia Language

I am new to Julia programming language, however, I am fitting a Linear Mixed Effects Model and I find it difficult to save the fixed and random effects estimates in .csv files.
An example code can be found:
using MixedModels
#time modelOutput = fit(lmm(Y~ A + B + (0 + A | group), data))
There is available reference about how to obtain the fixed (fixef(modelOutput)) and random (ranef(modelOutput)) effects however using a DataFrame I am facing errors.
Any advice is appreciated.
Okay, I actually took the time to do this for you. A CoefTable is a type defined in statmodels here. Given this information, we can extract the relevant information from the CoefTable instance as follows:
df = DataFrame(variable = ct.rownms,
Estimate = ct.mat[:,1],
StdError = ct.mat[:,2],
z_val = ct.mat[:,3])
This will give an nvar-by-4 DataFrame which you can then write to csv as described earlier using writetable("output.csv",df)
I had a number of problems getting the accepted answer to work; Julia has evolved a lot since then. I rewrote it based primarily on code from the jglmm R package, with some adaptation/cobbling-together from other sources ...
"""
outfun(m, outfn="output.csv")
output the coefficient table of a fitted model to a file
"""
outfun = function(m, outfn="output.csv")
ct = coeftable(m)
coef_df = DataFrame(ct.cols);
rename!(coef_df, ct.colnms, makeunique = true)
coef_df[!, :term] = ct.rownms;
CSV.write(outfn, coef_df);
end

Python equivalent to R caTools random 'sample.split'

Is there a Python (perhaps pandas) equivalent to R's
install.packages("caTools")
library(caTools)
set.seed(88)
split = sample.split(df$col, SplitRatio = 0.75)
that will generate exactly the same value split?
My current context for this is, as an example getting Pandas dataframes that correspond exactly to the R dataframes (qualityTrain, qualityTest) created by:
# https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv
quality = read.csv("quality.csv")
set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)
I think scikit-learn's train_test_split function might work for you (link).
import pandas as pd
from sklearn.cross_validation import train_test_split
url = 'https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv'
quality = pd.read_csv(url)
train, test = train_test_split(quality, train_size=0.75, random_state=88)
qualityTrain = pd.DataFrame(train, columns=quality.columns)
qualityTest = pd.DataFrame(test, columns=quality.columns)
Unfortunately I don't get the same rows as the R function. I'm guessing it's the seeding, but could be wrong.
Splitting with sample.split from caTools library means the class distribution is preserved. Scikit-learn method train_test_split does not guarantee that (it splits dataset into a random train and test subsets).
You can get equivalent result as R caTools library (regarding class distribution) by using instead sklearn.cross_validation.StratifiedShuffleSplit
sss = StratifiedShuffleSplit(quality['PoorCare'], n_iter=1, test_size=0.25, random_state=0)
for train_index, test_index in sss:
qualityTrain = quality.iloc[train_index,:]
qualityTest = quality.iloc[test_index,:]
I know this is an old thread but I just found it looking for any potential solution because for a lot of online classes in stats and machine learning that are taught in R, if you want to use Python you run into this issue where all the classes say to do a set.seed() in R and then you use something like the caTools sample.split and you must get the same split or your result won't be the same later and you can't get the right answer for some quiz or exercise question. One of the main issues is that although both Python and R use, by default, the Mercenne Twister algorithm for their pseudo-random number generation, I discovered, by looking at the random states of their respective prngs, that they won't produce the same result given the same seed. And one (I forget which) is using signed numbers and the other unsigned, so it seems like there's little hope that you could find a seed to use with Python that would produce the same series of numbers as R.
A small correction in the above, StatifiedShuffleSplit is now part of sklearn.model_selection.
I have a some data with X and Y in different numpy arrays. The distribution of 1s against 0s in my Y array is about 4.1%. If I use StatifiedShuffleSplit it maintains this distribution in test and train set made after wards. See below.
full_data_Y_np.sum() / len(full_data_Y_np)
0.041006701187937859
for train_index, test_index in sss.split(full_data_X_np, full_data_Y_np):
X_train = full_data_X_np[train_index]
Y_train = full_data_Y_np[train_index]
X_test = full_data_X_np[test_index]
Y_test = full_data_Y_np[test_index]
Y_train.sum() / len(Y_train)
0.041013925152306355
Y_test.sum() / len(Y_test)
0.040989847715736043

Resources