How to give R's "matchit" a formula object via Python's r2py? - r

I use rpy2 to use R in Python. Especially I want to use the MatchIt package but stuck on a detail. The call to the primary function of that package in R looks like this
# R code
m.out1 <- matchit(
gruppe ~ geschlecht + alter + pflege,
data = df,
method = "nearest",
distance = "glm"
)
The first argument is a "formula". I don't have an idea how to create such an object/argument in Python code? The other three arguments are no problem. The error from rpy2 is this:
[WARNING] R[write to console]: Error: 'formula' must be a formula object.
Traceback (most recent call last):
...
File "...\AppData\Roaming\Python\Python39\site-packages\rpy2\rinterface.py", line 813, in __call__
raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error: 'formula' must be a formula object.
This is the Python code producing that problem.
# Python code
import pandas
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
r_package_matchit = robjects.packages.importr('MatchIt')
func_matchit = robjects.r['matchit']
df = pandas.DataFrame({
'gruppe': list('IICC'),
'geschlecht': list('mwmw'),
'alter': range(4),
'pflege': range(4)
})
# convert the data frame from Pandas to R
with robjects.conversion.localconverter(
robjects.default_converter + pandas2ri.converter):
rdf = robjects.conversion.py2rpy(df)
func_matchit(formula='gruppe ~ geschlecht + alter + pflege',
data=rdf, method='nearest',
distance='glm')

Mutch easier then I thougt. rpy2 offers a Formula() for cases like this.
import rpy2
from rpy2.robjects import Formula
# ...
func_matchit(formula=Formula('gruppe ~ geschlecht + alter + pflege'),
data=rdf, method='nearest',
distance='glm')

Related

Azure ML Studio not showing datasets under models

I registered a model in an Azure ML notebook along with its datasets. In ML Studio I can see the model listed under the dataset, but no dataset gets listed under the model. What should I do to have datasets listed under models?
Model listed under dataset:
Dataset not listed under the model:
Notebook code:
import pickle
import sys
from azureml.core import Workspace, Dataset, Model
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import assert_all_finite
workspace = Workspace('<snip>', '<snip>', '<snip>')
dataset = Dataset.get_by_name(workspace, name='creditcard')
data = dataset.to_pandas_dataframe()
data.dropna(inplace=True)
X = data.drop(labels=["Class"], axis=1, inplace=False)
y = data["Class"]
model = make_pipeline(StandardScaler(), GradientBoostingClassifier())
model.fit(X, y)
with open('creditfraud_sklearn_model.pkl', 'wb') as outfile:
pickle.dump(model, outfile)
Model.register(
Workspace = workspace,
model_name = 'creditfraud_sklearn_model',
model_path = 'creditfraud_sklearn_model.pkl',
description = 'Gradient Boosting classifier for Kaggle credit-card fraud',
model_framework = Model.Framework.SCIKITLEARN,
model_framework_version = sys.modules['sklearn'].__version__,
sample_input_dataset = dataset,
sample_output_dataset = dataset)
It looks like add_dataset_references() needs to be called to have datasets displayed under models:
model_registration.add_dataset_references([("input dataset", dataset)])

NameError: name 'gensim' is not defined (doc2vec similarity)

I have gensim installed in my system. I did the summarization with gensim. NOw I want to find the similarity between the sentence and it showing an error. sample code is given below. I have downloaded the Google news vectors.
from gensim.models import KeyedVectors
#two sample sentences
s1 = 'the first sentence'
s2 = 'the second text'
#model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
model = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)
#calculate distance between two sentences using WMD algorithm
distance = model.wmdistance(s1, s2)
print ('distance = %.3f' % distance)
Error#################################################
****Traceback (most recent call last): File "/home/abhi/Desktop/CHiir/CLustering &
summarization/.idea/FInal_version/sentence_embedding.py", line 7, in
model = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz',
binary=True) NameError: name 'gensim' is not defined****
Importing with from x import y only lets you use y, but not x.
You can either do import gensim instead of from gensim.models import KeyedVectors, or you can directly use the imported KeyedVectors:
model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)

R Leaps Package: Regsubsets - coef "Reordr" Fortran error

I'm using the R leaps package to obtain a fit to some data:
(My dataframe df contains a Y variable and 41 predictor variables)
require(leaps)
N=3
regsubsets(Y ~ ., data = df, nbest=1, nvmax=N+1,force.in="X", method = 'exhaustive')-> regfit
coef(regfit,id = N)
When I run the code more than once (the first time works fine) I get the following error when I run the coef command:
Error in .Fortran("REORDR", np = as.integer(object$np), nrbar = as.integer(object$nrbar), :
"reordr" not resolved from current namespace (leaps)
Any help with why this is happening would be much appreciated.
A.
I had to build the package from source inserting the (PACKAGE = 'leaps') argument into the REORDR function in the leaps.R file. It now works fine every time.
The solution is related to:
R: error message --- package error: "functionName" not resolved from current namespace

Minimal example of rpy2 regression using pandas data frame

What is the recommended way (if any) for doing linear regression using a pandas dataframe? I can do it, but my method seems very elaborate. Am I making things unnecessarily complicated?
The R code, for comparison:
x <- c(1,2,3,4,5)
y <- c(2,1,3,5,4)
M <- lm(y~x)
summary(M)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6 1.1489125 0.522233 0.6376181
x 0.8 0.3464102 2.309401 0.1040880
Now, my python (2.7.10), rpy2 (2.6.0), and pandas (0.16.1)
version:
import pandas
import pandas.rpy.common as common
from rpy2 import robjects
from rpy2.robjects.packages import importr
base = importr('base')
stats = importr('stats')
dataframe = pandas.DataFrame({'x': [1,2,3,4,5],
'y': [2,1,3,5,4]})
robjects.globalenv['dataframe']\
= common.convert_to_r_dataframe(dataframe)
M = stats.lm('y~x', data=base.as_symbol('dataframe'))
print(base.summary(M).rx2('coefficients'))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6 1.1489125 0.522233 0.6376181
x 0.8 0.3464102 2.309401 0.1040880
By the way, I do get a FutureWarning on the import of pandas.rpy.common. However, when I tried the pandas2ri.py2ri(dataframe) to convert a dataframe from pandas to R (as mentioned here), I get
NotImplementedError: Conversion 'py2ri' not defined for objects of type '<class 'pandas.core.series.Series'>'
After calling pandas2ri.activate() some conversions from Pandas objects to R objects happen automatically. For example, you can use
M = R.lm('y~x', data=df)
instead of
robjects.globalenv['dataframe'] = dataframe
M = stats.lm('y~x', data=base.as_symbol('dataframe'))
import pandas as pd
from rpy2 import robjects as ro
from rpy2.robjects import pandas2ri
pandas2ri.activate()
R = ro.r
df = pd.DataFrame({'x': [1,2,3,4,5],
'y': [2,1,3,5,4]})
M = R.lm('y~x', data=df)
print(R.summary(M).rx2('coefficients'))
yields
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6 1.1489125 0.522233 0.6376181
x 0.8 0.3464102 2.309401 0.1040880
The R and Python are not strictly identical because you build a data frame in Python/rpy2 whereas you use vectors (without a data frame) in R.
Otherwise, the conversion shipping with rpy2 appears to be working here:
from rpy2.robjects import pandas2ri
pandas2ri.activate()
robjects.globalenv['dataframe'] = dataframe
M = stats.lm('y~x', data=base.as_symbol('dataframe'))
The result:
>>> print(base.summary(M).rx2('coefficients'))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6 1.1489125 0.522233 0.6376181
x 0.8 0.3464102 2.309401 0.1040880
I can add to unutbu's answer by outlining how to retrieve particular elements of the coefficients table including, crucially, the p-values.
def r_matrix_to_data_frame(r_matrix):
"""Convert an R matrix into a Pandas DataFrame"""
import pandas as pd
from rpy2.robjects import pandas2ri
array = pandas2ri.ri2py(r_matrix)
return pd.DataFrame(array,
index=r_matrix.names[0],
columns=r_matrix.names[1])
# Let's start from unutbu's line retrieving the coefficients:
coeffs = R.summary(M).rx2('coefficients')
df = r_matrix_to_data_frame(coeffs)
This leaves us with a DataFrame which we can access in the normal way:
In [179]: df['Pr(>|t|)']
Out[179]:
(Intercept) 0.637618
x 0.104088
Name: Pr(>|t|), dtype: float64
In [181]: df.loc['x', 'Pr(>|t|)']
Out[181]: 0.10408803866182779

How to call ltm function using rpy package in python

I am trying the following code:
from rpy import *
r.library("ltm")
dat= #some data frame or matrix
r.ltm(r('dat~z1'))
error coming is--- RPy_RException: Error in eval(expr, envir, enclos)
: object 'dat' not found
Please tell me the right way to call ltm function using rpy library
I'd try as a general approach using rpy2 and something along the lines of:
from rpy2.robjects import *
r("library('ltm')")
r.assign('r_var_name',py_var_name)
r("r_var_name<-as.desired.data.type(r_var_name)")
Then whatever commands you're doing to 'r_var_name' using 'ltm' package functions inside further r("blah") statements.
E.g. getting the coefficients for one of the ltm package examples:
In [30]: py_obj = r("coef(ltm(Abortion ~ z1, control = list(GHk = 20, iter.em = 20)))")
In [32]: py_obj
Out[32]:
<Matrix - Python:0x4db0290 / R:0x52f04f0>
[0.188998, -0.256378, -0.367623, ..., 4.542567, 5.840821, 3.243826]

Resources