Statsmodels Error with R ntoation - r

want to use R notation for my regression and am using the following code:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
d = pd.DataFrame(np.arange(3), columns = ['a'], index = np.arange(3))
d['b'] = np.arange(3)+2
smf.OLS('a ~ b', data=d).fit()
However, I then get a long error message, involving ValueError: unrecognized data structures: / .
What is happening here? Many thanks!

Related

sympy plot rise error "Cannot convert complex to float"/"Cannot convert expression to float"

I have a code (on jupyter notebook) that was working last week but not anymore. Please does someone could help?
Here is my code to solve an equation and plot it.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import math
import sympy as sym
from sympy import solve, Symbol, Eq, symbols
#from sympy.plotting import plot as symplot
x,y=symbols('x y')
from scipy import interpolate
from scipy.interpolate import interp1d
CAL_RR=[
(12,33),
(20,33),
(35,66)
]
from sympy.abc import a, b, x, y
from sympy import exp
eq = Eq(a*exp(x)/(x+b),y)
#eq0=eq.subs([(x,CAL_RR[0][0]),(y,CAL_RR[0][1])])
eq1=eq.subs([(x,CAL_RR[1][0]),(y,CAL_RR[1][1])])
eq2=eq.subs([(x,CAL_RR[2][0]),(y,CAL_RR[2][1])])
print([eq1,eq2])
res=solve([eq1,eq2],Dict=True)
print(res)
def Prr(x):
return res[a]*exp(x)/(x+res[b])
print(Prr(15))
print(float(Prr(15)))
print(type(float(Prr(15))))
print(type(Prr(15)))
y=Prr(x)
plt.plot(x, Prr(x), label='Prr')
Here is the end of the output and I will write the beginning of it in comment. Please tell me if you need more :
...
File C:\ProgramData\Anaconda\lib\site-packages\sympy\core\expr.py:345, in Expr.__float__(self)
343 if result.is_number and result.as_real_imag()[1]:
344 raise TypeError("Cannot convert complex to float")
--> 345 raise TypeError("Cannot convert expression to float")
TypeError: Cannot convert expression to float
Thank you for reading

Pandas equivalent to R 'MAX_VALUE'

I am translating R code to Python using Pandas and I have been able to find Pandas equivalent to all R actions, but now I got this R code:
dtfr %>% mutate(a_column = ifelse(a_column == "INFINITY", MAX_VALUE, a_column))
This is my Pandas equivalent:
dtfr['a_column'] = np.where(dtfr['a_column'] == 'INFINITY', MAX_VALUE, dtfr['a_column'])
I have been looking for an equivalent to R MAX_VALUE in Pandas, but I haven't found how to replicate it.
There is np.inf: https://numpy.org/devdocs/reference/constants.html#numpy.inf
It is used in pandas to represent infinity (just as np.nan is used to represent "missing values".

Encoding/tokenizing dataset dictionary (BERT/Huggingface)

I am trying to finetune my Sentiment Analysis Model. Therefore, I have splitted my pandas Dataframe (column with reviews, column with sentiment scores) into a train and test Dataframe and transformed everything into a Dataset Dictionary:
#Creating Dataset Objects
dataset_train = datasets.Dataset.from_pandas(training_data)
dataset_test = datasets.Dataset.from_pandas(testing_data)
#Get rid of weird columns
dataset_train = dataset_train.remove_columns('__index_level_0__')
dataset_test = dataset_test.remove_columns('__index_level_0__')
#Create Dataset Dictionary
data_dict = datasets.DatasetDict({"train":dataset_train,"test":dataset_test})
I am transforming everything to a dataset dictionary cause I am following more or less a code and transfer it to my problem. Anyways, I am defining the function to tokenize:
from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score
num_labels = 5
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
batch_size = 16
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
def tokenize(batch):
return tokenizer(batch, padding=True, truncation=True)
and call the function with:
data_encoded = data_dict.map(tokenize, batched=True, batch_size=None)
I am getting this error after all this:
ValueError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).
What am I missing? Sorry I am completely new to the whole Huggingface infrastructure…
Found the error on my own as I had to specify the column which had to be tokenized. The correct Tokenizer function would be:
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
instead of
def tokenize(batch):
return tokenizer(batch, padding=True, truncation=True)

Plot all columns from dataframe without having to define them in Plotly

I would like to plot in Plotly all columns from dataframe without having to define them.
The required is the same functionality in Plotly as here in matplotlib.
import glob
import pandas as pd
df = pd.DataFrame({
'A': ['15','21','30'],
'M': ['12','24','31'],
'I': ['28','32','10']})
%matplotlib inline
from matplotlib import pyplot as plt
df=df.astype(float)
df.plot()
Here is my code for Plotly, but as I said, I have no idea how to plot all the columns automatically. The once I have noticed is also, that in Plotly the X-axis needs to be defined, but with this restriction I can live.
import plotly.express as px
import pandas as pd
import numpy as np
import os
# data
df = pd.DataFrame({
'ID': ['1','2','3'],
'A': ['15','21','30'],
'M': ['12','24','31'],
'I': ['28','32','10']})
df_long=pd.melt(df , id_vars=['ID'], value_vars=['A', 'M' , 'I'])
fig = px.line(df_long, x='ID', y='value', color='variable')
fig.show()
How can I define how to plot in Plotly all the columns automatically?
Okay, i have found the solution to my problem:
df_long=pd.melt(df , id_vars=['ID'])
instead of:
df_long=pd.melt(df , id_vars=['ID'], value_vars=['A', 'M' , 'I'])
Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

How to use glmulti from python using rpy2?

consider the following dataframe
import pickle
a='pickle.loads(b\'\\x80\\x03cpandas.core.frame\\nDataFrame\\nq\\x00)\\x81q\\x01}q\\x02(X\\x05\\x00\\x00\\x00_dataq\\x03cpandas.core.internals.managers\\nBlockManager\\nq\\x04)\\x81q\\x05(]q\\x06(cpandas.core.indexes.base\\n_new_Index\\nq\\x07cpandas.core.indexes.base\\nIndex\\nq\\x08}q\\t(X\\x04\\x00\\x00\\x00dataq\\ncnumpy.core.multiarray\\n_reconstruct\\nq\\x0bcnumpy\\nndarray\\nq\\x0cK\\x00\\x85q\\rC\\x01bq\\x0e\\x87q\\x0fRq\\x10(K\\x01K\\n\\x85q\\x11cnumpy\\ndtype\\nq\\x12X\\x02\\x00\\x00\\x00O8q\\x13K\\x00K\\x01\\x87q\\x14Rq\\x15(K\\x03X\\x01\\x00\\x00\\x00|q\\x16NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK?tq\\x17b\\x89]q\\x18(X\\x0b\\x00\\x00\\x00priceToBookq\\x19X\\x04\\x00\\x00\\x00betaq\\x1aX\\x0e\\x00\\x00\\x00price to salesq\\x1bX\\x0c\\x00\\x00\\x00gross profitq\\x1cX\\x0c\\x00\\x00\\x0052WeekChangeq\\x1dX\\n\\x00\\x00\\x00market capq\\x1eX\\x04\\x00\\x00\\x00ebitq\\x1fX\\r\\x00\\x00\\x00total revenueq X\\x0c\\x00\\x00\\x00payout ratioq!X\\x08\\x00\\x00\\x00pe ratioq"etq#bX\\x04\\x00\\x00\\x00nameq$Nu\\x86q%Rq&h\\x07cpandas.core.indexes.range\\nRangeIndex\\nq\\\'}q((h$NX\\x05\\x00\\x00\\x00startq)K\\x00X\\x04\\x00\\x00\\x00stopq*K\\x07X\\x04\\x00\\x00\\x00stepq+K\\x01u\\x86q,Rq-e]q.h\\x0bh\\x0cK\\x00\\x85q/h\\x0e\\x87q0Rq1(K\\x01K\\nK\\x07\\x86q2h\\x12X\\x02\\x00\\x00\\x00f8q3K\\x00K\\x01\\x87q4Rq5(K\\x03X\\x01\\x00\\x00\\x00<q6NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK\\x00tq7b\\x89B0\\x02\\x00\\x00\\xd1#,\\x9b9\\x8c)#Cz\\xe5\\xd5\\x94_\\xf5?\\x92(\\x0ffn9\\xf0?\\n+\\x15TT-\\x17# \\xd5\\xb0\\xdf\\x13\\x03%#u\\xdek\\xad\\xd4\\xb8\\xfb?\\x1c\\xee#\\xb7&\\xbd\\xf3?-\\x98\\xf8\\xa3\\xa8\\xf3\\xf3?H\\xfd\\xf5\\n\\x0b\\xae\\xf1?:;\\x19\\x1c%/\\xf1?\\x9f\\x93\\xde7\\xbe\\xf6\\xf0?\\xbb}V\\x99)\\xad\\xf3?\\xae\\xbby\\xaaC.\\xf3?\\xa5,C\\x1c\\xeb\\xe2\\xf9?d\\x94g^\\x0e\\x13\\x12#\\x9e\\xc7r\\\\\\xd7i\\x06#\\xe4\\xe0\\x0c\\xddp\\xc8\\xcc?%\\x95)\\xe6 \\x18 #\\xa1\\xf4\\x85\\x90\\xf3\\x1e!#y6P\\x85\\xe4\\x89\\x0e#.\\xd9\\xc2=\\xe0\\x1b\\x0c#\\x00\\x00\\x00\\xc6\\x9e\\xe86B\\x00\\x00\\x00fF\\xb83B\\x00\\x00\\x00.\\xdc\\xb6\\x0bB\\x00\\x00\\x80\\x954\\xa5%B\\x00\\x00#\\\'1O3B\\x00\\x00\\x00\\xec\\xed58B\\x00\\x00\\x80\\t\\x93\\xa64B\\xf1\\xda\\x84\\xff\\x9d\\x82\\xd5?f\\xb8>\\x028#\\xa0?\\xc8^\\xef\\xfex/\\xb0\\xbf\\xab\\xd5\\x91\\x02\\x8f\\x18\\xd6?\\xd7\\xc05\\xfb,d\\xd6?r\\x8e\\xb6\\x01\\n\\xbb\\xc8?\\xc0\\xd52\\x00\\xf1F\\xc9?\\x00\\x00\\x00 \\x8b\\x1bqB\\x00\\x00\\x00\\xa0\\x92HKB\\x00\\x00\\x00\\x80\\xcb\\x8a B\\x00\\x00\\x00\\x00\\x98)_B\\x00\\x00\\x00`\\xca+pB\\x00\\x00\\x00\\xa0N\\xe4WB\\x00\\x00\\x00\\x00\\xc0\\x9fQB\\x00\\x00\\x00\\xc5\\x0c\\xc5-B\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00lIv\\xf0A\\x00\\x00\\x00\\xd9\\xb83\\x17B\\x00\\x00\\x80\\xa3\\x1c\\x01$B\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xc0\\x17\\xcaINB\\x00\\x00\\x00fF\\xb83B\\x00\\x00#\\xdcq\\xaaBB\\x00\\x00\\x00\\x87h\\x00*B\\x00\\x00\\xc0\\xca\\xd3L=B\\x00\\x00\\x00\\xec\\xed58B\\x00\\x00\\x80\\t\\x93\\xa64B\\xa1\\xf81\\xe6\\xae%\\xd0?\\x8b\\xfde\\xf7\\xe4a\\xd9?\\x00\\x00\\x00\\x00\\x00\\x00\\xf8?\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf1\\xf4JY\\x868\\xd6?>(\\xc5\\x1ap\\xce\\xd4?\\xc3\\xf5(\\\\\\x8f\\xc2\\xcd?\\xad\\xbf%\\x00\\xff\\xe05#\\xc8$#gaG\\\'#\\x9a\\x99\\x99\\x99\\x99\\x996#{Ic\\xb4\\x8e\\x82>#^+\\xa1\\xbb$\\x8a;#UL\\xa5\\x9fp\\xbe)#0G\\x8f\\xdf\\xdb\\x84(#q8tq9ba]q:h\\x07h\\x08}q;(h\\nh\\x0bh\\x0cK\\x00\\x85q<h\\x0e\\x87q=Rq>(K\\x01K\\n\\x85q?h\\x15\\x89]q#(h\\x19h\\x1ah\\x1bh\\x1ch\\x1dh\\x1eh\\x1fh h!h"etqAbh$Nu\\x86qBRqCa}qDX\\x06\\x00\\x00\\x000.14.1qE}qF(X\\x04\\x00\\x00\\x00axesqGh\\x06X\\x06\\x00\\x00\\x00blocksqH]qI}qJ(X\\x06\\x00\\x00\\x00valuesqKh1X\\x08\\x00\\x00\\x00mgr_locsqLcbuiltins\\nslice\\nqMK\\x00K\\nK\\x01\\x87qNRqOuaustqPbX\\x04\\x00\\x00\\x00_typqQX\\t\\x00\\x00\\x00dataframeqRX\\t\\x00\\x00\\x00_metadataqS]qTub.\')'
a=eval(a)
a
and I want to run the function known by glmulti in python. I tried lots of ways but I failed. I then did the hopeless act of going to r as follow
take the dataset to excel file as
a.to_excel('test1.xlsx')
Go to r studio
install.packages("glmulti", "rJava", "readxl")
library("glmulti", "rJava", "readxl")
getwd()
setwd(".Gp\\to\\the\\python directory where you are workingin")
my_data <- read_excel("test1.xlsx", sheet = 1)
Change the columns of the dataframe because it does not work with the main values of the data
j=1
for (i in paste0("x",1:length(my_data))){
names(my_data)[j]=i
j=j+1
}
Select my x variable and y variable
y=my_data[,6]
x=my_data[, names(my_data) != names(my_data)[6]]
finally, I run the function I want in r as
glmulti(names(y), names(x), data=my_data, method="h")
Is there an easier way to run it from python using rpy2? If so can you please advise on this?
Consider converting Pandas data frame into an R data frame with rpy2, and then call just as you do now the glmulti from imported package.
However, a few notes about R:
Every function or method derives from a package which is true of Python except standard library functions (e.g., list, sum, type). But in R, its standard library packages are loaded by default (e.g., utils, stats, base) for everyday methods (e.g., read.csv, head, summary).
Though you can qualify package names with each function call such as with base::names, it is not required as in Python but helpful in case of name collision with other packages.
You do not need a for loop to rename all columns but can vectorize with base::paste0 and assign using stats::setNames or base::colnames.
Python Processing
import pandas as pd
import pickle
df_py = eval('pickle.loads(...)')
# RE-ORDER COLUMNS BY MOVING SIXTH COLUMN TO FIRST POSITION
cols = df_py.columns.to_list()
new_order = [cols[5]] + cols[0:5] + cols[6:]
df_py = df_py.reindex(new_order, axis=1)
print(df_py.head(10))
R Processing
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr
utils = importr('utils')
base = importr('base')
stats = importr('stats')
glmulti = importr('glmulti') # DOES NOT REQUIRE rJava PACKAGE BUT DOES REQUIRE Java LANGUAGE
# CONVERT TO R DATAFRAME
pandas2ri.activate()
df_r = pandas2ri.py2ri(df_py) # USING ABOVE PANDAS DATA FRAME
# RENAME COLUMNS y, x1, x2, x3, ...
df_r = stats.setNames(df_r, base.c("y", base.paste0("x", base.seq(1,base.length(df_r)[0]-1))))
print(utils.head(df_r, 10))
# CALL glmulti()
glmulti.glmulti(y = base.names(df_r)[0],
xr = base.names(df_r)[1:],
data = df_r,
method = "h")

Resources