How to use glmulti from python using rpy2? - r

consider the following dataframe
import pickle
a='pickle.loads(b\'\\x80\\x03cpandas.core.frame\\nDataFrame\\nq\\x00)\\x81q\\x01}q\\x02(X\\x05\\x00\\x00\\x00_dataq\\x03cpandas.core.internals.managers\\nBlockManager\\nq\\x04)\\x81q\\x05(]q\\x06(cpandas.core.indexes.base\\n_new_Index\\nq\\x07cpandas.core.indexes.base\\nIndex\\nq\\x08}q\\t(X\\x04\\x00\\x00\\x00dataq\\ncnumpy.core.multiarray\\n_reconstruct\\nq\\x0bcnumpy\\nndarray\\nq\\x0cK\\x00\\x85q\\rC\\x01bq\\x0e\\x87q\\x0fRq\\x10(K\\x01K\\n\\x85q\\x11cnumpy\\ndtype\\nq\\x12X\\x02\\x00\\x00\\x00O8q\\x13K\\x00K\\x01\\x87q\\x14Rq\\x15(K\\x03X\\x01\\x00\\x00\\x00|q\\x16NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK?tq\\x17b\\x89]q\\x18(X\\x0b\\x00\\x00\\x00priceToBookq\\x19X\\x04\\x00\\x00\\x00betaq\\x1aX\\x0e\\x00\\x00\\x00price to salesq\\x1bX\\x0c\\x00\\x00\\x00gross profitq\\x1cX\\x0c\\x00\\x00\\x0052WeekChangeq\\x1dX\\n\\x00\\x00\\x00market capq\\x1eX\\x04\\x00\\x00\\x00ebitq\\x1fX\\r\\x00\\x00\\x00total revenueq X\\x0c\\x00\\x00\\x00payout ratioq!X\\x08\\x00\\x00\\x00pe ratioq"etq#bX\\x04\\x00\\x00\\x00nameq$Nu\\x86q%Rq&h\\x07cpandas.core.indexes.range\\nRangeIndex\\nq\\\'}q((h$NX\\x05\\x00\\x00\\x00startq)K\\x00X\\x04\\x00\\x00\\x00stopq*K\\x07X\\x04\\x00\\x00\\x00stepq+K\\x01u\\x86q,Rq-e]q.h\\x0bh\\x0cK\\x00\\x85q/h\\x0e\\x87q0Rq1(K\\x01K\\nK\\x07\\x86q2h\\x12X\\x02\\x00\\x00\\x00f8q3K\\x00K\\x01\\x87q4Rq5(K\\x03X\\x01\\x00\\x00\\x00<q6NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK\\x00tq7b\\x89B0\\x02\\x00\\x00\\xd1#,\\x9b9\\x8c)#Cz\\xe5\\xd5\\x94_\\xf5?\\x92(\\x0ffn9\\xf0?\\n+\\x15TT-\\x17# \\xd5\\xb0\\xdf\\x13\\x03%#u\\xdek\\xad\\xd4\\xb8\\xfb?\\x1c\\xee#\\xb7&\\xbd\\xf3?-\\x98\\xf8\\xa3\\xa8\\xf3\\xf3?H\\xfd\\xf5\\n\\x0b\\xae\\xf1?:;\\x19\\x1c%/\\xf1?\\x9f\\x93\\xde7\\xbe\\xf6\\xf0?\\xbb}V\\x99)\\xad\\xf3?\\xae\\xbby\\xaaC.\\xf3?\\xa5,C\\x1c\\xeb\\xe2\\xf9?d\\x94g^\\x0e\\x13\\x12#\\x9e\\xc7r\\\\\\xd7i\\x06#\\xe4\\xe0\\x0c\\xddp\\xc8\\xcc?%\\x95)\\xe6 \\x18 #\\xa1\\xf4\\x85\\x90\\xf3\\x1e!#y6P\\x85\\xe4\\x89\\x0e#.\\xd9\\xc2=\\xe0\\x1b\\x0c#\\x00\\x00\\x00\\xc6\\x9e\\xe86B\\x00\\x00\\x00fF\\xb83B\\x00\\x00\\x00.\\xdc\\xb6\\x0bB\\x00\\x00\\x80\\x954\\xa5%B\\x00\\x00#\\\'1O3B\\x00\\x00\\x00\\xec\\xed58B\\x00\\x00\\x80\\t\\x93\\xa64B\\xf1\\xda\\x84\\xff\\x9d\\x82\\xd5?f\\xb8>\\x028#\\xa0?\\xc8^\\xef\\xfex/\\xb0\\xbf\\xab\\xd5\\x91\\x02\\x8f\\x18\\xd6?\\xd7\\xc05\\xfb,d\\xd6?r\\x8e\\xb6\\x01\\n\\xbb\\xc8?\\xc0\\xd52\\x00\\xf1F\\xc9?\\x00\\x00\\x00 \\x8b\\x1bqB\\x00\\x00\\x00\\xa0\\x92HKB\\x00\\x00\\x00\\x80\\xcb\\x8a B\\x00\\x00\\x00\\x00\\x98)_B\\x00\\x00\\x00`\\xca+pB\\x00\\x00\\x00\\xa0N\\xe4WB\\x00\\x00\\x00\\x00\\xc0\\x9fQB\\x00\\x00\\x00\\xc5\\x0c\\xc5-B\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00lIv\\xf0A\\x00\\x00\\x00\\xd9\\xb83\\x17B\\x00\\x00\\x80\\xa3\\x1c\\x01$B\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xc0\\x17\\xcaINB\\x00\\x00\\x00fF\\xb83B\\x00\\x00#\\xdcq\\xaaBB\\x00\\x00\\x00\\x87h\\x00*B\\x00\\x00\\xc0\\xca\\xd3L=B\\x00\\x00\\x00\\xec\\xed58B\\x00\\x00\\x80\\t\\x93\\xa64B\\xa1\\xf81\\xe6\\xae%\\xd0?\\x8b\\xfde\\xf7\\xe4a\\xd9?\\x00\\x00\\x00\\x00\\x00\\x00\\xf8?\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf1\\xf4JY\\x868\\xd6?>(\\xc5\\x1ap\\xce\\xd4?\\xc3\\xf5(\\\\\\x8f\\xc2\\xcd?\\xad\\xbf%\\x00\\xff\\xe05#\\xc8$#gaG\\\'#\\x9a\\x99\\x99\\x99\\x99\\x996#{Ic\\xb4\\x8e\\x82>#^+\\xa1\\xbb$\\x8a;#UL\\xa5\\x9fp\\xbe)#0G\\x8f\\xdf\\xdb\\x84(#q8tq9ba]q:h\\x07h\\x08}q;(h\\nh\\x0bh\\x0cK\\x00\\x85q<h\\x0e\\x87q=Rq>(K\\x01K\\n\\x85q?h\\x15\\x89]q#(h\\x19h\\x1ah\\x1bh\\x1ch\\x1dh\\x1eh\\x1fh h!h"etqAbh$Nu\\x86qBRqCa}qDX\\x06\\x00\\x00\\x000.14.1qE}qF(X\\x04\\x00\\x00\\x00axesqGh\\x06X\\x06\\x00\\x00\\x00blocksqH]qI}qJ(X\\x06\\x00\\x00\\x00valuesqKh1X\\x08\\x00\\x00\\x00mgr_locsqLcbuiltins\\nslice\\nqMK\\x00K\\nK\\x01\\x87qNRqOuaustqPbX\\x04\\x00\\x00\\x00_typqQX\\t\\x00\\x00\\x00dataframeqRX\\t\\x00\\x00\\x00_metadataqS]qTub.\')'
a=eval(a)
a
and I want to run the function known by glmulti in python. I tried lots of ways but I failed. I then did the hopeless act of going to r as follow
take the dataset to excel file as
a.to_excel('test1.xlsx')
Go to r studio
install.packages("glmulti", "rJava", "readxl")
library("glmulti", "rJava", "readxl")
getwd()
setwd(".Gp\\to\\the\\python directory where you are workingin")
my_data <- read_excel("test1.xlsx", sheet = 1)
Change the columns of the dataframe because it does not work with the main values of the data
j=1
for (i in paste0("x",1:length(my_data))){
names(my_data)[j]=i
j=j+1
}
Select my x variable and y variable
y=my_data[,6]
x=my_data[, names(my_data) != names(my_data)[6]]
finally, I run the function I want in r as
glmulti(names(y), names(x), data=my_data, method="h")
Is there an easier way to run it from python using rpy2? If so can you please advise on this?

Consider converting Pandas data frame into an R data frame with rpy2, and then call just as you do now the glmulti from imported package.
However, a few notes about R:
Every function or method derives from a package which is true of Python except standard library functions (e.g., list, sum, type). But in R, its standard library packages are loaded by default (e.g., utils, stats, base) for everyday methods (e.g., read.csv, head, summary).
Though you can qualify package names with each function call such as with base::names, it is not required as in Python but helpful in case of name collision with other packages.
You do not need a for loop to rename all columns but can vectorize with base::paste0 and assign using stats::setNames or base::colnames.
Python Processing
import pandas as pd
import pickle
df_py = eval('pickle.loads(...)')
# RE-ORDER COLUMNS BY MOVING SIXTH COLUMN TO FIRST POSITION
cols = df_py.columns.to_list()
new_order = [cols[5]] + cols[0:5] + cols[6:]
df_py = df_py.reindex(new_order, axis=1)
print(df_py.head(10))
R Processing
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr
utils = importr('utils')
base = importr('base')
stats = importr('stats')
glmulti = importr('glmulti') # DOES NOT REQUIRE rJava PACKAGE BUT DOES REQUIRE Java LANGUAGE
# CONVERT TO R DATAFRAME
pandas2ri.activate()
df_r = pandas2ri.py2ri(df_py) # USING ABOVE PANDAS DATA FRAME
# RENAME COLUMNS y, x1, x2, x3, ...
df_r = stats.setNames(df_r, base.c("y", base.paste0("x", base.seq(1,base.length(df_r)[0]-1))))
print(utils.head(df_r, 10))
# CALL glmulti()
glmulti.glmulti(y = base.names(df_r)[0],
xr = base.names(df_r)[1:],
data = df_r,
method = "h")

Related

Pandas equivalent to R 'MAX_VALUE'

I am translating R code to Python using Pandas and I have been able to find Pandas equivalent to all R actions, but now I got this R code:
dtfr %>% mutate(a_column = ifelse(a_column == "INFINITY", MAX_VALUE, a_column))
This is my Pandas equivalent:
dtfr['a_column'] = np.where(dtfr['a_column'] == 'INFINITY', MAX_VALUE, dtfr['a_column'])
I have been looking for an equivalent to R MAX_VALUE in Pandas, but I haven't found how to replicate it.
There is np.inf: https://numpy.org/devdocs/reference/constants.html#numpy.inf
It is used in pandas to represent infinity (just as np.nan is used to represent "missing values".

How to quickly and easily convert between R and pandas dataframes in Databricks?

I am an R user with minimal python experience. I have some colleagues who use python and I want to be able to easily convert between R and python/pandas dataframes in the same Databricks notebook. I have heard that I have to use spark temp tables to do this and that it is quite straightforward, but I cannot find any complete example code and so far I haven't been able to get it to work.
I get a SparkR dataframe (as I can't get Base R dataframes to work with RegisterTempTable()) and convert it to a temp table:
#Cell 1
jdbc_url <- "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb;user=user;password=*****"
df_R <- read.jdbc(jdbc_url, "(SELECT TOP 10 * FROM [schema].[table]) as result" )
SparkR:::registerTempTable(df_R,"df_temptable")
Then I try to read that back in as a pandas dataframe:
%python
#Cell 2:
import pandas as pd
pandas_df = df_temptable.select("*").toPandas()
which results in the error:
NameError: name 'df_temptable' is not defined
How do I successfully convert between R and python dataframes and back within Databricks (I would preferably like to go from a Base R dataframe to a pandas dataframe without using any Scala and in as few steps as possible)?
From the error message "NameError: name 'df_temptable' is not defined", it looks like df_temptable is not defined as the dataframe.
Here is an example to convert spark DataFrames to and from Pandas DataFrames.
%python
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Generate a pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))
# Create a Spark DataFrame from a pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
# Convert the Spark DataFrame back to a pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()

Importing MICE object to Stata for analysis

I am trying to use imputed data created with MICE in Stata.
My understanding of the steps are:
1) converting the mids object to mi in R
m=20
completed=lapply(1:20,function(i)complete(imp,i))
completed.mi=do.call(Zelig::mi,completed)
2) preparing mice object for exporting in R
(a) mi2stata
STATA=mi::mi2stata(completed.mi, m=20, file="C:\\Users\\STATA.csv",
missing.ind = FALSE)
Note: after loading the data into Stata, version 11 or later, type 'mi
import ice' to register the data as being multiply imputed.
For Stata 10 and earlier, install MIM by typing 'findit mim' and include
'mim:' as a prefix for any command using the MI data.
Error in lapply(X = X, FUN = FUN, ...) :
trying to get slot "data" from an object (class "mi") that is not an S4
object
(b) Following the suggestion from below to write a csv without mi2stata:
data_out <- data.table::rbindlist(completed, idcol="m")
write.csv(data_out, "C:\\deleted\\STATA2.csv", row.names=FALSE)
3) importing the CSV file of the original, nonimputed data into Stata
**appears to have worked fine. all variables from CSV file appears on the
right-hand side
4) use mi import ice command in Stata
(a) error re: mi2stata (I had actually imported the non-imputed file)
. mi import ice STATA
varlist not allowed
r(101);
(b) error in reading CSV version of imputed data
mi import ice[stata2]
weights not allowed
r(101);
I have encountered errors with 2, 4, and possibly 1 (as error for 2 refers back to conversion of mice object to mi class data). I would really appreciate a user friendly step by step guidance. Although mi2stata might not work directly work for mice objects, I am still interested in learning a solution for this.
Collecting the comments above: you can't use mi::mi2stata with either the data that results from Zelig::mi or from mice::complete. But if you look at the code for mi::mi2stata, it just seems to stack the raw data, and each imputed dataset. It then adds indices to mark each dataset, and each observation.
library(mice)
# don't really need data.table but makes adding the indices easier
library(data.table)
# Function to export mice imputed datasets
mice2stata <- function(imp, path="stata", type="dta"){
completed <- lapply(seq_len(imp$m),function(i) complete(imp,i))
data_out <- rbindlist(completed, idcol="_mj")
data_out <- rbind(imp$data, data_out, fill=TRUE)
data_out[, `_mj` := replace(`_mj`, is.na(`_mj`), 0L)]
data_out[, `_mi` := rowid(`_mj`)]
if(type=="dta") {
foreign::write.dta(data_out, file=paste(path, type, sep="."))
} else {
write.csv(data_out, file=paste(path, type, sep="."), na="", row.names=FALSE)
}
}
An example
imp <- mice(nhanes, m=2, print=FALSE)
mice2stata(imp, type="dta")
Then in Stata use
use path\to\stata.dta
mi import ice
Q4 looks straightforward. The syntax for that command (not function) is documented as
mi import ice [, options]
and so STATA looks like an attempt to specify a variable list. Where does that come from?
If Q2 failed, was the point of Q3 and Q4?
I hope that some R user can add some comments on Q2. On the face of it, you got an explicit error message, so do you think it's wrong?

What is the best way to import spss file in R with value labels?

I have a spss file which contents variables and value labels. I saw foreign package with read.spss function:
data <- read.spss("2017.sav", to.data.frame = TRUE, use.value.labels = TRUE)
If i use use.value.labels = TRUE, all string change to factor variables and i dont want it because they are not factor all.
I found one solution but i dont know if it is the best way to do it
1º First read spss file with previous sentence
2º select which variables are not factor and change it to string with:
cols <- c("x", "ab")
data[cols] <- lapply(data[cols], as.character)
if i dont use use.value.labels = TRUE i will have not value labels and i cannot export file correctly
You can also use the memisc package:
sav <- spss.system.file("file.sav")
df <- as.data.set(sav)
My company regularly deals with SAV files and we extract out the metadata separately. With the foreign package, you can get the metadata out in a few different ways (after you have loaded the file in):
data.label.table <- attr(sav, "label.table")
missings <- attr(sav, "missings")
The other bits require various lapply and sapply functions to get them out. The script I have is quite long, so I will not share it here. If you read the data in with read.spss(sav, to.data.frame = TRUE) you can get:
VariableLabels <- unname(attr(sav, "variable.labels"))
I dont know why, but I can’t install a "foreign" package.
Here is what I did instead to import a dataset from SPSS to R (through Excel):
Open your data in SPSS.
Export dataset from SPSS to Excel, but make sure to choose the "Save
value labels where defined instead of data values" option at the
very bottom.
Open R.
Import dataset from Excel.
Now, you have a dataset in R with value labels.
Use the haven package:
library(haven)
data <- read_sav("2017.sav")
The labels are shown in the RStudio viewer.

How to use !duplicate with rpy2?

I want to do the equivalent of this R script:
> csvData <- read.csv(file='/homes/ndeklein/test.csv', head=TRUE, sep='\t')
> csv = subset(csvData, !duplicated(id))
in rpy2. However, if I import rpy2.robjects as R, it does not recognize R.r['!duplicated']
(like this):
import rpy2.robjects as R
csvData = R.r['read.csv'](file='/homes/ndeklein/test.csv', head=True, sep='\t')
csv = R.r['subset'](csvData, R.r['!duplicated']('id'))
How can I use !duplicated in rpy2?
edit:
R.r['duplicated']
does work, so I'm looking for how to make ! work in rpy2
I got the answer trough a mailing list, in case someone else needs it:
Using R.r'!' instead of R.r'!duplicated' works.
# getting the not sign of R
rnot = R.r['!']
# getting duplicated
duplicated = R.r['duplicated']
# get only the rows with unique ids and put it in a new matrix
csvUniqID = R.r['subset'](csvData, rnot(duplicated(csvData[0])))

Resources