How to use !duplicate with rpy2? - r

I want to do the equivalent of this R script:
> csvData <- read.csv(file='/homes/ndeklein/test.csv', head=TRUE, sep='\t')
> csv = subset(csvData, !duplicated(id))
in rpy2. However, if I import rpy2.robjects as R, it does not recognize R.r['!duplicated']
(like this):
import rpy2.robjects as R
csvData = R.r['read.csv'](file='/homes/ndeklein/test.csv', head=True, sep='\t')
csv = R.r['subset'](csvData, R.r['!duplicated']('id'))
How can I use !duplicated in rpy2?
edit:
R.r['duplicated']
does work, so I'm looking for how to make ! work in rpy2

I got the answer trough a mailing list, in case someone else needs it:
Using R.r'!' instead of R.r'!duplicated' works.
# getting the not sign of R
rnot = R.r['!']
# getting duplicated
duplicated = R.r['duplicated']
# get only the rows with unique ids and put it in a new matrix
csvUniqID = R.r['subset'](csvData, rnot(duplicated(csvData[0])))

Related

Write stata dataframe in R [duplicate]

I am getting an error while converting R file into Stata format. I am able to convert the numbers into
Stata file but when I include strings I get the following error:
library(foreign)
write.dta(newdata, "X.dta")
Error in write.dta(newdata, "X.dta") :
empty string is not valid in Stata's documented format
I have few strings like location, name etc. which have missing values which is probably causing this problem. Is there a way to handle this? .
I've had this error many times before, and it's easy to reproduce:
library(foreign)
test <- data.frame(a = "", b = 1, stringsAsFactors = FALSE)
write.dta(test, 'example.dta')
One solution is to use factor variables instead of character variables, e.g.,
for (colname in names(test)) {
if (is.character(test[[colname]])) {
test[[colname]] <- as.factor(test[[colname]])
}
}
Another is to change the empty strings to something else and change them back in Stata.
This is purely a problem with write.dta, because Stata is perfectly fine with empty strings. But since foreign is frozen, there's not much you can do about that.
Update: (2015-12-04) A better solution is to use write_dta in the haven package:
library(haven)
test <- data.frame(a = "", b = 1, stringsAsFactors = FALSE)
write_dta(test, 'example.dta')
This way, Stata reads string variables properly as strings.
You could use the great readstata13 package (which kindly imports only the Rcpp package).
readstata13::save.dta13(mtcars, 'mtcars.dta')
The function allows to save already in Stata 15/16 MP file format (experimental), which is the next update after Stata 13 format.
readstata13::save.dta13(mtcars, 'mtcars15.dta', version="15mp")
Note: Of course, this also works with OP's data:
readstata13::save.dta13(data.frame(a="", b=1), 'my_data.dta')

How to use glmulti from python using rpy2?

consider the following dataframe
import pickle
a='pickle.loads(b\'\\x80\\x03cpandas.core.frame\\nDataFrame\\nq\\x00)\\x81q\\x01}q\\x02(X\\x05\\x00\\x00\\x00_dataq\\x03cpandas.core.internals.managers\\nBlockManager\\nq\\x04)\\x81q\\x05(]q\\x06(cpandas.core.indexes.base\\n_new_Index\\nq\\x07cpandas.core.indexes.base\\nIndex\\nq\\x08}q\\t(X\\x04\\x00\\x00\\x00dataq\\ncnumpy.core.multiarray\\n_reconstruct\\nq\\x0bcnumpy\\nndarray\\nq\\x0cK\\x00\\x85q\\rC\\x01bq\\x0e\\x87q\\x0fRq\\x10(K\\x01K\\n\\x85q\\x11cnumpy\\ndtype\\nq\\x12X\\x02\\x00\\x00\\x00O8q\\x13K\\x00K\\x01\\x87q\\x14Rq\\x15(K\\x03X\\x01\\x00\\x00\\x00|q\\x16NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK?tq\\x17b\\x89]q\\x18(X\\x0b\\x00\\x00\\x00priceToBookq\\x19X\\x04\\x00\\x00\\x00betaq\\x1aX\\x0e\\x00\\x00\\x00price to salesq\\x1bX\\x0c\\x00\\x00\\x00gross profitq\\x1cX\\x0c\\x00\\x00\\x0052WeekChangeq\\x1dX\\n\\x00\\x00\\x00market capq\\x1eX\\x04\\x00\\x00\\x00ebitq\\x1fX\\r\\x00\\x00\\x00total revenueq X\\x0c\\x00\\x00\\x00payout ratioq!X\\x08\\x00\\x00\\x00pe ratioq"etq#bX\\x04\\x00\\x00\\x00nameq$Nu\\x86q%Rq&h\\x07cpandas.core.indexes.range\\nRangeIndex\\nq\\\'}q((h$NX\\x05\\x00\\x00\\x00startq)K\\x00X\\x04\\x00\\x00\\x00stopq*K\\x07X\\x04\\x00\\x00\\x00stepq+K\\x01u\\x86q,Rq-e]q.h\\x0bh\\x0cK\\x00\\x85q/h\\x0e\\x87q0Rq1(K\\x01K\\nK\\x07\\x86q2h\\x12X\\x02\\x00\\x00\\x00f8q3K\\x00K\\x01\\x87q4Rq5(K\\x03X\\x01\\x00\\x00\\x00<q6NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK\\x00tq7b\\x89B0\\x02\\x00\\x00\\xd1#,\\x9b9\\x8c)#Cz\\xe5\\xd5\\x94_\\xf5?\\x92(\\x0ffn9\\xf0?\\n+\\x15TT-\\x17# \\xd5\\xb0\\xdf\\x13\\x03%#u\\xdek\\xad\\xd4\\xb8\\xfb?\\x1c\\xee#\\xb7&\\xbd\\xf3?-\\x98\\xf8\\xa3\\xa8\\xf3\\xf3?H\\xfd\\xf5\\n\\x0b\\xae\\xf1?:;\\x19\\x1c%/\\xf1?\\x9f\\x93\\xde7\\xbe\\xf6\\xf0?\\xbb}V\\x99)\\xad\\xf3?\\xae\\xbby\\xaaC.\\xf3?\\xa5,C\\x1c\\xeb\\xe2\\xf9?d\\x94g^\\x0e\\x13\\x12#\\x9e\\xc7r\\\\\\xd7i\\x06#\\xe4\\xe0\\x0c\\xddp\\xc8\\xcc?%\\x95)\\xe6 \\x18 #\\xa1\\xf4\\x85\\x90\\xf3\\x1e!#y6P\\x85\\xe4\\x89\\x0e#.\\xd9\\xc2=\\xe0\\x1b\\x0c#\\x00\\x00\\x00\\xc6\\x9e\\xe86B\\x00\\x00\\x00fF\\xb83B\\x00\\x00\\x00.\\xdc\\xb6\\x0bB\\x00\\x00\\x80\\x954\\xa5%B\\x00\\x00#\\\'1O3B\\x00\\x00\\x00\\xec\\xed58B\\x00\\x00\\x80\\t\\x93\\xa64B\\xf1\\xda\\x84\\xff\\x9d\\x82\\xd5?f\\xb8>\\x028#\\xa0?\\xc8^\\xef\\xfex/\\xb0\\xbf\\xab\\xd5\\x91\\x02\\x8f\\x18\\xd6?\\xd7\\xc05\\xfb,d\\xd6?r\\x8e\\xb6\\x01\\n\\xbb\\xc8?\\xc0\\xd52\\x00\\xf1F\\xc9?\\x00\\x00\\x00 \\x8b\\x1bqB\\x00\\x00\\x00\\xa0\\x92HKB\\x00\\x00\\x00\\x80\\xcb\\x8a B\\x00\\x00\\x00\\x00\\x98)_B\\x00\\x00\\x00`\\xca+pB\\x00\\x00\\x00\\xa0N\\xe4WB\\x00\\x00\\x00\\x00\\xc0\\x9fQB\\x00\\x00\\x00\\xc5\\x0c\\xc5-B\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00lIv\\xf0A\\x00\\x00\\x00\\xd9\\xb83\\x17B\\x00\\x00\\x80\\xa3\\x1c\\x01$B\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xc0\\x17\\xcaINB\\x00\\x00\\x00fF\\xb83B\\x00\\x00#\\xdcq\\xaaBB\\x00\\x00\\x00\\x87h\\x00*B\\x00\\x00\\xc0\\xca\\xd3L=B\\x00\\x00\\x00\\xec\\xed58B\\x00\\x00\\x80\\t\\x93\\xa64B\\xa1\\xf81\\xe6\\xae%\\xd0?\\x8b\\xfde\\xf7\\xe4a\\xd9?\\x00\\x00\\x00\\x00\\x00\\x00\\xf8?\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf1\\xf4JY\\x868\\xd6?>(\\xc5\\x1ap\\xce\\xd4?\\xc3\\xf5(\\\\\\x8f\\xc2\\xcd?\\xad\\xbf%\\x00\\xff\\xe05#\\xc8$#gaG\\\'#\\x9a\\x99\\x99\\x99\\x99\\x996#{Ic\\xb4\\x8e\\x82>#^+\\xa1\\xbb$\\x8a;#UL\\xa5\\x9fp\\xbe)#0G\\x8f\\xdf\\xdb\\x84(#q8tq9ba]q:h\\x07h\\x08}q;(h\\nh\\x0bh\\x0cK\\x00\\x85q<h\\x0e\\x87q=Rq>(K\\x01K\\n\\x85q?h\\x15\\x89]q#(h\\x19h\\x1ah\\x1bh\\x1ch\\x1dh\\x1eh\\x1fh h!h"etqAbh$Nu\\x86qBRqCa}qDX\\x06\\x00\\x00\\x000.14.1qE}qF(X\\x04\\x00\\x00\\x00axesqGh\\x06X\\x06\\x00\\x00\\x00blocksqH]qI}qJ(X\\x06\\x00\\x00\\x00valuesqKh1X\\x08\\x00\\x00\\x00mgr_locsqLcbuiltins\\nslice\\nqMK\\x00K\\nK\\x01\\x87qNRqOuaustqPbX\\x04\\x00\\x00\\x00_typqQX\\t\\x00\\x00\\x00dataframeqRX\\t\\x00\\x00\\x00_metadataqS]qTub.\')'
a=eval(a)
a
and I want to run the function known by glmulti in python. I tried lots of ways but I failed. I then did the hopeless act of going to r as follow
take the dataset to excel file as
a.to_excel('test1.xlsx')
Go to r studio
install.packages("glmulti", "rJava", "readxl")
library("glmulti", "rJava", "readxl")
getwd()
setwd(".Gp\\to\\the\\python directory where you are workingin")
my_data <- read_excel("test1.xlsx", sheet = 1)
Change the columns of the dataframe because it does not work with the main values of the data
j=1
for (i in paste0("x",1:length(my_data))){
names(my_data)[j]=i
j=j+1
}
Select my x variable and y variable
y=my_data[,6]
x=my_data[, names(my_data) != names(my_data)[6]]
finally, I run the function I want in r as
glmulti(names(y), names(x), data=my_data, method="h")
Is there an easier way to run it from python using rpy2? If so can you please advise on this?
Consider converting Pandas data frame into an R data frame with rpy2, and then call just as you do now the glmulti from imported package.
However, a few notes about R:
Every function or method derives from a package which is true of Python except standard library functions (e.g., list, sum, type). But in R, its standard library packages are loaded by default (e.g., utils, stats, base) for everyday methods (e.g., read.csv, head, summary).
Though you can qualify package names with each function call such as with base::names, it is not required as in Python but helpful in case of name collision with other packages.
You do not need a for loop to rename all columns but can vectorize with base::paste0 and assign using stats::setNames or base::colnames.
Python Processing
import pandas as pd
import pickle
df_py = eval('pickle.loads(...)')
# RE-ORDER COLUMNS BY MOVING SIXTH COLUMN TO FIRST POSITION
cols = df_py.columns.to_list()
new_order = [cols[5]] + cols[0:5] + cols[6:]
df_py = df_py.reindex(new_order, axis=1)
print(df_py.head(10))
R Processing
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr
utils = importr('utils')
base = importr('base')
stats = importr('stats')
glmulti = importr('glmulti') # DOES NOT REQUIRE rJava PACKAGE BUT DOES REQUIRE Java LANGUAGE
# CONVERT TO R DATAFRAME
pandas2ri.activate()
df_r = pandas2ri.py2ri(df_py) # USING ABOVE PANDAS DATA FRAME
# RENAME COLUMNS y, x1, x2, x3, ...
df_r = stats.setNames(df_r, base.c("y", base.paste0("x", base.seq(1,base.length(df_r)[0]-1))))
print(utils.head(df_r, 10))
# CALL glmulti()
glmulti.glmulti(y = base.names(df_r)[0],
xr = base.names(df_r)[1:],
data = df_r,
method = "h")

What is the best way to import spss file in R with value labels?

I have a spss file which contents variables and value labels. I saw foreign package with read.spss function:
data <- read.spss("2017.sav", to.data.frame = TRUE, use.value.labels = TRUE)
If i use use.value.labels = TRUE, all string change to factor variables and i dont want it because they are not factor all.
I found one solution but i dont know if it is the best way to do it
1º First read spss file with previous sentence
2º select which variables are not factor and change it to string with:
cols <- c("x", "ab")
data[cols] <- lapply(data[cols], as.character)
if i dont use use.value.labels = TRUE i will have not value labels and i cannot export file correctly
You can also use the memisc package:
sav <- spss.system.file("file.sav")
df <- as.data.set(sav)
My company regularly deals with SAV files and we extract out the metadata separately. With the foreign package, you can get the metadata out in a few different ways (after you have loaded the file in):
data.label.table <- attr(sav, "label.table")
missings <- attr(sav, "missings")
The other bits require various lapply and sapply functions to get them out. The script I have is quite long, so I will not share it here. If you read the data in with read.spss(sav, to.data.frame = TRUE) you can get:
VariableLabels <- unname(attr(sav, "variable.labels"))
I dont know why, but I can’t install a "foreign" package.
Here is what I did instead to import a dataset from SPSS to R (through Excel):
Open your data in SPSS.
Export dataset from SPSS to Excel, but make sure to choose the "Save
value labels where defined instead of data values" option at the
very bottom.
Open R.
Import dataset from Excel.
Now, you have a dataset in R with value labels.
Use the haven package:
library(haven)
data <- read_sav("2017.sav")
The labels are shown in the RStudio viewer.

Converting R file to Stata with missing string values

I am getting an error while converting R file into Stata format. I am able to convert the numbers into
Stata file but when I include strings I get the following error:
library(foreign)
write.dta(newdata, "X.dta")
Error in write.dta(newdata, "X.dta") :
empty string is not valid in Stata's documented format
I have few strings like location, name etc. which have missing values which is probably causing this problem. Is there a way to handle this? .
I've had this error many times before, and it's easy to reproduce:
library(foreign)
test <- data.frame(a = "", b = 1, stringsAsFactors = FALSE)
write.dta(test, 'example.dta')
One solution is to use factor variables instead of character variables, e.g.,
for (colname in names(test)) {
if (is.character(test[[colname]])) {
test[[colname]] <- as.factor(test[[colname]])
}
}
Another is to change the empty strings to something else and change them back in Stata.
This is purely a problem with write.dta, because Stata is perfectly fine with empty strings. But since foreign is frozen, there's not much you can do about that.
Update: (2015-12-04) A better solution is to use write_dta in the haven package:
library(haven)
test <- data.frame(a = "", b = 1, stringsAsFactors = FALSE)
write_dta(test, 'example.dta')
This way, Stata reads string variables properly as strings.
You could use the great readstata13 package (which kindly imports only the Rcpp package).
readstata13::save.dta13(mtcars, 'mtcars.dta')
The function allows to save already in Stata 15/16 MP file format (experimental), which is the next update after Stata 13 format.
readstata13::save.dta13(mtcars, 'mtcars15.dta', version="15mp")
Note: Of course, this also works with OP's data:
readstata13::save.dta13(data.frame(a="", b=1), 'my_data.dta')

for normalization of microarray data

I want to normalize data using RMA in R package. but there has problem it does not read .txt file. Please tell me, "what I do for normalizing data from .txt file?"
reply please
Basically all normalization methods in Bioconductor are based on the AffyBatch class. Therefore, you have to read your textfile (probably a matrix) and create an AffyBatch manually:
AB <- new("AffyBatch", exprs = exprs, cdfName = cdfname, phenoData = phenoData,...)
RMA needs ExpressionSet structure. After reading the file (read.table()) and cleaning colnames and row.names convert the file to matrix and use:
a<-ExpressionSet(assayData=matrix)
If didnt work, import your *.txt data to flexarray software which can read it and do rma.
This may work.
I use normalizeQuantiles() function from Limma R package:
library(limma)
mydata <- read.table("RDotPsoriasisLogNatTranformedmanuallyTABExport.tab", sep = "\t", header = TRUE) # read from file
b = as.matrix(cbind(mydata[, 2:5], mydata[, 6:11])) # set the numeric data set
m = normalizeQuantiles(b, ties=TRUE) # normilize
mydata_t <- t(as.matrix(m)) # transpose if you need

Resources