Rpy2 - List of List of Dataframes - r

I'm trying to figure out how to use python to do file parsing from XML files into a data structure to pass into R.
What I need to create in R is a List of Lists of Dataframes:
Nodes = data.frame()
Edges = data.frame()
NetworkCompListA = list()
NetworkCompListA[['Nodes']] = Nodes
NetworkCompListA[['Edges']] = Edges
Networks = list()
Networks[['NetA']] = NetworkCompListA
Networks[['NetB']] = NetworkCompListB
I know how to create a dataframe from the examples in the Rpy2 documentation.
import rpy2.rlike.container as rlc
od = rlc.OrdDict([('value', robjects.IntVector((1,2,3))),
('letter', robjects.StrVector(('x', 'y', 'z')))])
df = robjects.DataFrame(od)
How do I insert 'df' into a List and then insert that list into another list in python and then write that out to an rdata file to load into another instance of R?
Thanks!

The class ListVector requires an object that implements iteritems() (such as a dict, or a OrderedDict). Note that in R data.frames are just lists with a the (loose) constrain that all elements should be vectors of the same length (or a matrix with the matching number of rows can be accepted), and with row names and column names (list's names being the column names).
from rpy2.robjects.vectors import ListVector, DataFrame
# rpy2's OrdDict was added because there was no ordered dict
# in Python's stdlib. It should be gone by rpy2-2.5
from collections import OrderedDict
od = OrderedDict((('a', 1), ('b', 2)))
df = DataFrame(od)
od_l = OrderedDict((('df', df),))
df_in_list = ListVector(od_l)
df_in_list_in_list = ListVector(OrderedDict((('df_in_list', df_in_list),))

Related

Importing Excel-tables in R

Is there a way to import a named Excel-table into R as a data.frame?
I typically have several named Excel-tables on a single worksheet, that I want to import as data.frames, without relying on static row - and column references for the location of the Excel-tables.
I have tried to set namedRegion which is an available argument for several Excel-import functions, but that does not seem to work for named Excel-tables. I am currently using the openxlxs package, which has a function getTables() that creates a variable with Excel-table names from a single worksheet, but not the data in the tables.
To get your named table is a little bit of work.
First you need to load the workbook.
library(openxlsx)
wb <- loadWorkbook("name_excel_file.xlsx")
Next you need to extract the name of your named table.
# get the name and the range
tables <- getTables(wb = wb,
sheet = 1)
If you have multiple named tables they are all in tables. My named table is called Table1.
Next you to extract the column numbers and row numbers, which you will later use to extract the named table from the Excel file.
# get the range
table_range <- names(tables[tables == "Table1"])
table_range_refs <- strsplit(table_range, ":")[[1]]
# use a regex to extract out the row numbers
table_range_row_num <- gsub("[^0-9.]", "", table_range_refs)
# extract out the column numbers
table_range_col_num <- convertFromExcelRef(table_range_refs)
Now you re-read the Excel file with the cols and rows parameter.
# finally read it
my_df <- read.xlsx(xlsxFile = "name_excel_file.xlsx",
sheet = 1,
cols = table_range_col_num[1]:table_range_col_num[2],
rows = table_range_row_num[1]:table_range_row_num[2])
You end up with a data frame with only the content of your named table.
I used this a while ago. I found this code somewhere, but I don't know anymore from where.
This link is might be useful for you
https://stackoverflow.com/a/17709204/10235327
1. Install XLConnect package
2. Save a path to your file in a variable
3. Load workbook
4. Save your data to df
To get table names you can use function
getTables(wb,sheet=1,simplify=T)
Where:
wb - your workbook
sheet - sheet name or might be the number as well
simplify = TRUE (default) the result is simplified to a vector
https://rdrr.io/cran/XLConnect/man/getTables-methods.html
Here's the code (not mine, copied from the topic above, just a bit modified)
require(XLConnect)
sampleFile = "C:/Users/your.name/Documents/test.xlsx"
wb = loadWorkbook(sampleFile)
myTable <- getTables(wb,sheet=1)
df<-readTable(wb, sheet = 1, table = myTable)
You can check next packages:
library(xlsx)
Data <- read.xlsx('YourFile.xlsx',sheet=1)
library(readxl)
Data <- read_excel('YourFile.xlsx',sheet=1)
Both options allow you to define specific regions to load the data into R.
I use read.xlsx from package openxlsx. For example:
library(openxlsx)
fileA <- paste0(some.directory,'excel.file.xlsx')
A <- read.xlsx(fileA, startRow = 3)
hope it helps

How are vector operations performed on 20newsgroups_vectorized data set?

When I fetch 20newsgroups_vectorized data by
newsgroups = fetch_20newsgroups_vectorized(subset='all')
labels = newsgroups.target_names
target = newsgroups.target
target = pd.DataFrame([labels[i] for i in target], columns=['label'])
data = newsgroups.data
data is the <class 'scipy.sparse.csr.csr_matrix'> with the shape
(18846, 130107)
How can I subset the data by target names (for example, extract only 'rec.sport.baseball') and use vector operations on those sparse row vectors (for example, calculate the mean vector or the distances)?
Unfortunately, subsetting the data by target names option is not available in fetch_20newsgroups_vectorized but it is available in
fetch_20newsgroups, just that you have to vectorize the data yourself.
Here is how you can do it.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
newsgroups_train = fetch_20newsgroups(subset='all',
categories=['rec.sport.baseball'])
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
print(vectors.shape)
# (994, 13986)
Read more here

Igraph Write Communities

We are using igraph and R to detect communities in a network. The detection using cluster_walktrap is working great:
e <- cluster_walktrap(g)
com <-membership(e)
print(com)
write.csv2(com, file ="community.csv", sep=",")
The result is printed fine using print with the number and the community number that it belongs to but we have a problem in writing the result in the csv file and I have an error : cannot coerce class ""membership"" to a data.frame
How can I write the result of membership in a file ?
Thanks
Convert the membership object to numeric. write.csv and write.csv2 expect a data frame or matrix. The command tries to coerce the object into a data frame, which the class membership resists. Since membership really is just a vector, you can convert it a numeric. Either:
write.csv2(as.numeric(com), file ="community.csv")
Or:
com <- as.numeric(com)
write.csv2(com, file ="community.csv")
Oh, and you don't need the sep = "," argument for write.csv.
If you want to create table of vertex names/numbers and groups:
com <- cbind(V(g),e$membership) #V(g) gets the number of vertices
com <- cbind(V(g)$name,e$membership) #To get names if your vertices are labeled
I don't know if you guys resolved the problem but I did the following using R:
```
# applying the community method
com = spinglass.community(graph_builted,
weights = graph_builted$weights,
implementation = "orig",
update.rule = "config")
# creating a data frame to store the results
type = c(0)
labels = c(0)
groups = c(0)
res2 = data.frame(type, labels, groups)
labels = com$names # here you get the vertices names
groups = com$membership # here you get the communities indices
# here you save the information
res = data.frame(type = "spinGlass1", labels, groups)
res2 = rbind(res2, res)
# then you save the .csv file
write.csv(res2, "spinglass-communities.csv")
```
That resolves the problem for me.
Best regards.

Select and import dataframes into a list from the global environment

I would like to select and import ca. 30 dataframes from my global environment into a list.
All my dataframes end with ".AM", therefore should be very easy to import them but I am still quite new to R.
I tried:
mylist <- list(pattern = "*.AM")
but it doesn't work.
We can use mget to return the datasets in a list.
mget(ls(pattern = "*.AM"))
you are looking for
mylist <- ls(pattern = ".AM")
which returns mylist as a vector of the matching object names
try to set the working directory to wherever the files are saved
setwd("C:\\user\\files")
then create a list of the .AM dataframes
mylist = list.files(pattern="*.AM")
I guess that should work?

Extracting dataframes from a dictionary of dataframes

I have a directory containing many csv files which I have loaded into a dictionary of dataframes
So, just 3 sample small csv files to illustrate
import os
import csv
import pandas as pd
#create 3 small csv files for test purposes
os.chdir('c:/test')
with open('dat1990.csv','w',newline='') as fp:
a=csv.writer(fp,delimiter=',')
data = [['Stock','Sales','Year'],
['100','24','1990'],
['120','33','1990'],
['23','5','1990']]
a.writerows(data)
with open('dat1991.csv','w',newline='') as fp:
a=csv.writer(fp,delimiter=',')
data = [['Stock','Sales','Year'],
['400','35','1991'],
['450','55','1991'],
['34','6','1991']]
a.writerows(data)
with open('other1991.csv','w',newline='') as fp:
a=csv.writer(fp,delimiter=',')
data = [['Stock','Sales','Year'],
['500','56','1991'],
['600','44','1991'],
['56','55','1991']]
a.writerows(data)
create a dictionary for processing the csv files into dataframes
dfcsv_dict = {'dat1990': 'dat1990.csv', 'dat1991': 'dat1991.csv',
'other1991': 'other1991.csv'}
create a simple import function for importing csv to pandas
def myimport(csvfile):
return pd.read_csv(csvfile)
iterate through the dictionary to import all csv files into pandas dataframes
df_dict = {}
for k, v in dfcsv_dict.items():
df_dict[k] = myimport(v)
Given I now may have thousands of dataframes within the unified dictionary object, how can I select a few and "extract" them out of the dictionary?
So for example, how would I extract just two of these three dataframes nested in the dictionary, something like
dat1990 = df_dict['dat1990']
dat1991 = df_dict['dat1991']
but without using literal assignments. Maybe some sort of looping structure over the dictionary, hopefully with a means to select a subgroup based on a string sequence in the dictionary key:
eg all dataframes named dat or 1991 etc
I don't want another "sub dictionary" but want to extract them as named "standalone" dataframes as the above code illustrates.
I am using python 3.5.
This is an old question from Jan 2016 but since no one answered, here is an answer from Oct 2019. Might be useful for future reference.
I think you can skip the step of creating a dictionary of dataframes. I previously wrote an answer on how to create a single master dataframe from multiple CSV files, and adding a column in the master dataframe with a string extracted from the CSV filename. I think you could essentially do the same thing here.
Create a dataframe of csv files based on timestamp intervals
Steps:
Create path to folder with files
Create list of files in folder
Create empty dataframe to store CSV dataframes
Loop through each csv as a dataframe
Add a column with the filename as a string
Concatenate the individual dataframe to the master dataframe
Use a dataframe filter mask to create new dataframe
import pandas as pd
import os
# Step 1: create a path to the folder, syntax for Windows OS
path_test_folder = 'C:\\test\\'
# Step 2: create a list of CSV files in the folder
files_in_folder = os.listdir(path_test_folder)
files_in_folder = [x for x in files_in_folder if '.csv' in x]
# Step 3: create empty master dataframe to store CSV files
df_master = pd.DataFrame()
# Step 4: loop through the files in folder
for each_csv in files_in_folder:
# temporary dataframe for the CSV
path_csv = os.path.join(path_test_folder, each_csv)
temp_df = pd.read_csv(path_csv)
# add folder with filename
temp_df['str_filename'] = str(each_csv)
# combine into master dataframe
df_master = pd.concat([df_master, temp_df])
# then filter on your filenames
mask_filter = df_master['str_filename'].isin(['dat1990.csv', 'dat1991.csv'])
df_filter = df_master.loc[mask_filter]

Resources