rpy2 does not convert back to pandas - r

I have an R object that will not convert to Pandas, and the strange part is that it doesn't throw an error.
Updated with the code I'm using, sorry not to supply that up front -- and to miss the request for 2 weeks!
Python code that calls an R script
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
import datetime
from rpy2.robjects.conversion import localconverter
def serial_date_to_string(srl_no):
new_date = datetime.datetime(1970,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
jurisdiction='TX'
r=ro.r
r_df=r['source']('farrington.R')
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(r_df)
The issue is that pd_from_r_df returns an R object rather than a Pandas dataframe:
>>> pd_from_r_df
R object with classes: ('list',) mapped to:
[ListSexpVector, BoolSexpVector]
value: <class 'rpy2.rinterface.ListSexpVector'>
<rpy2.rinterface.ListSexpVector object at 0x7faa4c4eff08> [RTYPES.VECSXP]
visible: <class 'rpy2.rinterface.BoolSexpVector'>
<rpy2.rinterface.BoolSexpVector object at 0x7faa4c4e7948> [RTYPES.LGLSXP]
Here's the R script "farrington.R", which returns a surveillance time series, which ro.conversion.rpy2py isn't (as used above) converting to a pandas dataframe
library('surveillance')
library(readr)
library(tidyr)
library(dplyr)
w<-1
b<-3
nfreq<-52
steps_back<- 28
alpha<-0.05
counts <- read_csv("Weekly_counts_of_death_by_jurisdiction_and_cause_of_death.csv")
counts<-counts[,!colnames(counts) %in% c('Cause Subgroup','Time Period','Suppress','Note','Average Number of Deaths in Time Period','Difference from 2015-2019 to 2020','Percent Difference from 2015-2019 to 2020')]
wide_counts_by_cause<-pivot_wider(counts,names_from='Cause Group',values_from='Number of Deaths',values_fn=(`Cause Group`=sum))
wide_state <- filter(wide_counts_by_cause,`State Abbreviation`==jurisdiction)
wide_state <- filter(wide_state,Type=='Unweighted')
wide_state[is.na(wide_state)] <-0
important_columns=c('Alzheimer disease and dementia','Cerebrovascular diseases','Heart failure','Hypertensive dieases','Ischemic heart disease','Other diseases of the circulatory system','Malignant neoplasms','Diabetes','Renal failure','Sepsis','Chronic lower respiratory disease','Influenza and pneumonia','Other diseases of the respiratory system','Residual (all other natural causes)')
all_columns <- append(c('Year','Week'),important_columns)
selected_wide_state<-wide_state[, names(wide_state) %in% all_columns]
start<-c(as.numeric(min(selected_wide_state[,'Year'])),as.numeric(min(selected_wide_state[,'Week'])))
freq<-as.numeric(max(selected_wide_state[,'Week']))
sts <- new("sts",epoch=1:nrow(numeric_wide_state),start=start,freq=freq,observed=numeric_wide_state)
sts_4 <- aggregate(sts[,important_columns],nfreq=nfreq)
start_idx=end_idx-steps_back
cntrlFar <- list(range=start_idx:end_idx,w==w,b==b,alpha==alpha)
surveil_ts_4_far <- farrington(sts_4,control=cntrlFar)
far_df<-tidy.sts(surveil_ts_4_far)
far_df
(using the NCHS data here [from a couple months back] https://data.cdc.gov/NCHS/Weekly-counts-of-death-by-jurisdiction-and-cause-o/u6jv-9ijr/ )

In R, when calling source() by default on a script without named functions, the returned object is a list of two named components, $value and $visible, where:
$value is the last displayed or defined object which in your case is the far_df data frame (which in R data.frame is a class object extending list type);
$visible is a boolean vector indicating if last object was displayed or not which in your case is TRUE. This would be FALSE had you ended script at far_df <- tidy.sts(surveil_ts_4_far).
In fact, your Python error confirms this output indicatating a list of [ListSexpVector, BoolSexpVector].
Therefore, since you only want the first item, index for first item accordingly by number or name.
r_raw = ro.r['source']('farrington.R') # IN R: r_raw <- source('farrington.R')
r_df = r_raw[0] # IN R: r_df <- r_raw[1]
r_df = r_raw[r_raw.names.index('value')] # IN R: r_df <- r_raw$value
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(r_df)

Related

Plotly choropleth map in jupyter notebooks not showing color

Trying to make a choropleth map in plotly using some data I have in a csv file. Have created This is what i get in result(my map)
Below are the coding that I have did to the work:
import json
import pandas as pd
import plotly.express as px
asean_country = json.load(open("aseancovidmap.geojson","r"))
df= pd.read_csv("covidcases.csv")
df["iso-2"]=df['Country'].apply(lambda x: id_map[x])
id_map={}
for feature in asean_country['features']:
feature['id']= feature['properties']['sform']
id_map[feature['properties']['name']]=feature['id']
figure=px.choropleth(df,locations='iso-2',locationmode='country names',geojson=asean_country,color='Ttlcases',scope='asia',title='Total COVID 19 cases in ASEAN Countries as on 10/1/2022')
figure.show()
clearly I don't have access to your files, so have sourced geometry and COVID data. For reference this is at end of answer.
the key change I have made. *Don't loop over geojson Define locations as column in dataframe and featureidkey
clearly this is coloring countries
solution
import json
import pandas as pd
import plotly.express as px
# asean_country = json.load(open("aseancovidmap.geojson","r"))
asean_country = gdf_asean.rename(columns={"adm0_a3": "iso_a2"}).__geo_interface__
# df= pd.read_csv("covidcases.csv")
df = gdf_asean_cases.loc[:, ["iso_code", "adm0_a3", "total_cases", "date"]].rename(
columns={"iso_code": "iso_a2", "total_cases": "Ttlcases"}
)
figure = px.choropleth(
df,
locations="iso_a2",
featureidkey="properties.iso_a2",
geojson=asean_country,
color="Ttlcases",
title="Total COVID 19 cases in ASEAN Countries as on 10/1/2022",
).update_geos(fitbounds="locations", visible=True).update_layout(margin={"t":40,"b":0,"l":0,"r":0})
figure.show()
data sourcing
import requests, io
import geopandas as gpd
import pandas as pd
# get asia geometry
gdf = gpd.read_file(
"https://gist.githubusercontent.com/hrbrmstr/94bdd47705d05a50f9cf/raw/0ccc6b926e1aa64448e239ac024f04e518d63954/asia.geojson"
)
# get countries that make up ASEAN
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_ASEAN_countries_by_GDP")[1].loc[1:]
# no geometry for singapore.... just ASEAN geometry
gdf_asean = (
gdf.loc[:, ["admin", "adm0_a3", "geometry"]]
.merge(
df.loc[:, ["Country", "Rank"]], left_on="admin", right_on="Country", how="right"
)
)
# get COVID data
dfall = pd.read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")
# filter to last date in data
dfall["date"] = pd.to_datetime(dfall["date"])
dflatest = dfall.groupby(["iso_code"], as_index=False).last()
# merge geometry and COVID data
gdf_asean_cases = gdf_asean.merge(
dflatest.loc[:, ["iso_code", "total_cases", "date"]], left_on="adm0_a3", right_on="iso_code"
)

Having issue using Julia library

I am trying to run this code in Julia to calculate the knn value, but I get the following error when I run it.
ERROR: LoadError: syntax: extra token "ScikitLearn" after end of expression
Stacktrace:
[1] top-level scope
# e:\Fontbonne\CIS 585 Independent Study\Code\knn.jl:6
in expression starting at e:\Fontbonne\CIS 585 Independent Study\Code\knn.jl:6
The error seems to be the library on line 6. I have searched for a couple of hours to try and find a solution. Any help would be greatly appreciated.
Here is the code:
import Pkg
Pkg.add("ScikitLearn")
using ScikitLearn: fit!, predict, #sk_import
using DataFrames, CSV, DataStructures
from ScikitLearn.neighbors import KNeighborsClassifier
from ScikitLearn.model_selection import train_test_split
from ScikitLearn.metrics import accuracy_score
function splitTrainTest(data, at = 0.8)
n = nrow(data)
ind = shuffle(1:n)
train_ind = view(ind, 1:floor(Int, at*n))
test_ind = view(ind, (floor(Int, at*n)+1):n)
return data[train_ind,:], data[test_ind,:]
end
# data preparation
df = open("breast-cancer.data") do file
read(file, String)
end
print(df)
X, y = splitTrainTest(df)
# split data into train and test
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
# make model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
# check accuracy
print(accuracy_score(y_test, knn.predict(x_test)))
That comment should have been an answer: You're doing
from ScikitLearn.neighbors import KNeighborsClassifier
which is Python syntax, not Julia syntax. If you're trying to use a Python model in ScikitLearn.jl you probably want the #sk_import macro, in your case:
julia> #sk_import neighbors: KNeighborsClassifier
PyObject <class 'sklearn.neighbors._classification.KNeighborsClassifier'>

Convert Pyspark dataframe to dictionary

I'm trying to convert a Pyspark dataframe into a dictionary.
Here's the sample CSV file -
Col0, Col1
-----------
A153534,BDBM40705
R440060,BDBM31728
P440245,BDBM50445050
I've come up with this code -
from rdkit import Chem
from pyspark import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
df = spark.read.csv("gs://my-bucket/my_file.csv") # has two columns
# Creating list
to_list = map(lambda row: row.asDict(), df.collect())
#Creating dictionary
to_dict = {x['col0']: x for x in to_list }
This creates a dictionary like below -
'A153534': {'col0': 'A153534', 'col1': 'BDBM40705'}, 'R440060': {'col0': 'R440060', 'col1': 'BDBM31728'}, 'P440245': {'col0': 'P440245', 'col1': 'BDBM50445050'}
But I want a dictionary like this -
{'A153534': 'BDBM40705'}, {'R440060': 'BDBM31728'}, {'P440245': 'BDBM50445050'}
How can I do that?
I tried the rdd solution by Yolo but I'm getting error. Can you please tell me what I am doing wrong?
py4j.protocol.Py4JError: An error occurred while calling
o80.isBarrier. Trace: py4j.Py4JException: Method isBarrier([]) does
not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Here's a way of doing it using rdd:
df.rdd.map(lambda x: {x.Col0: x.Col1}).collect()
[{'A153534': 'BDBM40705'}, {'R440060': 'BDBM31728'}, {'P440245': 'BDBM50445050'}]
This could help you:
df = spark.read.csv('/FileStore/tables/Create_dict.txt',header=True)
df = df.withColumn('dict',to_json(create_map(df.Col0,df.Col1)))
df_list = [row['dict'] for row in df.select('dict').collect()]
df_list
Output is:
['{"A153534":"BDBM40705"}',
'{"R440060":"BDBM31728"}',
'{"P440245":"BDBM50445050"}']

Problem in the function "niche.overlap" of the Phyloclim R Package

I'm trying to use the niche.overlap function by inputting a pno object obtained in the phyloclim package:
library(phyloclim)
x <- pno(path_bioclim = "C:\\Users\\test phyloclim 2\\Nova pasta (3)\\bio2.asc",
path_model = "C:\\Users\\Nova pasta (4)",
subset = NULL , bin_width = 1, bin_number = 100)
niche.overlap(x)
I expect to get a matrix but instead I get got the following error:
Error in niche.overlap(x) : object 'DI' not found
One must only export the object as a .csv file, and then import again as a table. It should work fine.

Nestled Loop not Working to gather data from NOAA

I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine
The loop:
require('triebeard')
require('bindr')
require('colorspace')
require('mime')
require('curl')
require('openssl')
require('R6')
require('urltools')
require('httpcode')
require('stringr')
require('assertthat')
require('bindrcpp')
require('glue')
require('magrittr')
require('pkgconfig')
require('rlang')
require('Rcpp')
require('BH')
require('plogr')
require('purrr')
require('stringi')
require('tidyselect')
require('digest')
require('gtable')
require('plyr')
require('reshape2')
require('lazyeval')
require('RColorBrewer')
require('dichromat')
require('munsell')
require('labeling')
require('viridisLite')
require('data.table')
require('rjson')
require('httr')
require('crul')
require('lubridate')
require('dplyr')
require('tidyr')
require('ggplot2')
require('scales')
require('XML')
require('xml2')
require('jsonlite')
require('rappdirs')
require('gridExtra')
require('tibble')
require('isdparser')
require('geonames')
require('hoardr')
require('rnoaa')
install.package('ncdf4')
install.packages("devtools")
library(devtools)
install_github("rnoaa", "ropensci")
library(rnoaa)
list <- buoys(dataset='wlevel')
lid <- data.frame(list$id)
foo <- for(range in 1990:2017){
for(bid in lid){
bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range)
bid.year.data <- data.frame(bid.year$data)
write.csv(bid.year.data, file='cwind/bid_range.csv')
}
}
The response:
Using c1990.nc
Using
Error: length(url) == 1 is not TRUE
It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops.
Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse.
# Load packages
library(tidyverse)
library(rnoaa)
Step 1: Create a "Grid" containing all combination of id and year
The expand function from tidyr can create the combination of different values.
data_list <- buoys(dataset = 'wlevel')
data_list2 <- data_list %>%
select(id) %>%
expand(id, year = 1990:2017)
Step 2: Create a "safe" version that does not break when there is no data.
Also make this function suitable for the map2 function
Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks.
# Modify the buoy function
buoy_modify <- function(buoyid, year, dataset, ...){
buoy(dataset, buoyid = buoyid, year = year, ...)
}
# Creare a safe version of buoy_modify
buoy_safe <- safely(buoy_modify)
Step 3: Apply the buoy_safe function
wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel")
# Assign name for the element in the list based on id and year
names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_")
After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message.
Step 4: Access the data
transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data.
# Turn the list "inside out"
wlevel_data2 <- transpose(wlevel_data)
# Get the error message
wlevel_error <- wlevel_data2$error
# Get he result
wlevel_result <- wlevel_data2$result
# Remove NULL element in wlevel_result
wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]

Resources