Convert Pyspark dataframe to dictionary - dictionary

I'm trying to convert a Pyspark dataframe into a dictionary.
Here's the sample CSV file -
Col0, Col1
-----------
A153534,BDBM40705
R440060,BDBM31728
P440245,BDBM50445050
I've come up with this code -
from rdkit import Chem
from pyspark import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
df = spark.read.csv("gs://my-bucket/my_file.csv") # has two columns
# Creating list
to_list = map(lambda row: row.asDict(), df.collect())
#Creating dictionary
to_dict = {x['col0']: x for x in to_list }
This creates a dictionary like below -
'A153534': {'col0': 'A153534', 'col1': 'BDBM40705'}, 'R440060': {'col0': 'R440060', 'col1': 'BDBM31728'}, 'P440245': {'col0': 'P440245', 'col1': 'BDBM50445050'}
But I want a dictionary like this -
{'A153534': 'BDBM40705'}, {'R440060': 'BDBM31728'}, {'P440245': 'BDBM50445050'}
How can I do that?
I tried the rdd solution by Yolo but I'm getting error. Can you please tell me what I am doing wrong?
py4j.protocol.Py4JError: An error occurred while calling
o80.isBarrier. Trace: py4j.Py4JException: Method isBarrier([]) does
not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Here's a way of doing it using rdd:
df.rdd.map(lambda x: {x.Col0: x.Col1}).collect()
[{'A153534': 'BDBM40705'}, {'R440060': 'BDBM31728'}, {'P440245': 'BDBM50445050'}]

This could help you:
df = spark.read.csv('/FileStore/tables/Create_dict.txt',header=True)
df = df.withColumn('dict',to_json(create_map(df.Col0,df.Col1)))
df_list = [row['dict'] for row in df.select('dict').collect()]
df_list
Output is:
['{"A153534":"BDBM40705"}',
'{"R440060":"BDBM31728"}',
'{"P440245":"BDBM50445050"}']

Related

Having issue using Julia library

I am trying to run this code in Julia to calculate the knn value, but I get the following error when I run it.
ERROR: LoadError: syntax: extra token "ScikitLearn" after end of expression
Stacktrace:
[1] top-level scope
# e:\Fontbonne\CIS 585 Independent Study\Code\knn.jl:6
in expression starting at e:\Fontbonne\CIS 585 Independent Study\Code\knn.jl:6
The error seems to be the library on line 6. I have searched for a couple of hours to try and find a solution. Any help would be greatly appreciated.
Here is the code:
import Pkg
Pkg.add("ScikitLearn")
using ScikitLearn: fit!, predict, #sk_import
using DataFrames, CSV, DataStructures
from ScikitLearn.neighbors import KNeighborsClassifier
from ScikitLearn.model_selection import train_test_split
from ScikitLearn.metrics import accuracy_score
function splitTrainTest(data, at = 0.8)
n = nrow(data)
ind = shuffle(1:n)
train_ind = view(ind, 1:floor(Int, at*n))
test_ind = view(ind, (floor(Int, at*n)+1):n)
return data[train_ind,:], data[test_ind,:]
end
# data preparation
df = open("breast-cancer.data") do file
read(file, String)
end
print(df)
X, y = splitTrainTest(df)
# split data into train and test
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
# make model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
# check accuracy
print(accuracy_score(y_test, knn.predict(x_test)))
That comment should have been an answer: You're doing
from ScikitLearn.neighbors import KNeighborsClassifier
which is Python syntax, not Julia syntax. If you're trying to use a Python model in ScikitLearn.jl you probably want the #sk_import macro, in your case:
julia> #sk_import neighbors: KNeighborsClassifier
PyObject <class 'sklearn.neighbors._classification.KNeighborsClassifier'>

Import data vector from julia to R using RCall

Assume I have a Julia data array like this:
Any[Any[1,missing], Any[2,5], Any[3,6]]
I want to import it to R using RCall so I have an output equivalent to this:
data <- cbind(c(1,NA), c(2,5), c(3,6))
Note: the length of data is dynamic and it may be not 3!
could anyone help me how can I do this? Thank you
You can just interpolate a matrix into R:
a = [ 1 2 3
missing 5 6 ]
R"data <- $a"
To reorgnize your "array of array" into a matrix, you need to concat them
b = Any[Any[1,missing], Any[2,5], Any[3,6]]
a = hcat(b...)
R"data <- $a"

rpy2 does not convert back to pandas

I have an R object that will not convert to Pandas, and the strange part is that it doesn't throw an error.
Updated with the code I'm using, sorry not to supply that up front -- and to miss the request for 2 weeks!
Python code that calls an R script
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
import datetime
from rpy2.robjects.conversion import localconverter
def serial_date_to_string(srl_no):
new_date = datetime.datetime(1970,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
jurisdiction='TX'
r=ro.r
r_df=r['source']('farrington.R')
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(r_df)
The issue is that pd_from_r_df returns an R object rather than a Pandas dataframe:
>>> pd_from_r_df
R object with classes: ('list',) mapped to:
[ListSexpVector, BoolSexpVector]
value: <class 'rpy2.rinterface.ListSexpVector'>
<rpy2.rinterface.ListSexpVector object at 0x7faa4c4eff08> [RTYPES.VECSXP]
visible: <class 'rpy2.rinterface.BoolSexpVector'>
<rpy2.rinterface.BoolSexpVector object at 0x7faa4c4e7948> [RTYPES.LGLSXP]
Here's the R script "farrington.R", which returns a surveillance time series, which ro.conversion.rpy2py isn't (as used above) converting to a pandas dataframe
library('surveillance')
library(readr)
library(tidyr)
library(dplyr)
w<-1
b<-3
nfreq<-52
steps_back<- 28
alpha<-0.05
counts <- read_csv("Weekly_counts_of_death_by_jurisdiction_and_cause_of_death.csv")
counts<-counts[,!colnames(counts) %in% c('Cause Subgroup','Time Period','Suppress','Note','Average Number of Deaths in Time Period','Difference from 2015-2019 to 2020','Percent Difference from 2015-2019 to 2020')]
wide_counts_by_cause<-pivot_wider(counts,names_from='Cause Group',values_from='Number of Deaths',values_fn=(`Cause Group`=sum))
wide_state <- filter(wide_counts_by_cause,`State Abbreviation`==jurisdiction)
wide_state <- filter(wide_state,Type=='Unweighted')
wide_state[is.na(wide_state)] <-0
important_columns=c('Alzheimer disease and dementia','Cerebrovascular diseases','Heart failure','Hypertensive dieases','Ischemic heart disease','Other diseases of the circulatory system','Malignant neoplasms','Diabetes','Renal failure','Sepsis','Chronic lower respiratory disease','Influenza and pneumonia','Other diseases of the respiratory system','Residual (all other natural causes)')
all_columns <- append(c('Year','Week'),important_columns)
selected_wide_state<-wide_state[, names(wide_state) %in% all_columns]
start<-c(as.numeric(min(selected_wide_state[,'Year'])),as.numeric(min(selected_wide_state[,'Week'])))
freq<-as.numeric(max(selected_wide_state[,'Week']))
sts <- new("sts",epoch=1:nrow(numeric_wide_state),start=start,freq=freq,observed=numeric_wide_state)
sts_4 <- aggregate(sts[,important_columns],nfreq=nfreq)
start_idx=end_idx-steps_back
cntrlFar <- list(range=start_idx:end_idx,w==w,b==b,alpha==alpha)
surveil_ts_4_far <- farrington(sts_4,control=cntrlFar)
far_df<-tidy.sts(surveil_ts_4_far)
far_df
(using the NCHS data here [from a couple months back] https://data.cdc.gov/NCHS/Weekly-counts-of-death-by-jurisdiction-and-cause-o/u6jv-9ijr/ )
In R, when calling source() by default on a script without named functions, the returned object is a list of two named components, $value and $visible, where:
$value is the last displayed or defined object which in your case is the far_df data frame (which in R data.frame is a class object extending list type);
$visible is a boolean vector indicating if last object was displayed or not which in your case is TRUE. This would be FALSE had you ended script at far_df <- tidy.sts(surveil_ts_4_far).
In fact, your Python error confirms this output indicatating a list of [ListSexpVector, BoolSexpVector].
Therefore, since you only want the first item, index for first item accordingly by number or name.
r_raw = ro.r['source']('farrington.R') # IN R: r_raw <- source('farrington.R')
r_df = r_raw[0] # IN R: r_df <- r_raw[1]
r_df = r_raw[r_raw.names.index('value')] # IN R: r_df <- r_raw$value
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(r_df)

Problem in the function "niche.overlap" of the Phyloclim R Package

I'm trying to use the niche.overlap function by inputting a pno object obtained in the phyloclim package:
library(phyloclim)
x <- pno(path_bioclim = "C:\\Users\\test phyloclim 2\\Nova pasta (3)\\bio2.asc",
path_model = "C:\\Users\\Nova pasta (4)",
subset = NULL , bin_width = 1, bin_number = 100)
niche.overlap(x)
I expect to get a matrix but instead I get got the following error:
Error in niche.overlap(x) : object 'DI' not found
One must only export the object as a .csv file, and then import again as a table. It should work fine.

How to replicate the Column with data set in table

I want to convert question columns into a row.using Python Pandas like below
import pandas as pd
from openpyxl import load_workbook
df = pd.read_excel (r'file' ,sheet_name='results')
d = {'Score.1':'Score','Score.2':'Score','Duration.1':'Duration','Duration.2':'Duration'}
melted=pd.melt(df, id_vars=['userid','Candidate','Score','Duration'], value_vars=['Question 1'],var_name='myVarname', value_name='myValname')
melted1=pd.melt(df, id_vars=['userid','Candidate','Score.1','Duration.1'], value_vars=['Question 2'],var_name='myVarname', value_name='myValname').rename(columns=d)
melted2=pd.melt(df, id_vars=['userid','Candidate','Score.2','Duration.2'], value_vars=['Question 3 '],var_name='myVarname', value_name='myValname').rename(columns=d)
......
melted2=pd.melt(df, id_vars=['userid','Candidate','Score.25','Duration.25'], value_vars=['Question 25 '],var_name='myVarname', value_name='myValname').rename(columns=d)
meltedfinal=[melted,melted1,melted2]
result = pd.concat(meltedfinal)
result.to_excel(r'file') # doctest: +SKIP

Resources