I am trying to run this code in Julia to calculate the knn value, but I get the following error when I run it.
ERROR: LoadError: syntax: extra token "ScikitLearn" after end of expression
Stacktrace:
[1] top-level scope
# e:\Fontbonne\CIS 585 Independent Study\Code\knn.jl:6
in expression starting at e:\Fontbonne\CIS 585 Independent Study\Code\knn.jl:6
The error seems to be the library on line 6. I have searched for a couple of hours to try and find a solution. Any help would be greatly appreciated.
Here is the code:
import Pkg
Pkg.add("ScikitLearn")
using ScikitLearn: fit!, predict, #sk_import
using DataFrames, CSV, DataStructures
from ScikitLearn.neighbors import KNeighborsClassifier
from ScikitLearn.model_selection import train_test_split
from ScikitLearn.metrics import accuracy_score
function splitTrainTest(data, at = 0.8)
n = nrow(data)
ind = shuffle(1:n)
train_ind = view(ind, 1:floor(Int, at*n))
test_ind = view(ind, (floor(Int, at*n)+1):n)
return data[train_ind,:], data[test_ind,:]
end
# data preparation
df = open("breast-cancer.data") do file
read(file, String)
end
print(df)
X, y = splitTrainTest(df)
# split data into train and test
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
# make model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
# check accuracy
print(accuracy_score(y_test, knn.predict(x_test)))
That comment should have been an answer: You're doing
from ScikitLearn.neighbors import KNeighborsClassifier
which is Python syntax, not Julia syntax. If you're trying to use a Python model in ScikitLearn.jl you probably want the #sk_import macro, in your case:
julia> #sk_import neighbors: KNeighborsClassifier
PyObject <class 'sklearn.neighbors._classification.KNeighborsClassifier'>
Related
I have an R object that will not convert to Pandas, and the strange part is that it doesn't throw an error.
Updated with the code I'm using, sorry not to supply that up front -- and to miss the request for 2 weeks!
Python code that calls an R script
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
import datetime
from rpy2.robjects.conversion import localconverter
def serial_date_to_string(srl_no):
new_date = datetime.datetime(1970,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
jurisdiction='TX'
r=ro.r
r_df=r['source']('farrington.R')
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(r_df)
The issue is that pd_from_r_df returns an R object rather than a Pandas dataframe:
>>> pd_from_r_df
R object with classes: ('list',) mapped to:
[ListSexpVector, BoolSexpVector]
value: <class 'rpy2.rinterface.ListSexpVector'>
<rpy2.rinterface.ListSexpVector object at 0x7faa4c4eff08> [RTYPES.VECSXP]
visible: <class 'rpy2.rinterface.BoolSexpVector'>
<rpy2.rinterface.BoolSexpVector object at 0x7faa4c4e7948> [RTYPES.LGLSXP]
Here's the R script "farrington.R", which returns a surveillance time series, which ro.conversion.rpy2py isn't (as used above) converting to a pandas dataframe
library('surveillance')
library(readr)
library(tidyr)
library(dplyr)
w<-1
b<-3
nfreq<-52
steps_back<- 28
alpha<-0.05
counts <- read_csv("Weekly_counts_of_death_by_jurisdiction_and_cause_of_death.csv")
counts<-counts[,!colnames(counts) %in% c('Cause Subgroup','Time Period','Suppress','Note','Average Number of Deaths in Time Period','Difference from 2015-2019 to 2020','Percent Difference from 2015-2019 to 2020')]
wide_counts_by_cause<-pivot_wider(counts,names_from='Cause Group',values_from='Number of Deaths',values_fn=(`Cause Group`=sum))
wide_state <- filter(wide_counts_by_cause,`State Abbreviation`==jurisdiction)
wide_state <- filter(wide_state,Type=='Unweighted')
wide_state[is.na(wide_state)] <-0
important_columns=c('Alzheimer disease and dementia','Cerebrovascular diseases','Heart failure','Hypertensive dieases','Ischemic heart disease','Other diseases of the circulatory system','Malignant neoplasms','Diabetes','Renal failure','Sepsis','Chronic lower respiratory disease','Influenza and pneumonia','Other diseases of the respiratory system','Residual (all other natural causes)')
all_columns <- append(c('Year','Week'),important_columns)
selected_wide_state<-wide_state[, names(wide_state) %in% all_columns]
start<-c(as.numeric(min(selected_wide_state[,'Year'])),as.numeric(min(selected_wide_state[,'Week'])))
freq<-as.numeric(max(selected_wide_state[,'Week']))
sts <- new("sts",epoch=1:nrow(numeric_wide_state),start=start,freq=freq,observed=numeric_wide_state)
sts_4 <- aggregate(sts[,important_columns],nfreq=nfreq)
start_idx=end_idx-steps_back
cntrlFar <- list(range=start_idx:end_idx,w==w,b==b,alpha==alpha)
surveil_ts_4_far <- farrington(sts_4,control=cntrlFar)
far_df<-tidy.sts(surveil_ts_4_far)
far_df
(using the NCHS data here [from a couple months back] https://data.cdc.gov/NCHS/Weekly-counts-of-death-by-jurisdiction-and-cause-o/u6jv-9ijr/ )
In R, when calling source() by default on a script without named functions, the returned object is a list of two named components, $value and $visible, where:
$value is the last displayed or defined object which in your case is the far_df data frame (which in R data.frame is a class object extending list type);
$visible is a boolean vector indicating if last object was displayed or not which in your case is TRUE. This would be FALSE had you ended script at far_df <- tidy.sts(surveil_ts_4_far).
In fact, your Python error confirms this output indicatating a list of [ListSexpVector, BoolSexpVector].
Therefore, since you only want the first item, index for first item accordingly by number or name.
r_raw = ro.r['source']('farrington.R') # IN R: r_raw <- source('farrington.R')
r_df = r_raw[0] # IN R: r_df <- r_raw[1]
r_df = r_raw[r_raw.names.index('value')] # IN R: r_df <- r_raw$value
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(r_df)
I'm trying to use the niche.overlap function by inputting a pno object obtained in the phyloclim package:
library(phyloclim)
x <- pno(path_bioclim = "C:\\Users\\test phyloclim 2\\Nova pasta (3)\\bio2.asc",
path_model = "C:\\Users\\Nova pasta (4)",
subset = NULL , bin_width = 1, bin_number = 100)
niche.overlap(x)
I expect to get a matrix but instead I get got the following error:
Error in niche.overlap(x) : object 'DI' not found
One must only export the object as a .csv file, and then import again as a table. It should work fine.
In the book "Machine Learning - A Probabilistic Perspective" by Kevin P. Murphy the first task reads:
Exercise 1.1 KNN classifier on shuffled MNIST data
Run mnist1NNdemo
and verify that the misclassification rate (on the first 1000 test
cases) of MNIST of a 1-NN classifier is 3.8%. (If you run it all on
all 10,000 test cases, the error rate is 3.09%.) Modify the code so
that you first randomly permute the features (columns of the training
and test design matrices), as in shuffledDigitsDemo, and then apply
the classifier. Verify that the error rate is not changed.
My simple understanding is that the exercise is looking for the 1-NN after loading the files(kNN() in R).
The files:
train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)
are taken from the The MNIST DATABASE
I found a popular template for loading the files:
# for the kNN() function
library(VIM)
load_mnist <- function() {
load_image_file <- function(filename) {
ret = list()
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
ret$n = readBin(f,'integer',n=1,size=4,endian='big')
nrow = readBin(f,'integer',n=1,size=4,endian='big')
ncol = readBin(f,'integer',n=1,size=4,endian='big')
x = readBin(f,'integer',n=ret$n*nrow*ncol,size=1,signed=F)
ret$x = matrix(x, ncol=nrow*ncol, byrow=T)
close(f)
ret
}
load_label_file <- function(filename) {
f = file(filename,'rb')
readBin(f,'integer',n=1,size=4,endian='big')
n = readBin(f,'integer',n=1,size=4,endian='big')
y = readBin(f,'integer',n=n,size=1,signed=F)
close(f)
y
}
train <<- load_image_file("train-images.idx3-ubyte")
test <<- load_image_file("t10k-images.idx3-ubyte")
train$y <<- load_label_file("train-labels.idx1-ubyte")
test$y <<- load_label_file("t10k-labels.idx1-ubyte")
}
show_digit <- function(arr784, col=gray(12:1/12)) {
image(matrix(arr784, nrow=28)[,28:1], col=col)
}
According to the comment, in the command line this should work:
# Error "Error in matrix(arr784, nrow = 28) : object 'train' not found"
show_digit(train$x[5,])
The question is how can I use the show_digit function ?
Edit Remove extra question
What I figured out for the problem:
First run the whole file in R Studio or ESS, then call the load_mnist() from the console.
After that execute show_digit(train$x[3,]) in the console again and it works.
Finding the KNN classifier can be done on the whole data set:
a <- knn(train, test, train$y) but it would be a very slow process.
Predictions for the result can be done like table(test$y, a), test$y is predicted, a is the actual result.
I am using the R package msa, a core Bioconductor package, for multiple sequence alignment. Within msa, I am using the MUSCLE alignment algorithm to align protein sequences.
library(msa)
myalign <- msa("test.fa", method=c("Muscle"), type="protein",verbose=FALSE)
The test.fa file is a standard fasta as follows (truncated, for brevity):
>sp|P31749|AKT1_HUMAN_RAC
MSDVAIVKEGWLHKRGEYIKTWRPRYFLL
>sp|P31799|AKT1_HUMAN_RAC
MSVVAIVKEGWLHKRGEYIKTWRFLL
When I run the code on the file, I get:
MUSCLE 3.8.31
Call:
msa("test.fa", method = c("Muscle"), type = "protein", verbose = FALSE)
MsaAAMultipleAlignment with 2 rows and 480 columns
aln
[1] MSDVAIVKEGWLHKRGEYIKTWRPRYFLL
[2] MSVVAIVKEGWLHKRGEYIKTWR---FLL
Con MS?VAIVKEGWLHKRGEYIKTWR???FLL
As you can see, a very reasonable alignment.
I want to write the gapped alignment, preferably without the consensus sequence (e.g., Con row), to a fasta file. So, I want:
>sp|P31749|AKT1_HUMAN_RAC
MSDVAIVKEGWLHKRGEYIKTWRPRYFLL
>sp|P31799|AKT1_HUMAN_RAC
MSVVAIVKEGWLHKRGEYIKTWR---FLL
I checked the msa help, and the package does not seem to have a built in method for writing out to any file type, fasta or otherwise.
The seqinr package looks somewhat promising, because maybe it could read this output as an msf format, albeit a weird one. However, seqinr seems to need a file read in as a starting point. I can't even save this using write(myalign, ...).
I wrote a function:
alignment2Fasta <- function(alignment, filename) {
sink(filename)
n <- length(rownames(alignment))
for(i in seq(1, n)) {
cat(paste0('>', rownames(alignment)[i]))
cat('\n')
the.sequence <- toString(unmasked(alignment)[[i]])
cat(the.sequence)
cat('\n')
}
sink(NULL)
}
Usage:
mySeqs <- readAAStringSet('test.fa')
myAlignment <- msa(mySeqs)
alignment2Fasta(myAlignment, 'out.fasta')
I think you ought to follow the examples in the help pages that show input with a specific read function first, then work with the alignment:
mySeqs <- readAAStringSet("test.fa")
myAlignment <- msa(mySeqs)
Then the rownames function will deliver the sequence names:
rownames(myAlignment)
[1] "sp|P31749|AKT1_HUMAN_RAC" "sp|P31799|AKT1_HUMAN_RAC"
(Not what you asked for but possibly useful in the future.) Then if you execute:
detail(myAlignment) #function actually in Biostrings
.... you get a text file in interactive mode that you can save
2 29
sp|P31749|AKT1_HUMAN_RAC MSDVAIVKEG WLHKRGEYIK TWRPRYFLL
sp|P31799|AKT1_HUMAN_RAC MSVVAIVKEG WLHKRGEYIK TWR---FLL
If you wnat to try hacking a function for which you can get a file written in code, then look at the Biostrings detail function code that is being used
> showMethods( f= 'detail')
Function: detail (package Biostrings)
x="ANY"
x="MsaAAMultipleAlignment"
(inherited from: x="MultipleAlignment")
x="MultipleAlignment"
showMethods( f= 'detail', classes='MultipleAlignment', includeDefs=TRUE)
Function: detail (package Biostrings)
x="MultipleAlignment"
function (x, ...)
{
.local <- function (x, invertColMask = FALSE, hideMaskedCols = TRUE)
{
FH <- tempfile(pattern = "tmpFile", tmpdir = tempdir())
.write.MultAlign(x, FH, invertColMask = invertColMask,
showRowNames = TRUE, hideMaskedCols = hideMaskedCols)
file.show(FH)
}
.local(x, ...)
}
You may use export.fasta function from bio2mds library.
# reading of the multiple sequence alignment of human GPCRS in FASTA format:
aln <- import.fasta(system.file("msa/human_gpcr.fa", package = "bios2mds"))
export.fasta(aln)
You can convert your msa alignment first ("AAStringSet") into an "align" object first, and then export as fasta as follows:
library(msa)
library(bios2mds)
mysequences <-readAAStringSet("test.fa")
alignCW <- msa(mysequences)
#https://rdrr.io/bioc/msa/man/msaConvert.html
alignCW_as_align <- msaConvert(alignCW, "bios2mds::align")
export.fasta(alignCW_as_align, outfile = "test_alignment.fa", ncol = 60, open = "w")
I'm trying to run this code, and I'm using mhadaptive package, but the problem is that when I run these code without writing metropolis_hastings (that is one part of mhadaptive package) error does not occur, but when I add mhadaptive package the error occur. What should I do?
li_F1<-function(pars,data) #defining first function
{
a01<-pars[1] #defining parameters
a11<-pars[2]
epsilon<<-pars[3]
b11<-pars[4]
a02<-pars[5]
a12<-pars[6]
b12<-pars[7]
h<-pars[8]
h[[i]]<-list() #I want my output is be listed in the h
h[[1]]<-0.32082184 #My first value of h is known and other values should calculate by formula
for(i in 2:nrow(F_2_))
{
h[[i]]<- ((a01+a11*(h[[i-1]])*(epsilon^2)*(h[[i-1]])*b11)+(F1[,2])*((a02+a12*(h[[i-1]])*(epsilon^2)+(h[[i-1]])*b12)))
pred<- h[[i]]
}
log_likelihood<-sum(dnorm(prod(h[i]),pred,sd = 1 ,log = TRUE))
return(h[i])
prior<- prior_reg(pars)
return(log_likelihood + prior)
options(digits = 22)
}
prior_reg<-function(pars) #defining another function
{
epsilon<<-pars[3] #error
prior_epsilon<-pt(0.95,5,lower.tail = TRUE,log.p = FALSE)
return(prior_epsilon)
}
F1<-as.matrix(F_2_) #defining my importing data and simulatunig data with them
x<-F1[,1]
y<-F1[,2]
d<-cbind(x,y)
#using mhadaptive package
mcmc_r<-Metro_Hastings(li_func = li_F1,pars=c(10,15,10,10,10,15),par_names=c('a01','a02','a11','a12','b11','b12'),data=d)
By running this code this error occur.
Error in h[[i]] <- list() : replacement has length zero
I'll so much appreciate who help me.