Import data vector from julia to R using RCall - r

Assume I have a Julia data array like this:
Any[Any[1,missing], Any[2,5], Any[3,6]]
I want to import it to R using RCall so I have an output equivalent to this:
data <- cbind(c(1,NA), c(2,5), c(3,6))
Note: the length of data is dynamic and it may be not 3!
could anyone help me how can I do this? Thank you

You can just interpolate a matrix into R:
a = [ 1 2 3
missing 5 6 ]
R"data <- $a"
To reorgnize your "array of array" into a matrix, you need to concat them
b = Any[Any[1,missing], Any[2,5], Any[3,6]]
a = hcat(b...)
R"data <- $a"

Related

Having issue using Julia library

I am trying to run this code in Julia to calculate the knn value, but I get the following error when I run it.
ERROR: LoadError: syntax: extra token "ScikitLearn" after end of expression
Stacktrace:
[1] top-level scope
# e:\Fontbonne\CIS 585 Independent Study\Code\knn.jl:6
in expression starting at e:\Fontbonne\CIS 585 Independent Study\Code\knn.jl:6
The error seems to be the library on line 6. I have searched for a couple of hours to try and find a solution. Any help would be greatly appreciated.
Here is the code:
import Pkg
Pkg.add("ScikitLearn")
using ScikitLearn: fit!, predict, #sk_import
using DataFrames, CSV, DataStructures
from ScikitLearn.neighbors import KNeighborsClassifier
from ScikitLearn.model_selection import train_test_split
from ScikitLearn.metrics import accuracy_score
function splitTrainTest(data, at = 0.8)
n = nrow(data)
ind = shuffle(1:n)
train_ind = view(ind, 1:floor(Int, at*n))
test_ind = view(ind, (floor(Int, at*n)+1):n)
return data[train_ind,:], data[test_ind,:]
end
# data preparation
df = open("breast-cancer.data") do file
read(file, String)
end
print(df)
X, y = splitTrainTest(df)
# split data into train and test
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
# make model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
# check accuracy
print(accuracy_score(y_test, knn.predict(x_test)))
That comment should have been an answer: You're doing
from ScikitLearn.neighbors import KNeighborsClassifier
which is Python syntax, not Julia syntax. If you're trying to use a Python model in ScikitLearn.jl you probably want the #sk_import macro, in your case:
julia> #sk_import neighbors: KNeighborsClassifier
PyObject <class 'sklearn.neighbors._classification.KNeighborsClassifier'>

Looking for a read.hclust function

The function write.hclust for hclust objects is available in the RFLPtools package. However, I can't find a corresponding read.*** function despite Googling. Does anyone know of such a function?
If I am understanding your question correctly, you should be able to use read.rflp.
library(RFLPtools)
data(RFLPdata)
res <- RFLPdist(RFLPdata, nrBands = 4)
cl <- hclust(res)
write.hclust(cl, file = "Test.txt", prefix = "Bd4", h = 50)
read.rflp("Test.txt")
Returns:
Sample Cluster Cluster.ID Gel
Ni_25_B2 Ni_25 1 Bd4_H50_01 B2
Ni_25_B5 Ni_25 2 Bd4_H50_02 B5
Ni_28_A2 Ni_28 3 Bd4_H50_03 A2`

How do similar documents transformed into TFIDF valued vector look in vector space

This might be a strange question, but I cant help it wonder. If I lets say have three documents:
d1 = "My name is Stefan."
d2 = "My name is David."
d3 = "Hello, how are you?"
And if i transform all these 3 documents into TFIDF valued vectors, in vector space, will the documents d1 and d2 be closer to each other then documents d2 and d3 for example? Sorry if it is a stupid question, but I would really like to visualize somehow this in order to better understand it. Thank you in advance!
Yes, they will be closer.
Demo:
In [21]: from sklearn.feature_extraction.text import TfidfVectorizer
In [22]: from sklearn.metrics.pairwise import cosine_similarity
In [23]: tfidf = TfidfVectorizer(max_features=50000, use_idf=True, ngram_range=(1,3))
In [24]: r = tfidf.fit_transform(data)
In [25]: s = cosine_similarity(r)
In [26]: s
Out[26]:
array([[1. , 0.53634991, 0. ],
[0.53634991, 1. , 0. ],
[0. , 0. , 1. ]])
In [27]: data
Out[27]: ['My name is Stefan.', 'My name is David.', 'Hello, how are you?']

Selecting features from a feature set using mRMRe package

I am a new user of R and trying to use mRMRe R package (mRMR is one of the good and well known feature selection approaches) to obtain feature subset from a feature set. Please excuse if my question is simple as I really want to know how I can fix an error. Below is the detail.
Suppose, I have a csv file (gene.csv) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's a sample gene.csv file:
[G1.1.1.1] [G1.1.1.2] [G1.1.1.3] [G1.1.1.4] [G1.1.1.5] [G1.1.1.6] [Output]
11.688312 0.974026 4.87013 7.142857 3.571429 10.064935 -1
12.538226 1.223242 3.669725 6.116208 3.363914 9.174312 1
10.791367 0.719424 6.115108 6.47482 3.597122 10.791367 -1
13.533835 0.37594 6.766917 7.142857 2.631579 10.902256 1
9.737828 2.247191 5.992509 5.992509 2.996255 8.614232 -1
11.864407 0.564972 7.344633 4.519774 3.389831 7.909605 -1
11.931818 0 7.386364 5.113636 3.409091 6.818182 1
16.666667 0.333333 7.333333 4.333333 2 8.333333 -1
I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7,
feature_count = 2, solution_count = 1)
When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):
Error in .local(.Object, ...) :
data columns must be either of numeric, ordered factor or Surv type
However, data in each column of the csv file are real number.So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7,feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.
I will appreciate much if anyone can help me to obtain the best feature subset based on the gene.csv file using mRMRe R package.
I solved the problem by modifying my code as follows.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
df[[7]] <- as.numeric(df[[7]])
f_data <- mRMR.data(data = data.frame(df))
results <- mRMR.classic("mRMRe.Filter", data = f_data, target_indices = 7,
feature_count = 2)
solutions(results)
It worked fine. The output of the code gives the indices of the selected 2 features.
I think it has to do with your Output column which is probably of class integer. You can check that using class(df[[7]]).
To convert it to numeric as required by the warning, just type:
df[[7]] <- as.numeric(df[[7]])
That worked for me.
As for the other question, after reading the documentation, setting target_indices = 7 seems the right choice.

Substring (variable length) values in entire column of dataframe

I have looked for this tirelessly with no luck. I am coming from a Java background and new to R. (On a side note, I am loving R, but disliking string operations in it as well as the documentation - maybe that's just a Java bias.)
Anyhow, I have a dataframe with a single column, it is composed of a latitude and longitude numbers seperated by a colon e.g. ROAD:_:-87.4968190989999:38.7414455360001
I would like to create 2 new data frames where each will have the separate lat and long numbers.
I have successfully written a piece of code where I use for loops (but I know this is inefficient - and that there has to be another way)
Here is a snippet of the inefficient code:
length <- length(fromLatLong)
for (i in 1:length){
fromLat[i] <- strsplit(fromLatLong[i] ,":")[[1]][4]
}
for (i in 1:length){
fromLong[i] <- strsplit(fromLatLong[i] ,":")[[1]][3]
}
for (i in 1:length){
toLat[i] <- strsplit(toLatLong[i] ,":")[[1]][4]
}
for (i in 1:length){
toLong[i] <- strsplit(toLatLong[i] ,":")[[1]][3]
}
Here is how I tried to optimize it using mutate, but I only get the first value copied over to all rows as such:
fromLat = mutate(fromLatLong, FROM_NODE_ID = (strsplit(as.character(fromLatLong$FROM_NODE_ID),":")[[1]][4]))
fromLong = mutate(fromLatLong, FROM_NODE_ID = (strsplit(fromLatLong$FROM_NODE_ID,":")[[1]][3]))
toLat = mutate(toLatLong, TO_NODE_ID = (strsplit(toLatLong$TO_NODE_ID,":")[[1]][4]))
toLong = mutate(toLatLong, TO_NODE_ID = (strsplit(toLatLong$TO_NODE_ID,":")[[1]][3]))
And here is the result:
FROM_NODE_ID
1
38.7414455360001
2
38.7414455360001
3
38.7414455360001
4
38.7414455360001
5
38.7414455360001
6
38.7414455360001
7
38.7414455360001
8
38.7414455360001
9
38.7414455360001
I would appriciete your help on this. Thanks
You can use the map_chr function of the purrr package. For instance:
fromLat = mutate(fromLatLong, FROM_NODE_ID = map_chr(FROM_NODE_ID, ~ strsplit(as.character(.x),":")[[1]][4]))
The following expression will produce a data frame with each of the colon-delimited components as a separate column. You can then break this up into separate data frames or do whatever else you want with it.
as.data.frame(t(matrix(unlist(strsplit(fromLatLong$coords, ":", fixed=TRUE), recursive=FALSE), nrow=4)),stringsAsFactors=FALSE)
(Assuming the column name of your values in the data frame is coords.)

Resources