How can I calculate Cosine similarity between two strings vectors - r

I have 2 vectors of dimensions 6 and I would like to have a number between 0 and 1.
a=c("HDa","2Pb","2","BxU","BuQ","Bve")
b=c("HCK","2Pb","2","09","F","G")
Can anyone explain what I should do?

using the lsa package and the manual for this package
# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))
# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)
EDIT: show how is the mymatrix object
myMatrix
#myMatrix
# docs
# terms D1 D2
# 2 1 1
# 2pb 1 1
# buq 1 0
# bve 1 0
# bxu 1 0
# hda 1 0
# 09 0 1
# f 0 1
# g 0 1
# hck 0 1
# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333

You need a dictionary of possible terms first and then convert your vectors to binary vectors with a 1 in the positions of the corresponding terms and 0 elsewhere. If you name the new vectors a2 and b2, you can calculate the cosine similarly with cor(a2, b2), but notice the cosine similarly is between -1 and 1. You could map it to [0,1] with something like this: 0.5*cor(a2, b2) + 0.5

CSString_vector <- c("Hi Hello","Hello");
corp <- tm::VCorpus(VectorSource(CSString_vector));
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf), weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix);
matrix_of_vector = as.matrix(dtm);
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,]);
could be the better one for the larger data set.

Advanced form of embedding might help you to get better output. Please check the following code.
It is a Universal sentence encode model that generates the sentence embedding using transformer-based architecture.
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model([input])
paragraph = [
"Universal Sentence Encoder embeddings also support short paragraphs. ",
"Universal Sentence Encoder support paragraphs"]
messages = [paragraph]
print(np.inner( embed(paragraph[0]) , embed(paragraph[1])))

Related

Clustering longitudinal data with multiple variables in R

I have a dataset that contains the observations of 30 people and each of them had done 20 experiments. Suppose my data looks like this:
ID trial reaction response prop_1 prop_2
"s1" 1 2.12 0 0.52 0.48
"s1" 2 1.32 1 0.12 0.88
"s1" 3 NA 1 NA NA
"s2" 1 2.33 1 0.65 0.35
"s2" 2 2.56 0 0.43 0.57
"s2" 3 NA 1 NA NA
I want to cluster the participants using these variables. I studied traj, latrend and kml packages but all of them use just one variable to cluster the data. How can I use multiple variables to cluster a longitudinal data like this?
Any simple help or guidance would be appreciated.
Here is one way to do it.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import csv
df = pd.read_csv('C:\\business.csv')
df.dropna(axis=0,how='any',subset=['latitude','longitude'],inplace=True)
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df[['latitude']]
X_axis = df[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
X = df[['longitude', 'latitude']].copy()
kmeans = KMeans(n_clusters = 5, init ='k-means++')
kmeans.fit(X[X.columns[1:2]]) # Compute k-means clustering
X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:3]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers
labels = kmeans.predict(X[X.columns[1:2]]) # Labels of each point
X.head(10)
X.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
Here's another idea.
# import necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from collections import Counter
df = pd.read_csv('C:\\properties_2017.csv')
# df.head(10)
df = df.head(10000)
list(df)
df.shape
df.shape
df = df.sample(frac=0.2, replace=True, random_state=1)
df.shape
df = df.fillna(0)
df.isna().sum()
df['regionidzip']=df['regionidzip'].fillna(97000)
df.dropna(axis=0,how='any',subset=['latitude','longitude'],inplace=True)
X=df.loc[:,['latitude','longitude']]
zp=df.regionidzip
id_n=8
kmeans = KMeans(n_clusters=id_n, random_state=0).fit(X)
id_label=kmeans.labels_
#plot result
ptsymb = np.array(['b.','r.','m.','g.','c.','k.','b*','r*','m*','r^']);
plt.figure(figsize=(12,12))
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
for i in range(id_n):
cluster=np.where(id_label==i)[0]
plt.plot(X.latitude[cluster].values,X.longitude[cluster].values,ptsymb[i])
plt.show()
#revise the clustering based on zipcode
uniq_zp=np.unique(zp)
for i in uniq_zp:
a=np.where(zp==i)[0]
c = Counter(id_label[a])
c.most_common(1)[0][0]
id_label[a]=c.most_common(1)[0][0]
#plot result (revised)
plt.figure(figsize=(12,12))
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
for i in range(id_n):
cluster=np.where(id_label==i)[0]
plt.plot(X.latitude[cluster].values,X.longitude[cluster].values,ptsymb[i])
plt.show()
https://www.kaggle.com/xxing9703/kmean-clustering-of-latitude-and-longitude?select=zillow_data_dictionary.xlsx
https://www.kaggle.com/c/zillow-prize-1/data
Also, check this out.
https://towardsdatascience.com/clustering-geospatial-data-f0584f0b04ec
https://raw.githubusercontent.com/mdipietro09/DataScience_ArtificialIntelligence_Utils/master/machine_learning/data_stores.csv

Error in as.Date.numeric() : 'origin' must be supplied

I have received a paper in which they included the R files for their empirical results. Nevertheless, I have some problems while trying to run their codes:
data <- vni$R[198:length(vni$R)]; date <- vni$Date[198:length(vni$R)]
l <- length(data)
rw_length <- 52 # 52 weeks (~ 1 year)
bound <- vector()
avr <- vector()
for (i in (rw_length+1):l) {
AVR.test <- AutoBoot.test(data[(i-rw_length):i],nboot=2000,"Normal",c(0.025, 0.975))
bound <- append(bound, AVR.test$CI.stat)
avr <- append(avr, AVR.test$test.stat)
}
lower <- bound[seq(1, length(bound), 2)]
upper <- bound[seq(2, length(bound), 2)]
results <- matrix(c(date[(rw_length+1):l],data[(rw_length+1):l],avr,upper, lower),ncol=5, dimnames = list(c(),c("Date", "Return", "AVR", "Upper", "Lower")))
And I get the following error: `
Error in as.Date.numeric(e) : 'origin' must be supplied`
for the results <- matrix(c(date[(rw_length+1):l],data[(rw_length+1):l],avr,upper, lower),ncol=5, dimnames = list(c(),c("Date", "Return", "AVR", "Upper", "Lower")))
My dataset is:
Date P R
1 2001-03-23 259.60 0.0000000000
2 2001-03-30 269.30 0.0366840150
3 2001-04-06 284.69 0.0555748690
4 2001-04-13 300.36 0.0535808860
5 2001-04-20 317.76 0.0563146260
...
935 2019-02-15 950.89 0.0454163960
936 2019-02-22 988.91 0.0392049380
937 2019-03-01 979.63 -0.0094283770
Could you please help me with that issue?
Thanks alot!
Everything in a matrix must be the same class. This is often found when there's a string among numbers, where
m <- matrix(0, nr=2, nc=2)
m
# [,1] [,2]
# [1,] 0 0
# [2,] 0 0
m[1] <- "a"
m
# [,1] [,2]
# [1,] "a" "0"
# [2,] "0" "0"
In this case, you have Date (first column) and numeric (all others? no idea what AutoBoot is). And because it's trying to coerce from least-complex to most-complex (from numeric to Date), the non-Date objects are being converted.
matrix(c(Sys.Date(), 1.1))
# Error in as.Date.numeric(e) : 'origin' must be supplied
I suggest that trying to store this in a matrix is therefore fundamentally flawed. If you want to store a Date object among numbers, you have two options:
Store it as a data.frame, where each column can have its own class.
Pre-convert the "Date" data to numeric and store it as a number. This means that if/when you need the dates to be of class Date again, you'll need to as.Date(..., origin="1970-01-01").

turning a weighted edgelist into an unweighted in r

I have data on every interaction that could and did happen at a university club weekly social hour
id1 id2 timestalked date
1 2 1 1/1/2010
1 3 0 1/1/2010
...
100 2 4 1/8/2010
...
I want to first load this in as a directed graph for the entire time period for visualization. For the weighted matrix I did.
library(igraph);
el <- read.csv("el.csv", header = TRUE);
G <- graph.data.frame(el,directed=TRUE);
A <- as_adjacency_matrix(G,type="both",names=TRUE,sparse=FALSE,attr="timestalked");
I thought removing attr="timestalked" would turn the weights > 0 into 1 but that does not seem to work
library(igraph);
el <- read.csv("el.csv", header = TRUE);
G_unweight <- graph.data.frame(el,directed=TRUE);
A_unweight <- as_adjacency_matrix(G_unweight,type="both",names=TRUE,sparse=FALSE)
as_adjacency_matrix() doesn't provide any argument to control weights. Note that it just provides the number of edges between nodes from the graph.
To turn the weighted edgelist into an unweighted one, try this
A <- as_adjacency_matrix(G, type = "both", names = TRUE, sparse = FALSE)
A[A > 1] <- 1
Note that you can also use the graph_from_adjacency_matrix() function to create an unweighted igraph graph from the adjacency matrix by specifying weighted = NULL.

Optimizing K-means clustering using Genetic Algorithm

I have the following dataset (obtained here):
----------item survivalpoints weight
1 pocketknife 10 1
2 beans 20 5
3 potatoes 15 10
4 unions 2 1
5 sleeping bag 30 7
6 rope 10 5
7 compass 30 1
I can cluster this dataset into three clusters with kmeans() using a binary string as my initial choice of centers. For eg:
## 1 represents the initial centers
chromosome = c(1,1,1,0,0,0,0)
## exclude first column (kmeans only support continous data)
cl <- kmeans(dataset[, -1], dataset[chromosome == 1, -1])
## check the memberships
cl$clusters
# [1] 1 3 3 1 2 1 2
Using this fundamental concept, I tried it out with GA package to conduct the search where I am trying to optimize(minimize) Davies-Bouldin (DB) Index.
library(GA) ## for ga() function
library(clusterSim) ## for index.DB() function
## defining my fitness function (Davies-Bouldin)
DBI <- function(x) {
## converting matrix to vector to access each row
binary_rep <- split(x, row(x))
## evaluate the fitness of each chromsome
for(each in 1:nrow(x){
cl <- kmeans(dataset, dataset[binary_rep[[each]] == 1, -1])
dbi <- index.DB(dataset, cl$cluster, centrotypes = "centroids")
## minimizing db
return(-dbi)
}
}
g<- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
Of course (I have no idea what's happening), I received error message of
Warning messages:
Error in row(x) : a matrix-like object is required as argument to 'row'
Here are my questions:
How can correctly use the GA package to solve my problem?
How can I make sure the randomly generated chromosomes contains the same number of 1s which corresponds to k number of clusters (eg. if k=3 then the chromosome must contain exactly three 1s)?
I can't comment on the sense of combining k-means with ga, but I can point out that you had issue in your fitness function. Also, errors are produced when all genes are on or off, so fitness is only calculated when that is not the case:
DBI <- function(x) {
if(sum(x)==nrow(dataset) | sum(x)==0){
score <- 0
} else {
cl <- kmeans(dataset[, -1], dataset[x==1, -1])
dbi <- index.DB(dataset[,-1], cl=cl$cluster, centrotypes = "centroids")
score <- dbi$DB
}
return(score)
}
g <- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
plot(g)
g#solution
g#fitnessValue
Looks like several gene combinations produced the same "best" fitness value

How to write lp object to lp file?

I have been using lpSolve and lpSolveAPI. I build my constraint matrix, objective function etc and feed to the lp function and this works just fine. I want to save the problem as an lp file using write.lp and am having trouble. I keep getting an error telling me that the object is not an lp object. Any ideas?
> x1 = lp(direction = "min", cost, A , ">=",r,,3:13, , , ,FALSE)
> class(x1)
[1] "lp"
>write.lp(x1, filename, type = "lp",use.names = c(TRUE, TRUE))
Error in write.lp(x1, filename, type = "lp", use.names = c(TRUE, TRUE)) :
the lp argument does not appear to be a valid linear program record
I don't think you can mix between these two packages (lpSolveAPI doesn't import or depend on lpSolve). Consider a simple LP in lpSolve:
library(lpSolve)
costs <- c(1, 2)
mat <- diag(2)
dirs <- rep(">=", 2)
rhs <- c(1, 1)
x1 = lp("min", costs, mat, dirs, rhs)
x1
# Success: the objective function is 3
Based on the project website for lpSolveAPI, you do the same thing with something like:
library(lpSolveAPI)
x2 = make.lp(0, ncol(mat))
set.objfn(x2, costs)
for (idx in 1:nrow(mat)) {
add.constraint(x2, mat[idx,], dirs[idx], rhs[idx])
}
Now, we can solve and observe the solution:
x2
# Model name:
# C1 C2
# Minimize 1 2
# R1 1 0 >= 1
# R2 0 1 >= 1
# Kind Std Std
# Type Real Real
# Upper Inf Inf
# Lower 0 0
solve(x2)
# [1] 0
get.objective(x2)
# [1] 3
get.variables(x2)
# [1] 1 1
Getting back to the question, we can now write it out to a file:
write.lp(x2, "myfile.lp")
Here's the contents of the file:
/* Objective function */
min: +C1 +2 C2;
/* Constraints */
R1: +C1 >= 1;
R2: +C2 >= 1;

Resources