Jaccard Analysis Similarity - r

I'm trying to do a Jaccard Analysis from R.
But, after the processing, my result columns are NULL.
I've tried to do a solution from many ways, but the problem still remains.
Can anyone help me please?
#######################PROCESSING JACCARD ANAYLSIS###################
Jaccard = function (x, y) {
M.11 = sum(x == 1 & y == 1)
M.10 = sum(x == 1 & y == 0)
M.01 = sum(x == 0 & y == 1)
return (M.11 / (M.11 + M.10 + M.01))
}
input.variables = data.frame(Q6_01, Q6_02, Q6_03, Q6_04, Q6_05, Q6_06, Q6_07, Q6_08)
m = matrix(data = NA, nrow = length(input.variables), ncol = length(input.variables))
for (r in 1:length(input.variables)) {
for (c in 1:length(input.variables)) {
if (c == r) {
m[r,c] = 1
} else if (c > r) {
m[r,c] = Jaccard(input.variables[,r], input.variables[,c])
}
}
}
variable.names = sapply(input.variables, attr, "label")
colnames(m) = variable.names
rownames(m) = variable.names
jaccards = m

If you are trying to calculate Jaccard similarity, I would suggest checking out the vegan package. The vegdist() function can be used to calculate lots of different similarity/dissimilarity measures. Check out ?vegdist.
For Jaccard, you would use vegdist(your_dataframe_name, method = "jaccard")

Related

Trying to run GA in R, getting Error in if (any(x < 0)) { : missing value where TRUE/FALSE needed

I am relatively new to R and need to set up a genetic algorithm to find an equation that would produce a certain number of prime numbers.
install.packages("GA")
install.packages("matlab")
library(GA)
library(matlab)
f <- function(x)
{
#initialize fitness score
score <- 0
#set test values for k
k <- seq(from = 1, to = 100,by = 1)
#test if the result of the formula (k^2 + ak + b) is a prime number using test k values
for (i in k) {
if (isprime(i ^ 2 + x[1] * i + x[2]) == 2) {
score = score + 1
}
}
#return fitness score
return(score)
}
lbound <- 2
ubound <- 1000
GA <- ga(type="real-valued",fitness=f,popSize = 10,pcrossover = 0.8,pmutation = 0.1, maxiter=30, run=20, lower = lbound, upper = ubound)
When I try to run the GA part, I get the following error:
> GA <- ga(type="real-valued",fitness=f,popSize = 10,pcrossover = 0.8,pmutation = 0.1, maxiter=30, run=20, lower = lbound, upper = ubound)
Error in if (any(x < 0)) { : missing value where TRUE/FALSE needed
Any suggestions for what I might be doing wrong?
Thank you
The error in your code happens because it tries to find x[2] when it doesn't exist.
If you read the Rastrigin example for GA function the vignette, for 2 values you need 1. specify a function with 2 inputs and 2. use a wrapper on this function
f <- function(x1,x2)
# two variables
{
#initialize fitness score
score <- 0
#set test values for k
k <- seq(from = 1, to = 100,by = 1)
#test if the result of the formula (k^2 + ak + b) is a prime number using test k values
for (i in k) {
if (isprime(i ^ 2 + x1 * i + x2) == 2) {
score = score + 1
}
}
#return fitness score
return(score)
}
lbound <- 2
ubound <- 1000
GA <- ga(type="real-valued",
#the wrapper is here
fitness=function(x)f(round(x[1]),round(x[2])),
popSize = 10,
pcrossover = 0.8,pmutation = 0.1, maxiter=30,
run=20, lower = rep(lbound,2), upper = rep(ubound,2))

R How do I form a matrix by using a loop to repeat function and binding each answer to the original function

I am new to programming so my knowledge is very limited at the moment but I am always looking to improve.
I have this function called gillespied detailed below to show the gillespie algorithm
> print(gillespied)
function (N, T = 100, dt = 1, ...)
{
tt = 0
n = T%/%dt
x = N$M
S = t(N$Post - N$Pre)
u = nrow(S)
v = ncol(S)
xmat = matrix(ncol = u, nrow = n)
i = 1
target = 0
repeat {
h = N$h(x, tt, ...)
h0 = sum(h)
if (h0 < 1e-10)
tt = 1e+99
else tt = tt + rexp(1, h0)
while (tt >= target) {
xmat[i, ] = x
i = i + 1
target = target + dt
if (i > n)
return(ts(xmat, start = 0, deltat = dt))
}
j = sample(v, 1, prob = h)
x = x + S[, j]
}
}
And I use
out = gillespied(LV,T=100,dt=1)
I would like to create a matrix which corresponds to the first column of this out result, but then I'd like to repeat this out result a further 19 times (so I have 20 in total) and bind each result to my original matrix, this would give me a 20X100 matrix.
This is my attempt at it, and I'm not sure if it is correct as my R freezes when I try to view my matrix M
M=matrix(out[,1],ncol=1)
for (i in 1:19) {
out = gillespied(LV, T=100, dt=1)
M = cbind(M,out[,1])
i = i+1
}
print(M)
I was wondering if this is correct, and if it is not what adjustments I should be making
You don't need to increment i, the for loop does this for you.
E.g.
M <- matrix(rnorm(5), ncol = 1)
for (i in 1:5){
out <- rnorm(5)
M <- cbind(M, out)
}
> M
out out out out out
[1,] 0.21701968 2.0296134 -0.26425755 0.3904337 0.1438060 -0.5340556
[2,] 0.07689991 -2.0589758 0.01443763 -0.7506177 -0.8498391 1.0487328
[3,] -0.73329583 0.2709055 0.42298869 0.3271687 1.0450811 -0.9313009
[4,] -1.68460070 0.2864797 -1.83408494 -0.2878682 -0.4297308 0.5282630
[5,] 0.08921503 -1.4390101 0.89112111 -1.6711018 -2.0863797 -0.6924083
Something like i = i+1 is usually used in a while loop. Otherwise your code appears to do what you want.
How big is your matrix? It might struggle to print to console. What happens if you try head(M)?

How to compute Ochiai distance matrix with pairwise deletion in R

I have a presence/absence dataset and need to calculate an Ochiai distance matrix with pairwise deletion of missing values. What is the simplest way to do this?
I can use designdist from the vegan package to generate a matrix, but not sure what it is doing with the missing values. If they are coded as "?" it produces a result, but if coded as "NA" then is produces a matrix of all NAs. In vegdist you can specify if you want pairwise deletion, but you can't implement the Ochiai coefficient. None of the other distance matrix functions in other packages have this combination as far as I can tell. Any ideas?
Cheers,
James
This could be implemented in vegan::designdist(), but with the current design only for terms="minimum". Binary data should be handled with 0/1 transformation of the input either in straight R or using decostand(..., "pa"). The following changes would do this in vegan::designdist():
--- a/R/designdist.R
+++ b/R/designdist.R
## -1,7 +1,7 ##
`designdist` <-
function (x, method = "(A+B-2*J)/(A+B)",
terms = c("binary", "quadratic", "minimum"),
- abcd = FALSE, alphagamma = FALSE, name)
+ abcd = FALSE, alphagamma = FALSE, name, na.rm = FALSE)
{
terms <- match.arg(terms)
if ((abcd || alphagamma) && terms != "binary")
## -9,13 +9,16 ##
x <- as.matrix(x)
N <- nrow(x)
P <- ncol(x)
+ ## check NA
+ if (na.rm && terms != "minimum" && any(is.na(x)))
+ stop("'na.rm = TRUE' can only be used with 'terms = \"minimum\"\' ")
if (terms == "binary")
x <- ifelse(x > 0, 1, 0)
if (terms == "binary" || terms == "quadratic")
x <- tcrossprod(x)
if (terms == "minimum") {
- r <- rowSums(x)
- x <- dist(x, "manhattan")
+ r <- rowSums(x, na.rm = na.rm)
+ x <- vegdist(x, "manhattan", na.rm = na.rm)
x <- (outer(r, r, "+") - as.matrix(x))/2
}
d <- diag(x)

Inverse Association Rules

Association rules are a very common technique when you want to figure out which events happens together (like burger and bread mostly sell together). In marketing this technique is used to find out the complimentary products.
I am looking for a technique to extract the "Substitute Products" and to be it is like Inverse Association rules to find out which events are less likely happens together. Is there any algorithm or technique available in Spark, R, Python, etc. for this?
Thanks,
Amir
I've done an a very practical implementation for Substitution Rule Mining using Teng, Hsieh and Chen (2002) for R. Maybe it can help you:
# Used packages:
library(arules)
SRM <- function(TransData, MinSup, MinConf, pMin, pChi, itemLabel, nTID){
# Packages ----------------------------------------------------------------
if (sum(search() %in% "package:arules") == 0) {
stop("Please load package arules")
}
# Checking Input data -----------------------------------------------------
if (missing(TransData)) {
stop("Transaction data is missing")
}
if (is.numeric(nTID) == FALSE) {
stop("nTID has to be one numeric number for the count of
Transactions")
}
if (length(nTID) > 1) {
stop("nTID has to be one number for the count of Transactions")
}
if (is.character(itemLabel) == FALSE) {
stop("itemLabel has to be a character")
}
# Concrete Item sets ---------------------------------------------------
# adding complements to transaction data
compl_trans <- addComplement(TransData,labels = itemLabel)
compl_tab <- crossTable(compl_trans,"support")
compl_tab_D <- as.data.frame(compl_tab)
# ordering matrix
compl_tab_D <- compl_tab_D[order(rownames((compl_tab))),order(colnames((compl_tab)))]
# Chi Value ---------------------------------------------------------------
# empty data frame for loop
complement_data <- data.frame(Chi = as.numeric(),
Sup_X.Y = as.numeric(),
X = as.character(),
Sup_X = as.numeric(),
Y = as.character(),
Sup_Y = as.numeric(),
CX = as.character(),
SupCX = as.numeric(),
CY = as.character(),
Sup_CY = as.numeric(),
Conf_X.CY = as.numeric(),
Sup_X.CY = as.numeric(),
Conf_Y.CX = as.numeric(),
SupY_CX = as.numeric())
# first loop for one item
for ( i in 1 : (length(itemLabel) - 1)) {
# second loop combines it with all other items
for (u in (i + 1) : length(itemLabel)) {
# getting chi value from Teng
a <- itemLabel[i]
b <- itemLabel[u]
ca <- paste0("!", itemLabel[i])
cb <- paste0("!", itemLabel[u])
chiValue <- nTID * (
compl_tab[ca, cb] ^ 2 / (compl_tab[ca, ca] * compl_tab[cb, cb]) +
compl_tab[ca, b] ^ 2 / (compl_tab[ca, ca] * compl_tab[b, b]) +
compl_tab[a, cb] ^ 2 / (compl_tab[a, a] * compl_tab[cb, cb]) +
compl_tab[a, b] ^ 2 / (compl_tab[a, a] * compl_tab[b, b]) - 1)
# condition to be dependent
if (compl_tab[a, b] > compl_tab[a, a] * compl_tab[b, b] && chiValue >= qchisq(pChi, 1) &&
compl_tab[a, a] >= MinSup && compl_tab[b, b] >= MinSup ) {
chi_sup <- data.frame(Chi = chiValue,
Sup_X.Y = compl_tab[a, b],
X = a,
Sup_X = compl_tab[a, a],
Y = b,
Sup_Y = compl_tab[b, b],
CX = ca,
SupCX = compl_tab[ca, ca],
CY = cb,
Sup_CY = compl_tab[cb, cb],
Conf_X.CY = compl_tab[a, cb] / compl_tab[a, a],
Sup_X.CY = compl_tab[a, cb],
Conf_Y.CX = compl_tab[ca, b] / compl_tab[b, b],
SupY_CX = compl_tab[ca, b])
try(complement_data <- rbind(complement_data, chi_sup))
}
}
}
if (nrow(complement_data) == 0) {
stop("No complement item sets could have been found")
}
# changing mode of
complement_data$X <- as.character(complement_data$X)
complement_data$Y <- as.character(complement_data$Y)
# calculating support for concrete itemsets with all others and their complements -------------------
## with complements
matrix_trans <- as.data.frame(as(compl_trans, "matrix"))
sup_three <- data.frame(Items = as.character(),
Support = as.numeric())
setCompl <- names(matrix_trans)
# 1. extracts all other values than that are not in the itemset
for (i in 1 : nrow(complement_data)) {
value <- setCompl[ !setCompl %in% c(complement_data$X[i],
complement_data$Y[i],
paste0("!", complement_data$X[i]),
paste0("!",complement_data$Y[i]))]
# 2. calculation of support
for (u in value) {
count <- sum(rowSums(matrix_trans[, c(complement_data$X[i], complement_data$Y[i], u )]) == 3)
sup <- count / nTID
sup_three_items <- data.frame(Items = paste0(complement_data$X[i], complement_data$Y[i], u),
Support=sup)
sup_three <- rbind(sup_three, sup_three_items)
}
}
# Correlation of single items-------------------------------------------------------------
# all items of concrete itemsets should be mixed for correlation
combis <- unique(c(complement_data$X, complement_data$Y))
# empty object
rules<- data.frame(
Substitute = as.character(),
Product = as.character(),
Support = as.numeric(),
Confidence = as.numeric(),
Correlation = as.numeric())
# first loop for one item
for (i in 1 : (length(combis) - 1)) {
# second loop combines it with all other items
for (u in (i + 1) : length(combis)) {
first <- combis[i]
second <- combis[u]
corXY <- (compl_tab[first, second] - (compl_tab[first, first] * compl_tab[second, second])) /
(sqrt((compl_tab[first, first] * (1 - compl_tab[first,first])) *
(compl_tab[second, second] * (1 - compl_tab[second, second]))))
# confidence
conf1 <- compl_tab[first, paste0("!", second)] / compl_tab[first, first]
conf2 <- compl_tab[second, paste0("!", first)] / compl_tab[second, second]
two_rules <- data.frame(
Substitute = c(paste("{", first, "}"),
paste("{", second, "}")),
Product = c(paste("=>", "{", second, "}"),
paste("=>", "{", first, "}")),
Support = c(compl_tab[first, paste0("!", second)], compl_tab[second, paste0("!", first)]),
Confidence = c(conf1, conf2),
Correlation = c(corXY, corXY)
)
# conditions
try({
if (two_rules$Correlation[1] < pMin) {
if (two_rules$Support[1] >= MinSup && two_rules$Confidence[1] >= MinConf) {
rules <- rbind(rules, two_rules[1, ])
}
if (two_rules$Support[2] >= MinSup && two_rules$Confidence[2] >= MinConf) {
rules <- rbind(rules, two_rules[2, ])
}
} })
}
}
# Correlation of concrete item pairs with single items --------------------
# adding variable for loop
complement_data$XY <- paste0(complement_data$X, complement_data$Y)
# combination of items
for (i in 1 : nrow(complement_data)){
# set of combinations from dependent items with single items
univector <- c(as.vector(unique(complement_data$X)), as.vector(unique(complement_data$Y)))
univector <- univector[!univector %in% c(complement_data$X[i], complement_data$Y[i])]
combis <- c(complement_data[i,"XY"], univector)
for (u in 2 : length(combis)) {
corXYZ <-(sup_three[sup_three$Items == paste0(combis[1], combis[u]),2] -
complement_data[complement_data$XY == combis[1],"Sup_X.Y"] *
compl_tab[combis[u],combis[u]]) /
(sqrt((complement_data[complement_data$XY == combis[1],"Sup_X.Y"] *
(1 - complement_data[complement_data$XY == combis[1],"Sup_X.Y"]) *
compl_tab[combis[u],combis[u]] * (1 - compl_tab[combis[u],combis[u]]))))
dataXYZ <- data.frame(
Substitute = paste("{", combis[1], "}"),
Product = paste("=>", "{", combis[u], "}"),
Support = sup_three[sup_three$Items == paste0(combis[1], "!", combis[u]),2],
Confidence = sup_three[sup_three$Items == paste0(combis[1], "!", combis[u]),2] /
complement_data[complement_data$XY == combis[1],"Sup_X.Y"],
Correlation = corXYZ)
# conditions
if (dataXYZ$Correlation < pMin && dataXYZ$Support >= MinSup && dataXYZ$Confidence >= MinConf) {
try(rules <- rbind(rules, dataXYZ))
}
}
}
if (nrow(rules) == 0) {
message("Sorry no rules could have been calculated. Maybe change input conditions.")
} else {
return(rules)
}
# end
}
I think a better explanation is in my blog:
http://mattimeyer.github.io/2016-12-21-Substitution-Rule-Mining/

fminsearch in R is worse than in Matlab

There is my data (x and y columns are relevant):
https://www.dropbox.com/s/b61a7enhoa0p57p/Simple1.csv
What I need is to fit the data with the polyline. Matlab code that does this is:
spline_fit.m:
function [score, params] = spline_fit (points, x, y)
min_f = min(x)-1;
max_f = max(x);
points = [min_f points max_f];
params = zeros(length(points)-1, 2);
score = 0;
for i = 1:length(points)-1
in = (x > points(i)) & (x <= points(i+1));
if sum(in) > 2
p = polyfit(x(in), y(in), 1);
pred = p(1)*x(in) + p(2);
score = score + norm(pred - y(in));
params(i, :) = p;
else
params(i, :) = nan;
end
end
test.m:
%Find the parameters
r = [100,250,400];
p = fminsearch('spline_fit', r, [], x, y)
[score, param] = spline_fit(p, x, y)
%Plot the result
y1 = zeros(size(x));
p1 = [-inf, p, inf];
for i = 1:size(param, 1)
in = (x > p1(i)) & (x <= p1(i+1));
y1(in) = x(in)*param(i,1) + param(i,2);
end
[x1, I] = sort(x);
y1 = y1(I);
plot(x,y,'x',x1,y1,'k','LineWidth', 2)
And this does work fine, producing following optimization: [102.9842, 191.0006, 421.9912]
I've implemented the same idea in R:
library(pracma);
spline_fit <- function(x, xx, yy) {
min_f = min(xx)-1;
max_f = max(xx);
points = c(min_f, x, max_f)
params = array(0, c(length(points)-1, 2));
score = 0;
for( i in 1:length(points)-1)
{
inn <- (xx > points[i]) & (xx <= points[i+1]);
if (sum(inn) > 2)
{
p <- polyfit(xx[inn], yy[inn], 1);
pred <- p[1]*xx[inn] + p[2];
score <- score + norm(as.matrix(pred - yy[inn]),"F");
params[i,] <- p;
}
else
params[i,] <- NA;
}
score
}
But I get very bad results:
> fminsearch(spline_fit,c(100,250,400), xx = Simple1$x, yy = Simple1$y)
$xval
[1] 100.1667 250.0000 400.0000
$fval
[1] 4452.761
$niter
[1] 2
As you can see, it stops after 2 iterations and doesn't produce good points.
I'll be very glad for any help in resolving this issue.
Also, if anyone knows how to implement this in C# using any free library, it will be even better. I know whereto get polyfit, but not fminsearch.
The problem here is that the likelihood surface is very badly behaved -- there are both multiple minima and discontinuous jumps -- which will make the results you get with different optimizers almost arbitrary. I will admit that MATLAB's optimizers are remarkably robust, but I would say that it's pretty much a matter of chance (and where you start) whether an optimizer will get to the global minimum for this case, unless you use some form of stochastic global optimization such as simulated annealing.
I chose to use R's built-in optimizer (which uses Nelder-Mead by default) rather than fminsearch from the pracma package.
spline_fit <- function(x, xx = Simple1$x, yy=Simple1$y) {
min_f = min(xx)-1
max_f = max(xx)
points = c(min_f, x, max_f)
params = array(0, c(length(points)-1, 2))
score = 0
for( i in 1:(length(points)-1))
{
inn <- (xx > points[i]) & (xx <= points[i+1]);
if (sum(inn) > 2)
{
p <- polyfit(xx[inn], yy[inn], 1);
pred <- p[1]*xx[inn] + p[2];
score <- score + norm(as.matrix(pred - yy[inn]),"F");
params[i,] <- p;
}
else
params[i,] <- NA;
}
score
}
library(pracma) ## for polyfit
Simple1 <- read.csv("Simple1.csv")
opt1 <- optim(fn=spline_fit,c(100,250,400), xx = Simple1$x, yy = Simple1$y)
## [1] 102.4365 201.5835 422.2503
This is better than the fminsearch results, but still different from the MATLAB results, and worse than them:
## Matlab results:
matlab_fit <- c(102.9842, 191.0006, 421.9912)
spline_fit(matlab_fit, xx = Simple1$x, yy = Simple1$y)
## 3724.3
opt1$val
## 3755.5 (worse)
The bbmle package offers an experimental/not very well documented set of tools for exploring optimization surfaces:
library(bbmle)
ss <- slice2D(fun=spline_fit,opt1$par,nt=51)
library(lattice)
A 2D "slice" around the optim-estimated parameters. The circles show the optim fit (solid) and the minimum value within each slice (open).
png("splom1.png")
print(splom(ss))
dev.off()
A 'slice' between the matlab and optim fits shows that the surface is quite rugged:
ss2 <- bbmle:::slicetrans(matlab_fit,opt1$par,spline_fit)
png("slice1.png")
print(plot(ss2))
dev.off()

Resources