How to compute Ochiai distance matrix with pairwise deletion in R - r

I have a presence/absence dataset and need to calculate an Ochiai distance matrix with pairwise deletion of missing values. What is the simplest way to do this?
I can use designdist from the vegan package to generate a matrix, but not sure what it is doing with the missing values. If they are coded as "?" it produces a result, but if coded as "NA" then is produces a matrix of all NAs. In vegdist you can specify if you want pairwise deletion, but you can't implement the Ochiai coefficient. None of the other distance matrix functions in other packages have this combination as far as I can tell. Any ideas?
Cheers,
James

This could be implemented in vegan::designdist(), but with the current design only for terms="minimum". Binary data should be handled with 0/1 transformation of the input either in straight R or using decostand(..., "pa"). The following changes would do this in vegan::designdist():
--- a/R/designdist.R
+++ b/R/designdist.R
## -1,7 +1,7 ##
`designdist` <-
function (x, method = "(A+B-2*J)/(A+B)",
terms = c("binary", "quadratic", "minimum"),
- abcd = FALSE, alphagamma = FALSE, name)
+ abcd = FALSE, alphagamma = FALSE, name, na.rm = FALSE)
{
terms <- match.arg(terms)
if ((abcd || alphagamma) && terms != "binary")
## -9,13 +9,16 ##
x <- as.matrix(x)
N <- nrow(x)
P <- ncol(x)
+ ## check NA
+ if (na.rm && terms != "minimum" && any(is.na(x)))
+ stop("'na.rm = TRUE' can only be used with 'terms = \"minimum\"\' ")
if (terms == "binary")
x <- ifelse(x > 0, 1, 0)
if (terms == "binary" || terms == "quadratic")
x <- tcrossprod(x)
if (terms == "minimum") {
- r <- rowSums(x)
- x <- dist(x, "manhattan")
+ r <- rowSums(x, na.rm = na.rm)
+ x <- vegdist(x, "manhattan", na.rm = na.rm)
x <- (outer(r, r, "+") - as.matrix(x))/2
}
d <- diag(x)

Related

Genetic algorythm (GA) to select the optimal n values of a vector

I have to choose 10 elements of a vector to maximizes a function. Since the vector is pretty long there are to many possibilities (~1000 choose 10) to compute them all. So I started to look into the GA package to use a genetic algorithm.
I came up with this MWE:
values <- 1:1000
# Fitness function which I want to maximise
f <- function(x){
# Choose values
y <- values[x]
# From the first 10 sum up the odd values.
y <- ifelse(y %% 2 != 0, y, 0)
y <- y[1:10]
return(sum(y))
}
# Maximum value of f for this example
y <- ifelse(values %% 2 != 0, values, 0)
sum(sort(y, decreasing = TRUE)[1:10])
# [1] 9900
# Genetic algorithm
GA <- ga(type = "permutation", fitness = f, lower = rep(1, 10), upper = rep(1000, 10), maxiter = 100)
summary(GA)
The results are a bit underwhelming. From summary(GA), I get the feeling that the algorithm always permutates all 1000 values (the solution goes from x1 to x1000) which leads to an inefficient optimization. How can I tell the algorithm that it should only should use 10 values (so the solution is x1 .. x10)?
You should read https://www.jstatsoft.org/article/view/v053i04. You don't have permutation problem but selection one hence you should use binary type of genetic algorithm. Because you want to select exclusively 10 (10 ones and 990 zeroes) you should probably write your own genetic operators because that is constraint that will hardly ever be satisfied by default operators (with inclusion of -Inf in fitness function if you have more than 10 zeroes). One approach:
Population (k tells how much ones you want):
myInit <- function(k){
function(GA){
m <- matrix(0, ncol = GA#nBits, nrow = GA#popSize)
for(i in seq_len(GA#popSize))
m[i, sample(GA#nBits, k)] <- 1
m
}
}
Crossover
myCrossover <- function(GA, parents){
parents <- GA#population[parents,] %>%
apply(1, function(x) which(x == 1)) %>%
t()
parents_diff <- list("vector", 2)
parents_diff[[1]] <- setdiff(parents[2,], parents[1,])
parents_diff[[2]] <- setdiff(parents[1,], parents[2,])
children_ind <- list("vector", 2)
for(i in 1:2){
k <- length(parents_diff[[i]])
change_k <- sample(k, sample(ceiling(k/2), 1))
children_ind[[i]] <- if(length(change_k) > 0){
c(parents[i, -change_k], parents_diff[[i]][change_k])
} else {
parents[i,]
}
}
children <- matrix(0, nrow = 2, ncol = GA#nBits)
for(i in 1:2)
children[i, children_ind[[i]]] <- 1
list(children = children, fitness = c(NA, NA))
}
Mutation
myMutation <- function(GA, parent){
ind <- which(GA#population[parent,] == 1)
n_change <- sample(3, 1)
ind[sample(length(ind), n_change)] <- sample(setdiff(seq_len(GA#nBits), ind), n_change)
parent <- integer(GA#nBits)
parent[ind] <- 1
parent
}
Fitness (your function adapted for binary GA):
f <- function(x, values){
ind <- which(x == 1)
y <- values[ind]
y <- ifelse(y %% 2 != 0, y, 0)
y <- y[1:10]
return(sum(y))
}
GA:
GA <- ga(
type = "binary",
fitness = f,
values = values,
nBits = length(values),
population = myInit(10),
crossover = myCrossover,
mutation = myMutation,
run = 300,
pmutation = 0.3,
maxiter = 10000,
popSize = 100
)
Chosen values
values[which(GA#solution[1,] == 1)]

Faulty NMI implementation in R?

#calculate NMI(c,t) c : cluster assignment , t : ground truth
NMI <- function(c,t){
n <- length(c) # = length(t)
r <- length(unique(c))
g <- length(unique(t))
N <- matrix(0,nrow = r , ncol = g)
for(i in 1:r){
for (j in 1:g){
N[i,j] = sum(t[c == i] == j)
}
}
N_t <- colSums(N)
N_c <- rowSums(N)
B <- (1/n)*log(t( t( (n*N) / N_c ) / N_t))
W <- B*N
I <- sum(W,na.rm = T)
H_c <- sum((1/n)*(N_c * log(N_c/n)) , na.rm = T)
H_t <- sum((1/n)*(N_t * log(N_t/n)) , na.rm = T)
nmi <- I/sqrt(H_c * H_t)
return (nmi)
}
Running this on some clustering benchmarks here gives me a value of the Normalized Mutual information . But , when I compare it with values of NMI obtained from the aricode library , I get values of NMI that generally differ in the second decimal place .
I will be grateful if someone is able to pin-point any possible error that has creeped into this code .
I am including a test case for this using a synthetic case :
library(aricode)
c <- c(1,1,2,2,2,3,3,3,3,4,4,4)
t <- c(1,2,2,2,3,4,3,3,3,4,4,2)
print(aricode::NMI(c , t)) #0.489574
print(NMI(c,t)) #0.5030771
This might be very late for an answer but for the sake of posterity:
The difference is in the way you and the aricode package normalise the index. You divide by sqrt() whereas aricode offers the following options:
function (c1, c2, variant = c("max", "min", "sqrt", "sum", "joint"))
so if you select variant = sqrt you should hopefully get the same answer.
The NMI package uses sum.

How to implement the jaccard distance in kproto function

I am trying to implement the distance of jaccard in the kproto function (package clustMixType in R), but without any success. The aim is to do a cluster analysis of my dataset.
The distance of jaccard that I want to use is the complement of the similarity coefficient of jaccard, so
distance of jaccard = 1-[a/(a+b+c)] = [(b+c)/(a+b+c)], or
distance of jaccard = 1-[M11/(M01+M10+M11)] = [(M01+M10)/(M01+M10+M11)].
The source code of the kproto function is presented bellow. The variable d1 is the euclidean distance for the numeric variables and the variable d2 is the distance from the simple matching coefficient for the categorical variables (as factors).
It computes the distances between the observations and the prototypes, not between observations. Prototypes are calculated, and not an observation of the data set it self.
So my twoo questions are
1) d2 is what I want to modify, but how?
2) should d1 be the sqrt of what is being calculated?
Thank you for all your help. It will be much apreciated.
Here is an excerpt of the dataset i'm working on, where V1 to V4 are factor (binary) variables (to use the jaccard distance) and V5 to V8 are numeric variables (to use the euclidean distance):
V1;V2;V3;V4;V5;V6;V7;V8
1;1;0;0;6;2;3;3
0;1;0;1;3;5;2;1
1;1;0;0;1;3;2;1
1;1;0;0;4;3;3;1
1;1;1;0;1;4;1;1
1;0;1;0;4;3;1;1
1;1;0;0;2;4;2;1
1;1;0;0;2;4;2;1
1;1;0;0;6;2;1;1
1;1;0;0;6;2;2;1
1;1;0;0;5;2;3;1
1;1;0;0;4;3;3;1
1;1;0;0;4;4;2;1
1;1;0;0;4;3;3;1
1;1;0;0;4;3;3;1
1;1;0;0;3;4;2;1
1;1;0;0;4;3;2;1
1;1;0;0;5;2;3;1
1;1;0;0;4;3;4;1
1;1;0;0;4;3;2;1
1;1;0;0;4;3;2;1
1;1;0;0;3;3;2;1
1;1;0;0;3;3;3;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;2;1
1;1;0;0;3;3;2;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;2;1
1;0;0;0;3;4;2;1
1;1;0;0;7;2;4;1
1;1;0;0;7;2;2;1
1;1;0;0;5;2;4;1
1;1;0;0;5;3;4;1
1;1;0;0;5;3;2;1
1;1;0;0;5;3;4;1
1;0;0;0;3;5;3;1
0;1;0;0;6;2;4;1
1;1;0;0;6;2;3;1
1;1;0;0;6;2;4;1
Lets take the first twoo observations from the dataset provided above as an example:
V1;V2;V3;V4;V5;V6;V7;V8
1;1;0;0;6;2;3;3
0;1;0;1;3;5;2;1
The algorithm first select the k prototypes from the data set randomly, so lets supose that the second observation is one of the inicial prototypes. As I understood the algorithm creates a data frame called "protos" initially with k random observations from the data set provided, so the second observation would be one of the lines of the "proto" dataframe.
The combined distance used to cluster the observations would be d=d1+lambda*d2. Lambda can also be a vector of individual weights to each variable. d is the distance between the observations in the data set provided and the "proto" matrix created initially with k random observations.
In this case, considering the first twoo observations presented, the calculated distances, between the observation (yi) and the prototype (yk), would be as follow:
Euclidian for the numeric variables (V5 to V8):
d1=sum[(yij-ykj)^2]^0,5
where,
k=1 to k clusters
i=1 to n observations
j=5 to 8 th variable
d1=[[(6-3)^2]+[(2-5)^2]+[(3-2)^2]+[(3-1)^2]]^0,5
d1=[9+9+1+4]^0,5
d1=4.796
Jaccard, for the set of binary variables (V1 to V4):
d2=[(b+c)/(a+b+c)]
where,
a=1
b=1
c=1
are correspondences counts between the n observations and the k prototypes, for variables 1 to 4.
d2=[(1+1)/(1+1+1)]
d2=2/3
d2=0.667
So the combined distance between this especific observation and the initial prototype of that cluster is:
d=d1+d2
d=4.796+0.667
d=5.463
The results, as I understood, are then stored in a matrix called "d", line by line, the size of [number of lines=number of observations, number of columns = number of clusters k].
I'm expecting to correctly calculate the euclidian and jaccard distances, modifiyng the kproto function, maintaining the steps and results provided by the original function.
NOTE: the final function should work on any number of observations, variables and prototypes, and not only to my specific dataset.
I've also tried to mix the codes from kproto (clustMixType package) and dist.binary (ade4 package), but they work in different ways.
#K-Prototypes algorithm
kproto.default <- function(x, k, lambda = NULL, iter.max = 100, nstart = 1, na.rm = TRUE, keep.data = TRUE, verbose = TRUE, ...){
# initial error checks
if(!is.data.frame(x)) stop("x should be a data frame!")
if(ncol(x) < 2) stop("For clustering x should contain at least two variables!")
if(iter.max < 1 | nstart < 1) stop("iter.max and nstart must not be specified < 1!")
if(!is.null(lambda)){
if(any(lambda < 0)) stop("lambda must be specified >= 0!")
if(!any(lambda > 0)) stop("lambda must be specified > 0 for at least one variable!")
}
# check for numeric and factor variables
numvars <- sapply(x, is.numeric)
anynum <- any(numvars)
catvars <- sapply(x, is.factor)
anyfact <- any(catvars)
if(!anynum) stop("\n No numeric variables in x! Try using kmodes() from package klaR...\n\n")
if(!anyfact) stop("\n No factor variables in x! Try using kmeans()...\n\n")
# treatment of missings
NAcount <- apply(x, 2, function(z) sum(is.na(z)))
if(verbose){
cat("# NAs in variables:\n")
print(NAcount)
}
if(any(NAcount == nrow(x))) stop(paste("Variable(s) have only NAs please remove them:",names(NAcount)[NAcount == nrow(x)],"!"))
if(na.rm) {
miss <- apply(x, 1, function(z) any(is.na(z)))
if(verbose){
cat(sum(miss), "observation(s) with NAs.\n")
if(sum(miss) > 0) message("Observations with NAs are removed.\n")
cat("\n")
}
x <- x[!miss,]
} # remove missings
if(!na.rm){
allNAs <- apply(x,1,function(z) all(is.na(z)))
if(sum(allNAs) > 0){
if(verbose) cat(sum(allNAs), "observation(s) where all variables NA.\n")
warning("No meaningful cluster assignment possible for observations where all variables NA.\n")
if(verbose) cat("\n")
}
}
if(nrow(x) == 1) stop("Only one observation clustering not meaningful.")
k_input <- k # store input k for nstart > 1 as clusters can be merged
# initialize prototypes
if(!is.data.frame(k)){
if (length(k) == 1){
if(as.integer(k) != k){k <- as.integer(k); warning(paste("k has been set to", k,"!"))}
if(nrow(x) < k) stop("Data frame has less observations than clusters!")
ids <- sample(nrow(x), k)
protos <- x[ids,]
}
if (length(k) > 1){
if(nrow(x) < length(k)) stop("Data frame has less observations than clusters!")
ids <- k
k <- length(ids)
if(length(unique(ids)) != length(ids)) stop("If k is specified as a vector it should contain different indices!")
if(any(ids<1)|any(ids>nrow(x))) stop("If k is specified as a vector all elements must be valid indices of x!")
#check for integer
protos <- x[ids,]
}
rm(ids)
}
if(is.data.frame(k)){
if(nrow(x) < nrow(k)) stop("Data frame has less observations than clusters!")
if(length(names(k)) != length(names(x))) stop("k and x have different numbers of columns!")
if(any(names(k) != names(x))) stop("k and x have different column names!")
if(anynum) {if( any(sapply(k, is.numeric) != numvars)) stop("Numeric variables of k and x do not match!")}
if(anyfact) {if( any(sapply(k, is.factor) != catvars)) stop("Factor variables of k and x do not match!")}
protos <- k
k <- nrow(protos)
}
if(k < 1) stop("Number of clusters k must not be smaller than 1!")
# automatic calculation of lambda
if(length(lambda) > 1) {if(length(lambda) != sum(c(numvars,catvars))) stop("If lambda is a vector, its length should be the sum of numeric and factor variables in the data frame!")}
if(is.null(lambda)){
if(anynum & anyfact){
vnum <- mean(sapply(x[,numvars, drop = FALSE], var, na.rm = TRUE))
vcat <- mean(sapply(x[,catvars, drop = FALSE], function(z) return(1-sum((table(z)/sum(!is.na(z)))^2))))
if (vnum == 0){
if(verbose) warning("All numerical variables have zero variance.")
anynum <- FALSE
}
if (vcat == 0){
if(verbose) warning("All categorical variables have zero variance.")
anyfact <- FALSE
}
if(anynum & anyfact){
lambda <- vnum/vcat
if(verbose) cat("Estimated lambda:", lambda, "\n\n")
}else{
lambda <- 1
}
}
}
# initialize clusters
clusters <- numeric(nrow(x))
tot.dists <- NULL
moved <- NULL
iter <- 1
# check for any equal prototypes and reduce cluster number in case of occurence
if(k > 1){
keep.protos <- rep(TRUE,k)
for(l in 1:(k-1)){
for(m in (l+1):k){
d1 <- sum((protos[l,numvars, drop = FALSE]-protos[m,numvars, drop = FALSE])^2) # euclidean for numerics
d2 <- sum(protos[l,catvars, drop = FALSE] != protos[m,catvars, drop = FALSE]) # wtd simple matching for categorics
if((d1+d2) == 0) keep.protos[m] <- FALSE
}
}
if(!all(keep.protos)){
protos <- protos[keep.protos,]
k <- sum(keep.protos)
if(verbose) message("Equal prototypes merged. Cluster number reduced to:", k, "\n\n")
}
}
# special case only one cluster
if(k == 1){clusters <- rep(1, nrow(x)); size <- table(clusters); iter <- iter.max} # REM: named vector size is needed later...
# start iterations for standard case (i.e. k > 1)
while(iter < iter.max){
# compute distances
nrows <- nrow(x)
dists <- matrix(NA, nrow=nrows, ncol = k)
for(i in 1:k){
#a0 <- proc.time()[3]
#d1 <- apply(x[,numvars],1, function(z) sum((z-protos[i,numvars])^2)) # euclidean for numerics
d1 <- (x[,numvars, drop = FALSE] - matrix(rep(as.numeric(protos[i, numvars, drop = FALSE]), nrows), nrow=nrows, byrow=T))^2
if(length(lambda) == 1) d1 <- rowSums(d1, na.rm = TRUE)
if(length(lambda) > 1) d1 <- as.matrix(d1) %*% lambda[numvars]
#a1 <- proc.time()[3]
#d2 <- lambda * apply(x[,catvars],1, function(z) sum((z != protos[i,catvars]))) # wtd simple matching for categorics
d2 <- sapply(which(catvars), function(j) return(x[,j] != rep(protos[i,j], nrows)) )
d2[is.na(d2)] <- FALSE
if(length(lambda) == 1) d2 <- lambda * rowSums(d2)
if(length(lambda) > 1) d2 <- as.matrix(d2) %*% lambda[catvars]
#a2 <- proc.time()[3]
dists[,i] <- d1 + d2
#cat(a1-a0, a2-a1, "\n")
}
# assign clusters
old.clusters <- clusters
# clusters <- apply(dists, 1, function(z) which.min(z))
clusters <- apply(dists, 1, function(z) {a <- which.min(z); if (length(a)>1) a <- sample(a,1); return(a)}) # sample in case of multiple minima
size <- table(clusters)
min.dists <- apply(cbind(clusters, dists), 1, function(z) z[z[1]+1])
within <- as.numeric(by(min.dists, clusters, sum))
tot.within <- sum(within)
# prevent from empty classes
#tot.within <- numeric(k)
#totw.list <- by(min.dists, clusters, sum)
#tot.within[names(totw.list)] <- as.numeric(totw.list)
# ...check for empty clusters and eventually reduce number of prototypes
if (length(size) < k){
k <- length(size)
protos <- protos[1:length(size),]
if(verbose) cat("Empty clusters occur. Cluster number reduced to:", k, "\n\n")
}
# trace
tot.dists <- c(tot.dists, sum(tot.within))
moved <- c(moved, sum(clusters != old.clusters))
# compute new prototypes
remids <- as.integer(names(size))
for(i in remids){
protos[which(remids == i), numvars] <- sapply(x[clusters==i, numvars, drop = FALSE], mean, na.rm = TRUE)
protos[which(remids == i), catvars] <- sapply(x[clusters==i, catvars, drop = FALSE], function(z) levels(z)[which.max(table(z))])
}
if(k == 1){clusters <- rep(1, length(clusters)); size <- table(clusters); iter <- iter.max; break}
# check for any equal prototypes and reduce cluster number in case of occurence
if(iter == (iter.max-1)){ # REM: for last iteration equal prototypes are allowed. otherwise less prototypes than assigned clusters.
keep.protos <- rep(TRUE,k)
for(l in 1:(k-1)){
for(m in (l+1):k){
d1 <- sum((protos[l,numvars, drop = FALSE]-protos[m,numvars, drop = FALSE])^2) # euclidean for numerics
d2 <- sum(protos[l,catvars, drop = FALSE] != protos[m,catvars, drop = FALSE]) # wtd simple matching for categorics
if((d1+d2) == 0) keep.protos[m] <- FALSE
}
}
if(!all(keep.protos)){
protos <- protos[keep.protos,]
k <- sum(keep.protos)
if(verbose) cat("Equal prototypes merged. Cluster number reduced to:", k, "\n\n")
}
}
# add stopping rules
if(moved[length(moved)] == 0) break
if(k == 1){clusters <- rep(1, length(clusters)); size <- table(clusters); iter <- iter.max; break}
#cat("iter", iter, "moved", moved[length(moved)], "tot.dists",tot.dists[length(tot.dists)],"\n" )
iter <- iter+1
}
### Final update of prototypes and dists
if(iter == iter.max){ # otherwise there have been no moves anymore and prototypes correspond to cluster assignments
# compute new prototypes
remids <- as.integer(names(size))
for(i in remids){
protos[which(remids == i), numvars] <- sapply(x[clusters==i, numvars, drop = FALSE], mean, na.rm = TRUE)
protos[which(remids == i), catvars] <- sapply(x[clusters==i, catvars, drop = FALSE], function(z) levels(z)[which.max(table(z))])
}
# compute distances
nrows <- nrow(x)
dists <- matrix(NA, nrow=nrows, ncol = k)
for(i in 1:k){
d1 <- (x[,numvars, drop = FALSE] - matrix(rep(as.numeric(protos[i, numvars, drop = FALSE]), nrows), nrow=nrows, byrow=T))^2
if(length(lambda) == 1) d1 <- rowSums(d1, na.rm = TRUE)
if(length(lambda) > 1) d1 <- as.matrix(d1) %*% lambda[numvars]
d2 <- sapply(which(catvars), function(j) return(x[,j] != rep(protos[i,j], nrows)) )
d2[is.na(d2)] <- FALSE
if(length(lambda) == 1) d2 <- lambda * rowSums(d2)
if(length(lambda) > 1) d2 <- as.matrix(d2) %*% lambda[catvars]
dists[,i] <- d1 + d2
}
size <- table(clusters)
min.dists <- apply(cbind(clusters, dists), 1, function(z) z[z[1]+1])
within <- as.numeric(by(min.dists, clusters, sum))
tot.within <- sum(within)
}
names(clusters) <- row.names(dists) <- row.names(x)
rownames(protos) <- NULL
# create result:
res <- list(cluster = clusters,
centers = protos,
lambda = lambda,
size = size,
withinss = within,
tot.withinss = tot.within,
dists = dists,
iter = iter,
trace = list(tot.dists = tot.dists, moved = moved))
# loop: if nstart > 1:
if(nstart > 1)
for(j in 2:nstart){
res.new <- kproto(x=x, k=k_input, lambda = lambda, iter.max = iter.max, nstart=1, verbose=verbose)
if(res.new$tot.withinss < res$tot.withinss) res <- res.new
}
if(keep.data) res$data = x
class(res) <- "kproto"
return(res)
}
I've managed to modify the functions to accept a variety of similarity measures and uploaded the R file at http://dx.doi.org/10.17632/63nyn9tjcd.1, in case someone needs it.

Use a function in R formula

Any help with this would be greatly appreciated. I am optimising parameters of a lognormal distribution so that the proportion of estimates matches a set of target values (distances). The proportions are calculated using the following functions:
adj_sumifs <- function(sum_array, condition_array, f, m=1){
n <- length(condition_array)
sm = 0
if (n == length(condition_array)){
fun <- function(x,i){if (f (condition_array[i])){sum_array[i] + x}else{x} }
sm <- Reduce(fun,1:n,0)
}
ifelse(m <= 0, sm , sm/m)
}
and
estimate.inrange <- function(vals,dist,lower,upper,total){
n <- length(lower)
if (n == length(upper)){
sapply(1:n, function(i){ ifelse(i < n ,
adj_sumifs(vals,dist, (function(x) x >= lower[i] && x < upper[i]),total) ,
adj_sumifs(vals,dist, (function(x) x >= lower[i]) , total)
) }
)
}else{
# for a failure in the process
as.numeric()
}
}
And the function I would like to optimise is:
calculate_Det_ptns <- function(alpha, beta, pxa, low,up, distances, eF){
temp <- numeric()
if ( length(pxa) == length(distances) && length(low) == length(up) )
{
ln_values <- as.numeric(Map(function(pa,d) eF * pa * dlnorm(d, meanlog = alpha, sdlog = beta),pxa,distances))
temp <- estimate.inrange (ln_values,distances,low,up, total = sum(ln_values))
}
temp
}
Optimisation is done using the Levenberg-Marquardt algorithm
lnVals <- nlsLM(target ~ calculate_Det_ptns(alpha = a,beta = b, pxa = odab,low = low, up = up, distances = dist, eF = expF),
start = list(a = mu, b = sd ),
trace = T)
where up,low and target are extracted from the same data file, e,g,
low, up, target
1,2,0.1
2,3,0.4
3,4,0.6
4,5,0.6
5,6,0.9
while odab and distance are vectors of arbitrary lengths (usually much longer than target,etc). The process works well when the target file has anout 150 rows, and distances and odab have about 500000 values. However, for reasons I cannot fathom, is fails when the target file has about 16 rows. The error message is:
Error in model.frame.default(formula = ~target + odab + low + up + dist) :
variable lengths differ (found for 'odab')
which suggests that the function is not being evaluated in the formula. Can anyone suggest a solution or explanation? It is important that the proportions are re-estimated for every new mu and sd.
You could try surrounding the function with I(), which will evaluate it as is before evaluating the formula; however, I could not replicate your problem with the code provided because I am missing some of the referenced objects (a, b, odab, dist, expF, mu, sd) so I could not confirm whether or not this works.
nVals <- nlsLM(target ~ I(calculate_Det_ptns(alpha = a,beta = b, pxa = odab,low = low, up = up, distances = dist, eF = expF)), start = list(a = mu, b = sd ), trace = T)

Autocorrelation plot for only negative values

I would like to do an acf plot in R for only the negative values of a time series. I cannot do this by just subsetting the data for only negative values beforehand, because then the autocorrelation will remove arbitrary number of positive days in between the negative values and be unreasonably high, but rather, I would like to run the autocorrelation on the whole time series and then filter out the results given the first day is negative.
For example, in theory, I could make a data frame with the original series and all of the lagged time series in a data frame, then filter for the negative values in the original series, and then plot the correlations. However, I would like to automate this using the acf plot.
Here is an example of my time series:
> dput(exampleSeries)
c(0, 0, -0.000687, -0.004489, -0.005688, 0.000801, 0.005601,
0.004546, 0.003451, -0.000836, -0.002796, 0.005581, -0.003247,
-0.002416, 0.00122, 0.005337, -0.000195, -0.004255, -0.003097,
0.000751, -0.002037, 0.00837, -0.003965, -0.001786, 0.008497,
0.000693, 0.000824, 0.005681, 0.002274, 0.000773, 0.001141, 0.000652,
0.001559, -0.006201, 0.000479, -0.002041, 0.002757, -0.000736,
-2.1e-05, 0.000904, -0.000319, -0.000227, -0.006589, 0.000998,
0.00171, 0.000271, -0.004121, -0.002788, -9e-04, 0.001639, 0.004245,
-0.00267, -0.004738, 0.001192, 0.002175, 0.004666, 0.006005,
0.001218, -0.003188, -0.004363, 0.000462, -0.002241, -0.004806,
0.000463, 0.000795, -0.005715, 0.004635, -0.004286, -0.008908,
-0.001044, -0.000842, -0.00445, -0.006094, -0.001846, 0.005013,
-0.006599, 0.001914, 0.00221, 6.2e-05, -0.001391, 0.004369, -0.005739,
-0.003467, -0.002103, -0.000882, 0.001483, 0.003074, 0.00165,
-0.00035, -0.000573, -0.00316, -0.00102, -0.00144, 0.003421,
0.005436, 0.001994, 0.00619, 0.005319, 7.3e-05, 0.004513)
I tried to implement your description.
correl <- function(x, lag.max = 10){
library(dplyr)
m <- matrix(ncol = lag.max, nrow = length(x))
for(i in 1:lag.max){
m[,i] <- lag(x, i)
}
m <- m[x<0,]
res <- apply(m, 2, function(y) cor(y, x[x<0], use = "complete.obs"))
barplot(res)
}
correl(exampleSeries)
Maybe just write your own function? Something like:
negativeACF <- function(x, num.lags = 10)
{
n <- length(x)
acfs <- sapply(0:num.lags, function(i) cor(x[-i:-1], x[(-n-1+i):-n]))
names(acfs) <- 0:num.lags
acfs[acfs < 0]
}
results <- negativeACF(exampleSeries, num.lags=20)
barplot(results)
Yea I ended up writing my own functions and just replacing the values in the R acf object with my own values that are just the correlations. So:
genACF <- function(series, my.acf, lag.max = NULL, neg){
x <- na.fail(as.ts(series))
x.freq <- frequency(x)
x <- as.matrix(x)
if (!is.numeric(x))
stop("'x' must be numeric")
sampleT <- as.integer(nrow(x))
nser <- as.integer(ncol(x))
if (is.null(lag.max))
lag.max <- floor(10 * (log10(sampleT) - log10(nser)))
lag.max <- as.integer(min(lag.max, sampleT - 1L))
if (is.na(lag.max) || lag.max < 0)
stop("'lag.max' must be at least 0")
if(neg){
indices <- which(series < 0)
}else{
indices <- which(series > 0)
}
series <- scale(series, scale = FALSE)
series.zoo <- zoo(series)
for(i in 0:lag.max){
lag.series <- lag(series.zoo, k = -i, na.pad = TRUE)
temp.corr <- cor(series.zoo[indices], lag.series[indices], use = 'complete.obs', method = 'pearson')
my.acf[i+1] <- temp.corr
}
my.acf[1] <- 0
return(my.acf)
}
plotMyACF <- function(series, main, type = 'correlation', neg = TRUE){
series.acf <- acf(series, plot = FALSE)
my.acf <- genACF(series, series.acf$acf, neg = neg)
series.acf$acf <- my.acf
plot(series.acf, xlim = c(1, dim(series.acf$acf)[1] - (type == 'correlation')), xaxt = "n", main = main)
if (dim(series.acf$acf)[1] < 25){
axis(1, at = 1:(dim(series.acf$acf)[1] - 1))
}else{
axis(1)
}
}
And I get something like this:

Resources