Divide rows into groups given the similarity between them

Divide rows into groups given the similarity between them - r

Given this example data frame:
DF <- data.frame(x = c(1, 0.85, 0.9, 0, 0, 0.9, 0.95),
y = c(0, 0, 0.1, 0.9, 1, 0.9, 0.97),
z = c(0, 0, 0, 0.9, 0.9, 0.0, 0.9 ))
I am trying to assign each row to a group containing rows adjacent to one another, based on their similarity. I would like to use a cutoff of 0.35, meaning that consecutive rows of values c(1, 0.85, 0.7) can be assigned to one group, but c(0, 1, 0) cannot.
Regarding the columns, column-to-column differences are not important i.e. c(1, 1, 1) and c(0, 0, 0) could still be assigned to one group, HOWEVER, if rows in one column meet the criteria (e.g. c(1, 1, 1)) but the rows in another column(s) do not (e.g. c(1, 0, 1)) - the row is invalid.
Here is the desired output for the example I gave above:
[1] 1 1 1 2 2 NA NA
I am currently applying the abs(diff()) function to determine the difference between the values, and then for each row I take the largest value (adding 1 at the beginning to account for the first row):
diff <- apply(DF, MARGIN = 2, function (x) abs(diff(x)))
max_diff <- c(1, apply(diff, MARGIN = 1, function (x) max(x, na.rm = T)))
max_diff
[1] 1.00 0.15 0.10 0.90 0.10 0.90 0.90
I am stuck at this point, not quite sure what is the best way to proceed with the group assignment. I was initially trying to convert max_diff into a logical vector (max diff < 0.35), and then running a for loop grouping all the TRUEs together. This has a couple of problems:
My dataset has millions of rows so the forloop takes ages,
I "ignore" the first component of the group - e.g. I would not consider the first row as a member of the first group, because the max_diff value of 1 gives FALSE. I don't want to ignore anything.
I will be very grateful for any advice on how to proceed in an efficient way.
PS. The way of determining the difference between sites is not crucial - here it is just a difference of 0.35 but this is very flexible. All I am after is an adjustable method of finding similar rows.

You could do a cluster analysis and play around with different cutoffs h.
cl <- hclust(dist(DF))
DF$group <- cutree(cl, h=.5)
DF
# x y z group
# 1 1.00 0.00 0.0 1
# 2 0.85 0.00 0.0 1
# 3 0.90 0.10 0.0 1
# 4 0.00 0.90 0.9 2
# 5 0.00 1.00 0.9 2
# 6 0.90 0.90 0.0 3
# 7 0.95 0.97 0.9 4
A dendrogram helps to determine h.
plot(cl)
abline(h=.5, col=2)

Related

Use apply() on a 1-dim vector to find the best threshold

My current mission: pick up some "good" columns from a incomplete matrix, trying to remove NAs while keeping real data.
My idea: I can calculate evey column's missing data NA%. For a given threshold t, all the NA% > t columns will be removed. The removed columns also contain some real data. In these columns, present/missing will show the "price" of deleting these columes. My idea is to search the lowest "price" to delete NA as much as possible, for each dataset.
I already wrote my function till the last 2 steps:
myfunc1 <- function(x){
return(sum(is.na(x))
}
myfunc2 <- function(x){
return (round(myfunc1(x) / length(x),4))
}
myfunc3 <- function(t, set){
m <- which(apply(set, MARGIN = 2, myfunc2) > t)
missed <- sum(is.na(set[m]))
present <- sum(!is.na(set[m]))
return(present/ missed)
}
myfunc3(0.5, setA) # worked
threshold <- seq(from = 0, to = 0.95, by = 0.5)
apply(X = threshold, MARGIN = 1, FUN = myfunc3, set = setA) # not worked. stuck here.
I have 10 datasets from setA to setJ, I want to test all thresholds from 0 to 0.95. I want a matrix as a return with 10 datasets as column and 20 rows threshold with every 0.05 interval.
Did I do this correctly? Are there better ideas, or already existing libraries that I could use?
----------edit: example-----------
setA <- data.frame(cbind(c(1,2,3,4,NA,6,7,NA), c(1,2,NA,4,5,NA,NA,8),c(1,2,3,4,5,6,NA,8), c(1,2,3,4,5,6,7,8),c(NA,NA,NA,4,NA,6,NA,NA)))
colnames(setA) <- sprintf("col%s", seq(1:5))
rownames(setA) <- sprintf("sample%s", seq(1:8))
View(setA)
myfunc1 <- function(x){
return(sum(is.na(x)))
}
myfunc2 <- function(x){
return (round(myfunc1(x) / length(x),4))
}
myfunc3 <- function(t, set){
m <- which(apply(set, MARGIN = 2, myfunc2) > t)
missed <- sum(is.na(set[m]))
present <- sum(!is.na(set[m]))
return(present/ missed)
}
In setA, there are 8 samples. Each sample has 5 attributes to describe the sample. Unfortunately, some data are missing. I need to delete some column with too many NAs. First, let me calculate every column's NA% .
> apply(setA, MARGIN = 2, myfunc2)
col1 col2 col3 col4 col5
0.250 0.375 0.125 0.000 0.750
If I set the threshold t = 0.3, that means col2, col5 are considered too many NAs and need to be deleted, others are acceptable. If I delete the 2 columns, I also delete some real data. (I deleted 7 real data and 9 NAs, 7/9 = 0.78. This means I sacrifice 0.78 real data when I delete 1 NA)
> myfunc3(0.3, setA)
[1] 0.7777778
I want to try every threshold's result and then decide.
threshold <- seq(from = 0, to = 0.9, by = 0.1)
apply(X= threshold, MARGIN = 1, FUN = myfunc3, set = setA) # not work
I manualy calculate setA part:
threshold: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
price: 1.667 1.667 1.118 0.778 0.334 0.334 0.334 0.334 NaN NaN
At last I want a talbe like:
threshold: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
setA: 1.667 1.667 1.118 0.778 0.334 0.334 0.334 0.334 NaN NaN
setB:
setC:
...
setJ:
Do I have the correct way with the problem?
-----------Edit---------------
I already solved the problem and please close the thread.

R solve.QP tracking error minimization constraints inconsistent

I am struggling with Solve.QP to get a solution to minimize tracking error. I have a benchmark consisting of 6 assets (asset_a to asset_f). For my portfolio I have upper and lower bounds (I cannot have a position in asset_f). The cov matrix is also given. I want to get the portfolio weights for the 6 assets that minimizes tracking error vs the benchmark (with position in asset_f equal to zero).
benchmark:
asset_a: 0.3
asset_b: 0.3
asset_c: 0.1
asset_d: 0.1
asset_e: 0.1
asset_f: 0.1
lowerbounds:
asset_a: 0.166
asset_b: 0.133
asset_c: 0.037
asset_d: 0.035
asset_e: 0.039
asset_f: 0
upperbounds:
asset_a: 1
asset_b: 1
asset_c: 1
asset_d: 1
asset_e: 1
asset_f: 0
benchmark weights and bounds:
test.benchmark_weights = c(0.3, 0.3, 0.1, 0.1, 0.1, 0.1)
test.lowerbound = c(0.166, 0.133, 0.037, 0.035, 0.039,0)
test.upperbound = c(1, 1, 1, 1, 1, 0)
cov matrix (test.Dmat):
test.dmat = matrix(c(0.0119127162, 0.010862842, 0.010266683, 0.0009550136, 0.008242322, 0.00964462, 0.0108628421, 0.010603072, 0.009872992, 0.0011019412, 0.007422522, 0.0092528873, 0.0102666826, 0.009872992, 0.010487808, 0.0012107665, 0.006489204, 0.0096216627, 0.0009550136, 0.001101941, 0.001210766, 0.0115527788, 0.001181745, 0.0008387247, 0.0082423222, 0.007422522, 0.006489204, 0.0011817453, 0.012920482, 0.005973886, 0.00964462, 0.009252887, 0.009621663, 0.0008387247, 0.005973886, 0.0089904809), nrow=6, ncol=6)
dvec (test.dvec):
test.dvec = matrix(c(0, 0, 0, 0, 0, 0), nrow=6, ncol=1)
Amat constraints matrix (test.Amat):
test.amat = matrix(c(1,1,1,1,1,1, 1,1,1,1,1,0, -1,0,0,0,0,0, 0,-1,0,0,0,0, 0,0,-1,0,0,0, 0,0,0,-1,0,0, 0,0,0,0,-1,0, 0,0,0,0,0,-1, 1,0,0,0,0,0, 0,1,0,0,0,0, 0,0,1,0,0,0, 0,0,0,1,0,0, 0,0,0,0,1,0, 0,0,0,0,0,0, -1,0,0,0,0,0, 0,-1,0,0,0,0, 0,0,-1,0,0,0, 0,0,0,-1,0,0, 0,0,0,0,-1,0, 0,0,0,0,0,0), nrow=6, ncol=20)
bvec (test.bvec)
test.bvec =cbind(0, 1, t(test.benchmark_weights), t(test.lowerbound), -t(test.upperbound)) %>% as.matrix()
then running the solver
solve.QP(as.matrix(test.Dmat), test.dvec, test.Amat, test.bvec)
gives me
constraints are inconsistent, no solution!

Seems like there is something wrong with your Amat and bvec, i.e. you need not have to pass in both sum of weights on first 5 assets equal to 1 and sum of 6 assets equal 1 and also benchmark weights are not constraints but the bounds are:
library(quadprog)
N = 6L
test.dvec = rep(0, N)
test.amat = cbind(
rep(1, N),
diag(1, N),
diag(-1, N))
test.bvec = c(1, test.lowerbound, -test.upperbound)
res = solve.QP(test.dmat, test.dvec, test.amat, test.bvec, meq=1L)
round(res$solution, 2)
#[1] 0.17 0.13 0.10 0.44 0.17 0.00

How to check if a column value is within a range of another two for each row in data table

I am now working on a R problem and in my last step, I would like to check if my 95% confidence intervals contain the true values of parameters. I have a question in this step.
dat <- data.table(low = c(0.9, 1.1, 1, 0.95),
up = c(0.99, 1.2, 1.3, 1.4),
true = c(1, 1.15, 1.2, 1.5))
For example, in the first line, I would like to check if the true value 1, is with the confidence interval 0.9 to 0.99. Obviously not so return 0. After examining by row. The desired results will be:
result <- data.table(low = c(0.9, 1.1, 1, 0.95),
up = c(0.99, 1.2, 1.3, 1.4),
true = c(1, 1.15, 1.2, 1.5),
conv = c(0, 1, 1, 0))
print(result)
Is there any efficient way I could solve this problem? Thank you very much for your help.

The most efficient way is probably the between operator:
> result[, conv2 := as.integer(true %between% .(low, up))]
> result
low up true conv conv2
1: 0.90 0.99 1.00 0 0
2: 1.10 1.20 1.15 1 1
3: 1.00 1.30 1.20 1 1
4: 0.95 1.40 1.50 0 0
You can skip the as.integer to get a logical column, which is more standard when encoding such data in R.
The between function I'm using here comes from data.table and has two syntaxes:
x %between% list(dn, up)
between(x, dn, up)
Inside DT[...] you can shorten list() to .().

This works idiomatically in data.table
dat[, inConf := ifelse(true >= low & true <= up,T,F)]
###alternatively with 0,1
dat[, inConf := ifelse(true >= low & true <= up,1,0)]

Generating random variables with specific correlation threshold value

I am generating random variables with specified range and dimension.I have made a following code for this.
generateRandom <- function(size,scale){
result<- round(runif(size,1,scale),1)
return(result)
}
flag=TRUE
x <- generateRandom(300,6)
y <- generateRandom(300,6)
while(flag){
corrXY <- cor(x,y)
if(corrXY>=0.2){
flag=FALSE
}
else{
x <- generateRandom(300,6)
y <- generateRandom(300,6)
}
}
I want following 6 variables with size 300 and scale of all is between 1 to 6 except for one variable which would have scale 1-7 with following correlation structure among them.
1 0.45 -0.35 0.46 0.25 0.3
1 0.25 0.29 0.5 -0.3
1 -0.3 0.1 0.4
1 0.4 0.6
1 -0.4
1
But when I try to increase threshold value my program gets very slow.Moreover,I want more than 7 variables of size 300 and between each pair of those variables I want some specific correlation threshold.How would I do it efficiently?

This answer is directly inspired from here and there.
We would like to generate 300 samples of a 6-variate uniform distribution with correlation structure equal to
Rhos <- matrix(0, 6, 6)
Rhos[lower.tri(Rhos)] <- c(0.450, -0.35, 0.46, 0.25, 0.3,
0.25, 0.29, 0.5, -0.3, -0.3,
0.1, 0.4, 0.4, 0.6, -0.4)
Rhos <- Rhos + t(Rhos)
diag(Rhos) <- 1
We first generate from this correlation structure the correlation structure of the Gaussian copula:
Copucov <- 2 * sin(Rhos * pi/6)
This matrix is not positive definite, we use instead the nearest positive definite matrix:
library(Matrix)
Copucov <- cov2cor(nearPD(Copucov)$mat)
This correlation structure can be used as one of the inputs of MASS::mvrnorm:
G <- mvrnorm(n=300, mu=rep(0,6), Sigma=Copucov, empirical=TRUE)
We then transform G into a multivariate uniform sample whose values range from 1 to 6, except for the last variable which ranges from 1 to 7:
U <- matrix(NA, 300, 6)
U[, 1:5] <- 5 * pnorm(G[, 1:5]) + 1
U[, 6] <- 6 * pnorm(G[, 6]) + 1
After rounding (and taking the nearest positive matrix to the copula's covariance matrix etc.), the correlation structure is not changed much:
Ur <- round(U, 1)
cor(Ur)

Easily input a correlation matrix in R

I have a R script I'm running now that is currently using 3 correlated variables. I'd like to add a 4th, and am wondering if there's a simple way to input matrix data, particularly for correlation matrices---some Matlab-like technique to enter a correlation matrix, 3x3 or 4x4, in R without the linear to matrix reshape I've been using.
In Matlab, you can use the semicolon as an end-row delimiter, so it's easy to keep track of where the cross correlations are.
In R, where I first create
corr <- c(1, 0.1, 0.5,
0.1, 1, 0.9,
0.5, 0.9, 1)
cormat <- matrix(corr, ncol=3)
Versus
cormat = [1 0.1 0.5;
0.1 1 0.9;
0.5 0.9 1]
It just feels clunkier, which makes me suspect there's a smarter way I haven't looked up yet. Thoughts?

Welcome to the site! :) you should be able to do it in one step:
MyMatrix = matrix(
c(1, 0.1, 0.5,
0.1, 1, 0.9,
0.5, 0.9, 1),
nrow=3,
ncol=3)

Here is another way:
CorrMat <- matrix(scan(),3,3,byrow=TRUE)
1 0.1 0.5
0.1 1 0.9
0.5 0.9 1
Trailing white line is important.

If you want to input a symmetric matrix, you can use the xpnd() function in the MCMCpack library.
xpnd() takes a vector which corresponds to the upper-triangle of the matrix (thus you only have to enter each value once). For instance, if you want to input:
$\left(\begin{array}{c c c}
1 & 0.1 & 0.5 \\
0.1 & 1 & 0.9 \\
0.5 & 0.9 & 1
\end{array}\right)$
You would use
library(MCMCpack)
xpnd(c(1, 0.1, 0.5, 1, 0.9, 1), 3)
where 3 refers to the number of rows in the matrix.
Help page for xpnd.

rbind(c(1, 0.1, 0.5),
c(0.1, 1, 0.9),
c(0.5, 0.9, 1))

For the existing solutions. That may only work for 3*3 matrix. I tried this one.
a<-diag(3)
m<-diag(3)
m[lower.tri(m,diag=F)]<-c(0.1, 0.5, 0.9)
m<-m+t(m)-a

As you are working with correlation matrices, you are probably not interested in entering the diagonal, and both the upper and lower parts. You can manipulate/extract those three parts separately using diag(), upper.tri() and lower.tri().
> M <- diag(3) # create 3x3 matrix, diagonal defaults to 1's
> M[lower.tri(M, diag=F)] <- c(0.1, 0.5, 0.9) # read in lower part
> M # lower matrix containing all information
[,1] [,2] [,3]
[1,] 1.0 0.0 0
[2,] 0.1 1.0 0
[3,] 0.5 0.9 1
If you want the full matrix:
> M[upper.tri(M, diag=F)] <- M[lower.tri(M)] # fill upper part
> M # full matrix
[,1] [,2] [,3]
[1,] 1.0 0.1 0.5
[2,] 0.1 1.0 0.9
[3,] 0.5 0.9 1.0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Divide rows into groups given the similarity between them - r

Related

Use apply() on a 1-dim vector to find the best threshold

R solve.QP tracking error minimization constraints inconsistent

How to check if a column value is within a range of another two for each row in data table

Generating random variables with specific correlation threshold value

Easily input a correlation matrix in R

Categories

Resources