Pairwise Similarity Matrix from Function (HPOSim) - r

I am trying to create a pairwise similarity matrix where I compare the similarity of each HPO term to every other HPO term using the "getSimWang" function of the R package HPOSim. Package available here: https://sourceforge.net/projects/hposim/
I can create the pairwise similarity matrix for a subset of the HPO terms (there are ~13,000) using the following:
list1<-c("HP:0002404","HP:0011933","HP:0030286")
custom <- function(x,y){
z <- getSimWang(x,y)
return(z)
}
outer(list1, list1, Vectorize(custom))
[,1] [,2] [,3]
[1,] 1.0000000 0.6939484 0
[2,] 0.6939484 1.0000000 0
[3,] 0.0000000 0.0000000 1
sapply(list1, function(x) sapply(list1, function(y) custom(x,y)))
HP:0002404 HP:0011933 HP:0030286
HP:0002404 1.0000000 0.6939484 0
HP:0011933 0.6939484 1.0000000 0
HP:0030286 0.0000000 0.0000000 1
However, when I tried to expand this code to apply to the rest of the HPO terms, R was calculating for 24+ hours, and when I used pbsapply to estimate the time it would take, it estimated it would be 20 days!
I have also tried mapply - but that only gives me a subset of the calculations (x1y1, x2y2, and x3y3) rather than all combinations (x1y1, x1y2, x1y3, etc).
mapply(custom, list1, list1)
HP:0002404 HP:0011933 HP:0030286
1 1 1
And the xapply solution here, but when I run that I lose the information about what terms are being compared:
xapply(FUN = custom, list1, list1)
[[1]]
[1] 1
[[2]]
[1] 0.6939484
[[3]]
[1] 0
[[4]]
[1] 0.6939484
[[5]]
[1] 1
[[6]]
[1] 0
[[7]]
[1] 0
[[8]]
[1] 0
[[9]]
[1] 1
Is there a different method that I am missing in order to get the pairwise (or ideally non-redundant pairwise) calculations for the similarity? Or is this really going to take 20 days?!?

Related

How to perform nested loop over a list() in R?

Suppose I have a list like this in R:
> L
[[1]]
[1] 0.6876619 0.7847888 0.6377801 0.2078056 0.8981001
[[2]]
[1] 0.9358160 0.8905056 0.7715877 0.8648426 0.4915060
[[3]]
[1] 0.88095630 0.08010288 0.15140700 0.35400865 0.60317717
[[4]]
[1] 0.07436267 0.85873209 0.49881141 0.92363954 0.87208334
And I want to find the correlation coefficient between each pair of vectors, e.g, cor(L[[i]],L[[j]]). Is there any solution to perform it with the apply family function?
Please take it as a specific case of a general question: What if we need to do a triple nested loop over a List() in R?
You can nest lapply calls:
lapply(L, function(x) lapply(L, function(y) cor(x,y))))
If you want the results presented more nicely, put them in a matrix:
L <- list(rnorm(10), rnorm(10), rnorm(10))
matrix(unlist(lapply(L,
function(x) lapply(L,
function(y) cor(x,y)))),
length(L))
#> [,1] [,2] [,3]
#> [1,] 1.0000000 -0.3880931 -0.4164212
#> [2,] -0.3880931 1.0000000 0.4158335
#> [3,] -0.4164212 0.4158335 1.0000000
Created on 2021-05-31 by the reprex package (v2.0.0)
You could use mapply. Generate all the combinations of interest (pairs, triples, ...) and then apply
L=replicate(5,rnorm(5),simplify=F)
tmp=expand.grid(1:length(L),1:length(L))
tmp$cor=mapply(
function(y,x){cor(L[[y]],L[[x]])},
tmp$Var1,
tmp$Var2
)
Var1 Var2 cor
1 1 1 1.0000000
2 2 1 0.1226881
3 3 1 -0.2871613
4 4 1 0.4746545
5 5 1 0.9779644
6 1 2 0.1226881
7 2 2 1.0000000
...
You can cbind the list and call cor with the resulting matirx.
cor(do.call(cbind, L))
# [,1] [,2] [,3] [,4]
#[1,] 1.0000000 -0.46988357 0.14151672 0.14151672
#[2,] -0.4698836 1.00000000 -0.09177819 -0.09177819
#[3,] 0.1415167 -0.09177819 1.00000000 1.00000000
#[4,] 0.1415167 -0.09177819 1.00000000 1.00000000
In case there is one more level in the list use unlist.
L2 <- lapply(L, list) #Create list with one more level.
cor(do.call(cbind, unlist(L2, FALSE)))
In case it is unknown or mixed, a recursive call of a function could be used:
L3 <- list(L[[1]], L[[2]], L2[[3]], L2[[4]])
f <- function(x) {
if(is.list(x)) sapply(x, f)
else x
}
cor(f(L3))
Data:
L <- list(c(0.6876619,0.7847888,0.6377801,0.2078056,0.8981001)
, c(0.9358160,0.8905056,0.7715877,0.8648426,0.4915060)
, c(0.88095630,0.08010288,0.15140700,0.35400865,0.60317717)
, c(0.88095630,0.08010288,0.15140700,0.35400865,0.60317717))

Is there a way to normalize rows in an fcm (i.e., transform counts into values from 0 to 1)?

Good day,
I have a feature co-location (fcm, from the quanteda package in R) of dimensions 94966 x 94966 (named fcm2, for illustration). I am able to select rows (class: fcm object) by the feature name or row number, e.g.:
a1 <- fcm2[1,]
and perform a normalization calculation on that particular row:
a2 <- a1/(max(a1)-min(a1))
My objective is to normalize each row in my original fcm. The strategy I attempted was to initialize an empty matrix and then use a for loop to go through the rows and perform the calculation. The initialization fails because of memory issues (Windows 10, 12 Gb RAM, R version 3.4.4):
fcm3 <- matrix(data = NA, nrow = 94966, ncol = 94966)
Error: cannot allocate vector of size 33.6 Gb
I had been able to accomplish the normalization by using a dataframe structure, but there is not enough memory to store the entire fcm2 as a dataframe:
Step 1: Extract a "sub-matrix" based on a list of keywords, convert to dataframe, drop unneeded columns
m <- fcm2[keywords(),]
df_m1 <- as.data.frame(m)
df_m1 <- subset(df_m1, select = -c(document, rt))
Step 2: Normalization
k <- 0 # initialize counter
df2 <- data.frame() # initialize
n4 <- nrow(df_m1) # count rows of the extracted sub-matrix as df (df_m1)
for(k in 1:n4){
a1 <- df_m1[k,] # store the (n4)th row
max_k <- max(a1)
min_k <- min(a1)
a2 <- a1/(max_k-min_k) # normalize so max is 1, 0s are still 0s
df2 <- rbind(df2, a2) # append normalized results into a row of a data.frame
}
Is there a more efficient way to normalize each row for the entire fcm?
Kind thanks!
Yo can write a function:
norm=function(mat){
mx=mat[cbind(1:nrow(mat),max.col(mat))]
mn=mat[cbind(1:nrow(mat),max.col(-mat))]
mat/(mx-mn)
}
And then use it.
Example
set.seed(1)
mat1=matrix(sample(20),5)
mat1
[,1] [,2] [,3] [,4]
[1,] 6 14 3 7 #max is 14, min is 3 thus divide by 11
[2,] 8 15 2 12
[3,] 11 9 20 17
[4,] 16 19 10 18
[5,] 4 1 5 13
norm(mat)
[,1] [,2] [,3] [,4]
[1,] 0.5454545 1.27272727 0.2727273 0.6363636
[2,] 0.6153846 1.15384615 0.1538462 0.9230769
[3,] 1.0000000 0.81818182 1.8181818 1.5454545
[4,] 1.7777778 2.11111111 1.1111111 2.0000000
[5,] 0.3333333 0.08333333 0.4166667 1.0833333
You can decide to print out in fraction form to see whether the results do represent what was needed:
MASS::fractions(norm(mat))
[,1] [,2] [,3] [,4]
[1,] 6/11 14/11 3/11 7/11
[2,] 8/13 15/13 2/13 12/13
[3,] 1 9/11 20/11 17/11
[4,] 16/9 19/9 10/9 2
[5,] 1/3 1/12 5/12 13/12
I can understand OP has constraint with memory, and he cannot allocate memory to hold another copy of that big matrix.
If memory permits then solution can be:
mat1 = t(apply(mat1, 1, function(x) x/(max(x)-min(x))))
With memory constraint, one can prefer to write a function to normalise a vector and apply it over all rows in for-loop. It should be an efficient way in given scenario.
# Function to normalize a vector
normalise <- function(x){
x/(max(x)-min(x))
}
#Apply over all rows of matrix
for(i in 1:nrow(mat1)){
mat1[i,] = normalise(mat1[i,])
}
mat1
# [,1] [,2] [,3] [,4]
# [1,] 0.5454545 1.27272727 0.2727273 0.6363636
# [2,] 0.6153846 1.15384615 0.1538462 0.9230769
# [3,] 1.0000000 0.81818182 1.8181818 1.5454545
# [4,] 1.7777778 2.11111111 1.1111111 2.0000000
# [5,] 0.3333333 0.08333333 0.4166667 1.0833333
Data: As used by #Onyambu
# Data
set.seed(1)
mat1=matrix(sample(20),5)
The most efficient way is to operate on the sparse values of the fcm object directly, avoiding any transformation into a dense object such as a matrix or data.frame. This is how the dfm and fcm manipulation and computation functions are defined in quanteda and why these are able to executive quickly and within limited memory.
To define such a function for your type of normalisation, you could use the following function, which I have demonstrated here on a simple fcm.
library("quanteda")
library("Matrix")
myfcm <- fcm(data_char_sampletext, window = 5)
myfcm
## Feature co-occurrence matrix of: 244 by 244 features.
Now we define a function that (for convenience) transforms the fcm into a sparse triplet representation (the dgTMatrix class) and extracts the non-zero values using split(). Each element of the resulting list will represent a row of your fcm, but only for non-zero values. (Because of this, we also have to return a zero for empty rows.)
fcm_customnorm <- function(x) {
x <- as(x, "dgTMatrix")
split_x <- split(x#x, x#i)
norm_x <- lapply(split_x, function(y) {
result <- y/(max(y) - min(y))
# transform any divisions by zero into zero
result[is.nan(result)] <- 0
result
})
x#x <- unlist(norm_x, use.names = FALSE)
quanteda:::as.fcm(x)
}
Applying this on a subset, we see that it works:
myfcm[1:5, 1:5]
## Feature co-occurrence matrix of: 5 by 5 features.
## 5 x 5 sparse Matrix of class "fcm"
## features
## features Instead we have a Fine
## Instead 0 5 1 4 1
## we 0 10 5 20 5
## have 0 0 0 4 1
## a 0 0 0 6 4
## Fine 0 0 0 0 0
fcm_customnorm(myfcm[1:5, 1:5])
## Feature co-occurrence matrix of: 5 by 5 features.
## 5 x 5 sparse Matrix of class "fcm"
## features
## features Instead we have a Fine
## Instead 0 1.0 0.8000000 0.3333333 1.00
## we 0 0.2 0.2000000 1.3333333 0.25
## have 0 0 0.6666667 0.3333333 3.00
## a 0 0 0 0.0000000 2.00
## Fine 0 0 0 0 0.00
Another option would have been to extract the simple triplet representation to a data.table (from the data.table package) and then perform your computations using the grouping functions and :=. But this approach is simpler and gets your desired result, which is a normalised fcm.

How to get sepecific element of a list of list in R?

I built my function (so long and complex). The output of my function is a list of two lists. To make my question clear, I built a simple example similar to my function.
my_func <- function(x,y){
out1 <- list(x*y+2,x+y^2)
out2 <- list(x^2+y-2,y-x^2)
out <- list(out1,out2)
return(out)
}
Then I got this:
> x <- my_func(x=c(2,3,4),y=c(5,4,7))
> x
[[1]]
[[1]][[1]]
[1] 12 14 30
[[1]][[2]]
[1] 27 19 53
[[2]]
[[2]][[1]]
[1] 7 11 21
[[2]][[2]]
[1] 1 -5 -9
I need to apply my function to 100 different data. Then, I will get 100 output from two lists. How I can use lapply function to get specific element.
T tried this:
> x <- lapply(x,'[[','2')
> x
[[1]]
NULL
[[2]]
NULL
However, in my original function I have use the following code to access the element.
> x[[1]][[1]]$par ##to get the first par (of list 1) of the first output list.
[,1] [,2] [,3] [,4] [,5]
[1,] 0.000000 0.000000 0.000000 0.000000 0
[2,] 2.545870 0.000000 0.000000 0.000000 0
[3,] 3.060626 2.879527 0.000000 0.000000 0
[4,] 2.015691 2.889324 2.967189 0.000000 0
[5,] 2.862635 2.511020 2.480592 1.781923 0
how can I use lapply function to access all the $par of the first and second list? any help please?
If the index used is numeric, we don't need the quotes. The quotes make the list to check for names that match that character. So,
lapply(x, `[[`, 2)
#[[1]]
#[1] 27 19 53
#[[2]]
#[1] 1 -5 -9
If we had named the sub lists, then the OP's approach would work
x1 <- lapply(x, setNames, 1:2)
lapply(x1, `[[`, '2')
#[[1]]
#[1] 27 19 53
#[[2]]
#[1] 1 -5 -9

Percentile in list

I have the following list (h):
> h
[[1]]
[1] 0.9613971
[[2]]
[1] 0.9705882
[[3]]
[1] 0.9503676
[[4]]
[1] 0.9632353
[[5]]
[1] 0.9779412
[[6]]
[1] 0.9852941
[[7]]
[1] 0.9852941
[[8]]
[1] 0.9816176
I would like to add a new column that will calculate the percentile of each number in the list.
I tried to use the following and I get errors:
perc.rank <- function(x, xo) length(x[x <= xo])/length(x)*100
perc.rank <- function(x) trunc(rank(x))/length(x)
trunc(rank(h))/length(h)
In addition, I would to know given a number such as 0.9503676 (the third number) or its number (3) how can I know what is his percentile?
You can do this more efficiently by first converting your list into a vector as follows:
h <- unlist(h)
Next, create a function to find the percentile, which you can easily do by creating an empirical cdf function for your list as follow:
perc.rank <- ecdf(h)
To find the percentile for any number, example the third number, do the following:
perc.rank(0.9503676)
This will work even if the number isn't in your list. eg. perc.rank(0.91) should give you the percentile for 0.91 and you can also pass multiple numbers to the function like perc.rank(c(0.950,0.913,0.6))
Converting to dataframe will make things easier. Here is one solution
library(dplyr)
df<-data.frame(x=rnorm(10))
df%>%mutate(percrank=rank(x)/length(x)*100)
x percrank
1 1.56254900 100
2 -0.52554968 10
3 0.16410991 70
4 0.95150575 80
5 0.01960002 60
6 -0.22860395 30
7 1.43025012 90
8 -0.15836126 40
9 -0.01150753 50
10 -0.39064474 20
This adds two list elements to the current list h.
The second list element is the percentile as you have it.
The third list element is an ordinal rank number.
h <- list(.9613971, .9705882, .9503676, .9632353, .9779412, .9852941, .9852941, .9816176)
# create percentiles
rnk1 <- rank(unlist(h)) / length(h)
# ordinal rank
rnk2 <- rank(unlist(rnk1))
# combine the original lists with the two additional elements
res <- mapply(c, h, rnk1, rnk2, SIMPLIFY=FALSE)
res
[[1]]
[1] 0.9613971 0.2500000 2.0000000
[[2]]
[1] 0.9705882 0.5000000 4.0000000
[[3]]
[1] 0.9503676 0.1250000 1.0000000
[[4]]
[1] 0.9632353 0.3750000 3.0000000
[[5]]
[1] 0.9779412 0.6250000 5.0000000
[[6]]
[1] 0.9852941 0.9375000 7.5000000
[[7]]
[1] 0.9852941 0.9375000 7.5000000
[[8]]
[1] 0.9816176 0.7500000 6.0000000
Lookup function by ordinal rank
perc.rank <- function(x, xo) {
x[[match(xo, sapply(x, "[[", 1))]]
}
perc.rank(res, .9779412)
[1] 0.9779412 0.6250000 5.0000000
Which shows that .9779412 is ranked number 5

How do I use the while() function to evaluate a matrix one row at a time until I find the last row index of the last column to reach <0

-edited for clarity-
I am interested in finding the zero point of a multidimensional autocorrelation function.
I can generate the autocorrelation matrix from my data using
acm <- autocorr(x, 1:10)
However, the full matrix can be on the order of 20 x 5000, and this is computationally expensive.
I would therefore like to calculate only 1 or n rows at a time.
Here are the steps that I would like to take
calculate the first row in the matrix
while(any column has all positive values)
calculate and append the next row of the matrix to the already calculated rows
identify the row index of the last column to reach zero
If this is the full matrix:
acm <- cbind( c(10, 9, 8, 7, 6, 5, 4, 3, 1, -1),
c(10, 8, 6, 5, 3, 1, -1, 1, -1, 0 ))
I want a function that will return 10 because the first col is the last to reach a negative value. If I calculated the full matrix first, the following would be sufficient:
max(which(apply(acm, 2, min)))
But I want to avoid calculating more of acm than needed, e.g. because often only 1 or a small fraction of the rows are necessary for the calculation.
There is a loop solution, using the break function. This is a hack using indices and a vector tt to keep track of which columns already showed a negative value.
find.point <- function(x){
tt <- rep(F,ncol(x)) # control vector tt
for (i in 1:nrow(x)){
tt[which(x[i,]<0)] <- T # check which columns have negative value
if(all(tt)) break # if all have reached negative, get out of loop
}
i # return index
}
Output is the same as the oneliner
max(apply(acm<0,2,function(x) match(T,x)))
Which you tried to refer to in your question I believe. I don't really see where your performance problem comes from. It depends whether you have 5000 columns or 5000 rows. In any case, even with a dataset ten times as big, calculations happen within a second for me using the oneliner:
Timings:
> acm <- matrix(rep(seq.int(5000,-5999),100),ncol=22)
> dim(acm)
[1] 50000 22
> system.time(max(apply(acm<0,2,function(x) match(T,x))))
user system elapsed
0.05 0.00 0.05
> system.time(find.point(acm))
user system elapsed
0.05 0.00 0.05
Yet, timing is substantially improved with the function over the oneliner when you have many columns :
> acm <- matrix(rep(seq.int(5000,-5999),100),ncol=50000)
> dim(acm)
[1] 22 50000
> system.time(max(apply(acm<0,2,function(x) match(T,x))))
user system elapsed
0.85 0.01 0.86
> system.time(find.point(acm))
user system elapsed
0.03 0.00 0.04
Heck, you forced me to think out a for-loop solution that works faster than a oneliner. Cool question!
I'm not sure exactly what your function is doing, but to answer the question "How can I find the last row of a dynamically generated matrix in which the value of the column goes below zero?":
findlastzero = function(mat){
apply(mat<0, 2, function(x)tail(which(x),1 ))
}
set.seed(1)
a <- cbind(rnorm(10), rnorm(10), rnorm(10), rnorm(10)) + 0.5
a
[,1] [,2] [,3] [,4]
[1,] -0.1264538 2.0117812 1.41897737 1.85867955
[2,] 0.6836433 0.8898432 1.28213630 0.39721227
[3,] -0.3356286 -0.1212406 0.57456498 0.88767161
[4,] 2.0952808 -1.7146999 -1.48935170 0.44619496
[5,] 0.8295078 1.6249309 1.11982575 -0.87705956
[6,] -0.3204684 0.4550664 0.44387126 0.08500544
[7,] 0.9874291 0.4838097 0.34420449 0.10571005
[8,] 1.2383247 1.4438362 -0.97075238 0.44068660
[9,] 1.0757814 1.3212212 0.02184994 1.60002537
[10,] 0.1946116 1.0939013 0.91794156 1.26317575
findlastzero(a)
[1] 6 4 8 5
Not sure if that is what you are asking for, however..
Not sure if I understood your question correctly, but you can use tapply to elide into each row in the matrix to extract the info you want.
I first create a "grouping matrix" of the same size as your a.
This serves as the index for grouping each row to be fed as input into your lambda function.
matrix(rep(1:10,4),nrow=10,ncol=4)
I then run "tapply" on the original matrix with the grouping matrix. This subsets the matrix so that each row vector is passed into the function:
function(x) { return( x[which(x<0)] ) }
which simply returns all the values where value is less than zero per row.
> a
[,1] [,2] [,3] [,4]
[1,] 0.5341781 -0.9263866 -0.5380141 -1.2453310
[2,] 0.2931630 1.0490300 0.8127472 0.2473263
[3,] 1.0936143 -0.3399709 1.8199833 1.0053080
[4,] 1.0002433 0.2002659 1.7730118 1.7578414
[5,] 0.8116914 0.9371518 0.8727981 1.4236349
[6,] -0.1127914 1.1563594 1.0331311 0.7658510
[7,] -0.5423493 1.8905533 -0.8121652 0.1355076
[8,] -1.6589310 0.4081290 0.3560005 1.6043205
[9,] 1.8760435 0.8826245 1.4457357 0.7561550
[10,] -0.8503400 0.2302597 0.5838986 0.1252952
> matrix(rep(1:10,4),nrow=10,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
[4,] 4 4 4 4
[5,] 5 5 5 5
[6,] 6 6 6 6
[7,] 7 7 7 7
[8,] 8 8 8 8
[9,] 9 9 9 9
[10,] 10 10 10 10
> tapply(a, matrix(rep(1:10,4),nrow=10,ncol=4), function(x) { return(x[which(x<0)])})
$`1`
[1] -0.9263866 -0.5380141 -1.2453310
$`2`
numeric(0)
$`3`
[1] -0.3399709
$`4`
numeric(0)
$`5`
numeric(0)
$`6`
[1] -0.1127914
$`7`
[1] -0.5423493 -0.8121652
$`8`
[1] -1.658931
$`9`
numeric(0)
$`10`
[1] -0.85034

Resources