functions for matrices / cluster analysis - r

I have the following problem. Maybe you can help me!
I have 60 matrices (60 trials). Each of those matrices is 16*1000 fileds big (16 angles and 1000 timestamps). The 16 angels are bodyangels.
Now I want to calculate the euclicid distance for each of the combinations (1770). So I would get 1770 matrices which are are 16*1000 fileds big.
The list for every combinations I would get through this formula:
>comb <- combinations(n=60, r=2, v=n, set=TRUE, repeats.allowed=FALSE)
The formula which I want to apply to each of these combinations is:
> dab <- sqrt(sum((a-b)^2)) # a and b are two matrices
I tried to create a function, which is fortunately only for single values, but not for whole matrices.:
>dist.fun <- function(x,y)
>{
>z <- sqrt(sum((x)-(y))^2)
> return(z)
>}
Out of those distances I want to create an euclidic distance matrix to do a cluster analysis.
>plot(hclust((as.dist(m)),method="ward.D2")) # m is the euclidic disdance matrix
I hope, anyone can help me with this problem. The data is biomechanical data from gymnasts, which I want to investigate in terms of variant and invariant components and prototypes.

Related

for loop for scalar product with a matrix

I'm trying to fill a 10 x 1500 matrix with a loop.
I have to fill that matrix with 150 small 10 x 10 matrixes. I have tried to implement this with a double loop, but unsuccessfully. My problem is that each 10*10 matrix is the result of a scalar product.
At the begin it seems to be easy, but then I realized I couldn't figure out the sizes of the 10 x 1500 matrix with the 150 small 10*10 matrixes.
Here is what I did:
es_var is a 1 x 150 matrix, which I converted to a vector to simplify the scalar product (at least in my opinion).
diax is a 10 x 10 matrix.
I want to multiply each value of the es_var vector per the whole diag 10*10 matrix.
I am having troubles because I don't manage to input R in filling 10 rows per time. Thus in the end I get a 10*1500 matrix, but it is the same 10*10 time matrix repeated 150 times.
Here is my code
es_var1 = as.vector(es_var)
v = matrix(0, 10, 10*N)
for (i in 1:N){
v[,] = es_var1[i] * diax
}
Can somebody help in figuring out this, please? I spent the whole day trying it. And I need to do that without using in build functions since this is a small part of a big math demonstration I have to implement.
If I understand your requirement correctly, you can accomplish this with the following line:
v <- matrix(diax,10,1500)*rep(es_var1,each=100);
This constructs a 10x1500 matrix with the 10x10 diax matrix as the initial values, cycled sufficiently to cover the complete 10x1500 size. Then, to apply the es_var1 multiplication, you can replicate each of its elements 100 times, such that they will naturally align with each consecutive 10x10 small matrix during vectorized multiplication.

compute matrix distance using dynamic programming

I have a matrix composing values 0, 1, and 2. 99% of the values are 0. The matrix has 1 million rows and 700 columns. There will be at least one non-zero values each row.
I need to compute the distance between each pair of columns using this formula for distance between column x and y:
D=(Sum(|xi-yi|)/2L for i from 1 to L, L=1 million, i.e. the number of rows.
I wrote a piece of R code but it's taking too long to compute, is it possible to use dynamic programing to do it faster? Here is my code:
#mac is the matrix
nCols=ncol(mac)
nRows=nrow(mac)
#the pairwise distance matrix
distMat=matrix(data=-1,nrow=nCols,ncol=nCols)
abs.dist=function(x){return(abs(x[1]-x[2]))}
for(i in 1:(nCols-1)){
for(j in (i+1):nCols){
d1=apply(mac[,c(i,j),1,abs.dist)
k=sum(d1)/(2*nRows)
distMat[i,j]=k
distMat[j,i]=k
}
}
for(i in 1:nCols) distMat[i,i]=0
Thanks a lot for any help?
I will just summarize what is in the comments already:
#mac is the matrix
nCols=ncol(mac)
nRows=nrow(mac)
#the pairwise distance matrix
distMat=matrix(data=-1,nrow=nCols,ncol=nCols)
for(i in 1:(nCols-1)){
for(j in (i+1):nCols){
d1=abs(mac[,i]-mac[,j])
k=sum(d1)/(2*nRows)
distMat[i,j]=k
distMat[j,i]=k
}
}
diag(distMat) <- 0
This is approximately 100 times faster for a 2000x500 matrix.
It took about half a minute for a 1e6x700 matrix.
Computing a distance matrix means you need (n^2-n)/2 operations. I'm not surprised it is taking a while.
Since you need all pairs, these calculations have to be done independently. Dynamic programming will not help. DP helps when you build the solution from smaller parts. Everything here is independent so DP won't help (as far as I know).
You said most entries are 0. Try looking at a sparse matrix library. This blog post may give you some ideas for doing this in R.

Generate many sample pairs from normal distribution

I trying to learn how to use R for statistics and I would like to how can I can I generate 20 000 (K number of pairs) times a set of two samples each with 50 points from the same normal distribution(mean 2.5 and variance 9)?
So far I know that this is how I make 50 points from a normal distribution:
rnorm(50,2.5,3)
But how do I generate 20 000 times a set of two samples so I can perform tests on the K pairs later?
x <- lapply(c(1:20000),
function(x){
lapply(c(1:2), function(y) rnorm(50,2.5,3))
})
This produces 20000 paired samples, where each sample is composed of 50 observations from a N(2.5,3^2) distribution. Note that x is a list where each slot is a list of two vector of length 50.
To t-test the samples, you'll need to extract the vectors and give them to function t-test.
t.tests <- lapply(x, function(y) t.test(x=y[[2]], y=y[[1]]))
Something along the lines of
yourresults <- replicate(20000,{yourtest(matrix(rnorm(100,2.5,3),nc=2),<...>)})
or
yourresults <- replicate(20000,{yourtest(rnorm(50,2.5,3),rnorm(50,2.5,3),<...>)})
where yourtest is whatever your function is that's carrying out some test, and <...> is whatever other arguments you pass to yourtest. The first one is suitable if it expects a matrix with two columns, the second is suitable if it expects two vectors. You can adapt this approach to other forms of input - such as a formula interface - in the obvious way.

Multiple Matrix Operations in R with loop based on matrix name

I'm a novice R user, who's learning to use this coding language to deal with data problems in research. I am trying to understand how knowledge evolves within an industry by looking at patenting in subclasses. So far I managed to get the following:
# kn.matrices<-with(patents, table(Class,year,firm))
# kn.ind <- with(patents, table(Class, year))
patents is my datafile, with Subclass, app.yr, and short.name as three of the 14 columns
# for (k in 1:37)
# kn.firms = assign(paste("firm", k ,sep=''),kn.matrices[,,k])
There are 37 different firms (in the real dataset, here only 5)
This has given 37 firm-specific and 1 industry-specific 2635 by 29 matrices (in the real dataset). All firm-specific matrices are called firmk with k going from 1 until 37.
I would like to perform many operations in each of the firm-specific matrices (e.g. compare the numbers in app.yr 't' with the average of the 3 previous years across all rows) so I am looking for a way that allows me to loop the operations for every matrix named firm1,firm2,firm3...,firm37 and that generates new matrices with consistent naming, e.g. firm1.3yearcomparison
Hopefully I framed this question in an appropriate way. Any help would be greatly appreciated.
Following comments I'm trying to add a minimal reproducible example
year<-c(1990,1991,1989,1992,1993,1991,1990,1990,1989,1993,1991,1992,1991,1991,1991,1990,1989,1991,1992,1992,1991,1993)
firm<-(c("a","a","a","b","b","c","d","d","e","a","b","c","c","e","a","b","b","e","e","e","d","e"))
class<-c(1900,2000,3000,7710,18000,19000,36000,115000,212000,215000,253600,383000,471000,594000)
These three vectors thus represent columns in a spreadsheet that forms the "patents" matrix mentioned before.
it looks like you already have a 3 dimensional array with all your data. You can basically view this as your 38 matrices all piled one on top of the other. You don't want to split this into 38 matrices and use loops. Instead, you can use R's apply function and extraction functions. Just view the help topic on the apply() family and it should show you how to do what you want. Here are a few basic examples
examples:
# returns the sums of all columns for all matrices
apply(kn.matrices, 3, colSums)
# extract the 5th row of all matrices
kn.matrices[5, , ]
# extract the 5th column of all matrices
kn.matrices[, 5, ]
# extract the 5th matrix
kn.matrices[, , 5]
# mean of 5th column for all matrices
colMeans(kn.matrices[, 5, ])

R looping over two vectors

I have created two vectors in R, using statistical distributions to build the vectors.
The first is a vector of locations on a string of length 1000. That vector has around 10 values and is called mu.
The second vector is a list of numbers, each one representing the number of features at each location mentioned above. This vector is called N.
What I need to do is generate a random distribution for all features (N) at each location (mu)
After some fiddling around, I found that this code works correctly:
for (i in 1:length(mu)){
a <- rnorm(N[i],mu[i],20)
feature.location <- c(feature.location,a)
}
This produces the right output - a list of numbers of length sum(N), and each number is a location figure which correlates with the data in mu.
I found that this only worked when I used concatenate to get the values into a vector.
My question is; why does this code work? How does R know to loop sum(N) times but for each position in mu? What role does concatenate play here?
Thanks in advance.
To try and answer your question directly, c(...) is not "concatenate", it's "combine". That is, it combines it's argument list into a vector. So c(1,2,3) is a vector with 3 elements.
Also, rnorm(n,mu,sigma) is a function that returns a vector of n random numbers sampled from the normal distribution. So at each iteration, i,
a <- rnorm(N[i],mu[i],20)
creates a vector a containing N[i] random numbers sampled from Normal(mu[i],20). Then
feature.location <- c(feature.location,a)
adds the elements of that vector to the vector from the previous iteration. So at the end, you have a vector with sum(N[i]) elements.
I guess you're sampling from a series of locations, each a variable no. of times.
I'm guessing your data looks something like this:
set.seed(1) # make reproducible
N <- ceiling(10*runif(10))
mu <- sample(seq(1000), 10)
> N;mu
[1] 3 4 6 10 3 9 10 7 7 1
[1] 206 177 686 383 767 496 714 985 377 771
Now you want to take a sample from rnorm of length N(i), with mean mu(i) and sd=20 and store all the results in a vector.
The method you're using (growing the vector) is not recommended as it will be re-copied in memory each time an element is added. (See Circle 2, although for small examples like this, it's not so important.)
First, initialize the storage vector:
f.l <- NULL
for (i in 1:length(mu)){
a <- rnorm(n=N[i], mean=mu[i], sd=20)
f.l <- c(f.l, a)
}
Then, each time, a stores your sample of length N[i] and c() combines it with the existing f.l by adding it to the end.
A more efficient approach is
unlist(mapply(rnorm, N, mu, MoreArgs=list(sd=20)))
Which vectorizes the loop. Unlist is used as mapply returns a list of vectors of varying lengths.

Resources