Getting a weighted proportion in R - r

I have created a transition probability matrix using 3 states (A,B,C) as follows:
transition <-prop.table(with(data, table(data$old,
data$new)), 2)
For example, if you wanted to get the probability for A --> B, you would count the number of times you see B follow A and divide it by the number of times you see any state follow A. Now suppose that there is a certain weight/importance associated with each of the rows of data. How would I modify the above to get a weighted probability transition matrix?

You can do this...
transition <- prop.table(tapply(data$weight, list(data$old, data$new), sum), 2)
where data$weight is a column of weights for each row of data.
The tapply with length will reproduce what you have. Changing it to sum adds the weights for each combination rather than just counting them.

Related

Weighted observation frequency clustering using hclust in R

I have a large matrix of 500K observations to cluster using hierarchical clustering. Due to the large size, i do not have the computing power to calculate the distance matrix.
To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. I have the frequency for each of the rows in this aggregated matrix. I now need to incorporate this frequency as a weight in my hierarchical clustering.
The data is a mixture of numerical and categorical variables for the 500K observations so i have used the daisy package to calculate the gower dissimilarity for my aggregated dataset. I want to use hclust in the stats package for the aggregated dataset however i want to take into account the frequency of each observation. From the help information for hclust the arguments are as follows:
hclust(d, method = "complete", members = NULL)
The information for the members argument is:, NULL or a vector with length size of d. See the ‘Details’ section. When you look at the details section you get: If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.
From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. I would like to use it like this:
hclust(d, method = "complete", members = df$freq)
Where df$freq is the frequency of each row in the aggregated matrix. So if a row is duplicated 10 times this value would be 10.
If anyone can help me that would be great,
Thanks
Yes, this should work fine for most linkages, in particular single, group average and complete linkage. For ward etc. you need to correctly take the weights into account yourself.
But even that part is not hard. Just make sure to use the cluster sizes, because you need to pass the distance of two clusters, not two points. So the matrix should contain the distance of n1 points at location x and n2 points at location y. For min/max/mean this n disappears or cancels out. For ward, you should get a SSQ like formula.

R: Putting Variables in order by a different variable

Once again I have been set another programming task and to most of which I have done, so a quick run through: I've had to take n amount of samples of multivariate normal distribution with dimension p (called it X) then to put it into a matrix (Matx) where the first two values in each row were taken and summed a long with a value randomly drawn from the standard normal distribution. (Call this vector Y) Then we had to order Y numerically and split it up into H groups, and then I had to find out the mean of each row in the matrix and now having to order then in terms of which Y group they were associated. I've struggled a fair bit and have now hit a brick wall. Quite confusing I understand, if anyone could help it'd be greatly appreciated!
Task:Return the pxH matrix which has in the first column the mean of the observations in the first group and in the Hth column the mean in the observations in the Hth group.
Code:
library('MASS')
x<-mvrnorm(36,0,1)
Matx<-matrix(c(x), ncol=6, byrow=TRUE)
v<-rnorm(6)
y1<-sum(x[1:2],v[1])
y2<-sum(x[7:8],v[2])
y3<-sum(x[12:13],v[3])
y4<-sum(x[19:20],v[4])
y5<-sum(x[25:26],v[5])
y6<-sum(x[31:32],v[6])
y<-c(y1,y2,y3,y4,y5,y6)
out<-order(y)
h1<-c(out[1:2])
h2<-c(out[3:4])
h3<-c(out[5:6])
x1<-c(x[1:6])
x2<-c(x[7:12])
x3<-c(x[13:18])
x4<-c(x[19:24])
x5<-c(x[25:30])
x6<-c(x[31:36])
mx1<-mean(x1)
mx2<-mean(x2)
mx3<-mean(x3)
mx4<-mean(x4)
mx5<-mean(x5)
mx6<-mean(x6)
d<-c(mx1,mx2,mx3,mx4,mx5,mx6)[order(out)]
d

Label or score outliers in R

I'm looking for some easy to use algorithms in R to label (outlier or not) or score (say, 7.5) outliers row-wise. Meaning, I have a matrix m that contains several rows and I want to identify rows who represent outliers compared to the other rows.
m <- matrix( data = c(1,1,1,0,0,0,1,0,1), ncol = 3 )
To illustrate some more, I want to compare all the (complete) rows in the matrix with each other to spot outliers.
Here's some really simple outlier detection (using either the boxplot statistics or quantiles of the data) that I wrote a few years ago.
Outliers
But, as noted, it would be helpful if you'd describe your problem with greater precision.
Edit:
Also you say you want row-wise outliers. Do you mean to say that you're interested in identifying whole rows vs observations within a variable (as is typically done)? If so, you'll want to use some sort of distance metric, though which metric you choose will depend on your data.

Looking for an efficient way to compute the variances of a multinomial distribution in R

I have an R matrix which dimensions are ~20,000,000 rows by 1,000 columns. The first column represents counts and the rest of the columns represent the probabilities of a multinomial distribution of these counts. So in other words, in each row the first column is n and the rest of the k columns are the probabilities of the k categories. Another point is that the matrix is sparse, meaning that in each row there are many columns with value of 0.
Here's a toy matrix I created:
mat=rbind(c(5,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),c(2,0.2,0.2,0.2,0.2,0.2,0,0,0,0,0),c(22,0.4,0.6,0,0,0,0,0,0,0,0),c(5,0.5,0.2,0,0.1,0.2,0,0,0,0,0),c(4,0.4,0.15,0.15,0.15,0.15,0,0,0,0,0),c(10,0.6,0.1,0.1,0.1,0.1,0,0,0,0,0))
What I'd like to do is obtain an empirical measure of the variance of the counts for each category. The natural thing that comes to mind is to obtain random draws and then compute the variances over them. Something like:
draws = apply(mat,1,function(x) rmultinom(samples,x[1],x[2:ncol(mat)]))
Where say samples=100000
Then I can run an apply over draws to compute the variances.
However, for my real data dimensions this will become prohibitive at least in terms of RAM. Is whether a more efficient solution in R to this problem?
If all you need is the variance of the counts, just compute it immediately instead of returning the intermediate simulated draws.
draws = apply(mat,1,function(x) var(rmultinom(samples,x[1],x[2:ncol(mat)])))

Getting the next observation from a HMM gaussian mixture distribution

I have a continuous univariate xts object of length 1000, which I have converted into a data.frame called x to be used by the package RHmm.
I have already chosen that there are going to be 5 states and 4 gaussian distributions in the mixed distribution.
What I'm after is the expected mean value for the next observation. How do I go about getting that?
So what I have so far is:
a transition matrix from running the HMMFit() function
a set of means and variances for each of the gaussian distributions in the mixture, along with their respective proportions, all of which was also generated form the HMMFit() function
a list of past hidden states relating to the input data when using the output of the HMMFit function and putting it into the viterbi function
How would I go about getting the next hidden state (i.e. the 1001st value) from what I've got, and then using it to get the weighted mean from the gaussian distributions.
I think I'm pretty close just not too sure what the next part is...The last state is state 5, do I use the 5th row in the transition matrix somehow to get the next state?
All I'm after is the weighted mean for what is to be expect in the next observation, so the next hidden state isn't even necessary. Do I multiply the probabilities in row 5 by each of the means, weighted to their proportion for each state? and then sum it all together?
here is the code I used.
# have used 2000 iterations to ensure convergence
a <- HMMFit(x, nStates=5, nMixt=4, dis="MIXTURE", control=list(iter=2000)
v <- viterbi(a,x)
a
v
As always any help would be greatly appreciated!
Next predicted value uses last hidden state last(v$states) to get probability weights from the transition matrix a$HMM$transMat[last(v$states),] for each state the distribution means a$HMM$distribution$mean are weighted by proportions a$HMM$distribution$proportion, then its all multiplied together and summed. So in the above case it would be as follows:
sum(a$HMM$transMat[last(v$states),] * .colSums((matrix(unlist(a$HMM$distribution$mean), nrow=4,ncol=5)) * (matrix(unlist(a$HMM$distribution$proportion), nrow=4,ncol=5)), m=4,n=5))

Resources