R: How to get a count for a certain value in a matrix row in R? - r

Ok I have the following problem:
I have several ranks in a matrix in r. (I've got this by ranking asset returns. Ranks>=3 get an NA, Ranks <3 get the rank number. If some assets share a rank, less NAs are in a row). Here are two example rows and two example rows of a matrix with returns.
ranks<-matrix(c(1,1,2,NA,NA, 1,2,NA,NA,NA),nrow=2,ncol=5)
returns<-matrix(c(0.3,0.1,-0.5,-0.7,0.2,0.1,0.4,0.05,-0.7,-0.3),nrow=2,ncol=5)
Now if all assets are equally bought for our portfolio, I can calculate the average return with:
Mat.Ret<-returns*ranks
Mean.Ret<-rowMeans(Mat.Ret,na.rm=TRUE)
However I want to have the option of giving a vector of weights for the two ranks and these weights say how big of a percentage this particular rank should have in my portfolio. As an example we have a vector of
weights<-c(0.7,0.3)
Now how would I use this in my code? I want to calculate basically ranks*returns*weights. If only ONE rank 1 and ONE rank 2 are in the table, the code works. But how would I do this variable? I mean a solution would be to calculate for each rank how many times it exists in a particular row and then divide the weight by this count. And then I would multiply this "net weight" * rank * returns.
But I have no clue how to do this in code..any help?
UPDATE AFTER FIRST COMMENTS
Ok I want to Keep it flexible to adjust the weights depending on how many times a certain rank is given. A user can choose the top 5 ranked assets, so none or several assets may share ranks. So the distribution of weights must be very flexible. I've programmed a formula which doesn't work yet since I'm obviously not yet experienced enough with the whole matrix and vector selection syntax I guess. This is what I got so far:
ranks<-apply(ranks,1,function(x)distributeWeightsPerMatrixRow(x,weights))
distributeWeightsPerMatrixRow<-function(MatrixRow,Weights){
if(length(Weights)==length(MatrixRow[!is.na(MatrixRow)])){
MatrixRow <- Weights[MatrixRow]
} else {
for(i in 1:length(MatrixRow)){
if(!is.na(MatrixRow[i])){
EqWeights<-length(MatrixRow[MatrixRow==MatrixRow[i]])
MatrixRow[i]<-sum(Weights[MatrixRow[i]:(MatrixRow[i]+EqWeights-1)])/EqWeights
}
}
}
return(MatrixRow)
}
EDIT2:
Function seems to work, however now the resulting ranks object is the transposed version of the original matrix without the column names..

Since your ranks are integers above zero, you can use this matrix for indexing the vector ranks:
mat.weights <- weights[ranks]
mat.weighted.ret <- returns * ranks * mat.weights
Update based on comment.
I suppose you're looking for something like this:
if (length(unique(na.omit(as.vector(ranks)))) == 1)
mat.weights <- (!is.na(ranks)) * 0.5
else
mat.weights <- weights[ranks]
mat.weighted.ret <- returns * ranks * mat.weights
If there is only one rank. All weights become 0.5.

Related

R: Rank cells in a list of matrices based on cell position

I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.

Forcing discrete time series to be monotonous decreasing

I've an evaluations series. Each evaluation could have discrete values ranging from 0 to 4. Series should decrease in time. However, since values are inserted manually, errors could happen.
Therefore, I would like to modify my series to be monotonous decreasing. Moreover, I would minimize the number of evaluations modified. Finally, if two or more series would satisfy these criteria, would choose the one with the higher overall values sum.
E.g.
Recorded evaluation
4332422111
Ideal evaluation
4332222111
Recorded evaluation
4332322111
Ideal evaluation
4333322111
(in this case, 4332222111 would have satisfied criteria too, but I chose with the higher values)
I tried with brutal force approach by generating all possible combinations, selecting those monotonous decreasing and finally comparing each one of these with that recorded.
However, series could be even 20-evaluations long and combinations would too many.
x1 <- c(4,3,3,2,4,2,2,1,1,1)
x2 <- c(4,3,3,2,3,2,2,1,1,1)
You could almost certainly break this algorithm, but here's a first try: replace locations with increased values by NA, then fill them in with the previous location.
dfun <- function(x) {
r <- replace(x,which(c(0,diff(x))>0),NA)
zoo::na.locf(r)
}
dfun(x1)
dfun(x2)
This gives the "less-ideal" answer in the second case.
For the record, I also tried
dfun2 <- function(x) {
s <- as.stepfun(isoreg(-x))
-s(seq_along(x))
}
but this doesn't handle the first example as desired.
You could also try to do this with discrete programming (about which I know almost nothing), or a slightly more sophisticated form of brute force -- use a stochastic algorithm that strongly penalizes non-monotonicity and weakly penalizes the distance from the initial sequence ... (e.g. optim(..., method="SANN") with a candidate function that adds or subtracts 1 from an element at random)

Expected value of the difference between a sum of variables and a threshold

I had a custom deck consisting of eight cards of the sequence 2^n, n=0,..,6. I draw cards (without replacement) until the sum is equal or greater than the threshold. How can I implement in R a function that calculates the mean of the difference between the sum and the threshold??
I tried to do it using this How to store values in a vector with nested functions
but it takes ages... I think there is a way to do it with probabilities/simulations but I can figure out.
The threshold could be greater than the value of one single card, ie, threshold=500 or less than the value of a single card, ie, threshold=50
What I have done so far is to find all the subsets that meet the condition of the sum greater or equal to the threshold. Then I will only substract the threshold and calculate the mean.
I am using the following code in R. For a small set I get the answer quite fast. However, I have been running the function for several ours with the set containing the 56 numbers and is still working.
set<-c(rep(1,8),rep(2,8), rep(4,8),rep(8,8),rep(16,8),rep(32,8),rep(64,8))
recursive.subset <-function(x, index, current, threshold, result){
for (i in index:length(x)){
if (current + x[i] >= threshold){
store <<- append(store, sum(c(result,x[i])))
} else {
recursive.subset(x, i + 1, current+x[i], threshold, c(result,x[i]))
}
}
}
store <- vector()
inivector <- vector(mode="numeric", length=0) #initializing empty vector
recursive.subset (set, 1, 0, threshold, inivector)
I don't know if it is possible to get an exact solution, simply because there are so many possible combinations. It is probably better to do simulations, i.e. write a script for 1 full draw and then rerun that script many times. Since the solutions are very similar, the simulation should give a pretty good approximation.
Ok, here goes:
set <- rep(2^(0:6), each = 8)
thr <- 500
fun <- function(set,thr){
x <- cumsum(sample(set))
value <- x[min(which(x >= thr))]
value
}
system.time(a <- replicate(100000, fun(set,thr)))
# user system elapsed
# 1.10 0.00 1.09
mean(a - thr)
# [1] 21.22992
Explanation: Rather than drawing a card one at a time, I draw all cards simultaneously (sample) and then calculate the cumulative sum (cumsum). I then find the point where the cards at up to the threshold or larger, and find the sum of those cards back in x. We run this function many times with replicate, to obtain a vector of the values. We use mean(a-thr) to calculate the mean difference.
Edit: Made a really stupid typo in the code, fixed it now.
Edit2: Shortened the function a little.

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

General code for calling a row that works for a matrix or a vector

Is there a general method for calling a row when you do not know whether you'll be referencing a matrix or a vector?
I want to subset results to those with accuracy greater than .5 and then select the row of the subset with the highest sensitivity. I repeat this process many times in a loop. The problem I'm running into is that in some runs of the model many rows of the results have accuracy greater than .5 and in some runs only one row has accuracy greater than .5.
To call the row with maximum accuracy I've written the code.
# Subset matrix to just rows with accuracy greater than .5
acc_ID = which(new_data[,"accuracy"] >= 0.5)
new_data2 = new_data[acc_ID,]
## Identify which row has the highest sensitivity
max_sensitivity_ID = which(new_data2[,"sensitivity"] == max(new_data2[,"sensitivity"]))[1]
The problem comes from the last line. If there is only one row with accuracy > .5. Then I need to remove the commas.
Note: this is a big data situation and I'm not uploading a replicable data example. I figure that someone out there will know a general method for calling a row without replicating the problem.
Use drop=FALSE to ensure new_data2 is always a matrix.
new_data2 = new_data[acc_ID,,drop=FALSE]

Resources