How to replace ties with NA in R - r

I am working on a function to return the column name of the largest value for each row. Something like:
colnames(x)[apply(x,1,which.max)]
However, before applying a function like this is there a straight forward and general way to replace ties with NA (or any other arbitrary letter etc.)?
I have the following matrix:
0 1
[1,] 5.000000e-01 0.5000000000
[2,] 9.901501e-01 0.0098498779
[3,] 9.981358e-01 0.0018641935
[4,] 9.996753e-01 0.0003246823
[5,] 9.998598e-01 0.0001402322
[6,] 1.303731e-02 0.9869626938
[7,] 1.157919e-03 0.9988420815
[8,] 6.274074e-07 0.9999993726
[9,] 1.659164e-07 0.9999998341
[10,] 6.517362e-08 0.9999999348
[11,] 8.951474e-06 0.9999910485
[12,] 5.070740e-06 0.9999949293
[13,] 1.278186e-07 0.9999998722
[14,] 9.914646e-08 0.9999999009
[15,] 7.058751e-08 0.9999999294
[16,] 2.847667e-09 0.9999999972
[17,] 1.675766e-08 0.9999999832
[18,] 2.172290e-06 0.9999978277
[19,] 4.964820e-06 0.9999950352
[20,] 1.333680e-07 0.9999998666
[21,] 2.087793e-07 0.9999997912
[22,] 2.358360e-06 0.9999976416
The first row has equal values for variables which I would like to replace with NA. While this is simple for this particular example, I want to be able to replace all ties with NA where they occur in any size matrix i.e. in this matrix:
1 2 3
[1,] 0.25 0.25 0.5
[2,] 0.3 0.3 0.3
all values would be replaced with NA except for [1,3]
I have looked at the function which.max.simple() which can deal with ties by replacing with NA but it doesn't appear to work any more, and all other methods of dealing with ties don't address my issue
I hope that makes sense
Thanks,
C

Here's a simple approach to replace any row-wise duplicated values with NA in a matrix m:
is.na(m) <- t(apply(m, 1, FUN = function(x) {
duplicated(x) | duplicated(x, fromLast = TRUE)}))
But consider the following notes:
1) be extra careful when comparing floating point numbers for equality (see Why are these numbers not equal?);
2) depending on your ultimate target, there may be simpler ways than replacing duplicated in your data (since it seems that you are only interested in column names); and
3) if you are going to replace values in a numeric matrix, don't use arbitrary characters for replacement since that will convert your whole matrix to character class (replacement with NA is not a problem)

Related

filling columns of a matrix by the values of outputs of different functions

I want to fill each column of an empty matrix by values resulted from different functions. I want to use many functions and so the speed is important. I have prepared a small example of what I want to do but I can't.
I have an empty matrix which I want to fill each column by values of functions' outputs. This matrix has an exact number of columns and each column has specific names:
mat<-matrix(ncol = 4)
colnames(mat)<-c("binomial","normal","gamma","exponential")
Then, considering a vector which includes some colnames of this matrix:
remove<-c("gamma","exponential")
I want to fill columns of this matrix by random values resulted from each distribution but under this circumstance that if remove object contains the name of columns of this matrix, they must be removed and not be computed.
I wrote this:
mat<-mat[,-which(colnames(mat) %in% remove) ]
mat[,1]<-rnbinom(10, mu = 4, size = 1)
mat[,2]<-rnorm(10)
mat[,3]<-rgamma(10, 0.001)
mat[,4]<-rexp(10)
The final matrix I am looking for that is something like this:
binomial normal
1 -0.54948696
6 -0.53396115
1 0.69918478
13 0.92824442
0 0.03331125
I would be very grateful for your kind help.
Here is a method that constructs a function. The random generators are stored in a list and then the subset of them (those not in remove) are fed to sapply.
randMatGet <- function(sampleSize=10, remove=NULL) {
randFuncs <- list("binomial"=function(x) rnbinom(x, mu=4, size=1),
"normal"=function(x)rnorm(x),
"gamma"=function(x) rgamma(x, 0.001),
"exponential"=function(x) rexp(x))
sapply(randFuncs[setdiff(names(randFuncs), remove)], function(f) f(sampleSize))
}
Now, call the function
set.seed(1234)
randMatGet()
binomial normal gamma exponential
[1,] 0 0.375635612 0.000000e+00 1.45891992
[2,] 1 0.310262167 0.000000e+00 1.43920743
[3,] 1 0.005006950 3.099691e-294 2.76404158
[4,] 5 -0.037630263 7.540715e-249 0.02316716
[5,] 0 0.723976061 0.000000e+00 0.89394340
[6,] 0 -0.496738863 0.000000e+00 3.68036715
[7,] 0 0.011395161 0.000000e+00 2.90720399
[8,] 4 0.009859946 9.088837e-34 0.13015222
[9,] 10 0.678271423 0.000000e+00 0.81417829
[10,] 0 1.029563029 0.000000e+00 2.01986489
and then with remove
# reset seed for comparison
set.seed(1234)
randMatGet(remove=remove)
binomial normal
[1,] 0 0.375635612
[2,] 1 0.310262167
[3,] 1 0.005006950
[4,] 5 -0.037630263
[5,] 0 0.723976061
[6,] 0 -0.496738863
[7,] 0 0.011395161
[8,] 4 0.009859946
[9,] 10 0.678271423
[10,] 0 1.029563029
To allow for adjustments of different parameters, change the function as follows. This is an example for the mu argument to rbinom.
randMatGet <- function(sampleSize=10, remove=NULL, mu=4) {
randFuncs <- list("binomial"=function(x) rnbinom(x, mu=mu, size=1),
"normal"=function(x)rnorm(x),
"gamma"=function(x) rgamma(x, 0.001),
"exponential"=function(x) rexp(x))
sapply(randFuncs[setdiff(names(randFuncs), remove)], function(f) f(sampleSize))
}
Now, you can do randMatGet(mu=1).

R: calling a matrix value of column 2 dependent on the value of column 1

I admit that I am totally new to R and have a few beginner's problems;
my problem is the following:
I have quite a long matrix TEST of length 5000 with 2 columns (column 1 = time; column 2 = concentration of a species).
I want to use the right concentration values for calculation of propensities in stochastic simulations.
I already have an alogrithm that gives me the simulation time t_sim; what I would need is a line of code that gives the respective concentration value at t= t_sim;
also: the time vector might have a big step size so that t_sim would have to be rounded to a bigger value in order to call the respective concentration value.
I know this probably quite an easy problem but I really do not see the solution in R.
Best wishes and many thanks,
Arne
Without sample data this answer is kind of a shot in the dark, but I think that this might work:
t_conc <- TEST[which.min(abs(t_sim-TEST[,1])),2]
where TEST is the matrix with two columns as described in the OP and the output t_conc is the concentration that corresponds to the value of time in the matrix that is closest to the input value t_sim.
Here's another shot in the dark:
set.seed(1);
N <- 20; test <- matrix(c(sort(sample(100,N)),rnorm(N,0.5,0.2)),N,dimnames=list(NULL,c('time','concentration')));
test;
## time concentration
## [1,] 6 0.80235623
## [2,] 16 0.57796865
## [3,] 19 0.37575188
## [4,] 20 0.05706002
## [5,] 27 0.72498618
## [6,] 32 0.49101328
## [7,] 34 0.49676195
## [8,] 37 0.68876724
## [9,] 43 0.66424424
## [10,] 57 0.61878026
## [11,] 58 0.68379547
## [12,] 61 0.65642726
## [13,] 62 0.51491300
## [14,] 63 0.10212966
## [15,] 67 0.62396515
## [16,] 83 0.48877425
## [17,] 86 0.46884090
## [18,] 88 0.20584952
## [19,] 89 0.40436999
## [20,] 97 0.58358831
t_sim <- 39;
test[findInterval(t_sim,test[,'time']),'concentration'];
## concentration
## 0.6887672
Note that findInterval() returns the index of the lesser time value if t_sim falls between two time values, as my example shows. If you want the greater, you need a bit more work:
i <- findInterval(t_sim,test[,'time']);
if (test[i,'time'] != t_sim && i < nrow(test)) i <- i+1;
test[i,'concentration'];
## concentration
## 0.6642442
If you want the nearest, see R: find nearest index.

Visualization of multi-dimensional data clusters in R

For a set of documents, I have a feature matrix of size 30 X 32 where rows represent documents and columns = features. So basically 30 documents and 32 features for each of them. After running a PSO Algorithm, I have been able to find some cluster centroids (that I am not at the moment sure if they are optimum) each of which is a row vector of length 32. And I have a column vector of size 30X1 which shows the centroid each document has been assigned to. So index one of this vector would contain the index of the centroid to which document 1 has been assigned and so on. This is obtained after computing euclidean distances of each of the documents from the centroids. I wanted to get some hints regarding whether there is a way in R to plot this multidimensional data in the form of clusters. Is there a way, for example, by which I could either collapse these dimensions to 1-D, or somehow show them in a graph that might be a bit pretty to look at. I have been reading on Multidimensional Scaling. So far what I understand about it is that it is a way to reduce a multi-dimensional data to lower dimensions, which does seem what I want. So, I tried it on with this code (the centroids[[3]] basically consists of 4 X 32 matrix and represents the 4 centroids):
points <- features.dataf[2:ncol(features.dataf)]
row.names(points) <- features.dataf[,1]
fit <- cmdscale(points, eig = TRUE, k = 2)
x <- fit$points[, 1]
y <- fit$points[, 2]
plot(x, y, pch = 19, xlab="Coordinate 1", ylab="Coordinate 2", main="Clustering Text Based on PSO", type="n")
text(x, y, labels = row.names(points), cex=.7)
It gives me this error:
Error in cmdscale(pointsPlusCentroids, eig = TRUE, k = 2) :
distances must be result of 'dist' or a square matrix
However, it does seem to give a plot alright. But the pch = 19 point symbols do not appear, just the text names. Like this:
In addition to above, I want to color these such that the documents that lie in cluster 1 get colored to one color and those in 2 to a different color and so on. Is there any way to do this if I have a column vector with centroids present in this way:
[,1]
[1,] 1
[2,] 3
[3,] 1
[4,] 4
[5,] 1
[6,] 4
[7,] 3
[8,] 4
[9,] 4
[10,] 4
[11,] 2
[12,] 2
[13,] 2
[14,] 2
[15,] 1
[16,] 2
[17,] 1
[18,] 4
[19,] 2
[20,] 4
[21,] 1
[22,] 1
[23,] 1
[24,] 1
[25,] 1
[26,] 3
[27,] 4
[28,] 1
[29,] 4
[30,] 1
Could anyone please help me with this? Or if there is any other way to plot multi-dimensional clusters like these. Thank you!
As cmdscale needs distances, try cmdscale(dist(points), eig = TRUE, k = 2). Symbols do not appear because of type = "n". For coloring text, use: text(x, y, rownames(points), cex = 0.6, col = centroids)

An easy solution in R? Binding several numbered data frames

I have 408 "Spatial Points" Data Frames, each numbered as follows:
Points1, Points2,..., Points408.
What I want is to loop through all of them, adding them to a list. I would then like to row bind them to create one huge Spatial Points file.
When I type "Points1" into R, this is printed:
x y
[1,] 40.38285 -11.54500
[2,] 38.41897 -13.55959
[3,] 38.51536 -12.42431
[4,] 38.82389 -12.95476
[5,] 39.88932 -12.77925
[6,] 39.86099 -13.32380
[7,] 38.47942 -14.10968
[8,] 39.85796 -11.84176
[9,] 38.16891 -13.70572
[10,] 39.89386 -12.21040
[11,] 38.32758 -14.03576
[12,] 39.97627 -11.97127
[13,] 38.50884 -14.07678
[14,] 39.06521 -12.19818
[15,] 39.40532 -13.68988
Please note that each Points data frame has the same number of points.
What I want is one huge file that has 15*408=6120 Spatial Points.
Thanks so much!

R select entire columns where at least one value meets a condition

I have a large matrix, ~300 rows and 200000 cols. I want to shrink this down by selecting the entire columns that have at least one value that is > 0.5 or less than -0.5 (not just that particular value). I would like to keep the row and column names. I was able to get a matrix of true false by doingtmp<-mymat > 0.5 | mymat < -0.5. I want to extract all columns that have at least one TRUE in them. I tried simply mymat[tmp] but this just returns a vector of the values that meet that condition. How can I get the actual columns of the original matrix? Thanks.
Try this:
> set.seed(007) # for the example being reproducible
> X <- matrix(rnorm(100), 20) # generating some data
> X <- cbind(X, runif(20, max=.48)) # generating a column with all values < 0.5
> colnames(X) <- paste('col', 1:ncol(X), sep='') # some column names
> X # this is how the matrix looks like
col1 col2 col3 col4 col5 col6
[1,] 2.287247161 0.83975036 1.218550535 0.07637147 0.342585350 0.335107187
[2,] -1.196771682 0.70534183 -0.699317079 0.15915528 0.004248236 0.419502015
[3,] -0.694292510 1.30596472 -0.285432752 0.54367418 0.029219842 0.346358090
[4,] -0.412292951 -1.38799622 -1.311552673 0.70480735 -0.393423429 0.212185020
[5,] -0.970673341 1.27291686 -0.391012431 0.31896914 -0.792704563 0.224824248
[6,] -0.947279945 0.18419277 -0.401526613 1.10924979 -0.311701865 0.415837389
[7,] 0.748139340 0.75227990 1.350517581 0.76915419 -0.346068592 0.057660111
[8,] -0.116955226 0.59174505 0.591190027 1.15347367 -0.304607588 0.007812921
[9,] 0.152657626 -0.98305260 0.100525456 1.26068350 -1.785893487 0.298192099
[10,] 2.189978107 -0.27606396 0.931071996 0.70062351 0.587274672 0.216225091
[11,] 0.356986230 -0.87085102 -0.262742349 0.43262716 1.635794434 0.026097800
[12,] 2.716751783 0.71871055 -0.007668105 -0.92260172 -0.645423474 0.190567072
[13,] 2.281451926 0.11065288 0.367153007 -0.61558421 0.618992169 0.402829397
[14,] 0.324020540 -0.07846677 1.707162545 -0.86665969 0.236393598 0.248196976
[15,] 1.896067067 -0.42049046 0.723740263 -1.63951709 0.846500899 0.406511129
[16,] 0.467680511 -0.56212588 0.481036049 -1.32583924 -0.573645739 0.162457572
[17,] -0.893800723 0.99751344 -1.567868244 -0.88903673 1.117993204 0.383801555
[18,] -0.307328300 -1.10513006 0.318250283 -0.55760233 -1.540001132 0.347037954
[19,] -0.004822422 -0.14228783 0.165991451 -0.06240231 -0.438123899 0.262938992
[20,] 0.988164149 0.31499490 -0.899907630 2.42269298 -0.150672971 0.139233120
>
> # defining a index for selecting if the condition is met
> ind <- apply(X, 2, function(X) any(abs(X)>0.5))
> X[,ind] # since col6 only has values less than 0.5 it is not taken
col1 col2 col3 col4 col5
[1,] 2.287247161 0.83975036 1.218550535 0.07637147 0.342585350
[2,] -1.196771682 0.70534183 -0.699317079 0.15915528 0.004248236
[3,] -0.694292510 1.30596472 -0.285432752 0.54367418 0.029219842
[4,] -0.412292951 -1.38799622 -1.311552673 0.70480735 -0.393423429
[5,] -0.970673341 1.27291686 -0.391012431 0.31896914 -0.792704563
[6,] -0.947279945 0.18419277 -0.401526613 1.10924979 -0.311701865
[7,] 0.748139340 0.75227990 1.350517581 0.76915419 -0.346068592
[8,] -0.116955226 0.59174505 0.591190027 1.15347367 -0.304607588
[9,] 0.152657626 -0.98305260 0.100525456 1.26068350 -1.785893487
[10,] 2.189978107 -0.27606396 0.931071996 0.70062351 0.587274672
[11,] 0.356986230 -0.87085102 -0.262742349 0.43262716 1.635794434
[12,] 2.716751783 0.71871055 -0.007668105 -0.92260172 -0.645423474
[13,] 2.281451926 0.11065288 0.367153007 -0.61558421 0.618992169
[14,] 0.324020540 -0.07846677 1.707162545 -0.86665969 0.236393598
[15,] 1.896067067 -0.42049046 0.723740263 -1.63951709 0.846500899
[16,] 0.467680511 -0.56212588 0.481036049 -1.32583924 -0.573645739
[17,] -0.893800723 0.99751344 -1.567868244 -0.88903673 1.117993204
[18,] -0.307328300 -1.10513006 0.318250283 -0.55760233 -1.540001132
[19,] -0.004822422 -0.14228783 0.165991451 -0.06240231 -0.438123899
[20,] 0.988164149 0.31499490 -0.899907630 2.42269298 -0.150672971
# It could be done just in one step avoiding 'ind'
X[, apply(X, 2, function(X) any(abs(X)>0.5))]
An addition to Jilber's answer for the case when only one column remains after filtering:
X[, apply(X, 2, function(X) any(abs(X)>0.5)), drop=FALSE]
Without the drop=FLASE argument the remaining column will be converted to a vector and you will lose the column name information.

Resources