svd of very large matrix in R program - r

I have a matrix 60 000 x 60 000 in a txt file, I need to get svd of this matrix. I use R but I don´t know if R can generate it.

I think it's possible to compute (partial) svd using the irlba package and bigmemory and bigalgebra without using a lot of memory.
First let's create a 20000 * 20000 matrix and save it into a file
require(bigmemory)
require(bigalgebra)
require(irlba)
con <- file("mat.txt", open = "a")
replicate(20, {
x <- matrix(rnorm(1000 * 20000), nrow = 1000)
write.table(x, file = 'mat.txt', append = TRUE,
row.names = FALSE, col.names = FALSE)
})
file.info("mat.txt")$size
## [1] 7.264e+09 7.3 Gb
close(con)
Then you can read this matrix using bigmemory::read.big.matrix
bigm <- read.big.matrix("mat.txt", sep = " ",
type = "double",
backingfile = "mat.bk",
backingpath = "/tmp",
descriptorfile = "mat.desc")
str(bigm)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slots
## ..# address:<externalptr>
dim(bigm)
## [1] 20000 20000
bigm[1:3, 1:3]
## [,1] [,2] [,3]
## [1,] -0.3623255 -0.58463 -0.23172
## [2,] -0.0011427 0.62771 0.73589
## [3,] -0.1440494 -0.59673 -1.66319
Now we can use the use the excellent irlba package as explained in the package vignette.
The first step consist of defining matrix multiplication operator which can work with big.matrix object and then use the irlba::irlba function
### vignette("irlba", package = "irlba") # for more info
matmul <- function(A, B, transpose=FALSE) {
## Bigalgebra requires matrix/vector arguments
if(is.null(dim(B))) B <- cbind(B)
if(transpose)
return(cbind((t(B) %*% A)[]))
cbind((A %*% B)[])
}
dim(bigm)
system.time(
S <- irlba(bigm, nu = 2, nv = 2, matmul = matmul)
)
## user system elapsed
## 169.820 0.923 170.194
str(S)
## List of 5
## $ d : num [1:2] 283 283
## $ u : num [1:20000, 1:2] -0.00615 -0.00753 -0.00301 -0.00615 0.00734 ...
## $ v : num [1:20000, 1:2] 0.020086 0.012503 0.001065 -0.000607 -0.006009 ...
## $ iter : num 10
## $ mprod: num 310
I forgot to set the seed to make it reproductible but I just wanted to show that it's possible to do that in R.
EDIT
If you are using a new version of the package irlba, the above code throw an error because the matmult parameter of the function irlba has been renamed to mult. Therefore, you should change this part of the code
S <- irlba(bigm, nu = 2, nv = 2, matmul = matmul)
By
S <- irlba(bigm, nu = 2, nv = 2, mult = matmul)
I want to thank #FrankD for pointing this out.

In R 3.x+ you can construct a matrix of that size, the upper limit of vector sizes being 2^53 (or maybe 2^53-1 ), up from 2^31-1 as it was before which was why Andrie was throwing an error on his out-of-date installation. It generally takes 10 bytes per numeric element. At any rate:
> 2^53 < 10*60000^2
[1] FALSE # so you are safe on that account.
It would also fit in 64GB (but not in 32GB):
> 64000000000 < 10*60000^2
[1] FALSE
Generally to do any serious work you need at least 3 times the size of your largest object, so this seems pretty borderline even with the new expanded vectors/matrices.

Related

fill NA raster cells using focal defined by boundary

I have a raster and a shapefile. The raster contains NA and I am filling the NAs using the focal function
library(terra)
v <- vect(system.file("ex/lux.shp", package="terra"))
r <- rast(system.file("ex/elev.tif", package="terra"))
r[45:60, 45:60] <- NA
r_fill <- terra::focal(r, 5, mean, na.policy="only", na.rm=TRUE)
However, there are some NA still left. So I do this:
na_count <- terra::freq(r_fill, value = NA)
while(na_count$count != 0){
r_fill <- terra::focal(r_fill, 5, mean, na.policy="only", na.rm=TRUE)
na_count <- terra::freq(r_fill, value = NA)
}
Once all NA's are filled, I clip the raster again using the shapefile
r_fill <- terra::crop(r_fill, v, mask = T, touches = T)
This is what my before and after looks like:
I wondered if the while loop is an efficient way to fill the NAs or basically determine how many times I have to run focal to fill all the NAs in the raster.
Perhaps we can, or want to, dispense with the while( altogether by making a better estimate of focal('s w= arg in a world where r, as ground truth, isn't available. Were it available, we could readily derive direct value of w
r <- rast(system.file("ex/elev.tif", package="terra"))
# and it's variants
r2 <- r
r2[45:60, 45:60] <- NA
freq(r2, value=NA) - freq(r, value=NA)
layer value count
1 0 NA 256
sqrt((freq(r2, value=NA) - freq(r, value=NA))$count)
[1] 16
which might be a good value for w=, and introducing another variant
r3 <- r
r3[40:47, 40:47] <- NA
r3[60:67, 60:67] <- NA
r3[30:37, 30:37] <- NA
r3[70:77, 40:47] <- NA
rm(r)
We no longer have our ground truth. How might we estimate an edge of w=? Turning to boundaries( default values (inner)
r2_bi <- boundaries(r2)
r3_bi <- boundaries(r3)
# examining some properties of r2_bi, r3_bi
freq(r2_bi, value=1)$count
[1] 503
freq(r3_bi, value=1)$count
[1] 579
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
sum(freq(r2_bi, value=1)$count,freq(r2_bi, value = 0)$count)
[1] 4352
sum(freq(r3_bi, value=1)$count,freq(r3_bi, value = 0)$count)
[1] 4352
Taken in reverse order, sum[s] and freq[s] suggest that while the total area of (let's call them holes) are the same, they differ in number and r2 is generally larger than r3. This is also clear from the first pair of freq[s].
Now we drift into some voodoo, hocus pocus in pursuit of a better edge estimate
sum(freq(r2)$count) - sum(freq(r2, value = NA)$count)
[1] 154
sum(freq(r3)$count) - sum(freq(r3, value = NA)$count)
[1] 154
(sum(freq(r3)$count) - sum(freq(r3, value = NA)$count))
[1] 12.40967
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r2_bi, value=0)$count/freq(r2_bi, value = 1)$count
[1] 7.652087
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
taking the larger, i.e. freq(r2_bi 7.052087
7.652087/0.1306833
[1] 58.55444
154+58
[1] 212
sqrt(212)
[1] 14.56022
round(sqrt(212)+1)
[1] 16
Well, except for that +1 part, maybe still a decent estimate for w=, to be used on both r2 and r3 if called upon to find a better w, and perhaps obviate the need for while(.
Another approach to looking for squares and their edges:
wtf3 <- values(r3_bi$elevation)
wtf2 <- values(r2_bi$elevation)
wtf2_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf2)))$lengths))
wtf3_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf3)))$lengths))
names(wtf2_tbl_df2)
[1] "Var1" "Freq"
wtf2_tbl_df2[which(wtf2_tbl_df2$Var1 == wtf2_tbl_df2$Freq), ]
Var1 Freq
14 16 16
wtf3_tbl_df2[which(wtf3_tbl_df2$Freq == max(wtf3_tbl_df2$Freq)), ]
Var1 Freq
7 8 35
35/8
[1] 4.375 # 4 squares of 8 with 3 8 length vectors
bringing in v finally and filling
v <- vect(system.file("ex/lux.shp", package="terra"))
r2_fill_17 <- focal(r2, 16 + 1 , mean, na.policy='only', na.rm = TRUE)
r3_fill_9 <- focal(r3, 8 + 1 , mean, na.policy='only', na.rm = TRUE)
r2_fill_17_cropv <- crop(r2_fill_17, v, mask = TRUE, touches = TRUE)
r3_fill_9_cropv <- crop(r3_fill_9, v, mask = TRUE, touches = TRUE)
And I now appreciate your while( approach as your r2 looks better, more naturally transitioned, though the r3 looks fine. In my few, brief experiments with smaller than 'hole', i.e. focal(r2, 9, I got the sense it would take 2 passes to fill, that suggests focal(r2, 5 would take 4.
I guess further determining the proportion of fill:hole:rast for when to deploy a while would be worthwhile.

Combinatorial optimization with discrete options in R

I have a function with five variables that I want to maximize using only an specific set of parameters for each variable.
Are there any methods in R that can do this, other than by brutal force? (e.g. Particle Swarm Optimization, Genetic Algorithm, Greedy, etc.). I have read a few packages but they seem to create their own set of parameters from within a given range. I am only interested in optimizing the set of options provided.
Here is a simplified version of the problem:
#Example of 5 variable function to optimize
Fn<-function(x){
a=x[1]
b=x[2]
c=x[3]
d=x[4]
e=x[5]
SUM=a+b+c+d+e
return(SUM)
}
#Parameters for variables to optimize
Vars=list(
As=c(seq(1.5,3, by = 0.3)), #float
Bs=c(1,2), #Binary
Cs=c(seq(1,60, by=10)), #Integer
Ds=c(seq(60,-60, length.out=5)), #Negtive
Es=c(1,2,3)
)
#Full combination
FullCombn= expand.grid(Vars)
Results=data.frame(I=as.numeric(), Sum=as.numeric())
for (i in 1:nrow(FullCombn)){
ParsI=FullCombn[i,]
ResultI=Fn(ParsI)
Results=rbind(Results,c(I=i,Sum=ResultI))
}
#Best iteration (Largest result)
Best=Results[Results[, 2] == max(Results[, 2]),]
#Best parameters
FullCombn[Best$I,]
Two more possibilities. Both minimize by default, so I flip the sign in your objective function (i.e. return -SUM).
#Example of 5 variable function to optimize
Fn<-function(x, ...){
a=x[1]
b=x[2]
c=x[3]
d=x[4]
e=x[5]
SUM=a+b+c+d+e
return(-SUM)
}
#Parameters for variables to optimize
Vars=list(
As=c(seq(1.5,3, by = 0.3)), #float
Bs=c(1,2), #Binary
Cs=c(seq(1,60, by=10)), #Integer
Ds=c(seq(60,-60, length.out=5)), #Negtive
Es=c(1,2,3)
)
First, a grid search. Exactly what you did, just convenient. And the implementation allows you to distribute the evaluations of the objective function.
library("NMOF")
gridSearch(fun = Fn,
levels = Vars)[c("minfun", "minlevels")]
## 5 variables with 6, 2, 6, 5, ... levels: 1080 function evaluations required.
## $minfun
## [1] -119
##
## $minlevels
## [1] 3 2 51 60 3
An alternative: a simple Local Search. You start with a valid initial guess, and then move randomly through possible feasible solutions. The key ingredient is the neighbourhood function. It picks one element randomly and then, again randomly, sets this element to one allowed value.
nb <- function(x, levels, ...) {
i <- sample(length(levels), 1)
x[i] <- sample(levels[[i]], 1)
x
}
(There would be better algorithms for neighbourhood functions; but this one is simple and so demonstrates the idea well.)
LSopt(Fn, list(x0 = c(1.8, 2, 11, 30, 2), ## a feasible initial solution
neighbour = nb,
nI = 200 ## iterations
),
levels = Vars)$xbest
## Local Search.
## ##...
## Best solution overall: -119
## [1] 3 2 51 60 3
(Disclosure: I am the maintainer of package NMOF, which provides functions gridSearch and LSopt.)
In response to the comment, a few remarks on Local Search and the neighbourhood function above (nb). Local Search, as implemented in
LSopt, will start with an arbitrary solution, and
then change that solution slightly. This new solution,
called a neighbour, will be compared (by its
objective-function value) to the old solution. If the new solution is
better, it becomes the current solution; otherwise it
is rejected and the old solution remains the current one.
Then the algorithm repeats, for a number of iterations.
So, in short, Local Search is not random sampling, but
a guided random-walk through the search space. It's
guided because only better solutions get accepted, worse one's get rejected. In this sense, LSopt will narrow down on good parameter values.
The implementation of the neighbourhood is not ideal
for two reasons. The first is that a solution may not
be changed at all, since I sample from feasible
values. But for a small set of possible values as here,
it might often happen that the same element is selected
again. However, for larger search spaces, this
inefficiency is typically negligible, since the
probability of sampling the same value becomes
smaller. Often so small, that the additional code for
testing if the solution has changed becomes more
expensive that the occasionally-wasted iteration.
A second thing could be improved, albeit through a more
complicated function. And again, for this small problem it does not matter. In the current neighbourhood, an
element is picked and then set to any feasible value.
But that means that changes from one solution to the
next might be large. Instead of picking any feasible values of the As,
in realistic problems it will often be better to pick a
value close to the current value. For example, when you are at 2.1, either move to 1.8 or 2.4, but not to 3.0. (This reasoning is only relevant, of course, if the variable in question is on a numeric or at least ordinal scale.)
Ultimately, what implementation works well can be
tested only empirically. Many more details are in this tutorial.
Here is one alternative implementation. A solution is now a vector of positions for the original values, e.g. if x[1] is 2, it "points" to 1.8, if x[2] is 2, it points to 1, and so on.
## precompute lengths of vectors in Vars
lens <- lengths(Vars)
nb2 <- function(x, lens, ...) {
i <- sample(length(lens), 1)
if (x[i] == 1L) {
x[i] <- 2
} else if (x[i] == lens[i]) {
x[i] <- lens[i] - 1
} else
x[i] <- x[i] + sample(c(1, -1), 1)
x
}
## the objective function now needs to map the
## indices in x back to the levels in Vars
Fn2 <- function(x, levels, ...){
y <- mapply(`[`, levels, x)
## => same as
## y <- numeric(length(x))
## y[1] <- Vars[[1]][x[1]]
## y[2] <- Vars[[2]][x[2]]
## ....
SUM <- sum(y)
return(-SUM)
}
xbest <- LSopt(Fn2,
list(x0 = c(1, 1, 1, 1, 1), ## an initial solution
neighbour = nb2,
nI = 200 ## iterations
),
levels = Vars,
lens = lens)$xbest
## Local Search.
## ....
## Best solution overall: -119
## map the solution back to the values
mapply(`[`, Vars, xbest)
## As Bs Cs Ds Es
## 3 2 51 60 3
Here is a genetic algorithm solution with package GA.
The key is to write a function decode enforcing the constraints, see the package vignette.
library(GA)
#> Loading required package: foreach
#> Loading required package: iterators
#> Package 'GA' version 3.2.2
#> Type 'citation("GA")' for citing this R package in publications.
#>
#> Attaching package: 'GA'
#> The following object is masked from 'package:utils':
#>
#> de
decode <- function(x) {
As <- Vars$As
Bs <- Vars$Bs
Cs <- Vars$Cs
Ds <- rev(Vars$Ds)
# fix real variable As
i <- findInterval(x[1], As)
if(x[1L] - As[i] < As[i + 1L] - x[1L])
x[1L] <- As[i]
else x[1L] <- As[i + 1L]
# fix binary variable Bs
if(x[2L] - Bs[1L] < Bs[2L] - x[2L])
x[2L] <- Bs[1L]
else x[2L] <- Bs[2L]
# fix integer variable Cs
i <- findInterval(x[3L], Cs)
if(x[3L] - Cs[i] < Cs[i + 1L] - x[3L])
x[3L] <- Cs[i]
else x[3L] <- Cs[i + 1L]
# fix integer variable Ds
i <- findInterval(x[4L], Ds)
if(x[4L] - Ds[i] < Ds[i + 1L] - x[4L])
x[4L] <- Ds[i]
else x[4L] <- Ds[i + 1L]
# fix the other, integer variable
x[5L] <- round(x[5L])
setNames(x , c("As", "Bs", "Cs", "Ds", "Es"))
}
Fn <- function(x){
x <- decode(x)
# a <- x[1]
# b <- x[2]
# c <- x[3]
# d <- x[4]
# e <- x[5]
# SUM <- a + b + c + d + e
SUM <- sum(x, na.rm = TRUE)
return(SUM)
}
#Parameters for variables to optimize
Vars <- list(
As = seq(1.5, 3, by = 0.3), # Float
Bs = c(1, 2), # Binary
Cs = seq(1, 60, by = 10), # Integer
Ds = seq(60, -60, length.out = 5), # Negative
Es = c(1, 2, 3)
)
res <- ga(type = "real-valued",
fitness = Fn,
lower = c(1.5, 1, 1, -60, 1),
upper = c(3, 2, 51, 60, 3),
popSize = 1000,
seed = 123)
summary(res)
#> ── Genetic Algorithm ───────────────────
#>
#> GA settings:
#> Type = real-valued
#> Population size = 1000
#> Number of generations = 100
#> Elitism = 50
#> Crossover probability = 0.8
#> Mutation probability = 0.1
#> Search domain =
#> x1 x2 x3 x4 x5
#> lower 1.5 1 1 -60 1
#> upper 3.0 2 51 60 3
#>
#> GA results:
#> Iterations = 100
#> Fitness function value = 119
#> Solutions =
#> x1 x2 x3 x4 x5
#> [1,] 2.854089 1.556080 46.11389 49.31045 2.532682
#> [2,] 2.869408 1.638266 46.12966 48.71106 2.559620
#> [3,] 2.865254 1.665405 46.21684 49.04667 2.528606
#> [4,] 2.866494 1.630416 46.12736 48.78017 2.530454
#> [5,] 2.860940 1.650015 46.31773 48.92642 2.521276
#> [6,] 2.851644 1.660358 46.09504 48.81425 2.525504
#> [7,] 2.855078 1.611837 46.13855 48.62022 2.575492
#> [8,] 2.857066 1.588893 46.15918 48.60505 2.588992
#> [9,] 2.862644 1.637806 46.20663 48.92781 2.579260
#> [10,] 2.861573 1.630762 46.23494 48.90927 2.555612
#> ...
#> [59,] 2.853788 1.640810 46.35649 48.87381 2.536682
#> [60,] 2.859090 1.658127 46.15508 48.85404 2.590679
apply(res#solution, 1, decode) |> t() |> unique()
#> As Bs Cs Ds Es
#> [1,] 3 2 51 60 3
Created on 2022-10-24 with reprex v2.0.2

In the as.party function how can I clarify which are the indices for the different nodes?

After creating my CART with rpart I proceed to convert it to a party object with the as.party function from the partykit package. The subsecuent error appears:
as.party(tree.hunterpb1)
Error in partysplit(varid = which(rownames(obj$split)[j] == names(mf)), :
‘index’ has less than two elements
I can only assume thet it's refering to the partitioning made by factor variables as I´ve understood from the literature, since the index applies to factors. My tree looks like this:
tree.hunterpb1
n= 354
node), split, n, deviance, yval
* denotes terminal node
1) root 354 244402.100 75.45134
2) hr.11a14>=49.2125 19 3378.322 33.44274 *
3) hr.11a14< 49.2125 335 205592.400 77.83391
6) month=April,February,June,March,May 141 58656.390 68.57493 *
7) month=August,December,January,July,November,October,September 194 126062.800 84.56338
14) presion.11a14>=800.925 91 74199.080 81.32755
28) month=January,November,October 16 9747.934 63.13394 *
29) month=August,December,July,September 75 58025.190 85.20885 *
15) presion.11a14< 800.925 103 50069.100 87.42223 *
The traceback shows that the first partition´s conversion to party class is done correctly but the second one based on the factor variables fails and produced said error.
Previously when working on similar data this error has not appeared. I can only assume that the as.party function isn't finding the indeces. Any advice on how to solve this will be appreciated.
Possibly, the problem is caused by the following situation. (Thanks to Yan Tabachek for e-mailing me a similar example.) If one of the partitioning variables passed on to rpart() is a character variable, then it is processed as if it were a factor by rpart() but not by the conversion in as.party(). As a simple example consider this small data set:
d <- data.frame(y = c(1:10, 101:110))
d$x <- rep(c("a", "b"), each = 10)
Fitting the rpart() tree treats the character variable x as a factor:
library("rpart")
(rp <- rpart(y ~ x, data = d))
## n= 20
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 20 50165.0 55.5
## 2) x=a 10 82.5 5.5 *
## 3) x=b 10 82.5 105.5 *
However, the as.party() conversion does not work:
library("partykit")
as.party(rp)
## Error in partysplit(varid = which(rownames(obj$split)[j] == names(mf)), :
## 'index' has less than two elements
The best fix is to transform x to a factor variable and re-fit the tree. Then the conversion also works smoothly:
d$x <- factor(d$x)
rp <- rpart(y ~ x, data = d)
as.party(rp)
## Model formula:
## y ~ x
##
## Fitted party:
## [1] root
## | [2] x in a: 5.500 (n = 10, err = 82.5)
## | [3] x in b: 105.500 (n = 10, err = 82.5)
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
I also added a fix in the development version of partykit on R-Forge to avoid the problem in the first place. It will be included in the next CRAN release (probably 1.0-1 for which a release date has not yet been scheduled).

Why does the calculation of Cohen's kappa fail across different packages on this contingency table?

I have a contingency table for which I would like to calculate Cohens's kappa - the level of agreement. I have tried using three different packages, which all seem to fail to some degree. The package e1071 has a function specifically for a contingency table, but that too seems to fail. Below is reproducable code. You will need to install packages concord, e1071, and irr.
# Recreate my contingency table, output with dput
conf.mat<-structure(c(810531L, 289024L, 164757L, 114316L), .Dim = c(2L,
2L), .Dimnames = structure(list(landsat_2000_bin = c("0", "1"
), MOD12_2000_binForest = c("0", "1")), .Names = c("landsat_2000_bin",
"MOD12_2000_binForest")), class = "table")
library(concord)
cohen.kappa(conf.mat)
library(e1071)
classAgreement(conf.mat, match.names=TRUE)
library(irr)
kappa2(conf.mat)
The output I get from running this is:
> cohen.kappa(conf.mat)
Kappa test for nominally classified data
4 categories - 2 methods
kappa (Cohen) = 0 , Z = NaN , p = NaN
kappa (Siegel) = -0.333333 , Z = -0.816497 , p = 0.792892
kappa (2*PA-1) = -1
> classAgreement(conf.mat, match.names=TRUE)
$diag
[1] 0.6708459
$kappa
[1] NA
$rand
[1] 0.5583764
$crand
[1] 0.0594124
Warning message:
In ni[lev] * nj[lev] : NAs produced by integer overflow
> kappa2(conf.mat)
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 2
Raters = 2
Kappa = 0
z = NaN
p-value = NaN
Could anyone advise on why these might fail? I have a large dataset, but as this table is simple I didn't think that could cause such problems.
In the first function, cohen.kappa, you need to specify that you are using count data and not just a n*m matrix of n subjects and m raters.
# cohen.kappa(conf.mat,'count')
cohen.kappa(conf.mat,'count')
The second function is much more tricky. For some reason, your matrix is full of integer and not numeric. integer can't store really big numbers. So, when you multiply two of your big numbers together, it fails. For example:
i=975288
j=1099555
class(i)
# [1] "numeric"
i*j
# 1.072383e+12
as.integer(i)*as.integer(j)
# [1] NA
# Warning message:
# In as.integer(i) * as.integer(j) : NAs produced by integer overflow
So you need to convert your matrix to have integers.
# classAgreement(conf.mat)
classAgreement(matrix(as.numeric(conf.mat),nrow=2))
Finally take a look at the documentation for ?kappa2. It requires an n*m matrix as explained above. It just won't work with your (efficient) data structure.
Do you need to know specifically why those fail? Here is a function that computes the statistic -- in a hurry, so I might clean it up later (kappa wiki):
kap <- function(x) {
a <- (x[1,1] + x[2,2]) / sum(x)
e <- (sum(x[1,]) / sum(x)) * (sum(x[,1]) / sum(x)) + (1 - (sum(x[1,]) / sum(x))) * (1 - (sum(x[,1]) / sum(x)))
(a-e)/(1-e)
}
Tests/output:
> (x = matrix(c(20,5,10,15), nrow=2, byrow=T))
[,1] [,2]
[1,] 20 5
[2,] 10 15
> kap(x)
[1] 0.4
> (x = matrix(c(45,15,25,15), nrow=2, byrow=T))
[,1] [,2]
[1,] 45 15
[2,] 25 15
> kap(x)
[1] 0.1304348
> (x = matrix(c(25,35,5,35), nrow=2, byrow=T))
[,1] [,2]
[1,] 25 35
[2,] 5 35
> kap(x)
[1] 0.2592593
> kap(conf.mat)
[1] 0.1258621

Calculating the Mode for Nominal as well as Continuous variables in [R]

Can anyone help me with this?
If I run:
> mode(iris$Species)
[1] "numeric"
> mode(iris$Sepal.Width)
[1] "numeric"
Then I get "numeric" as answer
Cheers
M
The function mode() is used to find out the storage mode of the the object, in this case is is stored as mode "numeric". This function is not used to find the most "frequent" observed value in a data set, i.e. it is not used to find the statistical mode. See ?mode for more on what this function does in R and why it isn't useful for your problem.
For discrete data, the mode is the most frequent observed value among the set:
> set.seed(1) ## reproducible example
> dat <- sample(1:5, 100, replace = TRUE) ## dummy data
> (tab <- table(dat)) ## tabulate the frequencies
dat
1 2 3 4 5
13 25 19 26 17
> which.max(tab) ## which is the mode?
4
4
> tab[which.max(tab)] ## what is the frequency of the mode?
4
26
For continuous data, the mode is the value of the data at which the probability density function (PDF) reaches a maximum. As your data are generally a sample from some continuous probability distribution, we don't know the PDF but we can estimate it through a histogram or better through a kernel density estimate.
Returning to the iris data, here is an example of determining the mode from continuous data:
> sepalwd <- with(iris, density(Sepal.Width)) ## kernel density estimate
> plot(sepalwd)
> str(sepalwd)
List of 7
$ x : num [1:512] 1.63 1.64 1.64 1.65 1.65 ...
$ y : num [1:512] 0.000244 0.000283 0.000329 0.000379 0.000436 ...
$ bw : num 0.123
$ n : int 150
$ call : language density.default(x = Sepal.Width)
$ data.name: chr "Sepal.Width"
$ has.na : logi FALSE
- attr(*, "class")= chr "density"
> with(sepalwd, which.max(y)) ## which value has maximal density?
[1] 224
> with(sepalwd, x[which.max(y)]) ## use the above to find the mode
[1] 3.000314
See ?density for more info. By default, density() evaluates the kernel density estimate at n = 512 equally spaced locations. If this is too crude for you, increase the number of locations evaluated and returned:
> sepalwd2 <- with(iris, density(Sepal.Width, n = 2048))
> with(sepalwd, x[which.max(y)])
[1] 3.000314
In this case it doesn't alter the result.
see ?mode : mode is giving you the storage mode. If you want the value with the maximum count, then use table.
> Sample <- sample(letters[1:5],50,replace=T)
> tmp <- table(Sample)
> tmp
Sample
a b c d e
9 12 9 7 13
> tmp[which(tmp==max(tmp))]
e
13
Please, read the help files if a function is not doing what you think it should.
Some extra explanation :
max(tmp) is the maximum of tmp
tmp == max(tmp) gives a logical vector with a length of tmp, indicating whether a value is equal or not to max(tmp).
which(tmp == max(tmp)) returns the index of the values in the vector that are TRUE. These indices you use to select the value in tmp that is the maximum value.
See the help files ?which, ?max and the introductory manuals for R.
See ?mode : mode is giving you the storage mode.
If you want to know the mode of a continuous random variable, I recently released the package ModEstM. In addition to the method proposed by Gavin Simpson, it addresses the case of multimodal variables.
For example, in case you study the sample:
> x2 <- c(rbeta(1000, 23, 4), rbeta(1000, 4, 16))
Which is clearly bimodal, you get the answer:
> ModEstM::ModEstM(x2)
[[1]]
[1] 0.8634313 0.1752347

Resources