What's Julia's solution to R's factor concept? - julia

Factors are a type of vector in R for which the elements are
categorical values that could also be ordered. The values are stored
internally as integers with labeled levels.
# In R:
> x = c( "high" , "medium" , "low" , "high" , "medium" )
> xf = factor( x )
> xf
[1] high medium low high medium
Levels: high low medium
> as.numeric(xf)
[1] 1 3 2 1 3
> xfo = factor( x , levels=c("low","medium","high") , ordered=TRUE )
> xfo
[1] high medium low high medium
Levels: low < medium < high
> as.numeric(xfo)
[1] 3 2 1 3 2
I checked Julia documentation and John Myles White's Comparing Julia and R’s Vocabularies (might be obsolote) - there seems no such a concept as factor. Is factor used quite often, and what's julia's solution to this problem?

The PooledDataArray in the DataFrames package is one possible alternative corresponding to R's factors. The following implements your example using it:
julia> using DataFrames # install with Pkg.add(DataFrames) if required
julia> x = ["high" , "medium" , "low" , "high" , "medium"];
julia> xf = PooledDataArray(x)
5-element DataArrays.PooledDataArray{ASCIIString,UInt32,1}:
"high"
"medium"
"low"
"high"
"medium"
julia> xf.refs
5-element Array{UInt32,1}:
0x00000001
0x00000003
0x00000002
0x00000001
0x00000003
julia> xfo = PooledDataArray(x,["low","medium","high"]);
julia> xfo.refs
5-element Array{UInt32,1}:
0x00000003
0x00000002
0x00000001
0x00000003
0x00000002

The CategoricalArrays.jl's CategoricalArray resembles factors.

Related

Measuring bandwidth of a signal in R

I am trying to measure the bandwidth of a signal from the power spectra. I want to be able to extract the min and max values given a relative amplitude value. I have been using "seewave" to calculate the power spectra, and I can make a density plot, and provide the abline, but I cannot figure out how to get R to tell me where the abline intersects with the plot. I will need to change the relative amplitude values of interest, depending on the signal quality, but want to find a straightforward way to measure bandwidth using R. Thanks in advance!
power.spec <- spec(IBK.trill.1, flim=c(0,2))
pow.spec <- as.matrix(power.spec)
head(pow.spec)
# x y
# [1,] 0.000000000 0.007737077
# [2,] 0.007470703 0.029795630
# [3,] 0.014941406 0.021248476
# [4,] 0.022412109 0.015603801
# [5,] 0.029882813 0.014103307
# [6,] 0.037353516 0.014584454
freq <- pow.spec[1:2941,1]
head(freq)
# [1] 0.000000000 0.007470703 0.014941406 0.022412109 0.029882813 0.037353516
ampl <- pow.spec[,2]
head(ampl)
# [1] 0.007737077 0.029795630 0.021248476 0.015603801 0.014103307 0.014584454
plot(ampl ~ freq, type="l",xlim=c(0,2))
abline(h=0.45)
Save the results of the identification of "y" values that exceed your threshold:
wspec <- which( power.spec[, "y"] > 0.45)
Then used those indices to pull from the "x" values to place vertical lines at the first and last indices:
abline( v= power.spec[ c( wspec[1], tail(wspec, 1) ) , "x"], col="blue" )
BTW, I suggested the original "power.spec" values rather than your as.matrix version because spec returns a matrix so coercion is not needed. I tested this on the first example from the ?spec page. I suppose you could get real picky and try to take the mean of "x" where the thresholds were in excess and the ones just before and after. Which would then be:
abline( v= c( mean( myspec[ c( wspec[1]-1, wspec[1]), "x"]) ,
mean( myspec[ c( tail(wspec, 1), tail(wspec, 1)+1 ) , "x"]) ), col="blue" )
I did look at the differences with diff and the typical separation in my example was
mean( diff(myspec[ , "x"]) )
[1] 0.0005549795
So I could have gone back and ahead by half that amount to get a reasonable estimate. (I used this as my estimate for "half-height": max(myspec[, "y"])/2)

Global constrained optimization specification in R

I am trying to set up a global constrained optimization in R.
The experiment data might look like
set.seed(123)
data.frame(main.goal = abs(rnorm(100)),
minor.goal.1 = abs(rnorm(100)),
minor.goal.2 = abs(rnorm(100))) -> d2optim
mean(sort(d2optim$minor.goal.1,
decreasing = TRUE)[1:20]) -> minor.goal.1.treshhold
mean(sort(d2optim$minor.goal.2,
decreasing = TRUE)[1:20]) -> minor.goal.2.treshhold
Where I would like to find indexes (ind) of 20 rows that
EDIT
maximzes mean(d2optim$main.goal[ind])
and mean(d2optim$minor.goal.1[ind]) >= 0.3 minor.goal.1.treshhold
and mean(d2optim$minor.goal.2[ind]) >= 0.5 minor.goal.2.treshhold
END EDIT
Is there a way to use any linear progamming packages such as lpSolve instead of grid checking every $\choose{100,20}$ configuration and then sorting them out? Like
all_configuration_of_indexes <- combn(100, 20) # doesn't fit in RAM
for( i in 1:length(all_configuration_of_indexes) ) {
i <- all_configuration_of_indexes[[i]]
if ( mean(d2optim$minor.goal.1[i]) >= 0.3 minor.goal.1.treshhold &
mean(d2optim$minor.goal.2[i]) >= 0.5 minor.goal.2.treshhold) {
res[[i]] <- mean(d2optim$major.goal[i])
} else {
res[[i]] <- 0
}
}
res[[which(max(unlist(res) = unlist(res))]]
I am looking fot the optimal sub-set of 100 rows that give the maximal mean of 1 variable but their mean of rest 2 variables are not less than 0.3 * minor.goal.1.treshhold nor 0.5 * minor.goal.2.treshhold
I'm not expert in linear programming and don't know how to implement it in R but that is what I think. I see it as integer linear programming problem modeled as follows:
x1, ..., x100 - logical (that is integer between 0 and 1) variables where xi indicates if we take i-th row of data.
objective function:
x1*d2optim$main.goal[1] + ... + x100*d2optim$main.goal[100] -> max
constraints:
0 <= x1, ..,. x100 <= 1
x1 + ... + x100 = 20
x1*d2optim$minor.goal.1[1] + ... + x100*d2optim$minor.goal.1[100] >= c1
x1*d2optim$minor.goal.2[1] + ... + x100*d2optim$minor.goal.2[100] >= c2
Instead of means we can take sums everywhere and c1, c2 are constatns adequate to your problem specification.
You can achieve this with:
# create an index withe the row meeting condition 2 and 3
idx <- d2optim$minor.goal.1 >= 0.3*minor.goal.1.treshhold & d2optim$minor.goal.2 >= 0.5*minor.goal.2.treshhold
# get the index of rownmubers with the highest values for the `main.goal` variable
rownames(d2optim[idx,][order(-d2optim2$main.goal),][1:20,])
which gives you an index of the rownumbers that match your criteria:
[1] "97" "44" "57" "98" "95" "43" "29" "46" "100" "64" "74" "19" "36" "75" "1" "15" "20" "48" "31" "13"
Because you now have a row index with the highest values for mail.goal which also meet the other two conditions, this means that the mean of these values is maximized as well.

Why does the calculation of Cohen's kappa fail across different packages on this contingency table?

I have a contingency table for which I would like to calculate Cohens's kappa - the level of agreement. I have tried using three different packages, which all seem to fail to some degree. The package e1071 has a function specifically for a contingency table, but that too seems to fail. Below is reproducable code. You will need to install packages concord, e1071, and irr.
# Recreate my contingency table, output with dput
conf.mat<-structure(c(810531L, 289024L, 164757L, 114316L), .Dim = c(2L,
2L), .Dimnames = structure(list(landsat_2000_bin = c("0", "1"
), MOD12_2000_binForest = c("0", "1")), .Names = c("landsat_2000_bin",
"MOD12_2000_binForest")), class = "table")
library(concord)
cohen.kappa(conf.mat)
library(e1071)
classAgreement(conf.mat, match.names=TRUE)
library(irr)
kappa2(conf.mat)
The output I get from running this is:
> cohen.kappa(conf.mat)
Kappa test for nominally classified data
4 categories - 2 methods
kappa (Cohen) = 0 , Z = NaN , p = NaN
kappa (Siegel) = -0.333333 , Z = -0.816497 , p = 0.792892
kappa (2*PA-1) = -1
> classAgreement(conf.mat, match.names=TRUE)
$diag
[1] 0.6708459
$kappa
[1] NA
$rand
[1] 0.5583764
$crand
[1] 0.0594124
Warning message:
In ni[lev] * nj[lev] : NAs produced by integer overflow
> kappa2(conf.mat)
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 2
Raters = 2
Kappa = 0
z = NaN
p-value = NaN
Could anyone advise on why these might fail? I have a large dataset, but as this table is simple I didn't think that could cause such problems.
In the first function, cohen.kappa, you need to specify that you are using count data and not just a n*m matrix of n subjects and m raters.
# cohen.kappa(conf.mat,'count')
cohen.kappa(conf.mat,'count')
The second function is much more tricky. For some reason, your matrix is full of integer and not numeric. integer can't store really big numbers. So, when you multiply two of your big numbers together, it fails. For example:
i=975288
j=1099555
class(i)
# [1] "numeric"
i*j
# 1.072383e+12
as.integer(i)*as.integer(j)
# [1] NA
# Warning message:
# In as.integer(i) * as.integer(j) : NAs produced by integer overflow
So you need to convert your matrix to have integers.
# classAgreement(conf.mat)
classAgreement(matrix(as.numeric(conf.mat),nrow=2))
Finally take a look at the documentation for ?kappa2. It requires an n*m matrix as explained above. It just won't work with your (efficient) data structure.
Do you need to know specifically why those fail? Here is a function that computes the statistic -- in a hurry, so I might clean it up later (kappa wiki):
kap <- function(x) {
a <- (x[1,1] + x[2,2]) / sum(x)
e <- (sum(x[1,]) / sum(x)) * (sum(x[,1]) / sum(x)) + (1 - (sum(x[1,]) / sum(x))) * (1 - (sum(x[,1]) / sum(x)))
(a-e)/(1-e)
}
Tests/output:
> (x = matrix(c(20,5,10,15), nrow=2, byrow=T))
[,1] [,2]
[1,] 20 5
[2,] 10 15
> kap(x)
[1] 0.4
> (x = matrix(c(45,15,25,15), nrow=2, byrow=T))
[,1] [,2]
[1,] 45 15
[2,] 25 15
> kap(x)
[1] 0.1304348
> (x = matrix(c(25,35,5,35), nrow=2, byrow=T))
[,1] [,2]
[1,] 25 35
[2,] 5 35
> kap(x)
[1] 0.2592593
> kap(conf.mat)
[1] 0.1258621

Calculating the Mode for Nominal as well as Continuous variables in [R]

Can anyone help me with this?
If I run:
> mode(iris$Species)
[1] "numeric"
> mode(iris$Sepal.Width)
[1] "numeric"
Then I get "numeric" as answer
Cheers
M
The function mode() is used to find out the storage mode of the the object, in this case is is stored as mode "numeric". This function is not used to find the most "frequent" observed value in a data set, i.e. it is not used to find the statistical mode. See ?mode for more on what this function does in R and why it isn't useful for your problem.
For discrete data, the mode is the most frequent observed value among the set:
> set.seed(1) ## reproducible example
> dat <- sample(1:5, 100, replace = TRUE) ## dummy data
> (tab <- table(dat)) ## tabulate the frequencies
dat
1 2 3 4 5
13 25 19 26 17
> which.max(tab) ## which is the mode?
4
4
> tab[which.max(tab)] ## what is the frequency of the mode?
4
26
For continuous data, the mode is the value of the data at which the probability density function (PDF) reaches a maximum. As your data are generally a sample from some continuous probability distribution, we don't know the PDF but we can estimate it through a histogram or better through a kernel density estimate.
Returning to the iris data, here is an example of determining the mode from continuous data:
> sepalwd <- with(iris, density(Sepal.Width)) ## kernel density estimate
> plot(sepalwd)
> str(sepalwd)
List of 7
$ x : num [1:512] 1.63 1.64 1.64 1.65 1.65 ...
$ y : num [1:512] 0.000244 0.000283 0.000329 0.000379 0.000436 ...
$ bw : num 0.123
$ n : int 150
$ call : language density.default(x = Sepal.Width)
$ data.name: chr "Sepal.Width"
$ has.na : logi FALSE
- attr(*, "class")= chr "density"
> with(sepalwd, which.max(y)) ## which value has maximal density?
[1] 224
> with(sepalwd, x[which.max(y)]) ## use the above to find the mode
[1] 3.000314
See ?density for more info. By default, density() evaluates the kernel density estimate at n = 512 equally spaced locations. If this is too crude for you, increase the number of locations evaluated and returned:
> sepalwd2 <- with(iris, density(Sepal.Width, n = 2048))
> with(sepalwd, x[which.max(y)])
[1] 3.000314
In this case it doesn't alter the result.
see ?mode : mode is giving you the storage mode. If you want the value with the maximum count, then use table.
> Sample <- sample(letters[1:5],50,replace=T)
> tmp <- table(Sample)
> tmp
Sample
a b c d e
9 12 9 7 13
> tmp[which(tmp==max(tmp))]
e
13
Please, read the help files if a function is not doing what you think it should.
Some extra explanation :
max(tmp) is the maximum of tmp
tmp == max(tmp) gives a logical vector with a length of tmp, indicating whether a value is equal or not to max(tmp).
which(tmp == max(tmp)) returns the index of the values in the vector that are TRUE. These indices you use to select the value in tmp that is the maximum value.
See the help files ?which, ?max and the introductory manuals for R.
See ?mode : mode is giving you the storage mode.
If you want to know the mode of a continuous random variable, I recently released the package ModEstM. In addition to the method proposed by Gavin Simpson, it addresses the case of multimodal variables.
For example, in case you study the sample:
> x2 <- c(rbeta(1000, 23, 4), rbeta(1000, 4, 16))
Which is clearly bimodal, you get the answer:
> ModEstM::ModEstM(x2)
[[1]]
[1] 0.8634313 0.1752347

Converting coefficient names to a formula in R

When using formulas that have factors, the fitted models name the coefficients XY, where X is the name of the factor and Y is a particular level of it. I want to be able to create a formula from the names of these coefficients.
The reason: If I fit a lasso to a sparse design matrix (as I do below) I would like to create a new formula object that only contains terms for the nonzero coefficients.
require("MatrixModels")
require("glmnet")
set.seed(1)
n <- 200
Z <- data.frame(letter=factor(sample(letters,n,replace=T),letters),
x=sample(1:20,200,replace=T))
f <- ~ letter + x:letter + I(x>5):letter
X <- sparse.model.matrix(f, Z)
beta <- matrix(rnorm(dim(X)[2],0,5),dim(X)[2],1)
y <- X %*% beta + rnorm(n)
myfit <- glmnet(X,as.vector(y),lambda=.05)
fnew <- rownames(myfit$beta)[which(myfit$beta != 0)]
[1] "letterb" "letterc" "lettere"
[4] "letterf" "letterg" "letterh"
[7] "letterj" "letterm" "lettern"
[10] "lettero" "letterp" "letterr"
[13] "letters" "lettert" "letteru"
[16] "letterw" "lettery" "letterz"
[19] "lettera:x" "letterb:x" "letterc:x"
[22] "letterd:x" "lettere:x" "letterf:x"
[25] "letterg:x" "letterh:x" "letteri:x"
[28] "letterj:x" "letterk:x" "letterl:x"
[31] "letterm:x" "lettern:x" "lettero:x"
[34] "letterp:x" "letterq:x" "letterr:x"
[37] "letters:x" "lettert:x" "letteru:x"
[40] "letterv:x" "letterw:x" "letterx:x"
[43] "lettery:x" "letterz:x" "letterb:I(x > 5)TRUE"
[46] "letterc:I(x > 5)TRUE" "letterd:I(x > 5)TRUE" "lettere:I(x > 5)TRUE"
[49] "letteri:I(x > 5)TRUE" "letterj:I(x > 5)TRUE" "letterl:I(x > 5)TRUE"
[52] "letterm:I(x > 5)TRUE" "letterp:I(x > 5)TRUE" "letterq:I(x > 5)TRUE"
[55] "letterr:I(x > 5)TRUE" "letteru:I(x > 5)TRUE" "letterv:I(x > 5)TRUE"
[58] "letterx:I(x > 5)TRUE" "lettery:I(x > 5)TRUE" "letterz:I(x > 5)TRUE"
From this I would like to have a formula
~ I(letter=="d") + I(letter=="e") + ...(etc)
I checked out formula() and all.vars() to no avail. Also, writing a function to parse this is a bit of a pain because of the different types of terms that can arise. For example, for x:letter when x is a numeric value and letter is a factor, or I(x>5):letter as another annoying case.
So am I not aware of some function to convert between formula and its character representation and back again?
When I ran the code, I got something a bit different, since set.seed() had not been specified. Instead of using the variable name "letter", I used "letter_" as a convenient splitting marker:
> fnew <- rownames(myfit$beta)[which(myfit$beta != 0)]
> fnew
[1] "letter_c" "letter_d" "letter_e" "letter_f" "letter_h" "letter_k" "letter_l"
[8] "letter_o" "letter_q" "letter_r" "letter_s" "letter_t" "letter_u" "letter_v"
[15] "letter_w"
Then made the split and packaged into a character matrix:
> fnewmtx <- cbind( lapply(sapply(fnew, strsplit, split="_"), "[[", 2),
+ lapply(sapply(fnew, strsplit, split="_"), "[[", 1))
fnewmtx
[,1] [,2]
letter_c "c" "letter"
letter_d "d" "letter"
letter_e "e" "letter"
letter_f "f" "letter" snipped the rest
And wrapped the paste function(s) output in as.formula() which is half of the answer to how to "convert between formula and its character representation and back." The other half is as.character()
form <- as.formula( paste("~",
paste(
paste(" I(", fnewmtx[,2], "_ ==", "'",fnewmtx[,1],"') ", sep="") ,
sep="", collapse="+")
)
) # edit: needed to add back the underscore
And the output is now an appropriate class object:
> class(form)
[1] "formula"
> form
~I(letter_ == "c") + I(letter_ == "d") + I(letter_ == "e") +
I(letter_ == "f") + I(letter_ == "h") + I(letter_ == "k") +
I(letter_ == "l") + I(letter_ == "o") + I(letter_ == "q") +
I(letter_ == "r") + I(letter_ == "s") + I(letter_ == "t") +
I(letter_ == "u") + I(letter_ == "v") + I(letter_ == "w")
I find it interesting that the as.formula conversion made the single-quotes around the letters into double-quotes.
Edit: Now that the problem has an additional dimension or two, my suggestion is to skip the recreation of the formula. Note that the rownames of myfit$beta are exactly the same as the column names of X, so instead use the non-zero rownames as indices to select columns in the X matrix:
> str(X[ , which( colnames(X) %in% rownames(myfit$beta)[which(myfit$beta != 0)] )] )
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..# i : int [1:429] 9 54 91 157 166 37 55 68 117 131 ...
..# p : int [1:61] 0 5 13 20 28 36 42 50 60 68 ...
..# Dim : int [1:2] 200 60
..# Dimnames:List of 2
.. ..$ : chr [1:200] "1" "2" "3" "4" ...
.. ..$ : chr [1:60] "letter_b" "letter_c" "letter_e" "letter_f" ...
..# x : num [1:429] 1 1 1 1 1 1 1 1 1 1 ...
..# factors : list()
> myfit2 <- glmnet(X[ , which( colnames(X) %in% rownames(myfit$beta)[which(myfit$beta != 0)] )] ,as.vector(y),lambda=.05)
> myfit2
Call: glmnet(x = X[, which(colnames(X) %in% rownames(myfit$beta)[
which(myfit$beta != 0)])],
y = as.vector(y), lambda = 0.05)
Df %Dev Lambda
[1,] 60 0.9996 0.05
Christopher, what you are asking for appears, after some consideration and examination of sparse.model.matrix etc, to be somewhat involved. You haven't explain why you do not want to form the full sparse model matrix for X_test so it is difficult to advise a way forward other than the two options below.
If you have a large number of observations in X_test and hence do not want to produce the full sparse matrix for use in predict() for computational reasons, it might be more expedient to split X_test into two or more chunks of samples and form the sparse model matrices for each one in turn, discarding it after after use.
Failing that, you will need to study code from the Matrix package in detail. Start with sparse.model.matrix and note that it then calls Matrix:::model.spmatrix and locate calls to Matrix:::fac2Sparse in that function. You will probably need to co-opt code from these functions but use a modified fac2Sparse to achieve what you want to achieve.
Sorry I cannot provide an off-the-shelf script to do this, but that is a substantial coding task. If you go down that route, check out the Sparse Model Matrices vignette in the Matrix package and get the package sources (from CRAN) to see if the functions I mention are better documented in the source code (there are no Rd files for fac2Sparse for example). You can also ask the authors of Matrix (Martin Maechler and Doug Bates) for advice, although note that both of these chaps have had a particularly heavy teaching load this term.
Good luck!

Resources