The nearZeroVar() function from the mixOmics R package is simply the following code:
nearZeroVar(x, freqCut=95/5, uniqueCut=15) # default values shown
Here is the description of what this function does, straight from the source.
For example, an example of near zero variance predictor is one that,
for 1000 samples, has two distinct values and 999 of them are a single
value.
To be flagged, first the frequency of the most prevalent value over
the second most frequent value (called the “frequency ratio”) must be
above freqCut. Secondly, the “percent of unique values,” the number of
unique values divided by the total number of samples (times 100), must
also be below uniqueCut.
In the above example, the frequency ratio is 999 and the unique value
percentage is 0.0001.
I understand that the frequency ratio would be 999/1 (because there are 999 single values and 1 other value) so it would be 999. But shouldn't the unique value percentage be = 2/1000*100 = 0.2, since it would be 2 unique values over the number of samples. How does one obtain 0.0001 as the answer?
Related
I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.
There's this line.
X1_X2_X3_X4_X5_X6
It is known that each variable X* can take values from 0 to 100. The sum of all X* variables is always equal to 100. How many possible string variants can be created?
Suppose F(n,s) is the number of strings with n variables, and the variables sum to s, where each variable is between 0 and 100, and suppose s<=100. You want F(6,100).
Clearly
F(1,s) = 1
If the first variable is t, then it can be followed by strings of n-1 variables that sum to s-t. Thus
F(n,s) = Sum{ 0<=t<=s | F(n-1, s-t) }
So its easy to write a wee function to compute the answer.
I am working with the golub dataset in R (separated by the AML and ALL) and I am attempting to do a hypothesis test in relation to two genes. For the AML patient group, I want to find out the proportion of patients who have a higher expression of gene 900 as compared to gene 1000, then I want to see if that out of those who have a higher expression value for gene 900, the number is less than half. I have a general idea to do the other half, and I had something like this for the first part, but seeing as its T/F I tried to switch it to numeric which gave 0 and 1 but I want the actual numbers and not in the logical form.
`gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
x <- golub[900,gol.fac=="AML"]
y <- golub[1000,gol.fac=="AML"]
z <-golub[900,gol.fac=="AML"] > golub[1000,gol.fac=="AML"]
k <- as.numeric(z)`
Use max
max(golub[900,gol.fac=="AML"], golub[1000,gol.fac=="AML"])
Or if you have multiple values then use pmax
pmax(golub[900,gol.fac=="AML"], golub[1000,gol.fac=="AML"])
Instead of doing multiple slices of rows, just get the max by subsetting once
max(golub[900:1000, "AML"])
Say, I have two vectors of the same length
A = mtcars$mpg
B = mtcars$cyl
I can calculate correlation between whole vectors
cor (A, B)
and get one single value (-0.852162).
What I need is to calculate correlation between the two vectors with a sampling rate of 10, which means I start at the first datapoint in A and B, take 5 values on the right from it (there are no values on the left), calculate a correlation coefficient ad write it in a vector C. Then I take the next value in A & B, take 5 values on the right and 1 on the left, write it into a vector; then shift again to the next value and so forth. The resulting vector C must contain the same number of values as A or B (N=32), and each value in C represents a correlation b/w A and B with a sampling rate 10 (5 values on the left and 5 on the right from that datapoint, if availible).
Is there any elegant and simple way to do it in R?
P.S.: The ease of coding is more important, than the time needed for calculations.
The TTR package may provide what you are looking for.
It should be as simple as:
TTR::runCor(A, B)
There is a whole blog post about rolling correlation here.
I've the below data in variable X. The data is in the form of pair of numbers {a, b}.
a represents the actual value while b represents its frequency in the data set.
X = {{20, 30}, {21, 40}, {22, 50}}
I want to calculate expected value of this data set.
How can extract out all values of a in a separate data set ?
The expected value is (in non-Mma notation) sum(x[i]*p[i], i, 1, n) where x[i] is the i-th distinct value (i.e., first value in each pair), p[i] is the proportion of that value (i.e., second value in each pair divided by the total of all of the second values), and n is the number of distinct values of x (i.e., the number of pairs). I think this is enough to help you solve it now.