How to extract top features by CATScore in r? - r

I am running a machine learning algorithm that uses CAT score for feature selection as
library(sda)
train1<- data.matrix(train, rownames.force = NA)
ranking.LDA = sda.ranking(train1[,1:lengthvar], train1[,lengthtrain], diagonal=FALSE)
topfs<-which(ranking.LDA[,"score"] >2)
My question is how to ask the CAT score to give me for example top 20 features? The only way I could extract features was setting a threshold, but this way, it gives me various number of features for different data set. What I want is always having eg. top 20 (or any other number) features.
Thanks in advance for your valuable contribution.

ranking.LDA gives a list of numbers.Hence we use a list function.
#As ranking.LDA gives a ranking of predictors we directly extract column names using this ranking.
colnames(train1[,ranking.LDA[1:20]])

Related

"grouping factor must have exactly 2 levels"

Hi y'all I'm fairly new to R and I'm supposed to calculate F statistic for this table
The code I have inputted is as follows:
# F-test
res.ftest <- var.test(TotalLength ~ SwimSpeed , data = my_data)
res.ftest
I know I have more than two levels from the other posts I have read online, but I am not sure what to change to get the outcome I want.
FIRST AND FOREMOST...If you invoke
?var.test()
you will note that the S3 version you called assumes lhs is numeric and rhs is a 2-level factor.
As for the rest, while I don't know the words to your specific work/school assignment here, the words shouldn't be "calculate an F-test", exactly. They should be "analyze these data appropriately". While there are a number of routes you could take, this is normally seen as a regression problem, NOT a problem of trying to compare two variances/complete a 1-way ANOVA which is what var.test() is designed to do. (Reading the documentation at, for example, https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/var.test should make this clear and is something you should always do when invoking R procedures.)
Using a subset of your data (please do this yourself for stack helpers next time rather than make someone here do it for you)...
df <- data.frame(
ID = 1:4,
TL = c(27.1,29.0,33.0,29.3),
SS = c(86.6,62.4,63.8,62.3)
)
cor.test(df$TL,df$SS) # reports t statistic
# or
summary(lm(df$TL ~ df$SS)) # reports F statistic
Note that F is simply t^2 here in the 2 variable case.
Lastly, I should add it is remotely, vaguely possible the assignment is to check if the variances of the 2 distributions are equal even though I can see no reason why anyone would want to know considering they are 2 different measures on two different underlying scales measuring 2 different things. However,
var.test(df$TL, df$SS)
will return a "result" should you take the assignment to mean compare the observed variances.

Why does mutate() command create NAs?

I am currently working on an amazon dataset with many rows, which makes it hard to spot issues in the data.
My goal is to look at the amazon data, and see whether certain products have a higher variance in star ratings than other ones. I have a variable indicating product ID (asin), a variable indicating the star rating (overall), and want to create a variance variable.
I have thus used dplyr's group_by function in combination with the mutate function. Even though all input variables don't have NAs/Missings, my output variable does. I have attempted to look for a solution, yet only found solutions on what to do if the input has NAs.
See my code attached:
any(is.na(data$asin))
#[1] FALSE
any(is.na(data$overall))
# [1] FALSE
#create variable that represents variance of rating, grouped by product type
data <- data %>%
group_by(asin) %>%
mutate(ProductVariance = var(overall))
any(is.na(data$ProductVariance))
#5226 [1] TRUE
> sum(is.na(data$ProductVariance))
# [1] 289
I would much appreciate your help! Even though the amount of NAs is not big regarding the number of reviews, I would still appreciate getting to accurate means (NAs hinder the usage of tapply) and being as precice as possible in follow-up analyses.
Thank you in advance!
var will return NA if the input is length one. So any ASINs that appear once in your data will have NA variance. Depending what you're doing with it, you may find it convenient to change those NAs to 0s:
var(1)
# [1] NA
...
mutate(ProductVariance = coalesce(var(overall), 0))
Is it possible that what you're seeing is that "empty" groups are not showing up? You can change the default with .drop.
When .drop = TRUE, empty groups are dropped.

Randomly pairing elements of a vector in R to count unique arrangements

Background:
On this combinatorics question, the issue is how to determine the sample space: the ways 8 different soccer teams can be paired up for the next round of competition. Two different answers have been advanced for that part of the problem: 28 (see comments OP) and 105 (see edit within OP and answer).
I'd like to do this manually to try to hone down on the mistake in whichever answer is incorrect.
What I have tried:
teams = 1:8
names(teams) = c("RM", "BCN", "SEV", "JUV", "ROM", "MC", "LIV", "BYN")
split(sample(teams), rep(1:(length(teams)/2), each=2))
Unfortunately, the output is a list, and I wanted a vector to be able to run something like:
unique(...,MARGIN=2)
Is there a way of doing this in an elegant manner?
After a now erased answer (thank you), I would go with
a <- replicate(1e5, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2))))
to simulate 100,000 random samples, and later run
unique(a, MARGIN = 2).
But how can I account for the fact that the order of the 4 pairings of opponents doesn't matter, and that LIV-BYN and BYN-LIV, for example, is the same pairing (field advantage notwithstanding)?
> u = ncol(unique(replicate(1e6, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2)))), MARGIN = 2))
> u / (factorial(4) * 2^4)
[1] 105
The idea of unlist is from #Song Zhengyi, and if his answer is un-deleted, I'll accept it. The complete answer is in the lines above.
u needs to be divided by 4! because
BCN-RM, BYN-SEV, JUV-ROM, LIV-MC
is exactly the same as
LIV-MC, BCN-RM, BYN-SEV, JUV-ROM
or
BCN-RM, LIV-MC, BYN-SEV, JUV-ROM
etc.
The term 2^4 is to avoid over-counting since for every possible unique draw, each one of the pairings can be flipped without loss (discarding field advantage): BCN-RM is the same as RM-BCN, and there are 4 pairs in each draw.
If field advantage is a consideration (real life)...
> u/factorial(4)
[1] 1680
we end up with 1,680 possible draws.

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

Calculate correlation coefficient between words?

For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
else
X[i][j] = 0;
}
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.

Resources