I have two tables A1 and A2
(for example A1=[0.4472,-0.8944;-0.8944 0.4472] A2=[-0.5558 0.9101;0.8313 0.41420] )
and i want to check if the columns of A2 are optimally ordered and scalled
(its columns are least-squares estimates of the columns of A)
And if not , to make them.
Any help?
Thanks
Related
I have a dataset with multiple columns with variables that I want to check for correlations. My identifier is country code and the remaining are all variables that are to be studied. The goal is to find the top 5/10 linear regressions among all combinations using r square value as comparison attribute. My data looks like:
Country_code | A1 | A2 | A3 |...... | A193
I want to run a simulation that can run lm in loop and have summaries of all combination of attributes in a list so that I can compare their r square value and publish/plot the top 5/10 correlations.
It should run like
For A1:
lm(A1~A2) then lm(A1~A3).....
Next: lm(A2~A3)....
I know I need to put two loops but either the results are not saved because the summary have too many attributes to be pushed into a list, etc.
My problem is:
I have two large matrices: Matrix A is of rank 5 (namelly a Tensor), which I reshape to a Matrix B (NxM) of rank 2. At some point, my problem involved normalizing my matrices, so I was doing:
1) A*norm_scalar;
2) reshaping A to get B.
which is giving a different result than doing
1) reshaping A to get B
2) B*norm_scalar;
Both results should have the same output, as I am only multiplying by a scalar. My theory is that there is something related to rounding precision. If so, which one is the most recommended way to proceed?
In this sense, I was trying to get B with both methods, namelly B1 and B2 and compare them:
I have tryed:
julia> isequal(B1,B2)
false
So, yes, they are different.
I know that find(B1.==B2) will give me the indexes where B1 and B2 are equal. Now: Is there any command that gives me the indexes where B1 and B2 are different?. this would help me great deal!
find(B1.!=B2) should do what you want.
Let's say we have a hypothetical complete schedule of potential outcomes from an experiment.
Y0<-c(10,15,20,20,10,15,15)
Y1<-c(15,15,30,15,20,15,30)
budgets<-matrix(data=c(Y0,Y1),nrow=7,ncol=2)
I would like to list all of the ways to choose two elements from Y1 and the remaining 5 from Y0. Ideally, this would look like an array of 21 lists, each with two elements labeled Y0 and two elements labeled Y1.
edit: These are matched pairs, so choosing y0[1] removes y1[1] from consideration.
Thanks in advance! I think there are many ways to approach this (sapply?) but would appreciate help on the details.
Here is a longer method, there is probably a more compact solution out there:
# get within group combinations as matrix
grp0 <-t(combn(Y0, 5))
grp1 <-t(combn(Y1, 2))
# get all possible combos of these rows
grpCombos <- expand.grid(1:nrow(grp1), 1:nrow(grp2))
# get all combinations as a matrix
allGroups <- cbind(grp0[grpCombos[,1],], grp1[grpCombos[,2],])
To get all the combinations of 2 elements from Y1 and and remaining 5 elements from Y0 and only choose one element from each position, try the following code:
cb <- as.data.frame(combn(1:7, 2))
sapply(cb, FUN = function(x) c(Y1[x], Y0[-x]))
previous: If you want all the combination of choose 2 from 7 within Y1 and choose 5 from 7 within Y0, the total combination number would be 21 * 21.
I have to apply a clustering algorithm to my dataset which is composed by elements composed by attributes of different nature:
A1 -> multivalued, nominal values
A2 -> multivalued, nominal values
A3 -> multivalued, nominal values
A4 -> single nominal value
The domain of each attribute is potentially huge, such as a dictionary word.
I've just found Jaccard measure, wich will be great for each attribute values-set, except for the first one.
Consider the following example
E1: [
A1 (ab,bb),
A2 (Mark, Rose, Bet),
A3 (rock, pop, soul),
A4 (France)
],
E2: [
A1 (ab,bb,cc,ca),
A2 (Mark, Peter, Bet, Louise),
A3 (pop, disco),
A4 (Spain),
]
While all the other attributes must be considerend as made of atomic value, the first attribute is composed by values that needs to be compared with string similarity, such as Levenshtein distance.
Which is the best approach? The first attribute is generally of small cardinality, about 10 values. But I have no a priori knowledge. All the other attributes have a huge cardinality.
I'm new to clustering stuff, I'm just trying to figure out how to do it in the best way :)
I have found that R software should be a good way to implement that kind of clustering (consider a dataset of milions of elements).
Any suggestion?
I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.