Best combination of lists that provides more unique values - r

Not sure if someone could help me with this problem.
I have 5 lists of values of different lengths.
Note: Same value can be presence in different lists.
Does anyone know how to get the combination of 3 lists that will provide more total unique values?
Thanks in advance,
Miguel

I do not really have an answer to your question, which seems to be more of a combinatorics question than programming. My sense is that if you want an exact solution you will have to try all the possible combinations of subsets of 3 lists out of 5 (there are 10 of them). One thing to remember if you go that way is that if you want the number of unique elements of the concatenation of 3 lists you do not have necessarily to do length(unique(c(l1,l2,l3)) which I imagine could be inefficient if you have very long lists. You can use the formula for the size of the intersection of 3 sets, which you can find for example at https://math.stackexchange.com/questions/669249/probability-of-the-union-of-3-events .
This will require you only to compute the length of all the possible intersections of the lists. it could be a completely academic exercise: as I said, I am not offering an answer but if you are not familiar with that formula it is worth reading it, since it is relevant to the problem of finding the size of a set.

Related

generating completely new vector based on other vectors

Assume I have four-vectors (v1,v2,v3,v4), and I want to create a new vector (vec_new) that is not close to any of those four-vectors. I was thinking about interpolation and extrapolation. Do you think they are suitable? Are they also apply for vector and generate a vector of let's say 300 dimensions? Another possible option would be the transformation matrix. But I am not sure if it fit my concern. I think averaging and concatenation are not the good ones as I might be close to some of those four-vectors.
based on my problem, Imagine I divided my vectors into two categories. I need to find a vector which belongs to non-of those categories.
Any other ideas?
Per my comment, I wouldn't expect the creation of synthetic "far away" examples to be useful for realistic goals.
Even things like word antonyms are not maximally cosine-dissimilar from each other, because among the realm of all word-meaning-possibilities, antonyms are quite similar to each other. For example, 'hot' and 'cold' are considered opposites, but are the same kind of word, describing the same temperature-property, and can often be drop-in replacements for each other in the same sentences. So while they may show an interesting contrast in word-vector space, the "direction of difference" isn't going to be through the origin -- as would create maximal cosine-dissimilarity.
And in classification contexts, even a simple 2-category classifier will need actual 'negative' examples. With only positive examples, the 'vector space' won't necessarily model anything about hypothesized-but-not-actually-present negative examples. (It's nearly impossible to divide the space into two categories without training examples showing the real "boundaries".)
Still, there's an easy way to make a vector that is maximally dissimilar to another single vector: negate it. That creates a vector that's in the exact opposite direction from the original, and thus will have a cosine-similarity of -1.0.
If you have a number of vectors against which you want to find a maximally-dissimilar vector, I suspect you can't do much better than negating the average of all the vectors. That is, average the vectors, then negate that average-vector, to find the vector that's pointing exactly-opposite the average.
Good luck!

How do I create an N-Queens Problem with Genetic Algorithm child with constrained row and column?

For the N-Queen problem found here, I am trying to implement a genetic algorithm to solve it.
However, let's say that I am trying to constrain the problem. We know that to get an attacking value of 0, you can't have queens in the same row and column. I limit the boards to always have a different row and column for each queen. I want the genetic algorithm to find a solution where the diagonals are also not attacking.
My problem is with creating a child for this solution using a genetic algorithm. What is a good way to generate a child from two parent boards that follows that the children must not have queens in overlapping rows and columns?
Avoiding both overlapping rows and columns is difficult in a genetic algorithm. The typical approach is to implicitly represent the columns by the index in an array, and then have the queens represented by numbers 1..N.
So, a solution to the 8-queen problem would be represented by (5 1 8 4 2 7 3 6).
If you take any subset of the array you can mix it with another array and be guaranteed that there is one queen in each column (or row - however you prefer to think of this).
You can avoid both column and row overlap by using combinatorics (so there are N! arrangements instead of N^N), but the issue is that the representation required to do this (you can essentially use integers to represent full configurations) doesn't work as well for crossover operations. You will also run into the limit of integer representations. Using an array as above works fairly well, so I would suggest exploring that approach first.

Use values from 2 matrices to index a third in R

I'm optimizing a more complex code, but got stuck with this problem.
a<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
m<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
f<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
g<-array(NA,c(10,10))
I need to use the values in a & m to index f and assign the value from f to g
i.e. g[1,1]<-f[a[1,1],m[1,1]] except for all the indexes, and as optimally/fast as possible
I could obviously make a for loop to do this for me but that seems rather dumb and slow. It seems like I should be able to us something in the apply family, but I've had no luck with figuring out how to do that. I do need to keep the data structured as it is here so that I can use matrix operations in different parts of my code. I've been searching for an answer to this but haven't found anything particularly helpful yet.
g[] <- f[cbind(c(a), c(m))]
This takes advantage of the fact that matrices can be addressed as vectors and using a matrix as the index.

Finding a closest looking segment of data in another sequence

I am doing image processing, in which I came across a situation, where I have to compare two vectors and find an instance of the smaller vector in the larger vector.
Say the two vectors are A: with 100 elements (or entries)
and B; with 10 elements. B is a model and it may not be present exactly as it is' in the vector A. I can compare 10 elements at a time and find the difference. Ideal case is that the B is present somewhere and the difference is zero. Otherwise a minimum will result at some random location, and i am missing the location.
Please help me in giving an algorithm such that the i can find Bs' closest instance in A.
What you are looking for is the cross-correlation function.The peak the the cross correlation of the two vectors will be the point were vector B is most similar to vector A.
You may want to get an explanation of how it is implemented in matlab HERE as it gives an easier explanation of how this operation can be implemented in software.

How can I structure and recode messy categorical data in R?

I'm struggling with how to best structure categorical data that's messy, and comes from a dataset I'll need to clean.
The Coding Scheme
I'm analyzing data from a university science course exam. We're looking at patterns in
student responses, and we developed a coding scheme to represent the kinds of things
students are doing in their answers. A subset of the coding scheme is shown below.
Note that within each major code (1, 2, 3) are nested non-unique sub-codes (a, b, ...).
What the Raw Data Looks Like
I've created an anonymized, raw subset of my actual data which you can view here.
Part of my problem is that those who coded the data noticed that some students displayed
multiple patterns. The coders' solution was to create enough columns (reason1, reason2,
...) to hold students with multiple patterns. That becomes important because the order
(reason1, reason2) is arbitrary--two students (like student 41 and student 42 in my
dataset) who correctly applied "dependency" should both register in an analysis, regardless of
whether 3a appears in the reason column or the reason2 column.
How Can I Best Structure Student Data?
Part of my problem is that in the raw data, not all students display the same
patterns, or the same number of them, in the same order. Some students may do just one
thing, others may do several. So, an abstracted representation of example students might
look like this:
Note in the example above that student002 and student003 both are coded as "1b", although I've deliberately shown the order as different to reflect the reality of my data.
My (Practical) Questions
Should I concatenate reason1, reason2, ... into one column?
How can I (re)code the reasons in R to reflect the multiplicity for some students?
Thanks
I realize this question is as much about good data conceptualization as it is about specific features of R, but I thought it would be appropriate to ask it here. If you feel it's inappropriate for me to ask the question, please let me know in the comments, and stackoverflow will automatically flood my inbox with sadface emoticons. If I haven't been specific enough, please let me know and I'll do my best to be clearer.
Make it "long":
library(reshape)
dnow <- read.csv("~/Downloads/catsample20100504.csv")
dnow <- melt(dnow, id.vars=c("Student", "instructor"))
dnow$variable <- NULL ## since ordering does not matter
subset(dnow, Student%in%c(41,42)) ## see the results
What to do next will depend on the kind of analysis you would like to do. But the long format is the useful for irregular data such as yours.
you should use ddply from plyr and split on all of the columns if you want to take into account the different reasons, if you want to ignore them don't use those columns in the split. You'll need to clean up some of the question marks and extra stuff first though.
x <- ddply(data, c("split_column1", "split_column3" etc),
summarize(result_df, stats you want from result_df))
What's the (bigger picture) question you're attempting to answer? Why is this information interesting to you?
Are you just trying to find patterns such as 'if the student does this, then they also likely do this'?
Something I'd consider if that's the case - split the data set into smaller random samples for your analysis to reduce the risk of false positives.
Interesting problem though!

Resources