Generating all permutations when length varies - r

Background: I am working with a qualitative data coding scheme that contains seven ordered levels of codes. Five of these contain a single option and two contain two mutually exclusive options. A given code can be a concatenation of up to seven component codes, but they must occur in the order of the levels (thus we have permutations rather than combinations). The hard part is that a code may contain any number of levels, 1-7.
Level 1 : A
Level 2 : B or C
Level 3 : D or E
Level 4 : F
Level 5 : G
Level 6 : H
Level 7 : I
Equally valid example codes : ABDFGHI, ACF, I, FGHI, ACE, FH
Issue: I need to create a list of all valid codes, but am struggling with strategy since the permutations can be of any length and I cannot find relevant existing questions posed here. My initial intent was to use R but any way I could get a complete list is welcome. Any pointers out there?

I am not sure exactly how you need your output, but this works. Assign each level to a variable, but add a NA to it. Then use expand.grid like so:
L1<-c("A",NA)
L2<-c("B","C",NA)
L3<-c("D","E",NA)
L4<-c("F",NA)
L5<-c("G",NA)
L6<-c("H",NA)
L7<-c("I",NA)
expand.grid(L1=L1,L2=L2,L3=L3,L4=L4,L5=L5,L6=L6,L7=L7)
Each row of the output will be a combination, but it will include NA for the variables that are not included. Note that 288, the last row, is all NA.
Note, to get a row without the NA you could do (using row 283 as an example):
Levels<-expand.grid(L1=L1,L2=L2,L3=L3,L4=L4,L5=L5,L6=L6,L7=L7)
Levels[283,][!is.na(Levels[283,])]

Related

Does 0 plays any important role in as.numeric function when using factors in R

Hi guys :) I know this question has been asked before here for example but I would like to ask if 0 plays any important role using the as.numeric function. For example, we have the following simple code
x2<-factor(c(2,2,0,2), label=c('Male','Female'))
as.numeric(x2) #knonwing that this is not the appropriate command used , as.numeric(levels(x2))[x2] would be more appropriate but return NAs
this returns
[1] 2 2 1 2
Is 0 being replaced here by 1 ? Moreover,
unclass(x2)
seems to give the same thing as well:
[1] 2 2 1 2
attr(,"levels")
[1] "Male" "Female"
It might be simple but I am trying to figure this out and it seems that I cant. Any help would be highly appreciated as I am new in R.
0 has no special meaning for factor.
As commenters have pointed out, factor recodes the input vector to an integer vector (starting with 1) and slaps a name tag onto each integer (the levels).
In the most simplest case, factor(c(2,2,0,2), the function takes the unique values of the input vector, sorts it, and converts it to a character vector, for the levels. I.e. the factor is internally represented as c(2,2,1,2) where 1 corresponds to '0' and 2 to '2'.
You then go further on by giving the levels some labels; these are normally identical to the levels. In your case factor(c(2,2,0,2), labels=c('Male','Female')), the levels are still evaluated to the sorted, unique vector (i.e. c(2,2,1,2)) but the levels now have labels Male for first level and Female for second level.
We can decide which levels should be used, as in factor(c(2,2,0,2), levels=c(2,0), labels=c('Male','Female')). Now we have been explicit towards which input value should have which level and label.

R factor and level

Levels make sense that it is unique values of the vector, but I can't get my head around what factor is. It just seems to repeat the vector values.
factor(c(1,2,3,3,4,5,1))
[1] 1 2 3 3 4 5 1
Levels: 1 2 3 4 5
Can anyone explain what factor is supposed to do, or why would I used it?
I'm starting to wonder if factors are like a code table in a database. Where the factor name is code table name and levels are the unique options of the code table. ?
A factor is stored as a hash table rather than raw character vector. What does this imply? There are two major benefits.
Much smaller memory footprint. Consider a text file containing the phrase "New Jersey" 100,000 times over encoded in ASCII. Now imagine if you just had to store the number 16 (in binary 100,000 times and then another table indicating that 16 means "New Jersey". It's leaner and faster.
Especially for visualization and statistical analysis, frequently we test for values "across all categories" (think ANOVA or what you would color a stacked barplot by). We can either repeatedly encode all of our functions to stack up observed choices in a string vector or we can simply create a new type of vector which will tell you what the valid choices are. That is called a factor, and the valid choices are called levels.

R commands for finding mode in R seem to be wrong

I watched video on YouTube re finding mode in R from list of numerics. When I enter commands they do not work. R does not even give an error message. The vector is
X <- c(1,2,2,2,3,4,5,6,7,8,9)
Then instructor says use
temp <- table(as.vector(x))
to basically sort all unique values in list. R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given. Then he says to use command,
names(temp)[temp--max(temp)]
which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list. I would like to stay with these commands as far as is possible as the instructor explains them in detail. Am I doing a typo or something?
You're kind of confused.
X <- c(1,2,2,2,3,4,5,6,7,8,9) ## define vector
temp <- table(as.vector(X))
to basically sort all unique values in list.
That's not exactly what this command does (sort(unique(X)) would give a sorted vector of the unique values; note that in R, lists and vectors are different kinds of objects, it's best not to use the words interchangeably). What table() does is to count the number of instances of each unique value (in sorted order); also, as.vector() is redundant.
R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given.
If you assign results to a variable, R doesn't print anything. If you want to see the value of a variable, type the variable's name by itself:
temp
you should see
1 2 3 4 5 6 7 8 9
1 3 1 1 1 1 1 1 1
the first row is the labels (unique values), the second is the counts.
Then he says to use command, names(temp)[temp--max(temp)] which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list.
No. You already have the sequence of counts stored in temp. You should have typed
names(temp)[temp==max(temp)]
(note =, not -) which should print
[1] "2"
i.e., this is the mode. The logic here is that temp==max(temp) gives you a logical vector (a vector of TRUE and FALSE values) that's only TRUE for the elements of temp that are equal to the maximum value; names(temp)[temp==max(temp)] selects the elements of the names vector (the first row shown in the printout of temp above) that correspond to TRUE values ...

Finding Chi-Squared with NA values

I have two vectors, both of which have NA values in them. I am trying to find a Chi-Squared value for a table I created with the two vectors, but I get this error:
Error in chisq.test(data.table) :
all entries of 'x' must be nonnegative and finite
Is there a code to remove the NA values from the table?
I did find some codes to do this for vectors but I am not sure how this would work. If an NA value gets deleted from one vector, will the corresponding value from the other vector not go into the Chi-Squared calculation?
The vectors have over 8,000 values each and each row corresponds to one subject, so if that subject failed to answer a question, I wouldn't want to use his/her other answer either. I hope that makes sense.
One solution would be to pull out the NA values from your data before you even run the test.
Reproducibility would be helpful here, but I'm guessing your data look something like this:
control<-c(runif(5),NA,runif(4))
treatment<-c(runif(3),NA,runif(6))
In this case, by putting your data into a dataframe, you can both values for every subject with an NA in either value:
df<-data.frame(control,treatment)
df<-df[-which(is.na(df$treatment)),]
df<-df[-which(is.na(df$control)),]
Your data now only includes subjects without any missing data, and can be tested as you please.

Different behaviour of intersect on vectors and factors

I try to compare multiple vectors of Entrez IDs (integer vectors) by using Reduce(intersect,...). The vectors are selected from a database using "DISTINCT" so a single vector does not contain duplicates.
length(factor(c(l1$entrez)))
gives the same length (and the same IDs w/o the length function) as
length(c(l1$entrez))
When I compare multiple vectors with
length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
or
length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
the result is not the same. I know that factor!=originalVector but I cannot understand why the result differs although the length and the levels of the initial factors/vectors are the same.
Could somebody please explain the different behaviour of the intersect function on vectors and factors? Is it that the intersect of two factor lists are again factorlists and then duplicates are treated differently?
Edit - Example:
> head(l1)
entrez
1 1
2 503538
3 29974
4 87769
5 2
6 144568
> head(l2)
entrez
1 1743
2 1188
3 8915
4 7412
5 51082
6 5538
The lists contain around 500 to 20K Entrez IDs. So the vectors contain pure integer and should give the intersect among all tested vectors.
> length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
[1] 514
> length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
[1] 338
> length(Reduce(intersect,list(l1$entrez,l2$entrez,l3$entrez,l4$entrez)))
[1] 494
I have to apologize profusely. The different behaviour of the intersect function may be caused by a problem with the data. I have found fields in the dataset containing comma seperated Entrez IDs (22038, 23207, ...). I should have had a more detailed look at the data first. Thank you for the answers and your time. Although I do not understand the different results yet, I am sure that this is the cause of the different behaviour. Can somebody confirm that?
As Roman says, an example would be very helpful.
Nevertheless, one possibility is that your variables l1$entrez, l2$entrez etc have the same levels but in different orders.
intersect converts its arguments via as.vector, which turns factors into character variables. This is usually the right thing to do, as it means that varying level order doesn't make any difference to the result.
Passing factor(l1$entrez) as an argument to intersect also removes the impact of varying level order, as it effectively creates a new factor with level ordering set to the default. However, if you pass c(l1$entrez), you strip the factor attributes off your variable and what you're left with is the raw integer codes which will depend on level ordering.
Example:
a <- factor(letters[1:3], levels=letters)
b <- factor(letters[1:3], levels=rev(letters)
# returns 1 2 3
intersect(c(factor(a)), c(factor(b)))
# returns integer(0)
intersect(c(a), c(b))
I don't see any reason why you should use c() in here. Just let R handle factors by itself (although to be fair, there are other scenarios where you do want to step in).

Resources