Index Vectors with Factors in R - r

I have a factor RFyhat which I'm looking to convert to a numeric vector. I've already discovered that
as.numeric(levels(RFyhat))[RFyhat]
works as desired, and I've played around a bit with this construction:
c(1,2,20,4,5,6,7)[RFyhat]
also works as expected (RFyhat has 7 levels).
So I understand the behavior of this construction, but I'm wondering if anyone can explain how this syntax is intended to work, or whether it is just 'syntactic sugar'. More specifically, does [RFyhat] act as an index vector? If it does, how do factors generally behave when used as an index?

Yes, I believe that factors gets converted to integers when used for indexing, rather than characters or anything else.
Look at this example
> fac <- factor(letters[c(1,1,2,1,3,3,2,1)])
> vec <- c(b=1, a=2, c=3)
> vec[fac]
b b a b c c a b
1 1 2 1 3 3 2 1
So element 1 of fac has returned element 1 of vec, regardless of the different order of names.
Personally I'd prefer as.integer(as.character(RFyhat)) to as.numeric(levels(RFyhat))[...].

Related

Does 0 plays any important role in as.numeric function when using factors in R

Hi guys :) I know this question has been asked before here for example but I would like to ask if 0 plays any important role using the as.numeric function. For example, we have the following simple code
x2<-factor(c(2,2,0,2), label=c('Male','Female'))
as.numeric(x2) #knonwing that this is not the appropriate command used , as.numeric(levels(x2))[x2] would be more appropriate but return NAs
this returns
[1] 2 2 1 2
Is 0 being replaced here by 1 ? Moreover,
unclass(x2)
seems to give the same thing as well:
[1] 2 2 1 2
attr(,"levels")
[1] "Male" "Female"
It might be simple but I am trying to figure this out and it seems that I cant. Any help would be highly appreciated as I am new in R.
0 has no special meaning for factor.
As commenters have pointed out, factor recodes the input vector to an integer vector (starting with 1) and slaps a name tag onto each integer (the levels).
In the most simplest case, factor(c(2,2,0,2), the function takes the unique values of the input vector, sorts it, and converts it to a character vector, for the levels. I.e. the factor is internally represented as c(2,2,1,2) where 1 corresponds to '0' and 2 to '2'.
You then go further on by giving the levels some labels; these are normally identical to the levels. In your case factor(c(2,2,0,2), labels=c('Male','Female')), the levels are still evaluated to the sorted, unique vector (i.e. c(2,2,1,2)) but the levels now have labels Male for first level and Female for second level.
We can decide which levels should be used, as in factor(c(2,2,0,2), levels=c(2,0), labels=c('Male','Female')). Now we have been explicit towards which input value should have which level and label.

How to find the row of a data.table containing the most matches of a query vector

I have a data.table like
library(data.table)
ffDummy_dt = data.table(Annotation=c("chr10:10..20,-", "chr10:25..30,-"
,"chr10:35..100,-","chr10:106..205,-","chr10:223..250,-","chr10:269..478,-"
,"chr10:699..1001,-","chr10:2000..2210,-","chr10:2300..2500,-"
,"chr10:2678..5678,-"),tpmOne=c(0,0,0.213,1,1.2,0.5,0.7,0.9,0.8,0.86),
tpmTwo=c(100,1000,1001,1500,900,877,1212,1232,1312,0),tpmThree=c(0.2138595,0,0,0
,0,0,0.6415786,0,0,0))
I want to pass a query (can be vector or even a data.table if need be) like:
test_v = c(0,0,0.86)
I want to find out which row is the best match.
In my real use case, test_v is like 20 elements long and the nrow(Dummy_dt) is >>20 (but likely there will only be one perfect match per 20-element vector).
Currently,
which.max(apply(as.matrix(ffDummy_dt[,2:ncol(ffDummy_dt),with=F]), 1,
function(k) sum(test_v%in%k)))
seems to work (gives the correct output in this case, which is 10), but this is not a data.table solution.
I've had a look here but can't quite figure out how to use %in% k above with data.table.
Assuming you actually want the matches to be exclusive (that seems to me to make more sense for a row to be a "best match"), you can do:
Reduce(`+`, lapply(ffDummy_dt, `%in%`, test_v))
#[1] 1 2 1 1 1 1 0 1 1 3

Combining the common elements in two lists in R, using only logical and arithmetic operators

I'm currently attempting to work out the GCD of two numbers (x and y) in R. I'm not allowed to use loops or if, else, ifelse statements. So i'm restricted to logical and arithmetic operators. So far using the code below i've managed to make lists of the factors of x and y.
xfac<-1:x
xfac[x%%fac==0]
This gives me two lists of factors but i'm not sure where to go from here. Is there a way I can combine the common elements in the two lists and then return the greatest value?
Thanks in advance.
Yes, max(intersect(xfac,yfac)) should give the gcd.
You have almost solved the problem. Let's take the example x <- 12 and y <- 18. The GCD is in this case 6.
We can start by creating vectors xf and yf containing the factor decomposition of each number, similar to the code you have shown:
xf <- (1:x)[!(x%%(1:x))]
#> xf
#[1] 1 2 3 4 6 12
yf <- (1:y)[!(y%%(1:y))]
#> yf
#[1] 1 2 3 6 9 18
The parentheses after the negation operator ! are not necessary due to specific rules of operator precedence in R, but I think that they make the code clearer in this case (see fortunes::fortune(138)).
Once we have defined these vectors, we can extract the GCD with
max(xf[xf %in% yf])
#[1] 6
Or, equivalently,
max(yf[yf %in% xf])
#[1] 6

Using nchar function on factor variables

can somebody explain to me what's going on here ? when a variable is coded as a factor and nchar coerces to a character, why can't that function effectively count the number of characters ?
> x <- c("73210", "73458", "73215", "72350")
> nchar(x)
[1] 5 5 5 5
>
> x <- factor(x)
> nchar(x)
[1] 1 1 1 1
>
> nchar(as.character(x))
[1] 5 5 5 5
thanks.
It is because with a factor, your data is represented by 1, 2, etc. What you mean to do is count the characters of the levels:
> nchar(levels(x)[x])
[1] 5 5 5 5
see the warning section of ?factor:
The interpretation of a factor depends on both the codes and the
‘"levels"’ attribute. Be careful only to compare factors with the
same set of levels (in the same order). In particular,
‘as.numeric’ applied to a factor is meaningless, and may happen by
implicit coercion. To transform a factor ‘f’ to approximately its
original numeric values, ‘as.numeric(levels(f))[f]’ is recommended
and slightly more efficient than ‘as.numeric(as.character(f))’.
nchar(levels(x))
The other answers are correct, I think, that the issue is that nchar is examining the underlying integer codes, not the labels. However, what I think most directly addresses your question is this piece from ?nchar:
The internal equivalent of the default method of as.character is
performed on x (so there is no method dispatch)
I'm not 100% sure, but I suspect this means that the coercion that takes place in nchar is not the same thing that happens when you directly call as.character, most likely going directly to the integer codes, rather than "smartly" looking at the labels.

Why does R need the name of the dataframe?

If you have a dataframe like this
mydf <- data.frame(firstcol = c(1,2,1), secondcol = c(3,4,5))
Why would
mydf[mydf$firstcol,]
work but
mydf[firstcol,]
wouldn't?
You can do this:
mydf[,"firstcol"]
Remember that the column goes second, not first.
In your example, to see what mydf[mydf$firstcol,] gives you, let's break it down:
> mydf$firstcol
[1] 1 2 1
So really mydf[mydf$firstcol,] is the same as
> mydf[c(1,2,1),]
firstcol secondcol
1 1 3
2 2 4
1.1 1 3
So you are asking for rows 1, 2, and 1. That is, you are asking for your row one to be the same as row 1 of mydf, your row 2 to be the same as row 2 of mydf and your row 3 to be the same as row 1 of mydf; and you are asking for both columns.
Another question is why the following doesn't work:
> mydf[,firstcol]
Error in `[.data.frame`(mydf, , firstcol) : object 'firstcol' not found
That is, why do you have to put quotes around the column name when you ask for it like that but not when you do mydf$firstcol. The answer is just that the operators you are using require different types of arguments. You can look at '$' to see the form x$name and thus the second argument can be a name, which is not quoted. You can then look up ?'[', which will actually lead you to the same help page. And there you will find the following, which explains it. Note that a "character" vector needs to have quoted entries (that is how you enter a character vector in R (and many other languages).
i, j, ...: indices specifying elements to extract or replace. Indices
are ‘numeric’ or ‘character’ vectors or empty (missing) or
‘NULL’. Numeric values are coerced to integer as by
‘as.integer’ (and hence truncated towards zero). Character
vectors will be matched to the ‘names’ of the object (or for
matrices/arrays, the ‘dimnames’): see ‘Character indices’
below for further details.
Nothing to add to the very clear explanation of Xu Wang. You might want to note in addition that the package data.table allows you to use notation such as mydf[firstcol==1,] or mydf[,firstcol], that many find more natural.

Resources