Using nchar function on factor variables - r

can somebody explain to me what's going on here ? when a variable is coded as a factor and nchar coerces to a character, why can't that function effectively count the number of characters ?
> x <- c("73210", "73458", "73215", "72350")
> nchar(x)
[1] 5 5 5 5
>
> x <- factor(x)
> nchar(x)
[1] 1 1 1 1
>
> nchar(as.character(x))
[1] 5 5 5 5
thanks.

It is because with a factor, your data is represented by 1, 2, etc. What you mean to do is count the characters of the levels:
> nchar(levels(x)[x])
[1] 5 5 5 5

see the warning section of ?factor:
The interpretation of a factor depends on both the codes and the
‘"levels"’ attribute. Be careful only to compare factors with the
same set of levels (in the same order). In particular,
‘as.numeric’ applied to a factor is meaningless, and may happen by
implicit coercion. To transform a factor ‘f’ to approximately its
original numeric values, ‘as.numeric(levels(f))[f]’ is recommended
and slightly more efficient than ‘as.numeric(as.character(f))’.
nchar(levels(x))

The other answers are correct, I think, that the issue is that nchar is examining the underlying integer codes, not the labels. However, what I think most directly addresses your question is this piece from ?nchar:
The internal equivalent of the default method of as.character is
performed on x (so there is no method dispatch)
I'm not 100% sure, but I suspect this means that the coercion that takes place in nchar is not the same thing that happens when you directly call as.character, most likely going directly to the integer codes, rather than "smartly" looking at the labels.

Related

Does 0 plays any important role in as.numeric function when using factors in R

Hi guys :) I know this question has been asked before here for example but I would like to ask if 0 plays any important role using the as.numeric function. For example, we have the following simple code
x2<-factor(c(2,2,0,2), label=c('Male','Female'))
as.numeric(x2) #knonwing that this is not the appropriate command used , as.numeric(levels(x2))[x2] would be more appropriate but return NAs
this returns
[1] 2 2 1 2
Is 0 being replaced here by 1 ? Moreover,
unclass(x2)
seems to give the same thing as well:
[1] 2 2 1 2
attr(,"levels")
[1] "Male" "Female"
It might be simple but I am trying to figure this out and it seems that I cant. Any help would be highly appreciated as I am new in R.
0 has no special meaning for factor.
As commenters have pointed out, factor recodes the input vector to an integer vector (starting with 1) and slaps a name tag onto each integer (the levels).
In the most simplest case, factor(c(2,2,0,2), the function takes the unique values of the input vector, sorts it, and converts it to a character vector, for the levels. I.e. the factor is internally represented as c(2,2,1,2) where 1 corresponds to '0' and 2 to '2'.
You then go further on by giving the levels some labels; these are normally identical to the levels. In your case factor(c(2,2,0,2), labels=c('Male','Female')), the levels are still evaluated to the sorted, unique vector (i.e. c(2,2,1,2)) but the levels now have labels Male for first level and Female for second level.
We can decide which levels should be used, as in factor(c(2,2,0,2), levels=c(2,0), labels=c('Male','Female')). Now we have been explicit towards which input value should have which level and label.

Xgboost - Do we have to convert integers to factors if they are only 0 & 1

I have many columns in a dataframe that are flags "0" and "1". They belong to class "integer" when i import the dataframe.
0 denotes absence and 1 denotes presence in all columns.
Do i need to convert them to fators?[factors will make levels 1 & 2 while currently they are almost similar 0 & 1 albeit integers]
I plan to later use xgboost to build a predictive model.
Xgboost works only on numeric columns so if i convert the columns to factor's then i will need to one-hot encode them to convert them to numeric.
(Side question: Do we always need to drop one column if we do one hot encoding to remove collinearity?)
Short answer: Depends. Yes, just for better variable interpretation. No as for 0/1 variables integer and factors both are same.
If you ask my personal opinion then I am more towards YES; as you will more likely also be having some categorical variables which are either have string values or more than 2 levels or 2 integer levels other than 0 and 1. In all aforementioned cases 0/1 variables integer and factors both are NOT same. Only specific case of 0/1 binary levels; integer variable and factors are same. So you may want to bring consistency in your coding and even want to adopt this for 0/1 case as well.
To see yourself:
a <- c(1,2,1,2,1,2,5)
c<-as.character(a)
b<-as.factor(c)
d<-as.integer(b)
Here I am just playing with a vector, which in end gives me:
> d
[1] 1 2 1 2 1 2 3
So if you don't want to debug why values are changing in future then use as.factor() from starting.
Side Answer: Yes. Search for model.matrix() and contrasts.arg for getting this done in R.
The error states that xgb.DMatrix takes numeric values, where the data were integers.
To convert the data to numeric use
train[] <- lapply(train, as.numeric)
and then use
xgb.DMatrix(data=data.matrix(train))

Retrieving minimum non-numeric value

This might be too simple question, but I'm still familiarising with R syntax.
I have a data frame with 2 columns and 3 rows:
The first column is a numeric vector from 1 to 3.
The second column is a character vector with values: best, good, worse.
Which function should I be using in order to obtain the minimum non-numeric value (i.e. "worse")?
Another solution would be to use an ordered factor for the character variable. This way min will know what to do:
dat <- data.frame(a=1:3, b=c("worst","good","best"))
dat$b <- ordered(dat$b, levels=c("worst","good","best"))
min(dat$b)
Result:
> min(dat$b)
[1] worst
Levels: worst < good < best

Index Vectors with Factors in R

I have a factor RFyhat which I'm looking to convert to a numeric vector. I've already discovered that
as.numeric(levels(RFyhat))[RFyhat]
works as desired, and I've played around a bit with this construction:
c(1,2,20,4,5,6,7)[RFyhat]
also works as expected (RFyhat has 7 levels).
So I understand the behavior of this construction, but I'm wondering if anyone can explain how this syntax is intended to work, or whether it is just 'syntactic sugar'. More specifically, does [RFyhat] act as an index vector? If it does, how do factors generally behave when used as an index?
Yes, I believe that factors gets converted to integers when used for indexing, rather than characters or anything else.
Look at this example
> fac <- factor(letters[c(1,1,2,1,3,3,2,1)])
> vec <- c(b=1, a=2, c=3)
> vec[fac]
b b a b c c a b
1 1 2 1 3 3 2 1
So element 1 of fac has returned element 1 of vec, regardless of the different order of names.
Personally I'd prefer as.integer(as.character(RFyhat)) to as.numeric(levels(RFyhat))[...].

Why does R need the name of the dataframe?

If you have a dataframe like this
mydf <- data.frame(firstcol = c(1,2,1), secondcol = c(3,4,5))
Why would
mydf[mydf$firstcol,]
work but
mydf[firstcol,]
wouldn't?
You can do this:
mydf[,"firstcol"]
Remember that the column goes second, not first.
In your example, to see what mydf[mydf$firstcol,] gives you, let's break it down:
> mydf$firstcol
[1] 1 2 1
So really mydf[mydf$firstcol,] is the same as
> mydf[c(1,2,1),]
firstcol secondcol
1 1 3
2 2 4
1.1 1 3
So you are asking for rows 1, 2, and 1. That is, you are asking for your row one to be the same as row 1 of mydf, your row 2 to be the same as row 2 of mydf and your row 3 to be the same as row 1 of mydf; and you are asking for both columns.
Another question is why the following doesn't work:
> mydf[,firstcol]
Error in `[.data.frame`(mydf, , firstcol) : object 'firstcol' not found
That is, why do you have to put quotes around the column name when you ask for it like that but not when you do mydf$firstcol. The answer is just that the operators you are using require different types of arguments. You can look at '$' to see the form x$name and thus the second argument can be a name, which is not quoted. You can then look up ?'[', which will actually lead you to the same help page. And there you will find the following, which explains it. Note that a "character" vector needs to have quoted entries (that is how you enter a character vector in R (and many other languages).
i, j, ...: indices specifying elements to extract or replace. Indices
are ‘numeric’ or ‘character’ vectors or empty (missing) or
‘NULL’. Numeric values are coerced to integer as by
‘as.integer’ (and hence truncated towards zero). Character
vectors will be matched to the ‘names’ of the object (or for
matrices/arrays, the ‘dimnames’): see ‘Character indices’
below for further details.
Nothing to add to the very clear explanation of Xu Wang. You might want to note in addition that the package data.table allows you to use notation such as mydf[firstcol==1,] or mydf[,firstcol], that many find more natural.

Resources