RPART - features types - r

RPART uses a different splitting procedure for continuous, ordinal and categorical variables. Is there a way to "inform" RPART about the variable type? For illustration, I have an ordinal variable with integer values (1,..,5). Right now, I need to coerce it into characters so RPART will not split it like a continuous variable.
I would like to refrain from changing all my variable types, just for RPART. I would prefer to declare it, somehow.
Thanks.

The problem is that how is R to know that foo <- c(1,2,3,2,4,5,1,5) (for example) is not numeric variable? If you look at the class of foo you'll see it is numeric.
R> class(foo)
[1] "numeric"
The problem you have is that at a very basic level you didn't tell R what the data types were. The simple solution in this case is not to convert this to a character vector, but to convert it to an ordered factor. If only for the fact that this is what the data is! rpart should pick up the factor aspect and treat it accordingly.
Therefore, the way to inform rpart that the variable is ordinal is to tell R that it is ordinal
foo <- as.ordered(foo)
R> foo
[1] 1 2 3 2 4 5 1 5
Levels: 1 < 2 < 3 < 4 < 5
I suspect you are missing out on other features of R because you fail to tell it the nature of the data. R is making an assumption about it which is not correct.

Related

Does 0 plays any important role in as.numeric function when using factors in R

Hi guys :) I know this question has been asked before here for example but I would like to ask if 0 plays any important role using the as.numeric function. For example, we have the following simple code
x2<-factor(c(2,2,0,2), label=c('Male','Female'))
as.numeric(x2) #knonwing that this is not the appropriate command used , as.numeric(levels(x2))[x2] would be more appropriate but return NAs
this returns
[1] 2 2 1 2
Is 0 being replaced here by 1 ? Moreover,
unclass(x2)
seems to give the same thing as well:
[1] 2 2 1 2
attr(,"levels")
[1] "Male" "Female"
It might be simple but I am trying to figure this out and it seems that I cant. Any help would be highly appreciated as I am new in R.
0 has no special meaning for factor.
As commenters have pointed out, factor recodes the input vector to an integer vector (starting with 1) and slaps a name tag onto each integer (the levels).
In the most simplest case, factor(c(2,2,0,2), the function takes the unique values of the input vector, sorts it, and converts it to a character vector, for the levels. I.e. the factor is internally represented as c(2,2,1,2) where 1 corresponds to '0' and 2 to '2'.
You then go further on by giving the levels some labels; these are normally identical to the levels. In your case factor(c(2,2,0,2), labels=c('Male','Female')), the levels are still evaluated to the sorted, unique vector (i.e. c(2,2,1,2)) but the levels now have labels Male for first level and Female for second level.
We can decide which levels should be used, as in factor(c(2,2,0,2), levels=c(2,0), labels=c('Male','Female')). Now we have been explicit towards which input value should have which level and label.

What are the practical differences between 'factor' and 'string' data types in R?

From other programming languages I am familiar with the string data type. In addition to this data type, R also has the factor data type. I am new to the R language, so I am trying to wrap my head around the intent behind this new data type.
Question: What are the practical differences between 'factor' and 'string' data types in R?
I get that (on a conceptual/philosophical level) the factor data type is supposed to encode the values of a categorical random variable, but I do not understand (on a practical level) why the string data type would be insufficient for this purpose.
Seemingly having duplicate data types which serve the same practical purpose would be bad design. However, if R were truly poorly designed on such a fundamental level, it would be much less likely to have achieved the level of popularity it has. So either a very improbable event has happened, or I am misunderstanding the practical significance/purpose of the factor data type.
Attempt: The one thing I could think of is the concept of "factor levels", whereby one can assign an ordering to factors (which one can't do for strings), which is helpful when describing "ordinal categorical variables", i.e. categorical variables with an order (e.g. "Low", "Medium", "High").
(Although even this wouldn't seem to make factors strictly necessary. Since the ordering is always linear, i.e. no true partial orders, on countable sets, we could always just accomplish the same with a map from some subset of the integers to the strings in question -- however in practice that would probably be a pain to implement over and over again, and a naive implementation would probably not be as efficient as the implementation of factors and factor levels built into R.)
However, not all categorical variables are ordinal, some are "nominal" (i.e. have no order). And yet "factors" and "factor levels" still seem to be used with these "nominal categorical variables". Why is this? I.e. what is the practical benefit to using factors instead of strings for such variables?
The only other information I could find on this subject is the following quote here:
Furthermore, storing string variables as factor variables is a more efficient use of memory.
What is the reason for this? Is this only true for "ordinal categorical variables", or is it also true for "nominal categorical variables"?
Related but different questions: These questions seem relevant, but don't specifically address the heart of my question -- namely, the difference between factors and strings, and why having such a difference is useful (from a programming perspective, not a statistical one).
Difference between ordered and unordered factor variables in R
Factors ordered vs. levels
Is there an advantage to ordering a categorical variable?
factor() command in R is for categorical variables with hierarchy level only?
Practical differences:
If x is a string it can take any value. If x is a factor it can only take a values from a list of all levels. That makes these variables more memory effecient as well.
example:
> x <- factor(c("cat1","cat1","cat2"),levels = c("cat1","cat2") )
> x
[1] cat1 cat1 cat2
Levels: cat1 cat2
> x[3] <- "cat3"
Warning message:
In `[<-.factor`(`*tmp*`, 3, value = "cat3") :
invalid factor level, NA generated
> x
[1] cat1 cat1 <NA>
Levels: cat1 cat2
As you said, you can have ordinal factors. Meaning that you can add extra information aout your variable that for instance level1 < level2 < level3. Characters don't have that. However, the order doesn't necessarily have to be linear, not sure where you found that.

Xgboost - Do we have to convert integers to factors if they are only 0 & 1

I have many columns in a dataframe that are flags "0" and "1". They belong to class "integer" when i import the dataframe.
0 denotes absence and 1 denotes presence in all columns.
Do i need to convert them to fators?[factors will make levels 1 & 2 while currently they are almost similar 0 & 1 albeit integers]
I plan to later use xgboost to build a predictive model.
Xgboost works only on numeric columns so if i convert the columns to factor's then i will need to one-hot encode them to convert them to numeric.
(Side question: Do we always need to drop one column if we do one hot encoding to remove collinearity?)
Short answer: Depends. Yes, just for better variable interpretation. No as for 0/1 variables integer and factors both are same.
If you ask my personal opinion then I am more towards YES; as you will more likely also be having some categorical variables which are either have string values or more than 2 levels or 2 integer levels other than 0 and 1. In all aforementioned cases 0/1 variables integer and factors both are NOT same. Only specific case of 0/1 binary levels; integer variable and factors are same. So you may want to bring consistency in your coding and even want to adopt this for 0/1 case as well.
To see yourself:
a <- c(1,2,1,2,1,2,5)
c<-as.character(a)
b<-as.factor(c)
d<-as.integer(b)
Here I am just playing with a vector, which in end gives me:
> d
[1] 1 2 1 2 1 2 3
So if you don't want to debug why values are changing in future then use as.factor() from starting.
Side Answer: Yes. Search for model.matrix() and contrasts.arg for getting this done in R.
The error states that xgb.DMatrix takes numeric values, where the data were integers.
To convert the data to numeric use
train[] <- lapply(train, as.numeric)
and then use
xgb.DMatrix(data=data.matrix(train))

R: Is the assignment order of vector elements indexed by ordinal well defined?

When assigning to a vector by ordinal, is the order of assignment well-defined, or is this implementation dependent? Are there any language specifications regarding this?
x <- 1:10
x[c(1,1,2,2,3,3,4,4,3,3)] <- 1:10
In the above code, the resultant vector is 2 4 10 8 5 6 7 8 9 10 on my system. Are all R implementations required to assign to each element in order, or are they free to assign in any order?
From ?"[<-":
Subassignment is done sequentially, so if an index is specified more
than once the latest assigned value for an index will result.
Therefore the result should be consistent.

Does R randomForest's rfcv method actually say which features it selected, or not?

I would like to use rfcv to cull the unimportant variables from a data set before creating a final random forest with more trees (please correct and inform me if that's not the way to use this function). For example,
> data(fgl, package="MASS")
> tst <- rfcv(trainx = fgl[,-10], trainy = fgl[,10], scale = "log", step=0.7)
> tst$error.cv
9 6 4 3 2 1
0.2289720 0.2149533 0.2523364 0.2570093 0.3411215 0.5093458
In this case, if I understand the result correctly, it seems that we can remove three variables without negative side effects. However,
> attributes(tst)
$names
[1] "n.var" "error.cv" "predicted"
None of these slots tells me what those first three variables that can be harmlessly removed from the dataset actually were.
I think the purpose of rfcv is to establish how your accuracy is related to the number of variables you use. This might not seem useful when you have 10 variables, but when you have thousands of variables it is quite handy to understand how much those variables "add" to the predictive power.
As you probably found out, this code
rf<-randomForest(type ~ .,data=fgl)
importance(rf)
gives you the relative importance of each of the variables.

Resources