I have saved models which were created using the rpart package in R. I am trying to retrieve some information from these saved models; specifically from rpart.object. While the documentation - rpart doc - is helpful there are a few things it is not clear about:
How do I find out which variables are categorical and which are numeric? Currently, what I do is refer to the 'index' column in the splits matrix. I've noticed that for numeric variables only, the entry is not an integer. Is there a cleaner way to do this?
The csplit matrix refers to the various values a categorical variable can take using integers i.e. R maps the original names to integers. Is there a way to access this mapping? For ex. if my original variable, say, Country can take any of the values France, Germany, Japan etc, the csplit matrix lets me know that a certain split is based on Country == 1, 2. Here, rpart has replaced references to France, Germany with 1, 2 respectively. How do I get the original names - France, Germany, Japan - back from the model file? Also, how do I know what the mapping between the names and the integers is?
Generally it is the terms component that would have that sort of information. See ?rpart::rpart.object.
fit <- rpart::rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit$terms # notice that the attribute dataClasses has the information
attr(fit$terms, "dataClasses")
#------------
Kyphosis Age Number Start
"factor" "numeric" "numeric" "numeric"
That example doesn't have a csplit node in its structure because none of hte variables are factors. You could make one fairly easily:
> fit <- rpart::rpart(Kyphosis ~ Age + factor(findInterval(Number,c(0,4,6,Inf))) + Start, data = kyphosis)
> fit$csplit
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 1 3
[3,] 3 1 3
[4,] 1 3 3
[5,] 3 1 3
[6,] 3 3 1
[7,] 3 1 3
[8,] 1 1 3
> attr(fit$terms, "dataClasses")
Kyphosis
"factor"
Age
"numeric"
factor(findInterval(Number, c(0, 4, 6, Inf)))
"factor"
Start
"numeric"
The integers are just the values of the factor variables so the "mapping" is just the same as it would be from as.numeric() to the levels() of a factor. If I were trying to construct a character matrix version of the fit$csplit-matrix that substituted the names of the levels in a factor variable, this would be one path to success:
> kyphosis$Numlev <- factor(findInterval(kyphosis$Number, c(0, 4, 6, Inf)), labels=c("low","med","high"))
> str(kyphosis)
'data.frame': 81 obs. of 5 variables:
$ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
$ Age : int 71 158 128 2 1 1 61 37 113 59 ...
$ Number : int 3 3 4 5 4 2 2 3 2 6 ...
$ Start : int 5 14 5 1 15 16 17 16 16 12 ...
$ Numlev : Factor w/ 3 levels "low","med","high": 1 1 2 2 2 1 1 1 1 3 ...
> fit <- rpart::rpart(Kyphosis ~ Age +Numlev + Start, data = kyphosis)
> Levels <- fit$csplit
> Levels[] <- levels(kyphosis$Numlev)[Levels]
> Levels
[,1] [,2] [,3]
[1,] "low" "low" "high"
[2,] "low" "low" "high"
[3,] "high" "low" "high"
[4,] "low" "high" "high"
[5,] "high" "low" "high"
[6,] "high" "high" "low"
[7,] "high" "low" "high"
[8,] "low" "low" "high"
Response to comment: If you only have the model then use str() to look at it. I see an "ordered" leaf in the example I created that has the factor labels stored in an attribute named "xlevels":
$ ordered : Named logi [1:3] FALSE FALSE FALSE
..- attr(*, "names")= chr [1:3] "Age" "Numlev" "Start"
- attr(*, "xlevels")=List of 1
..$ Numlev: chr [1:3] "low" "med" "high"
- attr(*, "ylevels")= chr [1:2] "absent" "present"
- attr(*, "class")= chr "rpart"
Related
I was expecting this to produce an object of mode numeric
R> mode(expand.grid(c(1,2),c(3,4)))
R> "list"
Is there an easy fix for making it "numeric"?
You are making numericals, the code below shows what you are making - and how to make it a matrix instead of lists:
> x <- as.matrix(expand.grid(c(1,2), c(3,4)))
> x
Var1 Var2
[1,] 1 3
[2,] 2 3
[3,] 1 4
[4,] 2 4
As you can see the components are numerical lists/vectors:
> str(x)
num [1:4, 1:2] 1 2 1 2 3 3 4 4
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "Var1" "Var2"
> x[,1]
[1] 1 2 1 2
I have a dataframe. I want to inspect the class of each column.
x1 = rep(1:4, times=5)
x2 = factor(rep(letters[1:4], times=5))
xdat = data.frame(x1, x2)
> class(xdat)
[1] "data.frame"
> class(xdat$x1)
[1] "integer"
> class(xdat$x2)
[1] "factor"
However, imagine that I have many columns and therefore need to use apply() to help me do the trick. But it's not working.
apply(xdat, 2, class)
x1 x2
"character" "character"
Why cannot I use apply() to see the data type of each column? or What I should do?
Thanks!
You could use
sapply(xdat, class)
# x1 x2
# "integer" "factor"
using apply would coerce the output to matrix and matrix can hold only a single 'class'. If there are 'character' columns, the result would be a single 'character' class. To understand this check
str(apply(xdat, 2, I))
#chr [1:20, 1:2] "1" "2" "3" "4" "1" "2" "3" "4" "1" ...
#- attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:2] "x1" "x2"
Now, if we check
str(lapply(xdat, I))
#List of 2
#$ x1:Class 'AsIs' int [1:20] 1 2 3 4 1 2 3 4 1 2 ...
#$ x2: Factor w/ 4 levels "a","b","c","d": 1 2 3 4 1 2 3 4 1 2 ...
I've got a list with different types in it. They are arranged in matrix form:
tmp <- list('a', 1, 'b', 2, 'c', 3)
dim(tmp) <- c(2,3)
tmp
[,1] [,2] [,3]
[1,] "a" "b" "c"
[2,] 1 2 3
That's the form I get it out of another more complex function.
Now I want to transpose it and convert to a data.frame. So I do the following:
data <- as.data.frame(t(tmp))
data
V1 V2
1 a 1
2 b 2
3 c 3
This looks great. But it's got the wrong structure:
str(data)
'data.frame': 3 obs. of 2 variables:
$ V1:List of 3
..$ : chr "a"
..$ : chr "b"
..$ : chr "c"
$ V2:List of 3
..$ : num 1
..$ : num 2
..$ : num 3
So how do I get rid of the extra level of lists?
This should do the trick:
df <- data.frame(lapply(data.frame(t(tmp)), unlist), stringsAsFactors=FALSE)
str(df)
# 'data.frame': 3 obs. of 2 variables:
# $ X1: chr "a" "b" "c"
# $ X2: num 1 2 3
The inner data.frame() call converts the matrix into a two column data.frame, with one "character" column and one "numeric" column.**
lapply(..., unlist) strips away extra list() layer.
The outer data.frame() call converts the resulting list into the data.frame you're after.
** (OK, that intermediate "character" column is really of class "factor", but it ends up making no difference in the final result. If you like, you could force it to be have class "character" by adding a stringsAsFactors=FALSE for the inner data.frame() call as well, but I don't think neglecting to do so would ever make a difference...)
Or this :
as.data.frame(matrix(unlist(tmp),ncol=2,byrow=TRUE))
You can inspect the result:
str(as.data.frame(matrix(unlist(tmp),ncol=2,byrow=TRUE)))
'data.frame': 3 obs. of 2 variables:
$ V1: Factor w/ 3 levels "a","b","c": 1 2 3
$ V2: Factor w/ 3 levels "1","2","3": 1 2 3
Can you use gsub on a data.frame?
dat="1 1W 16 2W 16
2 1 16 2W W
3 1W 16 16 0
4 4 64 64 0"
data=read.table(text=dat,header=F)
gsub("W",3,data)
Why we get an output such as below:
[1] "1:4" "c(2, 1, 2, 3)" "c(16, 16, 16, 64)" "c(2, 2, 1, 3)" "c(2, 3, 1, 1)" .
It is hard to understand.
> str(data)
'data.frame': 4 obs. of 5 variables:
$ V1: int 1 2 3 4
$ V2: Factor w/ 3 levels "1","1W","4": 2 1 2 3
$ V3: int 16 16 16 64
$ V4: Factor w/ 3 levels "16","2W","64": 2 2 1 3
$ V5: Factor w/ 3 levels "0","16","W": 2 3 1 1
What is the meaning of the *2 1 2 3 * in V2: Factor w/ 3 levels "1","1W","4": 2 1 2 3?
The output is the same as as.character(data).
Since the letter W never appears in any of these strings, gsub has no effect, other than the conversion to character.
As discussed in the comments, as.character has quirky behaviour on data frames. It calls as.vector(x, "character"), which needs to condense each column to a single value, and chooses to return the code needed to recreate the column, ignoring attributes. For factor columns this means that you get the integer levels, not the string values, which is why W never appears.
You need to apply through each value in your data frame:
apply(data, 1:2, function(x) gsub("W", 3, x))
# V1 V2 V3 V4 V5
# [1,] "1" "13" "16" "23" "16"
# [2,] "2" "1" "16" "23" "3"
# [3,] "3" "13" "16" "16" "0"
# [4,] "4" "4" "64" "64" "0"
#Richie Cotton's comments explain why you need to do it this way.
The "apply" documentation mentions that "Where 'X' has named dimnames, it can be a character vector selecting dimension names." I would like to use apply on a data.frame for only particular columns. Can I use the dimnames feature to do this?
I realize I can subset() X to only include the columns of interest, but I want to understand "named dimnames" better.
Below is some sample code:
> x <- data.frame(cbind(1,1:10))
> apply(x,2,sum)
X1 X2
10 55
> apply(x,c('X2'),sum)
Error in apply(x, c("X2"), sum) : 'X' must have named dimnames
> dimnames(x)
[[1]]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
[[2]]
[1] "X1" "X2"
> names(x)
[1] "X1" "X2"
> names(dimnames(x))
NULL
If I understand you correctly, you would like to use apply only on certain columns. This is not what named dimnames would accomplish. The apply function on a matrix or data.frame always applies to all the rows or all the columns. The named dimnames allows you to choose to use rows or columns by name instead of the "normal" 1 and 2:
m <- matrix(1:12,4, dimnames=list(foo=letters[1:4], bar=LETTERS[1:3]))
apply(m, "bar", sum) # Use "bar" instead of 2 to refer to the columns
However if you have the column names you'd like to apply to, you could do it by first selecting only those columns:
n <- c("A","C")
apply(m[,n], 2, sum)
# A C
#10 42
Named dimnames is a side-effect of that dimnames are stored as a list in the "dimnames" attribute in a matrix or array. Each component of the list corresponds to one dimension and can be named. This is probably more useful for multidimensional arrays...
For a data.frame, there is no "dimnames" attribute. A data.frame is essentially a list, so the list's "names" attributes corresponds to the column names, and an extra "row.names" attribute corresponds to the row names. Because of this, there is no place to store the names of the dimnames (they could have an extra attribute for that of course, but they didn't). When you call the dimnames function on a data.frame, it simply creates a list from the "row.names" and "names" attributes.
The issue is that you can't manipulate the dimnames of x directly for some reason, and x will be coerced to a matrix which isn't preserving named dimnames.
A solution is to coerce to a matrix first, then name the dimnames and then use apply()
> X <- as.matrix(x)
> str(X)
num [1:10, 1:2] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:10] "1" "2" "3" "4" ...
..$ : chr [1:2] "X1" "X2"
> dimnames(X) <- list(C1 = dimnames(x)[[1]], C2 = dimnames(x)[[2]])
> str(X)
num [1:10, 1:2] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "dimnames")=List of 2
..$ C1: chr [1:10] "1" "2" "3" "4" ...
..$ C2: chr [1:2] "X1" "X2"
> apply(X, "C1", mean)
1 2 3 4 5 6 7 8 9 10
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
> rowMeans(X)
1 2 3 4 5 6 7 8 9 10
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5