How can I get ID (or name) of terminal node of rpart model for every row? predict.rpart can return only predicted class (number or factor) or class probability or some combination (using type="matrix") for classification tree.
I would like to do something like:
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
plot(fit) # there are 5 terminal nodes
predict(fit, type = "node_id") # should return IDs of terminal nodes (e.g. 1-5) (does not work)
The partykit package supports predict(..., type = "node"), both in and out of sample. You can simply convert the rpart object to use this:
library("partykit")
predict(as.party(fit), type = "node")
## 9 7 9 9 3 3 3 3 3 8 8 3 9 5 3 3 3 7 3 5 3 9 8 9 9 5 9 8 3 3 3 7 7 3 7 3 5 ## 9 5 8
## 9 7 9 9 3 3 3 3 3 8 8 3 9 5 3 3 3 7 3 5 3 9 8 9 9 5 9 8 3 3 3 7 7 3 7 3 5 ## 9 5 8
## 9 5 9 9 3 7 3 7 9 7 8 3 9 3 3 3 5 9 5 8 9 9 9 3 3 5 3 7 5 3 7 7 3 7 3 3 7 ## 5 7 9
## 9 5 9 9 3 7 3 7 9 7 8 3 9 3 3 3 5 9 5 8 9 9 9 3 3 5 3 7 5 3 7 7 3 7 3 3 7 ## 5 7 9
## 5
## 5
table(predict(as.party(fit), type = "node"))
## 3 5 7 8 9
## 29 12 14 7 19
For that model there were 4 splits, yielding 5 "terminal nodes" or in the terminology used in rpart: <leaf>s. I do not see why there should be 5 predictions for anything. The predictions are for particular cases and the leaves are the result of a variable number of the splits used to make those predictions. The numbers of rows in the original dataset that ended up in the leaves may be what you want, in which case these are ways of getting those numbers:
# Row-wise predicted class
fit$where
# counts of cases in leaves of prediction rules
table(fit$where)
3 5 7 8 9
29 12 14 7 19
In order to assemble the labels(fit) that apply to a particular leaf, you would need to traverse the rule-tree and accumulate all the labels for all the splits that were applied to produce a particular leaf. You probably want to look at:
?print.rpart
?rpart.object
?text.rpart
?labels.rpart
The above method using $where pops up only the row number in the tree frame. And so some observation might be assigned node ID instead of leaf node ID when using kyphosis$ID = fit$where
To get the actual leaf node ID use the following:
MyID <- row.names(fit$frame)
kyphosis$ID <- MyID[fit$where]
For predicting leafs on a new data one could use rpart.predict(fit, newdata, nn = TRUE) from the package rpart.plot to add node names to the output.
Here is an isolated rpart leaf preditor:
rpart_leaves <- function(fit, newdata, type = c("where", "leaf"), na.action = na.pass) {
if (is.null(attr(newdata, "terms"))) {
Terms <- delete.response(fit$terms)
newdata <- model.frame(Terms, newdata, na.action = na.action,
xlev = attr(fit, "xlevels"))
if (!is.null(cl <- attr(Terms, "dataClasses")))
.checkMFClasses(cl, newdata, TRUE)
}
newdata <- rpart:::rpart.matrix(newdata)
where <- unname(rpart:::pred.rpart(fit, newdata))
if (match.arg(type) == "where")
return(where)
rownames(fit$frame)[where]
}
Related
Suppose I have the following clusters:
library(linkcomm)
g <- swiss[,3:4]
lc <-getLinkCommunities(g)
plot(lc, type = "members")
getNodesIn(lc, clusterids = c(3, 7, 8))
From the plot you can see the node 6 is present in 3 overlapping clusters: 3, 7 and 8. I am interested to know how to retrieve the direct binary interactions in these clusters as a data frame. Specifically, I would like a data frame with the cluster id as the first column, and the last two columns as "interactor 1" and "interactor 2", where all pairs of interactors can be listed per cluster. These should be direct, i.e. they have an edge in common.
Basically I would like something like this:
Cluster ID Interactor 1 Interactor 2
3 6 14
3 3 7
3 6 7
3 14 3
3 6 3
and so on for the other ids. If possible I would like to avoid duplicates such as 6 and 14, 14 and 6 etc.
Many thanks,
Abigail
You might be looking for the edges. Note: Use str(lc) to examine what's all included in your object of interest.
lc$edges
# node1 node2 cluster
# 1 17 15 1
# 2 17 8 1
# 3 15 8 1
# 4 16 13 2
# 5 16 10 2
# 6 16 29 2
# 7 14 6 3
# 8 ...
res <- setNames(lc$edges, c(paste0("interactor.", 1:2), "cluster"))[c(3, 1, 2)]
res
# cluster interactor.1 interactor.2
# 1 1 17 15
# 2 1 17 8
# 3 1 15 8
# 4 2 16 13
# 5 2 16 10
# 6 2 16 29
# 7 3 14 6
# 8 ...
My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)
my first language isn't English so I apologize in advance for mistakes I could do. I'm newbie in R but you will notice that anyway.
I'm trying to solve the problem of having a co-occurence matrix. I have several dataframes and I am interested in 3 variables : idT, numname and numstim.
This is the unique dataframe that contains the merged data :
z=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12,df13,df14,
df15,df16,df17,df18,df19,df20,df21,df22,df23,df24,df25,df26,df27,df28,df29,df30,df31,df32)
write.csv(z, file = ".../listz.csv")
Then I extracted the 3 variables with :
#Extract columns 3 & 6 from all the files within the list
z1 = z[,c(3,6)]
#Create a new variable 'numname' to convert name groups into numeric groups,
#then obtain levels with facNum
z1$numname <- as.numeric(z1$namegroup)
colnames(z1) <- c("namegroup", "idT", "numname")
facNum <- factor(z1$numname)
write.csv(z1, file = "...D:/z1.csv")
And data look like :
namegroup idT numname
1 GLISSEVIBREVITE 1 6
2 CINETIQUE 1 3
3 VIBRATIONS_LEGERES 1 20
4 DIFFUS 1 5
5 LIQUIDE 1 8
6 PICOTEMENTS 1 10
How to read the table : each idT is classified in a group (namegroup) and then this group is converted in a numeric variable (numname).
# Specify z1 as a data frame to make next operations
z1 = as.data.frame(z1, idT = z1$numstim, numgroup = z1$numname)
tab1 <- table(z1)
write.csv(tab1, file = ".../tab1test.csv")
out1 <- data.matrix(tab1 %*% t(tab1))
write.csv(out1, file = ".../bmtest.csv")
But the bmtest matrix doesn't look like counting pairs of idT, because only 22 users have participated and there are 32 idT, but some the numbers are much higher :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 24 10 7 7 11 7 7 8 10 8 11 8 6 11 11 12
2 10 32 27 7 5 4 7 4 4 4 5 3 2 6 6 14
3 7 27 40 0 3 1 0 2 0 0 2 2 1 2 0 15
4 7 7 0 30 7 14 15 9 15 13 13 7 5 12 13 5
5 11 5 3 7 24 7 9 20 12 13 10 19 14 20 12 7
I wanna have a matrix which shows the results of a count of idT paired together. The matrix has to look like :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 15 3 2 2 3 3 2 1 2 1 3 3 1 3 3 5
2 3 15 9 2 0 1 2 0 0 0 0 0 0 0 1 3
3 2 9 15 0 2 1 0 2 0 0 1 1 1 2 0 2
4 2 2 0 15 1 6 5 1 7 5 6 2 0 1 3 2
5 3 0 2 1 15 1 2 12 4 5 3 13 9 11 3 2
In other words, I want to see which idT have been paired together. I've looked at this topic but didn't find a way to solve my problem.
Also, I tried :
library(igraph)
library(tnet)
idT_numname <- cbind(z1$idT, z1$numname)
igraph <- graph.data.frame(idT_numname)
item_item <- projecting_tm(net = idT_numname, method="sum")
item_item <- tnet_igraph(item_item,type="weighted one-mode tnet")
itemmat <- get.adjacency(item_item,attr="weight")
itemmat #8x8 martrix of items to items
But I get error message and I don't know how to get over the "duplicated entries in the edgelist", because it seems necessary to me to have duplicated entries in order to do a co-occurrence matrix :
> idT_numname <- cbind(z1$idT, z1$numname)
> item_item <- projecting_tm(idT_numname, method="sum")
Error in as.tnet(net, type = "binary two-mode tnet") :
There are duplicated entries in the edgelist
> item_item <- as.tnet(net = idT_numname, type ="binary two-mode tnet", method="sum")
Error in as.tnet(net = idT_numname, type = "binary two-mode tnet", method = "sum") :
unused argument (method = "sum")
> item_item <- as.tnet(net = idT_numname, type ="binary two-mode tnet")
Error in as.tnet(net = idT_numname, type = "binary two-mode tnet") :
There are duplicated entries in the edgelist
Your help is greatly appreciated.
I like to do data analysis and I want to learn more and more everyday !
Thank you
I need to extract separate tables from each excel sheet and have them as a list object. I have two lists : "allsheets" contains 38 sheets and each of sheets includes at least 2 tables, and "dataRowMeta" contains information about which rows are relevant for each table. For example,
a1 <- data.frame(y1=c(1:15),y2=c(6:20))
a2 <- data.frame(y1=c(3:18),y2=c(2:17))
allsheets <- list(a1, a2)
d1<- data.frame(starthead=c(1,9),endhead=c(2,10),startdata =c(3,11),
enddata = c(7,14),footer = c(8,15))
d2<- data.frame(starthead=c(1,10),endhead=c(2,11),startdata =c(3,12),
enddata = c(8,15),footer = c(9,16))
dataRowMeta <- list(d1,d2)
[[1]]
y1 y2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
6 6 11
7 7 12
8 8 13
9 9 14
10 10 15
11 11 16
12 12 17
13 13 18
14 14 19
15 15 20
[[2]]
y1 y2
1 3 2
2 4 3
3 5 4
4 6 5
5 7 6
6 8 7
7 9 8
8 10 9
9 11 10
10 12 11
11 13 12
12 14 13
13 15 14
14 16 15
15 17 16
16 18 17
and here is dataRowMeta :
[[1]]
starthead endhead startdata enddata footer
1 1 2 3 7 8
2 9 10 11 14 15
[[2]]
starthead endhead startdata enddata footer
1 1 2 3 8 9
2 10 11 12 15 16
I've tried to write a loop function which would subset each sheet according to dataRowMeta, but failed to get a desired output.
I am getting an error
Error in sheet[[a[m]:b[m], ]] : incorrect number of subscripts
I guess that's because I am iterating over list, not matrices...but how to tell R to subset list in this case?
So I need 1st and 4th columns of dataRowMeta(starthead and enddata) as "start" and "end" id rows of future tables.
tables <- function(allsheets,dataRowMeta){
for(i in 1 : length(dataRowMeta)){
for (j in 1 : nrow(dataRowMeta[[i]])){
a <-""
b <- ""
a <- dataRowMeta[[i]][j:j,1]
b <- dataRowMeta[[i]][j:j,4]
for (k in 1 : length(allsheets)){
sheet <- allsheets[k]
for ( m in 1 : length(a)){
tbl <- sheet[[a[m]:b[m],]]
}
}
}
}}
Desired output : I have this for the first element of the first list(sheet1):
sheet1 <- allsheets[[1]]
tmp1 <- sheet1[dataRowMeta[[1]][1:1,1] :dataRowMeta[[1]][1:1,4] ,]
> tmp1
y1 y2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
6 6 11
7 7 12
And need a loop which would do it for all sheets. Please help me to figure out how to get it. Thank you!
I would like to import the data into R as intervals, then I would like to count all the numbers falling within these intervals and draw a histogram from this counts.
Example:
start end freq
1 8 3
5 10 2
7 11 5
.
.
.
Result:
number freq
1 3
2 3
3 3
4 3
5 5
6 5
7 10
8 10
9 7
10 7
11 5
Some suggestions?
Thank you very much!
Assuming your data is in df, you can create a data set that has each number in the range repeated by freq. Once you have that it's trivial to use the summarizing functions in R. This is a little roundabout, but a lot easier than explicitly computing the sum of the overlaps (though that isn't that hard either).
dat <- unlist(apply(df, 1, function(x) rep(x[[1]]:x[[2]], x[[3]])))
hist(dat, breaks=0:max(df$end))
You can also do table(dat)
dat
1 2 3 4 5 6 7 8 9 10 11
3 3 3 3 5 5 10 10 7 7 5