R replicate tapply for every row to which is applied

R replicate tapply for every row to which is applied - r

I have a dataset like this:
Anno.2013 Giorni.2013 Anno.2014 Giorni.2014 Stagionalità Destagionata2013
1 18 mar 17 mer Bassa 9.3710954
2 9 mer 5 gio Bassa 4.6855477
3 9 gio 2 ven Bassa 4.6855477
4 8 ven 5 sab Bassa 4.1649313
5 4 sab 2 dom Bassa 2.0824656
6 2 dom 0 lun Bassa 1.0412328
7 1 lun 1 mar Bassa 0.5206164
8 0 mar 0 mer Bassa 0.0000000
9 2 mer 0 gio Bassa 1.0412328
10 0 gio 1 ven Bassa 0.0000000
Destagionata2014 Settimana2013 Settimana2014
1 9.4463412 1 1
2 2.7783356 1 1
3 1.1113343 1 1
4 2.7783356 1 1
5 1.1113343 1 1
6 0.0000000 1 2
7 0.5556671 2 2
8 0.0000000 2 2
9 0.0000000 2 2
10 0.5556671 2 2
> str(domanda)
'data.frame': 365 obs. of 9 variables:
$ Anno.2013 : int 18 9 9 8 4 2 1 0 2 0 ...
$ Giorni.2013 : Factor w/ 7 levels "dom","gio","lun",..: 4 5 2 7 6 1 3 4 5 2 ...
$ Anno.2014 : int 17 5 2 5 2 0 1 0 0 1 ...
$ Giorni.2014 : Factor w/ 7 levels "dom","gio","lun",..: 5 2 7 6 1 3 4 5 2 7 ...
$ Stagionalità : Factor w/ 2 levels "Alta","Bassa": 2 2 2 2 2 2 2 2 2 2 ...
$ Destagionata2013: num 9.37 4.69 4.69 4.16 2.08 ...
$ Destagionata2014: num 9.45 2.78 1.11 2.78 1.11 ...
$ Settimana2013 : Factor w/ 53 levels "1","2","3","4",..: 1 1 1 1 1 1 2 2 2 2 ...
$ Settimana2014 : Factor w/ 53 levels "1","2","3","4",..: 1 1 1 1 1 2 2 2 2 2 ...
I would like to divide every row of Destagionata2013 for the mean of Destagionata2013 grouped by Settimana2013. For example:
Destagionata2013[1:6]/mean(Destagionata2013[1:6])
I try to use tapply:
Media_Settimana<-as.vector(tapply(domanda$Anno.2013, domanda$Settimana2013, mean))
Media_Settimana
> Media_Settimana
[1] 8.333333 5.857143 3.142857 4.285714 6.428571 6.714286 13.714286 3.428571
[9] 4.000000 3.285714 11.428571 6.285714 11.714286 7.285714 12.142857 12.000000
[17] 16.000000 20.857143 19.428571 23.428571 33.857143 31.000000 31.714286 32.428571
[25] 38.571429 41.000000 36.000000 38.714286 36.714286 39.857143 40.714286 39.857143
[33] 41.714286 41.857143 41.142857 40.571429 40.428571 37.857143 32.714286 19.714286
[41] 9.000000 4.142857 5.857143 16.285714 11.000000 8.428571 4.428571 6.857143
[49] 6.285714 3.857143 7.000000 5.571429 18.500000
But I'am not able to replicate values for every row.

As MrFlick notes, you need ave instead of tapply as ave automatically recycles 1 length results to the length of the inputs. Here we do what you are trying to do with iris (normalize Sepal.Length by the mean Sepal.Width within each species):
transform(iris, norm.sep.len=Sepal.Length / ave(Sepal.Width, Species, FUN=mean))

Related

Get top 3 average from data in R

The First csv file is called "CLAIM" and these are parts of data
The second csv file is called "CUSTOMER" and these are parts of data
First, I wanted to merge two data based on the common column
Second, I wanted to remove all columns including NA value
Third, I wanted to remove the variables like 'SIU_CUST_YN, CTPR, OCCP_GRP_2, RECP_DATE, RESN_DATE'.
Fourth, I wanted to remove the empty row of OCCP_GRP_1
Expecting form is
dim(data_fin)
## [1] 114886 11
head(data_fin)
## CUST_ID DIVIDED_SET SEX AGE OCCP_GRP_1 CHLD_CNT WEDD_YN CHANG_FP_YN
## 1 1 1 2 47 3.사무직 2 Y Y
## 2 1 1 2 47 3.사무직 2 Y Y
## 3 1 1 2 47 3.사무직 2 Y Y
## 4 1 1 2 47 3.사무직 2 Y Y
## 5 2 1 1 53 3.사무직 2 Y Y
## 6 2 1 1 53 3.사무직 2 Y Y
## DMND_AMT PAYM_AMT NON_PAY_RATIO
## 1 52450 52450 0.4343986
## 2 24000 24000 0.8823529
## 3 17500 17500 0.7272727
## 4 47500 47500 0.9217391
## 5 99100 99100 0.8623195
## 6 7817 7500 0.8623195
str(data_fin)
## 'data.frame': 114886 obs. of 11 variables:
## $ CUST_ID : int 1 1 1 1 2 2 2 3 4 4 ...
## $ DIVIDED_SET : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SEX : int 2 2 2 2 1 1 1 1 2 2 ...
## $ AGE : int 47 47 47 47 53 53 53 60 64 64 ...
## $ OCCP_GRP_1 : Factor w/ 9 levels "","1.주부","2.자영업",..: 4 4 4 4 4 4 4 6 3 3 ...
## $ CHLD_CNT : int 2 2 2 2 2 2 2 0 0 0 ...
## $ WEDD_YN : Factor w/ 3 levels "","N","Y": 3 3 3 3 3 3 3 2 2 2 ...
## $ CHANG_FP_YN : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 2 ...
## $ DMND_AMT : int 52450 24000 17500 47500 99100 7817 218614 430000 200000 120000 ...
## $ PAYM_AMT : int 52450 24000 17500 47500 99100 7500 218614 430000 200000 120000 ...
## $ NON_PAY_RATIO: num 0.434 0.882 0.727 0.922 0.862 ...
so I wrote down the code like
#gc(reset=T); rm(list=ls())
getwd()
setwd("/Users/Hong/Downloads")
getwd()
CUSTOMER <- read.csv("CUSTOMER.csv", header=T)
CLAIM <- read.csv("CLAIM.csv", header=T)
#install.packages("dplyr")
library("dplyr")
merge(CUSTOMER, CLAIM, by='CUST_ID', all.y=TRUE)
merged_data <- merge(CUSTOMER, CLAIM)
omitted_data <- na.omit(merged_data)
deducted_data <- head(select(omitted_data, -SIU_CUST_YN, -CTPR, -OCCP_GRP_2, -RECP_DATE, -RESN_DATE), 115327)
data_fin <- head(filter(deducted_data, OCCP_GRP_1 !=""), 115327)
dim(data_fin)
head(data_fin)
str(data_fin)
Next,
1) I should get top 3 (OCCP_GRP_1) that has high non_pay_ratio
2) I should get the (CUST_ID) over 600,000 of DMND_AMT Value
I don't know how to write it down

r: append mean of a subset of columns by name

I have this df:
webvisits1 webvisits2 webvisits3 webvisits4
s001 2 0 11 2
s002 11 2 23 3
s003 12 1 1 5
s004 13 5 5 0
s005 4 3 9 3
I need to create an output dataframe with an added columns containing the difference between the mean of webvisits(3-4) and webvisits (1-2), like so:
webvisits1 webvisits2 webvisits3 webvisits4 difference_mean
s001 2 0 11 2 -5.5
s002 11 2 23 3 -6.5
s003 12 1 1 5 3.5
s004 13 5 5 0 6.5
s005 4 3 9 3 -2.5
Is there an easy way to do so, considering that column names (webvisits) are important?
Thank you

rowSums function can sum rows of each variables, then after find difference between existing variables and take mean of them
library(dplyr)
dt %>%
mutate(difference_mean = (rowSums(dt[,2:3])-rowSums(dt[,4:5]))/2)
s.no webvisits1 webvisits2 webvisits3 webvisits4 difference_mean
1 s001 2 0 11 2 -5.5
2 s002 11 2 23 3 -6.5
3 s003 12 1 1 5 3.5
4 s004 13 5 5 0 6.5
5 s005 4 3 9 3 -2.5

We subset the dataset into two (df[1:2], df[3:4]), get the difference and then with rowMeans we find the mean, create a new column 'differenceMean' using transform.
df <- transform(df, differenceMean = rowMeans(df[1:2]- df[3:4]))
df
# webvisits1 webvisits2 webvisits3 webvisits4 differenceMean
#s001 2 0 11 2 -5.5
#s002 11 2 23 3 -6.5
#s003 12 1 1 5 3.5
#s004 13 5 5 0 6.5
#s005 4 3 9 3 -2.5

How to count number of similar occurences of a combination in a dataframe? [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Count number of rows within each group
(17 answers)
Closed 6 years ago.
I am a naïve, I have loaded a famous dataset on R, now I want to do several experiments with it. below is the array of scripts I have executed so far :
I have a battles dataframe :
str(battles)
'data.frame': 38 obs. of 25 variables:
$ name : Factor w/ 38 levels "Battle at the Mummer's Ford",..: 13 1 7 14 18 10 25 5 3 17 ...
$ year : int 298 298 298 298 298 298 298 299 299 299 ...
$ battle_number : int 1 2 3 4 5 6 7 8 9 10 ...
$ attacker_king : Factor w/ 5 levels "","Balon/Euron Greyjoy",..: 3 3 3 4 4 4 3 2 2 2 ...
$ defender_king : Factor w/ 7 levels "","Balon/Euron Greyjoy",..: 6 6 6 3 3 3 6 6 6 6 ...
$ attacker_1 : Factor w/ 11 levels "Baratheon","Bolton",..: 10 10 10 11 11 11 10 9 9 9 ...
$ attacker_2 : Factor w/ 8 levels "","Bolton","Frey",..: 1 1 1 1 8 8 1 1 1 1 ...
$ attacker_3 : Factor w/ 3 levels "","Giants","Mormont": 1 1 1 1 1 1 1 1 1 1 ...
$ attacker_4 : Factor w/ 2 levels "","Glover": 1 1 1 1 1 1 1 1 1 1 ...
$ defender_1 : Factor w/ 13 levels "","Baratheon",..: 12 2 12 8 8 8 6 11 11 11 ...
$ defender_2 : Factor w/ 3 levels "","Baratheon",..: 1 1 1 1 1 1 1 1 1 1 ...
$ defender_3 : logi NA NA NA NA NA NA ...
$ defender_4 : logi NA NA NA NA NA NA ...
$ attacker_outcome : Factor w/ 3 levels "","loss","win": 3 3 3 2 3 3 3 3 3 3 ...
$ battle_type : Factor w/ 5 levels "","ambush","pitched battle",..: 3 2 3 3 2 2 3 3 5 2 ...
$ major_death : int 1 1 0 1 1 0 0 0 0 0 ...
$ major_capture : int 0 0 1 1 1 0 0 0 0 0 ...
$ attacker_size : int 15000 NA 15000 18000 1875 6000 NA NA 1000 264 ...
$ defender_size : int 4000 120 10000 20000 6000 12625 NA NA NA NA ...
$ attacker_commander: Factor w/ 32 levels "","Asha Greyjoy",..: 8 6 9 22 16 18 6 30 2 28 ...
$ defender_commander: Factor w/ 29 levels "","Amory Lorch",..: 7 4 10 28 12 14 15 1 1 1 ...
$ summer : int 1 1 1 1 1 1 1 1 1 1 ...
$ location : Factor w/ 28 levels "","Castle Black",..: 8 13 17 9 27 17 4 12 5 23 ...
$ region : Factor w/ 7 levels "Beyond the Wall",..: 7 5 5 5 5 5 5 3 3 3 ...
$ note : Factor w/ 6 levels "","Greyjoy's troop number based on the Battle of Deepwood Motte, in which Asha had 1000 soldier on 30 longships. That comes out to"| __truncated__,..: 1 1 1 1 1 1 1 1 1 2 ...
My requirement is I want to know how many loss and wins a king had in his entire span of GOT so far.
select(battles,attacker_outcome,attacker_king)
attacker_outcome attacker_king
1 win Joffrey/Tommen Baratheon
2 win Joffrey/Tommen Baratheon
3 win Joffrey/Tommen Baratheon
4 loss Robb Stark
5 win Robb Stark
6 win Robb Stark
7 win Joffrey/Tommen Baratheon
8 win Balon/Euron Greyjoy
9 win Balon/Euron Greyjoy
10 win Balon/Euron Greyjoy
11 win Robb Stark
12 win Balon/Euron Greyjoy
13 win Balon/Euron Greyjoy
14 win Joffrey/Tommen Baratheon
15 win Robb Stark
16 win Stannis Baratheon
17 loss Joffrey/Tommen Baratheon
18 win Robb Stark
19 win Robb Stark
20 loss Stannis Baratheon
21 win Robb Stark
22 loss Robb Stark
23 win
24 win Joffrey/Tommen Baratheon
25 win Joffrey/Tommen Baratheon
26 win Joffrey/Tommen Baratheon
27 win Robb Stark
28 loss Stannis Baratheon
29 win Joffrey/Tommen Baratheon
30 win
31 win Stannis Baratheon
32 win Balon/Euron Greyjoy
33 win Balon/Euron Greyjoy
34 win Joffrey/Tommen Baratheon
35 win Joffrey/Tommen Baratheon
36 win Joffrey/Tommen Baratheon
37 win Joffrey/Tommen Baratheon
38 Stannis Baratheon
I need 2 more columns with name "number of wins" and "number of loss"
for each attacker king.
Note: Please excuse me if in any ways my question hurts the stackOverFlow ask question policy, as this is my first question in R.

You can use table from base package
table(df$attacker_king,df$attacker_outcome )
# loss win
# Balon/Euron Greyjoy 0 7
# Joffrey/Tommen Baratheon 1 13
# Robb Stark 2 8
# Stannis Baratheon 2 2

One option would be dplyr. After grouping by 'attacker_king', we summarise the output by creating two columns ('NoWins', 'NoLoss') based on the sum of the logical vector for "win" and "loss" and if needed filter out the blank elements in 'attacker_king'.
library(dplyr)
battles %>%
group_by(attacker_king) %>%
summarise(NoWins = sum(attacker_outcome == "win"),
NoLoss = sum(attacker_outcome == "loss")) %>%
filter(nzchar(attacker_king))
# attacker_king NoWins NoLoss
# <chr> <int> <int>
#1 Balon/Euron Greyjoy 7 0
#2 Joffrey/Tommen Baratheon 13 1
#3 Robb Stark 8 2
#4 Stannis Baratheon 2 2
Or we can use dplyr/tidyr. After grouping, we get the frequency count with tally, filter (as above) and then spread (from tidyr) to convert the 'long' to 'wide' format.
library(tidyr)
battles %>%
group_by(attacker_king, attacker_outcome) %>%
tally() %>%
filter(nzchar(attacker_king) & nzchar(attacker_outcome)) %>%
spread(attacker_outcome, n)
Or using dcast from data.table. This would be much easier as the dcast also have the fun.aggregate so we can specify the function (in this case length) while reshaping to 'wide' format.
library(data.table)
dcast(setDT(battles), attacker_king~attacker_outcome, length)[nzchar(attacker_king)
][, -2, with = FALSE]
# attacker_king loss win
#1: Balon/Euron Greyjoy 0 7
#2: Joffrey/Tommen Baratheon 1 13
#3: Robb Stark 2 8
#4: Stannis Baratheon 2 2
Or use table from base R
table(battles[c("attacker_king", "attacker_outcome")])[-1,-1]
# attacker_outcome
# attacker_king loss win
# Balon/Euron Greyjoy 0 7
# Joffrey/Tommen Baratheon 1 13
# Robb Stark 2 8
# Stannis Baratheon 2 2

How to create unique rows in a data frame

I have a dataframe where rows are duplicated. I need to create unique rows from this. I tried a couple of options but they don't seem to work
l1 <-summarise(group_by(l,bowler,wickets),economyRate,d=unique(date))
This works for some rows but also gives the error "Expecting a single value". The dataframe 'l' looks like this
bowler overs maidens runs wickets economyRate date opposition
(fctr) (int) (int) (dbl) (dbl) (dbl) (date) (chr)
1 MA Starc 9 0 51 0 5.67 2010-10-20 India
2 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
3 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
4 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
5 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
6 MA Starc 6 0 33 2 5.50 2012-02-05 India
7 MA Starc 6 0 33 2 5.50 2012-02-05 India
8 MA Starc 10 0 50 2 5.00 2012-02-10 Sri Lanka
9 MA Starc 10 0 50 2 5.00 2012-02-10 Sri Lanka
10 MA Starc 8 0 49 0 6.12 2012-02-12 India
The date is unique and can be used to get the rows for which the row can be selected. Please let me know how this can be done.

In the example dataset, there are more than one unique elements of 'date' per each 'bowler', 'wickets' combination. One option would be to paste the unique 'date' together
l %>%
group_by(bowler, wickets) %>%
summarise(economyRate= mean(economyRate), d = toString(unique(date)))
Or create 'd' as a list column
l %>%
group_by(bowler, wickets) %>%
summarise(economyRate= mean(economyRate), d = list(unique(date)))
With respect to 'economyRate', I am guessing the OP need the mean of that.
If we need to create a column of unique date in the original dataset, use mutate
l %>%
group_by(bowler, wickets) %>%
mutate(d = list(unique(date)))
As the OP didn't provide the expected output, the below could be also the result
l %>%
group_by(bowler, wickets) %>%
distinct(date)
Or as #Frank mentioned
l %>%
group_by(bowler,wickets,date) %>%
slice(1L)

If I get the intention of th OP right, he is asking to simply remove the duplicate rows. So, I would use
unique(l1)
That's what ?unique says:
unique returns a vector, data frame or array like x but with duplicate elements/rows removed.

Data
l <- read.table(text = "bowler overs maidens runs wickets economyRate date opposition
1 MA_Starc 9 0 51 0 5.67 2010-10-20 India
2 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
3 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
4 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
5 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
6 MA_Starc 6 0 33 2 5.50 2012-02-05 India
7 MA_Starc 6 0 33 2 5.50 2012-02-05 India
8 MA_Starc 10 0 50 2 5.00 2012-02-10 Sri-Lanka
9 MA_Starc 10 0 50 2 5.00 2012-02-10 Sri-Lanka
10 MA_Starc 8 0 49 0 6.12 2012-02-12 India")
Distinct
Use dplyr::distinct to remove duplicated rows.
ldistinct <- distinct(l)
# bowler overs maidens runs wickets economyRate date
# 1 MA_Starc 9 0 51 0 5.67 2010-10-20
# 2 MA_Starc 9 0 27 4 3.00 2010-11-07
# 3 MA_Starc 6 0 33 2 5.50 2012-02-05
# 4 MA_Starc 10 0 50 2 5.00 2012-02-10
# 5 MA_Starc 8 0 49 0 6.12 2012-02-12
# opposition
# 1 India
# 2 Sri-Lanka
# 3 India
# 4 Sri-Lanka
# 5 India
l2 <- summarise(group_by(ldistinct,bowler,wickets),
economyRate,d=unique(date))
# Error: expecting a single value
But it's not enough here, there are still many dates for
one combination of bowler and wickets.
Collapse values together
By pasting multiple values together you will see that there are many dates and many economyRate for a single combination of bowler and wickets.
l3 <- summarise(group_by(l,bowler,wickets),
economyRate = paste(unique(economyRate),collapse=", "),
d=paste(unique(date),collapse=", "))
l3
# bowler wickets economyRate d
# (fctr) (int) (chr) (chr)
# 1 MA_Starc 0 5.67, 6.12 2010-10-20, 2012-02-12
# 2 MA_Starc 2 5.5, 5 2012-02-05, 2012-02-10
# 3 MA_Starc 4 3 2010-11-07

So, I took an unusual route to doing this disection, but I let the date remain a factor when it came over from the csv file I created. you could easily the date column to a factor with
l1$date<-as.factor(l1$date)
This will make that row a non-date row, you could also convert to character, either will work fine. This is what it looks like structurally.
str(l1)
'data.frame': 10 obs. of 10 variables:
$ bowler : Factor w/ 2 levels "(fctr)","MA": 2 2 2 2 2 2 2 2 2 2
$ overs : Factor w/ 2 levels "(int)","Starc": 2 2 2 2 2 2 2 2 2 2
$ maidens : Factor w/ 5 levels "(int)","10","6",..: 5 5 5 5 5 3 3 2 2 4
$ runs : Factor w/ 2 levels "(dbl)","0": 2 2 2 2 2 2 2 2 2 2
$ wickets : Factor w/ 6 levels "(dbl)","27","33",..: 6 2 2 2 2 3 3 5 5 4
$ economyRate: Factor w/ 4 levels "(dbl)","0","2",..: 2 4 4 4 4 3 3 3 3 2
$ date : Factor w/ 6 levels "(date)","3","5",..: 5 2 2 2 2 4 4 3 3 6
$ opposition : Factor w/ 6 levels "(chr)","10/20/2010",..: 2 3 3 3 3 6 6 4 4 5
$ X.1 : Factor w/ 3 levels "","India","Sri": 2 3 3 3 3 2 2 3 3 2
$ X.2 : Factor w/ 2 levels "","Lanka": 1 2 2 2 2 1 1 2 2 1
After that it is about making sure that you are using the sub-setting grammar properly with the most concise query:
l2<-l1[!duplicated(l1$date),]
And this is what is returned, 5 rows of unique data:
bowler overs maidens runs wickets economyRate date opposition X.1 X.2
2 MA Starc 9 0 51 0 5.67 10/20/2010 India
3 MA Starc 9 0 27 4 3 11/7/2010 Sri Lanka
7 MA Starc 6 0 33 2 5.5 2/5/2012 India
9 MA Starc 10 0 50 2 5 2/10/2012 Sri Lanka
11 MA Starc 8 0 49 0 6.12 2/12/2012 India
The only thing you need to be careful of is to keep that comma after the !duplicated(l1$date) to be sure that ALL columns are searched and included in the final subset.
If you want dates or characters you can as.POSIXct or as.character convert them to a usable format for the rest of your manipulation.
I hope this is useful to you!

Combining DF and rpart$where?

If I do DF$where <- tree$where after fitting an rpart object using DF as my data, will each row be mapped to its corresponding leaf through the column where?
Thanks!

As an example of how to demonstrate that this is possibly true (modulo my understanding of your question being correct), we work with the first example in ?rpart:
require(rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
kyphosis$where <- fit$where
> str(kyphosis)
'data.frame': 81 obs. of 5 variables:
$ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
$ Age : int 71 158 128 2 1 1 61 37 113 59 ...
$ Number : int 3 3 4 5 4 2 2 3 2 6 ...
$ Start : int 5 14 5 1 15 16 17 16 16 12 ...
$ where : int 9 7 9 9 3 3 3 3 3 8 ...
> plot(fit)
> text(fit, use.n = TRUE)
And now look at some tables based on the 'where' vector and some logical tests:
First node:
> with(kyphosis, table(where, Start >= 8.5))
where FALSE TRUE
3 0 29
5 0 12
7 0 14
8 0 7
9 19 0 # so this is the row that describes that split
> fit$frame[9,]
var n wt dev yval complexity ncompete nsurrogate yval2.V1
3 <leaf> 19 19 8 2 0.01 0 0 2.0000000
yval2.V2 yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob
3 8.0000000 11.0000000 0.4210526 0.5789474 0.2345679
Second node:
> with(kyphosis, table(where, Start >= 8.5, Start>=14.5))
, , = FALSE
where FALSE TRUE
3 0 0
5 0 12
7 0 14
8 0 7
9 19 0
, , = TRUE
where FALSE TRUE
3 0 29
5 0 0
7 0 0
8 0 0
9 0 0
And this is the row of fit$frame that describes the second split:
> fit$frame[3,]
var n wt dev yval complexity ncompete nsurrogate yval2.V1
4 <leaf> 29 29 0 1 0.01 0 0 1.0000000
yval2.V2 yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob
4 29.0000000 0.0000000 1.0000000 0.0000000 0.3580247
So I would characterize the value of fit$where as describing the "terminal nodes" which are being labeled as '<leaf>', which may or not be what you were calling the "nodes".
> fit$frame
var n wt dev yval complexity ncompete nsurrogate yval2.V1
1 Start 81 81 17 1 0.17647059 2 1 1.00000000
2 Start 62 62 6 1 0.01960784 2 2 1.00000000
4 <leaf> 29 29 0 1 0.01000000 0 0 1.00000000
5 Age 33 33 6 1 0.01960784 2 2 1.00000000
10 <leaf> 12 12 0 1 0.01000000 0 0 1.00000000
11 Age 21 21 6 1 0.01960784 2 0 1.00000000
22 <leaf> 14 14 2 1 0.01000000 0 0 1.00000000
23 <leaf> 7 7 3 2 0.01000000 0 0 2.00000000
3 <leaf> 19 19 8 2 0.01000000 0 0 2.00000000
yval2.V2 yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob
1 64.00000000 17.00000000 0.79012346 0.20987654 1.00000000
2 56.00000000 6.00000000 0.90322581 0.09677419 0.76543210
4 29.00000000 0.00000000 1.00000000 0.00000000 0.35802469
5 27.00000000 6.00000000 0.81818182 0.18181818 0.40740741
10 12.00000000 0.00000000 1.00000000 0.00000000 0.14814815
11 15.00000000 6.00000000 0.71428571 0.28571429 0.25925926
22 12.00000000 2.00000000 0.85714286 0.14285714 0.17283951
23 3.00000000 4.00000000 0.42857143 0.57142857 0.08641975
3 8.00000000 11.00000000 0.42105263 0.57894737 0.23456790

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R replicate tapply for every row to which is applied - r

Related

Get top 3 average from data in R

r: append mean of a subset of columns by name

How to count number of similar occurences of a combination in a dataframe? [duplicate]

How to create unique rows in a data frame

Combining DF and rpart$where?

Categories

Resources