Error in Adabag boosting function - r

df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
ada = boosting(formula=var1~., data=df1)
Error in cbind(yval2, yprob, nodeprob) :
el número de filas de las matrices debe coincidir (vea arg 2)
Hi everyone, I'm trying to use boosting function from adabag package, but it's telling me that the number of rows from matrix (?) must be equal. This data is not the original, but it seems to throw the same error.
Could you help me?
Thank you.

You should not use ID as explanatory variable.
Unfortunately your df1 dataset is too small and it is not possibile to understand if ID is the source of your problem.
Below I generate a bigger data set:
library(adabag)
set.seed(1)
n <- 100
df1 <- data.frame(ID = 1:n,
var1 = sample(letters[1:5], n, replace=T),
var2 = sample(c(0,1), n, replace=T))
head(df1)
# ID var1 var2
#
# 1 1 b 1
# 2 2 b 0
# 3 3 c 0
# 4 4 e 1
# 5 5 b 1
# 6 6 e 0
ada <- boosting(var1~var2, data=df1)
ada.pred <- predict.boosting(ada, newdata=df1)
ada.pred$confusion
# Observed Class Predicted Class a b c d e
# b 5 20 2 7 11
# c 2 2 10 2 2
# d 6 3 7 17 4

Pablo, if we have a closer look at your sample data, we will notice a property that makes it impossible for the classification algorithm to handle. Your dataset consists of five samples, each having a unique label i.e. the dependent variable: a, b, c, d, e. The dataset has only one feature (i.e. independent variable var2, as ID should be excluded from the features’ list) consisting of two classes: 0 and 1. It means there are several labels (of the dependent variable) that correspond to the same class of the independent variable. When algorithm tries to build a model, in this process it encounters a problem with defining regression due to the previously described dataset property and throws the error (number of rows of matrices must match (see arg 2)).
Marco's data, instead, has some healthy diversity: in the dataset of six samples, there are only three labels (b, c, e) and two classes (0, 1). The data set is diverse and reliable enough for the algorithm to handle it.
So, in order to use adabag’s boosting (that uses a regression tree called rpart as the control), you should make your data more diverse and reliable. Good luck!

Related

How to mutate a mean of certain rows in a data frame

I would like to create a new column which equals to the mean of several variables (columns) in my data frame. However, I'm afraid I can't use 'rowMeans' because I don't want to average all variables. Moreover, I'm hesitate to manually type all the variable names (which are many). For example:
my_data <- data.frame(a = c(1,2,3), b = c(4,5,6), c = c(10,10,10), d = c(13,24,81),
e = c(10, 8, 6), hello = c(1,-1,1), bye = c(1,5,5))
I want to mutate a row called avg which is the average of variables a, b, c, d, and e only. Because in my dataset, the variables names are long (and complex), and there are more than 10 variables, I prefer not to type them out one by one. So I guess I might need to use dplyr package and the mutate function?? Could you please suggest a clever way for me to do that?
The below content is added after your kind comments and answers suggest. Thank you all again:
Actually, the column names that I needed are Mcheck5_1_1, Mcheck5_2_1, ..., Mcheck5_8_1 (so there are 8 in total).However, I tried
my_data$avg = rowMeans(select(my_data, Mcheck5_1_1:Mcheck5_8_1), na.rm = TRUE)
but an error was thrown to me:
Error in select(my_data, Mcheck5_1_1:Mcheck5_8_1) :
unused argument (Mcheck5_1_1:Mcheck5_8_1)
Right now I solved the problem by using the following code:
`idx = grep("Mcheck5_1_1", names(my_data))
my_data$avg = rowMeans(my_data[, idx:idx+7], na.rm = TRUE)`
But is there a more elegant way to do it? Or why couldn't I use select()? Thanks!
I would do something like this
my_data <- data.frame(a = c(1,2,3), b = c(4,5,6), c = c(10,10,10), d = c(13,24,81),
e = c(10, 8, 6), hello = c(1,-1,1), bye = c(1,5,5))
several_variables <- c('a', 'b', 'c', 'd', 'e') #3 or `letters[1:5]`
my_data$avg <- rowMeans(my_data[,several_variables])
my_data
#> a b c d e hello bye avg
#> 1 1 4 10 13 10 1 1 7.6
#> 2 2 5 10 24 8 -1 5 9.8
#> 3 3 6 10 81 6 1 5 21.2
Obviously, if the variables is at some fixed position, and you know they will stay there, you could use the numbered indexing as suggested by Jaap,
my_data$avg <- rowMeans(my_data[,1:5])

Limiting Duplication of Specified Columns

I'm trying to find a way to add some constraints into a linear programme to force the solution to have a certain level of uniqueness to it. I'll try explain what I mean here. Take the example below, the linear programme returns the max possible Score for a combination of 2 males and 1 female.
Looking at the Team/Grade/Rep columns however we can see that there is a lot of duplication from row to row. In fact Shana and Jason are identical.
Name<-c("Jane","Brad","Harry","Shana","Debra","Jason")
Sex<-c("F","M","M","F","F","M")
Score<-c(25,50,36,40,39,62)
Team<-c("A","A","A","B","B","B")
Grade<-c(1,2,1,2,1,2)
Rep<-c("C","D","C","D","D","D")
df<-data.frame(Name,Sex,Score,Team,Grade,Rep)
df
Name Sex Score Team Grade Rep
1 Jane F 25 A 1 C
2 Brad M 50 A 2 D
3 Harry M 36 A 1 C
4 Shana F 40 B 2 D
5 Debra F 39 B 1 D
6 Jason M 62 B 2 D
library(Rglpk)
num <- length(df$Name)
obj<-df$Score
var.types<-rep("B",num)
matrix <- rbind(as.numeric(df$Sex == "M"),as.numeric(df$Sex == "F"))
direction <- c("==","==")
rhs<-c(2,1)
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE)
df[sol$solution==1,]
Name Sex Score Team Grade Rep
2 Brad M 50 A 2 D
4 Shana F 40 B 2 D
6 Jason M 62 B 2 D
What I am trying to work out is how to limit say the level of randomness across those last three columns. For example I would like there to no more than ie 2 columns the same across any two rows. So this would mean that either the Shana row or Jason row would be replaced in the model with an alternative.
I'm not sure if this is something that can be easily added into the Rglpk model? Appreciate any help that can be offered.
It sounds like you're asking how to prevent having a pair of individuals who are "too similar" from being returned by your optimization model. Once you have determined a rule for what makes a pair of people "too similar", you can simply add a constraint for each pair, limiting your solution to have no more than one of those two people.
For instance, if we use your rule of having no more than 2 columns the same, we could easily identify all pairs that we want to block:
pairs <- t(combn(nrow(df), 2))
(blocked <- pairs[rowSums(sapply(df[,c("Team", "Grade", "Rep")], function(x) {
x[pairs[,1]] == x[pairs[,2]]
})) >= 3,])
# [,1] [,2]
# [1,] 1 3
# [2,] 4 6
We want to block the pairs Jane/Harry and Shana/Jason. This is easy to do with linear constraints:
library(Rglpk)
num <- length(df$Name)
obj<-df$Score
var.types<-rep("B",num)
matrix <- rbind(as.numeric(df$Sex == "M"), as.numeric(df$Sex == "F"),
outer(blocked[,1], seq_len(num), "==") + outer(blocked[,2], seq_len(num), "=="))
direction <- rep(c("==", "<="), c(2, nrow(blocked)))
rhs<-c(2, 1, rep(1, nrow(blocked)))
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE)
df[sol$solution==1,]
# Name Sex Score Team Grade Rep
# 2 Brad M 50 A 2 D
# 5 Debra F 39 B 1 D
# 6 Jason M 62 B 2 D
The approach of computing every pair to block is attractive because we could have a much more complicated rule for which pairs to block, since we don't need to encode the rule into the linear program. All we need to be able to do is to compute every pair that needs to be blocked.
For each group of rows having the same last 3 columns we construct a constraint such that at most one of those rows may appear. If a is an indictor vector of the rows of such a group then the constraint would look like this:
a'x <= 1
To do that split the row numbers by the last 3 columns into a list of vectors s each of whose components is a vector of row numbers for rows having the same last 3 columns. Only keep those conponents having more than 1 row number giving s1. In this case the first component of s1 is c(1, 3) referring to the Jane and Harry rows and the second component is c(4, 6) referring to the Shana and Jason rows. In this particular data there were 2 rows in each of the groups but in other data there could be more than 2 rows in a group. excl has one row (constraint) for each element of s1.
The data in the question only has groups of size 2 but in general if there were k rows in some group one would need k choose 2 constraint rows to ensure that only one of the k were chosen if this were done pairwise whereas the approach here only requires one constraint row for the entire group. For example, if k = 10 then choose(10, 2) = 45 so this uses 1 constraint in place of 45.
Finally rbind excl to matrix giving matrix2 and adjust the other Rglpk_solve_LP arguments accordingly giving:
nr <- nrow(df)
s <- split(1:nr, df[4:6])
s1 <- s[lengths(s) > 1]
excl <-t(sapply(s1, "%in%", x = 1:nr)) + 0
matrix2 <- rbind(matrix, excl)
direction2 <- c(direction, rep("<=", nrow(excl)))
rhs2 <- c(rhs, rep(1, nrow(excl)))
sol2 <- Rglpk_solve_LP(obj = obj, mat = matrix2,
dir = direction2, rhs = rhs2, types = "B", max = TRUE)
df[ sol2$solution == 1, ]
giving:
Name Sex Score Team Grade Rep
2 Brad M 50 A 2 D
5 Debra F 39 B 1 D
6 Jason M 62 B 2 D

Resample with replacement by cluster

I want to draw clusters (defined by the variable id) with replacement from a dataset, and in contrast to previously answered questions, I want clusters that are chosen K times to have each observation repeated K times. That is, I'm doing cluster bootstrapping.
For example, the following samples id=1 twice, but repeats the observations for id=1 only once in the new dataset s. I want all observations from id=1 to appear twice.
f <- data.frame(id=c(1, 1, 2, 2, 2, 3, 3), X=rnorm(7))
set.seed(451)
new.ids <- sample(unique(f$id), replace=TRUE)
s <- f[f$id %in% new.ids, ]
One option would be to lapply over each new.id and save it in a list. Then you can stack that all together:
library(data.table)
rbindlist(lapply(new.ids, function(x) f[f$id %in% x,]))
# id X
#1: 1 1.20118333
#2: 1 -0.01280538
#3: 1 1.20118333
#4: 1 -0.01280538
#5: 3 -0.07302158
#6: 3 -1.26409125
Just in case one would need to have a "new_id" that corresponded to the index number (i.e. sample order) -- (I needed to have "new_id" so that i could run mixed effects models without having several instances of a cluster treated as one cluster because they shared the same id):
library(data.table)
f = data.frame( id=c(1,1,2,2,2,3,3), X = rnorm(7) )
set.seed(451); new.ids = sample( unique(f$id), replace=TRUE )
## ss has unique valued `new_id` for each cluster
ss = rbindlist(mapply(function(x, index) cbind(f[f$id %in% x,], new_id=index),
new.ids,
seq_along(new.ids),
SIMPLIFY=FALSE
))
ss
which gives:
> ss
id X new_id
1: 1 -0.3491670 1
2: 1 1.3676636 1
3: 1 -0.3491670 2
4: 1 1.3676636 2
5: 3 0.9051575 3
6: 3 -0.5082386 3
Note the values of X are different because set.seed is not set before the rnorm() call, but the id is the same as the answer of #Mike H.
This link was useful to me in constructing this answer: R lapply statement with index [duplicate]

How to split into train and test data ensuring same combinations of factors are present in both train and test?

Is there a way to split the data into train and test such that all combinations of categorical predictors in the test data are present in the training data? If it is not possible to split the data given the proportions specified for the test and train sizes, then those levels should not be included in the test data.
Say I have data like this:
SAMPLE_DF <- data.frame("FACTOR1" = c(rep(letters[1:2], 8), "g", "g", "h", "i"),
"FACTOR2" = c(rep(letters[3:5], 2,), rep("z", 3), "f"),
"response" = rnorm(10,10,1),
"node" = c(rep(c(1,2),5)))
> SAMPLE_DF
FACTOR1 FACTOR2 response node
1 a c 10.334690 1
2 b d 11.467605 2
3 a e 8.935463 1
4 b c 10.253852 2
5 a d 11.067347 1
6 b e 10.548887 2
7 a z 10.066082 1
8 b z 10.887074 2
9 a z 8.802410 1
10 b f 9.319187 2
11 a c 10.334690 1
12 b d 11.467605 2
13 a e 8.935463 1
14 b c 10.253852 2
15 a d 11.067347 1
16 b e 10.548887 2
17 g z 10.066082 1
18 g z 10.887074 2
19 h z 8.802410 1
20 i f 9.319187 2
In the test data, if there were a combination of FACTOR 1 and 2 of a c then this would also be in the train data. The same goes for all other possible combinations.
createDataPartition does this for one level, but I would like it for all levels.
You could try the following using dplyr to remove the combinations that appear only once and therefore would end up only in the training or test set and then use CreateDataPartition to make the split:
Data
SAMPLE_DF <- data.frame("FACTOR1" = rep(letters[1:2], 10),
"FACTOR2" = c(rep(letters[3:5], 2,), rep("z", 4)),
"num_pred" = rnorm(10,10,1),
"response" = rnorm(10,10,1))
Below you use dplyr to count the number of the combinations of factor1 and factor2. If any of those are 1 then you filter them out:
library(dplyr)
mydf <-
SAMPLE_DF %>%
mutate(all = paste(FACTOR1,FACTOR2)) %>%
group_by(all) %>%
summarise(total=n()) %>%
filter(total>=2)
The above only keeps combinations of factor1 and 2 that appear at least twice
You remove rows from SAMPLE_DF according to the above kept combinations:
SAMPLE_DF2 <- SAMPLE_DF[paste(SAMPLE_DF$FACTOR1,SAMPLE_DF$FACTOR2) %in% mydf$all,]
And finally you let createDataPartition do the split for you:
library(caret)
IND_TRAIN <- createDataPartition(paste(SAMPLE_DF2$FACTOR1,SAMPLE_DF2$FACTOR2))$Resample
#train set
A <- SAMPLE_DF2[ IND_TRAIN,]
#test set
B <- SAMPLE_DF2[-IND_TRAIN,]
>identical(sort(paste(A$FACTOR1,A$FACTOR2)) , sort(paste(B$FACTOR1,B$FACTOR2)))
[1] TRUE
As you can see at the identical line, the combinations are exactly the same!
This is a function I have put together that does this and splits a dataframe into train and test sets (to the user's percentage liking) that contain Factor variables(columns) that have the same levels.
getTrainAndTestSamples_BalancedFactors <- function(data, percentTrain = 0.75, inSequence = F, seed = 0){
set.seed(seed) # Set Seed so that same sample can be reproduced in future also
sample <- NULL
train <- NULL
test<- NULL
listOfFactorsAndTheirLevels <- lapply(Filter(is.factor, data), summary)
factorContainingOneElement <- sapply(listOfFactorsAndTheirLevels, function(x){any(x==1)})
if (any(factorContainingOneElement))
warning("This dataframe cannot be reliably split into sets that contain Factors equally represented. At least one factor contains only 1 possible level.")
else {
# Repeat loop until all Factor variables have same levels on both train and test set
repeat{
# Now Selecting 'percentTrain' of data as sample from total 'n' rows of the data
if (inSequence)
sample <- 1:floor(percentTrain * nrow(data))
else
sample <- sample.int(n = nrow(data), size = floor(percentTrain * nrow(data)), replace = F)
train <- data[sample, ] # create train set
test <- data[-sample, ] # create test set
train_factor_only <- Filter(is.factor, train) # df containing only 'train' Factors as columns
test_factor_only <- Filter(is.factor, test) # df containing only 'test' Factors as columns
haveFactorsWithExistingLevels <- NULL
for (i in ncol(train_factor_only)){ # for each column (i.e. factor variable)
names_train <- names(summary(train_factor_only[,i])) # get names of all existing levels in this factor column for 'train'
names_test <- names(summary(test_factor_only[,i])) # get names of all existing levels in this factor column for 'test'
symmetric_diff <- union(setdiff(names_train,names_test), setdiff(names_test,names_train)) # get the symmetric difference between the two factor columns (from 'train' and 'test')
if (length(symmetric_diff) == 0) # if no elements in the symmetric difference set then it means that both have the same levels present at least once
haveFactorsWithExistingLevels <- c(haveFactorsWithExistingLevels, TRUE) # append logic TRUE
else # if some elements in the symmetric difference set then it means that one of the two sets (train, test) has levels the other doesn't and it will eventually flag up when using function predict()
haveFactorsWithExistingLevels <- c(haveFactorsWithExistingLevels, FALSE) # append logic FALSE
}
if(all(haveFactorsWithExistingLevels))
break # break out of the repeat loop because we found a split that has factor levels existing in both 'train' and 'test' sets, for all Factor variables
}
}
return (list( "train" = train, "test" = test))
}
Use like:
df <- getTrainAndTestSamples_BalancedFactors(some_dataframe)
df$train # this is your train set
df$test # this is your test set
...and no more annoying errors from R!
This can be definitely improved and I am looking forward to more efficient ways of doing this in the comments below, however, one can just use the code as it is.
Enjoy!

Simple data-manipulation in R

#Aniko points out that one way to view my problem is that I need to find the connected components of a graph, where the vertices are called groups and, variables group and nominated_group indicate an edges between those two groups. My goal is to create a variable parent_Group which indexes the connected components. Or as I put it before:
I have a dataframe with four variables: ID, group, and nominated_ID, and nominated_Group.
Consider sister-groups: Groups A and B are sister-groups if there is at least one case in the data where group==A and nominated_group==B, or vice versa.
I would like to create a variable parent_group which takes on a unique value for each set of sister-groups. In other words, no nominations should occur between cases in different parent_groups. Making the parent_group sequential numbers seems like a good idea.
Many thanks for the help I already received here! I can't really contribute here but note that I try to pay it forward at stats.exchange and on wikipedia.
In my fake data, A and B are sister-groups. Either case ID=4 or ID=5 are sufficient to make this true. Each group is also their own sister-group. The goal, the creation of parent_group, should result in one parent_group for all cases in A or B, and another parent_group for group C
df <- data.frame(ID = c(9, 5, 2, 4, 3, 7),
group = c("A", "A", "B", "B", "A", "C"),
nominated_ID = c(9, 8, 4, 9, 2, 7) )
df$nominated_group <- with(df, group[match(nominated_ID, ID)])
df
ID group nominated_ID nominated_group
1 9 A 9 A
2 5 A 8 <NA>
3 2 B 4 B
4 4 B 9 A
5 3 A 2 B
6 7 C 7 C
Consider a graph with the groups as its vertices and the edges indicating that the two groups occur for the same ID. Then I think you are looking for connected components of this graph. The following is a quick and dirty (and probably not optimal) implementation of this idea using the graph package:
library(graph)
#make some fake data
nom <- data.frame(group = c("A","A","A","B","B","C","C"),
group2 = c("A","A","B","B","A","C","C"),
stringsAsFactors=FALSE)
#remove duplicated pairs
#it will keep A-B distinct from B-A, could probably be fixed
nom1 <- nom[!duplicated(nom),]
#define empty graph
grps <- union(unique(nom$group), unique(nom$group2))
gg <- new("graphNEL", nodes=grps, edgeL=list())
#add an edge for every pair
for (i in 1:nrow(nom1)) gg <- addEdge(nom1$group[i], nom1$group2[i], gg, 1)
#find connected components
cc <- connComp(gg)
#assing parent by matching within cc
nom$parent <- apply(nom, 1,
function(x) which(sapply(cc, function(y) x["group"] %in% y)))
nom
group group2 parent
1 A A 1
2 A A 1
3 A B 1
4 B B 1
5 B A 1
6 C C 2
7 C C 2

Resources