Data Partition in Caret Package and Over-fitting - r

I was reading caret package and I saw that code;
createDataPartition(y, times = 1, p = 0.5, list = TRUE, groups = min(5,
length(y)))
I am wondering about "times" expression. So, if I use this code,
inTrain2 <- createDataPartition(y = MyData$Class ,times=3, p = .70,list = FALSE)
training2 <- MyData[ inTrain2,] # ≈ %67 (train)
testing2<- MydData[-inTrain2[2],] # ≈ %33 (test)
Would it be cause of overfitting problem? Or is that using for some kind of resampling method (unbiased)?
Many thanks in advance.
Edit:
I would like to mention that, if I use This code;
inTrain2 <- createDataPartition(y = MyData$Class ,times=1, p = .70,list = FALSE)
training2<- MyData[ inTrain2,] #142 samples # ≈ %67 (train)
testing2<- MydData[-inTrain2,] #69 samples # ≈ %33 (test)
I will have got 211 samples and And ≈ %52 Accuracy rate, On the other hand if I use this code;
inTrain2 <- createDataPartition(y = MyData$Class ,times=3,p =.70,list = FALSE)
training2<- MyData[ inTrain2,] # ≈ %67 (train) # 426 samples
testing2<- MydData[-inTrain2[2],] # ≈ %33 (test) # 210 samples
I will have got 536 samples and and ≈ %98 Accuracy rate.
Thank you.

It is not clear why you mix overfitting in this question; times refers simply to how many different partitions you want (docs). Let's see an example with the iris data:
library(caret)
data(iris)
ind1 <- createDataPartition(iris$Species, times=1, list=FALSE)
ind2 <- createDataPartition(iris$Species, times=2, list=FALSE)
nrow(ind1)
# 75
nrow(ind2)
# 75
head(ind1)
Resample1
[1,] 1
[2,] 5
[3,] 7
[4,] 11
[5,] 12
[6,] 18
head(ind2)
Resample1 Resample2
[1,] 2 1
[2,] 3 4
[3,] 6 6
[4,] 7 9
[5,] 8 10
[6,] 11 11
Both indices have a length of 75 (since we have used the default argument p=0.5, i.e. half the rows of the initial dataset). The columns (different samples) of ind2 are independent between them, and the analogy of the different iris$Species is preserved, e.g.:
length(which(iris$Species[ind2[,1]]=='setosa'))
# 25
length(which(iris$Species[ind2[,2]]=='setosa'))
# 25

Related

Large mixed integer programming in R - possible to solve?

I would like to solve a large mixed integer programming problem, and I have tried with R, package lpSolveAPI. The problem is large - 410 variables each of which can be either 0 or 1, and about 49422 constraints. I have tried to let it run for 1,5 days, but once I try to stop it, it says that R needs to be terminated. The same happens if I let it run for short time, say 15 minutes, and then try to stop it by clicking on the red button. Since this happens, I am not sure whether there is something wrong with my computer or whether such problem is way too large for a computer. When it runs, it uses maybe 20% of the CPU power and about 70% of memory. My computer is a 2022 Lenovo X1 Yoga with i7 2.80GHz processor and 16GB of ram.
The problem itself is constructed in the following way:
library(lpSolveAPI)
#A has length of 410. No constraints to begin with
lprec <- make.lp(0, 410)
set.objfn(lprec, round(A,0))
lp.control(lprec, sense="max")
set.type(lprec,c(1:A),"binary")
#Defining the constraints with a for loop. Will not go into details, but it adds 49422 constraints
for (){
...
add.constraint(lprec, MyConstraint, "<=", 1)
...
}
lprec
#This says: Model name: a linear program with 410 decision variables and 49422 constraints
solve(lprec)
The vector "MyConstraint" is different in every iteration, but it has length 410 where 408 elements are 0 and two elements are 1.
That is, I have 410 objects in total, and I want to choose a set of those objects (1 if an object chosen and 0 otherwise) such that the objective function is maximized. However, some pairs of objects are not allowed, and so each of the 49422 constraints specifies which two objects cannot be chosen chosen at once: each constraint says that the sum cannot be above 1.
So, my question is if there is any way to solve this? If not, how large can such problem be in order to be solvable?
Thank you!
EDIT: ---------------------------------------------
In the comments I was asked to provide an example, so here is it. A similar, but much smaller problem. Suppose we have 7 different objects, and these can allocated into 5 groups. Let us define the groups and the associated savings denoted by A:
MyGroups <- c(1,0,0,0,1,0,0,
0,0,1,1,0,0,0,
0,0,1,0,0,1,0,
0,0,0,1,0,1,0,
0,0,1,1,0,1,0)
MyGroups <- matrix(MyGroups,nrow=5,ncol=7,byrow=TRUE)
rownames(MyGroups) <- paste0("Group", 1:5)
colnames(MyGroups) <- paste0("Object", 1:7)
A=c(50,30,100,100,200)
That is, group 1 consists of Object 1 and Object 5 (denoted by the first row in the matrix MyGroups). Such a group will give a saving of 50. Objective: to maximize the total saving by choosing the right groups. Problem: each object can only be a part of one group. For example, if group 2 is implemented, then group 3 cannot be implemented, since both groups require object 3. Here we see that the optimal solution is to choose Group 1 and Group 5, which will give a total saving of 50+200=250. I want to be able to find this for a bigger problem. So, first I can create a matrix with constraints where specifies which 2 groups cannot be implemented at the same time.
lprec2 <- make.lp(0, 5)
set.objfn(lprec2, A)
lp.control(lprec2, sense="max")
set.type(lprec2,c(1:5),"binary")
#Defining the constraints
for (i in 1:(5-1)){
for (j in (i+1):5) {
if(max(colSums(MyGroups[c(i,j),]))>1){
#group i and group j cannot be together. Add constraint
MyConstraint=integer(5)
MyConstraint[c(i,j)]=1
add.constraint(lprec2, MyConstraint, "<=", 1)
}
}
}
lprec2
This gives the following mixed integer problem:
When I solve it, then the solution is:
solve(lprec2)
get.objective(lprec2)
get.variables(lprec2)
Which gives 250 and (1 0 0 0 1) respectively.
In the original problem I have 410 possible groups, implying 410 decision variables. The number of constraints is 49422, but in all rows there are exactly two 1 and the remaining are 0.
If you could help me to solve such a problem, I would be happy :-). Thanks!
Here is the model formulated using ompr:
MyGroups <- c(1,0,0,0,1,0,0,
0,0,1,1,0,0,0,
0,0,1,0,0,1,0,
0,0,0,1,0,1,0,
0,0,1,1,0,1,0)
MyGroups <- matrix(MyGroups,nrow=5,ncol=7,byrow=TRUE)
ngroups <- nrow(MyGroups)
nobjects <- ncol(MyGroups)
coeffs <- c(50, 30, 100, 100, 200)
model <- MIPModel() %>%
add_variable(group[i], i=1:ngroups, type = 'binary') %>%
add_variable(assign[i, j], i=1:ngroups, j=1:nobjects, type = 'binary', MyGroups[i, j] == 1) %>%
set_objective(sum_over(coeffs[i] * group[i], i=1:ngroups, sense = 'max')) %>%
add_constraint(sum_over(assign[i, j], i=1:ngroups, MyGroups[i, j] == 1) <= 1, j=1:nobjects) %>%
add_constraint(assign[i, j] == group[i], i=1:ngroups, j=1:nobjects, MyGroups[i, j] == 1) %>%
add_constraint(sum_over(group[i], i=1:ngroups) <= 2)
result <- solve_model(model, with_ROI("glpk", verbose = TRUE))
result
<SOLVER MSG> ----
GLPK Simplex Optimizer, v4.47
16 rows, 16 columns, 35 non-zeros
* 0: obj = 0.000000000e+000 infeas = 0.000e+000 (11)
* 12: obj = 2.500000000e+002 infeas = 0.000e+000 (3)
OPTIMAL SOLUTION FOUND
GLPK Integer Optimizer, v4.47
16 rows, 16 columns, 35 non-zeros
16 integer variables, all of which are binary
Integer optimization begins...
+ 12: mip = not found yet <= +inf (1; 0)
+ 13: >>>>> 2.500000000e+002 <= 2.500000000e+002 0.0% (1; 0)
+ 13: mip = 2.500000000e+002 <= tree is empty 0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
<!SOLVER MSG> ----
result
Status: success
Objective value: 250
ompr is a model management wrapper around the ROI package. It using an algebraic paradigm like GAMS or AMPL but has less embedded logic to simplify the syntax. Although with ompr, you can test other solvers that ROI offers as plug-ins: http://roi.r-forge.r-project.org/
Some are free, others like Mosek, CPLEX and Gurobi are commercial products. Suggest running a large subset problem and checking the relative performance of the different solvers.
Also note that your toy problem is degenerate. Group(1, 3, 4) is also a solution. I added an additional constraint that can limit the number of groups selected. If your objective function coefficients are integer values the formulation may have many degenerate solutions, a simple test is to add a small random epsilon to each of the coefficients to eliminate degeneracy and see if that improves performance.
Keying in on a couple of the OP's statements:
In the original problem I have 410 possible groups, implying 410
decision variables. The number of constraints is 49422, but in all
rows there are exactly two 1 and the remaining are 0.
and
I have 32 objects.
It seems like this can be formulated as a one-sided matching problem with a utility matrix instead of a preference matrix, which can be solved using the matchingR package.
The problem is set up using a savings matrix where the row and column indices refer to objects and each cell (and its mirror across the main diagonal) represents the savings for a group composed of two objects (the row and column).
First a smaller example with 7 objects and 15 groups:
library(matchingR) # for the roommate function
library(Rfast) # for the rowSort function
set.seed(379327748)
n <- 7L # number of objects
m <- matrix(0L, n, n) # initialize the savings matrix
# specify the savings for the 15 groups
m[which(lower.tri(m))[sample(n*(n - 1L)/2L, 15)]] <- sample(1e3, 15, TRUE)
# make the savings matrix symmetric
m[upper.tri(m)] <- t(m)[upper.tri(m)]
# the savings matrix: each cell refers to the savings for the row/column pair
# it is symmetric: 1 paired with 2 has the same savings as 2 paired with 1
m
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 0 692 429 767 133 434 619
#> [2,] 692 0 0 997 146 801 0
#> [3,] 429 0 0 214 966 683 0
#> [4,] 767 997 214 0 835 0 0
#> [5,] 133 146 966 835 0 888 513
#> [6,] 434 801 683 0 888 0 0
#> [7,] 619 0 0 0 513 0 0
# use the roommate function to get the optimal pairings
groups <- unique(rowSort(cbind(1:n, roommate(utils = m))))
# remove objects with no pair in the optimal solution (NAs happen only with an
# odd number of objects)
groups <- groups[!is.na(groups[,2]),]
groups <- groups[m[groups] > 0,]
# show the solution
groups
#> [,1] [,2]
#> [1,] 1 7
#> [2,] 2 4
#> [3,] 3 5
c(savings = sum(m[groups]))
#> savings
#> 2582
Now an example with 32 objects and 410 groups. The solution is provided almost instantly.
n <- 32L
m <- matrix(0L, n, n)
m[which(lower.tri(m))[sample(n*(n - 1L)/2L, 410)]] <- sample(1e3, 410, TRUE)
m[upper.tri(m)] <- t(m)[upper.tri(m)]
system.time(groups <- unique(rowSort(cbind(1:n, roommate(utils = m)))))
#> user system elapsed
#> 0 0 0
groups <- groups[m[groups] > 0,]
groups
#> [,1] [,2]
#> [1,] 1 15
#> [2,] 2 18
#> [3,] 3 32
#> [4,] 4 19
#> [5,] 5 30
#> [6,] 6 9
#> [7,] 7 12
#> [8,] 8 14
#> [9,] 10 29
#> [10,] 11 24
#> [11,] 13 16
#> [12,] 17 20
#> [13,] 21 27
#> [14,] 22 31
#> [15,] 23 26
#> [16,] 25 28
c(savings = sum(m[groups]))
#> savings
#> 14369
# check that each object is used only once
max(tabulate(groups, 32L))
#> [1] 1

Limma to Compare Bulk RNA Seq using makeContrasts and eBayes

After a day of googling, I've decided that it'd be better to ask the question here.
So the experiment is I have bulk RNA seq data from 3 patients: A, B, C.
And their RNA seq data is obtained for pre-treatment, treatment cycle 1, treatment cycle 2, treatment cycle 3.
So in total I have 12 samples of bulk RNA seq:
A.PreTreat -> A.Cycle1 -> A.Cycle2 -> A.Cycle3
B.PreTreat -> B.Cycle1 -> B.Cycle2 -> B.Cycle3
C.PreTreat -> C.Cycle1 -> C.Cycle2 -> C.Cycle3
I want to get a differential gene list between different cycles (i.e. cycle 3 to pretreatment, cycle 3 to cycle 2) using model.matrix(), lmFit(), makeContrasts(), contrasts.fit(), eBayes(), all of which are in the limma package.
Here is my minimal working example.
library(limma)
# Already normalized expression set: rows are genes, columns are the 12 samples
normalized_expression <- matrix(data=sample(1:100), nrow=10, ncol=12)
colnames(normalized_expression) <- c("A.PreTreat", "A.Cycle1", "A.Cycle2", "A.Cycle3", "B.PreTreat", "B.Cycle1", "B.Cycle2", "B.Cycle3", "C.PreTreat", "C.Cycle1", "C.Cycle2", "C.Cycle3")
patient_and_treatment <- factor(colnames(normalized_expression), levels = colnames(normalized_expression))
design.matrix <- model.matrix(~0 + patient_and_treatment)
colnames(design.matrix) <- patient_and_treatment
fit <- lmFit(normalized_expression, design.matrix)
# I want to get a contrast matrix to get differential genes between cycle 3 treatment and pre-treatment in all patients
contrast.matrix <- makeContrasts("A.Cycle3+B.Cycle3+C.Cycle3-A.PreTreat-B.PreTreat-C.PreTreat",
levels = levels(patient_and_treatment))
# Outputs Error of no residual degree of freedom
fit2 <- eBayes( contrasts.fit( fit, contrast.matrix ) )
# Want to run but cannot
summary(decideTests(fit2))
So far I am stuck on no residual degree of freedom error.
I am not even sure if this is the statistically right way in limma to address my question of getting differential gene list between cycle 3 treatment to pre-treatment in all patients.
Any help will be greatly appreciated.
Thanks!
You cannot have 1 observation per group, this makes the regression meaningless as you are fitting each data point to itself.
Briefly, what you are looking for is common effects observed across all patients, for say Cycle3 compared to PreTreat and so on, set up the model like this:
library(limma)
metadata = data.frame(
Patient=gsub("[.][^ ]*","",colnames(normalized_expression)),
Treatment=gsub("^[A-Z][.]*","",colnames(normalized_expression))
)
Patient Treatment
1 A PreTreat
2 A Cycle1
3 A Cycle2
4 A Cycle3
5 B PreTreat
6 B Cycle1
7 B Cycle2
8 B Cycle3
9 C PreTreat
10 C Cycle1
11 C Cycle2
12 C Cycle3
Now specify the model matrix, the Patient term is to account for differences in starting levels between Patients:
design.matrix <- model.matrix(~0 + Treatment+Patient,data=metadata)
fit <- lmFit(normalized_expression, design.matrix)
contrast.matrix <- makeContrasts(TreatmentCycle3-TreatmentPreTreat,
TreatmentCycle1-TreatmentPreTreat,levels=design.matrix)
fit2 = contrasts.fit(fit, contrast.matrix)
fit2 = eBayes(fit2)
You can check that the coefficients give you what you wanted:
fit2$coefficients
Contrasts
TreatmentCycle3 - TreatmentPreTreat
[1,] -3.666667
[2,] -13.666667
[3,] 1.666667
[4,] -40.666667
[5,] 12.000000
[6,] -46.000000
[7,] -32.000000
[8,] 4.666667
[9,] 11.333333
[10,] 5.666667
Contrasts
TreatmentCycle1 - TreatmentPreTreat
[1,] -11.33333
[2,] -19.33333
[3,] -27.33333
[4,] -42.33333
[5,] 27.33333
[6,] -32.66667
[7,] -33.00000
[8,] -30.66667
[9,] 46.00000
[10,] 17.33333

Trying to add breakpoint lines from strucchange to a plot by "lines" command [duplicate]

This is my first time with strucchange so bear with me. The problem I'm having seems to be that strucchange doesn't recognize my time series correctly but I can't figure out why and haven't found an answer on the boards that deals with this. Here's a reproducible example:
require(strucchange)
# time series
nmreprosuccess <- c(0,0.50,NA,0.,NA,0.5,NA,0.50,0.375,0.53,0.846,0.44,1.0,0.285,
0.75,1,0.4,0.916,1,0.769,0.357)
dat.ts <- ts(nmreprosuccess, frequency=1, start=c(1996,1))
str(dat.ts)
Time-Series [1:21] from 1996 to 2016: 0 0.5 NA 0 NA 0.5 NA 0.5 0.375 0.53 ...
To me this means that the time series looks OK to work with.
# obtain breakpoints
bp.NMSuccess <- breakpoints(dat.ts~1)
summary(bp.NMSuccess)
Which gives:
Optimal (m+1)-segment partition:
Call:
breakpoints.formula(formula = dat.ts ~ 1)
Breakpoints at observation number:
m = 1 6
m = 2 3 7
m = 3 3 14 16
m = 4 3 7 14 16
m = 5 3 7 10 14 16
m = 6 3 7 10 12 14 16
m = 7 3 5 7 10 12 14 16
Corresponding to breakdates:
m = 1 0.333333333333333
m = 2 0.166666666666667 0.388888888888889
m = 3 0.166666666666667
m = 4 0.166666666666667 0.388888888888889
m = 5 0.166666666666667 0.388888888888889 0.555555555555556
m = 6 0.166666666666667 0.388888888888889 0.555555555555556 0.666666666666667
m = 7 0.166666666666667 0.277777777777778 0.388888888888889 0.555555555555556 0.666666666666667
m = 1
m = 2
m = 3 0.777777777777778 0.888888888888889
m = 4 0.777777777777778 0.888888888888889
m = 5 0.777777777777778 0.888888888888889
m = 6 0.777777777777778 0.888888888888889
m = 7 0.777777777777778 0.888888888888889
Fit:
m 0 1 2 3 4 5 6 7
RSS 1.6986 1.1253 0.9733 0.8984 0.7984 0.7581 0.7248 0.7226
BIC 14.3728 12.7421 15.9099 20.2490 23.9062 28.7555 33.7276 39.4522
Here's where I start having the problem. Instead of reporting the actual breakdates it reports numbers which then makes it impossible to plot the break lines onto a graph because they're not at the breakdate (2002) but at 0.333.
plot.ts(dat.ts, main="Natural Mating")
lines(fitted(bp.NMSuccess, breaks = 1), col = 4, lwd = 1.5)
Nothing shows up for me in this graph (I think because it's so small for the scale of the graph).
In addition, when I try fixes that may possibly work around this problem,
fm1 <- lm(dat.ts ~ breakfactor(bp.NMSuccess, breaks = 1))
I get:
Error in model.frame.default(formula = dat.ts ~ breakfactor(bp.NMSuccess, :
variable lengths differ (found for 'breakfactor(bp.NMSuccess, breaks = 1)')
I get errors because of the NA values in the data so the length of dat.ts is 21 and the length of breakfactor(bp.NMSuccess, breaks = 1) 18 (missing the 3 NAs).
Any suggestions?
The problem occurs because breakpoints() currently can only (a) cope with NAs by omitting them, and (b) cope with times/date through the ts class. This creates the conflict because when you omit internal NAs from a ts it loses its ts property and hence breakpoints() cannot infer the correct times.
The "obvious" way around this would be to use a time series class that can cope with this, namely zoo. However, I just never got round to fully integrate zoo support into breakpoints() because it would likely break some of the current behavior.
To cut a long story short: Your best choice at the moment is to do the book-keeping about the times yourself and not expect breakpoints() to do it for you. The additional work is not so huge. First, we create a time series with the response and the time vector and omit the NAs:
d <- na.omit(data.frame(success = nmreprosuccess, time = 1996:2016))
d
## success time
## 1 0.000 1996
## 2 0.500 1997
## 4 0.000 1999
## 6 0.500 2001
## 8 0.500 2003
## 9 0.375 2004
## 10 0.530 2005
## 11 0.846 2006
## 12 0.440 2007
## 13 1.000 2008
## 14 0.285 2009
## 15 0.750 2010
## 16 1.000 2011
## 17 0.400 2012
## 18 0.916 2013
## 19 1.000 2014
## 20 0.769 2015
## 21 0.357 2016
Then we can estimate the breakpoint(s) and afterwards transform from the "number" of observations back to the time scale. Note that I'm setting the minimal segment size h explicitly here because the default of 15% is probably somewhat small for this short series. 4 is still small but possibly enough for estimating of a constant mean.
bp <- breakpoints(success ~ 1, data = d, h = 4)
bp
## Optimal 2-segment partition:
##
## Call:
## breakpoints.formula(formula = success ~ 1, h = 4, data = d)
##
## Breakpoints at observation number:
## 6
##
## Corresponding to breakdates:
## 0.3333333
We ignore the break "date" at 1/3 of the observations but simply map back to the original time scale:
d$time[bp$breakpoints]
## [1] 2004
To re-estimate the model with nicely formatted factor levels, we could do:
lab <- c(
paste(d$time[c(1, bp$breakpoints)], collapse = "-"),
paste(d$time[c(bp$breakpoints + 1, nrow(d))], collapse = "-")
)
d$seg <- breakfactor(bp, labels = lab)
lm(success ~ 0 + seg, data = d)
## Call:
## lm(formula = success ~ 0 + seg, data = d)
##
## Coefficients:
## seg1996-2004 seg2005-2016
## 0.3125 0.6911
Or for visualization:
plot(success ~ time, data = d, type = "b")
lines(fitted(bp) ~ time, data = d, col = 4, lwd = 2)
abline(v = d$time[bp$breakpoints], lty = 2)
One final remark: For such short time series where just a simple shift in the mean is needed, one could also consider conditional inference (aka permutation tests) rather than the asymptotic inference employed in strucchange. The coin package provides the maxstat_test() function exactly for this purpose (= short series where a single shift in the mean is tested).
library("coin")
maxstat_test(success ~ time, data = d, dist = approximate(99999))
## Approximative Generalized Maximally Selected Statistics
##
## data: success by time
## maxT = 2.3953, p-value = 0.09382
## alternative hypothesis: two.sided
## sample estimates:
## "best" cutpoint: <= 2004
This finds the same breakpoint and provides a permutation test p-value. If however, one has more data and needs multiple breakpoints and/or further regression coefficients, then strucchange would be needed.

Perform basic regression

I want to know how to do the following:
a <- data.frame(num = 1:10, numsqr = (1:10)^2)
b <- data.frame(num = 11:14, numsqr = 0)
fit <- lm(numsqr ~ num, data = a)
b$numsqr <- predict(fit, b)
print(b)
num numsqr
1 11 121
2 12 144
3 13 169
4 14 196
Right now I'm getting the following result
print(b)
num numsqr
1 11 99
2 12 110
3 13 121
4 14 132
How could I get my anticipated result??
To get the squared variable in the formula, you can use I or poly (still linear in the coefficients), otherwise it is just fitting y ~ ax + b.
fit <- lm(numsqr ~ I(num^2), data=a)
fit <- lm(numsqr ~ poly(num, 2), data=a) # different model, same predictions
predict(fit, newdata=b)
# 1 2 3 4
# 121 144 169 196
If you assume that you have no idea whatsoever of the relationship of the data
( in this case you know that y= x^2 ), it will be very difficult to get the exact values through a linear regression
You can try converting the response variable to a logarithm for a slightly better resolution to incorporate the curvature

How to sample/partition panel data by individuals( preferably with caret library)?

I would like to partition panel data and preserve the panel nature of the data:
library(caret)
library(mlbench)
#example panel data where id is the persons identifier over years
data <- read.table("http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
## Here for instance the dependent variable is working
inTrain <- createDataPartition(y = data$WORKING, p = .75,list = FALSE)
# subset into training
training <- data[ inTrain,]
# subset into testing
testing <- data[-inTrain,]
# Here we see some intersections of identifiers
str(training$id[10:20])
str(testing$id)
However I would like, when partitioning or sampling the data, to avoid that the same person (id) is splitted into two data sets.Is their a way to randomly sample/partition from the data an assign indivuals to the corresponding partitions rather then observations?
I tried to sample:
mysample <- data[sample(unique(data$id), 1000,replace=FALSE),]
However, that destroys the panel nature of the data...
I think there's a little bug in the sampling approach using sample(): It is using the id variable like a row number. Instead, the function needs to fetch all rows belonging to an ID:
nID <- length(unique(data$id))
p = 0.75
set.seed(123)
inTrainID <- sample(unique(data$id), round(nID * p), replace=FALSE)
training <- data[data$id %in% inTrainID, ]
testing <- data[!data$id %in% inTrainID, ]
head(training[, 1:5], 10)
# id FEMALE YEAR AGE HANDDUM
# 1 1 0 1984 54 0.0000000
# 2 1 0 1985 55 0.0000000
# 3 1 0 1986 56 0.0000000
# 8 3 1 1984 58 0.1687193
# 9 3 1 1986 60 1.0000000
# 10 3 1 1987 61 0.0000000
# 11 3 1 1988 62 1.0000000
# 12 4 1 1985 29 0.0000000
# 13 5 0 1987 27 1.0000000
# 14 5 0 1988 28 0.0000000
dim(data)
# [1] 27326 41
dim(training)
# [1] 20566 41
dim(testing)
# [1] 6760 41
20566/27326
### 75.26% were selected for training
Let's check class balances, because createDataPartition would keep the class balance for WORKING equal in all sets.
table(data$WORKING) / nrow(data)
# 0 1
# 0.3229525 0.6770475
#
table(training$WORKING) / nrow(training)
# 0 1
# 0.3226685 0.6773315
#
table(testing$WORKING) / nrow(testing)
# 0 1
# 0.3238166 0.6761834
### virtually equal
I thought I would point out caret's groupKFold function for anyone looking at this, which would be handy for cross validation with this class of data. From the documentation:
"To split the data based on groups, groupKFold can be used:
set.seed(3527)
subjects <- sample(1:20, size = 80, replace = TRUE)
folds <- groupKFold(subjects, k = 15)
The results in folds can be used as inputs into the index argument of the trainControl function."

Resources