Create partition based in two variables - r

I have a data set with two outcome variables, case1 and case2. Case1 has 4 levels, while case2 has 50 (levels in case2 could increase later). I would like to create data partition for train and test keeping the ratio in both cases. The real data is imbalanced for both case1 and case2. As an example,
library(caret)
set.seed(123)
matris=matrix(rnorm(10),1000,20)
case1 <- as.factor(ceiling(runif(1000, 0, 4)))
case2 <- as.factor(ceiling(runif(1000, 0, 50)))
df <- as.data.frame(matris)
df$case1 <- case1
df$case2 <- case2
split1 <- createDataPartition(df$case1, p=0.2)[[1]]
train1 <- df[-split1,]
test1 <- df[split1,]
length(split1)
201
split2 <- createDataPartition(df$case2, p=0.2)[[1]]
train2 <- df[-split2,]
test2 <- df[split2,]
length(split2)
220
If I do separate splitting, I get different length for the data frame. If I do one splitting based on case2 (one with more classes), I lose the ratio of classes for case1.
I will be predicting the two cases separately, but at the end my accuracy will be given by having the exact match for both cases (e.g., ix = which(pred1 == case1 & pred2 == case2), so I need the arrays to be the same size.
Is there a smart way to do this?
Thank you!

If I understand correctly (which I do not guarantee) I can offer the following approach:
Group by case1 and case2 and get the group indices
library(tidyverse)
df %>%
select(case1, case2) %>%
group_by(case1, case2) %>%
group_indices() -> indeces
use these indeces as the outcome variable in create data partition:
split1 <- createDataPartition(as.factor(indeces), p=0.2)[[1]]
check if satisfactory:
table(df[split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
5 6 5 8 5 5 6 6 4 6 6 6 6 6 5 5 5 4 4 7 5 6 5 6 7 5 5 8 6 7 6 6 7
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
4 5 6 6 6 5 5 6 5 6 6 5 4 5 6 4 6
table(df[-split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
15 19 13 18 12 13 16 15 8 13 13 15 21 14 11 13 12 9 12 20 17 15 16 19 16 11 14 21 13 20 18 13 16
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
9 6 12 19 14 10 16 19 17 17 16 14 4 15 14 9 19
table(df[split1,21])
#output
1 2 3 4
71 70 71 67
table(df[-split1,21])
1 2 3 4
176 193 174 178

Related

Select every nth row, offset the start and repeat

I am trying to create a new column in a data.frame that is created by selecting the 9th row of a column starting at the first row (i.e. row 1, row 9, row 17). Once it reaches the nth row of the column I need it to repeat this process starting at row 2 (selecting row 2, row 10, row 18). I have a fixed number of rows at 96 so I need it to repeat until it would start on the 9th row and then quit.
Here is an example of what I would like to do:
df <- data.frame(Row=1:96)
> df$nineth <- c(1,9,17,25,33,41,49,57,65,73,81,89,2,10,18,26,34,42,50,58,66,74,82,90)
> print(df)
Row nineth
1 1 1
2 2 9
3 3 17
4 4 25
5 5 33
6 6 41
7 7 49
8 8 57
9 9 65
10 10 73
11 11 81
12 12 89
13 13 2
14 14 10
15 15 18
16 16 26
17 17 34
18 18 42
19 19 50
20 20 58
21 21 66
22 22 74
23 23 82
24 24 90
Is there a way to do this using a for loop? I am more familiar with them than the apply family.
You can use R's matrix/vector duality to do this easily...
df <- data.frame(Row=1:96)
df$nineth <- as.vector(matrix(df$Row, byrow = TRUE, ncol = 8))
head(df,15)
Row nineth
1 1 1
2 2 9
3 3 17
4 4 25
5 5 33
6 6 41
7 7 49
8 8 57
9 9 65
10 10 73
11 11 81
12 12 89
13 13 2
14 14 10
15 15 18
Following works:
n <- 9
df$nineth <- unlist(lapply(1:(n-1),
function(x){
df$Row[seq(x, nrow(df),by=n-1)]}))

How to delete values from a column only from a threshold on? [duplicate]

This question already has answers here:
Subset data frame based on multiple conditions [duplicate]
(3 answers)
Closed 3 years ago.
I have a dataframe (k by 4). I have ordered one of the four columns in a descending order (from 19 to -9 let'say). I would like to throw away those values that are smaller than 1.5.
I just tried unsuccessfully various combinations of the following code
subset(w, select = -c(columnofinterest, <=1.50))
Can anyone help me?
Thanks a lot!
You can use arrange and filter from dplyr package:
library(dplyr)
w <- data.frame(use_this = round(runif(100, min = -9, max = 19)),
second = runif(100),
third = runif(100),
fourth = runif(100)) %>%
arrange(desc(use_this)) %>%
filter(use_this >= 1.5)
Output:
> w
use_this second third fourth
1 19 0.264306555 0.11234097 0.30149863
2 19 0.574675520 0.50406805 0.71502833
3 19 0.376586752 0.21530618 0.35323250
4 18 0.949974135 0.46726122 0.36008741
5 17 0.339737597 0.11358402 0.04035303
6 16 0.180291264 0.81855913 0.16109650
7 16 0.958398058 0.94827266 0.54693974
8 16 0.297317238 0.28726682 0.63560208
9 16 0.653006870 0.15175848 0.69305851
10 16 0.685338886 0.30493976 0.89360112
11 16 0.493931093 0.52830391 0.68391458
12 16 0.945083084 0.19880501 0.66769341
13 16 0.910927578 0.86032225 0.73062990
14 15 0.662130980 0.19207451 0.44240610
15 15 0.730482762 0.92418574 0.46387086
16 15 0.547101759 0.87847767 0.27973739
17 15 0.487773258 0.05870471 0.40147753
18 15 0.695824922 0.91289504 0.94897518
19 14 0.576095914 0.42914670 0.27707368
20 14 0.156691824 0.02187951 0.31940887
21 13 0.079037019 0.16993999 0.53232350
22 13 0.944372064 0.63485350 0.23548337
23 13 0.016378244 0.42772076 0.76618218
24 13 0.606340182 0.33611591 0.36017352
25 13 0.170346203 0.43325314 0.16285515
26 13 0.605379012 0.95574187 0.23941377
27 12 0.157352454 0.90963650 0.01611328
28 12 0.353934785 0.80058806 0.13782414
29 12 0.464950823 0.81835421 0.12771521
30 12 0.624139506 0.69472154 0.02833191
31 11 0.362033514 0.98849181 0.37684822
32 11 0.067974815 0.24154922 0.49300890
33 11 0.522271380 0.03502680 0.50665790
34 10 0.810183210 0.56598130 0.41279787
35 10 0.609560713 0.46745813 0.34939724
36 10 0.087748839 0.56531646 0.02249387
37 10 0.008262635 0.68432285 0.35648525
38 10 0.757824842 0.57826099 0.89973902
39 10 0.428174539 0.12538288 0.69233083
40 10 0.785175550 0.21516237 0.36578714
41 10 0.631388832 0.63700087 0.40933640
42 10 0.171396873 0.37925970 0.27935731
43 10 0.773437320 0.24710107 0.23902388
44 8 0.443778088 0.77238651 0.08517639
45 8 0.954302451 0.87102748 0.52031446
46 8 0.347608835 0.79912385 0.36169856
47 8 0.839238717 0.54200177 0.52221408
48 8 0.235710838 0.85575923 0.78092366
49 7 0.610772265 0.16833538 0.94704562
50 7 0.242917834 0.02852729 0.87131760
51 7 0.875879507 0.04537683 0.81000861
52 7 0.577880660 0.54259171 0.43301336
53 6 0.541772984 0.06164861 0.62867700
54 6 0.071746509 0.51758874 0.70365933
55 5 0.103953563 0.99147043 0.33944620
56 5 0.504618656 0.95827073 0.65527417
57 5 0.726648637 0.37460291 0.47072657
58 5 0.796268586 0.09644167 0.93960812
59 5 0.796498528 0.68346948 0.23290885
60 5 0.490859592 0.76727730 0.39888256
61 5 0.949232913 0.02954981 0.56672834
62 4 0.360401806 0.62879833 0.31107107
63 4 0.926329930 0.87624801 0.91260914
64 4 0.922783983 0.11524112 0.06240194
65 3 0.518727534 0.23927630 0.37114683
66 3 0.951288192 0.58672287 0.45337659
67 3 0.767943126 0.76102957 0.24347122
68 2 0.786254279 0.39824869 0.58548193
69 2 0.321557042 0.75393236 0.43273743
70 2 0.872124621 0.89918160 0.55623725
71 2 0.242389529 0.85453423 0.78540085
72 2 0.013294874 0.61593974 0.70549476

Sum a variable based on another variable

I have a dataset consisting of two variables, Contents and Time like so:
Time Contents
2017M01 123
2017M02 456
2017M03 789
. .
. .
. .
2018M12 789
Now I want to create a numeric vector that aggregates Contents for six months, that is I want to sum 2017M01 to 2017M06 to one number, 2017M07 to 2017M12 to another number and so on.
I'm able to do this by indexing but I want to be able to write: "From 2017M01 to 2017M06 sum contents corresponding to that sequence" in my code.
I would really appreciate some help!
You can create a grouping variable based on the number of rows and number of elements to group. For your case, you want to group every 6 rows so your data frame should be divisible with 6. Using iris to demonstrate (It has 150 rows, so 150 / 6 = 25)
rep(seq(nrow(iris)%/%6), each = 6)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
There are plenty of ways to handle how you want to call it. Here is a custom function that allows you to do that (i.e. create the grouping variable),
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
return(rep(seq(nrow(df)%/%i1), each = i1))
}
f1("2017M01:2017M06", iris)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
EDIT: We can easily make the function compatible with 'non-0-remainder' divisions by concatenating the final result with a repetition of the max+1 value of the final result of remainder times, i.e.
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
final_v <- rep(seq(nrow(df) %/% i1), each = i1)
if (nrow(df) %% i1 == 0) {
return(final_v)
} else {
remainder = nrow(df) %% i1
final_v1 <- c(final_v, rep((max(final_v) + 1), remainder))
return(final_v1)
}
}
So for a data frame with 20 rows, doing groups of 6, the above function will yield the result:
f1("2017M01:2017M06", df)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4

Create new columns based on similar row values

How do I create a new set of data frame columns based on matched row values?
For instance, for this sample data frame:
x<-data.frame(cbind(numsp=rep(c(16,64,256),each=12),Colless=rep(c("loIc","midIc","hiIc"),each=4, times=3), lambdaE=rep(c(TRUE,FALSE),each=2,times=9),ntree=rep(c(1,2),length.out=36), metric1=seq(1:36), metric2=seq(1:36)))
For when some parameter, e.g., lambdaE, I'd like to create new columns for metric1 and metric 2 based on whether lambdaE is TRUE or FALSE.
The data frame would look something like this:
x2<-data.frame(cbind(numsp=rep(c(16,64,256),each=6),Colless=rep(c("hiIc","loIc","midIc"),each=2, times=3), ntree=rep(c(1,2),length.out=18), metric1.lambdE.FALSE=c(11,12,3,4,7,8,35,36,27,28,31,32,23,24,15,16,19,20), metric2.lambdE.FALSE=c(11,12,3,4,7,8,35,36,27,28,31,32,23,24,15,16,19,20),metric1.lambdE.TRUE=c(9,10,1,2,5,6,33,34,25,26,29,30,21,22,13,14,17,18), metric2.lambdE.TRUE=c(9,10,1,2,5,6,33,34,25,26,29,30,21,22,13,14,17,18)))
Or alternatively for the parameter "Colless", a new set of columns for metric1 and metric2 for each level of Colless.
Thanks in advance!
Okay, looks like library reshape2 has a quick solution:
reshape(x, direction="wide", idvar=c("numsp","Colless","ntree"), timevar="lambdaE")
melt and dcast of reshape2 can also be used:
library(reshape2)
mm =melt(x, id=c('numsp','Colless','lambdaE','ntree'))
dcast(mm, numsp+Colless+ntree~lambdaE+variable)
numsp Colless ntree FALSE_metric1 FALSE_metric2 TRUE_metric1 TRUE_metric2
1 16 hiIc 1 11 11 9 9
2 16 hiIc 2 12 12 10 10
3 16 loIc 1 3 3 1 1
4 16 loIc 2 4 4 2 2
5 16 midIc 1 7 7 5 5
6 16 midIc 2 8 8 6 6
7 256 hiIc 1 35 35 33 33
8 256 hiIc 2 36 36 34 34
9 256 loIc 1 27 27 25 25
10 256 loIc 2 28 28 26 26
11 256 midIc 1 31 31 29 29
12 256 midIc 2 32 32 30 30
13 64 hiIc 1 23 23 21 21
14 64 hiIc 2 24 24 22 22
15 64 loIc 1 15 15 13 13
16 64 loIc 2 16 16 14 14
17 64 midIc 1 19 19 17 17
18 64 midIc 2 20 20 18 18

How to obtain all possible sub-samples of size n from a dataframe of size N in R?

I have a dataframe with 20 classrooms [1 to 20] indexes and 20 different number of students in each class, how to obtain all sub-samples of size n = 8 and store them because i want to use them later for calculations. I used combn() but that takes only one vector, can i use it with a dataframe and how? (sorry but i'm new in R),
dataframe below:
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
8 8 22
9 9 32
10 10 26
11 11 27
12 12 34
13 13 27
14 14 28
15 15 33
16 16 21
17 17 36
18 18 24
19 19 19
20 20 32
It is as simple as passing a function to combn. simplify = FALSE means that a list will be returned.
Assuming you want all possible combinations of 8 classrooms from the dataset classrooms
combinations <- combn(nrow(classrooms), 8, function(x,data) data[x,],
simplify = FALSE, data =classrooms )
head(combinations, n = 2)
[[1]]
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
8 8 22
[[2]]
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
9 9 32

Resources