I have the following data frame and I want to break it up into 10 different data frames. I want to break the initial 100 row data frame into 10 data frames of 10 rows. I could do the following and get the desired results.
df = data.frame(one=c(rnorm(100)), two=c(rnorm(100)), three=c(rnorm(100)))
df1 = df[1:10,]
df2 = df[11:20,]
df3 = df[21:30,]
df4 = df[31:40,]
df5 = df[41:50,]
...
Of course, this isn't an elegant way to perform this task when the initial data frames are larger or if there aren't an easy number of segments that it can be broken down into.
So given the above, let's say we have the following data frame.
df = data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))
Now I want to split it into new data frames comprised of 200 rows, and the final data frame with the remaining rows. What would be a more elegant (aka 'quick') way to perform this task.
> str(split(df, (as.numeric(rownames(df))-1) %/% 200))
List of 6
$ 0:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.592 1.664 -1.231 0.269 0.912 ...
..$ two : num [1:200] 0.639 -0.525 0.642 1.347 1.142 ...
..$ three: num [1:200] -0.45 -0.877 0.588 1.188 -1.977 ...
$ 1:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -0.0017 1.9534 0.0155 -0.7732 -1.1752 ...
..$ two : num [1:200] -0.422 0.869 0.45 -0.111 0.073 ...
..$ three: num [1:200] -0.2809 1.31908 0.26695 0.00594 -0.25583 ...
$ 2:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.578 0.433 0.277 1.297 0.838 ...
..$ two : num [1:200] 0.913 0.378 0.35 -0.241 0.783 ...
..$ three: num [1:200] -0.8402 -0.2708 -0.0124 -0.4537 0.4651 ...
$ 3:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] 1.432 1.657 -0.72 -1.691 0.596 ...
..$ two : num [1:200] 0.243 -0.159 -2.163 -1.183 0.632 ...
..$ three: num [1:200] 0.359 0.476 1.485 0.39 -1.412 ...
$ 4:'data.frame': 200 obs. of 3 variables:
..$ one : num [1:200] -1.43 -0.345 -1.206 -0.925 -0.551 ...
..$ two : num [1:200] -1.343 1.322 0.208 0.444 -0.861 ...
..$ three: num [1:200] 0.00807 -0.20209 -0.56865 1.06983 -0.29673 ...
$ 5:'data.frame': 123 obs. of 3 variables:
..$ one : num [1:123] -1.269 1.555 -0.19 1.434 -0.889 ...
..$ two : num [1:123] 0.558 0.0445 -0.0639 -1.934 -0.8152 ...
..$ three: num [1:123] -0.0821 0.6745 0.6095 1.387 -0.382 ...
If some code might have changed the rownames it would be safer to use:
split(df, (seq(nrow(df))-1) %/% 200)
require(ff)
df <- data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))
for(i in chunk(from = 1, to = nrow(df), by = 200)){
print(df[min(i):max(i), ])
}
If you can generate a vector that defines the groups, you can split anything:
f <- rep(seq_len(ceiling(1123 / 200)),each = 200,length.out = 1123)
> df1 <- split(df,f = f)
> lapply(df1,dim)
$`1`
[1] 200 3
$`2`
[1] 200 3
$`3`
[1] 200 3
$`4`
[1] 200 3
$`5`
[1] 200 3
$`6`
[1] 123 3
Chops df into 1 million row groups and pushes and appends a million at a time to df in SQL
batchsize = 1000000 # vary to your liking
# cycles through data by batchsize
for (i in 1:ceiling(nrow(df)/batchsize))
{
print(i) # just to show the progress
# below shows how to cycle through data
batch <- df[(((i-1)*batchsize)+1(batchsize*i),,drop=FALSE] # drop = FALSE keeps it from being converted to a vector
# if below not done then the last batch has Nulls above the number of rows of actual data
batch <- batch[!is.na(batch$ID),] # ID is a variable I presume is in every row
#in this case the table already existed, if new table overwrite = TRUE
(dbWriteTable(con, "df", batch, append = TRUE,row.names = FALSE))
}
Something like this...?
b <- seq(10, 100, 10)
lapply(seq_along(b), function(i) df[(b-9)[i]:b[i], ])
[[1]]
one two three
1 -2.4157992 -0.6232517 1.0531358
2 0.6769020 0.3908089 -1.9543895
3 0.9804026 -2.5167334 0.7120919
4 -1.2200089 0.5108479 0.5599177
5 0.4448290 -1.2885275 -0.7665413
6 0.8431848 -0.9359947 0.1068137
7 -1.8168134 -0.2418887 1.1176077
8 1.4475904 -0.8010347 2.3716663
9 0.7264027 -0.3573623 -1.1956806
10 0.2736119 -1.5553148 0.2691115
[[2]]
one two three
11 -0.3273536 -1.92475496 -0.08031696
12 1.5558892 -1.20158371 0.09104958
13 1.9202047 -0.13418754 0.32571632
14 -0.0515136 -2.15669216 0.23099397
15 0.1909732 -0.30802742 -1.28651457
16 0.8545580 -0.18238266 1.57093844
17 0.4903039 0.02895376 -0.47678196
18 0.5125400 0.97052082 -0.70541908
19 -1.9324370 0.22093545 -0.34436105
20 -0.5763433 0.10442551 -2.05597985
[[3]]
one two three
21 0.7168771 -1.22902943 -0.18728871
22 1.2785641 0.14686576 -1.74738091
23 -1.1856173 0.43829361 0.41269975
24 0.0220843 1.57428924 -0.80163986
25 -1.0012255 0.05520813 0.50871603
26 -0.1842323 -1.61195239 0.04843504
27 0.2328831 -0.38432225 0.95650710
28 0.8821687 -1.32456215 -1.33367967
29 -0.8902177 0.86414661 -1.39629358
30 -0.6586293 -2.27325919 0.27367902
[[4]]
one two three
31 1.3810437 -1.0178835 0.07779591
32 0.6102753 0.3538498 1.92316801
33 -1.5034439 0.7926925 2.21706284
34 0.8251638 0.3992922 0.56781321
35 -1.0832114 0.9878058 -0.16820827
36 -0.4132375 -0.9214491 1.06681472
37 -0.6787631 1.3497766 2.18327887
38 -3.0082585 -1.3047024 -0.04913214
39 -0.3433300 1.1008951 -2.02065141
40 0.6009334 1.2334421 0.15623298
[[5]]
one two three
41 -1.8608051 -0.08589437 0.02370983
42 -0.1829953 0.91139017 -0.01356590
43 1.1146731 0.42384993 -0.68717391
44 1.9039900 -1.70218225 0.06100297
45 -0.4851939 1.38712015 -1.30613414
46 -0.4661664 0.23504099 -0.29335162
47 0.5807227 -0.87821946 -0.14816121
48 -2.0168910 -0.47657382 0.90503226
49 2.5056404 0.27574224 0.10326333
50 0.2238735 0.34441325 -0.17186115
[[6]]
one two three
51 1.51613140 -2.5630782 -0.6720399
52 0.03859537 -2.6688365 0.3395574
53 -0.08695292 -0.5114117 -0.1378789
54 -0.51878363 -0.5401962 0.3946324
55 -2.20482710 0.1716744 0.1786546
56 -0.28133749 -0.4497112 0.5936497
57 -2.38269088 -0.4625695 1.0048914
58 0.37865952 0.5055141 0.3337986
59 0.09329172 0.1560469 0.2835735
60 -1.10818863 -0.2618910 0.3650042
[[7]]
one two three
61 -1.2507208 -1.5050083 -0.63871084
62 0.1379394 0.7996674 -1.80196762
63 0.1582008 -0.3208973 0.40863693
64 -0.6224605 0.1416938 -0.47174711
65 1.1556149 -1.4083576 -1.12619693
66 -0.6956604 0.7994991 1.16073748
67 0.6576676 1.4391007 0.04134445
68 1.4610598 -1.0066840 -1.82981058
69 1.1951788 -0.4005535 1.57256648
70 -0.1994519 0.2711574 -1.04364396
[[8]]
one two three
71 1.23897065 0.4473611 -0.35452535
72 0.89015916 2.3747385 0.87840852
73 -1.17339703 0.7433220 0.40232381
74 -0.24568490 -0.4776862 1.24082294
75 -0.47187443 -0.3271824 0.38542703
76 -2.20899136 -1.1131712 -0.33663075
77 -0.05968035 -0.6023045 -0.23747388
78 1.19687199 -1.3390960 -1.37884241
79 -1.29310506 0.3554548 -0.05936756
80 -0.17470891 1.6198307 0.69170207
[[9]]
one two three
81 -1.06792315 0.04801998 0.08166394
82 0.84152560 -0.45793907 0.27867619
83 0.07619456 -1.21633682 -2.51290495
84 0.55895466 -1.01844178 -0.41887672
85 0.33825508 -1.15061381 0.66206732
86 -0.36041720 0.32808609 -1.83390913
87 -0.31595401 -0.87081019 0.45369366
88 0.92331087 1.22055348 -1.91048757
89 1.30491142 1.22582353 -1.32244004
90 -0.32906839 1.76467263 1.84479228
[[10]]
one two three
91 2.80656707 -0.9708417 0.25467304
92 0.35770119 -0.6132523 -1.11467041
93 0.09598908 -0.5710063 -0.96412216
94 -1.08728715 0.3019572 -0.04422049
95 0.14317455 0.1452287 -0.46133199
96 -1.00218917 -0.1360570 0.88864256
97 -0.25316855 0.6341925 -1.37571664
98 0.36375921 1.2244921 0.12718650
99 0.13345555 0.5330221 -0.29444683
100 2.28548261 -2.0413222 -0.53209956
Related
I have some metabolomics data I am trying to process (validate the compounds that are actually present).
`'data.frame': 544 obs. of 48 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ No. : int 2 32 34 95 114 141 169 234 236 278 ...
$ RT..min. : num 0.89 3.921 0.878 2.396 0.845 ...
$ Molecular.Weight : num 70 72 72 78 80 ...
$ m.z : num 103 145 114 120 113 ...
$ HMDB.ID : chr "HMDB0006804" "HMDB0031647" "HMDB0006112" "HMDB0001505" ...
$ Name : chr "Propiolic acid" "Acrylic acid" "Malondialdehyde" "Benzene" ...
$ Formula : chr "C3H2O2" "C3H4O2" "C3H4O2" "C6H6" ...
$ Monoisotopic_Mass: num 70 72 72 78 80 ...
$ Delta.ppm. : num 1.295 0.833 1.953 1.023 0.102 ...
$ X1 : num 288.3 16.7 1130.9 3791.5 33.5 ...
$ X2 : num 276.8 13.4 1069.1 3228.4 44.1 ...
$ X3 : num 398.6 19.3 794.8 2153.2 15.8 ...
$ X4 : num 247.6 100.5 1187.5 1791.4 33.4 ...
$ X5 : num 98.4 162.1 1546.4 1646.8 45.3 ...`
I tried to write a loop so that if the Delta.ppm value is larger than (m/z - molecular weight)/molecular weight, the entire row is deleted in the subsequent dataframe.
for (i in 1:nrow(rawdata)) {
ppm <- (rawdata$m.z[i] - rawdata$Molecular.Weight[i]) /
rawdata$Molecular.Weight[i]
if (ppm > rawdata$Delta.ppm[i]) {
filtered_data <- rbind(filtered_data, rawdata[i,])
}
}
Instead of giving me a new df with the validated compounds, under the 'Values' section, it generates a single number for 'ppm'.
Still very new to R, any help is super appreciated!
No need to do this row-by-row, we can remove all undesired rows in one operation:
## base R
good <- with(rawdat, (m.z - Molecular.Weight)/Molecular.Weight < Delta.ppm.)
newdat <- rawdat[good, ]
## dplyr
newdat <- filter(rawdat, (m.z - Molecular.Weight)/Molecular.Weight < Delta.ppm.)
Iteratively adding rows to a frame using rbind(old, newrow) works in practice but scales horribly, see "Growing Objects" in The R Inferno. For each row added, it makes a complete copy of all rows in old, which works but starts to slow down a lot. It is far better to produce a list of these new rows and then rbind them at one time; e.g.,
out <- list()
for (...) {
# ... newrow ...
out <- c(out, list(newrow))
}
alldat <- do.call(rbind, out)
ppm[i] <- NULL
for (i in 1:nrow(rawdata)) {
ppm[i] <- (rawdata$m.z[i] - rawdata$Molecular.Weight[i]) /
rawdata$Molecular.Weight[i]
if (ppm[i] > rawdata$Delta.ppm[i]) {
filtered_data <- rbind(filtered_data, rawdata[i,])
}
}
I changed my dataset to data.table and I'm using sapply (apply family) but so far that wasn't sufficiant. Is this fully correct?
I already went from this:
library(data.table)
library(lubridate)
buying_volume_before_breakout <- list()
for (e in 1:length(df_1_30sec_5min$date_time)) {
interval <- dolar_tick_data_unified_dt[date_time <= df_1_30sec_5min$date_time[e] &
date_time >= df_1_30sec_5min$date_time[e] - time_to_collect_volume &
Type == "Buyer"]
buying_volume_before_breakout[[e]] <- sum(interval$Quantity)
}
To this (created a function and and using sapply)
fun_buying_volume_before_breakout <- function(e) {
interval <- dolar_tick_data_unified_dt[date_time <= df_1_30sec_5min$date_time[e] &
date_time >= df_1_30sec_5min$date_time[e] - time_to_collect_volume &
Type == "Buyer"]
return(sum(interval$Quantity))
}
buying_volume_before_breakout <- sapply(1:length(df_1_30sec_5min$date_time), fun_buying_volume_before_breakout)
I couldn't make my data reproducible but here are some more insights about its structure.
> str(dolar_tick_data_unified_dt)
Classes ‘data.table’ and 'data.frame': 3120650 obs. of 6 variables:
$ date_time : POSIXct, format: "2017-06-02 09:00:35" "2017-06-02 09:00:35" "2017-06-02 09:00:35" ...
$ Buyer_from : Factor w/ 74 levels "- - ","- - BGC LIQUIDEZ DTVM",..: 29 44 19 44 44 44 44 17 17 17 ...
$ Price : num 3271 3271 3272 3271 3271 ...
$ Quantity : num 5 5 5 5 5 5 10 5 50 25 ...
$ Seller_from: Factor w/ 73 levels "- - ","- - BGC LIQUIDEZ DTVM",..: 34 34 42 28 28 28 28 34 45 28 ...
$ Type : Factor w/ 4 levels "Buyer","Direct",..: 1 3 1 1 1 1 1 3 3 3 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(df_1_30sec_5min)
Classes ‘data.table’ and 'data.frame': 3001 obs. of 13 variables:
$ date_time : POSIXct, format: "2017-06-02 09:33:30" "2017-06-02 09:49:38" "2017-06-02 10:00:41" ...
$ Price : num 3251 3252 3256 3256 3260 ...
$ fast_small_mm : num 3250 3253 3254 3256 3259 ...
$ slow_small_mm : num 3254 3253 3254 3256 3259 ...
$ fast_big_mm : num 3255 3256 3256 3256 3258 ...
$ slow_big_mm : num 3258 3259 3260 3261 3262 ...
$ breakout_strength : num 6.5 2 0.5 2 2.5 0.5 1 2.5 1 0.5 ...
$ buying_volume_before_breakout: num 1285 485 680 985 820 ...
$ total_volume_before_breakout : num 1285 485 680 985 820 ...
$ average_buying_volume : num 1158 338 318 394 273 ...
$ average_total_volume : num 1158 338 318 394 273 ...
$ relative_strenght : num 1 1 1 1 1 1 1 1 1 1 ...
$ relative_strenght_last_6min : num 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
First, separate the 'buyer' data from the rest. Then add a column for the start of the time interval and do a non-equi join in data.table, which is what #chinsoon is suggesting. I've made a reproducible example below:
library(data.table)
set.seed(123)
N <- 1e5
# Filter buyer details first
buyer_dt <- data.table(
tm = Sys.time()+runif(N,-1e6,+1e6),
quantity=round(runif(N,1,20))
)
time_dt <- data.table(
t = seq(
min(buyer_dt$tm),
max(buyer_dt$tm),
by = 15*60
)
)
t_int <- 300
time_dt[,t1:=t-t_int]
library(rbenchmark)
benchmark(
a={ # Your sapply code
bv1 <- sapply(1:nrow(time_dt), function(i){
buyer_dt[between(tm,time_dt$t[i]-t_int,time_dt$t[i]),sum(quantity)]
})
},
b={ # data.table non-equi join
all_intervals <- buyer_dt[time_dt,.(t,quantity),on=.(tm>=t1,tm<=t)]
bv2 <- all_intervals[,sum(quantity),by=.(t)]
}
,replications = 9
)
#> test replications elapsed relative user.self sys.self user.child
#> 1 a 9 42.75 158.333 81.284 0.276 0
#> 2 b 9 0.27 1.000 0.475 0.000 0
#> sys.child
#> 1 0
#> 2 0
Edit: In general, any join of two tables A and B is a subset of the outer join [A x B]. The rows of [A x B] will have all possible combinations of the rows of A and the rows of B. An equi join will subset [A x B] by checking equality conditions, i.e. If x and y are the join columns in A and B, Your join will be : rows from [A x B] where A.x=B.x and A.y=B.y
In a NON-equi join, the subset condition will have comparision operators OTHER than =, for example: like your case, where you want columns such that A.x <= B.x <= A.x + delta.
I don't know much about how they are implemented, but data.table has a pretty fast one that has worked well for me with large data frames.
EDIT: The problem was not within the geoMean function, but with a wrong use of aggregate(), as explained in the comments
I am trying to calculate the geometric mean of multiple measurements for several different species, which includes NAs. An example of my data looks like this:
species <- c("Ae", "Ae", "Ae", "Be", "Be")
phen <- c(2, NA, 3, 1, 2)
hveg <- c(NA, 15, 12, 60, 59)
df <- data.frame(species, phen, hveg)
When I try to calculate the geometric mean for the species Ae with the built-in function geoMean from the package EnvStats like this
library("EnvStats")
aggregate(df[, 3:3], list(df1$Sp), geoMean, na.rm=TRUE)
it works wonderful and skips the NAs to give me the geometric means per species.
Group.1 phen hveg
1 Ae 4.238536 50.555696
2 Be 1.414214 1.414214
When I do this with my large dataset, however, the function stumbles over NAs and returns NA as result even though there are e.g 10 numerical values and only one NA. This happens for example with the column SLA_mm2/mg.
My large data set looks like this:
> str(cut2trait1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 22 obs. of 19 variables:
$ Cut : chr "15_08" "15_08" "15_08" "15_08" ...
$ Block : num 1 1 1 1 1 1 1 1 1 1 ...
$ ID : num 451 512 431 531 591 432 551 393 511 452 ...
$ Plot : chr "1_1" "1_1" "1_1" "1_1" ...
$ Grazing : chr "n" "n" "n" "n" ...
$ Acro : chr "Leuc.vulg" "Dact.glom" "Cirs.arve" "Trif.prat" ...
$ Sp : chr "Lv" "Dg" "Ca" "Tp" ...
$ Label_neu : chr "Lv021" "Dg022" "Ca021" "Tp021" ...
$ PlantFunctionalType: chr "forb" "grass" "forb" "forb" ...
$ PlotClimate : chr "AC" "AC" "AC" "AC" ...
$ Season : chr "Aug" "Aug" "Aug" "Aug" ...
$ Year : num 2015 2015 2015 2015 2015 ...
$ Tiller : num 6 3 3 5 6 8 5 2 1 7 ...
$ Hveg : num 25 38 70 36 68 65 23 58 71 27 ...
$ Hrep : num 39 54 77 38 76 70 65 88 98 38 ...
$ Phen : num 8 8 7 8 8 7 6.5 8 8 8 ...
$ SPAD : num 40.7 42.4 48.7 43 31.3 ...
$ TDW_in_g : num 4.62 4.85 11.86 5.82 8.99 ...
$ SLA_mm2/mg : num 19.6 19.8 20.3 21.2 21.7 ...
and the result of my code
gm_cut2trait1 <- aggregate(cut2trait1[, 13:19], list(cut2trait1$Sp), geoMean, na.rm=TRUE)
is (only the first two rows):
Group.1 Tiller Hveg Hrep Phen SPAD TDW_in_g SLA_mm2/mg
1 Ae 13.521721 73.43485 106.67933 NA 28.17698 1.2602475 NA
2 Be 8.944272 43.95452 72.31182 5.477226 20.08880 0.7266361 9.309672
Here, the geometric mean of SLA for Ae is NA, even though there are 9 numeric measurements and only one NA in the column used to calculate the geometric mean.
I tried to use the geometric mean function suggested here:
Geometric Mean: is there a built-in?
But instead of NAs, this returned the value 1.000 when used with my big dataset, which doesn't solve my problem.
So my question is: What is the difference between my example df and the big dataset that throws the geoMean function off the rails?
I am very new to R and had a question you may find simple. I have two data frames which have the same exact column names. One data frame has around 58k rows (each row is an article number and each column is a month - the values are quantities). The second data frame is a much smaller subset of the first (has around 1000 rows). The rows from the second data frame will always have a value in the first. What I need to do is subtract the second data frames quantities for each month/article from the first larger data frame. It almost is like a vlookup on two values. Any ideas?
UPDATE: this is what I think it would look like in SQL:
SELECT I.Division,
ILS.Brand,
ILS.Cust #,
ILS.Article,
ILS.201811change - SLT.201811change AS '201811change',
ILS.201812change - SLT.201812change AS '201812change',
ILS.201901change - SLT.201901change AS '201901change',
ILS.201903change,
ILS.201904change,
ILS.201905change,
ILS.201906change,
ILS.201907change,
ILS.201808change,
ILS.201809change
FROM ILS LEFT OUTER JOIN SLT ON ILS.Article = SLT.Article
You can use left_join function on dplyr is analog of LEFT JOIN of SQL. In your case in simplified form it would be ISL %>% left_join(SLS, by = "Article"). Please see the full code below:
# data.frame simulation
strs3 <- c("Brand", "Cust", "Article", "201808change", "201809change", "201903change", "201904change", "201905change",
"201906change", "201907change", "201811change", "201812change", "201901change")
n <- 1000
total <- cbind(
as.data.frame(matrix(sample(LETTERS, 3 * n, replace = TRUE), ncol = 3)),
matrix(rnorm(n * 10), ncol = 10)
)
names(total) <- c("Brand", "Cust", "Article", "201808change", "201809change", "201903change", "201904change", "201905change",
"201906change", "201907change", "201811change", "201812change", "201901change")
spl <- ceiling(n * 57 / 58)
ils <- total[1:spl, ]
u <- unique(ils$Article)
ul <- length(u)
slt <- total[(spl + 1): (spl + ul), ]
slt$Article <- u
# left join
z <- ils %>% left_join(slt, by = "Article") %>%
mutate(`201811change` = `201811change.x` - `201811change.y`) %>%
mutate(`201812change` = `201812change.x` - `201812change.y`) %>%
mutate(`201901change` = `201901change.x` - `201901change.y`) %>%
select(-ends_with("y")) %>% select(-one_of("201811change.x", "201812change.x", "201901change.x"))
str(z)
Output (structure of the resultant data frame):
'data.frame': 983 obs. of 13 variables:
$ Brand.x : Factor w/ 26 levels "A","B","C","D",..: 16 23 19 20 19 26 7 21 22 9 ...
$ Cust.x : Factor w/ 26 levels "A","B","C","D",..: 21 15 25 3 24 2 1 26 3 23 ...
$ Article : Factor w/ 26 levels "A","B","C","D",..: 13 14 2 17 23 13 4 1 17 15 ...
$ 201808change.x: num -1.398 -0.357 -1.042 -0.653 -1.037 ...
$ 201809change.x: num 1.483 0.604 0.276 0.846 -1.245 ...
$ 201903change.x: num -0.733 -0.413 0.61 -1.037 1.048 ...
$ 201904change.x: num -0.794 -1.0688 0.577 0.3368 0.0472 ...
$ 201905change.x: num -0.427 -0.898 1.124 -0.435 -0.304 ...
$ 201906change.x: num 2.094 0.177 -0.892 -1.655 -1.091 ...
$ 201907change.x: num 0.228 0.546 0.141 -1.166 -0.687 ...
$ 201811change : num 1.5082 0.0148 -0.5335 -0.763 -1.7196 ...
$ 201812change : num 1.415 -2.128 -0.576 1.205 -0.631 ...
$ 201901change : num -0.883 -0.892 -2.032 -2.172 0.483 ...
How to split automatically a matrix using R for 5-fold cross-validation?
I actually want to generate the 5 sets of (test_matrix_indices, train matrix_indices).
I suppose you want the matrix rows to be the cases to split. Then all you need is sample and split :
X <- matrix(rnorm(1000),ncol=5)
id <- sample(1:5,nrow(X),replace=TRUE)
ListX <- split(x,id) # gives you a list with the 5 matrices
X[id==2,] # gives you the second matrix
I'd work with the list, as it allows you to do something like :
names(ListX) <- c("Train1","Train2","Train3","Test1","Test2")
mean(ListX$Train3)
which makes for code that's easier to read, and keeps you from creating tons of matrices in your workspace. You're bound to mess up if you put the matrices individually in your workspace. Use lists!
In case you want the test matrix to be smaller or larger than the other ones, use the prob argument of sample :
id <- sample(1:5,nrow(X),replace=TRUE,prob=c(0.15,0.15,0.15,0.15,0.3))
gives you a test matrix that's double the size of the train matrices.
In case you want to determine the exact number of cases, sample and prob aren't the best options. You could use a trick like :
indices <- rep(1:5,c(100,20,20,20,40))
id <- sample(indices)
to get matrices with respectively 100, 20, ... and 40 cases.
f_K_fold <- function(Nobs,K=5){
rs <- runif(Nobs)
id <- seq(Nobs)[order(rs)]
k <- as.integer(Nobs*seq(1,K-1)/K)
k <- matrix(c(0,rep(k,each=2),Nobs),ncol=2,byrow=TRUE)
k[,1] <- k[,1]+1
l <- lapply(seq.int(K),function(x,k,d)
list(train=d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
test=d[seq(k[x,1],k[x,2])]),k=k,d=id)
return(l)
}
Solution without split:
set.seed(7402313)
X <- matrix(rnorm(999), ncol=3)
k <- 5 # number of folds
# Generating random indices
id <- sample(rep(seq_len(k), length.out=nrow(X)))
table(id)
# 1 2 3 4 5
# 67 67 67 66 66
# lapply over them:
indicies <- lapply(seq_len(k), function(a) list(
test_matrix_indices = which(id==a),
train_matrix_indices = which(id!=a)
))
str(indicies)
# List of 5
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 12 13 14 17 18 20 23 28 41 45 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 5 6 7 8 9 10 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 4 19 31 36 47 53 58 67 83 89 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 5 6 7 8 9 10 11 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 5 8 9 30 32 35 37 56 59 60 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 6 7 10 11 12 13 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:66] 1 2 3 6 21 24 27 29 33 34 ...
# ..$ train_matrix_indices: int [1:267] 4 5 7 8 9 10 11 12 13 14 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:66] 7 10 11 15 16 22 25 26 40 42 ...
# ..$ train_matrix_indices: int [1:267] 1 2 3 4 5 6 8 9 12 13 ...
But you could return matrices too:
matrices <- lapply(seq_len(k), function(a) list(
test_matrix = X[id==a, ],
train_matrix = X[id!=a, ]
))
str(matrices)
List of 5
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] -1.0132 -1.3657 -0.3495 0.6664 0.0762 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] 0.484 0.418 -0.622 0.996 0.414 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.682 0.186 ...
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] 0.682 0.812 -1.111 -0.467 0.37 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.186 ...
# $ :List of 2
# ..$ test_matrix : num [1:66, 1:3] -0.65 0.797 0.689 0.186 -1.398 ...
# ..$ train_matrix: num [1:267, 1:3] 0.484 0.682 0.473 0.812 -1.111 ...
# $ :List of 2
# ..$ test_matrix : num [1:66, 1:3] 0.473 0.212 -2.175 -0.746 1.707 ...
# ..$ train_matrix: num [1:267, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
Then you could use lapply to get results:
lapply(matrices, function(x) {
m <- build_model(x$train_matrix)
performance(m, x$test_matrix)
})
Edit: compare to Wojciech's solution:
f_K_fold <- function(Nobs, K=5){
id <- sample(rep(seq.int(K), length.out=Nobs))
l <- lapply(seq.int(K), function(x) list(
train = which(x!=id),
test = which(x==id)
))
return(l)
}
Edit : Thanks for your answers.
I have found the following solution (http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_Validation_Croisee_Suite.pdf) :
n <- nrow(mydata)
K <- 5
size <- n %/% K
set.seed(5)
rdm <- runif(n)
ranked <- rank(rdm)
block <- (ranked-1) %/% size+1
block <- as.factor(block)
Then I use :
for (k in 1:K) {
matrix_train<-matrix[block!=k,]
matrix_test<-matrix[block==k,]
[Algorithm sequence]
}
in order to generate the adequate sets for each iterations.
However this solution can omit one individual for tests. I do not recommend it.
Below does the trick without having to create separate data.frames/matrices, all you need to do is to keep an integer sequnce, id that stores the shuffled indices for each fold.
X <- read.csv('data.csv')
k = 5 # number of folds
fold_size <-nrow(X)/k
indices <- rep(1:k,rep(fold_size,k))
id <- sample(indices, replace = FALSE) # random draws without replacement
log_models <- new.env(hash=T, parent=emptyenv())
for (i in 1:k){
train <- X[id != i,]
test <- X[id == i,]
# run algorithm, e.g. logistic regression
log_models[[as.character(i)]] <- glm(outcome~., family="binomial", data=train)
}
The sperrorest package provides this ability. You can choose between a random split (partition.cv()), a spatial split (partition.kmeans()), or a split based on factor levels (partition.factor.cv()). The latter is currently only available in the Github version.
Example:
library(sperrorest)
data(ecuador)
## non-spatial cross-validation:
resamp <- partition.cv(ecuador, nfold = 5, repetition = 1:1)
# first repetition, second fold, test set indices:
idx <- resamp[['1']][[2]]$test
# test sample used in this particular repetition and fold:
ecuador[idx , ]
If you have a spatial data set (with coords), you can also visualize your generated folds
# this may take some time...
plot(resamp, ecuador)
Cross-validation can then be performed using sperrorest() (sequential) or parsperrorest() (parallel).