Hierarchical clustering of a time-series - r

I am struggling with hierarchical or clustering. I have the following time-series and I want to cluster to based on time. Would transpose function work for this?
structure(list(`04:00` = c(0, 0, 0, 0, 0, 0), `04:10` = c(0,
0, 0, 0, 0, 0), `04:20` = c(0, 0, 0, 0, 0, 0), `04:30` = c(0,
0, 0, 0, 0, 0), `04:40` = c(0, 0, 0, 0, 0, 0), `04:50` = c(0,
0, 0, 0, 0, 0), `05:00` = c(0, 0, 0, 0, 0, 0), `05:10` = c(0,
0, 0, 0, 0, 0), `05:20` = c(0, 0, 0, 0, 0, 0), `05:30` = c(0,
0, 0, 0, 0, 0), `05:40` = c(0, 0, 0, 0, 0, 0), `05:50` = c(1,
0, 0, 0, 0, 0), `06:00` = c(1, 0, 0, 0, 0, 0), `06:10` = c(1,
0, 0, 0, 0, 0), `06:20` = c(2, 0, 0, 0, 0, 0), `06:30` = c(0,
0, 0, 0, 0, 0), `06:40` = c(0, 1, 0, 0, 0, 0), `06:50` = c(0,
2, 0, 0, 0, 1), `07:00` = c(0, 0, 0, 0, 0, 2), `07:10` = c(0,
0, 1, 0, 0, 2), `07:20` = c(0, 0, 0, 0, 0, 2), `07:30` = c(0,
0, 1, 0, 0, 0), `07:40` = c(1, 0, 1, 0, 0, 0), `07:50` = c(1,
0, 0, 0, 2, 0), `08:00` = c(1, 0, 0, 0, 0, 0), `08:10` = c(1,
0, 0, 0, 0, 0), `08:20` = c(2, 0, 0, 0, 0, 0), `08:30` = c(2,
0, 0, 0, 0, 0), `08:40` = c(2, 0, 0, 0, 0, 0), `08:50` = c(2,
0, 0, 0, 0, 0), `09:00` = c(0, 0, 0, 0, 0, 0), `09:10` = c(0,
0, 0, 0, 0, 0), `09:20` = c(0, 1, 0, 0, 0, 0), `09:30` = c(0,
1, 0, 2, 0, 0), `09:40` = c(0, 1, 0, 0, 0, 0), `09:50` = c(0,
1, 0, 0, 0, 0), `10:00` = c(0, 0, 0, 0, 0, 0), `10:10` = c(0,
0, 0, 0, 0, 0), `10:20` = c(0, 1, 0, 0, 0, 0), `10:30` = c(0,
1, 0, 0, 0, 0), `10:40` = c(0, 0, 0, 0, 0, 0), `10:50` = c(0,
0, 0, 0, 0, 0), `11:00` = c(2, 0, 0, 1, 0, 0), `11:10` = c(0,
0, 0, 1, 0, 0), `11:20` = c(0, 0, 0, 1, 0, 1), `11:30` = c(0,
0, 0, 1, 0, 1), `11:40` = c(0, 0, 0, 1, 0, 1), `11:50` = c(0,
0, 0, 1, 0, 0), `12:00` = c(0, 0, 0, 1, 2, 0), `12:10` = c(0,
0, 0, 1, 0, 0), `12:20` = c(0, 0, 0, 1, 0, 0), `12:30` = c(0,
0, 0, 1, 0, 0), `12:40` = c(0, 0, 0, 1, 0, 0), `12:50` = c(0,
0, 0, 1, 1, 0), `13:00` = c(0, 0, 0, 0, 1, 0), `13:10` = c(0,
0, 0, 0, 1, 0), `13:20` = c(0, 0, 0, 0, 1, 0), `13:30` = c(0,
0, 0, 0, 1, 0), `13:40` = c(0, 0, 0, 0, 1, 0), `13:50` = c(0,
0, 0, 0, 1, 0), `14:00` = c(0, 0, 0, 0, 1, 0), `14:10` = c(0,
0, 0, 0, 1, 0), `14:20` = c(0, 0, 0, 0, 1, 0), `14:30` = c(0,
0, 0, 0, 1, 0), `14:40` = c(0, 0, 0, 0, 1, 0), `14:50` = c(0,
0, 0, 0, 0, 0), `15:00` = c(0, 0, 0, 0, 0, 0), `15:10` = c(0,
2, 0, 0, 0, 0), `15:20` = c(0, 2, 0, 0, 1, 0), `15:30` = c(0,
2, 0, 0, 1, 1), `15:40` = c(0, 2, 0, 0, 1, 0), `15:50` = c(0,
2, 0, 0, 1, 0), `16:00` = c(0, 2, 0, 0, 1, 0), `16:10` = c(0,
2, 0, 0, 1, 0), `16:20` = c(2, 2, 0, 0, 1, 0), `16:30` = c(2,
2, 0, 0, 1, 2), `16:40` = c(2, 2, 0, 0, 1, 1), `16:50` = c(2,
2, 0, 0, 0, 1), `17:00` = c(0, 2, 0, 0, 2, 0), `17:10` = c(0,
0, 0, 0, 2, 0), `17:20` = c(0, 0, 0, 0, 2, 0), `17:30` = c(0,
0, 0, 0, 2, 0), `17:40` = c(0, 0, 0, 0, 0, 0), `17:50` = c(0,
0, 0, 0, 0, 0), `18:00` = c(0, 2, 0, 0, 0, 2), `18:10` = c(0,
2, 0, 0, 0, 2), `18:20` = c(0, 0, 0, 0, 2, 2), `18:30` = c(0,
0, 0, 0, 0, 2), `18:40` = c(0, 0, 0, 0, 0, 2), `18:50` = c(1,
0, 0, 0, 0, 2), `19:00` = c(1, 0, 0, 1, 1, 0), `19:10` = c(1,
0, 0, 1, 1, 0), `19:20` = c(1, 0, 0, 1, 1, 0), `19:30` = c(1,
0, 1, 1, 1, 0), `19:40` = c(1, 0, 1, 1, 1, 1), `19:50` = c(1,
0, 1, 1, 1, 1), `20:00` = c(0, 0, 1, 1, 1, 1), `20:10` = c(0,
0, 1, 1, 1, 1), `20:20` = c(0, 0, 1, 1, 1, 1), `20:30` = c(0,
1, 2, 1, 1, 1), `20:40` = c(0, 1, 0, 1, 1, 1), `20:50` = c(0,
1, 0, 1, 1, 1), `21:00` = c(0, 1, 0, 1, 1, 1), `21:10` = c(0,
1, 0, 0, 1, 1), `21:20` = c(0, 1, 0, 0, 1, 1), `21:30` = c(0,
1, 1, 0, 1, 1), `21:40` = c(0, 1, 1, 0, 1, 1), `21:50` = c(0,
1, 1, 0, 0, 1), `22:00` = c(0, 1, 1, 0, 0, 0), `22:10` = c(0,
1, 0, 0, 0, 0), `22:20` = c(0, 1, 0, 0, 0, 0), `22:30` = c(0,
1, 0, 0, 0, 0), `22:40` = c(0, 1, 0, 0, 0, 0), `22:50` = c(0,
1, 0, 0, 0, 0), `23:00` = c(0, 0, 0, 0, 1, 0), `23:10` = c(0,
0, 0, 0, 0, 1), `23:20` = c(0, 0, 0, 0, 0, 1), `23:30` = c(0,
0, 0, 0, 0, 1), `23:40` = c(0, 0, 0, 0, 0, 1), `23:50` = c(0,
0, 0, 0, 0, 0), `00:00` = c(0, 0, 0, 0, 0, 0), `00:10` = c(0,
0, 0, 0, 0, 0), `00:20` = c(0, 0, 0, 0, 0, 0), `00:30` = c(0,
0, 0, 0, 0, 0), `00:40` = c(0, 0, 0, 0, 0, 0), `00:50` = c(0,
0, 0, 0, 0, 0), `01:00` = c(0, 0, 0, 0, 0, 0), `01:10` = c(0,
0, 0, 0, 0, 0), `01:20` = c(0, 0, 0, 0, 0, 0), `01:30` = c(0,
0, 0, 0, 0, 0), `01:40` = c(0, 0, 0, 0, 0, 0), `01:50` = c(0,
0, 0, 0, 0, 0), `02:00` = c(0, 0, 0, 0, 0, 0), `02:10` = c(0,
0, 0, 0, 0, 0), `02:20` = c(0, 0, 0, 0, 0, 0), `02:30` = c(0,
0, 0, 0, 0, 0), `02:40` = c(0, 0, 0, 0, 0, 0), `02:50` = c(0,
0, 0, 0, 0, 0), `03:00` = c(0, 0, 0, 0, 0, 0), `03:10` = c(0,
0, 0, 0, 0, 0), `03:20` = c(0, 0, 0, 0, 0, 0), `03:30` = c(0,
0, 0, 0, 0, 0), `03:40` = c(0, 0, 0, 0, 0, 0), `03:50` = c(0,
0, 0, 0, 0, 0)), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
I managed to run hierarchical clustering but only on cases and not on time
d_distance <- dist(as.matrix(df))
plot(hclust(d_distance))
The plot that I generated
As you can see on the plot the structure end points are indexes - how can I have instead of index time (maybe transpose)? Also I would like to plot time-series cluster separately like below plot. Would dtw be better than hierarchical clustering?

Related

Changing a character column into a continuous column, by dividing them into sections (1,2,3,4)

I have a data set I'm trying to run a glm regression on, however it contains characters as age limit, race, and comorbidity class. I would like to change those columns into a continuous variable so the regression can accept it. Data below, I want to change the TBI.irace2 into (Hispanic=1, Black=2, white=3, and other=4) same with age (age 18-28=1, 29-46=2, 47-64=3, and >64=4) and with NISS (NISS 0-10=1, NISS 11-20=2, NISS 21-30=3, and NISS 31-40=4, NISS41-50=5, NISS 51-60=6, NISS 61-70=7, NISS>70= 8)
Please find summary of data below
TBI.crani = c(0, 0, 0, 0, 0, 0), TBI.vte = c(0,
0, 0, 0, 0, 0), TBI.FEMALE = c(0, 0, 1, 0, 1, 0), TBI.iracecat2 = c("Whites",
"Whites", "Whites", "Hispanics", "Whites", "Blacks"), TBI.agecat = c("Age 47-64",
"Age 29-46", "Age > 64", "Age 29-46", "Age 18-28", "Age 18-28"
), TBI.nisscategory = c("NISS 21-30", "NISS 11-20", "NISS 21-30",
"NISS 11-20", "NISS 11-20", "NISS 0-10"), TBI.LOS = c(5, 8, 1,
3, 19, 1), TBI.hospitalteach = c(0, 0, 1, 1, 1, 1), TBI.largebedsize = c(1,
1, 1, 1, 1, 1), TBI.CM_ALCOHOL = c(0, 0, 0, 1, 0, 0), TBI.CM_ANEMDEF = c(0,
0, 0, 0, 0, 0), TBI.CM_BLDLOSS = c(0, 0, 0, 0, 0, 0), TBI.CM_CHF = c(1,
0, 0, 0, 0, 0), TBI.CM_CHRNLUNG = c(0, 0, 0, 0, 0, 0), TBI.CM_COAG = c(0,
0, 0, 0, 1, 0), TBI.CM_HYPOTHY = c(0, 0, 0, 0, 0, 0), TBI.CM_LYTES = c(0,
0, 0, 0, 0, 0), TBI.CM_METS = c(0, 0, 0, 0, 0, 0), TBI.CM_NEURO = c(0,
0, 0, 0, 0, 0), TBI.CM_OBESE = c(0, 0, 0, 0, 0, 0), TBI.CM_PARA = c(0,
0, 0, 0, 0, 0), TBI.CM_PSYCH = c(0, 1, 0, 0, 0, 0), TBI.CM_TUMOR = c(0,
0, 0, 0, 0, 0), TBI.CM_WGHTLOSS = c(0, 0, 0, 0, 0, 0), TBI.UTI = c(0,
0, 0, 0, 0, 0), TBI.pneumonia = c(0, 0, 0, 0, 0, 0), TBI.AMI = c(0,
0, 0, 0, 0, 0), TBI.sepsis = c(0, 0, 0, 0, 0, 0), TBI.arrest = c(0,
0, 0, 0, 0, 0), TBI.spineinjury = c(0, 0, 0, 0, 0, 0), TBI.legfracture = c(0,
0, 0, 0, 0, 0), TBI_time_to_surg.NEW = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
6L), class = "data.frame")
A small little tip, provide a small sample set that is just big enough to address your question.
library(data.table)
# took a small sample and changed one value to Asian
dt <- data.table(
TBI.FEMALE = c(0, 0, 1, 0, 1, 0),
TBI.iracecat2 = as.character(c("Whites", "Whites", "Asian", "Hispanics", "Whites", "Blacks"))
)
# define race groups, and note I did not define Asian
convert_race <- c("Hispanics" = 1, "Blacks" = 2, "Whites" = 3) # other will all be not defined
dt[, TBI.irace2 := lapply(TBI.iracecat2, function(x) convert_race[x]), by = TBI.iracecat2]
dt[is.na(TBI.irace2), TBI.irace2 := 4]
dt
# TBI.FEMALE TBI.iracecat2 TBI.irace2
# 1: 0 Whites 3
# 2: 0 Whites 3
# 3: 1 Asian 4
# 4: 0 Hispanics 1
# 5: 1 Whites 3
# 6: 0 Blacks 2

R vegan package error says data can't contain NA, but the dataframe doesn't contain NAs

I am trying to run an NMDS on some data, using the metaMDS function in the R vegan package. I've managed to run it with a similar dataframe, but for some reason I'm getting the following error with this one:
>Error in cmdscale(dist, k = k) : NA values not allowed in 'd'
In addition: Warning messages:
1: In distfun(comm, method = distance, ...) :
you have empty rows: their dissimilarities may be meaningless in method “bray”
2: In distfun(comm, method = distance, ...) : missing values in results
As it's a large dataframe, I've put it into a Google sheet here
For context, the rows are samples and the columns are genes, with the value indicating the level of the gene in the sample.
With the NMDS, I want to see how similar the samples are, and from that I understand I've got the data set up correctly.
So I tried running the following;
library(vegan)
NMDS <- metaMDS(NMDS, distance="bray")
where NMDS is the dataframe. This is where I get the above error, and I'm not sure what I've done wrong?
This also happens after I run the following code:
NMDS[is.na(NMDS)] = 0
Any ideas where I'm going wrong?
dput:
structure(list(X1 = c(0, 0, 0, 0, 0, 0), X2 = c(0, 0, 0, 0, 0,
0), X3 = c(0, 0, 0, 0, 0, 0), X4 = c(0, 0, 0, 0, 0, 0), X5 = c(0,
0, 0, 0, 0, 0), X6 = c(0, 28, 161, 688, 0, 0), X7 = c(0, 3, 14,
0, 0, 0), X8 = c(0, 0, 0, 0, 0, 0), X9 = c(3, 0, 2, 2, 0, 0),
X10 = c(12, 78, 602, 303, 900, 0), X11 = c(0, 52, 856, 28,
191, 0), X12 = c(0, 51, 12, 1, 0, 0), X13 = c(0, 0, 0, 0,
0, 0), X14 = c(0, 0, 2, 0, 0, 0), X15 = c(5, 17, 46, 39,
9, 0), X16 = c(5255, 1531, 6790, 3302, 5084, 0), X17 = c(0,
0, 0, 0, 0, 0), X18 = c(0, 0, 15, 0, 0, 0), X19 = c(0, 0,
0, 0, 0, 0), X20 = c(0, 0, 0, 0, 0, 0), X21 = c(0, 0, 0,
0, 0, 0), X22 = c(0, 0, 0, 0, 0, 0), X23 = c(0, 0, 0, 0,
0, 0), X24 = c(0, 0, 44, 0, 0, 0), X25 = c(0, 0, 0, 0, 0,
0), X26 = c(0, 6, 24, 185, 0, 0), X27 = c(0, 0, 0, 0, 0,
0), X28 = c(0, 0, 13, 0, 0, 0), X29 = c(0, 0, 0, 0, 0, 0),
X30 = c(0, 0, 0, 7, 0, 0), X31 = c(0, 0, 0, 0, 0, 0), X32 = c(0,
0, 0, 0, 0, 0), X33 = c(0, 0, 1, 2, 0, 0), X34 = c(0, 0,
0, 0, 0, 0), X35 = c(0, 0, 0, 0, 0, 0), X36 = c(0, 2, 0,
0, 0, 0), X37 = c(0, 0, 0, 0, 0, 0), X38 = c(0, 0, 0, 0,
0, 0), X39 = c(0, 0, 0, 0, 0, 0), X40 = c(0, 0, 0, 0, 0,
0), X41 = c(0, 0, 0, 0, 0, 0), X42 = c(0, 0, 0, 0, 0, 0),
X43 = c(0, 0, 0, 0, 0, 0), X44 = c(0, 0, 0, 0, 0, 0), X45 = c(0,
0, 0, 1, 0, 0), X46 = c(0, 0, 0, 63, 0, 0), X47 = c(0, 0,
0, 0, 0, 0), X48 = c(0, 0, 0, 0, 0, 0), X49 = c(0, 0, 0,
0, 0, 0), X50 = c(0, 0, 0, 0, 0, 0), X51 = c(0, 0, 0, 0,
0, 0), X52 = c(0, 0, 0, 0, 0, 0), X53 = c(0, 0, 0, 1, 0,
0), X54 = c(0, 0, 0, 0, 0, 0), X55 = c(0, 0, 0, 1, 0, 0),
X56 = c(0, 0, 0, 0, 0, 0), X57 = c(0, 0, 3, 0, 0, 0), X58 = c(0,
0, 0, 0, 0, 0), X59 = c(0, 0, 0, 0, 0, 0), X60 = c(0, 0,
0, 0, 0, 0), X61 = c(0, 0, 44, 0, 0, 0), X62 = c(0, 0, 15,
0, 0, 0), X63 = c(0, 0, 347, 0, 0, 0), X64 = c(0, 0, 0, 0,
0, 0), X65 = c(0, 0, 0, 5, 0, 0), X66 = c(0, 0, 0, 0, 0,
0), X67 = c(1, 8, 2, 11, 6, 0), X68 = c(0, 26, 0, 0, 0, 0
), X69 = c(0, 0, 0, 8, 0, 0), X70 = c(0, 0, 0, 13, 0, 0),
X71 = c(0, 0, 0, 0, 0, 0), X72 = c(0, 2, 0, 0, 0, 0), X73 = c(0,
0, 0, 0, 0, 0), X74 = c(341, 74, 0, 0, 0, 0), X75 = c(4,
6, 10, 17, 13, 0), X76 = c(0, 0, 0, 0, 0, 0), X77 = c(0,
0, 0, 0, 0, 0), X78 = c(0, 0, 0, 6, 0, 0), X79 = c(0, 0,
0, 0, 0, 0), X80 = c(0, 0, 0, 0, 0, 0), X81 = c(403, 86,
0, 0, 0, 0), X82 = c(20, 95, 54, 0, 0, 0), X83 = c(0, 2,
0, 1, 0, 0), X84 = c(0, 0, 3, 1, 0, 0), X85 = c(0, 0, 0,
0, 0, 0), X86 = c(40, 132, 39, 0, 1, 0), X87 = c(0, 0, 0,
0, 0, 0), X88 = c(0, 0, 0, 0, 0, 0), X89 = c(0, 0, 0, 0,
0, 0), X90 = c(0, 0, 0, 0, 0, 0), X91 = c(0, 0, 0, 0, 0,
0), X92 = c(0, 7, 0, 0, 0, 0), X93 = c(0, 0, 0, 0, 0, 0),
X94 = c(0, 0, 0, 0, 0, 0), X95 = c(0, 0, 0, 0, 0, 0), X96 = c(0,
0, 0, 0, 0, 0), X97 = c(0, 0, 0, 0, 0, 0), X98 = c(0, 0,
0, 0, 0, 0), X99 = c(0, 0, 0, 0, 0, 0), X100 = c(0, 0, 0,
0, 0, 0), X101 = c(0, 0, 0, 0, 0, 0), X102 = c(0, 8, 0, 1,
0, 0), X103 = c(0, 0, 0, 0, 0, 0), X104 = c(0, 0, 0, 0, 0,
0), X105 = c(0, 0, 0, 0, 0, 0), X106 = c(0, 0, 0, 0, 0, 0
), X107 = c(0, 0, 0, 0, 0, 0), X108 = c(0, 0, 0, 0, 0, 0),
X109 = c(0, 0, 0, 0, 0, 0), X110 = c(0, 0, 0, 0, 0, 0), X111 = c(0,
0, 0, 0, 0, 0), X112 = c(15, 47, 0, 1, 0, 0), X113 = c(0,
0, 0, 0, 0, 0), X114 = c(0, 0, 0, 0, 0, 0), X115 = c(0, 0,
0, 2, 0, 0), X116 = c(43, 0, 0, 1, 1, 0), X117 = c(0, 0,
0, 0, 0, 0), X118 = c(0, 0, 0, 0, 0, 0), X119 = c(0, 0, 0,
0, 0, 0), X120 = c(387, 0, 0, 0, 0, 0), X121 = c(0, 0, 0,
0, 0, 0), X122 = c(342, 1, 0, 72, 0, 0), X123 = c(0, 0, 0,
0, 0, 0), X124 = c(0, 0, 0, 76, 0, 0), X125 = c(0, 0, 0,
0, 0, 0), X126 = c(0, 0, 0, 0, 0, 0), X127 = c(0, 2, 0, 0,
0, 0), X128 = c(0, 0, 0, 0, 0, 0), X129 = c(0, 0, 0, 0, 0,
0), X130 = c(0, 0, 0, 0, 0, 0), X131 = c(0, 0, 0, 0, 0, 0
), X132 = c(0, 0, 0, 0, 0, 0), X133 = c(0, 0, 0, 0, 0, 0),
X134 = c(0, 0, 0, 11, 0, 0), X135 = c(13, 108, 0, 129, 192,
0), X136 = c(0, 0, 0, 0, 0, 0), X137 = c(18, 129, 0, 23,
0, 0), X138 = c(0, 0, 0, 32, 7, 0), X139 = c(1, 0, 0, 10,
0, 0), X140 = c(0, 0, 0, 3, 0, 0), X141 = c(0, 0, 0, 0, 0,
0), X142 = c(0, 0, 0, 14, 0, 0), X143 = c(0, 0, 0, 0, 0,
0), X144 = c(16, 74, 71, 0, 0, 0), X145 = c(0, 0, 0, 0, 392,
0), X146 = c(0, 24, 224, 1, 0, 0), X147 = c(0, 19, 224, 1,
0, 0), X148 = c(0, 13, 253, 0, 0, 0), X149 = c(49, 17, 17,
0, 0, 0), X150 = c(133, 70, 74, 0, 0, 0), X151 = c(0, 0,
0, 0, 0, 0), X152 = c(0, 0, 0, 0, 0, 0), X153 = c(0, 0, 0,
0, 0, 0), X154 = c(0, 0, 0, 0, 0, 0), X155 = c(0, 0, 0, 0,
0, 0), X156 = c(0, 1, 0, 0, 0, 0), X157 = c(0, 0, 0, 0, 0,
0), X158 = c(0, 0, 0, 22, 0, 0), X159 = c(0, 0, 0, 0, 0,
0), X160 = c(0, 0, 0, 10, 0, 0), X161 = c(0, 0, 0, 106, 0,
0), X162 = c(148, 27, 85, 0, 0, 0), X163 = c(0, 0, 0, 0,
0, 0), X164 = c(0, 0, 0, 0, 0, 0), X165 = c(0, 10, 0, 0,
0, 0), X166 = c(0, 5, 0, 0, 0, 0), X167 = c(0, 0, 0, 0, 0,
0), X168 = c(1, 0, 0, 0, 0, 0), X169 = c(0, 7, 0, 0, 0, 0
), X170 = c(0, 0, 0, 2, 0, 0), X171 = c(0, 0, 0, 0, 0, 0),
X172 = c(0, 0, 0, 0, 0, 0), X173 = c(0, 0, 0, 0, 0, 0), X174 = c(0,
0, 0, 0, 0, 0), X175 = c(0, 0, 0, 2, 0, 0), X176 = c(0, 0,
0, 0, 0, 0), X177 = c(0, 0, 0, 212, 0, 0), X178 = c(0, 1,
0, 0, 0, 0), X179 = c(0, 0, 0, 0, 0, 0), X180 = c(0, 0, 0,
0, 0, 0), X181 = c(0, 0, 0, 0, 0, 0), X182 = c(0, 0, 0, 0,
0, 0), X183 = c(0, 0, 0, 0, 0, 0), X184 = c(0, 0, 0, 0, 0,
0), X185 = c(0, 9, 0, 0, 0, 0), X186 = c(0, 0, 0, 0, 0, 0
), X187 = c(0, 0, 0, 0, 0, 0), X188 = c(0, 0, 0, 0, 0, 0),
X189 = c(0, 0, 0, 0, 0, 0), X190 = c(475, 108, 329, 14, 57,
0), X191 = c(0, 0, 8, 0, 0, 0), X192 = c(0, 0, 0, 0, 0, 0
), X193 = c(0, 0, 0, 0, 0, 0), X194 = c(0, 0, 0, 0, 0, 0),
X195 = c(0, 0, 0, 0, 0, 0), X196 = c(0, 0, 0, 0, 0, 0), X197 = c(0,
0, 0, 0, 0, 0), X198 = c(0, 0, 2, 0, 0, 0), X199 = c(0, 0,
0, 0, 0, 0), X200 = c(0, 0, 0, 0, 0, 0), X201 = c(0, 27,
647, 1, 0, 0), X202 = c(0, 0, 0, 0, 0, 0), X203 = c(0, 0,
0, 0, 0, 0), X204 = c(0, 0, 0, 0, 0, 0), X205 = c(251, 41,
58, 0, 1, 0), X206 = c(0, 0, 0, 0, 0, 0), X207 = c(0, 0,
0, 0, 0, 0), X208 = c(0, 0, 0, 0, 0, 0), X209 = c(0, 0, 0,
0, 0, 0), X210 = c(0, 0, 0, 0, 0, 0), X211 = c(0, 0, 0, 0,
0, 0), X212 = c(0, 0, 0, 0, 0, 0), X213 = c(0, 0, 0, 0, 0,
0), X214 = c(0, 0, 0, 0, 0, 0), X215 = c(0, 0, 0, 0, 0, 0
), X216 = c(0, 0, 0, 0, 0, 0), X217 = c(0, 0, 0, 0, 0, 0),
X218 = c(0, 0, 0, 0, 0, 0), X219 = c(0, 0, 0, 0, 0, 0), X220 = c(0,
0, 0, 0, 0, 0), X221 = c(0, 0, 0, 0, 0, 0), X222 = c(0, 0,
0, 0, 0, 0), X223 = c(0, 0, 0, 0, 0, 0), X224 = c(2, 0, 0,
0, 0, 0), X225 = c(0, 0, 0, 0, 0, 0), X226 = c(0, 0, 0, 0,
0, 0), X227 = c(0, 0, 0, 0, 0, 0), X228 = c(0, 0, 0, 0, 0,
0), X229 = c(0, 0, 0, 0, 0, 0), X230 = c(0, 0, 0, 0, 0, 0
), X231 = c(1, 0, 0, 0, 0, 0), X232 = c(0, 0, 0, 0, 0, 0),
X233 = c(0, 0, 0, 0, 0, 0), X234 = c(0, 0, 0, 0, 0, 0), X235 = c(0,
0, 0, 0, 0, 0), X236 = c(0, 0, 0, 0, 0, 0), X237 = c(0, 0,
0, 0, 0, 0), X238 = c(0, 0, 0, 0, 0, 0), X239 = c(0, 0, 0,
0, 0, 0), X240 = c(1, 0, 0, 0, 0, 0), X241 = c(445, 90, 0,
0, 1, 0), X242 = c(1, 70, 0, 0, 0, 0), X243 = c(23, 154,
11, 0, 0, 0), X244 = c(0, 0, 1, 0, 0, 0), X245 = c(174, 250,
192, 6, 0, 0), X246 = c(0, 2, 0, 1, 0, 0), X247 = c(0, 0,
0, 0, 0, 0), X248 = c(0, 0, 0, 0, 0, 0), X249 = c(29, 73,
20, 0, 0, 0), X250 = c(0, 99, 0, 0, 0, 0), X251 = c(20, 66,
4, 0, 0, 0), X252 = c(265, 48, 191, 0, 1, 0), X253 = c(112,
59, 0, 0, 0, 0), X254 = c(0, 3, 3, 0, 0, 0), X255 = c(0,
1, 0, 0, 0, 0), X256 = c(0, 0, 0, 0, 0, 0), X257 = c(0, 2,
0, 0, 0, 0), X258 = c(0, 0, 0, 0, 0, 0), X259 = c(86, 44,
69, 0, 0, 0), X260 = c(0, 0, 0, 0, 0, 0), X261 = c(13, 27,
0, 0, 1, 0), X262 = c(0, 5, 0, 0, 0, 0), X263 = c(0, 0, 0,
0, 0, 0), X264 = c(0, 0, 0, 0, 0, 0), X265 = c(0, 0, 0, 0,
0, 0), X266 = c(0, 0, 0, 0, 0, 0), X267 = c(0, 1, 0, 0, 0,
0), X268 = c(0, 0, 0, 0, 0, 0), X269 = c(0, 0, 0, 0, 0, 0
), X270 = c(0, 0, 0, 0, 0, 0), X271 = c(0, 0, 0, 4, 0, 0),
X272 = c(0, 0, 0, 0, 0, 0), X273 = c(0, 0, 0, 0, 0, 0), X274 = c(0,
0, 0, 0, 0, 0), X275 = c(291, 200, 115, 0, 0, 0), X276 = c(0,
5, 0, 0, 0, 0), X277 = c(0, 0, 0, 0, 0, 0), X278 = c(0, 5,
0, 5, 0, 0), X279 = c(0, 3, 2, 6, 0, 0), X280 = c(0, 0, 28,
0, 0, 0), X281 = c(0, 1, 0, 0, 0, 0), X282 = c(0, 8, 1, 5,
0, 0), X283 = c(0, 3, 0, 1, 0, 0), X284 = c(0, 0, 17, 0,
0, 0), X285 = c(0, 3, 0, 0, 0, 0), X286 = c(0, 0, 0, 0, 0,
0), X287 = c(0, 1, 1, 4, 0, 0), X288 = c(0, 0, 0, 0, 0, 0
), X289 = c(0, 2, 0, 0, 0, 0), X290 = c(0, 0, 0, 0, 0, 0),
X291 = c(0, 0, 0, 0, 0, 0), X292 = c(0, 0, 0, 4, 0, 0), X293 = c(0,
0, 0, 0, 0, 0), X294 = c(38, 10, 72, 0, 0, 0), X295 = c(0,
58, 0, 0, 0, 0), X296 = c(0, 20, 0, 0, 0, 0), X297 = c(69,
4, 39, 0, 1, 0), X298 = c(0, 15, 304, 3, 0, 0), X299 = c(0,
0, 0, 0, 0, 0), X300 = c(0, 6, 0, 0, 0, 0), X301 = c(0, 1,
0, 0, 0, 0), X302 = c(51, 28, 13, 0, 0, 0), X303 = c(96,
149, 28, 0, 0, 0), X304 = c(34, 25, 24, 0, 0, 0), X305 = c(0,
3, 1, 0, 0, 0), X306 = c(0, 3, 7, 0, 0, 0), X307 = c(0, 4,
0, 0, 0, 0), X308 = c(0, 0, 0, 0, 0, 0), X309 = c(0, 0, 35,
1, 0, 0), X310 = c(262, 9, 137, 0, 0, 0), X311 = c(3, 15,
0, 2, 9, 0), X312 = c(445, 139, 353, 48, 16, 0), X313 = c(0,
0, 0, 0, 0, 0), X314 = c(0, 0, 0, 0, 0, 0), X315 = c(0, 0,
0, 0, 0, 0), X316 = c(0, 0, 0, 0, 0, 0), X317 = c(0, 0, 0,
0, 0, 0), X318 = c(0, 0, 0, 0, 0, 0), X319 = c(0, 0, 0, 0,
0, 0), X320 = c(62, 138, 36, 0, 0, 0), X321 = c(3, 0, 0,
0, 0, 0), X322 = c(0, 0, 0, 0, 0, 0), X323 = c(0, 13, 0,
0, 0, 0), X324 = c(0, 0, 0, 0, 0, 0), X325 = c(142, 0, 104,
0, 0, 0), X326 = c(0, 2, 0, 0, 0, 0), X327 = c(56, 35, 101,
0, 0, 0), X328 = c(0, 0, 0, 10, 0, 0), X329 = c(0, 0, 0,
0, 0, 0), X330 = c(0, 2, 0, 0, 0, 0), X331 = c(259, 27, 107,
0, 2, 0), X332 = c(0, 0, 0, 0, 0, 0), X333 = c(0, 7, 0, 0,
0, 0), X334 = c(0, 0, 0, 0, 0, 0), X335 = c(98, 39, 95, 0,
0, 0), X336 = c(0, 0, 1, 0, 0, 0), X337 = c(0, 0, 0, 0, 0,
0), X338 = c(141, 28, 85, 0, 0, 0), X339 = c(15, 14, 20,
0, 0, 0), X340 = c(0, 6, 0, 0, 0, 0), X341 = c(0, 0, 0, 0,
0, 0), X342 = c(0, 2, 0, 0, 0, 0), X343 = c(0, 0, 0, 0, 0,
0), X344 = c(0, 0, 0, 0, 0, 0), X345 = c(0, 10, 232, 0, 0,
0), X346 = c(0, 4, 0, 0, 0, 0), X347 = c(0, 0, 0, 0, 0, 0
), X348 = c(0, 0, 0, 0, 0, 0), X349 = c(0, 0, 0, 0, 0, 0),
X350 = c(0, 0, 0, 0, 0, 0), X351 = c(0, 0, 0, 0, 0, 0), X352 = c(0,
0, 0, 0, 0, 0), X353 = c(0, 0, 0, 0, 4, 0), X354 = c(0, 0,
0, 0, 0, 0), X355 = c(0, 0, 0, 0, 1, 0), X356 = c(0, 0, 0,
0, 0, 0), X357 = c(0, 0, 0, 0, 0, 0), X358 = c(0, 0, 0, 0,
0, 0), X359 = c(0, 0, 0, 0, 0, 0), X360 = c(0, 0, 0, 0, 0,
0), X361 = c(0, 0, 0, 0, 0, 0), X362 = c(0, 0, 0, 0, 0, 0
), X363 = c(0, 0, 0, 0, 0, 0), X364 = c(0, 0, 0, 0, 2, 0),
X365 = c(0, 0, 0, 0, 0, 0), X366 = c(0, 0, 0, 0, 0, 0), X367 = c(0,
0, 0, 0, 0, 0), X368 = c(0, 0, 0, 0, 0, 0), X369 = c(0, 0,
0, 17, 0, 0), X370 = c(0, 0, 0, 0, 0, 0), X371 = c(0, 0,
0, 0, 0, 0), X372 = c(0, 0, 0, 0, 0, 0), X373 = c(0, 0, 0,
0, 0, 0), X374 = c(0, 0, 0, 0, 0, 0), X375 = c(0, 0, 0, 0,
0, 0), X376 = c(0, 0, 1, 0, 0, 0), X377 = c(0, 0, 0, 0, 0,
0), X378 = c(0, 0, 0, 0, 0, 0), X379 = c(0, 0, 0, 0, 0, 0
), X380 = c(0, 0, 0, 0, 0, 0), X381 = c(0, 0, 0, 0, 0, 0),
X382 = c(0, 0, 0, 0, 0, 0), X383 = c(0, 51, 0, 0, 0, 0),
X384 = c(0, 0, 0, 0, 0, 0), X385 = c(7, 0, 0, 11, 1, 0),
X386 = c(0, 0, 0, 0, 0, 0), X387 = c(0, 0, 1, 0, 0, 0), X388 = c(0,
0, 0, 0, 0, 0), X389 = c(0, 0, 0, 0, 0, 0), X390 = c(0, 5,
0, 0, 0, 0), X391 = c(0, 0, 0, 0, 0, 0), X392 = c(0, 0, 0,
0, 0, 0), X393 = c(2, 16, 0, 0, 0, 0), X394 = c(0, 6, 88,
0, 0, 0), X395 = c(0, 14, 136, 1, 0, 0), X396 = c(0, 41,
350, 2, 0, 0), X397 = c(0, 0, 0, 0, 0, 0), X398 = c(20, 413,
0, 12, 3, 0), X399 = c(0, 0, 0, 0, 0, 0), X400 = c(0, 3,
0, 0, 0, 0), X401 = c(0, 0, 0, 0, 0, 0), X402 = c(0, 2, 0,
0, 0, 0), X403 = c(0, 2, 0, 0, 0, 0), X404 = c(0, 0, 0, 0,
0, 0), X405 = c(0, 0, 0, 0, 0, 0), X406 = c(0, 0, 0, 0, 0,
0), X407 = c(0, 0, 39, 1, 0, 0), X408 = c(10, 73, 31, 0,
0, 0), X409 = c(0, 11, 0, 0, 0, 0), X410 = c(68, 58, 66,
1, 0, 0), X411 = c(4, 32, 3, 0, 0, 0), X412 = c(8, 66, 39,
0, 0, 0), X413 = c(0, 0, 0, 0, 0, 0), X414 = c(2, 53, 7,
0, 0, 0), X415 = c(120, 90, 109, 0, 0, 0), X416 = c(0, 80,
0, 0, 0, 0), X417 = c(62, 79, 24, 0, 0, 0), X418 = c(58,
156, 30, 0, 0, 0), X419 = c(72, 138, 50, 2, 0, 0), X420 = c(0,
0, 0, 0, 0, 0), X421 = c(0, 0, 0, 0, 0, 0), X422 = c(36,
143, 43, 0, 0, 0), X423 = c(0, 0, 0, 0, 0, 0), X424 = c(0,
0, 0, 0, 0, 0), X425 = c(0, 5, 0, 0, 0, 0), X426 = c(12,
109, 0, 18, 26, 0), X427 = c(0, 0, 0, 0, 0, 0), X428 = c(0,
0, 0, 0, 0, 0), X429 = c(0, 3, 0, 0, 0, 0), X430 = c(0, 0,
362, 0, 0, 0), X431 = c(0, 0, 0, 0, 0, 0), X432 = c(0, 0,
685, 0, 0, 0), X433 = c(0, 0, 0, 0, 0, 0), X434 = c(0, 0,
0, 0, 0, 0), X435 = c(0, 0, 0, 0, 0, 0), X436 = c(0, 0, 0,
0, 0, 0), X437 = c(0, 0, 15, 8, 0, 0), X438 = c(0, 0, 184,
0, 0, 0), X439 = c(0, 0, 0, 0, 0, 0), X440 = c(0, 0, 0, 0,
0, 0), X441 = c(0, 0, 0, 0, 0, 0), X442 = c(0, 0, 0, 0, 0,
0), X443 = c(0, 0, 0, 0, 0, 0), X444 = c(0, 6, 0, 0, 0, 0
), X445 = c(0, 0, 0, 0, 0, 0), X446 = c(0, 1, 1, 4, 0, 0),
X447 = c(0, 3, 0, 0, 0, 0), X448 = c(0, 1, 0, 0, 0, 0), X449 = c(616,
28, 368, 0, 0, 0), X450 = c(0, 0, 1, 0, 0, 0), X451 = c(4098,
2120, 3788, 2663, 3524, 0), X452 = c(0, 0, 0, 0, 0, 0), X453 = c(0,
66, 0, 0, 0, 0), X454 = c(0, 9, 0, 0, 0, 0), X455 = c(0,
1, 0, 0, 0, 0), X456 = c(0, 5, 0, 0, 0, 0), X457 = c(57,
111, 36, 0, 0, 0), X458 = c(0, 0, 0, 0, 0, 0), X459 = c(0,
54, 68, 0, 0, 0), X460 = c(0, 0, 0, 0, 0, 0), X461 = c(0,
0, 0, 0, 0, 0), X462 = c(0, 0, 0, 0, 0, 0), X463 = c(0, 0,
0, 0, 0, 0), X464 = c(0, 0, 0, 0, 0, 0), X465 = c(0, 0, 0,
0, 0, 0), X466 = c(0, 0, 0, 0, 0, 0), X467 = c(0, 1, 0, 2,
0, 0), X468 = c(48, 79, 52, 0, 0, 0), X469 = c(24, 244, 178,
0, 0, 0), X470 = c(24, 28, 13, 0, 0, 0), X471 = c(0, 0, 0,
0, 0, 0), X472 = c(96, 52, 45, 0, 0, 0), X473 = c(0, 0, 0,
102, 0, 0), X474 = c(196, 82, 130, 0, 0, 0), X475 = c(106,
30, 33, 0, 0, 0), X476 = c(12, 21, 22, 0, 0, 0), X477 = c(0,
0, 0, 0, 172, 0), X478 = c(0, 28, 280, 0, 0, 0), X479 = c(0,
27, 310, 0, 0, 0), X480 = c(0, 32, 366, 0, 0, 0), X481 = c(0,
7, 0, 0, 0, 0), X482 = c(0, 22, 0, 0, 0, 0), X483 = c(0,
1, 0, 0, 0, 0), X484 = c(0, 13, 0, 0, 0, 0), X485 = c(0,
2, 0, 0, 0, 0), X486 = c(0, 16, 0, 0, 0, 0), X487 = c(0,
6, 0, 0, 0, 0), X488 = c(0, 8, 0, 0, 0, 0), X489 = c(0, 20,
0, 0, 0, 0), X490 = c(0, 3, 0, 0, 0, 0), X491 = c(0, 14,
0, 0, 0, 0), X492 = c(0, 4, 0, 0, 0, 0), X493 = c(0, 2, 0,
0, 0, 0), X494 = c(0, 5, 0, 0, 0, 0), X495 = c(0, 1, 0, 0,
0, 0), X496 = c(0, 4, 0, 0, 0, 0), X497 = c(0, 15, 0, 0,
0, 0), X498 = c(0, 0, 0, 0, 0, 0), X499 = c(0, 7, 0, 0, 0,
0), X500 = c(0, 13, 0, 0, 0, 0), X501 = c(0, 11, 0, 0, 0,
0), X502 = c(0, 7, 0, 0, 0, 0), X503 = c(0, 4, 0, 0, 0, 0
), X504 = c(0, 0, 0, 0, 0, 0), X505 = c(0, 7, 0, 0, 0, 0),
X506 = c(0, 1, 0, 0, 0, 0), X507 = c(0, 1, 0, 0, 0, 0), X508 = c(0,
0, 0, 1, 0, 0), X509 = c(0, 6, 0, 0, 0, 0), X510 = c(0, 0,
0, 0, 0, 0), X511 = c(0, 2, 0, 0, 0, 0), X512 = c(0, 1, 0,
0, 0, 0), X513 = c(0, 14, 0, 0, 0, 0), X514 = c(0, 3, 0,
0, 0, 0), X515 = c(237, 171, 188, 0, 0, 0), X516 = c(291,
222, 163, 0, 0, 0), X517 = c(5, 36, 9, 0, 0, 0), X518 = c(5,
102, 0, 0, 0, 0), X519 = c(0, 0, 0, 0, 0, 0), X520 = c(0,
0, 0, 0, 0, 0), X521 = c(0, 0, 0, 0, 0, 0), X522 = c(96,
69, 109, 0, 0, 0), X523 = c(236, 0, 118, 0, 1, 0), X524 = c(0,
44, 0, 0, 0, 0), X525 = c(0, 0, 0, 0, 0, 0), X526 = c(0,
0, 0, 0, 0, 0), X527 = c(0, 0, 0, 0, 0, 0), X528 = c(0, 0,
0, 0, 0, 0), X529 = c(0, 62, 15, 0, 0, 0), X530 = c(4, 183,
16, 0, 0, 0), X531 = c(3, 187, 19, 0, 0, 0), X532 = c(197,
79, 64, 0, 0, 0), X533 = c(27, 255, 25, 0, 0, 0), X534 = c(0,
2, 0, 0, 0, 0), X535 = c(0, 20, 0, 0, 0, 0), X536 = c(0,
1, 0, 0, 0, 0), X537 = c(0, 10, 0, 0, 0, 0), X538 = c(0,
1, 0, 0, 0, 0), X539 = c(0, 4, 0, 0, 0, 0), X540 = c(0, 0,
0, 0, 0, 0), X541 = c(0, 6, 0, 0, 0, 0), X542 = c(0, 1, 0,
0, 0, 0), X543 = c(0, 12, 113, 0, 0, 0), X544 = c(0, 77,
990, 0, 0, 0), X545 = c(6, 27, 14, 0, 0, 0), X546 = c(0,
0, 0, 0, 0, 0), X547 = c(0, 0, 0, 0, 0, 0), X548 = c(0, 0,
0, 0, 0, 0), X549 = c(0, 0, 0, 0, 0, 0), X550 = c(0, 0, 0,
0, 0, 0), X551 = c(0, 0, 0, 0, 0, 0), X552 = c(0, 0, 0, 0,
0, 0), X553 = c(301, 0, 0, 0, 0, 0), X554 = c(444, 148, 305,
0, 0, 0), X555 = c(0, 0, 0, 0, 0, 0), X556 = c(0, 2, 2, 0,
0, 0), X557 = c(0, 0, 0, 0, 0, 0), X558 = c(0, 1, 0, 0, 0,
0), X559 = c(0, 0, 0, 0, 0, 0), X560 = c(0, 0, 0, 0, 0, 0
), X561 = c(0, 3, 4, 6, 1, 0), X562 = c(120, 77, 26, 0, 0,
0), X563 = c(0, 3, 628, 0, 0, 0), X564 = c(709, 104, 0, 0,
0, 0), X565 = c(0, 0, 0, 0, 0, 0), X566 = c(95, 59, 581,
175, 1219, 0), X567 = c(0, 0, 0, 0, 13, 0), X568 = c(26,
7, 0, 26, 39, 0), X569 = c(18, 33, 0, 35, 36, 0), X570 = c(0,
2, 41, 39, 1, 0), X571 = c(0, 8, 47, 97, 1, 0), X572 = c(216,
291, 52, 279, 688, 0), X573 = c(198, 504, 0, 5, 0, 0), X574 = c(0,
0, 0, 0, 0, 0), X575 = c(110, 102, 895, 254, 1682, 0), X576 = c(1,
2, 0, 0, 0, 0), X577 = c(10, 18, 0, 0, 0, 0), X578 = c(8,
40, 0, 0, 0, 0), X579 = c(0, 0, 0, 0, 0, 0), X580 = c(0,
0, 0, 0, 0, 0), X581 = c(0, 0, 0, 0, 0, 0), X582 = c(0, 0,
0, 0, 0, 0), X583 = c(0, 0, 216, 0, 0, 0), X584 = c(0, 0,
0, 0, 0, 0), X585 = c(0, 0, 0, 0, 0, 0), X586 = c(0, 0, 0,
0, 0, 0), X587 = c(0, 0, 0, 0, 0, 0), X588 = c(0, 0, 0, 0,
0, 0), X589 = c(0, 0, 0, 0, 0, 0), X590 = c(0, 0, 0, 0, 0,
0), X591 = c(31, 32, 0, 52, 213, 0), X592 = c(0, 0, 12, 0,
0, 0), X593 = c(0, 0, 0, 0, 0, 0), X594 = c(28, 77, 21, 0,
0, 0), X595 = c(0, 0, 0, 0, 0, 0), X596 = c(0, 0, 0, 0, 0,
0)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
You have some rows in NMDS that contain all 0 values which apparently doesn't work with metaMDS.
You can remove rows containing all values == 0 using dplyr:
library(dplyr)
NMDS <- NMDS %>%
filter_all(any_vars(. != 0))
NMDS <- metaMDS(NMDS, distance="bray")

xgb.cv with no folds and return the results based on a split of the data

I have some data which looks like:
# A tibble: 50 x 28
sanchinarro date holiday weekday weekend workday_on_holi… weekend_on_holi… protocol_active
<dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -1.01 2010-01-01 1 1 0 1 0 0
2 0.832 2010-01-02 0 0 1 0 0 0
3 1.29 2010-01-03 0 0 1 0 0 0
4 1.04 2010-01-04 0 1 0 0 0 0
5 0.526 2010-01-05 0 1 0 0 0 0
6 -0.292 2010-01-06 1 1 0 1 0 0
7 -0.394 2010-01-07 0 1 0 0 0 0
8 -0.547 2010-01-08 0 1 0 0 0 0
9 -0.139 2010-01-09 0 0 1 0 0 0
10 0.628 2010-01-10 0 0 1 0 0 0
I want to run xgb.cv on the first 40 rows and validate it on the final 10 rows.
I try the following:
library(xgboost)
library(dplyr)
X_Val <- ddd %>% select(-c(1:2))
Y_Val <- ddd %>% select(c(1)) %>% pull()
dVal <- xgb.DMatrix(data = as.matrix(X_Val), label = as.numeric(Y_Val))
xgb.cv(data = dVal, nround = 30, folds = NA, params = list(eta = 0.1, max_depth = 5))
which gives me this error:
Error in xgb.cv(data = dVal, nround = 30, folds = NA, eta = 0.1,
max_depth = 5) : 'folds' must be a list with 2 or more elements
that are vectors of indices for each CV-fold
How can I run a simple xgb.cv on the first 40 rows and test it on the last 10 rows.
I eventually want to apply a gird search with a list of parameters and save the results in a list. Since I am dealing with time series data I do not want to mix the folds up, I just want a simple train and in-sample test of 40:10.
Data:
ddd <- structure(list(sanchinarro = c(-1.00742964973274, 0.832453587904369,
1.29242439731365, 1.03688505875294, 0.525806381631517, -0.291919501762755,
-0.394135237187039, -0.547458840323464, -0.138595898626329, 0.628022117055801,
1.19020866188936, 1.5990716035865, 1.5990716035865, -0.70078244345989,
2.11015028070792, 1.95682667757149, 0.985777191040795, 0.883561455616511,
0.985777191040795, 0.270267043070807, 2.51901322240505, 2.41679748698077,
0.372482778495091, -0.291919501762755, -0.905213914308458, -0.905213914308458,
-0.649674575747748, 1.2413165296015, 1.54796373587436, -0.70078244345989,
-0.905213914308458, -0.0363801632020448, 1.54796373587436, 2.00793454528363,
1.54796373587436, -0.445243104899181, -0.445243104899181, 1.03688505875294,
0.628022117055801, -0.496350972611323, 0.168051307646523, -0.649674575747748,
0.0658355722222391, -1.00742964973274, -0.291919501762755, 0.0147277045100972,
0.168051307646523, -0.189703766338471, 0.219159175358665, 0.679129984767943
), date = structure(c(14610, 14611, 14612, 14613, 14614, 14615,
14616, 14617, 14618, 14619, 14620, 14621, 14622, 14623, 14624,
14625, 14626, 14627, 14628, 14629, 14630, 14631, 14632, 14633,
14634, 14635, 14636, 14637, 14638, 14639, 14640, 14641, 14642,
14643, 14644, 14645, 14646, 14647, 14648, 14649, 14650, 14651,
14652, 14653, 14654, 14655, 14656, 14657, 14658, 14659), class = "Date"),
holiday = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), weekday = c(1,
0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1), weekend = c(0, 1, 1, 0,
0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0), workday_on_holiday = c(1, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), weekend_on_holiday = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), protocol_active = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text_broken_clouds = c(0,
1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1), text_clear = c(0, 0, 0,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1), text_fog = c(0, 1, 0, 1, 1, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0), text_partly_cloudy = c(0, 1, 0, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), text_partly_sunny = c(1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0,
0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1), text_passing_clouds = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 1), text_scattered_clouds = c(1, 1,
0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,
0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1), text_sunny = c(0, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1), month_1 = c(1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), month_2 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1), month_3 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), month_4 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), month_5 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), month_6 = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), month_7 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), month_8 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), month_9 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), month_10 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), month_11 = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), month_12 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-50L))
EDIT: List data:
The final data comes in the form of lists.
datalst <- list(structure(list(sanchinarro = c(-1.00742964973274, 0.832453587904369,
1.29242439731365, 1.03688505875294, 0.525806381631517, -0.291919501762755,
-0.394135237187039, -0.547458840323464, -0.138595898626329, 0.628022117055801,
1.19020866188936, 1.5990716035865, 1.5990716035865, -0.70078244345989
), date = structure(c(14610, 14611, 14612, 14613, 14614, 14615,
14616, 14617, 14618, 14619, 14620, 14621, 14622, 14623), class = "Date"),
holiday = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0), weekday = c(1,
0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1), weekend = c(0, 1,
1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0), workday_on_holiday = c(1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0), weekend_on_holiday = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), protocol_active = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text_broken_clouds = c(0,
1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0), text_clear = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0), text_fog = c(0, 1,
0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0), text_partly_cloudy = c(0,
1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0), text_partly_sunny = c(1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1), text_passing_clouds = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), text_scattered_clouds = c(1,
1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1), text_sunny = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0), month_1 = c(1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), month_2 = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), month_3 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), month_4 = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), month_5 = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), month_6 = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), month_7 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), month_8 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), month_9 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0), month_10 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0), month_11 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), month_12 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-14L)), structure(list(sanchinarro = c(0.832179838392013, 1.29225734336885,
1.03665872949283, 0.525461501740789, -0.292454062662475, -0.394693508212883,
-0.548052676538495, -0.139094894336863, 0.627700947291197, 1.19001789781844,
1.59897568002007, 1.59897568002007, -0.701411844864107, 2.11017290777211
), date = structure(c(14611, 14612, 14613, 14614, 14615, 14616,
14617, 14618, 14619, 14620, 14621, 14622, 14623, 14624), class = "Date"),
holiday = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), weekday = c(0,
0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1), weekend = c(1, 1,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0), workday_on_holiday = c(0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), weekend_on_holiday = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), protocol_active = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text_broken_clouds = c(1,
0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), text_clear = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1), text_fog = c(1, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0), text_partly_cloudy = c(1,
0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), text_partly_sunny = c(1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0), text_passing_clouds = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), text_scattered_clouds = c(1,
0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0), text_sunny = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1), month_1 = c(1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), month_2 = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), month_3 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), month_4 = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), month_5 = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), month_6 = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), month_7 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), month_8 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), month_9 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0), month_10 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0), month_11 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), month_12 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-14L)), structure(list(sanchinarro = c(1.29293502084952, 1.03729933727253,
0.526027970118536, -0.292006217327851, -0.394260490758649, -0.547641900904846,
-0.138624807181653, 0.628282243549334, 1.19068074741873, 1.59969784114192,
1.59969784114192, -0.701023311051044, 2.11096920829591, 1.95758779814971
), date = structure(c(14612, 14613, 14614, 14615, 14616, 14617,
14618, 14619, 14620, 14621, 14622, 14623, 14624, 14625), class = "Date"),
holiday = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), weekday = c(0,
1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0), weekend = c(1, 0,
0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1), workday_on_holiday = c(0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), weekend_on_holiday = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), protocol_active = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text_broken_clouds = c(0,
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1), text_clear = c(0,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0), text_fog = c(0, 1,
1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0), text_partly_cloudy = c(0,
0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0), text_partly_sunny = c(1,
1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1), text_passing_clouds = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), text_scattered_clouds = c(0,
0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0), text_sunny = c(0,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0), month_1 = c(1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), month_2 = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), month_3 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), month_4 = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), month_5 = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), month_6 = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), month_7 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), month_8 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), month_9 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0), month_10 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0), month_11 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), month_12 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-14L)))
EDIT:
I think this gives me what I am after - I need to double/tripple check it. (if you see any errors please let me know)
splt <- 0.80 * nrow(ddd)
ddd[c(1:splt), "id"] = 1
ddd$id[is.na(ddd$id)] = 2
fold.ids <- unique(ddd$id)
custom.folds <- vector("list", length(fold.ids))
i <- 1
for( id in fold.ids){
custom.folds[[i]] <- which( ddd$id %in% id )
i <- i+1
}
custom.folds
cv <- xgb.cv(params = list(eta = 0.1, max_depth = 5), dVal, nround = 10, folds = custom.folds, prediction = TRUE)
cv$evaluation_log
I now need to find a way to apply this to all 3 lists in the "new" added data.
Firstly, you should split the data onto dtrain (40 first rows) and dval (10 last rows). Secondly, you need rather xgb.train, not xgb.cv.
So, your code should be modified to something like that:
library(xgboost)
library(dplyr)
# you code regarding ddd
X <- ddd %>% select(-c(1:2))
Y <- ddd %>% select(c(1)) %>% pull()
dtrain <- xgb.DMatrix(data = as.matrix(X[1:40,]), label = as.numeric(Y[1:40,]))
dval <- xgb.DMatrix(data = as.matrix(X[41:50,]), label = as.numeric(Y[41:50,]))
watchlist <- list(train=dtrain, val=dval)
model <- xgb.train(data=dtrain, watchlist=watchlist, nround = 30, eta = 0.1, max_depth = 5)
IMHO, 40+10 rows only and so sparse features give no hope to obtain good results using XGBoost.

How to use ids from one dataframe to sum rows in another dataframe

I feel like this answer has been asked before, but I can't seem to find an answer to this question. Maybe my title is too vague, so feel free to change it.
So I have one data frame, a, with ids the correspond to column name in data frame b. Both data frames are simplified versions of a much larger data frame.
here is data frame a
a <- structure(list(V1 = structure(c(4L, 5L, 1L, 2L, 3L), .Label = c("GEN[D00105].GT",
"GEN[D00151].GT", "GEN[D00188].GT", "GEN[D86396].GT", "GEN[D86397].GT"
), class = "factor")), row.names = c(NA, -5L), class = "data.frame")
here is data frame b
b <- structure(list(`GEN[D01104].GT` = c(0, 0, 0, 0, 1, 0, 0, 2, 0,
1, 1, 1, 1, 0, 0, 0, 2, 0, 0, 0), `GEN[D01312].GT` = c(1, 0,
2, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 2, 0, 0, 0), `GEN[D01878].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 2, 0, 0), `GEN[D01882].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0), `GEN[D01952].GT` = c(0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0), `GEN[D01953].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 2, 0, 0, 0, 2, 0), `GEN[D02053].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0), `GEN[D00316].GT` = c(0,
0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0), `GEN[D01827].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0), `GEN[D01881].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0), `GEN[D02044].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0), `GEN[D02085].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0), `GEN[D02204].GT` = c(0,
0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0), `GEN[D02276].GT` = c(0,
0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0), `GEN[D02297].GT` = c(0,
0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0), `GEN[D02335].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0, 0), `GEN[D02397].GT` = c(0,
0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0), `GEN[D00856].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0), `GEN[D00426].GT` = c(0,
0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0), `GEN[D02139].GT` = c(0,
0, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0), `GEN[D02168].GT` = c(0,
0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0)), row.names = c(NA,
-20L), class = "data.frame")
I want to be able to use the ids from data frame a to sum the row in data frame b that have a matching id if that makes sense.
So in the past, I just did something like
b$affected.samples <- (b$`GEN[D86396].GT` + b$`GEN[D86397].GT` + b$`GEN[D00105].GT` + b$`GEN[D00151].GT` + b$`GEN[D00188].GT`)
which got annoying and took to much time, so I moved over to
b$affected.samples <- rowSums(b[,c(1:5)])
Which isn't too bad for this example but with my large data set, my sample can be all over the place, and it's starting to take too much time to finds where everything is. I was hoping there is a way just to use my data frame a to sum the correct rows in data frame b.
Hopefully, I gave this is all the information you need! Let me know if you have any questions.
Thanks in advance!!
Extract the 'V1' column as a character string, use that to select the columns of 'b' (assuming these column names are found in 'b') and get the rowSums
rowSums( b[as.character(a$V1)], na.rm = TRUE)

Excluding one of the columns from the means calculating

I have a data.frame like this:
> dput(head(dat))
structure(list(`Gene name` = c("at1g01050", "at1g01080", "at1g01090",
"at1g01220", "at1g01320", "at1g01420"), `1_1` = c(0, 0, 0, 0,
0, 0), `1_2` = c(0, 0, 0, 0, 0, 0), `1_3` = c(0, 2.2266502274762,
0, 0, 0, 0), `1_4` = c(0, 1.42835007256373, 0, 0, 0, 0), `1_5` = c(0,
1, 0, 0, 0, 0.680307288653971), `1_6` = c(0, 0.974694551708235,
0.0703315834738149, 0, 0, 1.5411058346636), `1_7` = c(1, 1.06166030205396,
0, 0, 0, 0), `1_8` = c(1, 1.07309874414745, 0.129442847788922,
0, 0, 0), `1_9` = c(1.83566164452602, 0.770848509662441, 1.16522133036595,
1.02360016370994, 0, 0), `1_10` = c(0, 0, 0.96367393959757, 0,
0, 0), `1_11` = c(0, 1, 1.459452636222, 0, 0.992067202742928,
0), `1_12` = c(0, 0, 0.670100384155585, 0, 0.461601636474094,
0), `1_13` = c(0, 0, 1.43074917909221, 0, 1.35246977730244, 0
), `1_14` = c(0, 0, 1.13052717277684, 0, 1.27971261718285, 0),
`1_15` = c(0, 0, 0, 0, 0, 0), `1_16` = c(0, 0, 1.02186950513655,
0, 0.937805171752374, 0), `1_17` = c(0, 0, 0, 0, 1.82226410514639,
0), `1_18` = c(0, 0, 1.2057581396188, 0, 1, 0), `1_19` = c(0,
0, 2.54080080087007, 0, 1.74014162763125, 0), `1_20` = c(0,
0, 0, 0, 0, 0), `1_21` = c(0, 0, 1.85335086627868, 0, 2.93605031878879,
0), `1_22` = c(0, 0, 0, 0, 0, 0), `1_23` = c(0, 0, 0, 0,
0, 0), `1_24` = c(0, 0.59685787388353, 4.74450895485671,
0, 1.64665192735547, 0), `1_25` = c(0, 0, 0, 0, 0, 0), `1_26` = c(0,
0, 0, 0, 0, 0), `1_27` = c(0, 1.70324142554566, 0, 0, 0,
0), `1_28` = c(0, 4.02915818089525, 0, 0, 0, 0), `1_29` = c(0,
1.10050253348262, 0, 0, 0, 1.78705663080963), `1_30` = c(0,
0, 0, 0, 0, 0), `1_31` = c(0.525193634811661, 1.19203674964562,
0, 0, 0, 0), `1_32` = c(0.949695564218912, 0.511935958918944,
0.698256748091399, 0.924419021307232, 0, 0), `1_33` = c(1,
0.392202418854686, 0.981531026331928, 1, 0, 0), `1_34` = c(0,
0, 1.04480642952605, 0, 0, 0), `1_35` = c(0.875709646300199,
0.416787083481068, 0.910412293707794, 0, 0.931813162802324,
0), `1_36` = c(0.235817844851986, 0, 0.695496044366791, 0,
0, 0), `1_37` = c(0, 0, 0, 0, 0, 0), `1_38` = c(0, 0, 0,
0, 0, 0), `1_39` = c(0, 0, 0, 0, 0, 0), `1_40` = c(0, 0.426301584359177,
1.05916031917965, 0, 1.11716924423855, 0), `1_41` = c(0,
0, 0, 0, 0, 0), `1_42` = c(0, 0, 0, 0, 0, 0), `1_43` = c(0,
0, 0, 0, 0, 0), `1_44` = c(0, 0.817605484758179, 1, 0, 1,
0), `1_45` = c(0, 0, 0, 0, 1.83706702696725, 0), `1_46` = c(0,
0, 0, 0, 0, 0), `1_48` = c(0, 0, 0, 0, 0, 0), `1_49` = c(0,
0, 0, 0, 0, 0), `1_50` = c(0, 0, 0, 0, 0, 0), `1_51` = c(0,
0.822966241998042, 0, 0, 0, 0), `1_52` = c(0, 1.38548267401525,
0, 0, 0, 0), `1_53` = c(0, 0.693090058304095, 0, 0, 0, 1.200664746484
), `1_54` = c(0, 7.58136662752864, 0, 0, 0, 0), `1_55` = c(0.519878111919004,
0.530809413647805, 0.343274113384907, 0, 0, 0), `1_56` = c(1.24511715957891,
0.545097856366912, 0.397440073804376, 0, 0, 0), `1_57` = c(1.26748496499576,
0.502893153188496, 1, 1.09278985531586, 0, 0), `1_58` = c(0.696198684496234,
0.68197003689249, 1.30108437738319, 0.778091049180591, 0.533017938104689,
0), `1_59` = c(1.15255606344999, 0.294294436704185, 1.07862692616479,
1, 0.250091116406616, 0), `1_60` = c(1.95634163405497, 0,
1.1602014253913, 0, 0, 0), `1_61` = c(1.09287167009628, 0,
2.05939536537347, 1.08165521287259, 0.68027384701565, 0),
`1_62` = c(0.791776166968497, 0, 0.846107162142824, 0, 0.77013323652256,
0), `1_63` = c(0.378787010943447, 0.391876271945063, 0.623223753921758,
0, 0.651918444771296, 0), `1_64` = c(0.189585762007804, 0.361452381684218,
0.799519726870751, 0, 1.06818683719768, 0), `1_65` = c(0,
0, 2.5212953775211, 0, 0, 0), `1_66` = c(0, 0, 0, 0, 0, 0
), `1_67` = c(0, 0, 0, 0, 2.44827717262786, 0), `1_68` = c(0,
0, 0, 0, 0, 0), `1_69` = c(0, 0, 0, 0, 0, 0), `1_70` = c(0,
0, 2.36142611074334, 0, 2.391093649557, 0), `1_71` = c(0,
0, 0.35565044656798, 0, 0, 0), `1_72` = c(0, 0, 5.86951313801941,
0, 0, 0)), .Names = c("Gene name", "1_1", "1_2", "1_3", "1_4",
"1_5", "1_6", "1_7", "1_8", "1_9", "1_10", "1_11", "1_12", "1_13",
"1_14", "1_15", "1_16", "1_17", "1_18", "1_19", "1_20", "1_21",
"1_22", "1_23", "1_24", "1_25", "1_26", "1_27", "1_28", "1_29",
"1_30", "1_31", "1_32", "1_33", "1_34", "1_35", "1_36", "1_37",
"1_38", "1_39", "1_40", "1_41", "1_42", "1_43", "1_44", "1_45",
"1_46", "1_48", "1_49", "1_50", "1_51", "1_52", "1_53", "1_54",
"1_55", "1_56", "1_57", "1_58", "1_59", "1_60", "1_61", "1_62",
"1_63", "1_64", "1_65", "1_66", "1_67", "1_68", "1_69", "1_70",
"1_71", "1_72"), row.names = c(NA, 6L), class = "data.frame")
That's the code I use for calculation of the mean for 3 replicates which I have in the data frame:
## Calculating the mean of 3 "replicates"
ind <- c(1, 25, 49)
dat2 <- dat[-1]
tbl_end <- cbind(dat[1], sapply(0:23, function(i) rowMeans(dat2[ind+i])))
That's an error which comes:
Error in `[.data.frame`(dat2, ind + i) : undefined columns selected
Called from: eval(substitute(browser(skipCalls = pos), list(pos = 9 - frame)),
envir = sys.frame(frame))
I have 71 columns of results (should be 72 because I have 24 fractions and 3 replicates what gives 72 in total) but there should be one more column. No idea why it's missing but anyway I have to solve it. There is no 1_47 which should come with 1_23 and 1_71. Do you have any idea how can I edit my function to just ignore fraction 1_47 and still get a mean of 1_23 and 1_71 ?
Why not just add in a dummy column for 1_47. That will make your data more regular and make it much easier to extract the indexes you need. To do this, try
dat2<-cbind(dat[1:47], 1_47=rep(NA, nrow(dat)), dat[48:72])
ind <- c(1, 25, 49)
tbl_end <- cbind(dat[1], sapply(0:23, function(i) rowMeans(dat2[ind+i+1], na.rm=T)))

Resources