Preparing Data with NAs for Sparse Matrix and XGBOOST - r

I have a large data set consisting of factor variables, numeric variables, and a target column I'm trying to properly feed into xgboost with the objective of making an xgb.Matrix and training a model.
I'm confused about the proper processing to get my dataframe into an xgb.DMatrix object. Specifically, I have NAs in both factor and numeric variables and I want to make a sparse.model.matrix from my dataframe before creating the xgb.Matrix. The proper handling of the NAs is really messing me up.
I have the following sample dataframe df consisting of one binary categorical variable, two continuous variables, and a target. the categorical variable and one continuous variable has NAs
'data.frame': 10 obs. of 4 variables:
$ v1 : Factor w/ 2 levels "0","1": 1 2 2 1 NA 2 1 1 NA 2
$ v2 : num 3.2 5.4 8.3 NA 7.1 8.2 9.4 NA 9.9 4.2
$ v3 : num 22.1 44.1 57 64.2 33.1 56.9 71.2 33.9 89.3 97.2
$ target: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1
v1 v2 v3 target
1 0 3.2 22.1 0
2 1 5.4 44.1 0
3 1 8.3 57.0 1
4 0 NA 64.2 1
5 <NA> 7.1 33.1 0
6 1 8.2 56.9 0
7 0 9.4 71.2 0
8 0 NA 33.9 1
9 <NA> 9.9 89.3 0
10 1 4.2 97.2 0
sparse.model.matrix from the matrix library won't accept NAs. It eliminates the rows (which I don't want). So I'll need to change the NAs to a numeric replacement like -999
if I use the simple command:
df[is.na(df)] = -999
it only replaces the NAs in the numeric columns:
v1 v2 v3 target
1 0 3.2 22.1 0
2 1 5.4 44.1 0
3 1 8.3 57.0 1
4 0 -999.0 64.2 1
5 <NA> 7.1 33.1 0
6 1 8.2 56.9 0
7 0 9.4 71.2 0
8 0 -999.0 33.9 1
9 <NA> 9.9 89.3 0
10 1 4.2 97.2 0
So I first (think I) need to change the factor variables to numeric and then do
the substitution. Doing that I get:
v1 v2 v3 target
1 1 3.2 22.1 0
2 2 5.4 44.1 0
3 2 8.3 57.0 1
4 1 -999.0 64.2 1
5 -999 7.1 33.1 0
6 2 8.2 56.9 0
7 1 9.4 71.2 0
8 1 -999.0 33.9 1
9 -999 9.9 89.3 0
10 2 4.2 97.2 0
but converting the factor variable back to a factor (I think this is necessary
so xgboost will later know its a factor) I get three levels:
data.frame': 10 obs. of 4 variables:
$ v1 : Factor w/ 3 levels "-999","1","2": 2 3 3 2 1 3 2 2 1 3
$ v2 : num 3.2 5.4 8.3 -999 7.1 8.2 9.4 -999 9.9 4.2
$ v3 : num 22.1 44.1 57 64.2 33.1 56.9 71.2 33.9 89.3 97.2
$ target: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1
I'm ultimately not sure now that making the sparse.model.matrix and ultimately
the xgb.matrix object will be meaningful because v1 appears messed up.
To make matters more confusing, xgb.Dmatrix() has an argument missing
that I can use to identify numeric values (-999) that represent NAs. But this
can only be used for a dense matrix. If I submitted the dense matrix I'd
just have the NAs and wouldn't need that. However, in the sparse matrix
where I have -999s, I can't use it.
I hope I'm not overlooking something easy. Been through xgboost.pdf extensively and looked on Google.
Please help. Thanks in advance.

options(na.action='na.pass') as mentioned by #mtoto is the best way to deal with this problem. It will make sure that you don't loose any data while building model matrix.
Specifically XGBoost implementation; in case of NAs, check for higher gain when doing the splits while growing tree. So for example if splits without considering NAs is decided to be a variable var1's (range [0,1]) value 0.5 then it calculates the gain considering var1 NAs to be < 0.5 and > 0.5. To whatever split direction it gets more gain it attributes NAs to have that split direction. So NAs now have a range [0,0.5] or [0.5,1] but not actual value assigned to it (i.e. imputed). Refer (original author tqchen's comment on Aug 12, 2014).
If you are imputing -99xxx there then you are limiting the algorithm ability to learn NA's proper range (conditional on labels).

Related

Error in unique combination of keys in anova test

I am currently doing a repeated measure anova test and can't get over that one error message saying:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2 rows:
* 82, 83
This is the formula I used:
res.aov <- anova_test(data = data1,
dv = AUC, wid = Observation,
within = c(SMIP, SMIB))
This is my data frame:
'data.frame': 410 obs. of 8 variables:
$ ID : Factor w/ 125 levels "1","4","6","7",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Observation : Factor w/ 11 levels "1","2","3","4",..: 1 2 1 2 3 1 1 2 3 4 ...
$ IDobs : Factor w/ 407 levels "1.1","1.2","4.1",..: 1 2 3 4 5 6 7 8 9 10 ...
$ SMIP : num 2.26 2.26 1.05 1.05 1.05 1.11 1.23 1.23 1.23 1.23 ...
$ SMIB : num 4.84 4.84 1.95 1.95 1.95 4.78 4.34 4.34 4.34 4.34 ...
$ AUC : num 21.2 16.6 19.2 16.9 18.5 15.4 10.4 12.8 15.4 17.9 ...
My df has different IDs (patient 1, 4, 6) with several observations (obs. 1, 2, for patient 1) for each one and therefor I have no idantical keys, even though they can have the same values in other variables (AUC). BUT - not every patient has the same number of observations! They vary from 1 - 11.
As you can see I tried combining the colums ID and Observation, but this didn`t help.
I tried removing the error lines in my excel sheet which led to the same error message but with different lines.
Any help or suggestions will be greatly apprecieated.
Many thanks in advance!
Nigina

What's wrong with this code with permtest function in R?

ID pounds Drug
1 1 46.4 B
2 2 40.4 A
3 3 27.6 B
4 4 93.2 B
5 5 28.8 A
6 6 36.0 A
7 7 81.2 B
8 8 14.4 B
9 9 64.0 A
10 10 29.6 A
My code is
test <-permtest(data1$pounds[Drug=='A'],data1$pounds[Drug=='B'])
But I get an error saying object 'Drug' not found.
Help!
We need to extract the column with $ or [[. Here it is searching for an object 'Drug' in the global env, which is not created there, but only within the environment of the 'data1'. So, either use $/[[
permtest(data1$pounds[data1$Drug=='A'],data1$pounds[data1$Drug=='B'])
Or use with
with(data1, permtest(pounds[Drug == 'A'], pounds[Drug == 'B']))

Descriptive stats for MI data in R: Take 3

As an R beginner, I have found it surprisingly difficult to figure out how to compute descriptive statistics on multiply imputed data (more so than running some of the other basic analyses, such as correlations and regressions).
These types of questions are prefaced with apologies (Descriptive statistics (Means, StdDevs) using multiply imputed data: R) but have have not been answered (https://stats.stackexchange.com/questions/296193/pooling-basic-descriptives-from-several-multiply-imputed-datasets-using-mice) or are quickly cast a down vote.
Here is a description of a miceadds function(https://www.rdocumentation.org/packages/miceadds/versions/2.10-14/topics/stats0), which I find difficult to follow with data that has been stored in the mids format.
I have gotten some output such as mean, median, min, max using the summary(complete(imp)) but would love to know how to get additional summary output (e.g., skew/kurtosis, standard deviation, variance).
Illustration borrowed from a previous poster above:
> imp <- mice(nhanes, seed = 23109)
iter imp variable
1 1 bmi hyp chl
1 2 bmi hyp chl
1 3 bmi hyp chl
1 4 bmi hyp chl
1 5 bmi hyp chl
2 1 bmi hyp chl
2 2 bmi hyp chl
2 3 bmi hyp chl
> summary(complete(imp))
age bmi hyp chl
1:12 Min. :20.40 1:18 Min. :113
2: 7 1st Qu.:24.90 2: 7 1st Qu.:186
3: 6 Median :27.40 Median :199
Mean :27.37 Mean :194
3rd Qu.:30.10 3rd Qu.:218
Max. :35.30 Max. :284
Would someone kindly take the time to illustrate how one might take the mids object to get the basic descriptives?
Below are some steps you can do to better understand what happens with R objects after each step. I would also recommend to look at this tutorial:
https://gerkovink.github.io/miceVignettes/
library(mice)
# nhanes object is just a simple dataframe:
data(nhanes)
str(nhanes)
#'data.frame': 25 obs. of 4 variables:
# $ age: num 1 2 1 3 1 3 1 1 2 2 ...
#$ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
#$ hyp: num NA 1 1 NA 1 NA 1 1 1 NA ...
#$ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
# you can generate multivariate imputation using mice() function
imp <- mice(nhanes, seed=23109)
#The output variable is an object of class "mids" which you can explore using str() function
str(imp)
# List of 17
# $ call : language mice(data = nhanes)
# $ data :'data.frame': 25 obs. of 4 variables:
# ..$ age: num [1:25] 1 2 1 3 1 3 1 1 2 2 ...
# ..$ bmi: num [1:25] NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
# ..$ hyp: num [1:25] NA 1 1 NA 1 NA 1 1 1 NA ...
# ..$ chl: num [1:25] NA 187 187 NA 113 184 118 187 238 NA ...
# $ m : num 5
# ...
# $ imp :List of 4
#..$ age: NULL
#..$ bmi:'data.frame': 9 obs. of 5 variables:
#.. ..$ 1: num [1:9] 28.7 30.1 22.7 24.9 30.1 35.3 27.5 29.6 33.2
#.. ..$ 2: num [1:9] 27.2 30.1 27.2 25.5 29.6 26.3 26.3 30.1 30.1
#.. ..$ 3: num [1:9] 22.5 30.1 20.4 22.5 27.4 22 26.3 27.4 35.3
#.. ..$ 4: num [1:9] 27.2 22 22.7 21.7 25.5 27.2 24.9 30.1 22
#.. ..$ 5: num [1:9] 28.7 28.7 20.4 21.7 25.5 22.5 22.5 25.5 22.7
#...
#You can extract individual components of this object using $, for example
#To view the actual imputation for bmi column
imp$imp$bmi
# 1 2 3 4 5
# 1 28.7 27.2 22.5 27.2 28.7
# 3 30.1 30.1 30.1 22.0 28.7
# 4 22.7 27.2 20.4 22.7 20.4
# 6 24.9 25.5 22.5 21.7 21.7
# 10 30.1 29.6 27.4 25.5 25.5
# 11 35.3 26.3 22.0 27.2 22.5
# 12 27.5 26.3 26.3 24.9 22.5
# 16 29.6 30.1 27.4 30.1 25.5
# 21 33.2 30.1 35.3 22.0 22.7
# The above output is again just a regular dataframe:
str(imp$imp$bmi)
# 'data.frame': 9 obs. of 5 variables:
# $ 1: num 28.7 30.1 22.7 24.9 30.1 35.3 27.5 29.6 33.2
# $ 2: num 27.2 30.1 27.2 25.5 29.6 26.3 26.3 30.1 30.1
# $ 3: num 22.5 30.1 20.4 22.5 27.4 22 26.3 27.4 35.3
# $ 4: num 27.2 22 22.7 21.7 25.5 27.2 24.9 30.1 22
# $ 5: num 28.7 28.7 20.4 21.7 25.5 22.5 22.5 25.5 22.7
# complete() function returns imputed dataset:
mat <- complete(imp)
# The output of this function is a regular data frame:
str(mat)
# 'data.frame': 25 obs. of 4 variables:
# $ age: num 1 2 1 3 1 3 1 1 2 2 ...
# $ bmi: num 28.7 22.7 30.1 22.7 20.4 24.9 22.5 30.1 22 30.1 ...
# $ hyp: num 1 1 1 2 1 2 1 1 1 1 ...
# $ chl: num 199 187 187 204 113 184 118 187 238 229 ...
# So you can run any descriptive statistics you need with this object
# Just like you would do with a regular dataframe:
> summary(mat)
# age bmi hyp chl
# Min. :1.00 Min. :20.40 Min. :1.00 Min. :113.0
# 1st Qu.:1.00 1st Qu.:24.90 1st Qu.:1.00 1st Qu.:187.0
# Median :2.00 Median :27.50 Median :1.00 Median :204.0
# Mean :1.76 Mean :27.48 Mean :1.24 Mean :204.9
# 3rd Qu.:2.00 3rd Qu.:30.10 3rd Qu.:1.00 3rd Qu.:229.0
# Max. :3.00 Max. :35.30 Max. :2.00 Max. :284.0
There are several mistakes in both your code and the answer from Katia and the link provided by Katia is no longer available.
To compute simple statistics after multiple imputation, you must follow Rubin's Rule, which is the method used in mice for a selected bunch of model fits.
When using
library(mice)
imp <- mice(nhanes, seed = 23109)
mat <- complete(imp)
mat
age bmi hyp chl
1 1 28.7 1 199
2 2 22.7 1 187
3 1 30.1 1 187
4 3 22.7 2 204
5 1 20.4 1 113
6 3 24.9 2 184
7 1 22.5 1 118
8 1 30.1 1 187
9 2 22.0 1 238
10 2 30.1 1 229
11 1 35.3 1 187
12 2 27.5 1 229
13 3 21.7 1 206
14 2 28.7 2 204
15 1 29.6 1 238
16 1 29.6 1 238
17 3 27.2 2 284
18 2 26.3 2 199
19 1 35.3 1 218
20 3 25.5 2 206
21 1 33.2 1 238
22 1 33.2 1 229
23 1 27.5 1 131
24 3 24.9 1 284
25 2 27.4 1 186
You only return the first imputed dataset, whereas you imputed five by default. See ?mice::complete for more informations "The default is action = 1L returns the first imputed data set."
To get the five imputed datasets, you have to specify the action argument of mice::complete
mat2 <- complete(imp, "long")
mat2
.imp .id age bmi hyp chl
1 1 1 1 28.7 1 199
2 1 2 2 22.7 1 187
3 1 3 1 30.1 1 187
4 1 4 3 22.7 2 204
5 1 5 1 20.4 1 113
6 1 6 3 24.9 2 184
7 1 7 1 22.5 1 118
8 1 8 1 30.1 1 187
9 1 9 2 22.0 1 238
10 1 10 2 30.1 1 229
...
115 5 15 1 29.6 1 187
116 5 16 1 25.5 1 187
117 5 17 3 27.2 2 284
118 5 18 2 26.3 2 199
119 5 19 1 35.3 1 218
120 5 20 3 25.5 2 218
121 5 21 1 22.7 1 238
122 5 22 1 33.2 1 229
123 5 23 1 27.5 1 131
124 5 24 3 24.9 1 186
125 5 25 2 27.4 1 186
Both summary(mat) and summary(mat2) are false.
Let's focus on bmi. The first one provides the mean bmi over the first imputed dataset. The second one provides the mean of an artifical m times larger dataset. The second dataset also has inappropriately low variance.
mean(mat$bmi)
27.484
mean(mat2$bmi)
26.5192
I have not found a better solution than applying manually Rubin's rule to the mean estimate. The correct estimate is simply the mean of estimates accross all imputed datasets
res <- with(imp, mean(bmi)) #get the mean for each imputed dataset, stored in res$analyses
do.call(sum, res$analyses) / 5 #compute mean over m = 5 mean estimations
26.5192
The variance / standard deviation has to be calculated appropriately. You can use Rubin's rule to compute any simple statistic that you wish. You can find the way of doing so here https://bookdown.org/mwheymans/bookmi/rubins-rules.html
Hope this helps.

How to compute the mean for the last few rows in each time period in a data frame?

I have data collected for a few subjects, every 15 seconds over an hour split up by periods. Here's how the dataframe looks like, the time is "Temps", subjects are "Sujet" and the periods are determined by "Palier".
data.frame': 2853 obs. of 22 variables:
$ Temps : Factor w/ 217 levels "00:15","00:30",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Sujet : int 1 1 1 1 1 1 1 1 1 1 ...
$ Test : Factor w/ 3 levels "VO2max","Tlim",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Palier : int 1 1 1 1 1 1 1 1 1 1 ...
$ RPE : int 8 8 8 8 8 8 8 8 8 8 ...
$ Rmec : num 39.1 27.5 23.3 21.5 20.3 21.7 20.5 20.7 20.2 20.1 ...
Here a glimpse of the data.frame:
Temps Sujet Test Palier RPE Rmec Pmec Pchim Fr Vt VE FeO2 FeCO2 VO2 VCO2 RER HR VO2rel VE.VO2 VE.VCO2
1 00:15 1 VO2max 1 8 39.1 185 473.6 19 1854 34.60 16.24 4.48 1353 1268 0.94 121 17.6 0.02557280 0.02728707
2 00:30 1 VO2max 1 8 27.5 185 672.4 17 2602 44.30 15.77 4.78 1921 1731 0.90 124 25.0 0.02306091 0.02559214
3 00:45 1 VO2max 1 8 23.3 185 794.5 18 2793 50.83 15.63 4.85 2270 2015 0.89 131 29.6 0.02239207 0.02522581
4 01:00 1 VO2max 1 8 21.5 185 860.3 20 2756 55.76 15.68 4.88 2458 2224 0.90 137 32.0 0.02268511 0.02507194
5 01:15 1 VO2max 1 8 20.3 185 909.3 23 2709 61.26 15.84 4.88 2598 2446 0.94 139 33.8 0.02357968 0.02504497
6 01:30 1 VO2max 1 8 21.7 185 853.7 21 2899 59.85 16.00 4.89 2439 2395 0.98 140 31.8 0.02453875 0.02498956
Each "Palier" lasts about 5 min and there are from 5 to 10 "Palier". For each subject and "Palier", I need to compute the mean for the last 2 min for all the variables. I haven't figured it out yet with dcast() or ddply(), but I am a newbie!
Any help would be much appreciated!
If you turned it into a data.table (which you'd have to install), you could do this with
library(data.table)
dt = as.data.table(d) # assuming your existing data frame was called d
last.two.min = dt[, mean(tail(Rmec, 9)), by=Sujet]
This assumes that your original data frame was called d, and that you want the last 9 items (since it is every 15 seconds- you might want the last 8 if you want from 58:15 to 60:00).
I assumed that Rmec was the variable you wanted to get the mean for. If there are multiple for which you want to get the mean, you can do something like:
last.two.min = dt[, list(mean.Rmec=mean(tail(Rmec, 9)),
mean.RPE=mean(tail(RPE, 9))), by=Sujet]

similar to excel vlookup

Hi
i have a 10 year, 5 minutes resolution data set of dust concentration
and i have seperetly a 15 year data set with a day resolution of the synoptic clasification
how can i combine these two datasets they are not the same length or resolution
here is a sample of the data
> head(synoptic)
date synoptic
1 01/01/1995 8
2 02/01/1995 7
3 03/01/1995 7
4 04/01/1995 20
5 05/01/1995 1
6 06/01/1995 1
>
head(beit.shemesh)
X........................ StWd SHT PRE GSR RH Temp WD WS PM10 CO O3
1 NA 64 19.8 0 -2.9 37 15.2 61 2.2 241 0.9 40.6
2 NA 37 20.1 0 1.1 38 15.2 344 2.1 241 0.9 40.3
3 NA 36 20.2 0 0.7 39 15.1 32 1.9 241 0.9 39.4
4 NA 52 20.1 0 0.9 40 14.9 20 2.1 241 0.9 38.7
5 NA 42 19.0 0 0.9 40 14.6 11 2.0 241 0.9 38.7
6 NA 75 19.9 0 0.2 40 14.5 341 1.3 241 0.9 39.1
No2 Nox No SO2 date
1 1.4 2.9 1.5 1.6 31/12/2000 24:00
2 1.7 3.1 1.4 0.9 01/01/2001 00:05
3 2.1 3.5 1.4 1.2 01/01/2001 00:10
4 2.7 4.2 1.5 1.3 01/01/2001 00:15
5 2.3 3.8 1.5 1.4 01/01/2001 00:20
6 2.8 4.3 1.5 1.3 01/01/2001 00:25
any idea's
Make an extra column for calculating the dates, and then merge. To do this, you have to generate a variable in each dataframe bearing the same name, hence you first need some renaming. Also make sure that the merge column you use has the same type in both dataframes :
beit.shemesh$datetime <- beit.shemesh$date
beit.shemesh$date <- as.Date(beith.shemesh$datetime,format="%d/%m/%Y")
synoptic$date <- as.Date(synoptic$date,format="%d/%m/%Y")
merge(synoptic, beit.shemesh,by="date",all.y=TRUE)
Using all.y=TRUE keeps the beit.shemesh dataset intact. If you also want empty rows for all non-matching rows in synoptic, you could use all=TRUE instead.

Resources