How to input data in xgb.train function in R? - r

I'm going to perform xgboost on R using xgb.train function.
In order to use the xgb.train function, I know that input data must be transformed as using xgb.DMatrix function.
But when I used this function in my data setm I got an error message :
Error in xgb.DMatrix(data = as.matrix(train)) :
[09:01:01] amalgamation/../dmlc-core/src/io/local_filesys.cc:66: LocalFileSystem.GetPathInfo 1 Error:No such file or directory
Following is my full R code. To use input data, How to transform input data?
credit<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) credit[,i]=as.factor(credit[,i])
str(credit)
library(caret)
set.seed(1000)
intrain<-createDataPartition(y=credit$Creditability, p=0.7, list=FALSE)
train<-credit[intrain, ]
test<-credit[-intrain, ]
d_train<-xgb.DMatrix(data=as.matrix(train))

If you still want to use factors you should use the model.matrix() function to convert your factors to dummy variables.
For example:
my.dat <- mtcars[c("mpg","cyl","disp")]
my.dat$cyl <- as.factor(my.dat$cyl)
# Convert data frame to X matrix
x.train <- model.matrix(mpg~.,data=my.dat)
head(x.train)
Output:
(Intercept) cyl6 cyl8 disp
Mazda RX4 1 1 0 160
Mazda RX4 Wag 1 1 0 160
Datsun 710 1 0 0 108
Hornet 4 Drive 1 1 0 258
Hornet Sportabout 1 0 1 360
Valiant 1 1 0 225
This creates dummy variables cyl6 and cyl8 where 4 cylinder vehicles would be the base group (where cyl6=0 and cyl8=0).
Then you can pass this matrix into the xgb.DMatrix function:
d_train<-xgb.DMatrix(x.train,label=my.dat$mpg)

Related

Multiply two matrices (sumproduct for multiple functions) - to get Fishers discriminant linear function scores

I have a set of Fisher's discriminant linear functions that I need to multiply against some test data. Both data files are in the form of two matrices (variables lined up to match variable order), so I need to multiply them together.
Here is some example test data, which I've added a constant=1 variable (you'll see why when you we get to the coefficients)
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
> testdata
constant mpg disp hp
Mazda RX4 1 21.0 160 110
Mazda RX4 Wag 1 21.0 160 110
Datsun 710 1 22.8 108 93
Hornet 4 Drive 1 21.4 258 110
Hornet Sportabout 1 18.7 360 175
Valiant 1 18.1 225 105
Here are my coefficients matrix (the Fishers discriminant linear functions)
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
> coefs
constant mpg disp hp
Function1 -67.67 4.01 0.14 0.13
Function2 -59.46 3.49 0.15 0.15
Function3 -89.70 3.69 0.22 0.20
I need to multiply the values in test data against the respective coefficients to get 3 functions scores per row. Here is how the values would be calculated
for the first row, Function1 = 1*(-67.67)+21*(4.01)+160*(0.14)+110*(0.13)
for the first row, Function2 = 1*(-59.46)+21*(3.49)+160*(0.15)+110*(0.15)
for the first row, Function3 = 1*(-89.70)+21*(3.69)+160*(0.22)+110*(0.20)
It's kind of like a sumproduct of coefficients against each row time 3 for each function.
So the df/matrix should look like this when multiplied same number of rows with 3 function score variables
> df_result
Function1 Function2 Function3
row1 53.24 54.33 44.99
row2
Not ideal, but I'm taking the data out doing it excel. If this is possible to do, any help is greatly appreciated. Many thanks
Are you just looking for the inner product?
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
as.matrix(testdata) %*% t(as.matrix(coefs))
# Function1 Function2 Function3
# Mazda RX4 53.240 54.330 44.990
# Mazda RX4 Wag 53.240 54.330 44.990
# Datsun 710 50.968 50.262 36.792
# Hornet 4 Drive 68.564 70.426 68.026
# Hornet Sportabout 80.467 86.053 93.503
# Valiant 50.061 53.209 47.589

Using results from the R console and R code to generate the name of an object into R script

I'm interested in Using results from the R console and R code to generate the name of an object into R script.
I created a linear regression with 1 IV, where the DV is mpg and the 2 IVs of interest are cyl and disp. I call the generic model code object lm_DVmpg. I used the code paste("lm_DVmpg","_IV", as.character(lm_DVmpg$call$formula[[3]]), sep = "") to generate a unique name for the object containing the lm() in the console. I then pasted the result into the script and used that text as the name of the IV specific regression object.
For example, the name of the object using the IV of cyl is lm_DVmpg_IVcyl, and the name of the object disp is lm_DVmpg_IVdisp.
Here is some example code from the console:
> ## first lm
> # ---- NOTE: Iv is cyl
> # ---- NOTE: creates lm() object
> lm_DVmpg <- lm(mpg~cyl, data = mtcars)
> # ---- NOTE: creates unique name for lm() object
> paste("lm_DVmpg","_IV", as.character(lm_DVmpg$call$formula[[3]]), sep = "")
[1] "lm_DVmpg_IVcyl"
> # ---- NOTE: moves lm_DVmpg to object unique to IV used
> lm_DVmpg_IVcyl <- lm_DVmpg
Is there any way to use R code in the R script file to change the name of the lm_DVmpg code in the line lm_DVmpg <- lm(mpg~cyl, data = mtcars) say to lm_DVmpg_IVcyl, derived from the paste("lm_DVmpg","_IV", as.character(lm_DVmpg$call$formula[[3]]), sep = "") command, using R code without having to copy/paste the text lm_DVmpg_IVcyl from the console to the appropriate R script line?
Alternatively, is there a way to get the text lm_DVmpg_IVcyl into the line lm_DVmpg_IVcyl <- lm_DVmpg, derived from the paste("lm_DVmpg","_IV", as.character(lm_DVmpg$call$formula[[3]]), sep = "") command, using R code without having to copy/paste from the console after running the code paste("lm_DVmpg","_IV", as.character(lm_DVmpg$call$formula[[3]]), sep = "")?
Any help would be appreciated.
FYI, I use a 2013 Macbook Pro with a 2.4 GHz dual-core intel chip, 8 GB of ram, macOS big sur 11.2.2, RStudio Version 1.4.1106, and the R Base Package 4.04.
Thanks.
The R script I used is below:
# Naming objects from returns on console
# ---- NOTE: will do several linear regressions using the mtcars dataset, with IV and the DV being "mpg"
# ---- NOTE: IVs - cyl, disp
## gives info on dataset
head(mtcars)
str(mtcars)
colnames(mtcars)
## first lm
# ---- NOTE: Iv is cyl
# ---- NOTE: creates lm() object
lm_DVmpg <- lm(mpg~cyl, data = mtcars)
# ---- NOTE: creates unique name for lm() object
paste("lm_DVmpg","_IV", as.character(lm_DVmpg$call$formula[[3]]), sep = "")
# ---- NOTE: moves lm_DVmpg to object unique to IV used
lm_DVmpg_IVcyl <- lm_DVmpg
## second lm
# ---- NOTE: Iv is disp
# ---- NOTE: creates lm() object
lm_DVmpg <- lm(mpg~disp, data = mtcars)
# ---- NOTE: creates unique name for lm() object
paste("lm_DVmpg","_IV", as.character(lm_DVmpg$call$formula[[3]]), sep = "")
# ---- NOTE: moves lm_DVmpg to object unique to IV used
lm_DVmpg_IVdisp <- lm_DVmpg
## displays summaries of regression objects
# ---- NOTE: latest regression object summary
summary(lm_DVmpg)
# ---- NOTE: unique regression objects
lm_DVmpg_IVmpg
lm_DVmpg_IVdisp
# ---- NOTE: lm_DVmpg = lm_DVmpg_IVdisp, since the lm_DVmpg_IVmpg object when IV == disp was the latest lm() to be run
Here are my results from the console when running the R script.
> # Naming objects from returns on console
> # ---- NOTE: will do several linear regressions using the mtcars dataset, with IV and the DV being "mpg"
> # ---- NOTE: IVs - cyl, disp
> # Naming objects from returns on console
> # ---- NOTE: will do several linear regressions using the mtcars dataset, with IV and the DV being "mpg"
> # ---- NOTE: IVs - cyl, disp
>
> ## gives info on dataset
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> colnames(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
>
> ## first lm
> # ---- NOTE: Iv is cyl
> # ---- NOTE: creates lm() object
> lm_DVmpg <- lm(mpg~cyl, data = mtcars)
> # ---- NOTE: creates unique name for lm() object
> paste("lm_DVmpg","_IV", as.character(lm_DVmpg$call$formula[[3]]), sep = "")
[1] "lm_DVmpg_IVcyl"
> # ---- NOTE: moves lm_DVmpg to object unique to IV used
> lm_DVmpg_IVcyl <- lm_DVmpg
>
> ## second lm
> # ---- NOTE: Iv is disp
> # ---- NOTE: creates lm() object
> lm_DVmpg <- lm(mpg~disp, data = mtcars)
> # ---- NOTE: creates unique name for lm() object
> paste("lm_DVmpg","_IV", as.character(lm_DVmpg$call$formula[[3]]), sep = "")
[1] "lm_DVmpg_IVdisp"
> # ---- NOTE: moves lm_DVmpg to object unique to IV used
> lm_DVmpg_IVdisp <- lm_DVmpg
>
> ## displays summaries of regression objects
> # ---- NOTE: latest regression object summary
> summary(lm_DVmpg)
Call:
lm(formula = mpg ~ disp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.8922 -2.2022 -0.9631 1.6272 7.2305
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
disp -0.041215 0.004712 -8.747 9.38e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.251 on 30 degrees of freedom
Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
> # ---- NOTE: unique regression objects
> lm_DVmpg_IVmpg
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Coefficients:
(Intercept) cyl
37.885 -2.876
> lm_DVmpg_IVdisp
Call:
lm(formula = mpg ~ disp, data = mtcars)
Coefficients:
(Intercept) disp
29.59985 -0.04122
> # ---- NOTE: lm_DVmpg = lm_DVmpg_IVdisp, since the lm_DVmpg_IVmpg object when IV == disp was the latest lm() to be run
I am not sure why you need this but I think assign + get is what you are looking for.
lm_DVmpg <- lm(mpg~cyl, data = mtcars)
new_name <- paste("lm_DVmpg","_IV", as.character(lm_DVmpg$call$formula[[3]]), sep = "")
assign(new_name, lm(mpg~disp, data = mtcars))
get(new_name)
#Call:
#lm(formula = mpg ~ disp, data = mtcars)
#Coefficients:
#(Intercept) disp
# 29.59985 -0.04122

Predicting data via regression model and storing in a vector

Apologies for what is probably a very basic question.
I have created a linear model for a massive meteorological dataset using multiple regression. My goal is to use that model to "predict" data during a certain period using predictors 1, 2 and 3. I will then compare those predicted data to the observed data for that period.
My approach thus far has been to create a new vector for the predicted values and loop through the vector, creating predicted values based on the extracted coefficients of the linear model. Then, I will simply subtract the predicted values from the observed values. For some reason, this approach results in the new predicted vector being NULL. Any idea how I could approach this?
A sample is below. "data" refers to the dataset containing the predictors.
coef <- coefficients(multipleRegressionModel)
predictedValues=c()
for(i in 1:length(data$timePeriod)){
predictedValues[i] = append(predictedValues, data$coef[1]+data$predictor1[i]*data$coef[2]+data$predictor2[i]*data$coef[3]+
data$predictor3[i]*data$coef[4])
}
diff=c()
diff=observedValues - predictedValues
It looks like you are making this more difficult than it needs to be. R has a predict() function that does all of this for you. If you had a sample data.frame like so:
set.seed(26)
mydf = data.frame (a=1:20 , b = rnorm(20),
c = 1:20 + runif(20,2,3)*runif(20, 2, 3),
d = 1:20 + rpois(20,5)*runif(1:20)*sin(1:20))
And you wanted to train on some rows, and test on the others
trainRows<-sample(1:20, 16)
mydf.train<-mydf[trainRows,]
mydf.test<-mydf[-trainRows,]
Then fit the model and predict
model<-lm(a~b+c+d, data = mydf.train)
summary(model) #gives info about your model.
mydf.test$pred<-predict(model1, newdata = mydf.test)
MSE<-mean((mydf.test$pred-mydf.test$a)^2) #calculate mean squared error
MSE
#[1] 0.06321
View the predictions with mydf.test$pred
Here is a simple example using a glm on the mtcars data.
Line<- #setting up the linear model function
function (train_dat, test_dat, variables, y_var, family = "gaussian")
{
fm <- as.formula(paste(y_var, " ~", paste(variables, collapse = "+"))) #formula
glm1 <- glm(fm, data = train_dat, family = family) #run the model
pred <- predict(glm1, newdata = test_dat) #predict the model
}
data(mtcars)
y_var<-'mpg'
x_vars<-setdiff(names(mtcars),y_var)
mtcars[,'linear_prediction']<-Line(mtcars,mtcars,x_vars,y_var)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb linear_prediction
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 22.59951
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 22.11189
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.25064
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.23740
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 17.69343
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 20.38304

Subsets of a dataset as separate dendrograms, but in the same plot

I know I can plot a dendrogram as follows
library(cluster)
d <- mtcars
d[,8:11] <- lapply(d[,8:11], as.factor)
gdist <- daisy(d, metric = c("gower"), stand = FALSE)
dendro <- hclust(gdist, method = "average")
plot(as.dendrogram(dendro))
However I have some groups identified (eg. by an iterative classification method), given as the last column in d
G <- c(1,2,3,3,4,4,5,5,5,5,1,2,1,1,2,4,1,3,4,5,1,7,4,3,3,2,1,1,1,3,5,6)
d$Group <- G
head(d)
mpg cyl disp hp drat wt qsec vs am gear carb Group
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 4
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 4
I am trying to plot all the dendrograms together on the same plot with the same scale. The groups with only a single member also needs to be plotted. (group 6 and 7)
I am able to plot individual dendrograms for subset of the data except when number of members in a group is only one. But I don't think this is the right approach.
layout(matrix(1:9, 3,3,byrow=TRUE))
gdist <- as.matrix(gdist)
N <- max(G)
for (i in 1:N){
rc_tokeep <- row.names(subset(d, G==i))
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend))
}
The loop is giving this error for the last two groups. (6 and 7) having only a single member.
Error in hclust(dis, method = "average") :
must have n >= 2 objects to cluster
Essentially I wan't to reproduce these type of plots. The clusters with single members are also plotted here.
If you want to mimic the last few graphs, you can do something like this:
N <- max(G)
layout(matrix(c(0,1:N,0),nc=1))
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){ #The idea is to catch the groups with one single element to plot them differently
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,
xlim=c(.8,0),axes=FALSE) # giving the same xlim will scale all of them, here i used 0.8 to fit your data but you can change it to whatever
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5) #I don't know how you intend to compute the length of the branch in a group of 1 element, you might want to change that
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
}
}
With your example it gives:
If you want to add the scale you can add a grid in all graphs and a scale in the last one:
N <- max(G)
layout(matrix(c(0,1:N,0),nc=1))
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,xlim=c(.8,0),xaxt="n",yaxt="n")
abline(v=seq(0,.8,.1),lty=3) #Here the grid
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5)
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
abline(v=seq(0,.8,.1),lty=3) #Here the grid
}
}
axis(1,at=seq(0,.8,.1)) #Here the axis
And finally if you want to even the spaces between the different branches in the resulting plot, you can use table(d$Group) to get the number of members of each group and use it as a height for each subplot:
N <- max(G)
layout(matrix(c(0,1:7,0),nc=1), height=c(3,table(d$Group),3)) #Plus the height of the empty spaces.
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,xlim=c(.8,0),xaxt="n",yaxt="n")
abline(v=seq(0,.8,.1),lty=3)
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5)
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
abline(v=seq(0,.8,.1),lty=3)
}
}
axis(1,at=seq(0,.8,.1))

R flag cases with missingness from regression analysis

When running a regression analysis in R (using glm) cases are removed due to 'missingness' of the data. Is there any way to flag which cases have been removed? I would ideally like to remove these from my original dataframe.
Many thanks
The model fit object returned by glm() records the row numbers of the data that it excludes for their incompleteness. They are a bit buried but you can retrieve them like this:
## Example data.frame with some missing data
df <- mtcars[1:6, 1:5]
df[cbind(1:5,1:5)] <- NA
df
# mpg cyl disp hp drat
# Mazda RX4 NA 6 160 110 3.90
# Mazda RX4 Wag 21.0 NA 160 110 3.90
# Datsun 710 22.8 4 NA 93 3.85
# Hornet 4 Drive 21.4 6 258 NA 3.08
# Hornet Sportabout 18.7 8 360 175 NA
# Valiant 18.1 6 225 105 2.76
## Fit an example model, and learn which rows it excluded
f <- glm(mpg~drat,weight=disp, data=df)
as.numeric(na.action(f))
# [1] 1 3 5
Alternatively, to get the row indices without having to fit the model, use the same strategy with the output of model.frame():
as.numeric(na.action(model.frame(mpg~drat,weight=disp, data=df)))
# [1] 1 3 5
Without a reproducible example I can't provide code tailored to your problem, but here's a generic method that should work. Assume your data frame is called df and your variables are called y, x1, x2, etc. And assume you want y, x1, x3, and x6 in your model.
# Make a vector of the variables that you want to include in your glm model
# (Be sure to include any weighting or subsetting variables as well, per Josh's comment)
glm.vars = c("y","x1","x3","x6")
# Create a new data frame that includes only those rows with no missing values
# for the variables that are in your model
df.glm = df[complete.cases(df[ , glm.vars]), ]
Also, if you want to see just the rows that have at least one missing value, do the following (note the addition of ! (the "not" operator)):
df[!complete.cases(df[ , glm.vars]), ]

Resources