Column changes from "WinorLoss" to "Class" - r

I am working on constructing a logistic model on R (I am a beginner on R and am following a tutorial on building logistic models). I have done the following, everything works but when I complete the downsample function for some reason the column named "WinorLoss" changes to "Class" and I am sure this cause an issue with everything.
Could anyone please let me know if what I am doing makes sense or is there big errors I am making?
my_data <- read.csv('C:/Users/Magician/Desktop/R files/Fnaticfirstround.csv', header=TRUE)
my_data
str(my_data)
library(mlbench)
glm(Map ~ WinorLoss, family="binomial", data=my_data)
table(my_data$Map)
table(my_data$WinorLoss)
my_data$WinorLoss <- ifelse(my_data$WinorLoss == "W", 1,0)
my_data$WinorLoss <- factor(my_data$WinorLoss, levels = c(0,1))
my_data
table(my_data$WinorLoss)
library(caret)
'%ni%' <- Negate('%in%')
options(scipen=999)
set.seed(100)
trainDataIndex <- createDataPartition(my_data$WinorLoss, p=0.7, list=F)
trainData <- my_data[trainDataIndex, ]
testData <- my_data[-trainDataIndex, ]
trainData
testData
table(trainData$WinorLoss)
table(testData$WinorLoss)
set.seed(100)
down_train <- downSample(x = trainData[, colnames(trainData) %ni% "WinorLoss"],
y = trainData$WinorLoss)
down_train
When running trainData the columns returned are Date, Event, opponent, Map, Score, WinorLoss, winner.. but when I run the downtrain function the columns become Date, Event, opponent, Map, Score, winner, Class
Help Please!

Yep, downSample and some of the other caret packages do that by default, unless specified otherwise.
If you have a question about a particular function try the manual packages first.
?downSample
If you do this you will see all of the arguments
downSample(x, y, list = FALSE, yname = "Class")
So by default the function will change the yname to "Class" which is what you are seeing.
Thus to get your desired output:
down_train <- downSample(x = trainData[, colnames(trainData) %ni% "WinorLoss"],
y = trainData$WinorLoss,
yname = "WinorLoss")

Related

r mice - "sample" imputation method not working correctly

I am using mice to impute missing data in a large dataset (24k obs, 98 vars). I am using the "sample" imputation method to impute some variables (and other methods for the others - many categorical). When I check my imputed data, those variables that I've applied "sample" to are not always imputed and I have missingness in them. I know for sure that I'm applying "sample" to them (I double checked the methods), and I made sure to remove all predictors of them in the prediction matrix. From my understanding, where they are in the visit sequence shouldn't matter (but I make sure they're immediately after variables with no missingness).
I can't give you a reprex because when I try to recreate the problem, it doesn't happen and everything is imputed just fine. I tried simulating my own data and I tried subsetting the dataset to a group of the variables that I want to use the sample method on. That's part of why I'm so stumped - I coded everything the same and it worked with the subset. I didn't think that the sample method would be at all dependent on the presence of any other vars.
EDIT:
This is the code I'm using
#produce prediction matrix
pred1 <- quickpred_ext(data1, mincor = 0.08, include = "age")
pred2 <- pred1
# for vars to not be imputed, set all predictors to 0
data_no_impute <- data1 %>%
select(contains(c("exp_", "outcome_"))) %>%
select(sort(names(.))) %>%
names
data_level3 <- data1 %>%
select(contains(c("f4", "f5", "f6")),
k22) %>%
select(sort(names(.))) %>%
names
pred2[data_no_impute,] <- 0
pred2[data_level3,] <- 0
#produce initial methods and visit sequence
initial <- mice(data1, max = 0, print = F, vis = "monotone",
defaultMethod = c("pmm", "logreg", "polyreg", "polr"))
#edit methods to be blank for vars I don't want to impute, "sample" for level 3
meth1 <- initial$meth
meth2 <- meth1
meth2[data_level3] <- "sample"
meth2[data_no_impute] <- ""
visits1 <- initial$visitSequence
visits2 <- visits1
visits2 <- append(visits2, data_level3,22)
#run mice test
mice_test <- mice(data1, m = 2, print = F,
predictorMatrix = pred2,
method = meth2,
vis = visits2,
nnet.MaxNWts = 3000)
#pull second completed dataset
imput1 <- mice::complete(mice_test, 2, include = F)
#look at missingness patterns
missingness_pattern2 <- md.pattern(imput1, plot = F)

An R function cannot work in local environment of other functions

I use Matchit package for propensity score matching. It can generate a matched data after matching using get_matches() function.
However, if I do not run the get_matches() function in the global environment but include it in any other function, the matched data cannot be found in the local environment. (These prove to be misleading information. There is nothing wrong with MatchIt's output. Answer by Noah explains my question better.)
For producing my data
dataGen <- function(b0,b1,n = 2000,cor = 0){
# covariate
sigma <- matrix(rep(cor,9),3,3)
diag(sigma) <- rep(1,3)
cov <- MASS::mvrnorm(n, rep(0,3), sigma)
# error
error <- rnorm(n,0,sqrt(18))
# treatment variable
logit <- b0+b1*cov[,1]+0.3*cov[,2]+cov[,3]
p <- 1/(1+exp(-logit))
treat <- rbinom(n,1,p)
# outcome variable
y <- error+treat+cov[,1]+cov[,2]
data <- as.data.frame(cbind(cov,treat,y))
return(data)
}
set.seed(1)
data <- dataGen(b0=-0.92, b1=0.8, 900)
It is like the following works. The est.m.WLS() can use the m.data.
fm1 <- treat ~ V1+V2+V3
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
est.m.WLS <- function(m.data, fm2){
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
fm2 <- y ~ treat+V1+V2+V3
est.m.WLS(m.data,fm2)
But the next syntax does not work. It will report
"object 'm.data' not found"
rm(m.data)
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
est.m.WLS <- function(m.out, fm2){
m.data <- MatchIt::get_matches(m.out,data=data)
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
est.m.WLS(m.out,fm2)
Since I want to run parallel loops using the groundhog library for simulation purpose, the get_matches function also cannot work in foreach()%dopar%{...} environment.
res=foreach(s = 1:7,.combine="rbind")%dopar%{
m.out <- MatchIt::matchit(data = data, formula = fm.p, distance = data$logit, m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
...
}
How should I fix the problem?
Any help would be appreciated. Thank you!
Using for() loop directly will not run into any problem since it just works in the global environment, but it is too slow... I really hope to do the thousand time simulations at once. Help!
This has nothing to do with MatchIt or get_matches(). Run debugonce(est.m.WLS) with your second implementation of est.m.WLS(). You will see that get_matches() works perfectly fine and returns m.data. The problem is when lmtest() runs with a formula argument for cluster.
This is due to a bug in R, outside any package, that I have already requested to be fixed. The problem is that expand.model.matrix(), a function that searches for the dataset that the variables supplied to cluster could be in, only searches the global environment for data, but m.data does not exist in the global environment. To get around this issue, don't supply a formula to cluster; use cluster = m.data["subclass"]. This should hopefully be resolved in an upcoming R release.

How to run predict.boosting for new data?

I am trying to use predict.boosting for new data in adabag package. I can't find a way to use it for data without labels (or any other function from that package).
I am trying:
pr <- predict.boosting(modelfit, test[,2:ncol(test)])
It gives:
Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) :
undefined columns selected
However, if I include labels:
pr <- predict.boosting(modelfit, test)
it works just fine. But there has to be a way to use it as a predictive model for data without labels.
Thanks for any help!
EDIT
Example from package:
library(rusboost)
library(rpart)
data(iris)
make it an unbalanced dataset by removing most of the setosa observations
df <- iris[41:150,]
create binary variable
df$Setosa <- factor(ifelse(df$Species == "setosa", "setosa", "notsetosa"))
create index of negative examples
idx <- df$Setosa == "notsetosa"
run model
test.rusboost <- rusb(Setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = df, boot = F, iters = 20, sampleFraction = .1, idx = idx)
predict.boosting(test.rusboost, df)
predict.boosting(test.rusboost, df[,1:4)
You should control that all your columns in train (the set you used to train the model) are present in test an with the same name.
Please check:
all(colnames(train) %in% colnames(test))
If it's false, you will need to control how you built train and test.
If it's TRUE, and in general, please provide a reproductible example.
Edit:
A nice way to control that columns are the same, and they contain the same factors is to use sameShape from dataPreparation package. If it's not the cas, it will add levels and columns (and warn you).
To use it:
library(dataPreparation)
test <- sameShape(test, train)
I came up with a workaround, I attached a column with the same name as the labels to my newdata and filled it with random factor levels.
df$Setosa <- factor(sample( c("setosa", "notsetosa"), nrow(df), replace=TRUE, prob=c(0.5, 0.5) ))
Then it works just fine.

R: error with autofitVariogram (automap package)

Using autofitVariogram() function from automap package I have generate following error:
Error in vgm_list[[which.min(SSerr_list)]] : attempt to select less
than one element in get1index
Example code:
model <- as.formula(Value ~ Elevation)
data <- matrix(c(11.07,42.75,5,62.5,
8.73,45.62,234,75,
12.62,44.03,12,75,
10.87,45.38,67,75,
8.79,42.53,64,75),
nrow = 5, byrow = TRUE)
data <- as.data.frame(data)
names(data) <- c('Lon', 'Lat', 'Elevation', 'Value')
library('sp')
coordinates(data) = ~Lon+Lat
library('automap')
autofitVariogram(model, data)
What causes this error? Do interpolated values cause some kind of 'singularity'?
Thx!
This error is caused by the fact that gstat cannot generate an experimental variogram given this number of observations:
library(gstat)
library(sp)
data <- matrix(c(11.07,42.75,5,62.5,
8.73,45.62,234,75,
12.62,44.03,12,75,
10.87,45.38,67,75,
8.79,42.53,64,75),
nrow = 5, byrow = TRUE)
data <- as.data.frame(data)
names(data) <- c('Lon', 'Lat', 'Elevation', 'Value')
coordinates(data) = ~Lon+Lat
variogram(Value ~ Elevation, data)
## NULL
When given insufficient observations, gstat::variogram returns NULL. This in turn causes autofitVariogram to fail.
The solution is to simply have more data if you want to use kriging. A rule of thumb is that you need about 30 observations to generate a meaningful variogram to fit a variogram model to.
Recently, I also come across this problem. I find out the reason is that there are some Inf values in my data, and if I delete them, the package works well. Hope this could help you.

Error message when using predict with LARS model on testdata

I use a lars model and apply it to a large data set (75 features) with numerical data and factors.
I train the model by
mm <- model.matrix(target~0+.,data=data)
larsMod <- lars(mm,data$target,intercept=FALSE)
which gives a nice in-sample fit. If I apply it to testdata by
mm.test <- model.matrix(target~0+.,,data=test.data)
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))
then I get the error message
Error in scale.default(newx, object$meanx, FALSE) :
length of 'center' must equal the number of columns of 'x'
I assume that it has todo with the fact that factor levels differ in the data sets. However
which(! colnames(mm.test) %in% colnames(mm) )
gives an empty result
while
which(! colnames(mm) %in% colnames(mm.test) )
gives 3 indizes.
Thus 3 factor levels do appear in the training set but not in the test set.
Why does this cause a problem? How can I solve this?
The code blow illustrates this with a toy example. In the test dataset the factor does not have the level "l3".
require(lars)
data.train = data.frame( target = c(0,1,0,1,1,1,1,0,0,0), f1 = rep(c("l1","l2","l1","l2","l3"),2), n1 = rep(c(1,2,3,4,5),2))
test.data = data.frame(f1 = rep(c("l1","l2","l1","l2","l2"),2),n1 = rep(c(7,4,3,4,5),2) )
mm <- model.matrix(target~0+f1+n1,data = data.train)
colnames(mm)
length(colnames(mm))
larsMod <- lars(mm,data.train$target,intercept=FALSE)
mm.test <- model.matrix(~0+f1+n1,data=test.data)
colnames(mm.test)
length( colnames(mm.test) )
which(! colnames(mm.test) %in% colnames(mm) )
which(! colnames(mm) %in% colnames(mm.test) )
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))
I might be very much off here, but in my field predict doesn't work if it can't find a variable it expects. So I tried what happened if I forced the model matrix to 0 for the factor (f1l3) that was not in the test data.
Note1: I created a target variable in the testdata, because I couldn't get your code to run otherwise
set.seed(123)
test.data$target <- rbinom(nrow(test.data),1,0.2)
#proof of concept:
mm.test <- model.matrix(target~0+f1+n1,data=test.data)
mm.test1 <- cbind(f1l3=0,mm.test)
predict(larsMod,mm.test1[,colnames(mm)],type="fit",s=length(larsMod$arc.length)) #runs
#runs!
Now generalize to allow for creation of a 'complete' model matrix
when factors are missing in testdata.
#missing columns
mis_col <- setdiff(colnames(mm), colnames(mm.test))
#matrix of missing levels
mis_mat <- matrix(0,ncol=length(mis_col),nrow=nrow(mm.test))
colnames(mis_mat) <- mis_col
#bind together
mm.test2 <- cbind(mm.test,mis_mat)[,colnames(mm)] #to ensure ordering, yielded different results in my testing
predict(larsMod,mm.test2,type="fit",s=length(larsMod$arc.length)) #runs
Note2: I don't know what happens if the problem is the other way around (factors present in testdata that were not in train data)

Resources