Weighted sampling in R problems - r

enter image description here
I want to do weighted sampling in R, with original data imbalanced between 0 and 1, I used sample to do but result in still biased data.
nsample=4000
model_weights <- ifelse(train$Bankrupt == 1,0.9677419,0.03225806)
samp_idx <- sample(4107, nsample, replace=T, prob=model_weights)
data.weighted <- data[samp_idx, ]
table(data.weighted$Bankrupt)
0 1
3761 239

Look at the documentation for the stratified function. You want to do something like this, but it is impossible to tell from the data you provide.
stratified(DF, "Status", c(Bankrupt = 30, NotBankrupt = 1))
The column in the data holding the groups should be character and the groups should match those in your list of weights you pass to the stratified function.

Related

How do I loop different percentages of missing values using MCAR?

Using the cleveland data from MCI data respository, I want to generate missing values on the data to apply some imputation techniques.
heart.ds <- read.csv(file.choose())
head(heart.ds)
attach(heart.ds)
sum(is.na(heart.ds))
str(heart.ds)
#Changing Appropriate Variables to Factors
heart.ds$sex<-as.factor(heart.ds$sex)
heart.ds$cp<-as.factor(heart.ds$cp)
heart.ds$fbs<-as.factor(heart.ds$fbs)
heart.ds$exang<-as.factor(heart.ds$exang)
heart.ds$restecg<-as.factor(heart.ds$restecg)
heart.ds$slope<-as.factor(heart.ds$slope)
heart.ds$thal<-as.factor(heart.ds$thal)
heart.ds$target<-as.factor(heart.ds$target)
str(heart.ds)
Now i want to generate missing values using the MCAR mechanism. Below is the loop code;
p = c(0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1)
hd_mcar = rep(0, length(heart.ds)) #to generate empty bins of 10 different percentages of missingness using the MCAR package
for(i in 1:length(p)){
hd_mcar[i] <- delete_MCAR(heart.ds, p[i]) #to generate 10 different percentages of missingness using the MCAR package
}
The problem here is that, after the above code, i dont get the data been generated in it original values like in a data frame where i will have n variables and n rows.
Below is a picture of the output i had through the above code;
enter image description here
But when i use only one missingness percentage i get an accurate results; below is the coe for only one missing percentage
#Missing Completely at Random(MCAR)
hd_mcar <- delete_MCAR(heart.ds, 0.05)
sum(is.na(hd_mcar))
Below is the output of the results;
enter image description here
Please I need help to to solve the looping problem. Thank you.
Now I want to apply the MICE and other imputations methods like HMISC, Amelia, mi, and missForest inside the loop but it is giving me an error saying "Error: Data should be a matrix or data frame"
The code below is for only MICE;
#1. Method(MICE)
mice_mcar[[i]] <- mice(hd_mcar, m=ip, method = c("pmm","logreg","polyreg","pmm","pmm","logreg",
"polyreg","pmm","logreg","pmm","polyreg","pmm",
"polyreg","logreg"), maxit = 20)
#Diagnostic check
summary(heart.ds$age)
mice_mcar$imp$age
#Finding the means of the impuatations
app1 <- apply(mice_mcar$imp$age, MARGIN = 2, FUN = mean)
min1 <- abs(app1-mean(heart.ds$age))
#Selecting the minimum index
sm1 <- which(min1==min(min1))
#Selecting final imputation
final_clean_hd_mcar =mice::complete(mice_mcar,sm1)
mice.mcar = final_clean_hd_mcar
How do i go about to make it fit into the loop and works perfectly?
Your problem was this line:
hd_mcar = rep(0, length(heart.ds)) #to generate empty bins of 10 different percentages of missingness using the MCAR package
You are creating a vector here rather than a list. You can't assign a data frame to an element of a vector without coercing it into something that is not a data frame. You want to do this:
p <- c(0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1)
hd_mcar <- vector(mode = "list", length = length(p))
for(i in 1:length(p)){
hd_mcar[[i]] <- delete_MCAR(heart.ds, p[i]) #to generate 10 different percentages of missingness using the MCAR package
}
Note that because it's a list now, hd_mcar[[i]] uses the [[ rather than [ subscript.

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Logistic regression training and test data

I am a beginner to R and am having trouble with something that feels basic but I am not sure how to do it. I have a data set with 1319 rows and I want to setup training data for observations 1 to 1000 and the test data for 1001 to 1319.
Comparing with notes from my class and the professor set this up by doing a Boolean vector by the 'Year' variable in her data. For example:
train=(Year<2005)
And that returns the True/False statements.
I understand that and would be able to setup a Boolean vector if I was subsetting my data by a variable but instead I have to strictly by the number of rows which I do not understand how to accomplish. I tried
train=(data$nrow < 1001)
But got logical(0) as a result.
Can anyone lead me in the right direction?
You get logical(0) because nrow is not a column
You can also subset your dataframe by using row numbers
train = 1:1000 # vector with integers from 1 to 1000
test = 1001:nrow(data)
train_data = data[train,]
test_data = data[test,]
But be careful, unless the order of rows in your dataframe is completely random, you probably want to get 1000 rows randomly and not the 1000 first ones, you can do this using
train = sample(1:nrow(data),1000)
You can then get your train_data and test_data using
train_data = data[train,]
test_data = data[setdiff(1:nrow(data),train),]
The setdiff function is used to get all rows not selected in train
The issue with splitting your data set by rows is the potential to introduce bias into your training and testing set - particularly for ordered data.
# Create a data set
data <- data.frame(year = sample(seq(2000, 2019, by = 1), 1000, replace = T),
data = sample(seq(0, 1, by = 0.01), 1000, replace = T))
nrow(data)
[1] 1000
If you really want to take the first n rows then you can try:
first.n.rows <- data[1:1000, ]
The caret package provides a more reliable approach to using cross validation in your models.
First create the partition rule:
library(caret)
inTrain <- createDataPartition(y = data$year,
p = 0.8, list = FALSE)
Note y = data$year this tells R to use the variable year to sample from, ensuring you don't get ordered data and introduced bias to the model.
The p argument tells caret how much of the original data should be partitioned to the training set, in this case 80%.
Then apply the partition to the data set:
# Create the training set
train <- data[inTrain,]
# Create the testing set
test <- data[-inTrain,]
nrow(train) + nrow(test)
[1] 1000

How to import a distance matrix for clustering in R

I have got a text file containing 200 models all compared to eachother and a molecular distance for each 2 models compared. It looks like this:
1 2 1.2323
1 3 6.4862
1 4 4.4789
1 5 3.6476
.
.
All the way down to 200, where the first number is the first model, the second number is the second model, and the third number the corresponding molecular distance when these two models are compared.
I can think of a way to import this into R and create a nice 200x200 matrix to perform some clustering analyses on. I am still new to Stack and R but thanks in advance!
Since you don't have the distance between model1 and itself, you would need to insert that yourself, using the answer from this question:
(you can ignore the wrong numbering of the models compared to your input data, it doesn't serve a purpose, really)
# Create some dummy data that has the same shape as your data:
df <- expand.grid(model1 = 1:120, model2 = 2:120)
df$distance <- runif(n = 119*120, min = 1, max = 10)
head(df)
# model1 model2 distance
# 1 2 7.958746
# 2 2 1.083700
# 3 2 9.211113
# 4 2 5.544380
# 5 2 5.498215
# 6 2 1.520450
inds <- seq(0, 200*119, by = 200)
val <- c(df$distance, rep(0, length(inds)))
inds <- c(seq_along(df$distance), inds + 0.5)
val <- val[order(inds)]
Once that's in place, you can use matrix() with the ncol and nrow to "reshape" your vector of distance in the appropriate way:
matrix(val, ncol = 200, nrow = 200)
Edit:
When your data only contains the distance for one direction, so only between e.g. model1 - model5 and not model5 - model1 , you will have to fill the values in the upper triangular part of a matrix, like they do here. Forget about the data I generated in the first part of this answer. Also, forget about adding the ones to your distance column.
dist_mat <- diag(200)
dist_mat[upper.tri(dist_mat)] <- your_data$distance
To copy the upper-triangular entries to below the diagonal, use:
dist_mat[lower.tri(dist_mat)] <- t(dist_mat)[lower.tri(dist_mat)]
As I do not know from your question what format is your file in, I will assume the most general file format, i.e., CSV.
Then you should look at the reading files, read.csv, or fread.
Example code:
dt <- read.csv(file, sep = "", header = TRUE)
I suggest using data.table package. Then:
setDT(dt)
dt[, id := paste0(as.character(col1), "-", as.character(col2))]
This creates a new variable out of the first and the second model and serves as a unique id.
What I do is then removing this id and scale the numerical input.
After scaling, run clustering algorithms.
Merge the result with the id to analyse your results.
Is that what you are looking for?

cforest party unbalanced classes

I want to measure the features importance with the cforest function from the party library.
My output variable has something like 2000 samples in class 0 and 100 samples in class 1.
I think a good way to avoid bias due to class unbalance is to train each tree of the forest using a subsample such that the number of elements of class 1 is the same of the number of element in class 0.
Is there anyway to do that? I am thinking to an option like n_samples = c(20, 20)
EDIT:
An example of code
> iris.cf <- cforest(Species ~ ., data = iris,
+ control = cforest_unbiased(mtry = 2)) #<--- Here I would like to train the forest using a balanced subsample of the data
> varimp(object = iris.cf)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.048981818 0.002254545 0.305818182 0.271163636
>
EDIT:
Maybe my question is not clear enough.
Random forest is a set of decision trees. In general the decision trees are constructed using only a random subsample of the data. I would like that the used subsample has the same numbers of element in the class 1 and in the class 0.
EDIT:
The function that I am looking for is for sure available in the randomForest package
sampsize
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
I need the same for the party package. Is there any way to get it?
I will assume you know what you want to accomplish, but don't know enough R to do that.
Not sure if the function provides balancing of data as an argument, but you can do it manually. Below is the code I quickly threw together. More elegant solution might exist.
# just in case
myData <- iris
# replicate everything *10* times. Replicate is just a "loop 10 times".
replicate(10,
{
# split dataset by class and add separate classes to list
splitList <- split(myData, myData$Species)
# sample *20* random rows from each matrix in a list
sampledList <- lapply(splitList, function(dat) { dat[sample(20),] })
# combine sampled rows to a data.frame
sampledData <- do.call(rbind, sampledList)
# your code below
res.cf <- cforest(Species ~ ., data = sampledData,
control = cforest_unbiased(mtry = 2)
)
varimp(object = res.cf)
}
)
Hope you can take it from here.

Resources