ddply skipping empty dataframe - r

I try to get the empirical distribution for different levels of a factor from a sample.
For some reason, running the following :
a <- daply(caseDataset, x, nrow) / nrow(caseDataset)
gives me some NA for the cases where the dataset has no values for a level of the factor x
So I have to use override the result with
a[is.na(a)] <- 0
How can I force daply to have a uniform behavior (and pass the empty dataframe down to nrow) ?
Sample for caseDataset:
dataset <- data.frame(
a1 = c(TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE),
a2 = c(TRUE,TRUE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,TRUE),
a3 = c(1,6,5,4,7,3,8,7,5),
target = c('+','+','-','+','-','-','-','+','-'))
caseDataset <- subset(dataset, target=='-')
daply(caseDataset, "target", nrow)

Does the .drop_i switch do what you are after?
> daply(caseDataset, "target", nrow, .drop_i=FALSE)
- +
5 0

Related

R: upSample in Caret is removing target variable completely

I am trying to upsample an imbalanced dataset in R using the upSample function in Caret. However upon applying the function it completely removes the target variable C_flag from the dataset. Here is my code:
set.seed(100)
'%ni%' <- Negate('%in%')
up_train <- upSample(x = train[, colnames(train) %ni% "C_flag"], #all predictor variables
y = train$C_flag) #target variable
Here is the amount of each category of C_flag in the train set.
0 = 100193, 1=29651.
I test to see if C_flag is there with this result:
print(up_train$C_flag)
NULL
Does anyone know why this function is removing this variable instead of upsampling?
First thing that comes to my mind is if up_train$C_flagis a factor or not. Anyway, I tried this sample dataset:
library(tidyverse)
library(caret)
train <- data.frame(x1 = c(2,3,4,2,3,3,3,8),
x2 = c(1,2,1,2,4,1,1,4),
C_flag = c("A","B","B","A","A","A","A","A"))
train$C_flag <- as.factor(train$C_flag)
'%ni%' <- Negate('%in%')
up_train <- upSample(x = train[,colnames(train) %ni% "C_flag"],
y = train$C_flag)
up_train$C_flag
And it returned me NULL. Why?, because the target column was renamed "Class". So if you want to see the target with the name C_flag add the yname name you want:
up_train <- upSample(x = train[,colnames(train) %ni% "C_flag"],
y = train$C_flag,
yname = "C_flag")
print(up_train$C_flag)
[1] A A A A A A B B B B B B
Levels: A B

how to solve error replacement has 0 rows, data has 200 for a loop

I have 3 df which I put them in a list. For each df, I adjust values in a zvar function. Then I want to run a model and then resampling for the model in a loop applying the zvar.
But I encounter this error:
Error in $<-.data.frame(*tmp*, "var2", value = numeric(0)) :
replacement has 0 rows, data has 200
The message says that a variable does not exist, but I am able to print each variable in the dataframe, what is the issue here?
My df sample (all continuous vars):
var1 var2
2 2
1 2
9 7
My code:
zvar <- function(data, mean, sd){
data$var2 <- (data$var2 - mean$var2)/sd$var2
data$var1 <- (data$var1 - mean$var1)/sd$var1
return(data)
}
# its args are automatically calculated in the following function `boot.fiml`.
dflist <- list(group1, group2, group3)
for (i in dflist) {
i <- data.frame(i)
print(i$var2)
print(i$var1)
m_group <- lm(var1 ~ var2, data = i)
print(m_group)
## previous lines run successfully, not not the following one
boot.fiml(data = i, model = m_group, R = 10, z.function = zvar)
}
Thanks!

Mice: partial imputation using where argument failing

I encounter a problem with the use of the mice function to do multiple imputation. I want to do imputation only on part of the missing data, what looking at the help seems possible and straightworward. But i can't get it to work.
here is the example:
I have some missing data on x and y:
library(mice)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10),c("x","y")] <- NA
I want only to impute missing data on y:
where <- data.frame(ID = rep(FALSE,100),x = rep(FALSE,100),y = is.na(plouf$y))
I do the imputation
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
I look at the imputed values:
test <- complete(plouf.imp)
Here i still have NAs on y:
> sum(is.na(test$y))
[1] 10
if I use where to say to impute on all values, it works:
where <- data.frame(ID = rep(FALSE,100),x = is.na(plouf$x),y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
test <- complete(plouf.imp)
> sum(is.na(test$y))
[1] 0
but it does the imputation on x too, that I don't want in this specific case (speed reason in a statistial simulation study)
Has anyone any idea ?
This is happening because of below code -
plouf[sample(100,10),c("x","y")] <- NA
Let's consider your 1st case wherein you want to impute y only. Check it's PredictorMatrix
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
plouf.imp
#PredictorMatrix:
# ID x y
#ID 0 0 0
#x 0 0 0
#y 1 1 0
It says that y's missing value will be predicted based on ID & x since it's value is 1 in row y.
Now check your sample data where you are populating NA in x & y column. You can notice that wherever y is NA x is also having the same NA value.
So what happens is that when mice refers PredictorMatrix for imputation in y column it encounters NA in x and ignore those rows as all independent variables (i.e. ID & x) are expected to be non-missing in order to predict the outcome i.e. missing values in y.
Try this -
library(mice)
#sample data
set.seed(123)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10), "x"] <- NA
set.seed(999)
plouf[sample(100,10), "y"] <- NA
#missing value imputation
whr <- data.frame(ID = rep(FALSE,100), x = rep(FALSE,100), y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
test <- complete(plouf.imp)
sum(is.na(test$y))
#[1] 1
Here only one value of y is left to be imputed and in this case both x & y are having NA value i.e. row number 39 (similar to your 1st case).

For loop that span between two variables in R

I am trying to do a series of function but selecting between two variables. I need to first impute the missing values then normalize the variables. To impute I use the following code.
for(i in (train$B365A:train$BSA)){
data[i][is.na(data[i])] <- round(mean(data[i], na.rm = TRUE))
}
so for above, I am trying to impute the missing values, they have approximately 20 variables between them.
I have come up with this but it is not affecting the cells.
convert_num <- function(i) {
i <- as.numeric(i)
}
for (i in c(1:3)){
convert_num(i)
}
The data looks similar to the following
hope coal kite
3 4 5
2 1 5
right now its class but need to be numeric.It has over 20 variables and 18k row.
if I understand correctly the solution to your problem would be the following.
data <- data.frame(c1 = c(rbinom(10,5,0.5)),
c2 = c(rbinom(10,5,0.5)),
c3 = c(rbinom(10,5,0.5)))
data[2:4,1] <- rep(NA,3);data[c(6,8),2] <- rep(NA,2);data[10,3] <- NA
data
# imput data from c1:c3
for(i in 1:3){
data[i][is.na(data[i])] <- round(mean(data[,i], na.rm = T))
}
data
data[] <- lapply(data,as.numeric) # transform to numeric
sapply(data,class)

how to impute missing value using R's multinom

Suppose my data frame DF has two colums $A and $B. $A is always present. $B is sometimes coded NaN when the value is missing. I want to predict $B.predicted, the missing values for $B, and create a new column $B.complete such that $B.complete[i] is $B.predicted if $B[i] is NaN and is $B[i] otherwise.
I use multinom, which requires a factors as the dependent variable, to predict the B's where I have a full observation, using:
DF$B.factor <- factor(DF$B)
model.results <- multinom(formula=B.factor ~ A,
data=DF[!is.na(DF$B),])
B.predicted <- predict(model.result, newdata=DF, type="class")
The variable B.predicted is a factor.
My DF$B column is not a factor.
Mu question is how to I merge DF$B and B.predicted to create B.complete? In particular, since B.predicted is a factor and DF$B is not, does this code pick up the correct values?
B.complete <- ifelse(is.na(DF$B), $B.predicted, DF$B)
Use replace
set.seed(1)
DF <- data.frame(A = factor(sample(letters[1:5],30, TRUE)),
B = sample(c(letters[1:3],NA), 30 , TRUE, prob = rep(c(0.3,0.1),c(3,1))),
stringsAsFactors = F)
DF$B.factor <- factor(DF$B)
# no need to include is.na(DF$B) as multinom will omit anyway
model <- multinom(B.factor ~ A, data = DF)
# use replace to replace the NA values (converting to character when necessary)
DF$B.complete <- replace(DF$B, is.na(DF$B), as.character(predict(model, newdata = DF[is.na(DF$B),])))

Resources