I have performed an operation using the mclust package on a nonmissing data frame. The nonmissing data frame was created using the dplyr package by using the select function. As such, row.names appears as a vector in the data frame passed to the mclust function.
I next have extracted some critical values (the case 'classification') from this function as:
class<-functionobject$classification
Thus, the numeric list of classification values is associated with row.names.
When I attempt to append this list of values to a new data frame of the same length (the same cases) without row.names, I lose important ordering, it seems. I know this as when I compare classification groups on other variables in the new data frame, they are not equal to the values obtained in the mclust function using those same variables.
The reason I can not simply append to the nonmissing data frame (with row.names) used in the mclust function is that I require other variables from the data set not used in the function and which needed merged on ID variables as:
NEW_DF=merge(mclust_DF, other_DF, by=c("X1", "X2"))
So I end up with a data frame of the same length but which no longer has row.names on which I want to append the classification values from the mclust function described above. Although no errors are thrown when I use:
FINAL_DF<- cbind.data.frame(NEW_DF, class)
The data are off as I can see inspection of group (class) means on relevant variables do NOT equal those from the mclust function (which they should as it is the same core input data).
I realize I am missing something obvious here, but I have not found an answer despite an exhaustive search of the archives. What is the correct way to go about this rather tedious wrangling?
FWIW: a simple, though perhaps still inefficient solution overall, was to bind the saved classification data from the mclust function to the nonmissing data frame BEFORE merging with additional validation variables as when the merge occurs, the 'row.names' vector induced by dplyr in the select cases function is lost and cases are resorted.
This solution dawned on me as I realized that the mclust function was based on the nonmissing data frame (created using dplyr) and thus resultant data objects followed the case ordering from input data (by row.names)
I am experimenting with the mice package in R and am curious about how i can leave columns out of the imputation.
If i want to run a mean imputation on just one column, the
mice.impute.mean(y, ry, x = NULL, ...) function seems to be what I would use. I'm struggling to understand what i need to include as the third argument to get this to work.
If i have a data set that includes categorical data such as name, ID, birth date, etc. which should not affect the calculation of other columns and should not be filled in when missing, how do i tell mice to exclude these columns in its calculation?
I've been using the mice dataset
nhanes for my exploration.
Thanks
I don't know your data thus I can't create a example for you, but you are looking exactly for this parameters of the mice() function
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
With this parameter you can define, which columns you want to use to impute a specific column.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
Here you can define, for which columns you want to create imputation.
I want to impute a part of my data set with mice. My data set has very many variables, which is why I don't want to impute all the variables but only those which I will use in my model.
(I know that as much information as possible should be used for the imputation, but I am already using 41 variables, which according to literature should be more than enough.)
My problem: I don't want every variable to be imputed at all times,
because I have several measurement points. So of course, my variables
at t4 have many missing, but I don't want to impute them when people
just haven't filled out the questionnaire at that point.
So I specified a predictor matrix, in which all of the variables at t0 (e.g. A103.0) are imputed, but not at t4 (A103.4).
However, when running mice, it just uses "pmm" for all of the variables, and every variables is imputed.
Any suggestions on what went wrong are highly appreciated, I spent quite some time now trying to find out what happened..
This is what I've done:
I create an object with all the columns I want to impute
impute <- c("A103", "A104", "A107", #SVF
"A302.0", "A303.0", "A304.0", "A305.0", "A306.0",
"A502_01.0", "A502_02.0", "A502_03.0", "A502_04.0",
"A504.0","A506.0", "A508.0", "W003.0", "W005.0",
"A509_02.0", "A509_03.0", "A509_06.0", "A509_10.0",
"A302.4", "A303.4", "A304.4", "A305.4", "A306.4",
"A502_01.4", "A502_02.4", "A502_03.4", "A502_04.4",
"A504.4", "A506.4", "A508.4","W003.4", "W005.4", "SD02_01",
"SD03",
"A509_02.4", "A509_03.4", "A509_06.4", "A509_10.4")
I create a subset of the columns (and all rows of course) which I want to impute
imp <- mice(ds_wide[ ,impute], maxit=0)
imp$PredictorMatrix
pred <- imp$predictorMatrix
pred [c("A302.4", "A303.4", "A304.4", "A305.4", "A306.4", #ABB.4
"A502_01.4", "A502_02.4", "A502_03.4", "A502_04.4", #PSWQ.4
"A504.4", "A506.4", "A508.4","W003.4", "W005.4", "SD02_01",
"SD03",
"A509_02.4", "A509_03.4", "A509_06.4", "A509_10.4"), ] <- 0
View(pred) #looks exactly how I want it to look like
imp <- mice(ds_wide[ ,impute], m=5, predictorMatrix = pred)
miceimp <- complete (imp)
anyNA(miceimp)
View(miceimp)
When I check miceimp (my result), there are no missing values whatsoever, so all the variables at t4 are imputed even though I specified otherwise. What did I do wrong?
Actually, what would be really best for me, would be if I could somehow impute those variables at t4 which do not only have missings. So those people, who filled out t4, should be imputed, and those, who are not at that measurement point, should not.
If anyone has any ideas how to make that possible, that would be great!
Many thanks!
I am not completely sure I understood 100% what you are trying to archive.
I understood, that you do not want to impute all your variables (but you want to include all your variables as input to the algorithm)
You were trying to define the parameter predictorMatrix
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
To me i sounds like this parameter is used to define, what variables are used as input.
In comparison the where parameter sounds to me as the correct parameter to specify which variables should be imputed.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
So my conclusion would be to try out the where parameter instead of predictorMatrix.
In "mice", in addition to specifying "predMatrix" as zero for the variables that should not be imputed, you must specify ("") in "method" for those variables.
very new to r.
I am trying to normalize multiple variables in matrix except the last column which has a categorical factor variable (in this case good/notgood).
I there any way to normalize the data without affecting the categorical column? I have tried to normalize while keeping the categorical column out, but can't seem to be able to add it back again.
minimum <- apply(mywines[,-12],2,min)
maximum <- apply(mywines[,-12],2,max)
mywinesNorm <- scale(mywines[,-12],center=minimum,scale=(maximum-minimum))
I still need the 12th column to build supervised models.
The short version is that you can simply reattach the column using cbind. However, it is just a little more complicated than that. scale returns a matrix not a data frame. In order to mix numbers and factors, you need a data.frame, not a matrix. So before the cbind, you will want to convert the scaled matrix back to a data.frame.
mywinesNorm = cbind(as.data.frame(mywinesNorm), mywines[ ,12])
A different approach would be to just change the data in place:
mywines[ ,12] = scale(mywines[ ,12])
I am trying to remove the correlated attributes which are out of boundaries (-1,1). I am using the following code for the correlation:
cor(df[sapply(df, is.numeric)])
After that I get the correlation values. How can I remove the values greater than 1 and smaller -1?
Thank you
The cor() function in R receives a numeric vector, matrix or data frame and gives a pairwise correlation matrix of variables. Values in the correlation matrix are expected to be in the range of -1 to +1. However problems arise when we have a correlation matrix that is not positive semi definite. The most frequent cause of an invalid correlation matrix is missing values. R offers several ways of handling missing values in a correlation matrix. You can use the na.rm = TRUE option to specify all missing values should be removed (only complete rows are used). This will always result in a valid correlation matrix.Three other options can be specified on how to handle missing observations. The use = “all.obs” option specifies there are no missing observations and the presence of any missing values will cause an error. If use = “complete.obs” is specified case wise deletion of missing observations happens.
If use = “pairwise.complete.obs” is specified only the complete pairs of observations are used. This may result in an invalid correlation matrix.