partial imputation with missForest - r

I'm trying to use the missForest package in R to partially impute a dataset. In detail, I would like to impute all the metric variables but leave a few columns alone. Is this possible?

I have a potential solution, if I'm understanding your question correctly. I am going to provide you some code that should be fully reproducible.
## Get some data...
data(iris)
## The data contains four continuous and one categorical variable.
## Artificially produce missing values using the 'prodNA' function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.1)
## Impute missing values for just the first four columns of data
iris.mis[,1:4] <- missForest(iris.mis)$ximp[,1:4]
Let me know if an approach like this works. If it doesn't work, see if you can use some example code to show why.

As I understood, You need to leave few columns and impute other columns with missForest function. Simple solution is
imputedData <- missForest(dataset[c( 2, 3)])
dataset <- data.frame(dataset[1], imputedData)
pass columns need to be imputed(here 2,3) and then combine it back.

Related

R scale function with character variable

I'm relatively new to R - I'm having challenges to figure out how to scale a dataset that contains a character variable.
However I when I try to use the scale function to create a dataframe, I'm getting an error:
df<-scale(USArrests)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Is there a way to create a dataframe with a character variable to later use it in a cluster analysis?
km.res<-kmeans(df,4,nstart=10)
?scale() says scale is desgined to center columns of numeric matrices, see the help entry for further details.
However, df <- USArrests is sufficient to store the required in-built dataset as object df (see environment), if you have to name it df.
Compare the following:
df <- USArrests
# compare
head(df, n=5)
# to
df1 <- scale(df)
head(df1, n=5)
As you can see, all numeric columns are now scaled while the row ids, Alabama, ..., Wyoming, of course, do not change. Btw, to check the class of all variables you can use lapply(df, class).
I think you shouldn't have problems to then call km.res <- kmeans(df1,4,nstart=10). To inspect the object type km.res.
To be honest, I think previous to running kmeans() you should again have a look on the help page (e.g. help(kmeans)) to get in touch with the arguments clusters, iter, ...
Further, I think it would be a good idea to investigate why or why not to center the data in previous step. In any case, it is possible to run kmeans() with centered (df1) and uncentered (df) data. Why one of those alternatives is more appropriate is of major importance.
EDIT: It is recommended to set a seed (e.g. set.seed(09102021)) before running the algorithm. By doing so you ensure the reproducibility of results.

Dummy coding omits / removes select variables from the data frame R

I have a fairly large dataset 1460(n)x81(p). About 38 variables are numeric and rest are factors with levels ranging from 2-30. I am using dummy.data.frame from *dummies package to encode the factor variables for use in running regression models.
However, as I run the following code:
train_dummy <- dummy.data.frame(train, sep = ".", verbose = TRUE, all = TRUE) some of the colums are from the original dataset are removed.
Has anyone encountered such issue before?
Link to original training dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
A number of columns from the original dataset including response variable SalePrice are being dropped. Any ideas/suggestions on what to try?
I wasn't able to reproduce the issue. I don't think there is enough info here to reproduce the issue, but I do have a few first thoughts.
run dummy data processing before train/test split
I see you're running the dummy data solely on your training data. I've found that it is usually a better strategy to run dummy data processing on the entire dataset as a whole, and then split into train / test.
Sometimes when you split first, you can run into issues with the levels of your factors.
Let's say I have a field called colors which is a factor in my data that contains the levels red, blue, green. If I split my data into train and test, I could run into a scenario where my training data only has red and blue values and no green. Now if my test dataset has all three, there will be a difference between the number of columns in my train vs test data.
I believe one way around that issue is the drop parameter in the dummy.data.frame function which defaults to TRUE.
things to check
Run these before running dummy data processing for train and test to see what characteristics these fields have that are being dropped:
# find the class of each column
train_class <- sapply(train, class)
test_class <- sapply(test, class)
# find the number of unique values within each column
unq_train_vals <- sapply(train, function(x) length(unique(x)))
unq_test_vals <- sapply(test, function(x) length(unique(x)))
# combine into data frame for easy comparison
mydf <- data.frame(
train_class = train_class,
test_class = test_class,
unq_train_vals = unq_train_vals,
unq_test_vals = unq_test_vals
)
I know this isn't really an "answer", but I don't have enough rep to comment yet.

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

Nested data frame

I have got a technical problem which, as it seems, I am not able to solve by myself. I ran an estimation with the mcmcglmm package. By results$Sol I get access to the estimated posterior distributions. Applying class() tells me that the object is of class "mcmc". Using as.data.frame() results in a nested data frame which contains other data frames (one data frame which contains many other data frames). I would like to rbind() all data frames within the main data frame in order to produce one data frame (or rather a vector) with all values of all posterior distributions and the name of the (secondary) data frame as a rowname., Any ideas? I would be grateful for every hint!
Update: I didn't manage to produce a useful data set for the purpose of stackoverflow, with all these sampling chains these data sets would be always too large. If you want to help me, please consider to run the following (exemplaric) model
require(MCMCglmm)
data(PlodiaPO)
result <- MCMCglmm(PO ~ plate + FSfamily, data = PlodiaPO, nitt = 50, thin = 2, burn = 10, verbose = FALSE)
result$Sol (an mcmc object) is where all the chains are stored. I want to rbind all chains in order to have a vector with all values of all posterior distributions and the variable names as rownames (or since no duplicated rownames are allowed, as an additional character vector).
I can't (using the example code from MCMCglmm) construct an example where as.data.frame(model$Sol) gives me a dataframe of dataframes. So although there's probably a simple answer I can't check it very easily.
That said, here's an example that might help. Note that if your child dataframes don't have the same colnames then this won't work.
# create a nested data.frame example to work on
a.df <- data.frame(c1=runif(10),c2=runif(10))
b.df <- data.frame(c1=runif(10),c2=runif(10))
full.df <- data.frame(1:10)
full.df$a <- a.df
full.df$b <- b.df
full.df <- full.df[,c("a","b")]
# the solution
res <- do.call(rbind,full.df)
EDIT
Okay, using your new example,
require(MCMCglmm)
data(PlodiaPO)
result<- MCMCglmm(PO ~ plate + FSfamily, data=PlodiaPO,nitt=50,thin=2,burn=10,verbose=FALSE)
melt(do.call(rbind,(as.data.frame(result$Sol))))

How do I match fitted(gamm4.model) values with DF despite NAs?

I'm trying to make use of the fitted values from a gamm4 model and need them to match up with the right rows in the dataframe I'm working with.
Here's the model I run:
gam.outcome <- gamm4(formula = outcome ~ male + s(gpa),
random = ~ (1|school),
data=avr, na.action="na.exclude")
With an lmer object the "na.exclude" option leaves NAs in the fitted values so that a fitted(lmer.output) call returns a vector the same length and order as the dataframe. But in gamm4 I've tried fitted(gam.outcome$gam) and fitted(gam.outcome$mer) but don't know how to deal with the results of either. The latter omits all NA, despite the "na.exclude" option. The former includes twice as many NA values as lmer which should be a clue of some kind, but I'm too thick to get it. All I know is that either way the vector doesn't line up with the original data.
I imagine there is more than one way to solve my problem. I greatly appreciate help improving or tagging my question as well as answering it. Thanks!
Approximately (untested):
myfitted <- numeric(nrow(avr))
myfitted[!complete.cases(avr)] <- NA
myfitted[complete.cases(avr)] <- fitted(gam.outcome$mer)
Or (also untested)
avrframe <- model.frame(outcome~male+gpa+school,na.action=na.exclude)
napredict(attr(avrframe,"na.action"),fitted(gam.outcome$mer))
The first solution assumes that all of the NA values in avr are either in the columns you are interested in, or are in the same rows as NA values in the columns you are interested in. The second attempts to figure this out automatically.

Resources