Nested data frame - r

I have got a technical problem which, as it seems, I am not able to solve by myself. I ran an estimation with the mcmcglmm package. By results$Sol I get access to the estimated posterior distributions. Applying class() tells me that the object is of class "mcmc". Using as.data.frame() results in a nested data frame which contains other data frames (one data frame which contains many other data frames). I would like to rbind() all data frames within the main data frame in order to produce one data frame (or rather a vector) with all values of all posterior distributions and the name of the (secondary) data frame as a rowname., Any ideas? I would be grateful for every hint!
Update: I didn't manage to produce a useful data set for the purpose of stackoverflow, with all these sampling chains these data sets would be always too large. If you want to help me, please consider to run the following (exemplaric) model
require(MCMCglmm)
data(PlodiaPO)
result <- MCMCglmm(PO ~ plate + FSfamily, data = PlodiaPO, nitt = 50, thin = 2, burn = 10, verbose = FALSE)
result$Sol (an mcmc object) is where all the chains are stored. I want to rbind all chains in order to have a vector with all values of all posterior distributions and the variable names as rownames (or since no duplicated rownames are allowed, as an additional character vector).

I can't (using the example code from MCMCglmm) construct an example where as.data.frame(model$Sol) gives me a dataframe of dataframes. So although there's probably a simple answer I can't check it very easily.
That said, here's an example that might help. Note that if your child dataframes don't have the same colnames then this won't work.
# create a nested data.frame example to work on
a.df <- data.frame(c1=runif(10),c2=runif(10))
b.df <- data.frame(c1=runif(10),c2=runif(10))
full.df <- data.frame(1:10)
full.df$a <- a.df
full.df$b <- b.df
full.df <- full.df[,c("a","b")]
# the solution
res <- do.call(rbind,full.df)
EDIT
Okay, using your new example,
require(MCMCglmm)
data(PlodiaPO)
result<- MCMCglmm(PO ~ plate + FSfamily, data=PlodiaPO,nitt=50,thin=2,burn=10,verbose=FALSE)
melt(do.call(rbind,(as.data.frame(result$Sol))))

Related

R - Combine two mice mids objects when data frames have different columns

I'm using the mice package on two different but related data frames.
While the large majority of the variables are the same for both data frames, a small number of variables are unique to each data frame and the imputation happens for both data frames separately (they have slightly different imputation models/ predictor matrices, etc.).
In the end, I would like to combine the two resulting mids objects,
but as the columns differ, the standard procedure via rbind(actually method rbind.mids is called) does not work.
Is there an easy way around this?
Two alternative approaches I could think of:
Combine the two dfs one time before imputation via dplyr::bind_rows and split them again. Now each data frame has all columns and rbind() would work after the imputations. However, that would also require defining the predictor matrix and method section again for both data frames to tell mice to ignore the new columns.
Use the mice::complete(imp_df, "long", include = TRUE) function on both mids objects, combine the resulting data frames, and use mice::as.mids() to convert them back into a single mids object. But I'm not sure if that would work or mess something else up, e.g.
Here is some example data to illustrate the issue
library(mice)
data("nhanes2") # load test data from mice package
# make test_df 1
df_1 <- nhanes2[1:14,c(1,2,3)]
# make test_df 2
df_2 <- nhanes2[15:25, c(1,2,4)]
# quick and dirty imputation for test purpose
test_imp1 <- mice(df_1)
test_imp2 <- mice(df_2)
rbind(test_imp1, test_imp2)
Error in rbind.mids.mids(x, y, call = call) :
datasets have different variable names

R scale function with character variable

I'm relatively new to R - I'm having challenges to figure out how to scale a dataset that contains a character variable.
However I when I try to use the scale function to create a dataframe, I'm getting an error:
df<-scale(USArrests)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Is there a way to create a dataframe with a character variable to later use it in a cluster analysis?
km.res<-kmeans(df,4,nstart=10)
?scale() says scale is desgined to center columns of numeric matrices, see the help entry for further details.
However, df <- USArrests is sufficient to store the required in-built dataset as object df (see environment), if you have to name it df.
Compare the following:
df <- USArrests
# compare
head(df, n=5)
# to
df1 <- scale(df)
head(df1, n=5)
As you can see, all numeric columns are now scaled while the row ids, Alabama, ..., Wyoming, of course, do not change. Btw, to check the class of all variables you can use lapply(df, class).
I think you shouldn't have problems to then call km.res <- kmeans(df1,4,nstart=10). To inspect the object type km.res.
To be honest, I think previous to running kmeans() you should again have a look on the help page (e.g. help(kmeans)) to get in touch with the arguments clusters, iter, ...
Further, I think it would be a good idea to investigate why or why not to center the data in previous step. In any case, it is possible to run kmeans() with centered (df1) and uncentered (df) data. Why one of those alternatives is more appropriate is of major importance.
EDIT: It is recommended to set a seed (e.g. set.seed(09102021)) before running the algorithm. By doing so you ensure the reproducibility of results.

Dummy coding omits / removes select variables from the data frame R

I have a fairly large dataset 1460(n)x81(p). About 38 variables are numeric and rest are factors with levels ranging from 2-30. I am using dummy.data.frame from *dummies package to encode the factor variables for use in running regression models.
However, as I run the following code:
train_dummy <- dummy.data.frame(train, sep = ".", verbose = TRUE, all = TRUE) some of the colums are from the original dataset are removed.
Has anyone encountered such issue before?
Link to original training dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
A number of columns from the original dataset including response variable SalePrice are being dropped. Any ideas/suggestions on what to try?
I wasn't able to reproduce the issue. I don't think there is enough info here to reproduce the issue, but I do have a few first thoughts.
run dummy data processing before train/test split
I see you're running the dummy data solely on your training data. I've found that it is usually a better strategy to run dummy data processing on the entire dataset as a whole, and then split into train / test.
Sometimes when you split first, you can run into issues with the levels of your factors.
Let's say I have a field called colors which is a factor in my data that contains the levels red, blue, green. If I split my data into train and test, I could run into a scenario where my training data only has red and blue values and no green. Now if my test dataset has all three, there will be a difference between the number of columns in my train vs test data.
I believe one way around that issue is the drop parameter in the dummy.data.frame function which defaults to TRUE.
things to check
Run these before running dummy data processing for train and test to see what characteristics these fields have that are being dropped:
# find the class of each column
train_class <- sapply(train, class)
test_class <- sapply(test, class)
# find the number of unique values within each column
unq_train_vals <- sapply(train, function(x) length(unique(x)))
unq_test_vals <- sapply(test, function(x) length(unique(x)))
# combine into data frame for easy comparison
mydf <- data.frame(
train_class = train_class,
test_class = test_class,
unq_train_vals = unq_train_vals,
unq_test_vals = unq_test_vals
)
I know this isn't really an "answer", but I don't have enough rep to comment yet.

Getting expected value through regression model and attach to original dataframe in R

My question is very similar to this one here , but I still can't solve my problem and thus would like to get little bit more help to make it clear. The original dataframe "ddf" looks like:
CONC <- c(0.15,0.52,0.45,0.29,0.42,0.36,0.22,0.12,0.27,0.14)
SPP <- c(rep('A',3),rep('B',3),rep('C',4))
LENGTH <- c(390,254,380,434,478,367,267,333,444,411)
ddf <- as.data.frame(cbind(CONC,SPECIES,LENGTH))
the regression model is constructed based on Species:
model <- dlply(ddf,.(SPP), lm, formula = CONC ~ LENGTH)
the regression model works fine and returns individual models for each species.
What I am going to get is the residual and expected value of 'Length' variable in terms of each models (corresponding to different species) and I want those data could be added into my original dataset ddf as new columns. so the new dataset should looks like:
SPP LENGTH CONC EXPECTED RESIDUAL
Firstly, I use the following code to get the expected value:
model_pre <- lapply(model,function(x)predict(x,data = ddf))
I loom there might be some mistakes in the above code, but it actually works! The result comes with two columns ( predicated value and species). My first question is whether I could believe this result of above code? (Does R fully understand what I am aiming to do, getting expected value of "length" in terms of different model?)
Then i used the following code to attach those data to ddf:
ddf_new <- cbind(ddf, model_pre)
This code works fine as well. But the problem comes here. It seems like R just attach the model_pre result directly to the original dataframe, since the result of model_pre is not sorted the same as the original ddf and thus is obviously wrong(justifying by the species column in original dataframe and model_pre).
I was using resid() and similar lapply, cbind code to get residual and attach it to original ddf. Same problem comes.
Therefore, how can I attach those result correctly in terms of length by species? (please let me know if you confuse what I am trying to explain here)
Any help would be greatly appreciated!
There are several problems with your code, you refer to columns SPP and Conc., but columns by those names don't exist in your data frame.
Your predicted values are made on the entire dataset, not just the subset corresponding to that model (this may be intended, but seems strange with the later usage).
When you cbind a data frame to a list of data frames, does it really cbind the individual data frames?
Now to more helpful suggestions.
Why use dlply at all here? You could just fit a model with interactions that effectively fits a different regression line to each species:
fit <- lm(CONC ~ SPECIES * LENGTH, data= ddf)
fitted(fit)
predict(fit)
ddf$Pred <- fitted(fit)
ddf$Resid <- ddf$CONC - ddf$Pred
Or if there is some other reason to really use dlply and the problem is combining 2 data frame that have different ordering then either use merge or reorder the data frames to match first (see functions like ordor, sort.list, and match).

Simulating data from a data frame using ddply

I have some plant data almost identical to the 'iris' data set. I would like to simulate new data using a normal distribution. So for each variable~species in the iris data set I would create 10 new observations from a normal distribution. Basically it would just create a new data frame with the same structure as the old one, but it would contain simulated data. I feel that the following code should get me started (I think the data frame would be in the wrong form), but it will not run.
ddply(iris, c("Species"), function(x) data.frame(rnorm(n=10, mean=mean(x), sd=sd(x))))
rnorm is returning an atomic vector so ddply should be able to handle it.
the ddply will subset the rows by Species, but you're doing nothing in the function to iterate over the columns of the sub-setting data.frame. You cannot get norm() to return a list or data.frame for you; you will need to assist with the shaping. How about
ddply(iris, c("Species"), function(x) {
data.frame(lapply(x[,1:4], function(y) rnorm(10, mean(y), sd(y))))
})
here we calculate new values for the first 4 columns in each group.

Resources