Does order of data matter? - r

I am using R to perform hierarchical clustering to categorical data.
I am trying out different variables from my sample, in order to identify the ones that provide meaningful clustering results. However, I noticed that if I change the order of the data, the results are different. Is this due to the way hclust works, or am I missing something?
For each trial I extract a certain number of columns (in the following example I used columns 3,28,50,14).
my.data.final <- data.frame(read.csv("C:\\Final dataset-for R.csv"))
library(dplyr)
my.data.final <- my.data.final %>% mutate_if(is.character,as.factor)
my.data.final <- my.data.final %>% mutate_if(is.integer,as.factor)
my.data.final$Age <- factor(my.data.final$Age, ordered = TRUE)
my.data3 <- my.data.final[,c(3,28,50,14)]
my.data3 <- na.exclude(my.data3, row.names=1)
complete.cases(my.data3)
library(cluster)
dist.gower <- daisy(my.data3, metric = "gower")
aggl.clust.c <- hclust(dist.gower, method = "complete")
plot(aggl.clust.c,
main = "Agglomerative, complete linkages")
When I change the order of the columns in the line:
my.data3 <- my.data.final[,c(3,28,50,14)]
I noticed that the dendrogram changes. Is it expected to happen with hclust ?
I have found that the line:
my.data.final$Age <- factor(my.data.final$Age, ordered = TRUE)
somehow affects the result but I am not quite sure why.

Related

r mice - "sample" imputation method not working correctly

I am using mice to impute missing data in a large dataset (24k obs, 98 vars). I am using the "sample" imputation method to impute some variables (and other methods for the others - many categorical). When I check my imputed data, those variables that I've applied "sample" to are not always imputed and I have missingness in them. I know for sure that I'm applying "sample" to them (I double checked the methods), and I made sure to remove all predictors of them in the prediction matrix. From my understanding, where they are in the visit sequence shouldn't matter (but I make sure they're immediately after variables with no missingness).
I can't give you a reprex because when I try to recreate the problem, it doesn't happen and everything is imputed just fine. I tried simulating my own data and I tried subsetting the dataset to a group of the variables that I want to use the sample method on. That's part of why I'm so stumped - I coded everything the same and it worked with the subset. I didn't think that the sample method would be at all dependent on the presence of any other vars.
EDIT:
This is the code I'm using
#produce prediction matrix
pred1 <- quickpred_ext(data1, mincor = 0.08, include = "age")
pred2 <- pred1
# for vars to not be imputed, set all predictors to 0
data_no_impute <- data1 %>%
select(contains(c("exp_", "outcome_"))) %>%
select(sort(names(.))) %>%
names
data_level3 <- data1 %>%
select(contains(c("f4", "f5", "f6")),
k22) %>%
select(sort(names(.))) %>%
names
pred2[data_no_impute,] <- 0
pred2[data_level3,] <- 0
#produce initial methods and visit sequence
initial <- mice(data1, max = 0, print = F, vis = "monotone",
defaultMethod = c("pmm", "logreg", "polyreg", "polr"))
#edit methods to be blank for vars I don't want to impute, "sample" for level 3
meth1 <- initial$meth
meth2 <- meth1
meth2[data_level3] <- "sample"
meth2[data_no_impute] <- ""
visits1 <- initial$visitSequence
visits2 <- visits1
visits2 <- append(visits2, data_level3,22)
#run mice test
mice_test <- mice(data1, m = 2, print = F,
predictorMatrix = pred2,
method = meth2,
vis = visits2,
nnet.MaxNWts = 3000)
#pull second completed dataset
imput1 <- mice::complete(mice_test, 2, include = F)
#look at missingness patterns
missingness_pattern2 <- md.pattern(imput1, plot = F)

Iterating Effect Size Calculations Through Columns

I am currently comparing the size of 159 regions (ROI) in the brain between an at-risk and normal population on R. I originally calculated lm model p-values using this loop:
storage <- list()
for(i in names(ThalPC)[-c(1:8)]){
storage[[i]] <- lm(get(i) ~ Status, ThalPC)
}
table <- storage %>% tibble(
dvsub = names(.),
untidied = .
) %>%
mutate(tidy = map(untidied, broom::tidy)) %>%
unnest(tidy)
tab <- as.data.frame(table)
to <- subset(tab, select = -c(2))
newtable <- filter(to, term == "StatusControl")
ThalPC= my data frame
Status = Their status as Control or at-risk population
Now, I have around 59 regions with significant p-values and I am hoping to calculate the effect sizes for them. Currently I am trying to use this loop:
stor <- list()
for(i in names(ThalPC)[-c(1:9)]) {
stor[[i]] <- lm(get(i) ~ Status, ThalPC)
try <- effectsize(stor[[i]], type="eta")
}
However, I get the following error:
Error in get(i) : object 'Left_LGN' not found
(Left_LGN being a region that I am studying, all the 159 regions are set up as columns through the data frame)
Perhaps I am overthinking it, does anyone know any simple solution/ better approach to getting the effect sizes for them?
I am still a beginner in R and statistics so I really appreciate your input!!
Thank you!
I would guess you used attach(ThalPC) before running your first script to add columns of ThalPC to the search path. Instead, try constructing your call to lm as:
stor[[i]] <- lm(as.formula(paste(i, "~ Status")),
data = ThalPC)
It looks like you might want to collect the output of effectsize as elements of a list too, otherwise you're overwriting it each time.

How to get around error "factor has new levels" in cross-validation glm?

My goal is to use cross-validation to evaluate the performance of a linear model.
My problem is that my training and testing sets might not always have the same variable levels.
Here is a reproducible data example:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
summary(data)
Now let's make a glm model:
model_glm <- glm(x~., data = data)
And let's use cross-validation on this model:
library(boot)
cross_validation_glm <- cv.glm(data = data, glmfit = model_glm, K = 10)
And this is the kind of error output that you will get:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor z has new levels F
if you don't get this error, re-run the cross validation and at some point you will get a similar error.
The nature of the problem here is that when you do cross-validation, the train and test subsets might not have the exact same variable levels. Here our variable z has three levels (D,E,F).
In the total amount of our data there is much more D's than E's and F's.
Thus whenever you take a small subset of the whole data (to do cross-validation).
There is a very good chance that your z variable are all going to be set at the D's level.
Thus Eand F levels gets dropped, thus we get the error (This answer is helpful to understand the problem: https://stackoverflow.com/a/51555998/10972294).
My question is: how to avoid the drop in the first place?
If it is not possible, what are the alternatives?
(Keep in mind that this a reproducible example, the actual data I am using has many variables like z, I would like to avoid deleting them.)
To answer your question in the comment, I don't know if there is a function or not. Most likely there is one, but I have no idea on which package would contain it. For this example, this function should work:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
#optional tag row for later identification:
#data$rowid<-1:nrow(data)
stratified <- function(df, column, percent){
#split dataframe into groups based on column
listdf<-split(df, df[[column]])
testsubgroups<-lapply(listdf, function(x){
#pick the number of samples per group, round up.
numsamples <- ceiling(percent*nrow(x))
#selects the rows
whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
testsubgroup <-x[whichones,]
})
#combine the subgroups into one data frame
testgroup<-do.call(rbind, testsubgroups)
testgroup
}
testgroup<-stratified(data, "z", 0.8)
This will just split the initial data by column z, if you are interested is grouping by multiple columns then this could be extended by using the group_by function from the dplyr package, but that would be another question.
Comment on the statistics: If you just have a few examples for any particular factor, what type of fit do you expect? A poor fit with wide confidence limits.

MXnet odd error

This is my first ANN so I imagine that there might be a lot of things done wrong here. I don't follow
I'm trying to predict species of flowers from iris data set provided in R language but I get following error:
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(n)) :
invalid 'dimnames' given for data frame
My code:
require(mxnet)
train <- iris[1:130,]
test <- iris[131:150,]
train.data <- as.data.frame(train[-5])
train.label <- data.frame(model.matrix(data=train,object =~Species-1))
test.data <- as.data.frame(test[-5])
test.label <- data.frame(model.matrix(data=test,object =~Species-1))
var1 <- mx.symbol.Variable("data")
layer0 <- mx.symbol.FullyConnected(var1, num.hidden=3)
cat.out <- mx.symbol.SoftmaxOutput(layer0)
net.model <- mx.model.FeedForward.create(cat.out,
array.layout = "auto",
X=train.data,
y=train.label,
eval.data = list(data=test.data,label=test.label),
num.round = 20,
array.batch.size = 20,
learning.rate=0.1,
momentum=0.9,
eval.metric = mx.metric.accuracy)
UPDATE:
I managed to get rid of this error by specifying column to use in labels(traning.label[,1]and test.label[,1]).
However now I'm training my net to predict just one of my binary variables while I have 3 (one for each species).
I had the same problem, turned out that:
train.data should be a matrix
train.label should be a numeric vector
Check these two and hopefully it should work.
I had a similar problem but during the prediction step. It turns out that my features were in a Data Frame which was causing the issue. Once I converted the data frame into a matrix, the issue went away.
pred.values = stats::predict(model,as.matrix(features))
instead of
pred.values = stats::predict(model,features)
So, the features need to be a matrix both during training and during the process of making predictions.

Error message when using predict with LARS model on testdata

I use a lars model and apply it to a large data set (75 features) with numerical data and factors.
I train the model by
mm <- model.matrix(target~0+.,data=data)
larsMod <- lars(mm,data$target,intercept=FALSE)
which gives a nice in-sample fit. If I apply it to testdata by
mm.test <- model.matrix(target~0+.,,data=test.data)
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))
then I get the error message
Error in scale.default(newx, object$meanx, FALSE) :
length of 'center' must equal the number of columns of 'x'
I assume that it has todo with the fact that factor levels differ in the data sets. However
which(! colnames(mm.test) %in% colnames(mm) )
gives an empty result
while
which(! colnames(mm) %in% colnames(mm.test) )
gives 3 indizes.
Thus 3 factor levels do appear in the training set but not in the test set.
Why does this cause a problem? How can I solve this?
The code blow illustrates this with a toy example. In the test dataset the factor does not have the level "l3".
require(lars)
data.train = data.frame( target = c(0,1,0,1,1,1,1,0,0,0), f1 = rep(c("l1","l2","l1","l2","l3"),2), n1 = rep(c(1,2,3,4,5),2))
test.data = data.frame(f1 = rep(c("l1","l2","l1","l2","l2"),2),n1 = rep(c(7,4,3,4,5),2) )
mm <- model.matrix(target~0+f1+n1,data = data.train)
colnames(mm)
length(colnames(mm))
larsMod <- lars(mm,data.train$target,intercept=FALSE)
mm.test <- model.matrix(~0+f1+n1,data=test.data)
colnames(mm.test)
length( colnames(mm.test) )
which(! colnames(mm.test) %in% colnames(mm) )
which(! colnames(mm) %in% colnames(mm.test) )
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))
I might be very much off here, but in my field predict doesn't work if it can't find a variable it expects. So I tried what happened if I forced the model matrix to 0 for the factor (f1l3) that was not in the test data.
Note1: I created a target variable in the testdata, because I couldn't get your code to run otherwise
set.seed(123)
test.data$target <- rbinom(nrow(test.data),1,0.2)
#proof of concept:
mm.test <- model.matrix(target~0+f1+n1,data=test.data)
mm.test1 <- cbind(f1l3=0,mm.test)
predict(larsMod,mm.test1[,colnames(mm)],type="fit",s=length(larsMod$arc.length)) #runs
#runs!
Now generalize to allow for creation of a 'complete' model matrix
when factors are missing in testdata.
#missing columns
mis_col <- setdiff(colnames(mm), colnames(mm.test))
#matrix of missing levels
mis_mat <- matrix(0,ncol=length(mis_col),nrow=nrow(mm.test))
colnames(mis_mat) <- mis_col
#bind together
mm.test2 <- cbind(mm.test,mis_mat)[,colnames(mm)] #to ensure ordering, yielded different results in my testing
predict(larsMod,mm.test2,type="fit",s=length(larsMod$arc.length)) #runs
Note2: I don't know what happens if the problem is the other way around (factors present in testdata that were not in train data)

Resources