How to solve this cluster analysis?

How to solve this cluster analysis? - r

I am calculating optimum number of clusters. I used NbClust function to comput, but how it is showing too many missing value but i don't know, there are no missing values in my data.
it is showing that
"Error in NbClust(data = df, distance = "euclidean", min.nc = 2, max.nc = 20, :
The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated."
Data i am using
dput(read.csv("cluster.csv"))
df = structure(list(St = c("PE", "SU", "PA", "OC", "PE",
"AC", "PP", "RA"), NDDZ91 = c(0.253576604, 0.0551232,
-0.53169303, -0.533246481, -0.533634844, -0.529751216, -0.529751216,
2.349376982), NDDZ92 = c(0.4633855, 0.952926247, -0.905688982,
-0.908031282, 0.815565566, -0.904127448, -0.904127448, 1.390097848
), NDDZ94 = c(0.971257769, 0.602251213, -0.82539626, -0.831562179,
0.018490857, -0.826819164, -0.826819164, 1.718596929), NDDZ95 = c(2.428086592,
-0.050766856, -0.502772844, -0.503557157, -0.289546405, -0.502953839,
-0.502953839, -0.075535652), NDDZ96 = c(0.073650972, 0.482511184,
-0.669130113, -0.675742407, -0.675742407, -0.664721917, -0.09563249,
2.224807178), NDDZ97 = c(2.108725851, 0.193018074, -0.616096838,
-0.618190279, 0.782927149, -0.616096838, -0.616096838, -0.618190279
), NDDZ98 = c(0.422792635, 0.224274925, -0.66324044, -0.674453783,
-0.191577267, -0.670300693, -0.670300693, 2.222805316), NDDZ99 = c(-0.045504148,
0.621635607, -1.030110408, -1.033331082, 0.370677267, 0.370677267,
-1.028730119, 1.774685616), NDDZ103 = c(0.543822029, 1.4294128,
-0.862935822, -0.865183039, 0.206064797, -0.865183039, -0.863310358,
1.277312632), NDDZ105 = c(-0.242116717, -0.327002284, -0.599905416,
-0.602682046, 0.790140631, -0.602682046, -0.598715431, 2.18296331
), NDDZ106 = c(-0.394116657, 1.166937427, -1.070650174, -1.078708713,
0.81841561, -1.078708713, 0.81841561, 0.81841561), NDDZ107 = c(1.493844177,
0.766047601, -1.041282102, -1.04295136, 0.956552995, -0.043914579,
-1.044382153, -0.043914579), NDDZ112 = c(2.137032432, 0.085031825,
-0.601376567, -0.601897927, -0.601897927, -0.601153126, 0.785414418,
-0.601153126), NDDZ113 = c(-0.102481763, -0.288855624, -0.41345193,
-0.41414606, -0.414377436, -0.413220553, 2.45975392, -0.413220553
), NDDZ114 = c(0.100876842, 0.716344963, -0.756031568, -0.758896113,
0.173403417, -0.756850009, -0.756850009, 2.038002477), NDDZ115 = c(-0.058558995,
0.221455542, -0.509307832, -0.505965142, -0.510336352, -0.507765052,
-0.507765052, 2.378242882), NDDZ116 = c(1.377841856, 1.640112838,
-0.676090962, -0.676661736, -0.676947124, -0.67409325, -0.67409325,
0.359931628), NDDZ117 = c(2.177231217, 0.849368214, -0.539426784,
-0.539639833, -0.479549446, -0.53892967, -0.509594639, -0.41945906
), NDDZ119 = c(2.215308855, 0.141088501, -0.679450372, -0.680029439,
-0.106916185, -0.678099214, -0.678099214, 0.466197068), NDDZ122 = c(1.743810041,
0.768581504, -0.772598602, -0.773098804, -0.348192016, -0.772598602,
-0.772598602, 0.926695082), NDDZ123 = c(0.634144889, 1.11554263,
-0.833927192, -0.834643558, -0.021473135, -0.832255672, -0.832255672,
1.60486771)), class = "data.frame", row.names = c(NA, -8L))
Code work i have done so so far
rownames(df) = c(df$St)
df = df[,-1]
library(NbClust)
nbclust_out <- NbClust(
data = df,
distance = "euclidean",
min.nc = 2,
max.nc = 20,
method = "ward.D",
)
but this the error showed like this "Error in NbClust(data = df, distance = "euclidean", min.nc = 2, max.nc = 20, :
The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated."

max.nc is higher then the rows in your dataset, which might lead to your issue. Using other packages:
#remove factor column
df$St <- NULL
#scale df
df.scaled <- scale(df)
#scree plot
scree <- fviz_nbclust(df.scaled, FUNcluster = kmeans, method = "wss", k.max = 7)
#parallel analysis
paral <- fa.parallel(df.scaled, fa = "pc")
Based on the plots below I would suggest 3 clusters. But the parallel analysis gives the error that you have a ultra-heywood case in your dataset, and to examine your results carefully.

Related

uwot is throwing an error running the Monocle3 R package's "find_gene_module()" function, likely as an issue with how my data is formatted

I am trying to run the Monocle3 function find_gene_modules() on a cell_data_set (cds) but am getting a variety of errors in this. I have not had any other issues before this. I am working with an imported Seurat object. My first error came back stating that the number of rows were not the same between my cds and cds#preprocess_aux$gene_loadings values. I took a look and it seems my gene loadings were a list under cds#preprocess_aux#listData$gene_loadings. I then ran the following code to make a dataframe version of the gene loadings:
test <- seurat#assays$RNA#counts#Dimnames[[1]]
test <- as.data.frame(test)
cds#preprocess_aux$gene_loadings <- test
rownames(cds#preprocess_aux$gene_loadings) <- cds#preprocess_aux$gene_loadings[,1]
Which created a cds#preprocess_aux$gene_loadings dataframe with the same number of rows and row names as my cds. This resolved my original error but now led to a new error being thrown from uwot as:
15:34:02 UMAP embedding parameters a = 1.577 b = 0.8951
Error in uwot(X = X, n_neighbors = n_neighbors, n_components = n_components, :
No numeric columns found
Running traceback() produces the following information.
> traceback()
4: stop("No numeric columns found")
3: uwot(X = X, n_neighbors = n_neighbors, n_components = n_components,
metric = metric, n_epochs = n_epochs, alpha = learning_rate,
scale = scale, init = init, init_sdev = init_sdev, spread = spread,
min_dist = min_dist, set_op_mix_ratio = set_op_mix_ratio,
local_connectivity = local_connectivity, bandwidth = bandwidth,
gamma = repulsion_strength, negative_sample_rate = negative_sample_rate,
a = a, b = b, nn_method = nn_method, n_trees = n_trees, search_k = search_k,
method = "umap", approx_pow = approx_pow, n_threads = n_threads,
n_sgd_threads = n_sgd_threads, grain_size = grain_size, y = y,
target_n_neighbors = target_n_neighbors, target_weight = target_weight,
target_metric = target_metric, pca = pca, pca_center = pca_center,
pca_method = pca_method, pcg_rand = pcg_rand, fast_sgd = fast_sgd,
ret_model = ret_model || "model" %in% ret_extra, ret_nn = ret_nn ||
"nn" %in% ret_extra, ret_fgraph = "fgraph" %in% ret_extra,
batch = batch, opt_args = opt_args, epoch_callback = epoch_callback,
tmpdir = tempdir(), verbose = verbose)
2: uwot::umap(as.matrix(preprocess_mat), n_components = max_components,
metric = umap.metric, min_dist = umap.min_dist, n_neighbors = umap.n_neighbors,
fast_sgd = umap.fast_sgd, n_threads = cores, verbose = verbose,
nn_method = umap.nn_method, ...)
1: find_gene_modules(cds[pr_deg_ids, ], reduction_method = "UMAP",
max_components = 2, umap.metric = "cosine", umap.min_dist = 0.1,
umap.n_neighbors = 15L, umap.fast_sgd = FALSE, umap.nn_method = "annoy",
k = 20, leiden_iter = 1, partition_qval = 0.05, weight = FALSE,
resolution = 0.001, random_seed = 0L, cores = 1, verbose = T)
I really have no idea what I am doing wrong or how to proceed from here. Does anyone with experience with uwot know where my error is coming from? Really appreciate the help!

How do I specify numerical and categorical variables in catboost with R?

The tutorial for catboost with R says this:
library(catboost)
countries = c('RUS','USA','SUI')
years = c(1900,1896,1896)
phone_codes = c(7,1,41)
domains = c('ru','us','ch')
dataset = data.frame(countries, years, phone_codes, domains)
label_values = c(0,1,1)
fit_params <- list(iterations = 100,
loss_function = 'Logloss',
ignored_features = c(4,9),
border_count = 32,
depth = 5,
learning_rate = 0.03,
l2_leaf_reg = 3.5)
pool = catboost.load_pool(dataset, label = label_values, cat_features = c(0,3))
model <- catboost.train(pool, params = fit_params)
However, this results in:
Error in catboost.from_data_frame(data, label, pairs, weight, group_id, :
Unsupported column type: character
Many thanks,

How to fix 'Error in sdata[[paste0("Y", usc(resp))]] : subscript out of bounds' in R, using brms package

I'm trying to set up priors for my MLM using brms. I have ran my model with the priors I had set with no error messages and now would like to check them using pp_check. I get an the 'Error in sdata[[paste0("Y", usc(resp))]] : subscript out of bounds' error and couldn't find any tips as to why this is happening. Thanks!
Edit: I have checked the structure of my fit and only the init variables are 0, which I think should be the case since I set the initialisation parameter to 0? Otherwise there is nothing problematic as I can see.
I couldn't try anything since googling led to nothing.
library(brms)
df <- data.frame( subjno = as.factor(c('sub-01', 'sub-01','sub-01','sub-01','sub02','sub02','sub02','sub02')),
L1 = c(0.898922096, -0.673393065, -2.240150247,-0.932520537, -0.472701111, -0.188825324,0.808675919, 0.293666248),
L2 = c(0.64888, 2.0891, -0.655322708, 0.007098555, -0.648887797, -0.249716343, -0.698128026,0.119511014),
W1 = c(0.5,0.5,-0.5,-0.5,0.5,-0.5,0.5,-0.5), W2 = c(0.5,-0.5,0.5,-0.5,0.5,0.5,-0.5,-0.5),
t = as.factor(c(12,23,34,45,12,23,34,45)))
ff_s = brmsformula(cbind(L1,L2) ~ W1 * W2 * t +
(W1*W2* t|p|subjno))
get_prior(formula = ff_s, family = gaussian(),
data = df)
pp_s <- c(set_prior('normal(0,1)', class = "b"),
set_prior("normal(0,10)", class = "sd", resp = 'L1'),
set_prior("normal(0,10)", class = "sd", resp = 'L2'),
set_prior("normal(0,5)", class = "sigma",resp = 'L1'),
set_prior("normal(0,5)", class = "sigma",resp = 'L2'),
set_prior("normal(0,10)", class = "Intercept", resp = 'L1'),
set_prior("normal(0,10)", class = "Intercept", resp = 'L2'),
set_prior("lkj(3)", class = "cor"))
fit_s <- brm(formula = ff_s,
data = df, family = gaussian(),
prior = pp_s,
chains = 6, cores = 3,
iter = 2e3, warmup = 1e3,
init = 0,
sample_prior = "only")
pp_check(fit_s)

I found out that I was calling the function pp_check without specifying the level I am interested in, problem solved!

HeatMap: how to cluster only the rows and keep order of the heatmap's column labels as same as in the df?

I wanna plot a heatmap and cluster only the rows (i.e. genes in this tydf1).
Also, wanna keep order of the heatmap's column labels as same as in the df (i.e. tydf1)?
Sample data
df1 <- structure(list(Gene = c("AA", "PQ", "XY", "UBQ"), X_T0_R1 = c(1.46559502, 0.220140568, 0.304127515, 1.098842127), X_T0_R2 = c(1.087642983, 0.237500819, 0.319844338, 1.256624804), X_T0_R3 = c(1.424945196, 0.21066267, 0.256496284, 1.467120048), X_T1_R1 = c(1.289943948, 0.207778662, 0.277942721, 1.238400358), X_T1_R2 = c(1.376535013, 0.488774258, 0.362562315, 0.671502431), X_T1_R3 = c(1.833390311, 0.182798731, 0.332856558, 1.448757569), X_T2_R1 = c(1.450753714, 0.247576125, 0.274415259, 1.035410946), X_T2_R2 = c(1.3094609, 0.390028842, 0.352460646, 0.946426593), X_T2_R3 = c(0.5953716, 1.007079177, 1.912258811, 0.827119776), X_T3_R1 = c(0.7906009, 0.730242116, 1.235644748, 0.832287694), X_T3_R2 = c(1.215333041, 1.012914813, 1.086362205, 1.00918082), X_T3_R3 = c(1.069312467, 0.780421013, 1.002313082, 1.031761442), Y_T0_R1 = c(0.053317766, 3.316414959, 3.617213894, 0.788193798), Y_T0_R2 = c(0.506623748, 3.599442788, 1.734075583, 1.179462912), Y_T0_R3 = c(0.713670106, 2.516735845, 1.236204882, 1.075393433), Y_T1_R1 = c(0.740998252, 1.444496448, 1.077023349, 0.869258744), Y_T1_R2 = c(0.648231834, 0.097957459, 0.791438659, 0.428805547), Y_T1_R3 = c(0.780499252, 0.187840968, 0.820430227, 0.51636582), Y_T2_R1 = c(0.35344654, 1.190274584, 0.401845911, 1.223534348), Y_T2_R2 = c(0.220223951, 1.367784148, 0.362815405, 1.102117612), Y_T2_R3 = c(0.432856978, 1.403057729, 0.10802472, 1.304233845), Y_T3_R1 = c(0.234963735, 1.232129062, 0.072433381, 1.203096462), Y_T3_R2 = c(0.353770497, 0.885122768, 0.011662112, 1.188149743), Y_T3_R3 = c(0.396091395, 1.333921747, 0.192594116, 1.838029829), Z_T0_R1 = c(0.398000559, 1.286528398, 0.129147097, 1.452769794), Z_T0_R2 = c(0.384759325, 1.122251177, 0.119475721, 1.385513609), Z_T0_R3 = c(1.582230097, 0.697419716, 2.406671502, 0.477415567), Z_T1_R1 = c(1.136843842, 0.804552001, 2.13213228, 0.989075996), Z_T1_R2 = c(1.275683837, 1.227821594, 0.31900326, 0.835941568), Z_T1_R3 = c(0.963349308, 0.968589683, 1.706670339, 0.807060135), Z_T2_R1 = c(3.765036263, 0.477443352, 1.712841882, 0.469173869), Z_T2_R2 = c(1.901023385, 0.832736132, 2.223429427, 0.593558769), Z_T2_R3 = c(1.407713024, 0.911920317, 2.011259223, 0.692553388), Z_T3_R1 = c(0.988333629, 1.095130142, 1.648598854, 0.629915612), Z_T3_R2 = c(0.618606729, 0.497458337, 0.549147265, 1.249492088), Z_T3_R3 = c(0.429823986, 0.471389536, 0.977124788, 1.136635484)), row.names = c(NA, -4L ), class = c("data.table", "data.frame"))
Scripts used
library(dplyr)
library(stringr)
library(tidyr)
gdf1 <- gather(df1, "group", "Expression", -Gene)
gdf1$tgroup <- apply(str_split_fixed(gdf1$group, "_", 3)[, c(1, 2)],
1, paste, collapse ="_")
library(dplyr)
tydf1 <- gdf1 %>%
group_by(Gene, tgroup) %>%
summarize(expression_mean = mean(Expression)) %>%
spread(., tgroup, expression_mean)
#1 heatmap script is being used
library(tidyverse)
tydf1 <- tydf1 %>%
as.data.frame() %>%
column_to_rownames(var=colnames(tydf1)[1])
library(gplots)
library(vegan)
randup.m <- as.matrix(tydf1)
scaleRYG <- colorRampPalette(c("red","yellow","darkgreen"),
space = "rgb")(30)
data.dist <- vegdist(randup.m, method = "euclidean")
row.clus <- hclust(data.dist, "aver")
heatmap.2(randup.m, Rowv = as.dendrogram(row.clus),
dendrogram = "row", col = scaleRYG, margins = c(7,10),
density.info = "none", trace = "none", lhei = c(2,6),
colsep = 1:3, sepcolor = "black", sepwidth = c(0.001,0.0001),
xlab = "Identifier", ylab = "Rows")
#2 heatmap script is being used
df2 <- as.matrix(tydf1[, -1])
heatmap(df2)
Also, I want to add a color key.

It is still unclear to me, what the desired output is. There are some notes:
You don't need to use vegdist() to calculate distance matrix for your hclust() call. Because if you check all(vegdist(randup.m, method = "euclidian") == dist(randup.m)) it returns TRUE;
Specifying Colv = F in your heatmap.2() call will prevent reordering of the columns (default is TRUE);
Maybe it is better to scale your data by row (see the uncommented row);
Your call of heatmap.2() returns the heatmap with color key.
So summing it up - in your first script you just miss the Colv = F argument, and after a little adjustment it looks like this:
heatmap.2(randup.m,
Rowv = as.dendrogram(row.clus),
Colv = F,
dendrogram = "row",
#scale = "row",
col = scaleRYG,
density.info = "none",
trace = "none",
srtCol = -45,
adjCol = c(.1, .5),
xlab = "Identifier",
ylab = "Rows"
)
However I am still not sure - is it what you need?

Putting series summary of ugarchboot into a dataframe

I am looking at the ugarchboot function in rugarch but I am having trouble getting the Series (summary) into a dataframe.
library(rugarch)
data(dji30ret)
spec = ugarchspec(variance.model=list(model="gjrGARCH", garchOrder=c(1,1)),
mean.model=list(armaOrder=c(1,1), arfima=FALSE, include.mean=TRUE,
archm = FALSE, archpow = 1), distribution.model="std")
ctrl = list(tol = 1e-7, delta = 1e-9)
fit = ugarchfit(data=dji30ret[, "BA", drop = FALSE], out.sample = 0,
spec = spec, solver = "solnp", solver.control = ctrl,
fit.control = list(scale = 1))
bootpred = ugarchboot(fit, method = "Partial", n.ahead = 120, n.bootpred = 2000)
bootpred
as.data.frame(bootpred, which = "sigma", type = "q", qtile = c(0.01, 0.05))
##I am tring to get this into a dataframe:
Series (summary):
min q.25 mean q.75 max forecast
t+1 -0.24531 -0.016272 0.000143 0.018591 0.16263 0.000743
t+2 -0.24608 -0.018006 -0.000290 0.017816 0.16160 0.000232
t+3 -0.24333 -0.017131 0.001007 0.017884 0.31861 0.000413
t+4 -0.26126 -0.018643 -0.000618 0.017320 0.34078 0.000349
t+5 -0.19406 -0.018545 -0.000453 0.016690 0.33356 0.000372
t+6 -0.23864 -0.017268 -0.000113 0.016001 0.18233 0.000364
t+7 -0.27024 -0.018031 -0.000514 0.017852 0.18436 0.000367
t+8 -0.13926 -0.016676 0.000539 0.017904 0.16271 0.000366
t+9 -0.32941 -0.017221 -0.000194 0.016718 0.13894 0.000366
t+10 -0.19013 -0.015845 0.001095 0.017064 0.14498 0.000366
Thank you for your help.