I'm trying to plot a K-Means cluster to analyze different categories of products based on their inventory average and sold quantity.
All values are non-negative and of the same measurement unit.
I don't know what I did wrong and the results contain point with negative values. Actually, I believe all the points given in the plot aren't actual valid points from my data.
Here is my code:
reduced_dataset = dataset[1:20, 4:5]
# Using the elbow method to find the optimal number of clusters
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(reduced_dataset, i)$withinss)
plot(1:10,
wcss,
type = 'b',
main = paste('The Elbow Method'),
xlab = 'Number of clusters',
ylab = 'WCSS')
# As a result, number of clusters should be 2
# Fitting K-Means to the dataset
kmeans = kmeans(x = reduced_dataset, centers = 2)
y_kmeans = kmeans$cluster
# Visualising the clusters
library(cluster)
clusplot(reduced_dataset,
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste('Clusters of categories - NOT ON SALE'),
xlab = 'Average Sold Quantity',
ylab = 'Average Inventory')
dput(reduced_dataset):
structure(list(Avg_Sold_No_Promo = c(0.255722695, 1.139983236,
0.458651842, 0.784966698, 1.642746914, 0.115264798, 7.50338696,
0.487603306, 1.023373984, 0.956099815, 1.505901506, 0.253837072,
0.834963325, 0.880898876, 6.527699531, 11.54054054, 3.44077135,
0.750182882, 0.251033058, 1.875698324), Avg_Inventory_No_Promo =
c(6.068672335,
22.57865326, 9.00694927, 11.56137012, 28.47530864, 7.485981308,
170.9064352, 11.07438017, 22.80792683, 40.63863216, 41.73463573,
10.87603306, 35.87408313, 46.09213483, 185.5671362, 315.6015693,
165.1129477, 78.18032187, 9.65857438, 198.4385475)), .Names =
c("Avg_Sold_No_Promo",
"Avg_Inventory_No_Promo"), row.names = c(NA, 20L), class = "data.frame")
Can someone please help me?
The clusplot function does this automatically.
It is called PCA, and that is also why you get the line with the variability explained there.
Related
I am performing a meta-analysis of proportions using metaprop function. I am looking at the prevalence of heart fibrosis in people living with HIV.
#mri$lgehivn <- number in people with HIV with fibrosis
#mri$lgehivn <- total number of peopl with HIV who have had CMR
lge.prop <- metaprop(event = mri$lgehivn,
n = mri$hivnmri,
subset = c(1:11, 13:16),
studlab = paper,
data = mri,
method = "Inverse"
sm = "PLOGIT",
random = TRUE,
hakn = FALSE,
pscale = 100,
digits = 1)
I am then passing this into a forest plot:
forest.meta(lge.prop,
rightcols=FALSE,
leftcols=c("studlab", "event", "n", "effect", "ci"),
leftlabs = c("Study", "Cases", "Total", "Prevalence", "95% C.I."),
xlim= c(0,110),
smlab = c("Prevalence of LGE (%)"),
digits = 1,
colgap.left = 1)
This then gives me the following forest plot:
Forest plot of meta analysis
I am trying to remove the line that reports the "Common effect model" and only show the random effect model.
Does anyone know the code for this?
Thank you!
I do not know if it is still relevant but I think that this was introduced with a newer version of the meta package.
When manually installing version 4.15-1, the common effect model was removed automatically.
Please use the bellow code before making forest figure
lge.prop <- metaprop(event = mri$lgehivn,
n = mri$hivnmri,
subset = c(1:11, 13:16),
studlab = paper,
data = mri,
method = "Inverse"
sm = "PLOGIT",
random = TRUE,
hakn = FALSE,
pscale = 100,
digits = 1,
common=F)
I am using WGCNA package for network analysis with following steps:
Data input
Generate Modules
Get gene id
Phenotype x module correlation
I want to use the package to include the phenotype data together with the gene expression matrix to find which genes group with the phenotypes. Then, I want to get the module of interest and do a network map and check which genes relate to the phenotypes.
I generated modules like:
library(WGCNA)
options(stringsAsFactors = FALSE)
enableWGCNAThreads()
lnames = load(file = "dataInput.RData");
# Choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=40, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
##Constructing the gene network and identifying modules is now a simple function call:
net_unsigned = blockwiseModules(datExpr, power = 6,
TOMType = "unsigned", minModuleSize = 30, maxBlockSize = 300,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase = "PopulusTOM_signed",
verbose = 5)
##maxBlockSize = The total number of genes you have in your gene expression matrix that passed the filter from Data_Input scrip
##Plotting graph
pdf("Dendogram_Modules_signed.pdf", width = 30, height = 30);
##Convert labels to colors for plotting
mergedColors = labels2colors(net_unsigned$colors)
##Plot the dendrogram and the module colors underneath
plotDendroAndColors(net_unsigned$dendrograms[[1]], mergedColors[net_unsigned$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
dev.off()
##Save
moduleLabels = net_unsigned$colors
moduleColors = labels2colors(net_unsigned$colors)
MEs = net_unsigned$MEs;
geneTree = net_unsigned$dendrograms[[1]];
save(MEs, moduleLabels, moduleColors, geneTree,
file = "unsigned-networkConstruction-auto.RData")
This generates modules and then I correlated a module with one phenotype. How can I include the phenotypic data with gene expression? Thank you!
I'm plotting decision trees built with partykit in ggparty, and struggling to rotate the tree branches around branch nodes- as in, change the order they are displayed. I'm hoping to plot a tree with the branches displayed in an order according to the values of the terminal nodes. This would be somewhat comparable to using reorder() in a standard ggplot geom.
For instance, outlined below using the WeatherPlay data from the ggparty vignette, can the branches of this tree be rotated so that terminal node geom_bar() plots are displayed in increasing order of proportion "yes"? In this case, this would mean "sunny", then "rainy", then "overcast".
For my project I'll have terminal nodes with boxplots, but I'm guessing that the method for rotating branches is modular and can be repurposed.
## Playing with default WeatherPlay data as demonstrated in ggparty examples here:
# https://cran.r-project.org/web/packages/ggparty/vignettes/ggparty-graphic-partying.html
library(partykit)
library(ggparty)
data("WeatherPlay", package = "partykit")
sp_o <- partysplit(1L, index = 1:3)
sp_h <- partysplit(3L, breaks = 75)
sp_w <- partysplit(4L, index = 1:2)
pn <- partynode(1L, split = sp_o, kids = list(
partynode(2L, split = sp_h, kids = list(
partynode(3L, info = "yes"),
partynode(4L, info = "no"))),
partynode(5L, info = "yes"),
partynode(6L, split = sp_w, kids = list(
partynode(7L, info = "yes"),
partynode(8L, info = "no")))))
py <- party(pn, WeatherPlay)
# Node plots
n1 <- partynode(id = 1L, split = sp_o, kids = lapply(2L:4L, partynode))
t2 <- party(n1,
data = WeatherPlay,
fitted = data.frame(
"(fitted)" = fitted_node(n1, data = WeatherPlay),
"(response)" = WeatherPlay$play,
check.names = FALSE),
terms = terms(play ~ ., data = WeatherPlay)
)
t2 <- as.constparty(t2)
# Plot tree
ggparty(t2) +
geom_edge() +
geom_edge_label() +
geom_node_splitvar() +
geom_node_plot(gglist = list(geom_bar(aes(x = "", fill = play),
position = position_fill()),
xlab("play")),
# draw only one label for each axis
shared_axis_labels = TRUE,
# draw line between tree and legend
legend_separator = TRUE
)
I'm trying to create a plot in R that would generate a table of the survival probabilities at specified points in time in a table.
Currently the plot looks like the following:
R code for the plot using the survminer package:
ggsurvplot(fit,
pval = TRUE, conf.int = TRUE,
risk.table = TRUE, # Add risk table
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"))
Ideally I would like a table below the "Number at risk by time" to display the survival probabilities for each strata at times 250, 500, 750, and 1000.
I can retrieve the survival probabilities with the following code:
summary(fit, times=0:1000)
I made a function for that a wile back. It takes as an argument a survfit object and a time sequence and returns the survival probabilities.
ConstruirTabela = function(a, sequencia = seq(250,1000,by=250)){
quebra=NULL
for(i in 1:(length(a$time)-1)){
if(a$time[i] > a$time[i+1]){
quebra = c(quebra,i)
}
}
quebra= c(quebra,length(a$time))
lsurv = list()
ltime = list()
previous = 0
for(i in 1:length(quebra)){
periodo = c((previous+1):quebra[i])
lsurv[[i]] = a$surv[periodo]
ltime[[i]] = a$time[periodo]
previous = quebra[i]
}
matriz=matrix(ncol=length(ltime),nrow=length(sequencia))
for(i in 1:length(sequencia)){
for(j in 1:length(ltime)){
indice = which.min(abs(ltime[[j]]-sequencia[i]))
matriz[i,j] = lsurv[[j]][indice]
}
}
retorno = as.data.frame(matriz)
f=strsplit(names(a$strata),"=")
names(retorno) = sapply(f, "[[", 2)
rownames(retorno) = as.character(sequencia)
return(retorno)}
It's probably not the best way to achieve this, but check if it works for you.
Try this ggpubr library. Look at the very bottom of this page. It shows a graph with a text table.
Given a data frame containing mixed variables (i.e. both categorical and continuous) like,
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
I perform unsupervised feature selection using the package FactoMineR
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
The variable df.princomp is a list.
Thereafter, to visualize the principal components I use
fviz_screeplot() and fviz_contrib() like,
#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
which gives the following Fig1
and Fig2
Explanation of Fig1: The Fig1 is a scree plot. A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 43.8% of total variance. The question now naturally arises, "What are these variables?". This I have shown in Fig2.
Explanation of Fig2: This figure visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, name, studLoc and finalMark are the most important variables that can be used for further analysis.
Further Analysis- where I'm stuck at: To derive the contribution of the aforementioned variables name, studLoc, finalMark. I use the principal component variable df.princomp (see above) like df.princomp$quanti.var$contrib[,4]and df.princomp$quali.var$contrib[,2:3].
I've to manually specify the column indices [,2:3] and [,4].
What I want: I want to know how to do dynamic column index assignment, such that I do not have to manually code the column index [,2:3] in the list df.princomp?
I've already looked at the following similar questions 1, 2, 3 and 4 but cannot find my solution? Any help or suggestions to solve this problem will be helpful.
Not sure if my interpretation of your question is correct, apologies if not. From what I gather you are using PCA as an initial tool to show you what variables are the most important in explaining the dataset. You then want to go back to your original data, select these variables quickly without manual coding each time, and use them for some other analysis.
If this is correct then I have saved the data from the contribution plot, filtered out the variables that have the greatest contribution, and used that result to create a new data frame with these variables alone.
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than, say, 20
r<-rownames(dat[dat$contrib>20,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
#finalmark name studLoc
#1 53 b POTYQ0002N
#2 73 i LWMTW1195I
#3 95 d VTUGO1685F
#4 39 f YCGGS5755N
#5 97 c GOSWE3283C
#6 58 g APBQD6181U
#7 67 a VUJOG1460V
#8 64 h YXOGP1897F
#9 15 j NFUOB6042V
#10 81 e QYTHG0783G
Based on your comment, where you said you wanted to 'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame', I would do this:
#top contributors to both Dim 1 and 2
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1,2), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than 5
r<-rownames(dat[dat$contrib>5,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
(This keeps all the original variables in our new data frame since they all contributed more than 5% to the total variance)
There are a lot of ways to extract contributions of individual variables to PCs. For numeric input, one can run a PCA with prcomp and look at $rotation (I spoke to soon and forgot you've got factors here so prcomp won't work directly). Since you are using factoextra::fviz_contrib, it makes sense to check how that function extracts this information under the hood. Key factoextra::fviz_contrib and read the function:
> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var",
"quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue",
color = "steelblue", sort.val = c("desc", "asc", "none"),
top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(),
...)
{
sort.val <- match.arg(sort.val)
choice = match.arg(choice)
title <- .build_title(choice[1], "Contribution", axes)
dd <- facto_summarize(X, element = choice, result = "contrib",
axes = axes)
contrib <- dd$contrib
names(contrib) <- rownames(dd)
theo_contrib <- 100/length(contrib)
if (length(axes) > 1) {
eig <- get_eigenvalue(X)[axes, 1]
theo_contrib <- sum(theo_contrib * eig)/sum(eig)
}
df <- data.frame(name = factor(names(contrib), levels = names(contrib)),
contrib = contrib)
if (choice == "quanti.var") {
df$Groups <- .get_quanti_var_groups(X)
if (missing(fill))
fill <- "Groups"
if (missing(color))
color <- "Groups"
}
p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill,
color = color, sort.val = sort.val, top = top, main = title,
xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt,
ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib,
linetype = 2, color = "red")
p
}
<environment: namespace:factoextra>
So it's really just calling facto_summarize from the same package. By analogy you can do the same thing, simply call:
> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
name contrib
ID ID 0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark 7.1874438
subj2mark subj2mark 16.6831560
name name 26.8610132
studLoc studLoc 26.8610132
And that's the table corresponding to your figure 2. For PC2 use axes = 2 and so on.
Regarding "how to programmatically determine the column indices of the PCs", I'm not 100% sure I understand what you want, but if you just want to say for column "finalmark", grab its contribution to PC3 you can do the following:
library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")
# get the contribution of column 'finalmark' by name
contribution_df %>%
filter(name == "finalmark")
# get the contribution of column 'finalmark' to PC3
contribution_df %>%
filter(name == "finalmark" & PC == 3)
# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
BTW I think ID in your example is treated as numeric instead of factor, but since it's just an example I'm not bothering with it.