i have same problem like this topic [https://stackoverflow.com/questions/50107157/adding-labels-to-cluster,
i followed the replied answer still not working on me then
try to find another solution from this [https://stackoverflow.com/questions/8120984/scaling-data-in-r-ignoring-specific-columns also still not working
so far my code just follow from this topic [https://uc-r.github.io/kmeans_clustering and [https://afit-r.github.io/kmeans_clustering as shown below
1. library(tidyverse)
2. library(cluster)
3. library(factoextra)
4. dataMCU = read.csv("MCU180721.csv")
5. dataMCU <- na.omit(dataMCU)
6. dataMCU <- scale(dataMCU)
this line number: 6 failed to proceed since show error like this Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
for addition information my table in csv file describe below
precint, green, yelloe, orange, red
oregon, 6, 7, 8, 9
my question is how to resolve this problem.
recently i've tried with this code dataMCU<-dataMCU[,c(-1)] before run scale()
this solution works but not as expected. since i wish to have same result just like [https://uc-r.github.io/kmeans_clustering and [https://afit-r.github.io/kmeans_clustering
for additional information:
this from [https://afit-r.github.io/kmeans_clustering
but my code always show like below picture
Your data frame has the character vector. You can have the states as row names and remove the first column.
rownames(dataMCU) <- dataMCU$Precint
dataMCU <- dataMCU[,-1]
Then, you can remove NA rows and scale the data frame.
dataMCU <- na.omit(dataMCU)
dataMCU <- scale(dataMCU)
Proceed with distance matrix and clustering:
distance <- get_dist(dataMCU)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
clust <- kmeans(dataMCU, centers = 2, nstart = 25)
fviz_cluster(clust, data = dataMCU)
P.S.: To avoid the manual workaround, you can simply import csv file as
dataMCU = read.csv("fMCU180721.csv", row.names = 1, header= TRUE)
Edit:
Importing the csv with row names:
dataMCU = read.csv("DataPrecint.csv", row.names=1, header=TRUE)
dataMCU <- na.omit(dataMCU)
dataMCU <- scale(dataMCU)
distance <- get_dist(dataMCU)
fviz_dist(distance, gradient=list(low="green", mid="yellow", high="red"))
k4 <- kmeans(dataMCU, centers=4, nstart=25)
str(k4)
k4
fviz_cluster(k4, data = dataMCU)
Related
I am deploying a Shiny App on Heroku.
However the buildpack does not support rowr as the version of R is not compatible with this package.
How can I replace rowr::cbind.fill with base R or dplyr functions ?
I tried to see what was in the function, but it is not clear to me:
function (..., fill = NULL)
{
inputs <- list(...)
inputs <- lapply(inputs, vert)
maxlength <- max(unlist(lapply(inputs, len)))
bufferedInputs <- lapply(inputs, buffer, length.out = maxlength,
fill, preserveClass = FALSE)
return(Reduce(cbind.data.frame, bufferedInputs))
}
Another way of asking this question is is there a solution to the following question in dplyr, I need fill = NA ?
cbind a vector of different length to a dataframe
Here is a solution to detect the number of missing elements in every column, append it with NAs and bind it together. I assume that the columns are stored in a list:
# test data
test_data <- list(c1 = rep(1, 5),
c2 = rep(2, 7),
c3 = rep(3, 4))
# find the longest column
longest_column <- max(unlist(lapply(test_data, length)))
# calculate how many NAs need to be added
number_append <- lapply(test_data, function(x) longest_column - length(x))
# append the columns
data_appended <- lapply(seq_len(length(test_data)), function(i) {
c(test_data[[i]], rep(NA, number_append[[i]]))
})
# combine the columns
data_combined <- do.call("cbind", data_appended)
I am trying to use Spearman correlation/clustering to draw a heat map for the results of a differential expression experiment.
The code is as follow
library(ggplot2)
library(data.table)
library(preprocessCore)
library(limma)
library(dplyr)
install.packages("NMF")
library(NMF)
library(RColorBrewer)
library(tidyverse)
rm(list=ls())
rm(list=ls())
file <- "C:/PETER PROJECT/3d. Mass Spec Processing/Serum Addition Apr 19/Serum Addition/DE Proteins for heatmap analysis/WCL/.CSV Files/DE Proteins WCL Apr2019 All samples Reanalysis.txt"
data <- read.delim(file, sep="\t", header=T, dec=".")
head(data) #data <- read.csv("", comment.char="#")
rnames <- data[,1] # assign labels in column 1 to "rnames"
mat_data <- data.matrix(data[,2:ncol(data)]) # transform column 2-5 into a matrix
rownames(mat_data) <- rnames # assign row names
head(mat_data)
dataD.log2 = log2(mat_data)
datat <- t(dataD.log2)
heatmap <- aheatmap(datat, color = "-RdBu:50", scale = "col", breaks = 0,
annRow = datat["Description"], annColors = "Set2", main = "Comparison of CHO K1 Cells grown in the presence and absence of FBS (Whole Cell Lysates - All Conditions)",
distfun = "spearman", treeheight=c(50, 50),
fontsize=10, cexCol=1, cexRow=1)
The data is in a matrix with 52000 elements so I don't propose to post it but instead have a new question as I removed the NAs and Zeros which were in my data.
Question
Having removed all NA and Zeros what else should I be searching for in my raw data that could be causing the error?
you can use t function for transpose.
mat_data <- replace(is.na(t(mat_data)), 0.00)
I am playing a bit with the SOMbrero package. I would like to attach the cluster numbers created like so (taken from here):
my.sc <- superClass(iris.som, k=3)
and X and Y coordinates of the SOM nodes to the training dataset.
In some code, where I use the kohonen package, I create clusters like this:
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
ind <- sapply(SubsetData, is.numeric)
SubsetData[ind] <- lapply(SubsetData[ind], range01)
TrainingMatrix <- as.matrix(SubsetData)
GridDefinition <- somgrid(xdim = 4, ydim = 4, topo = "rectangular", toroidal = FALSE)
SomModel <- som(
data = TrainingMatrix,
grid = GridDefinition,
rlen = 10000,
alpha = c(0.05, 0.01),
keep.data = TRUE
)
nb <- table(SomModel$unit.classif)
groups = 5
tree.hc = cutree(hclust(d=dist(SomModel$codes[[1]]),method="ward.D2",members=nb),groups)
plot(SomModel, type="codes", bgcol=rainbow(groups)[tree.hc])
add.cluster.boundaries(SomModel, tree.hc)
result <- OrginalData
result$Cluster <- tree.hc[SomModel$unit.classif]
result$X <- SomModel$grid$pts[SomModel$unit.classif,"x"]
result$Y <- SomModel$grid$pts[SomModel$unit.classif,"y"]
write.table(result, file = "FinalData.csv", sep = ",", col.names = NA, quote = FALSE)
PS:
Some example code using the iris dataset can be found here.
PPS:
I played a bit with the code iris code quoted above and think I have managed to extract the clusters, node ids and prototypes (see code below). What is missing are the coordinates X and Y. I think they are in here:
iris.som$parameters$the.grid$coord
Code:
library(SOMbrero)
set.seed(100)
setwd("D:\\RProjects\Clustering")
#iris.som <- trainSOM(x.data=iris[,1:4],dimension=c(10,10), maxit=100000, scaling="unitvar", radius.type="gaussian")
iris.som <- trainSOM(x.data=iris[,1:4],dimension=c(3,3), maxit=100000, scaling="unitvar", radius.type="gaussian")
# perform a hierarchical clustering
## with 3 super clusters
iris.sc <- superClass(iris.som, k=3)
summary(iris.sc)
# compute the projection quality indicators
quality(iris.som)
iris1 <- iris
iris1$Cluster = iris.sc$cluster[iris.sc$som$clustering]
iris1$Node = iris.sc$som$clustering
iris1$Pt1Sepal.Length = iris.sc$som$prototypes[iris.sc$som$clustering,1]
iris1$Pt2Sepal.Width = iris.sc$som$prototypes[iris.sc$som$clustering,2]
iris1$Pt3Petal.Length = iris.sc$som$prototypes[iris.sc$som$clustering,3]
iris1$Pt4Petal.Width = iris.sc$som$prototypes[iris.sc$som$clustering,4]
write.table(iris1, file = "Iris.csv", sep = ",", col.names = NA, quote = FALSE)
I think I have figured it out using the iris example (please correct/improve code! - I am not fluent in R):
library(SOMbrero)
set.seed(100)
setwd("D:\\RProjects\\SomBreroClustering")
iris.som <- trainSOM(x.data=iris[,1:4],dimension=c(5,5), maxit=10000, scaling="unitvar", radius.type="letremy")
# perform a hierarchical clustering
# with 3 super clusters
iris.sc <- superClass(iris.som, k=3)
summary(iris.sc)
# compute the projection quality indicators
quality(iris.som)
iris1 <- iris
iris1$Cluster = iris.sc$cluster[iris.sc$som$clustering]
iris1$Node = iris.sc$som$clustering
iris1$Pt1Sepal.Length = iris.sc$som$prototypes[iris.sc$som$clustering,1]
iris1$Pt2Sepal.Width = iris.sc$som$prototypes[iris.sc$som$clustering,2]
iris1$Pt3Petal.Length = iris.sc$som$prototypes[iris.sc$som$clustering,3]
iris1$Pt4Petal.Width = iris.sc$som$prototypes[iris.sc$som$clustering,4]
iris1$X = iris.som$parameters$the.grid$coord[iris.sc$som$clustering,1]
iris1$Y = iris.som$parameters$the.grid$coord[iris.sc$som$clustering,2]
write.table(iris1, file = "Iris.csv", sep = ",", col.names = NA, quote = FALSE)
I am not sure that I got it right but:
iris.som$parameters$the.grid contains coordinates of the clusters (it is a two column array with x and y coordinates in the mapping space)
so I think that what you want to do is
out.grid <- iris.som$parameters$the.grid$coord
out.grid$sc <- iris.sc$clustering
and export out.grid (a three column array). iris.sc$som$prototypes contains the coordinate of the prototypes of the clusters but in the original space (the four dimensional space in which the iris dataset takes its values.
I think my answer captures the requirements. Adding the node ids, x +
y coordinates, cluster and prototypes to the original data. Would you
agree.
yes :)
I have a table made of 10 rows and 6 columns, where each entry is a real value.
After the application of kmeans algorithm, I would like R to plot 6*(6-1) = 30 plots, in which each couple of rows is the axis in turn.
When I do it with the original data, everything works fine. But if I try to quantile-normalize the data, it does not work anymore and the system just shows the first couple plot.
Here are the data (data.csv):
chrName-chrStart-chrEnd,gm12878,h1-hesc,hela-s3,hepg2,huvec,k562
chr1-66660-66810,0,0,2.825,0.75,0,0.85
chr1-564520-564670,15.6356435644,4.5469879518,57.7813793103,130.2263636364,5.8088888889,101.680952381
chr1-568060-568210,17.9069767442,3.6970588235,15.962745098,34.8866666667,4.1,31.0394736842
chr1-568900-569050,41.7029411765,7.4568181818,28.3984615385,59.464957265,8.5194444444,44.6583333333
chr1-601040-601190,0.4,0.75,0.5333333333,0.4,0.3,0.3
chr1-662500-662650,0,3.45,0.25,63,0.9923076923,5.7469879518
chr1-714040-714190,115.0871428571,125.6707142857,80.8081632653,153.9737931034,70.0197080292,166.5101351351
chr1-730400-730550,1.3730769231,0,0,0.9,7.6690140845,0.76
chr1-753400-753550,1.3517241379,4.1,0.4818181818,0,0.3,1.4285714286
chr1-762820-762970,43.6430769231,17.875,21.2659574468,123.1888888889,14.5743589744,56.7931034483
Here's my working code:
dnaseSignalFile = "data.csv"
originalDataTab <- read.csv(dnaseSignalFile, header=TRUE, sep=",")
originalDataTabSubMatrixChromSel_onlyData <- originalDataTab [,2:7]
cl0 <- kmeans(originalDataTabSubMatrixChromSel_onlyData , 2)
plot(originalDataTabSubMatrixChromSel_onlyData , col = cl0$cluster)
points(cl0$centers, col = 1:2, pch = 8, cex = 2)
It then correctly shows this image:
And that's fine!
But if I tried to run a quantile-normalization, things do not work anymore:
library("slam"); library("preprocessCore"); library("nnet");
normQuant<- normalize.quantiles(as.matrix(originalDataTabSubMatrixChromSel_onlyData), copy=TRUE)
roundNormQuant <- round(normQuant)
roundNormQuantTab <- as.data.frame(roundNormQuant)
colnames(roundNormQuantTab) <- colnames(originalDataTabSubMatrixChromSel_onlyData)
roundNormQuantTab <- normQuant
colnames(roundNormQuantTab) <- colnames(originalDataTabSubMatrixChromSel_onlyData)
rownames(roundNormQuantTab) <- rownames(originalDataTabSubMatrixChromSel_onlyData)
dev.new()
cl <- kmeans(roundNormQuantTab, 2)
plot(roundNormQuantTab, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
'Cause the only thing that I see is the following picture:
Why can't I get the six plots in the second case, too?
What's different between the former case and the latter one?
How could I solve this problem?
I wonder if anyone is familiar with the Bioconductor RankProduct package for ranking and obtaining differentially expressed genes. Some info about the software are as follows paper, manual, documentation.
I ran into some problems while using the program, maybe because of my little knowledge of R language. I tried to replicate the steps in the pdf files above with my own data. Although my own datasets were not in the afffy .cel files as in the examples, but only as rows and columns in a tab-delimited file. I have two conditions (1 and 2, replicate = 4 for each)
Here is my code:
library(RankProd)
library(preprocessCore)
#Read expression data
#gdata <- read.table(file="data2.txt", sep="\t", header=T) #9000 rows of genes X 8 columns of chips
gdata <- read.table(file="data2.txt", sep="\t", header=T, row.names=1) #9000 rows of genes X 8 columns of chips
#colnames(gdata)
# This vector contains the microarray sample names
SampleNames= names(data.frame(gdata[,-1]))
#names(datExpr)=gdata[,1]
# This vector contains the gene names
datExpr.gnames= gdata$GeneName
# Since the first column contains the gene names, exclude it.
# dataExp is then the matix required
datExpr=data.frame(gdata[,-1])
#convert data into matrix form
datExpr <- as.matrix(datExpr)
#data normalization - quantile normalization
#datExpr.log.norm <- normalize.quantiles((log2(datExpr)),copy=TRUE) #with logged data
datExpr <- datExpr.log.norm
#datExpr.norm <- normalize.quantiles(datExpr,copy=TRUE) #without logged data
#datExpr <- datExpr.norm
# Identify two class data - control/treatment (or condition 1/condition2)
nl <- 4
n2 <- 4
cl <- rep(c(0,1), c(nl, n2))
datExpr.cl <- cl
# data were generated under identical or very similar conditions except the
# factor of interest (e.g., control and treatment),
origin <- rep(1, nl + n2)
datExpr.origin <- origin
# Data anslysis
datExpr.sub <- datExpr[,which(datExpr.origin == 1)]
datExpr.cl.sub <- datExpr.cl[which(datExpr.origin == 1)]
datExpr.origin.sub <- datExpr.origin[which(datExpr.origin == 1)]
#Rank product analysis and output
#RP.out <- RP(datExpr.sub, datExpr.cl.sub, num.perm = 100, logged = TRUE,na.rm = FALSE, plot = FALSE, rand = 123)
RP.out <- RPadvance(datExpr.sub, datExpr.cl.sub, datExpr.origin.sub, num.perm = 100,logged = TRUE,
na.rm = FALSE, gene.names = datExpr.gnames, plot = FALSE,rand = 123)
# Output a table of the identified genes based on user-specified selection criteria
topGene(RP.out, cutoff = 0.05, method = "pfp", logged = TRUE,logbase = 2, gene.names = datExpr.gnames)
I did run the code, but my fold changes for differentially expressed genes in one condition VS the other were either 0's or infinities. I wonder if anyone with experience with this program can help me.
At a first glance what I note is that
#datExpr.log.norm <- normalize.quantiles((log2(datExpr)),copy=TRUE) #with logged data
datExpr <- datExpr.log.norm
Here as long as the first line is commented out datExpr will result empty.