I generated the following data matrix called arrayDataMatrixQuantile in R:
DNp73flflV2324I DNp73flflV2324J DNp73flflV2324K DNp73nullV2523B DNp73nullV2523C DNp73nullV2523E
ENSMUSG00000028180 8.185794 5.6914560 5.693373 6.9734687 8.8689120 5.9152113
ENSMUSG00000028182 0.000000 0.1749128 0.000000 0.1685122 0.1784736 0.1229401
ENSMUSG00000028185 0.000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
ENSMUSG00000028184 7.439927 8.8635180 10.288115 11.8621800 13.4530467 13.4414667
ENSMUSG00000028187 7.458357 10.0175407 14.108493 11.7789400 19.7581400 12.1482933
ENSMUSG00000028186 0.400568 0.1346390 3.450423 0.1643176 0.0000000 0.0000000
I want to generate log2 of each of the values and output that. The R code I wrote:
log2_matrix<-matrix( nrow(arrayDataMatrixQuantile),ncol(arrayDataMatrixQuantile)) #opens new matrix
for (i in 1:nrow(arrayDataMatrixQuantile)) {
for (j in 1:ncol(arrayDataMatrixQuantile)) {
add <- ((arrayDataMatrixQuantile[i,j])+10^-5) #Added 10-5 to avoid errors with 0 values
log2_matrix[i,j] <-add }
}
This code gives the following error:
Error in [<-(*tmp*, i, j, value = 2.50880030780749) : subscript out of bounds
However, once I change the line :
log2_matrix<-matrix( nrow(arrayDataMatrixQuantile),ncol(arrayDataMatrixQuantile))
to
log2_matrix<-matrix(0, nrow(arrayDataMatrixQuantile),ncol(arrayDataMatrixQuantile))
it works. I dont know how adding a "0" in the new matrix gets rid of the error. I used that as I saw other users adding a 0 at the start of each new matrix. Any advise on this?
We could do this either using apply
apply(arrayDataMatrixQuantile, 2, FUN=function(x) x+ 10^-5)
Or directly add the number to the entire dataset
arrayDataMatrixQuantile+10^-5
Regarding the error in the OP's code, it happened because the matrix created was not of the same dimensions as the "arrayDataMatrixQuantile"
log2_matrix<- matrix( nrow(arrayDataMatrixQuantile),
ncol(arrayDataMatrixQuantile))
The "log2_matrix" doesn't have a data argument and its dimensions are 6,1 with 6 as the value (from the nrow(...)). Instead, we need to add a , before the nrow(..) so that we get a matrix of NA with dimensions 6,6
log2_matrix <- matrix(, nrow(arrayDataMatrixQuantile),
ncol(arrayDataMatrixQuantile))
Related
I am a new user of R and trying to use mRMRe R package (mRMR is one of the good and well known feature selection approaches) to obtain feature subset from a feature set. Please excuse if my question is simple as I really want to know how I can fix an error. Below is the detail.
Suppose, I have a csv file (gene.csv) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's a sample gene.csv file:
[G1.1.1.1] [G1.1.1.2] [G1.1.1.3] [G1.1.1.4] [G1.1.1.5] [G1.1.1.6] [Output]
11.688312 0.974026 4.87013 7.142857 3.571429 10.064935 -1
12.538226 1.223242 3.669725 6.116208 3.363914 9.174312 1
10.791367 0.719424 6.115108 6.47482 3.597122 10.791367 -1
13.533835 0.37594 6.766917 7.142857 2.631579 10.902256 1
9.737828 2.247191 5.992509 5.992509 2.996255 8.614232 -1
11.864407 0.564972 7.344633 4.519774 3.389831 7.909605 -1
11.931818 0 7.386364 5.113636 3.409091 6.818182 1
16.666667 0.333333 7.333333 4.333333 2 8.333333 -1
I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7,
feature_count = 2, solution_count = 1)
When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):
Error in .local(.Object, ...) :
data columns must be either of numeric, ordered factor or Surv type
However, data in each column of the csv file are real number.So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7,feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.
I will appreciate much if anyone can help me to obtain the best feature subset based on the gene.csv file using mRMRe R package.
I solved the problem by modifying my code as follows.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
df[[7]] <- as.numeric(df[[7]])
f_data <- mRMR.data(data = data.frame(df))
results <- mRMR.classic("mRMRe.Filter", data = f_data, target_indices = 7,
feature_count = 2)
solutions(results)
It worked fine. The output of the code gives the indices of the selected 2 features.
I think it has to do with your Output column which is probably of class integer. You can check that using class(df[[7]]).
To convert it to numeric as required by the warning, just type:
df[[7]] <- as.numeric(df[[7]])
That worked for me.
As for the other question, after reading the documentation, setting target_indices = 7 seems the right choice.
I have lot's of files in my directoy and I want to read all files and select the second columns of them and put those columns as rows of a matrix, but I face with strange error.
would anybody help me to figure out, what's going wrong with my code ?
Here is my effort:
#read all files in one directoy into R and select desired column
nm <- list.files(path="April/mRNA_expression/")
Gene_exp<-do.call(rbind, lapply(nm, function(x) read.table(file=x,header=TRUE, sep= ",")[, 2]))
save(Gene_exp, file="Path to folder")
The error I get is :
## Error in `[.data.frame`(read.table(file = x, header = TRUE, sep = ""), :
## undefined columns selected*
To check that, really my files have 2 columns I did this :
b <- read.table("A.genes.normalized_results", sep="")
dim(b)
## [1] 20532 2
My text file Looks like this :
gene_id normalized_count
?|100130426 0.0000
?|100133144 10.6340
?|100134869 5.6790
?|10357 106.4628
?|10431 710.8902
?|136542 0.0000
?|155060 132.2883
?|26823 0.5098
?|280660 0.0000
?|317712 0.0000
?|340602 0.0000
?|388795 1.2745
?|390284 5.3527
?|391343 2.5489
?|391714 0.7647
?|404770 0.0000
?|441362 0.0000
The better solution would be to only import the second column when reading it. Use the colClasses argument to completely skip the first:
Gene_exp<-do.call(rbind, lapply(nm, function(x) read.delim(file=x,header=TRUE, colClasses=c('NULL', 'character'))))
I am assuming the second column is character. Change it to the appropriate class if you need to.
I'm working on a text mining/clustering project and am trying to create a table which contains number of clusters as rows and 6 columns representing the following 6 metrics:
max.diameter, min.separation, average.within,average.between,avg.silwidth,dunn.
I need to create the tables for 3 methods - kmeans, pam and hclust.
I was able to create something for kmeans
dtm0.90Dist = dist(dtm0.90)
foreachcluster = function(k) {
kmeans.result = kmeans(dtm0.90, k);
kmeans.stats = cluster.stats(dtm0.90Dist,kmeans.result$cluster);
c(kmeans.stats$min.separation, kmeans.stats$max.diameter,
kmeans.stats$average.within, kmeans.stats$avearge.between,
kmeans.stats$avg.silwidth, kmeans.stats$dunn)
}
rbind(foreachcluster(2), foreachcluster(3), foreachcluster(4), foreachcluster(5),
foreachcluster(6), foreachcluster(7),foreachcluster(8))
and I get the following output
[,1] [,2] [,3] [,4] [,5]
[1,] 3.162278 30.19934 5.831550 0.5403872 0.10471348
[2,] 2.236068 28.37252 5.006058 0.3923446 0.07881104
[3,] 1.000000 28.37252 4.995478 0.2496066 0.03524537
[4,] 1.000000 26.40076 4.387212 0.2633338 0.03787770
[5,] 1.000000 26.40076 4.353248 0.2681947 0.03787770
[6,] 1.000000 26.40076 4.163757 0.1633954 0.03787770
[7,] 1.000000 26.40076 4.128927 0.2676423 0.03787770
I need similar output for hclust and pam methods but for the life of me can't get the same function to work for either of the two methods
OK, so I was able to make the function for HCLUST
forhclust=function(k){dfDist = dist(dtm0.90);
hclust.result = hclust(dfDist);
hclust.cluster = (cutree(hclust.result, k));
cluster.stats(dfDist,hclust.cluster);c(cluster.stats$min.separation)}
But I get an error when i run this
Error in cluster.stats$min.separation :
object of type 'closure' is not subsettable
What I need is for it to print "min.separation" output.
I would really appreciate all the help and perhaps some guidance in understanding why my approach is failing in hclust.
Also, is there a good source that can explain the functioning and application of these methods, step by step, in detail?
Thank You
foreachcluster2 = function(k) {
hc = hclust(mDist, method = "ave")
hresult = cutree(hc, k)
h.stats = cluster.stats(mDist,hresult);
c( max.dia=h.stats$max.diameter,
min.sep=h.stats$min.separation,
avg.wi=h.stats$average.within,
avg.bw=h.stats$average.between,
silwidth=h.stats$avg.silwidth,
dunn=h.stats$dunn)
}
t2 = rbind(foreachcluster2(2), foreachcluster2(3), foreachcluster2(4), foreachcluster2(5),foreachcluster2(6),
foreachcluster2(7), foreachcluster2(8), foreachcluster2(9), foreachcluster2(10),
foreachcluster2(11), foreachcluster2(12),foreachcluster2(13),foreachcluster2(14))
rownames(t2) = 2:14
t2
This should work. For pam():
pamC <- pam(x=m, k=2)
pamC
pamC$clustering
use $clustering instead of $cluster, the rest are the same.
I plotting a 52 x 52 matrix with geom_raster through ggplot.
Code is here:
m <- NULL
for(i in 1:nrow(df)){
for(z in 1:nrow(df)){
if(df[i,][4] > df[z,][4]){m<-c(m,((df[i,][[4]]/df[z,][[4]])*100)-100)}
if(df[i,][4] < df[z,][4]){m<-c(m,((df[z,][[4]]/df[i,][[4]])*100)-100)}
if(df[i,][4] == df[z,][4]){m<-c(m,0.0)}}}
m <- matrix(m,nrow=nrow(df))
colnames(m) <- df$PDB
rownames(m) <- df$PDB
p1 <- ggplot(melt(m),aes(Var1,Var2,fill=value)) + geom_raster() + labs(x="PDB",y="PDB")
p1 <- p1 + theme(axis.text.x = element_text(angle=90,hjust=1))
print(p1)
ggsave(file="ccs_diff_ehss.pdf")
The issue I have is when I save the file I get the following outputs:
Through file > save as >:
Through ggsave:
Output from print(p1):
As you can see the out from print(p1) as a lot sharper than ggsave and manual saving. How can I save the images as outputted from print(p1)?
Here is a subsbset of my matrix:
1a29 1cll 1clm 1cm1 1exr 1g4y 1iq5 1lin 1mxe1 1mxe2
1a29 0.000000 18.8967136 19.0727700 3.814554 20.539906 19.3075117 9.330986 1.584507 5.6924883 5.8098592
1cll 18.896714 0.0000000 0.1480750 14.527982 1.382034 0.3455084 8.749329 17.042172 12.4930594 12.3682751
1clm 19.072770 0.1480750 0.0000000 14.697569 1.232134 0.1971414 8.910360 17.215482 12.6596335 12.5346644
1cm1 3.814554 14.5279819 14.6975692 0.000000 16.110797 14.9236857 5.313737 2.195263 1.8089316 1.9219898
1exr 20.539906 1.3820336 1.2321341 16.110797 0.000000 1.0329562 10.252281 18.659734 14.0477512 13.9212424
1g4y 19.307512 0.3455084 0.1971414 14.923686 1.032956 0.0000000 9.125067 17.446563 12.8817324 12.7565169
1iq5 9.330986 8.7493290 8.9103596 5.313737 10.252281 9.1250671 0.000000 7.625650 3.4425319 3.3277870
1lin 1.584507 17.0421722 17.2154824 2.195263 18.659734 17.4465627 7.625650 0.000000 4.0439053 4.1594454
1mxe1 5.692488 12.4930594 12.6596335 1.808932 14.047751 12.8817324 3.442532 4.043905 0.0000000 0.1110494
1mxe2 5.809859 12.3682751 12.5346644 1.921990 13.921242 12.7565169 3.327787 4.159445 0.1110494 0.0000000
I realize this is an older thread but it looks like it gets about 7 views a month. Perhaps this will help those visitors:
There is a chance that the image viewer application itself is applying a smoothing algorithm. I came across your post while trying to resolve the same issue and eventually discovered that I needed to turn off smoothing in the PDF viewer preferences. Now the output file looks identical to the plot in R.
This is the thread that tipped me off (includes some extra directions about where to locate the settings). https://groups.google.com/forum/#!topic/ggplot2/8VLuo1cw6SE
Take a look at the ggsave documentation -- perhaps you can increase your resolution by manually specifying the dimensions or the dpi.
I would like to mine specific rhs rules. There is an example in the documentation which demonstrates that this is possible, but only for a specific case (as we see below). First an data set to illustrate my problem:
input <- matrix( c( rep(10001,6) , rep(10002,3) , rep(10003,3), 100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,rep('a',6),rep('b',6)), ncol=3)
colnames(input) <- c(letters[1:3])
input <- as.data.frame(input)
Now i can create rules:
r <- apriori(input)
To see the rules:
inspect(r)
I would like to only mine rules that have b=... on the rhs. For specific values this can be done by adding:
appearance = list(rhs = c("b=100001", "b=100002"),default="lhs")
to the apriori command. I will also have to adjust the confidence if i want to find them ofcourse. The problem lies in the number of elements in column b. I can manualy type all the elements in the "b=....." format in this example, but I can't in my own data.
I tried to get the values of b using unique() and then giving that to the rhs, but it will generate an error because i give values like: "100001" "100002" instead of "b=100001" "b=100002".
Is there a was to only get rhs rules from a specific column?
If not, is there an easy way to generate 'want' from 'current?
current <- c("100001", "100002", "100003", "100004", "100005", "100006", "100007", "100008")
want <- c("b=100001", "b=100002", "b=100003", "b=100004", "b=100005", "b=100006", "b=100007", "b=100008")
Somewhat related is this question: Creating specific rules with arules in r
But that has the same problem for me, only a different way.
You can use subset:
r <- apriori(input, parameter = list(support = 0.1, confidence = 0.1))
inspect( subset( r, subset = rhs %pin% "b=" ) )
# lhs rhs support confidence lift
# 1 {} => {b=100002} 0.2500000 0.2500000 1.000000
# 2 {} => {b=100003} 0.2500000 0.2500000 1.000000
# 3 {c=b} => {b=100002} 0.1666667 0.3333333 1.333333
# 4 {c=b} => {b=100003} 0.1666667 0.3333333 1.333333
For you second question, you can use paste:
paste0( "b=", current )
# [1] "b=100001" "b=100002" "b=100003" "b=100004" "b=100005" "b=100006" "b=100007"
# [8] "b=100008"
The arules documentation now has an example that does exactly what you want:
bItems <- grep("^b=", itemLabels(input), value = TRUE)
rules <- apriori(input, parameter = list(support = 0.1, confidence = 0.1),
appearance = list(rhs = bItems))
I haven't actually tested this with your example code (the arules documentation example uses a transactions object, not a data.frame), but grep-ing those column labels should work out.