As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1
I'm using LDA for topic modeling:
dtm <- DocumentTermMatrix(docs)
However, there are rows that all elements in dtm are zero. So I followed the instruction in here
ui = unique(dtm$i)
dtm.new = dtm[ui,]
And, then LDA works and I have the topics and everything. My next attempt is to use LDAvis as recommended in here. Source code:
topicmodels_json_ldavis <- function(fitted, corpus, doc_term){
# Required packages
library(topicmodels)
library(dplyr)
library(stringi)
library(tm)
library(LDAvis)
# Find required quantities
phi <- posterior(fitted)$terms %>% as.matrix
theta <- posterior(fitted)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
temp <- paste(corpus[[i]]$content, collapse = ' ')
doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- inspect(doc_term)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
Freq = colSums(temp_frequency))
rm(temp_frequency)
# Convert to json
json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
vocab = vocab,
doc.length = doc_length,
term.frequency = freq_matrix$Freq)
return(json_lda)
}
When I call topicmodels_json_ldavis function, I receive this error:
Length of doc.length not equal to the number of rows in theta;
both should be equal to the number of documents in the data.
I checked the length of theta and doc.length. They are different. I assume because I pass the corpus (docs) which makes a dtm with (at least) a zero row. In order for the corpus to match with doc_term_matrix, I decided to make a new corpus from dtm.new as suggested in here. Source code:
dtm2list <- apply(dtm, 1, function(x) {
paste(rep(names(x), x), collapse=" ")
})
myCorp <- VCorpus(VectorSource(dtm2list))
I even made a new ldaOut with dtm.new and passed the following parameters to topicmodels_json_ldavis: ldaOut22, myCorp, dtm.new
I still receive the error message that theta and doc.length must have the same length.
I had the exact same problem, I was able to remove rows with all zero-vectors for LDA analysis, but then tumbled into row-count of the sparse matrix not matching anymore the row-count of Documents for LDAvis. I've solved it, unfortunately only for Python, but you may use the following approach as a starting point:
Lets see what I got first:
print(f'The tf matrix:\n {cvz.toarray()[:100]}\n')
sparseCountMatrix = np.array(cvz.toarray())
print(f'Number of non-zero vectors: {len(x[x>0])} Number of zero vectors: {len(x[x==0])}\n')
print(f'Have a look at the non-zero vectors:\n{x[x>0][:200]}\n')
print(f'This is our sparse matrix with {x.shape[0]} (# of documents) by {x.shape[1]} (# of terms in the corpus):\n{x.shape}')
Output:
The tf matrix:
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
Number of non-zero vectors: 4721 Number of zero vectors: 232354
Have a look at the non-zero vectors:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
This is our sparse matrix with 545 (# of documents) by 435 (# of terms in the corpus):
(545, 435)
How many rows contain all zero vectors?
len(list(np.array(sparseCountMatrix[(sparseCountMatrix==0).all(1)])))
Output: 12
How many rows contain at least one non-zero vector?
len(list(np.array(sparseCountMatrix[~(sparseCountMatrix==0).all(1)])))
Output: 533
Remove the 12 rows which contain all zero vectors for LDA Analysis:
cleanedSparseCountMatrix = np.array(sparseCountMatrix[~(sparseCountMatrix==0).all(1)])
Also remove these documents from original Pandas Series (tokens), so document count matches sparse matrix row count, which is important to visualize LDA results with pyLDAVis:
First, to get the index position of rows with all zero vectors, use np.where:
indexesToDrop = np.where((sparseCountMatrix==0).all(1))
print(f"Indexes with all zero vectors: {indexesToDrop}\n")
Output:
Indexes with all zero vectors: (array([ 47, 77, 88, 95, 106, 109, 127, 244, 363, 364, 367, 369],
dtype=int64),)
Second, use this list of indexes to drop original rows in Pandas series with series.drop:
data_tokens_cleaned = data['tokens'].drop(data['tokens'].index[indexesToDrop])
New length of cleaned tokens (should match sparse matrix length!):
len(data_tokens_cleaned)
Output:
533
This is our cleaned sparse matrix, ready for LDA analysis:
print(cleanedSparseCountMatrix.shape)
Output: (533, 435)
This is proving to be a monster for me with zero experience in R script. I have a data frame with 57 columns, 30 rows of data
Here is what I am trying to do:
1) Go to each column:
2) Count the number of times 2/3/4/5/6/7/8/9 consecutive values are less than -1
3) Print the result as a text file
4) Repeat step 2 and 3 for the second column and so on
I looked around and also on r stackoverflow
check number of times consecutive value appear based on a certain criteria
This is one column of my data:
data<-c(-0.996,-1.111,-0.638,0.047,0.694,1.901,2.863,2.611,2.56,2.016,0.929,-0.153,-0.617,-0.143
0.199,0.556,0.353,-0.638,0.347,0.045,-0.829,-0.882,-1.143,-0.869,0.619,0.923,-0.474,0.227
0.394,0.789,1.962,1.132,0.1,-0.278,-0.303,-0.606,-0.705,-0.858,-0.723,-0.081,1.206,2.329
1.863,2.1,1.547,2.026,0.015,-0.441,-0.371,-0.304,-0.668,-0.953,-1.256,-1.185,-0.891,-0.569
0.485,0.421,-0.004,0.024,-0.39,-0.58,-1.178,-1.101,-0.882,0.01,0.052,-0.166,-1.703,-1.048
-0.718,-0.036,-0.561,-0.08,0.272,-0.041,-0.811,-0.929,-0.853,-1.047,0.431,0.576,0.642,1.62
2.324,1.251,1.384,0.195,-0.081,-0.335,-0.176,1.089,-0.602,-1.134,-1.356,-1.203,-0.795,-0.752
-0.692,-0.813,-1.172,-0.387,-0.079,-0.374,-0.157,0.263,0.313,0.975,2.298,1.71,0.229,-0.313
-0.779,-1.12,-1.102,-1.01,-0.86,-1.118,-1.211,-1.081,-1.156,-0.972)
When I run the following code:
for (col in 1:ncol(data)) {
runs <- rle(data[,col])
print(runs$lengths[which(runs$values < -1)])
}
It gives me this:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
It has counted the number of values <-1 but not runs. Is it something that I am during wrong here?
(massive edit)
Fixed data vector (was missing commas):
data <- c(-0.996,-1.111,-0.638,0.047,0.694,1.901,2.863,2.611,2.56,2.016,0.929,-0.153,-0.617,-0.143,
0.199,0.556,0.353,-0.638,0.347,0.045,-0.829,-0.882,-1.143,-0.869,0.619,0.923,-0.474,0.227,
0.394,0.789,1.962,1.132,0.1,-0.278,-0.303,-0.606,-0.705,-0.858,-0.723,-0.081,1.206,2.329,
1.863,2.1,1.547,2.026,0.015,-0.441,-0.371,-0.304,-0.668,-0.953,-1.256,-1.185,-0.891,-0.569,
0.485,0.421,-0.004,0.024,-0.39,-0.58,-1.178,-1.101,-0.882,0.01,0.052,-0.166,-1.703,-1.048,
-0.718,-0.036,-0.561,-0.08,0.272,-0.041,-0.811,-0.929,-0.853,-1.047,0.431,0.576,0.642,1.62,
2.324,1.251,1.384,0.195,-0.081,-0.335,-0.176,1.089,-0.602,-1.134,-1.356,-1.203,-0.795,-0.752,
-0.692,-0.813,-1.172,-0.387,-0.079,-0.374,-0.157,0.263,0.313,0.975,2.298,1.71,0.229,-0.313,
-0.779,-1.12,-1.102,-1.01,-0.86,-1.118,-1.211,-1.081,-1.156,-0.972)
Doing data < -1 gives you a logical vector, and we can count runs of TRUE & FALSE:
runs <- rle(data < -1)
print(runs)
## Run Length Encoding
## lengths: int [1:21] 1 1 20 1 29 2 8 2 4 2 ...
## values : logi [1:21] FALSE TRUE FALSE TRUE FALSE TRUE ...
Then extract the length of only the TRUE runs:
print(runs$lengths[which(runs$values)])
## [1] 1 1 2 2 2 1 3 1 3 4
and, iterate over columns of a data frame as previously shown:
# make a data frame from sampled versions of data
set.seed(1492) # repeatable
df <- data.frame(V1=data,
V2=sample(data, length(data), replace=TRUE),
V3=sample(data, length(data), replace=TRUE),
V4=sample(data, length(data), replace=TRUE))
# do the extraction
for (col in 1:ncol(df)) {
runs <- rle(df[, col] < -1)
print(runs$lengths[which(runs$values)])
}
## [1] 1 1 2 2 2 1 3 1 3 4
## [1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1