Vector analysis in R - r

As inputs your function should take a vector of 0s and 1s;
Every time you see a sequence of 1s in the data you need to increase the number of children by 1;
Be careful with the two subsequent sequences of 1s, where the difference between them is less than 5 (i.e. when there are less than 5 0s in between them, then it is the same child and not a new child);
To help you social planner provides some examples of what your function should return:
#Input: c(1,1,1,1,0,0,0,0)
#Output: 1 1 1 1 1 1 1 1
#Input: c(0,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1)
#Output: 0 0 0 0 1 1 1 1 1 1 1 1 1 2 2 2
#Input: c(0,0,0,0,1,1,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,0,1)
#Output: 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
Functions, which might be helpful:
diff()
cumsum()
which()
rle()
I dont quite understand how to approach the question, my thoughts on this are using diff function after the cumsum as it will help me to sustain a row of 1s but in this scenario i am loosing the length of vector (it obviously becomes shorter) also #rle$lenght seems to help me to detect gaps of length 5 or more to turn 1s into 2s. Sorry for this question I am only a beginner

I make use of which function in r (https://www.r-bloggers.com/which-function-in-r/) and run length encoding (http://www.cookbook-r.com/Manipulating_data/Finding_sequences_of_identical_values/). Here's my attempt:
vector_analyse <- function(sample_vector){
# ----------------------------------------------------------------------------
# Signature: vector --> vector
# Author: kon_u
# Description: Given a sample vector of 0s and 1s, return a sequence of 1s in
# the data you need to increase the number of children by 1 (when there are less
# 5 0s in between them, then it is the same child and not a new child)
# ----------------------------------------------------------------------------
# ----------------------------------------------------------------------------
# Run Length Encoding gives a list of length and values
# ----------------------------------------------------------------------------
rle_object <- rle(sample_vector)
x <- rle_object$lengths # original length
y <- rle_object$values # original values
z <- which(y == 1) # index of 1 in vector y
if (length(z) == 1){
invisible()
} else{
for (i in 2:length(z)){
if (x[z[i]-1] >= 5){
y[z[i]] = y[z[i]]
} else {
y[z[i]] = y[z[i]] - 1
}
}
}
y_cumsum = cumsum(y)
rle_object$values <- y_cumsum
new_vector = inverse.rle(rle_object)
return(new_vector)
}
vector_analyse(c(1,1,1,1,0,0,0,0)) # 1 1 1 1 1 1 1 1
vector_analyse(c(0,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1)) # 0 0 0 0 1 1 1 1 1 1 1 1 1 2 2 2
vector_analyse(c(0,0,0,0,1,1,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,0,1)) # 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

Related

making 1000 contingency tables in R

I have a vector called "combined" with 1's and 0's
combined
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I sampled twice from this vector, each with a sample size of 3 and put it into a contingency table of counts as follows.
2 1
1 2
I want to reiterate this sampling 1000 times such that I end with 1000 contingency tables each with counts of 1s and 0s from the sampling.
This is what I tried:
sample1 = as.vector(replicate(10000, sample(combined, 3)))
sample2 = as.vector(replicate(10000, sample(combined, 3)))
con_table = table(sample1,sample2)
but I ended up only getting 1 table instead of 10000. Hoping to get some help.
8109 7573
7306 7012
You need to wrap the entire expression, sample and table inside replicate. Add a conversion to a factor to ensure you always get a 2x2 table. E.g. a simple version with 2 replications:
combined <- rep(0:1,each=10)
combined <- as.factor(combined)
replicate(2, table(sample(combined,3), sample(combined,3)), simplify=FALSE)
#[[1]]
#
# 0 1
# 0 0 1
# 1 1 1
#
#[[2]]
#
# 0 1
# 0 1 1
# 1 0 1

Sub-setting or arrange the data in R

As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1

Using LDAvis when doc_term_matrix has at least one row that all elements are zero

I'm using LDA for topic modeling:
dtm <- DocumentTermMatrix(docs)
However, there are rows that all elements in dtm are zero. So I followed the instruction in here
ui = unique(dtm$i)
dtm.new = dtm[ui,]
And, then LDA works and I have the topics and everything. My next attempt is to use LDAvis as recommended in here. Source code:
topicmodels_json_ldavis <- function(fitted, corpus, doc_term){
# Required packages
library(topicmodels)
library(dplyr)
library(stringi)
library(tm)
library(LDAvis)
# Find required quantities
phi <- posterior(fitted)$terms %>% as.matrix
theta <- posterior(fitted)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
temp <- paste(corpus[[i]]$content, collapse = ' ')
doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- inspect(doc_term)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
Freq = colSums(temp_frequency))
rm(temp_frequency)
# Convert to json
json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
vocab = vocab,
doc.length = doc_length,
term.frequency = freq_matrix$Freq)
return(json_lda)
}
When I call topicmodels_json_ldavis function, I receive this error:
Length of doc.length not equal to the number of rows in theta;
both should be equal to the number of documents in the data.
I checked the length of theta and doc.length. They are different. I assume because I pass the corpus (docs) which makes a dtm with (at least) a zero row. In order for the corpus to match with doc_term_matrix, I decided to make a new corpus from dtm.new as suggested in here. Source code:
dtm2list <- apply(dtm, 1, function(x) {
paste(rep(names(x), x), collapse=" ")
})
myCorp <- VCorpus(VectorSource(dtm2list))
I even made a new ldaOut with dtm.new and passed the following parameters to topicmodels_json_ldavis: ldaOut22, myCorp, dtm.new
I still receive the error message that theta and doc.length must have the same length.
I had the exact same problem, I was able to remove rows with all zero-vectors for LDA analysis, but then tumbled into row-count of the sparse matrix not matching anymore the row-count of Documents for LDAvis. I've solved it, unfortunately only for Python, but you may use the following approach as a starting point:
Lets see what I got first:
print(f'The tf matrix:\n {cvz.toarray()[:100]}\n')
sparseCountMatrix = np.array(cvz.toarray())
print(f'Number of non-zero vectors: {len(x[x>0])} Number of zero vectors: {len(x[x==0])}\n')
print(f'Have a look at the non-zero vectors:\n{x[x>0][:200]}\n')
print(f'This is our sparse matrix with {x.shape[0]} (# of documents) by {x.shape[1]} (# of terms in the corpus):\n{x.shape}')
Output:
The tf matrix:
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
Number of non-zero vectors: 4721 Number of zero vectors: 232354
Have a look at the non-zero vectors:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
This is our sparse matrix with 545 (# of documents) by 435 (# of terms in the corpus):
(545, 435)
How many rows contain all zero vectors?
len(list(np.array(sparseCountMatrix[(sparseCountMatrix==0).all(1)])))
Output: 12
How many rows contain at least one non-zero vector?
len(list(np.array(sparseCountMatrix[~(sparseCountMatrix==0).all(1)])))
Output: 533
Remove the 12 rows which contain all zero vectors for LDA Analysis:
cleanedSparseCountMatrix = np.array(sparseCountMatrix[~(sparseCountMatrix==0).all(1)])
Also remove these documents from original Pandas Series (tokens), so document count matches sparse matrix row count, which is important to visualize LDA results with pyLDAVis:
First, to get the index position of rows with all zero vectors, use np.where:
indexesToDrop = np.where((sparseCountMatrix==0).all(1))
print(f"Indexes with all zero vectors: {indexesToDrop}\n")
Output:
Indexes with all zero vectors: (array([ 47, 77, 88, 95, 106, 109, 127, 244, 363, 364, 367, 369],
dtype=int64),)
Second, use this list of indexes to drop original rows in Pandas series with series.drop:
data_tokens_cleaned = data['tokens'].drop(data['tokens'].index[indexesToDrop])
New length of cleaned tokens (should match sparse matrix length!):
len(data_tokens_cleaned)
Output:
533
This is our cleaned sparse matrix, ready for LDA analysis:
print(cleanedSparseCountMatrix.shape)
Output: (533, 435)

Updating 0 vector values based on preceding and successive values

I have a data frame which has a cumulative count for each event (an event in this case being represented by a sequence of 1's in the bin column) with separating values given the value 0 and each event given an ID as such:
bin cumul ID
0 0 0
1 1 3
1 1 3
1 1 3
1 1 3
0 0 0
0 0 0
0 0 0
0 0 0
1 2 2
1 2 2
1 2 2
1 2 2
1 2 2
0 0 0
0 0 0
0 0 0
0 0 0
1 3 1
1 3 1
1 3 1
I want to update the ID column so each non-event (0 in the bin column) is assigned an ID value based on the previous and subsequent ID.
Therefore, if a non-event is preceded and succeeded by events of equal ID values (e.g. both 3) the non-event also carries this ID value (3). However if the non-event is preceded by an event with one value but succeeded with an event with a different value then the first half of the non-event is given an ID value equal to the preceding event and the final half of the non-event is given an ID value equal to the ID value of the succeeding event. Giving the final data frame:
bin cumul ID
0 0 3
1 1 3
1 1 3
1 1 3
1 1 3
0 0 3
0 0 3
0 0 2
0 0 2
1 2 2
1 2 2
1 2 2
1 2 2
1 2 2
0 0 2
0 0 2
0 0 1
0 0 1
1 3 1
1 3 1
1 3 1
If the question were how to fill in the zeros with ID that matched the preceding values, or matched successive values, then you could use na.locf from the zoo-package and it would be a one liner. For this task I think you might reach for the rle function:
rle(dat$ID)
#Run Length Encoding
# lengths: int [1:6] 1 4 4 5 4 3
# values : int [1:6] 0 3 0 2 0 1
Then thinking about how to use such result, my thinking was to use an algorithm like:
for each '0' in values; assign the first [`length`/2 + .9] values as $values[ idx-1 ]
assign the next ]`length`/2] values as $values[ idx+1 ]
( using `rep` will truncate/floor the fractional indices and adding a number
slightly less than 1.0 will take care of the edge cases where there are an
odd number of zeros in a row.)
( `sum` on the lengths can recover the correct positions.)
and for the beginning and ending 0-cases;
replace with successive and preceding values respectively
After considerable debugging effort (and commenting out the debugging cat-calls):
rldat <- rle(dat$ID)
for ( nth in seq_along( rldat$lengths) ){ #cat("nth=", nth, "\n")
if(rldat$values[nth] == 0){
if (nth == 1) { # cat("first value=",rldat$values[nth+1], "\n")
dat$ID[ 1:rldat$lengths[nth] ] <-rldat$values[nth+1];
} else {
if (nth== length(rldat$lengths) ){
dat$ID[ (length(dat$ID)-rldat$lengths[nth]+1):length(dat$ID) ] <-
rldat$values[nth-1]
} else {
# cat( "seq=", (sum(rldat$lengths[1:(nth- 1)])+1): sum(rldat$lengths[1:nth]) ,"\n")
dat$ID[ (sum(rldat$lengths[1:(nth-1)])+1):sum(rldat$lengths[1:nth]) ] <-
c( rep( rldat$values[nth-1],rldat$lengths[nth]/2+.9) ,
rep( rldat$values[nth+1],rldat$lengths[nth]/2) )}}
} }

R function for counting how often a value falls below a particular value

This is proving to be a monster for me with zero experience in R script. I have a data frame with 57 columns, 30 rows of data
Here is what I am trying to do:
1) Go to each column:
2) Count the number of times 2/3/4/5/6/7/8/9 consecutive values are less than -1
3) Print the result as a text file
4) Repeat step 2 and 3 for the second column and so on
I looked around and also on r stackoverflow
check number of times consecutive value appear based on a certain criteria
This is one column of my data:
data<-c(-0.996,-1.111,-0.638,0.047,0.694,1.901,2.863,2.611,2.56,2.016,0.929,-0.153,-0.617,-0.143
0.199,0.556,0.353,-0.638,0.347,0.045,-0.829,-0.882,-1.143,-0.869,0.619,0.923,-0.474,0.227
0.394,0.789,1.962,1.132,0.1,-0.278,-0.303,-0.606,-0.705,-0.858,-0.723,-0.081,1.206,2.329
1.863,2.1,1.547,2.026,0.015,-0.441,-0.371,-0.304,-0.668,-0.953,-1.256,-1.185,-0.891,-0.569
0.485,0.421,-0.004,0.024,-0.39,-0.58,-1.178,-1.101,-0.882,0.01,0.052,-0.166,-1.703,-1.048
-0.718,-0.036,-0.561,-0.08,0.272,-0.041,-0.811,-0.929,-0.853,-1.047,0.431,0.576,0.642,1.62
2.324,1.251,1.384,0.195,-0.081,-0.335,-0.176,1.089,-0.602,-1.134,-1.356,-1.203,-0.795,-0.752
-0.692,-0.813,-1.172,-0.387,-0.079,-0.374,-0.157,0.263,0.313,0.975,2.298,1.71,0.229,-0.313
-0.779,-1.12,-1.102,-1.01,-0.86,-1.118,-1.211,-1.081,-1.156,-0.972)
When I run the following code:
for (col in 1:ncol(data)) {
runs <- rle(data[,col])
print(runs$lengths[which(runs$values < -1)])
}
It gives me this:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
It has counted the number of values <-1 but not runs. Is it something that I am during wrong here?
(massive edit)
Fixed data vector (was missing commas):
data <- c(-0.996,-1.111,-0.638,0.047,0.694,1.901,2.863,2.611,2.56,2.016,0.929,-0.153,-0.617,-0.143,
0.199,0.556,0.353,-0.638,0.347,0.045,-0.829,-0.882,-1.143,-0.869,0.619,0.923,-0.474,0.227,
0.394,0.789,1.962,1.132,0.1,-0.278,-0.303,-0.606,-0.705,-0.858,-0.723,-0.081,1.206,2.329,
1.863,2.1,1.547,2.026,0.015,-0.441,-0.371,-0.304,-0.668,-0.953,-1.256,-1.185,-0.891,-0.569,
0.485,0.421,-0.004,0.024,-0.39,-0.58,-1.178,-1.101,-0.882,0.01,0.052,-0.166,-1.703,-1.048,
-0.718,-0.036,-0.561,-0.08,0.272,-0.041,-0.811,-0.929,-0.853,-1.047,0.431,0.576,0.642,1.62,
2.324,1.251,1.384,0.195,-0.081,-0.335,-0.176,1.089,-0.602,-1.134,-1.356,-1.203,-0.795,-0.752,
-0.692,-0.813,-1.172,-0.387,-0.079,-0.374,-0.157,0.263,0.313,0.975,2.298,1.71,0.229,-0.313,
-0.779,-1.12,-1.102,-1.01,-0.86,-1.118,-1.211,-1.081,-1.156,-0.972)
Doing data < -1 gives you a logical vector, and we can count runs of TRUE & FALSE:
runs <- rle(data < -1)
print(runs)
## Run Length Encoding
## lengths: int [1:21] 1 1 20 1 29 2 8 2 4 2 ...
## values : logi [1:21] FALSE TRUE FALSE TRUE FALSE TRUE ...
Then extract the length of only the TRUE runs:
print(runs$lengths[which(runs$values)])
## [1] 1 1 2 2 2 1 3 1 3 4
and, iterate over columns of a data frame as previously shown:
# make a data frame from sampled versions of data
set.seed(1492) # repeatable
df <- data.frame(V1=data,
V2=sample(data, length(data), replace=TRUE),
V3=sample(data, length(data), replace=TRUE),
V4=sample(data, length(data), replace=TRUE))
# do the extraction
for (col in 1:ncol(df)) {
runs <- rle(df[, col] < -1)
print(runs$lengths[which(runs$values)])
}
## [1] 1 1 2 2 2 1 3 1 3 4
## [1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Resources