changing values in vector given a location and condition with R - r

i'm having trouble manipulating vectors in R. i have a vector that looks like this:
stack <- append(append(rep(0,8),c(1,0,0,0,0,1)),rep(0,6))
[1] 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
my overall goal is to the manipulate the vector as such:
*when there is a 1, make the next three values in the vector 1.
*change the original 1 to 0.
so ultimately the vector would look like:
[1] 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0
the second part I can do by:
replace(stack,which(stack == 1),0)
but I can't figure out how to do the first one efficiently. any help would be greatly appreciated.

You can use filter here :
c(filter(sx,c(0,0,0,0,1,1,1),circular=TRUE))
## [1] 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0

Here's a possible base R option
temp <- which(stack == 1)
stack[as.vector(mapply(`:`, temp, temp + 3))] <- c(0, rep(1, 3))
stack
# [1] 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0

I would go with regular expressions
stack <- paste0(stack, collapse="")
stack <- gsub("1.{3}", "0111", stack)
stack <- strsplit(stack, "+")

Related

Find the vertical distance between the top most '1' and the bottom most '1' in a matrix in R

I have successfully imported a csv file into R. It is a 6 by 6 matrix.
0 0 0 0 0 0
0 1 0 0 0 0
0 1 1 0 0 0
0 1 0 0 0 1
0 1 0 1 0 0
0 0 0 0 0 0
'1' exists in the second row and also exists in the second last row. So the distance between them vertically is 4.
Would I use the dist function to calculate this? And if so how would I implement it to give me the value of 4?
diff(range(which(rowSums(mat) > 0)))
# [1] 3
Explanation: since the data is binary, we can look at the distance between rows where the row sum is >0.
Adapting Sathish's nicely share data, this works:
mat <- matrix(as.integer(unlist(strsplit('0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0', " "))),
nrow = 6, ncol = 6, byrow = TRUE)

Creating a repeated sequence of zero and ones with uneven "breaks" between

I am trying to create a sequence consisting of 1 and 0 using Rstudio.
My desired output is a sequence that first has five 1 then six 0, followed by four 1 then six 0. Then this should all be repeat until the end of a given vector.
The result should be like this:
1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 .....
Hope someone has a good solution, and sorry if I have some grammar mistakes
Best,
HB
rep(c(rep(1,5),rep(0,6),rep(1,4),rep(0,6)),n)
repeating your pattern n times.
You could use Map.
unlist(Map(function(x, ...) c(rep(x, ...), rep(0, 6)), 1, times=length(v):1))
# [1] 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0
Instead of length(v):1 you may also use rev(seq(v)) but it's slower.
Data
v <- c("Vector", "of", "specific", "length", "five")

Counting repeated 5-mers gene from 100 DNA sequence samples

I am beginner in R and trying to solve this but have been struggling for few days already. Please help a newbie out.
I extracted 100 samples each of length 1000 from a 100,000 DNA sequence. Then, I want to count "AATAA" appeared how many times in the each of the sample.
dog_100
# [1] "GGGTCCTTGAAAGAAGCACAGGGTGGGGGTGGGGGTGGGGGTGGGGGAAGGCAGAGAGGAGGAAACAGGTTTTTGTCCTCAGGGCGTTGCCAGTCTGAAGGAGGTGATGGGATAATTATTTATGAGAGTTCAGGAATGCCAGGCATGGATTAAATGCAAACTAATGGAAATGACACAGAACAATACATTACAC......................................"
#[2] "CCAGGCCAGAACTGAGGCCCTCAGGGCCCCCCAGAATTCCTCATTTGCAGGATAAAAATATACTCAGCTCTTCAATCTTGGTTCTTGCTACTGCACCATGTGCTTCCTGGACTCTGGGAGGCCAGGGGTTAAGTGGGAGTGTTTGAATAAGGGAAAGGATGAGCCCTTTCCCCACACTTTGCCCCAAATAAC......................................"
#[3]
#........
# [4]
#........
# [100]
#........
I wrote a function to identify and count the "AATAA".
R
library(stringr)
cal_AATAA <- function(DNA){
sam_pro <- numeric(length(DNA))
k <- 5
sam_code <- "AATAA"
for(i in 1:(length(DNA))){
Num <- str_length(DNA[i])
for(j in 1:(Num - k +1)){
if ((str_sub(DNA[i], j, j+k-1)) == sam_code){
sam_pro[i] <- sam_pro[i] + 1
}
else {
sam_pro[i] <- sam_pro[i]
}
}
return (sam_pro)
}
}
sample_100 <- cal_AATAA(dog_100)
What I got after running the function is
> sample_100
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[46] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[91] 0 0 0 0 0 0 0 0 0 0
Tried to debug my code but don't know where went wrong. Appreciate any tips or guidance.
R has a built in function called gregexpr which can be used for counting patterns in a string. It outputs a list, so we have to use sapply to loop through the elements of the output. For each element, we count the number of values that are greater than zero because a value of -1 indicates that any match was not found. Look at the output of gregexpr("ap", c("appleap", "orange")) as an example.
dna = c("AGTACGTGCATAGC", "GTAGCTAGCTAGCAT")
sam = "AGC"
sapply(gregexpr(sam, dna), function(x) sum(x > 0))
#[1] 1 3

R - Creating a new column within a data frame when two or more columns are a match in a row

I'm currently stuck on a part of my code that feels intuitive but I can't figure a way to do it. I have a very big data frame (nrows = 34036, ncol = 43) in which I want to create a continuous sequence of the variables where the value of the row is 1 (without having multiple columns with 1). It consists of only zeros and ones similar to the following:
A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1
I was able to remove the zeroes using:
#find the sum of each row
placeholderData <- transform(placeholderData, sum=rowSums(placeholderData))
placeholderData <- placeholderData[!(placeholderData$sum <= 0),]
And the data frame now looks like:
A B C D sum
1 0 0 0 1
0 0 0 1 1
0 0 0 1 1
1 0 1 0 2
1 0 1 0 2
0 1 0 0 1
0 1 0 0 1
1 0 0 1 2
My main problem comes when there are two or more 1's in a row. To try to solve this, I used the following code to identify the columns that have a sum of 2 or more:
placeholderData$Matches <- lapply(apply(placeholderData == 1, 1, which), names)
Which added the following column to the data frame:
A B C D sum Matches
1 0 0 0 1 A
0 0 0 1 1 D
0 0 0 1 1 D
1 0 1 0 2 c("A","C")
1 0 1 0 2 c("A","C")
0 1 0 0 1 B
0 1 0 0 1 B
1 0 0 1 2 c("A", "D")
I added the Matches column as an approach to solve the problem, but I'm not sure how would I do it without using a lot of logical operators (I don't know what columns have matches or not). What I would like to do is to aggregate the rows that have more than (or equal to) two 1's into a new column, to be able to have a data frame like this:
A B C D AC AD sum Matches
1 0 0 0 0 0 1 A
0 0 0 1 0 0 1 D
0 0 0 1 0 0 1 D
0 0 0 0 1 0 1 c("A","C")
0 0 0 0 1 0 1 c("A","C")
0 1 0 0 0 0 1 B
0 1 0 0 0 0 1 B
0 0 0 0 0 1 1 c("A", "D")
Then, I would be able to use my code as normal (It works just fine when there are no repeated values in rows). I tried searching to find similar questions, but I'm not sure if I was even asking the right question. I was wondering if anyone could provide some help or some ideas that I could try.
Thank you very much!
This seems a lot like making dummy variables, so I would use the model.matrix function commonly used for dummy variables (one-hot encoding):
m = read.table(header = T, text = "A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1")
m = m[rowSums(m) > 0, ]
d = factor(sapply(apply(m == 1, 1, which), function(x) paste(names(m)[x], collapse = "")))
result = data.frame(model.matrix(~ d + 0))
names(result) = levels(d)
# A AC AD B D
# 1 1 0 0 0 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 1 0 0 0
# 6 0 0 0 1 0
# 7 0 0 0 1 0
# 8 0 0 1 0 0

Determining if each vector element not exceeds all previous elements

I need to compare element i with all previous elements i-1,i-2,..., and if i < i-1, i-2, ... return 1, otherwise return 0.
data <- c(10.3,14.3,7.7,15.8,14.4,16.7,15.3,20.2,17.1,7.7,15.3,16.3,19.9,14.4,18.7,20.7)
The result of comparing should be the following.
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I tried to make it with
as.integer(cummin(data)==data)
and i get
1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
The first elements easy to fix. But what to do with another 1 on 10 position.
A possible approach:
v <- rank(data,ties='first')
out <- as.integer(cummin(v)==v)
# [1] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Taking care of the first element:
out[1] <- 0
try this:
sapply(1 : length(data), FUN = function(i) all(data[i] < data[1 : (i - 1)]) * 1)
#[1] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Resources