Finding Specific Vector Entries in a Sliding Window - r

I am trying to create a function that will return counts of specific adjacent nucleotides (CG beside eachother) within a specific window that I have formatted in a vector.
I would like the windows to be 100 nucleotides long and move shift every 10.
The data is setup like this (to 10k entries):
data <- c("a", "g", "t", "t", "g", "t", "t", "a", "g", "t", "c", "t",
"a", "c", "g", "t", "g", "g", "a", "c", "c", "g", "a", "c")
So far I have tried this:
library(zoo)
library(seqinr)
rollapply(data, width=100, by=10, FUN=count(data, wordsize=2))
But I always get the error
"Error in match.fun(FUN) :
'count(data, 2)' is not a function, character or symbol"
I have also tried:
starts <- seq(1, length(data)-100, by = 100)
n <- length(starts)
for (i in 1:n){
chunk <- data[starts[i]:(starts[i]+99)]
chunkCG <- count(chunk,wordsize=2)
print (chunkCG)
}
However, I do not know how to save the data that is returned. This approach also does not allow me to overlap frames.

EDIT: To get the desired output with a 10 observation sliding window you can use a for loop. Since we pre-allocate the size of our result vector, the loop is reasonably fast. I think this is the best way to solve your problem since I dont think a lot of grouping (if any) supports a sliding window:
library(data.table)
set.seed(1)
#Sample data
df<-data.frame(var=sample(c("a","g","t","c"),600,replace=T))
#The number of windows you want, shift by 10 each time
n_windows <- ((nrow(df) - 100) / 10) + 1
#Create empty DF, this helps increase speed of below loop
res <- data.frame(window=rep(NA,n_windows),count_cg=rep(NA,n_windows))
#Loop over each i, paste a leaded version of your sequence onto current sequence and count "cg"s
for (i in 1:n_windows){
res$window[i] <- paste0((i-1)*10 + 1,"-",(i-1)*10 + 100)
subs <- df[((i-1)*10 + 1):((i-1)*10 + 100),"var"]
subs2<- paste0(as.character(subs),as.character(shift(subs,1L,type="lead")[1:length(subs) - 1]))
res$count_cg[i] <- sum(subs2=="cg")
}
head(res)
window count_cg
1 1-100 10
2 11-110 10
3 21-120 8
4 31-130 9
5 41-140 9
6 51-150 9

Your method doesn't overlap, as you call it with by = 100. Otherwise it looks fine. Just change it to 10.
To extract the data from you last try, try creating character vector that will collect the data and then you can extract the proper count with name indexing.
counted_cg <- vector(mode = "character")
for (i in 1:n){
chunk <- data[starts[i]:(starts[i]+99)]
chunkCG <- count(chunk,wordsize=2)
counted_cg <- c(counted_cg, chunkCG["cg"])
}

Related

add a column based on values in three other columns

I have a data frame ('ju') that has three columns and 230 rows. The first two columns represent a pair of objects. The third column includes one of those objects. I'd like to add the fourth column which will contain the second object from that pair, as shown below.
I wrote a code to identify the value for the forth column (loser), but it does not give me any output when I run it.
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
paste(ju$letter1[i])
} else {
paste (ju$letter2[i])
}
}
I can not see what is wrong with the code. Also I would appreciate if you can suggest how I could create this fourth column directly into my data frame, instead of creating a separate vector and then adding it to the data frame. Thanks
This will do it without a for loop:
ju$loser <- ifelse(ju$winner %in% ju$letter1, ju$letter2, ju$letter1)
Gives:
> ju
letter1 letter2 winner loser
1 a c a c
2 c b b c
3 t j j t
4 r k k r
If you want to print to console, you'll need to add:
cat(ju$letter1[i])
or
print(ju$letter1[i])
Regarding the New Column question, a possible solution (sub-optimal to use a for loop here -- See suggestion from #lab_rat_kid):
ju$NewColumn = NA
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
ju$NewColumn[i] <- ju$letter1[i]
} else {
ju$NewColumn[i] <- ju$letter2[i]
}
}
with tidyverse:
dt <- tibble(l1 = c("a", "c", "t", "r"),
l2 = c("c", "b", "j", "k"),
winner = c("a", "b", "j", "k"))
dt <- dt %>%
mutate(looser = if_else(winner == l1, l2, l1))
(dt)

Cluster hh:mm in 3 minute intervals in dataframe R

I have a dataframe containing tweets during a 1.5 hour time span (from 21:23 to 22:48). One of the columns in my dataframe indicates the timestamp hh:mm (this is of class character). I want to plot my data (e.g nr of tweets) over time but cluster it in 3 minute intervals (so I will get cleaner overview when creating a barplot, than clustering it by the minute). Can someone explain to me how this would work?
So the result I want to achieve is that in the example dt below is:
a, b, c, and d are clustered in group 1
e, f, g, h, and i are clustered in group 2
j, k, and l are clustered in group 3
dt <- data.frame(text = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l"),
time = c("21:23", "21:23", "21:24", "21:25", "21:26", "21:27", "21:27", "21:28", "21:28", "21:29", "21:30", "21:31"))
Check out the bins option where you can specify the number of bins, that is number of buckets. For example,
ggplot(<your options>) + geom_histogram(bins=30)
I used 30 since there are 3 *30 = 90 mins of data. But you can make it a variable.

How to subset a data set based on a specific condition?

I am working with a data set that has over 500 000 observations. It is available here: http://archive.ics.uci.edu/ml/machine-learning-databases/00352/. My goal is to remove some of the rows based on the following buzzwords:
buzzwords = c("A", "B", "C", "D", "E", "F", "G", "H")
If the column Description contains these buzzwords, for example, "Apple" contains "A", I should remove the entire row. I wrote this function:
isUndesirable2 = function(x){
c=FALSE
for (i in 1:length(buzzwords)){
if (str_detect(toupper(x),buzzwords[i])){
c=TRUE
}
break
}
return(c)
}
The function above works well, I tried it on multiple examples. However, when I try to subset the original data set:
dataset43 = dataset2[which(!isUndesirable2(as.character(dataset2$Description))),]
I get this warning:
Warning message:
In if (str_detect(toupper(x), buzzwords[i])) { :
the condition has length > 1 and only the first element will be used
and the new dataset43 is empty. Weird!
If anyone more experienced with R can help - I'd really appreciate it. Thanks! PS Sorry about the formatting, it is my first post.
Consider a more R-like approach
buzzwords = c("A", "B", "C", "D", "E", "F", "G", "H")
dataset2 <- data.frame(Description=c("Apple", "Bee", "Zoo", "Green", "Hospital", "Yoohoo", "You"))
library(stringr)
I use outer for all-to-all comparison of dataset2$Description and buzzwords in str_detect(i, j). Then filter dataset2 based on whether any buzzword was found or not
is_buzzword_present <- outer(dataset2$Description, buzzwords, function(i, j) str_detect(toupper(i), j))
dplyr::filter(dataset2, !apply(is_buzzword_present, 1, any))
# Description
# 1 Zoo
# 2 You
Using the example above we can do:
subset(dataset2,!grepl(paste(buzzwords,collapse="|"),do.call(paste,dataset2)))
Description
3 Zoo
6 Yoohoo
7 You

Finding occurrence of character from multiple vector or list

I wish to find the number of times a unique/distinct character occurs accross mulitple vectors or from a list.
Perhaps its best to describe in an example ;
In this example, lets say the "unique character" are letters. And the muliple "vectors" are books. I wish to find the occurance of the letters as the number of book increases.
# Initial data in the format of a list
book_list <- list(book_A <- c("a", "b", "c", "z"),
book_B <- c("c", "d", "a"),
book_C <- c("b", "a", "c", "e", "x"))
# Initial data in the format of multiple vectors
book_A <- c("a", "b", "c", "z")
book_B <- c("c", "d", "a")
book_C <- c("b", "a", "c", "e", "x")
# Finding the unique letters in each book
# This is the part im struggling to code in a loop fashion
one_book <- length(unique(book_A))
two_book <- length(unique(c(book_A, book_B)))
three_book <- length(unique(c(book_A, book_B, book_C)))
# Plot the desired output
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of unqiue letters", xlab = "Book Number",
main="The occurence of unique letters as number of book increases")
To Note : The real data set is much bigger. Each vector (book_A, book_B...etc) is about 7000 in length.
I attempting to solve the problem with dplyr or data frame, but I'm not quite there yet.
# Explore data frame option with an example data
library(dplyr)
df <- read.delim("http://m.uploadedit.com/ba3s/148950223626.txt")
# Group them
df_group <- dplyr::group_by(df, book) %>% summarize(occurence = length(letter))
# Use the cummuative sum
plot(x=1:length(unique(df$book)), y=cumsum(df_group$occurence))
But I know the plot is not correct, as it is only plotting the cummulative sum rather than what I intended. Any hints would be most helpful.
To add to the complexity, it would be nice if the book which have the shortest number of letter first can be ploted. Something along the line
# Example ;
# Find the length of the letters in the book
lapply(book_list, length)
# I know that book_B is has the shortest number of letters (3);
# followed by book_A (4) then book_C (5)
one_book <- length(unique(book_B))
two_book <- length(unique(c(book_B, book_A)))
three_book <- length(unique(c(book_B, book_A, book_C)))
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of letters", xlab = "Book Number")
You can use Reduce with accumulate = TRUE, i.e.
sapply(Reduce(c, book_list, accumulate = TRUE), function(i) length(unique(i)))
#[1] 4 5 7

subsetting in r based on a vector of conditions

This is a restatement of my poorly worded previous question. (To those who replied to it, I appreciate your efforts, and I apologize for not being as clear with my question as I should have been.) I have a large dataset, a subset of which might look like this:
a<-c(1,2,3,4,5,1)
b<-c("a","b","a","b","c","a")
c<-c("m","f","f","m","m","f")
d<-1:6
e<-data.frame(a,b,c,d)
If I want the sum of the entries in the fourth column based on a specific condition, I could do something like this:
attach(e)
total<-sum(e[which(a==3 & b=="a"),4])
detach(e)
However, I have a "vector" of conditions (call it condition_vector), the first four elements of which look more like this:
a==3 & b == "a"
a==2
a==1 & b=="a" & c=="m"
c=="f"
I'd like to create a "generalized" version of the "total" formula above that produces a results_vector of totals by reading in the condition_vector of conditions. In this example, the first four entries in the results_vector would be calculated conceptually as follows:
results_vector[1]<-sum(e[which(a==3 & b=="a"),4])
results_vector[2]<-sum(e[which(a==2),4])
results_vector[3]<-sum(e[which(a==1 & b=="a" & c=="m"),4])
results_vector[4]<-sum(e[which(c=="f"),4])
My actual data set has more than 20 variables. So each record in the condition_vector can contain anywhere from 1 to more than 20 conditions (as opposed to between 1 and 3 conditions, used in this example).
Is there a way to accomplish this other than using a parse(eval(text= ... approach (which takes a long time to run on a relatively small dataset)?
Thanks in advance for any help you can provide (and again, I apologize that I wasn't as clear as I should have been last time around).
Spark
Here using a solution using eval(parse(text=..) here, even if obviously you find it slow:
cond <- c('a==3 & b == "a"','a==2','a==1 & b=="a" & c=="x"','c=="f"')
names(cond) <- cond
results_vector <- lapply(cond,function(x)
sum(dat[eval(parse(text=x)),"d"]))
$`a==3 & b == "a"`
[1] 3
$`a==2`
[1] 2
$`a==1 & b=="a" & c=="m"`
[1] 1
$`c=="f"`
[1] 11
The advantage of naming your conditions vector is to access to your results by condition.
results_vector[cond[2]]
$`a==2`
[1] 2
Here is a function that takes as arguments the condition in each column (if no condition in a column, then NA as argument) and sums in a selected column of a selected data.frame:
conds.by.col <- function(..., sumcol, DF) #NA if not condition in a column
{
conds.ls <- list(...)
res.ls <- vector("list", length(conds.ls))
for(i in 1: length(conds.ls))
{
res.ls[[i]] <- which(DF[,i] == conds.ls[[i]])
}
res.ls <- res.ls[which(lapply(res.ls, length) != 0)]
which_rows <- Reduce(intersect, res.ls)
return(sum(DF[which_rows , sumcol]))
}
Test:
a <- c(1,2,3,4,5,1)
b <- c("a", "b", "a", "b", "c", "a")
c <- c("m", "f", "f", "m", "m", "f")
d <- 1:6
e <- data.frame(a, b, c, d)
conds.by.col(3, "a", "f", sumcol = 4, DF = e)
#[1] 3
For multiple conditions, mapply:
#all conditions in a data.frame:
myconds <- data.frame(con1 = c(3, "a", "f"),
con2 = c(NA, "a", NA),
con3 = c(1, NA, "f"),
stringsAsFactors = F)
mapply(conds.by.col, myconds[1,], myconds[2,], myconds[3,], MoreArgs = list(sumcol = 4, DF = e))
#con1 con2 con3
# 3 10 6
I guess "efficiency" isn't the first you say watching this, though...

Resources