Replacing values in high-frequency data - r

I've got high frequency data about durations. I've found out that I've got some faulty entries that I cannot discard that have 1800*random number added to them. Now I was stupid enough to try:
for(i in 1:21863924) {while(rr[i]>=1800){rr[i]=rr[i]-1800}}
Which obviously didn't work even though I left it overnight. I was wondering if there is a more elegant way for this,since subsetting the dataset to exclude the faulty entries works in matter of seconds ?

It can be done in a vectorized way. Create a logical vector
i1 <- rr >= 1800
Use that vector to replace the values while assigning the values to the original vector
rr[i1] <- rr[i1] - 1800
A recursive function would be
f1 <- function(x, val) {
i1 <- x >= val
x[i1] <- x[i1] - val
if(sum(x >= val) > 0) f1(x, val)
}
out <- f1(rr, val = 1800)
sum(out >= 1800)
#[1] 0
data
set.seed(24)
rr <- sample(20000, 100)

Related

Confusing non-numeric argument to binary operator error

I cannot seem to even create a reproducible example on this as it works fine when I go through the code one line at a time.
The error message I get is as follows:
"Error in testData[, colCheck][length(testData[, colCheck])] - testData[, :
non-numeric argument to binary operator "
Both colCheck and testData$linearcorrd15N are numeric and like I said, the calculation works fine when I run it at that line. The error comes only when I run the function from QTest(df, colCheck).
Here is an example of what some of the code looks like. It will not produce an error, but maybe you can see something that I don't.
QTest <- function(testData, colCheck)
#%#
# testData <- This is the entire data frame for the std/ref that has too high
# of a SD, this way the data frame can be returned without the outlier
# colCheck <- The column name for values that were flagged for having too high of a SD
# This Q test info provided by: https://www.statisticshowto.com/dixons-q-test/
#%#
{
#Get the mean of the highest and lowest values
testData <- arrange(testData, desc(testData[, colCheck]))
len <- length(testData[,colCheck])-1
high <- sapply(1:len, function(i) testData[,colCheck][i])
meanhigh <- mean(high)
testData <- arrange(testData, (testData[, colCheck]))
low <- sapply(1:len, function(i) testData[,colCheck][i])
meanlow <- mean(low)
#If the mean of the lowest numbers is lower than the mean of the highest numbers, do this
if(meanlow < meanhigh){
QexpVal <- abs((testData[, colCheck][2] - testData[, colCheck][1])/
(testData[, colCheck][length(testData[, colCheck])] - testData[, colCheck][1]))
outlier <- testData[,colCheck][1]
closest <- testData[,colCheck][2]
#else if the mean of the lowest numbers is higher than the mean of the highest numbers, do this
} else {
QexpVal <- abs((testData[, colCheck][length(testData[,colCheck])-1] - (testData[, colCheck][length(testData[,colCheck])])) /
(testData[,colCheck][length(testData[,colCheck])]) - (testData[,colCheck][1]))
outlier <- testData[,colCheck][length(testData[,colCheck])]
closest <- testData[,colCheck][length(testData[,colCheck])-1]
}
return(QexpVal)
}
df <- data.frame(Row = c(1, 2, 3, 4, 5), Identifier.2 = "36-UWSIF-UT Glut1", linearcorrd15N = c(-11.63433,
-22.13869, -57.21795, -17.06438, -16.23358))
colCheck <- as.numeric(grep("linearcorrd15N", colnames(std1)))
QTestCorrVals <- QTest(df, colCheck)
It seems you realy overcomplicate this function by pushing the whole table in the function and loop over everything and read a value again from the whole table...
just the part to get meanhigh and meanlow requires this:
v <- df[, colCheck]
v <- v[order(v)]
n <- length(v)
meanhigh <- mean(v[2:n])
meanlow <- mean(v[1:n-1])
Or if you use the decreasing ordering this:
v <- df[, colCheck]
v <- v[order(v, decreasing = T)]
n <- length(v)
meanhigh <- mean(v[1:n-1])
meanlow <- mean(v[2:n])
Full function
Hereby the full code using this approach and I agree that is not the specific question you asked, but the way you coded it is highly inefficient and error prone by every time take the whole data.frame and subset it and recalculate lengths every time. Also you just have to order once, as if the lowest is on top, the highest is per definition on the bottom. Then play around with the 1 for first and 2 for second and n for last and n-1 for second last.
df <- data.frame(Row = c(1, 2, 3, 4, 5), Identifier.2 = "36-UWSIF-UT Glut1", linearcorrd15N = c(-11.63433,
-22.13869, -57.21795, -17.06438, -16.23358))
colCheck <- as.numeric(grep("linearcorrd15N", colnames(df)))
QTest <- function(v) {
v <- v[order(v)]
n <- length(v)
meanhigh <- mean(v[2:n])
meanlow <- mean(v[1:n-1])
if(meanlow < meanhigh) {
QexpVal <- abs((v[2]-v[1])/(v[n]-v[1]))
outlier <- v[1]
closest <- v[2]
} else {
QexpVal <- abs((v[n-1]-v[n])/(v[n]-v[1]))
outlier <- v[n]
closest <- v[n-1]
}
return(QexpVal)
}
QTestCorrVals <- QTest(df[, colCheck])
Side note
Using the column index number works slightly different whether your data is a data.frame or a data.table
class(df)
df[, colCheck]
dt <- data.table(df)
class(dt)
dt[, ..colCheck]
dt[, colCheck] # throws an error

Quickest way to change DNA character string based on quality information (both imported from BAM file) in R

I am attempting to alter the characters in a DNA character string based on the associated quality information. There are a number of tools I have seen for quality filtering (cutadapt, trimmomatic among others), but none of them seem to do prescisely what I want.
I want to convert the basecall to another value (using "Y" in below script for no particular reason) based on a quality threshold. The purpose of this is to exclude that base from downstream analysis, which depends on the presence of A / T / G / C / -.
The quality information is imported from the BAM file (using GenomicAlignments + ScanBam) and I am using Rsubread to convert this to Q-scores.
I have the below for loop working fine, but it takes quite a while on large data sets (~100k reads). Are there ways to maintain a similar output but in a more efficient manner, maybe with apply? I don't know how to turn this into a parallel computation (it seems to only use one core).
Dummy data included for completeness.
#Make dummy sequence data frame
seq1 <- "CCGGGGGATCCGGAGGCACCGGCGGCGGGTCCGGCGGCGGCAGTAATACGACTCACTATAGGGGCACCGGTGGACTGTTCGAGGCCATCGAGGGATTCATCGAAAACGGATGGGAAGGCATGATCGACGGCTGGTACGGCTTTAGGTGACACCAGAACGCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCACAT"
seq1DF <- data.frame(seq1)
seq2 <- "CCGGGGGATCCGGAGGCACCGGCGGCGGGTCCGGCGGCGGCAGTAATACGACTCACTATAGGGGCACCGGTGGACTGTTCGAGGCCATCGAGGGATTCATCGAAAACGGATGGGAAGGCATGATCGACGGCTGGTACGGCTTTAGGTGACACCAGAACGCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCACAT"
seq2DF <- data.frame(seq2)
seqDF <- as.data.frame(rbind(seq1,seq2))
#Make dummy quality data frame
quality1 <- data.frame(t(c(replicate(220, 40),30:25)))
quality2 <- data.frame(t(c(replicate(220, 40),35:30)))
qualityDF <- as.data.frame(rbind(quality1,quality2))
#Define quality threshold
threshold <- 30
#number of rows in seqDF changes depending on the number of reads
#some reads have slightly longer lengths, so check for the max nchar and go over all sequences to this extent
#the quality information can contain NA where there is no sequencing data, include !is.na to circumvent this
for (x in c(1:nrow(seqDF))){
for (n in c(1:max(nchar(seqDF[,1])))) {
if(qualityDF[x,n] <= threshold && !is.na(qualityDF[x,n])){
substr(seqDF[x,1],n , n) <- "Y"
}
}
}
This should be more efficient. It's unnecessary to loop through the whole string, only the targeted position ( <= threshould ) needed to be replaced by "Y".
# seq to data.frame split as column
df1 <- data.frame(seq1 = strsplit(seq1,"")[[1]])
df2 <- data.frame(seq2 = strsplit(seq2,"")[[1]])
# NA will be skipped
df1[which(quality1 <= threshold),] <- "Y"
df2[which(quality2 <= threshold),] <- "Y"
# collapse from vector to string
seq1_convert = paste(df1$seq1, sep = '', collapse = '')
seq2_convert = paste(df2$seq2, sep = '', collapse = '')
Maybe the following is a little bit faster.
MNC <- max(nchar(seqDF[,1]))
for(i in c(1:nrow(seqDF))) {
for(j in which(!is.na(qualityDF[i,]) & qualityDF[i,] <= threshold)) {
substr(seqDF[i,1], j, j) <- "Y"
}
substr(seqDF[i,1], nchar(seqDF[i,1])+1, MNC)
}
Or you convert the data.frames to matrix.
#Convert to matrix
seqM <- strsplit(seqDF[,1], "", TRUE)
seqM <- simplify2array(lapply(tt, "[", seq_len(max(lengths(tt)))))
qualityM <- t(as.matrix(qualityDF))
#overwrite with `Y` where quality is below threshold
seqM[!is.na(qualityM) & qualityM <= threshold] <- "Y"
#In case needed - Convert it back to string
apply(seqM, 2, function(x) paste(x[!is.na(x)], collapse=""))

Loop in a dataset simulation

I hope to get help on the following problem in R.
I have the folowing code to generate 30 column dataset based on an exponential distribuition:
x0=0
xmax=8000
xout=3000
lambda=0.0002
n=1
x1=x0+rexp(n,lambda)-xout
x2=x1+rexp(n,lambda)-xout
x3=x2+rexp(n,lambda)-xout
x4=x3+rexp(n,lambda)-xout
x5=x4+rexp(n,lambda)-xout
x6=x5+rexp(n,lambda)-xout
x7=x6+rexp(n,lambda)-xout
x8=x7+rexp(n,lambda)-xout
x9=x8+rexp(n,lambda)-xout
x10=x9+rexp(n,lambda)-xout
x11=x10+rexp(n,lambda)-xout
x12=x11+rexp(n,lambda)-xout
x13=x12+rexp(n,lambda)-xout
x14=x13+rexp(n,lambda)-xout
x15=x14+rexp(n,lambda)-xout
x16=x15+rexp(n,lambda)-xout
x17=x16+rexp(n,lambda)-xout
x18=x17+rexp(n,lambda)-xout
x19=x18+rexp(n,lambda)-xout
x20=x19+rexp(n,lambda)-xout
x21=x20+rexp(n,lambda)-xout
x22=x21+rexp(n,lambda)-xout
x23=x22+rexp(n,lambda)-xout
x24=x23+rexp(n,lambda)-xout
x25=x24+rexp(n,lambda)-xout
x26=x25+rexp(n,lambda)-xout
x27=x26+rexp(n,lambda)-xout
x28=x27+rexp(n,lambda)-xout
x29=x28+rexp(n,lambda)-xout
x30=x29+rexp(n,lambda)-xout
I have three doubts:
1 - Is there any way to write this function in a reduced form?
2 - This row (30 columns) needs to be simulated 10,000 times. How to do this in a loop?
3 - The values ​​of each cell (x1, x2, x3 ...) must be limited to the interval x0 and xmax (0-8000). How to do this?
That depends on what you want to do with values over 8000. Here's a solution that just takes those values and wraps them around with a modulo operator.
library(tidyverse)
test <- data.frame(x0 = rep(0, n))
for (i in 1:30) {
new_col <- sym(paste0("x", i))
old_col <- sym(paste0("x", i - 1))
test <- test %>%
mutate(!!new_col := (!!old_col + rexp(n, lambda) - xout) %% xmax)
}
I don't know how familiar you may or may not be with the tidyverse and tidy evaluation, which I've used here liberally. The !! operator, combined with sym(), turns the variable names into actual variables. The %>% operator "pipes" data from one function to the next. The := operator is needed only if you want to make assignments with a !! on the lefthand side.
I think this is my first time actually trying to post an answer on StackOverflow, so be easy on me! :)
As I'm fairly new to R myself, I thought it would be good practice to try to write this out. Perhaps not the most efficient code, but it works:
xmax <- 8000
xout <- 3000
lambda <- 0.0002
n <- 1
iterations <- 30
df <- data.frame(matrix(ncol = 31, nrow = iterations))
names(df) <- c(paste("x", 0:30, sep=""))
for (j in 1:iterations) {
df$x0[j] <- 0
df$x1[j] <- df$x0[j] + rexp(n,lambda)-xout
if (df$x1[j] < 0) {
df$x1[j] <- 0
}
if (df$x1[j] > 8000) {
df$x1[j] <- 8000
}
for (i in 3:31) {
df[j,i] <- df[j, i-1] + rexp(n,lambda)-xout
if (df[j,i] < 0) {
df[j,i] <- 0
}
if (df[j,i] > 8000) {
df[j,i] <- 8000
}
}
}
You can change iterations to 30000, for testing purposes I've used 30. Also I didn't know if you wanted to limit to 0 and 8000 before or after the next iterations, I've done it before.
Is there any way to write this function in a reduced form?
I would do it like this. Pretty sure this is equivalent.
ncol = 30
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
This row (30 columns) needs to be simulated 10,000 times. How to do this in a loop?
Use replicate with the code above:
sim_data = t(replicate(10000, {
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
}))
replicate gives 10000 columns and 30 rows. We use t() to transpose it to 10000 rows with 30 columns.
The values ​​of each cell (x1, x2, x3 ...) must be limited to the interval x0 and xmax (0-8000). How to do this?
Use pmin() and pmax(). Not sure if you want this done before or after the cumulative summing...
sim_data = t(replicate(10000, {
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
row = pmax(0, row)
row = pmin(xmax, row)
row
}))

Ideas to improve this loop possible?

I've been reading how to improve code in R taking a look a some of the answers here and also reading a bit of the R inferno document. Now I have this problem and the loop I created seems to be taking forever (15 hours and counting).
k <- NROW(unique(df$EndStation.Id))
l <- NROW(unique(df$StartStation.Id))
m1 <- as.matrix(df[,c("Duration","StartStation.Id","EndStation.Id")])
g <- function(m){
for (i in 1:l){
for (j in 1:k){
duration <- m[(m[,2]==i & m[,3]==j),1]
if (NROW(duration)<=1) {
m[(m[,2]==i & m[,3]==j),1] <- NA
next
}
duration <- duration/median(duration)
m[(m[,2]==i & m[,3]==j),1] <- duration
}
}
return(m)
}
answer <- g(m1)
The number of Stations (Start and End) is both 750 and the duration vector size can vary a lot from 1 or 2 to 80. Is this loop improbable or should I give up and try to get access to a faster computer.
Best regards,
Fernando
The code is a bit hard to read, but I think this is what you want to do:
library(data.table)
## generate a data table
dt <- setDT(df[,c("Duration","StartStation.Id","EndStation.Id")])
## calculate the duration
dt[, Duration := Duration / median(Duration), by = .(StartStation.Id, EndStation.Id)]
## replace the result with NA when the vector length == 1
dt[, N := .N, by = .(StartStation.Id, EndStation.Id)][
N == 1, Duration := NA
][, N := NULL]
If I understand your function correctly, you want to divide the duration between two stations by it median duration and if there is only one entry for the pair of stations set to NA
Here is a base solution (it's a bit clunky, I haven't finished my first cup of coffee):
##Some sample data
df <- data.frame(StartStation.Id=sample(LETTERS[1:10], 100, replace =T),
EndStation.Id=sample(LETTERS[11:20], 100, replace =T),
Duration=runif(100, 0.1,100))
res <- tapply(df$Duration, paste0(df$StartStation.Id, df$EndStation.Id), function(x) x/median(x))
res <- data.frame(StartStation.Id=sapply(strsplit(rep(names(res), sapply(res, length)), ""), "[", 1),
EndStation.Id=sapply(strsplit(rep(names(res), sapply(res, length)), ""), "[", 2),
durn=unlist(res))
res[res$durn==1,] <- NA

removing specific rows from a data frame within a loop in r

How can I save the changes occurred in a data frame within a loop, after each iteration? By the following codes, I want to delete some rows from my data frame (df) which have equal value to 'v'. The codes works well, but the problem is finally, only the result of the last i value in the iterations affects the data frame!
for (i in 1:50){
v <- i+ 450
temp<- fn$sqldf("select count(V1) from df where V1='$v' ")
if (temp[1,] < 1000){
g <- temp[1,]
c <- v
print(paste("Topic number: ", c, "is less than 1000, with ", g, "records") )
new_df<- df[df$V1 != v,]
}
}
A more idiomatic R way would be:
reduced <- subset(df, V1 > 450 & V1 <= 500)
count <- table(reduced$V1)
V1OK <- as.integer(names(count)[count<1000])
filtered <- subset(reduced, V1 %in% V1OK)
If you'd rather continue with an sql-centric perspective, your problem appears to be that in the creation of file3, you're generating this from new each iteration (I have to guess what file_new is). You could set up a flag for each row prior to the loop:
V1OK <- rep(FALSE, nrow(DF))
and update it within the loop with
V1OK <- V1OK | df$V1 !=v
and after the loop you could access
file_new[V1OK,]

Resources