Summing the counts in a data frame using sliding window - r

I am new to R. I have a data frame in R like following
df <- data.frame(ID=c(rep("A1",10),rep("A2",13),rep("A3",12)),
Values=c(10,2,4,23,10,5,20,15,13,21,15,9,19,5,14,25,18,19,31,26,4,21,4,6,7,12,15,18,25,20,16,29,21,19,10))
For every ID I would like to sum the counts in column "Values" in a sliding windows for every 3 positions. Following data frame is an excerpt from df which includes only the records corresponding to A1:
ID Values
A1 10
A1 2
A1 4
A1 23
A1 10
A1 5
A1 20
A1 15
A1 13
A1 21
I would like to take 3 entries at time and sum and move to next 3 entries. When the sliding windows can't accommodate 3 entries then I skip those values.
For an example, Window_1 starts from first value (10) while window_2 starts from second value (2) and window_3 starts from third value (4).
window_1 = [10+2+4] + [23+10+5] + [20+15+13] = 102
window_2 = [2+4+23] + [10+5+20] + [15+13+21] = 113
window_3 = [4+23+10] + [5+20+15] = 77
and report it in a data frame like following:
ID Window_1 Window_2 Window_3
A1 102 113 77
Likewise I would like sum the counts in column Values for everyid in the data frame "df" and report in a data.frmae like following:
ID window_1 window_2 window_3
A1 102 113 77
A2 206 195 161
A3 198 163 175
I tried the following code
sum_win_3=0
sum_win_2=0
sum_win_1=0
win_1_counts=0
win_2_counts=0
win_3_counts=0
for (i in seq(1,length(df$Values),3))
{
if((i+i+1+i+2) %% 3 == 0)
{
win_1_counts=df$Values[i]+df$Values[i+1]+df$Values[i+2]
win_1_counts[is.na(win_1_counts)]=0
#print(win_1_counts)
}
sum_win_1=sum_win_1+win_1_counts
}
#print(sum_win_1)
for (j in seq(2,length(df$Values),3))
{
if((j+j+1+j+2) %% 3 == 0)
{
win_2_counts=df$Values[j]+df$Values[j+1]+df$Values[j+2]
win_2_counts[is.na(win_2_counts)]=0
#print(win_2_counts)
}
sum_win_2=sum_win_2+win_2_counts
}
#print(sum_win_2)
for (k in seq(3,length(df$Values),3))
{
if((k+k+1+k+2) %% 3 == 0)
{
win_3_counts=df$Values[k]+df$Values[k+1]+df$Values[k+2]
win_3_counts[is.na(win_3_counts)]=0
#print(win_3_counts)
}
#sum_win_3=sum_win_3+win_3_counts
}
print(sum_win_3)
output=data.frame(ID=df[1],Window_1=sum_win_1,Window_2=sum_win_2,Window_3=sum_win_3)
The above code sums the counts for window_1, windows_2 and window_3 by taking all the IDs together rather working on every ID separately.
Kindly guide me in getting the the output in the desired format stated above.
Thanks in advance

Using the data.table package, I would approach it as follows:
library(data.table)
setDT(df)[, .(w1 = sum(Values[1:(3*(.N%/%3))]),
w2 = sum(Values[2:(3*((.N-1)%/%3)+1)]),
w3 = sum(Values[3:(3*((.N-2)%/%3)+2)]))
, by = ID]
which gives:
ID w1 w2 w3
1: A1 102 113 77
2: A2 206 195 161
3: A3 198 163 175
Or to avoid the repetition (thanx to #Cath):
setDT(df)[, lapply(1:3, function(i) {sum(Values[i:(3*((.N-i+1)%/%3)+(i-1))])})
, by = ID]
If you want to rename the V1, V2 & V3 variables, you can do that afterwards, but you can also do:
cols <- c("w1","w2","w3")
setDT(df)[, (cols) := lapply(1:3, function(i) {sum(Values[i:(3*((.N-i+1)%/%3)+(i-1))])})
, by = ID]

This could be done using tapplyand aggregate
sumf <- function(x1){
sum(tapply(x1,
(seq_along(x1) -1) %/%3,
function(x) ifelse(length(x) == 3, sum(x), 0)))
}
aggregate(Values ~ ID, data = df,
FUN = function(y){
cbind(sumf(y), sumf(y[-1]), sumf(y[-c(1,2)]))
})
# Group.1 x.1 x.2 x.3
#1 A1 102 113 77
#2 A2 206 195 161
#3 A3 198 163 175
This can also be done using filter
sum.filter <- function(z) tapply(head(tail(as.numeric(
filter(z, c(1,1,1))),-1), -1),
0:(length(z)-3) %% 3 +1, sum)
aggregate(Values ~ ID, data = df, FUN = function(y){ cbind(sum.filter(y) )})

This seems to work:
library(zoo)
wins = function(x, w)
rollapply(x, width = w*((length(x)-seq(w)+1) %/% w), align = "left", sum)
aggregate(Values ~ ID, df, wins, 3)
# ID Values.1 Values.2 Values.3
# 1 A1 102 113 77
# 2 A2 206 195 161
# 3 A3 198 163 175
This is the only answer so far to perform the calculation on a rolling basis, which is usually more efficient.

Related

Replacing NA with mean using loop in R

I have to solve this problem using loop in R (I am aware that you can do it much more easily without loops, but it is for school...).
So I have vector with NAs like this:
trades<-sample(1:500,150,T)
trades<-trades[order(trades)]
trades[sample(10:140,25)]<-NA
and I have to create a FOR loop that will replace NAs with mean from 2 numbers before the NA and 2 numbers that come after the NA.
This I am able to do, with loop like this:
for (i in 1:length(trades)) {
if (is.na(trades[i])==T) {
trades[i] <- mean(c(trades[c(i-1:2)], trades[c(i+1:2)]), na.rm = T)
}
}
But there is another part to the homework. If there is NA within the 2 previous or 2 following numbers, then you have to replace the NA with mean from 4 previous numbers and 4 following numbers (I presume with removing the NAs). But I just am not able to crack it... I have the best results with this loop:
for (i in 1:length(trades)) {
if (is.na(trades[i])==T && is.na(trades[c(i-1:2)]==T || is.na(trades[c(i+1:2)]==T))) {
trades[i] <- mean(c(trades[c(i-1:4)], trades[c(i+1:4)]), na.rm = T)
}else if (is.na(trades[i])==T){
trades[i] <- mean(c(trades[c(i-1:2)], trades[c(i+1:2)]))
}
}
But it still misses some NAs.
Thank you for your help in advance.
We can use na.approx from zoo
library(zoo)
na.approx(trades)
Here is another solution using a loop. I did shortcut some code by using lead and lag from dplyr. First we use 2 recursive functions to calculate the lead and lag sums. Then we use conditional statements to determine if there are any missing data. Lastly, we fill the missing data using either the output of the recursive or the sum of the previous and following 4 (with NA removed). I would note that this is not the way that I would go about this issue, but I tried it out with a loop as requested.
library(dplyr)
r.lag <- function(x, n){
if (n == 1) return(lag(x = x, n = 1))
else return( lag(x = x, n = n) + r.lag(x = x, n = n-1))
}
r.lead <- function(x, n){
if (n == 1) return(lead(x = x, n = 1))
else return( lead(x = x, n = n) + r.lead(x = x, n = n-1))
}
lead.vec <- r.lead(trades, 2)
lag.vec <- r.lag(trades, 2)
output <- vector(length = length(trades))
for(i in 1:length(trades)){
if(!is.na(trades[[i]])){
output[[i]] <- trades[[i]]
}
else if(is.na(trades[[i]]) & !is.na(lead.vec[[i]]) & !is.na(lag.vec[[i]])){
output[[i]] <- (lead.vec[[i]] + lag.vec[[i]])/4
}
else
output[[i]] <- mean(
c(trades[[i-4]], trades[[i-3]], trades[[i-2]], trades[[i-1]],
trades[[i+4]], trades[[i+3]], trades[[i+2]], trades[[i+1]]),
na.rm = T
)
}
tibble(
original = trades,
filled = output
)
#> # A tibble: 150 x 2
#> original filled
#> <int> <dbl>
#> 1 7 7
#> 2 7 7
#> 3 12 12
#> 4 18 18
#> 5 30 30
#> 6 31 31
#> 7 36 36
#> 8 NA 40
#> 9 43 43
#> 10 50 50
#> # … with 140 more rows
So it seems that posting to StackOverflow helped me solve the problem.
trades<-sample(1:500,25,T)
trades<-trades[order(trades)]
trades[sample(1:25,5)]<-NA
which gives us:
[1] NA 20 24 30 NA 77 188 217 238 252 264 273 296 NA 326 346 362 368 NA NA 432 451 465 465 490
and if you run this loop:
for (i in 1:length(trades)) {
if (is.na(trades[i])== T) {
test1 <- c(trades[c(i+1:2)])
if (any(is.na(test1))==T) {
test2 <- c(trades[abs(c(i-1:4))], trades[c(i+1:4)])
trades[i] <- round(mean(test2, na.rm = T),0)
}else {
test3 <- c(trades[abs(c(i-1:2))], trades[c(i+1:2)])
trades[i] <- round(mean(test3, na.rm = T),0)
}
}
}
it changes the NAs to this:
[1] 22 20 24 30 80 77 188 217 238 252 264 273 296 310 326 346 362 368 387 410 432 451 465 465 490
So it works pretty much as expected.
Thank you for all your help.

Finding the overlap between two data frames in R, how can I make my code more efficient?

I have two dataframes in R. In the first one I have two columns one is called "chr" and the other "position"; in the second dataframe I have three columns one is again "chr", other "start" and another one "end". I want to select those rows in the first dataframe in which chr value is the same as the second data frame, but also whose "position" is in the interval start-end of the second data frame.
For that I have written a function in R that gives me the desired output but it is very slow when I run it with huge data frames.
# My DataFrames are:
bed <- data.frame(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
x1 = c(5,20,44,67,5,20,44,20),
x3=c(12,43,64,94,12,43,64,63))
snv <- data.frame(Chr = c(rep("chr1",6),rep("chr3",6)),
position = c(5,18,46,60,80,90,21,60,75,80,84,87))
# My function is:
get_overlap <- function(df, position, chrom){
overlap <- FALSE
for (row in 1:nrow(df)){
chr = df[row, 1]
start = df[row, 2]
end = df[row, 3]
if(chr == chrom & position %in% seq(start, end)){
overlap <- TRUE
}
}
return(overlap)
}
# The code is:
overlap_vector = c()
for (row in 1:nrow(snv)){
chrom = snv[row, 1]
position = snv[row, 2]
overlap <- get_overlap(bed, position, chrom)
overlap_vector <- c(overlap_vector, overlap)
}
print(snv[overlap_vector,])
How can I make this more efficient? I have never worked with hash tables, can that be the answer?
I'm sure there's a more elegant data.table solution, but this works. First I load the package.
# Load package
library(data.table)
Then, I define the data tables
# Define data tables
bed <- data.table(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
start = c(5,20,44,67,5,20,44,20),
end = c(12,43,64,94,12,43,64,63))
snv <- data.table(Chr = c(rep("chr1",6),rep("chr3",6)),
position = c(5,18,46,60,80,90,21,60,75,80,84,87))
Here, I do a non-equi join on position and start/end, and an equal join on Chr. I assume you want to keep all columns, so specified them in the j argument and omitted those rows without matches.
na.omit(bed[snv,
.(Chr, start = x.start, end = x.end, position = i.position),
on = c("start <= position", "end >= position", "Chr == Chr")])
#> Chr start end position
#> 1: chr1 5 12 5
#> 2: chr1 44 64 46
#> 3: chr1 44 64 60
#> 4: chr1 67 94 80
#> 5: chr1 67 94 90
#> 6: chr3 20 63 21
#> 7: chr3 20 63 60
Created on 2019-08-21 by the reprex package (v0.3.0)
Edit
A quick benchmarking shows that Nathan's solution is about as twice as fast!
Unit: milliseconds
expr min lq mean median uq max neval
NathanWren() 1.684392 1.729557 1.819263 1.751520 1.787829 5.138546 100
Lyngbakr() 3.336902 3.395528 3.603376 3.441933 3.496131 7.720925 100
The data.table package is great for fast merging of tables. It also comes with a vectorized between function for just this type of task.
library(data.table)
# Convert the data.frames to data.tables
setDT(bed)
setDT(snv)
# Use the join syntax for data.table, then filter for the desired rows
overlap_dt <- bed[
snv,
on = "Chr",
allow.cartesian = TRUE # many-to-many matching
][
between(position, lower = x1, upper = x3)
]
overlap_dt
# Chr x1 x3 position
# 1: chr1 5 12 5
# 2: chr1 44 64 46
# 3: chr1 44 64 60
# 4: chr1 67 94 80
# 5: chr1 67 94 90
# 6: chr3 20 63 21
# 7: chr3 20 63 60

If() statement in R

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)
From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89
The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

Sum and place it elsewhere in R

I have one column with 950 numbers. I want to sum row 1:40 and place it in a new column on row 50, then sum row 2:41 and place it on row 51 in the new column and so on. How do I do?
You can use the function RcppRoll::roll_sum()
Hope this helps:
r <- 50
df1 <- data.frame(c1 = 1:951)
v1 <- RcppRoll::roll_sum(df1$c1, n=40)
df1$c2 <- c(rep(NA, r), v1[1:(nrow(df1)-r)])
View(df1) # in RStudio
You decide what happens with the sum from row 911 onwards (I've ignored them)
You can use RcppRoll::roll_sum() and dplyr::lag()...
df <- data.frame(v = 1:950)
library(dplyr)
library(RcppRoll)
range <- 40 # how many values to sum, i.e. window size
offset <- 10 # e.g sum(1:40) goes to row 50
df <- mutate(df, roll_sum = RcppRoll::roll_sum(lag(v, n = offset),
n = range, fill = NA, align = "right"))
df[(range+offset):(range+offset+5), ]
# v roll_sum
# 50 50 820
# 51 51 860
# 52 52 900
# 53 53 940
# 54 54 980
# 55 55 1020
sum(1:range); sum(2:(range+1))
# [1] 820
# [1] 860

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Resources