Generating a sequence where the gap between numbers increases - r

I need to generate a sequence on R where the gap between elements increases each time
Seq1:
1, 49, 100, 154, ... 19306
Seq2:
48, 99, 153, 210, ..., 19650
Note the gap between seq1 elements increases by 3 each time. 49-1 = 48, 100-49 =51, 154-100 = 54...
The gap between Seq2 elements also increases by 3 each time 99-48 =51, 153-99 = 54

Given the advice from #Dason:
seq1 <- seq(48, 19306,3)
which(cumsum(seq1) ==19650)
seq2 <- cumsum(seq1)[1:100]
seq3 <- seq(47, 19306, 3)
seq4 <- seq2 -seq3[1:100]

Related

How to get the recombined vectors efficiently?

Given two axises, both with positions from 1 to N (N can be several millions, but we assume N = 1000 here),
there are two vectors recording the positions of some points on the two axises, respectively. For example:
chrm1 <- c(1, 35, 456, 732) # 4 points on axis 1 at position 1, 35, 456, 732;
chrm2 <- c(23, 501, 980)
if recombination at position 300 of the two axis, points behind 300 on the two axises will switch to the other axis.
the two vectors recording position of points will become :
chrm1 <- c(1, 35, 501, 980)
chrm2 <- c(23, 456, 732)
if a second recombination occurs at 600, the new vectors will be:
chrm1 <- c(1, 35, 501, 732)
chrm2 <- c(23, 456, 980)
the real data looks like this:
set.seed(1)
chrm1 <- sample.int(1e8, 50)
chrm2 <- sample.int(1e8, 50)
breaks.site <- sample.int(1e8, 5)
My brute-force way was to swap points into the other vector for each breaks sites. But this is quite slow, because I have to do this for 2 x 1000 x 20000 times.
How to get the recombined vectors efficiently?
for(i in breaks.site){
chrm1.new <- c(chrm1[chrm1 < i], chrm2[chrm2 > i])
chrm2.new <- c(chrm1[chrm1 > i], chrm2[chrm2 < i])
chrm1 <- chrm1.new
chrm2 <- chrm2.new
}
background about recombination:
https://en.wikipedia.org/wiki/Genetic_recombination
Maybe this:
chrm1 <- c(1, 35, 456, 732)
chrm2 <- c(23, 501, 980)
breaks <- c(300, 600)
#check all points for all breaks,
#get sum of position changes and
#calculate x mod 2
changepos1 <- rowSums(outer(chrm1, breaks, ">")) %% 2
changepos2 <- rowSums(outer(chrm2, breaks, ">")) %% 2
#assemble results and sort
res1 <- sort(c(chrm1[!changepos1], chrm2[as.logical(changepos2)]))
#[1] 1 35 501 732
res2 <- sort(c(chrm2[!changepos2], chrm1[as.logical(changepos1)]))
#[1] 23 456 980
If outer needs to much memory due to the size of your problem, you can use a loop instead.

R - sequence with increasing interval

normal sequences with fixed intervals can be created using seq(from, to, by= )
is there a way to create a sequence with increasing intervals like the sequence below.
seq1 = c(2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192)
here each of vector element is 2^index
We can just use
## not `2 ^ 1:13`
2 ^ (1:13)
or
2 ^ seq(1, 13, 1)

Find best match of two vectors with different length in R

From two vectors (a, b) with increasing integers and eventually different lengths, I want to extract two vectors both with length n (=5) that lead to the smallest difference when subtracting a from b.
Example
a<-c(25, 89, 159, 224, 292, 358)
b<-c(1, 19, 93, 155, 230, 291)
Subtracting the following elements leads to the smallest difference:
c(25-19, 89-93, 159-155, 225-230, 291-292)
From a, 358 is excluded
From b, 1 is excluded.
The Problem:
The length of the vectors can vary:
Examples
a<-c(25, 89, 159, 224, 292, 358)
b<-c(19, 93, 155, 230, 291)
a<-c(25, 89, 159, 224, 292, 358, 560)
b<-c(19, 93, 155, 230, 291)
a<-c(25, 89, 159, 224, 292, 358)
b<-c(1 , 5, 19, 93, 155, 230, 291)
Because I have to find this “best match” for >1000 vectors, I would like to built a function that takes as input the two vectors with different lengths and gives me as output the two vectors with length n=5 that lead to the smallest difference.
This works by brute force. The columns of combn.a and combn.b are the combinations of 5 elements from a and b. Each row of the two column data frame g is a pair of column numbers of combn.a and combn.b respectiively. f evaluates the sum of absolute differences of the a and b subsets corresponding to a row r of g. v is the distance values found, one per row of g, with ix being the row number in g having the least distance. From g[ix,] we can have the column numbers of the minimizer in comb.a and combn.b and from those we determine the corresponding a and b subsets.
align5 <- function(a, b) {
combn.a <- combn(a, 5)
combn.b <- combn(b, 5)
g <- expand.grid(a = 1:ncol(combn.a), b = 1:ncol(combn.b))
f <- function(r) sum(abs(combn.a[, r[1]] - combn.b[, r[2]]))
v <- apply(g, 1, f)
ix <- which.min(v)
rbind( combn.a[, g[ix, 1] ], combn.b[,g[ix, 2] ] )
}
# test
a <- c(25, 89, 159, 224, 292, 358)
b <- c(1, 19, 93, 155, 230, 291)
align5(a, b)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 25 89 159 224 292
## [2,] 19 93 155 230 291

R: Average nearby elements in a vector

I have many vectors such as this: c(28, 30, 50, 55, 99, 102) and I would like to obtain a new vector where elements differing less than 10 from one to another are averaged. In this case, I would like to obtain c(29, 52.5, 100.5).
Another way
vec <- c(28, 30, 50, 55, 99, 102)
indx <- cumsum(c(0, diff(vec)) > 10)
tapply(vec, indx, mean)
# 0 1 2
# 29.0 52.5 100.5

Elegant, Fast Way to Perform Rolling Sum By List of Variables

Has anyone developed an elegant, fast way to perform a rolling sum by date? For example, if I wanted to create a rolling 180-day total for the following dataset by Cust_ID, is there a way to do it faster (like something in data.table). I have been using the following example to currently calculate the rolling sum, but I am afraid it is far to inefficient.
library("zoo")
library("plyr")
library("lubridate")
##Make some sample variables
set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)
##Combine into one dataset
Example.Data <- data.frame(Trans_Dates,Cust_ID,Target)
##Create extra variable with 180 day rolling sum
Example.Data2 <- ddply(Example.Data, .(Cust_ID),
function(datc) adply(datc, 1,
function(x) data.frame(Target_Running_Total =
sum(subset(datc, Trans_Dates>(as.Date(x$Trans_Dates)-180) & Trans_Dates<=x$Trans_Dates)$Target))))
#Print new data
Example.Data2
Assuming that your panel is more-or-less balanced, then I suspect that expand.grid and ave will be pretty fast (you'll have to benchmark with your data to be sure). I use expand.grid to fill in the missing days so that I can naively take a rolling sum with cumsum then subtract all but the most recent 180 with head.
-As a question for you (and more skilled R users), why does my identical call always fail?-
I build on your same data.
full <- expand.grid(seq(from=min(Example.Data$Trans_Dates), to=max(Example.Data$Trans_Dates), by=1), unique(Example.Data$Cust_ID))
Example.Data3 <- merge(Example.Data, full, by.x=c("Trans_Dates", "Cust_ID"), by.y=c("Var1", "Var2"), all=TRUE)
Example.Data3 <- Example.Data3[with(Example.Data3, order(Cust_ID, Trans_Dates)), ]
Example.Data3$Target.New <- ifelse(is.na(Example.Data3$Target), 0, Example.Data3$Target)
Example.Data3$Target_Running_Total <- ave(Example.Data3$Target.New, Example.Data3$Cust_ID, FUN=function(x) cumsum(x) - c(rep(0, 180), head(cumsum(x), -180)))
Example.Data3$Target.New <- NULL
Example.Data3 <- Example.Data3[complete.cases(Example.Data3), ]
row.names(Example.Data3) <- seq(nrow(Example.Data3))
Example.Data3
identical(Example.Data2$Target_Running_Total, Example.Data3$Target_Running_Total)
sum(Example.Data2$Target_Running_Total - Example.Data3$Target_Running_Total)
(Example.Data2$Target_Running_Total - Example.Data3$Target_Running_Total)
Which yields the following.
> (Example.Data2$Target_Running_Total - Example.Data3$Target_Running_Total)
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I think I stumbled upon an answer that is fairly efficient..
set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)
##Make simulated data into a data.table
library(data.table)
data <- data.table(Cust_ID,Trans_Dates,Target)
##Assign each customer an number that ranks them
data[,Cust_No:=.GRP,by=c("Cust_ID")]
##Create "list" of comparison dates
Ref <- data[,list(Compare_Value=list(I(Target)),Compare_Date=list(I(Trans_Dates))), by=c("Cust_No")]
##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mapply(FUN = function(RD, NUM) {
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -180)*Ref$Compare_Value[[NUM]])
}, RD = data$Trans_Dates,NUM=data$Cust_No)
##Print out data
data <- data[,list(Cust_ID,Trans_Dates,Target,Roll.Val)][order(Cust_ID,Trans_Dates)]
data
library(data.table)
set.seed(1)
data <- data.table(Cust_ID = c(rep(1, 12), rep(2, 12), rep(3, 12)),
Trans_Dates = as.Date(c(31, 33, 65, 96, 150, 187, 210,
212, 240, 273, 293, 320, 32, 34,
66, 97, 151, 188, 211, 213, 241,
274, 294, 321, 33, 35, 67, 98,
152, 189, 212, 214, 242, 275,
295, 322),
origin = "2010-01-01"),
Target = rpois(36, 3))
data[, RollingSum := {
d <- data$Trans_Dates - Trans_Dates
sum(data$Target[Cust_ID == data$Cust_ID & d <= 0 & d >= -180])
},
by = list(Trans_Dates, Cust_ID)]

Resources