Find best match of two vectors with different length in R - r

From two vectors (a, b) with increasing integers and eventually different lengths, I want to extract two vectors both with length n (=5) that lead to the smallest difference when subtracting a from b.
Example
a<-c(25, 89, 159, 224, 292, 358)
b<-c(1, 19, 93, 155, 230, 291)
Subtracting the following elements leads to the smallest difference:
c(25-19, 89-93, 159-155, 225-230, 291-292)
From a, 358 is excluded
From b, 1 is excluded.
The Problem:
The length of the vectors can vary:
Examples
a<-c(25, 89, 159, 224, 292, 358)
b<-c(19, 93, 155, 230, 291)
a<-c(25, 89, 159, 224, 292, 358, 560)
b<-c(19, 93, 155, 230, 291)
a<-c(25, 89, 159, 224, 292, 358)
b<-c(1 , 5, 19, 93, 155, 230, 291)
Because I have to find this “best match” for >1000 vectors, I would like to built a function that takes as input the two vectors with different lengths and gives me as output the two vectors with length n=5 that lead to the smallest difference.

This works by brute force. The columns of combn.a and combn.b are the combinations of 5 elements from a and b. Each row of the two column data frame g is a pair of column numbers of combn.a and combn.b respectiively. f evaluates the sum of absolute differences of the a and b subsets corresponding to a row r of g. v is the distance values found, one per row of g, with ix being the row number in g having the least distance. From g[ix,] we can have the column numbers of the minimizer in comb.a and combn.b and from those we determine the corresponding a and b subsets.
align5 <- function(a, b) {
combn.a <- combn(a, 5)
combn.b <- combn(b, 5)
g <- expand.grid(a = 1:ncol(combn.a), b = 1:ncol(combn.b))
f <- function(r) sum(abs(combn.a[, r[1]] - combn.b[, r[2]]))
v <- apply(g, 1, f)
ix <- which.min(v)
rbind( combn.a[, g[ix, 1] ], combn.b[,g[ix, 2] ] )
}
# test
a <- c(25, 89, 159, 224, 292, 358)
b <- c(1, 19, 93, 155, 230, 291)
align5(a, b)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 25 89 159 224 292
## [2,] 19 93 155 230 291

Related

Writing a median function in R

I have been tasked to write my own median function in R, without using the built-in median function. If the numbers are odd; calculate the two middle values, as is usual concerning the median value.
Something i probably could do in Java, but I struggle with some of the syntax in
R Code:
list1 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47, 71)
sorted=list1[order(list1)]
sorted
n = length(sorted)
n
if(n%2==0) # problem here, implementing mod() and the rest of logic.
Here is a self-written function mymedian:
mymedian <- function(lst) {
n <- length(lst)
s <- sort(lst)
ifelse(n%%2==1,s[(n+1)/2],mean(s[n/2+0:1]))
}
Example
list1 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47, 71)
list2 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47)
mymedian(list1)
mymedian(list2)
such that
> mymedian(list1)
[1] 44.5
> mymedian(list2)
[1] 42
I believe this should get you the median you're looking for:
homemade_median <- function(vec){
sorted <- sort(vec)
n <- length(sorted)
if(n %% 2 == 0){
mid <- sorted[c(floor(n/2),floor(n/2)+1)]
med <- sum(mid)/2
} else {
med <- sorted[ceiling(n/2)]
}
med
}
homemade_median(list1)
median(list1) # for comparison
A short function that does the trick:
my_median <- function(x){
# Order Vector ascending
x <- sort(x)
# For even lenght average the value of the surrounding numbers
if((length(x) %% 2) == 0){
return((x[length(x)/2] + x[length(x)/2 + 1]) / 2)
}
# For uneven lenght just take the value thats right in the center
else{
return(x[(length(x)/2) + 0.5])
}
}
Check to see if it returns desired outcomes:
my_median(list1)
44.5
median(list1)
44.5
#
list2 <- c(1,4,5,90,18)
my_median(list2)
5
median(list2)
5
You don't need to test for evenness, you can just create a sequence from half the length plus one, using floor and ceiling as appriopriate:
x <- rnorm(100)
y <- rnorm(101)
my_median <- function(x)
{
mid <- seq(floor((length(x)+1)/2),ceiling((length(x)+1)/2))
mean(sort(x)[mid])
}
my_median(x)
[1] 0.1682606
median(x)
[1] 0.1682606
my_median(y)
[1] 0.2473015
median(y)
[1] 0.2473015

Generating a sequence where the gap between numbers increases

I need to generate a sequence on R where the gap between elements increases each time
Seq1:
1, 49, 100, 154, ... 19306
Seq2:
48, 99, 153, 210, ..., 19650
Note the gap between seq1 elements increases by 3 each time. 49-1 = 48, 100-49 =51, 154-100 = 54...
The gap between Seq2 elements also increases by 3 each time 99-48 =51, 153-99 = 54
Given the advice from #Dason:
seq1 <- seq(48, 19306,3)
which(cumsum(seq1) ==19650)
seq2 <- cumsum(seq1)[1:100]
seq3 <- seq(47, 19306, 3)
seq4 <- seq2 -seq3[1:100]

How to get the recombined vectors efficiently?

Given two axises, both with positions from 1 to N (N can be several millions, but we assume N = 1000 here),
there are two vectors recording the positions of some points on the two axises, respectively. For example:
chrm1 <- c(1, 35, 456, 732) # 4 points on axis 1 at position 1, 35, 456, 732;
chrm2 <- c(23, 501, 980)
if recombination at position 300 of the two axis, points behind 300 on the two axises will switch to the other axis.
the two vectors recording position of points will become :
chrm1 <- c(1, 35, 501, 980)
chrm2 <- c(23, 456, 732)
if a second recombination occurs at 600, the new vectors will be:
chrm1 <- c(1, 35, 501, 732)
chrm2 <- c(23, 456, 980)
the real data looks like this:
set.seed(1)
chrm1 <- sample.int(1e8, 50)
chrm2 <- sample.int(1e8, 50)
breaks.site <- sample.int(1e8, 5)
My brute-force way was to swap points into the other vector for each breaks sites. But this is quite slow, because I have to do this for 2 x 1000 x 20000 times.
How to get the recombined vectors efficiently?
for(i in breaks.site){
chrm1.new <- c(chrm1[chrm1 < i], chrm2[chrm2 > i])
chrm2.new <- c(chrm1[chrm1 > i], chrm2[chrm2 < i])
chrm1 <- chrm1.new
chrm2 <- chrm2.new
}
background about recombination:
https://en.wikipedia.org/wiki/Genetic_recombination
Maybe this:
chrm1 <- c(1, 35, 456, 732)
chrm2 <- c(23, 501, 980)
breaks <- c(300, 600)
#check all points for all breaks,
#get sum of position changes and
#calculate x mod 2
changepos1 <- rowSums(outer(chrm1, breaks, ">")) %% 2
changepos2 <- rowSums(outer(chrm2, breaks, ">")) %% 2
#assemble results and sort
res1 <- sort(c(chrm1[!changepos1], chrm2[as.logical(changepos2)]))
#[1] 1 35 501 732
res2 <- sort(c(chrm2[!changepos2], chrm1[as.logical(changepos1)]))
#[1] 23 456 980
If outer needs to much memory due to the size of your problem, you can use a loop instead.

R: Average nearby elements in a vector

I have many vectors such as this: c(28, 30, 50, 55, 99, 102) and I would like to obtain a new vector where elements differing less than 10 from one to another are averaged. In this case, I would like to obtain c(29, 52.5, 100.5).
Another way
vec <- c(28, 30, 50, 55, 99, 102)
indx <- cumsum(c(0, diff(vec)) > 10)
tapply(vec, indx, mean)
# 0 1 2
# 29.0 52.5 100.5

Elegant, Fast Way to Perform Rolling Sum By List of Variables

Has anyone developed an elegant, fast way to perform a rolling sum by date? For example, if I wanted to create a rolling 180-day total for the following dataset by Cust_ID, is there a way to do it faster (like something in data.table). I have been using the following example to currently calculate the rolling sum, but I am afraid it is far to inefficient.
library("zoo")
library("plyr")
library("lubridate")
##Make some sample variables
set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)
##Combine into one dataset
Example.Data <- data.frame(Trans_Dates,Cust_ID,Target)
##Create extra variable with 180 day rolling sum
Example.Data2 <- ddply(Example.Data, .(Cust_ID),
function(datc) adply(datc, 1,
function(x) data.frame(Target_Running_Total =
sum(subset(datc, Trans_Dates>(as.Date(x$Trans_Dates)-180) & Trans_Dates<=x$Trans_Dates)$Target))))
#Print new data
Example.Data2
Assuming that your panel is more-or-less balanced, then I suspect that expand.grid and ave will be pretty fast (you'll have to benchmark with your data to be sure). I use expand.grid to fill in the missing days so that I can naively take a rolling sum with cumsum then subtract all but the most recent 180 with head.
-As a question for you (and more skilled R users), why does my identical call always fail?-
I build on your same data.
full <- expand.grid(seq(from=min(Example.Data$Trans_Dates), to=max(Example.Data$Trans_Dates), by=1), unique(Example.Data$Cust_ID))
Example.Data3 <- merge(Example.Data, full, by.x=c("Trans_Dates", "Cust_ID"), by.y=c("Var1", "Var2"), all=TRUE)
Example.Data3 <- Example.Data3[with(Example.Data3, order(Cust_ID, Trans_Dates)), ]
Example.Data3$Target.New <- ifelse(is.na(Example.Data3$Target), 0, Example.Data3$Target)
Example.Data3$Target_Running_Total <- ave(Example.Data3$Target.New, Example.Data3$Cust_ID, FUN=function(x) cumsum(x) - c(rep(0, 180), head(cumsum(x), -180)))
Example.Data3$Target.New <- NULL
Example.Data3 <- Example.Data3[complete.cases(Example.Data3), ]
row.names(Example.Data3) <- seq(nrow(Example.Data3))
Example.Data3
identical(Example.Data2$Target_Running_Total, Example.Data3$Target_Running_Total)
sum(Example.Data2$Target_Running_Total - Example.Data3$Target_Running_Total)
(Example.Data2$Target_Running_Total - Example.Data3$Target_Running_Total)
Which yields the following.
> (Example.Data2$Target_Running_Total - Example.Data3$Target_Running_Total)
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I think I stumbled upon an answer that is fairly efficient..
set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)
##Make simulated data into a data.table
library(data.table)
data <- data.table(Cust_ID,Trans_Dates,Target)
##Assign each customer an number that ranks them
data[,Cust_No:=.GRP,by=c("Cust_ID")]
##Create "list" of comparison dates
Ref <- data[,list(Compare_Value=list(I(Target)),Compare_Date=list(I(Trans_Dates))), by=c("Cust_No")]
##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mapply(FUN = function(RD, NUM) {
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -180)*Ref$Compare_Value[[NUM]])
}, RD = data$Trans_Dates,NUM=data$Cust_No)
##Print out data
data <- data[,list(Cust_ID,Trans_Dates,Target,Roll.Val)][order(Cust_ID,Trans_Dates)]
data
library(data.table)
set.seed(1)
data <- data.table(Cust_ID = c(rep(1, 12), rep(2, 12), rep(3, 12)),
Trans_Dates = as.Date(c(31, 33, 65, 96, 150, 187, 210,
212, 240, 273, 293, 320, 32, 34,
66, 97, 151, 188, 211, 213, 241,
274, 294, 321, 33, 35, 67, 98,
152, 189, 212, 214, 242, 275,
295, 322),
origin = "2010-01-01"),
Target = rpois(36, 3))
data[, RollingSum := {
d <- data$Trans_Dates - Trans_Dates
sum(data$Target[Cust_ID == data$Cust_ID & d <= 0 & d >= -180])
},
by = list(Trans_Dates, Cust_ID)]

Resources