Convolution of a piecewise function - r

Suppose I have two functions, f1 and f2, which is defined piecewise from a list of enumerated steps / jump points.
set.seed(1729)
n = 100
x1 = cumsum(runif(n))
x2 = cumsum(runif(n))
val1 = cumsum(runif(n))
val2 = cumsum(runif(n))
f1_list = data.frame(f = val1, x = x1)
f2_list = data.frame(f = val2, x = x2)
For simplicity, let's assume both are right-continuous. The first few values look like
> head(f1_list)
f x
1 0.1371357 0.5852396
2 0.4752026 1.0226336
3 1.0987574 1.5955279
4 1.9413884 1.9487419
5 2.0264764 2.8100133
6 2.3962088 3.2208168
> head(f2_list)
f x
1 0.3294329 0.5373382
2 0.8749826 1.3104701
3 1.5604155 2.0395473
4 1.9325968 2.9311143
5 2.3134223 3.2732812
6 2.4605212 3.6648067
I want to compute the convolution, g(t) = (f1*f2)(t). Programmatically, this does not seem like an easy task, because not only do we need to keep track of all the jump-points, but also in reverse, by the definition of convolution. What I've tried is
#############
#Say, t = 10#
#############
t = 10
f2_list$x_rev = t - f2_list$x
At which point, I'm stuck, since I don't know how to match up the corresponding intervals of x values for f1_list and f2_list.

Related

Simple linear transformation of variable in R: changing the scope of a variable. How to make it right?

I am trying to change the value range of a variable (array, set of values) while keeping its properties. I don't know the exact name in math, but I mean such a kind of transformation that the variable array has exactly the same properties, the spacing between the values is the same, but the range is different. Maybe the code below will explain what I mean.
I just want to "linearly transpose" (or smth?) values to some other range and the distribution should remain same. In other words - I'll just change the scope of the variable using the regression equation y = a * x + b. I assume that the transformation will be completely linear, the correlation between the variables is exactly 1, and I calculate new variable (array) from a regression equation, actually a system of equations where I simply substitute the maximum ranges of both variables:
minimum.y1 = minimum.x1 * a + b
maximum.y2 = maximum.x2 * a + b
from which I can work out the following code to obtain a and b coefficients:
# this is my input variable
x <- c(-1, -0.5, 0, 0.5, 1)
# this is the range i want to obtain
y.pred <- c(1,2,3,4,5)
max_y = 5
min_y = 1
min_x = min(x)
max_x = max(x)
c1 = max_x-min_x
c2 = max_y-min_y
a.coeff = c2/c1
b.coeff = a.coeff-min_x
y = x * a.coeff + b.coeff
y
# hey, it works! :)
[1] 1 2 3 4 5
the correlation between the variable before and after the transformation is exactly 1. So we have a basis for further action. Let's get it as a function:
linscale.to.int <- function(max.lengt, vector) {
max_y = max.lengt
min_y = 1
min_x = min(vector)
max_x = max(vector)
c1 = max_x-min_x
c2 = max_y-min_y
a.coeff = c2/c1
b.coeff = a.coeff-min_x
return(vector * a.coeff + b.coeff)
}
x <- c(-1, -0.5, 0, 0.5, 1)
linscale.to.int(5,x)
[1] 1 2 3 4 5
and it works again. But here's the thing: when i aplly this function to random distribution, like this:
x.rand <- rnorm(50)
y.rand <- linscale.to.int(5,x.rand)
plot(x.rand, y.rand)
or better seable this:
x.rand <- rnorm(500)
y.rand <- linscale.to.int(20,x.rand)
plot(x.rand, y.rand)
I get the values of the second variable completely out of range; it should be between 1 and 20 but i get scope of valuest about -1 to 15:
And now the question arises - what am I doing wrong here? Where do I go wrong with such a transformation?
What you are trying to do is very straightforward using rescale from the scales package (which you will already have installed if you have ggplot2 / tidyverse installed). Simply give it the new minimum / maximum values:
x <- c(-1, -0.5, 0, 0.5, 1)
scales::rescale(x, c(1, 5))
#> [1] 1 2 3 4 5
If you want to have your own function written in base R, the following one-liner should do what you want:
linscale_to_int <- function(y, x) (x - min(x)) * (y - 1) / diff(range(x)) + 1
(Note that it is good practice in R to avoid periods in function names because this can cause confusion with S3 method dispatch)
Testing, we have:
x <- c(-1, -0.5, 0, 0.5, 1)
linscale_to_int(5, x)
#> [1] 1 2 3 4 5
x.rand <- rnorm(50)
y.rand <- linscale_to_int(5, x.rand)
plot(x.rand, y.rand)
y.rand <- linscale_to_int(20, x.rand)
plot(x.rand, y.rand)
Created on 2022-08-31 with reprex v2.0.2

Reverting a linear filter for time series in R

I'm using the stats::filter function in R in order to understand ARIMA simulations in R (as in the function stats::arima.sim) and estiamtion. I know that stats::filter applies a linear filter to a vector or time series, but I'm not sure how to "unfilter" my series.
Consider the following example: I want to use a recursive filter with value 0.7 to my series x = 1:5 (which is essentially generating an AR(1) with phi=0.7). I can do so by:
x <- 1:5
ar <-0.7
filt <- filter(x, ar, method="recursive")
filt
Time Series:
Start = 1
End = 5
Frequency = 1
[1] 1.0000 2.7000 4.8900 7.4230 10.1961
Which returns me essentially c(y1,y2,y3,y4,y5) where:
y1 <- x[1]
y2 <- x[2] + ar*y1
y3 <- x[3] + ar*y2
y4 <- x[4] + ar*y3
y5 <- x[5] + ar*y4
Now imagine I have the y = c(y1,y2,y3,y4,y5) series. How can I use the filter function to return me the original series x = 1:5?
I can write a code to do it like:
unfilt <- rep(NA, 5)
unfilt[1] <- filt[1]
for(i in 2:5){
unfilt[i] <- filt[i] - ar*filt[i-1]
}
unfilt
[1] 1 2 3 4 5
But I do want to use the filter function to do so, instead of writing my own function. How can I do so? I tried stats::filter(filt, -ar, method="recursive"), which returns me [1] 1.0000 2.0000 3.4900 4.9800 6.7101 not what I desire.
stats::filter used with the recursive option is a particular case of an ARMA filter.
a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]
You could implement this filter with the signal package which allows more options than stat::filter :
a = c(1,-ar)
b = 1
filt_Arma <- signal::filter(signal::Arma(b = b, a = a),x)
filt_Arma
# Time Series:
# Start = 1
# End = 5
# Frequency = 1
# [1] 1.0000 2.7000 4.8900 7.4230 10.1961
identical(filt,filt_Arma)
# [1] TRUE
Reverting an ARMA filter can be done by switching b and a, provided that the inverse filter stays stable (which is the case here):
signal::filter(signal::Arma(b = a, a = b),filt)
# Time Series:
# Start = 2
# End = 6
# Frequency = 1
# [1] 1 2 3 4 5
This corresponds to switching numerator and denominator in the z-transform:
Y(z) = a(z)/b(z) X(z)
X(z) = b(z)/a(z) Y(z)

How to create multiple loops inside a function in r

I want to obtain new values for each step and use them inside an equation for further steps. Normally, I can perform loop but for this problem, I need to use past values, too. For instance, I have flow data like this:
q<-c(10, 15.83333, 21.66667)
I created a loop manually:
z1<-190 #initial elevation
s1<-24011 #initial storage
in1<-q[1] #initial inflow
out1<-1.86*sqrt((z1-110)*19.62) #outflow
z2<- z1+0.3*((in1-out1)/s1) #elevation at second step
in2<-q[2] #second inflow
out2<-1.86*sqrt((z2-110)*19.62) #outflow at z2 elevation
ds2<-0.3*((in1+in2)/2-(out1+out2)/2) #change in storage
s2<-s1+ds2 #net storage value
z3<-z2+0.3*((in2-out2)/s2) #elevation at third step
in3<-q[3]
out3<-1.86*sqrt((z3-110)*19.62)
ds3<-0.3*((in2+in3)/2-(out2+out3)/2)
s3<-s2+ds3
.
.
.
z4<-z3+0.3*((in3-out3)/s3)
Briefly, I am calculating z value using previous values of in,out,s. What I need to find is z values considering q values.
Expected result is:
z q outflows storages
[1,] 190.0000 10.00000 73.68981 24011.00
[2,] 189.9992 15.83333 73.68944 23992.77
[3,] 189.9985 21.66667 73.68911 23976.29
I'll extend my comment here.
You can overwrite the variables. Following your implementation you could for instance create a temporal variable for your out2:
q = c(10, 15.83333, 21.66667)
#Results storage array
results = array(numeric(),c(length(q),4))
colnames(results) = c("z", "q", "outflows", "storages")
z = 190
s = 24011
infl = q[1]
out = 1.86*sqrt((z-110)*19.62)
#Save init values
results[1,1] = z
results[,2] = q
results[1,3] = out
results[1,4] = s
for (n in 2:length(q)) {
z = z+0.3*((infl-out)/s)
out_tmp = 1.86*sqrt((z-110)*19.62)
ds = 0.3*((infl+q[n])/2-(out+out_tmp)/2)
s = s+ds
infl = q[n]
out = out_tmp
results[n,1] = z
results[n,3] = out
results[n,4] = s
}
View(results)
If you want to avoid to create the temporal variable, you can try something like this:
q = c(10, 15.83333, 21.66667)
results = array(numeric(),c(length(q),4))
colnames(results) = c("z", "q", "outflows", "storages")
z = 190
s = 24011
infl = q[1]
out = 1.86*sqrt((z-110)*19.62)
#Save init values
results[1,1] = z
results[,2] = q
results[1,3] = out
results[1,4] = s
for (n in 2:length(q)) {
z = z+0.3*((infl-out)/s)
out = 1.86*sqrt((z-110)*19.62)
ds = 0.3*((infl+q[n])/2-(results[n-1,3]+out)/2)
s = s+ds
infl = q[n]
results[n,1] = z
results[n,3] = out
results[n,4] = s
}
View(results)
Init values and create a result table
Add the current state values to the table
Simulate new state using old or new states
Set the new state to all variables
library(tidyverse)
data <- tibble(step = numeric(), out = numeric(), y = numeric(), z = numeric())
# Initialization
z <- 190
y <- 1
out <- NA
for (step in seq(5)) {
# save current state
data <- data %>% add_row(step = step, out = out, z = z, y = y)
# use old state of z
new_out <- z / 2
# use old state of y
new_z <- y + 1
# use new state of out
new_y <- new_out
# Lastly, update all new variables
out <- new_out
y <- new_y
z <- new_z
}
data
#> # A tibble: 5 x 4
#> step out y z
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 1 190
#> 2 2 95 95 2
#> 3 3 1 1 96
#> 4 4 48 48 2
#> 5 5 1 1 49
Created on 2021-11-10 by the reprex package (v2.0.1)

optimize R code for min() and sample() by group

I generate a network with npeople(=80), ncomp(=4) components and I want each component to have density equal to dens(=0.2).
I want to optimize 2 lines of the code which take most of the time (especially if I want to have 5k people in the network).
the 2 lines are:
# adjust probability to keep density
nodes[,p:= as.numeric(min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
# simulate edges
nodes[, edge := sample(c(0,1),1, prob = c(1-p,p)), by = c("ID","ALTERID")]
I have tried using the lapply() function, but the execution time increased - see below the line of code:
nodes[,lapply(.SD, function(p) min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
rm(list=ls())
library(data.table)
library(intergraph)
library(igraph)
library(Matrix)
library(profvis)
library(ggplot2)
draw.var <- function(n, var1, rho, mean){
C <- matrix(rho, nrow = 2, ncol = 2)
diag(C) <- 1
C <- chol(C)
S <- rnorm(n, mean = mean)
S <- cbind(scale(var1)[1:n],S)
ZS <- S %*% C
return(ZS[,2])
}
set.seed(1123)
profvis({
# create empty list to store data
dt.list <- list()
npeople <- 500
dens <- .2
OC.impact <- FALSE
cor_iv_si <- .6
cor_iv_uc <- 0
cor_uc_oc <- 0.6
ncomp <- 4
beta_oc <- 2 # observed characteristics
beta_uc <- 2 # unobserved characteristics
beta_si <- 1
# create data.table
dt.people <- data.table(ego = 1:npeople)
# draw observed characteristics
dt.people[, OC := abs(rt(npeople,2))]
# draw unobserved variable
dt.people[, UC := draw.var(npeople, dt.people$OC, rho = cor_uc_oc,mean = 5)]
# set component idientifier
dt.people$group <- cut_number(dt.people$UC, ncomp,labels = F)
for(q in 1:ncomp){
# subset comp
dt.sub <- dt.people[group == q]
# create undirected graph
nodes <- as.data.table(t(combn(dt.sub$ego, 2)))
setnames(nodes,c("ID","ALTERID"))
# add attributes
nodes <- merge(nodes,dt.people[,list(ID = ego, ID.UC = UC, ID.OC = OC)], by = "ID")
nodes <- merge(nodes,dt.people[,list(ALTERID = ego, ALTERID.UC = UC, ALTERID.OC = OC)], by = "ALTERID")
# calculate distance
nodes[,d := abs(ID.UC - ALTERID.UC)]
# estimate the appropiate density per component
n.edges <- (dens * (npeople * (npeople - 1)))/ncomp
n.nodes <- npeople/ncomp
c.dens <- n.edges/(n.nodes * (n.nodes - 1))
# estimate initial probability of tie based on distance
coefficient <- log(c.dens / (1 - c.dens))
alpha <- coefficient / mean(nodes$d)
nodes[,p := exp(alpha * d) / (1 + exp(alpha * d))]
# adjust probability to keep density
nodes[,p:= as.numeric(min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
# simulate edges
nodes[, edge := sample(c(0,1),1, prob = c(1-p,p)), by = c("ID","ALTERID")]
# keep the edges
nodes <- nodes[edge == 1,list(ID,ALTERID)]
# bind the networks
if(q == 1){
net <- copy(nodes)
} else{
net <- rbind(net,nodes)
}
}
# create opposide direction
net <- rbind(net,net[,list(ID = ALTERID, ALTERID = ID)])
})
This incorporates #BenBolker and # DavidArenburg's suggestions and also incorporates some of data.table's tools.
Non-Equi joins
The OP code loops through each group. One part of the code also uses combn and multiple joins to get the data in the right format. Using non-equi joins, we can combine all of those steps in one data.table call
dt_non_sub <- dt.people[dt.people,
on = .(ego < ego, group = group),
allow.cartesian = T,
nomatch = 0L,
.(group,
ALTERID = i.ego, ID = x.ego,
ID.UC = UC, ID.OC = OC,
ALTERID.OC = i.OC, ALTERID.UC = i.UC,
d = abs(UC - i.UC)) #added to be more efficient
]
# dt_non_sub[, d:= abs(ID.UC - ALTERID.UC)]
Vectorization
The original code was mostly slow because of two calls with by groupings. Since each call split the dataframe in around 8,000 individual groups, there were 8,000 functions calls each time. This eliminates those by using pmin as suggested by #DavidArenburg and then uses runif(N)<p as suggested by #BenBolker. My addition was that since your final result don't seem to care about p, I only assigned the edge by using {} to only return the last thing calculated in the call.
# alpha <- coefficient / mean(nodes$d)
dt_non_sub[,
edge := {
alpha = coefficient / mean(d)
p = exp(alpha * d) / (1 + exp(alpha * d))
p_mean = mean(p)
p = pmin(1, p * (1/(p_mean / c.dens)))
as.numeric(runif(.N)<p)
}
, by = .(group)]
net2 <- rbindlist(dt_non_sub[edge == 1, .(group, ALTERID, ID)],
dt_non_sub[edge == 1, .(group, ID = ALTERID, ALTERID = ID)]
One thing to note is that the vectorization is not 100% identical. Your code was recursive, each split updated the mean(node$p) for the next ID, ALTERID group. If you need that recursive part of the call, there's not much help to make it faster.
In the end, the modified code runs in 20 ms vs. the 810 ms of your original function. The results, while different, are somewhat similar in the total number of results:
Original:
net
ID ALTERID
1: 5 10
2: 10 14
3: 5 25
4: 10 25
5: 14 25
---
48646: 498 458
48647: 498 477
48648: 498 486
48649: 498 487
48650: 498 493
Modified
net2
group ALTERID ID
1: 2 4 3
2: 2 6 4
3: 4 7 1
4: 4 8 7
5: 2 9 4
---
49512: 3 460 500
49513: 3 465 500
49514: 3 478 500
49515: 3 482 500
49516: 3 497 500

Using split function in R

I am trying to simulate three small datasets, which contains x1,x2,x3,x4, trt and IND.
However, when I try to split simulated data by IND using "split" in R I get Warning messages and outputs are correct. Could someone please give me a hint what I did wrong in my R code?
# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run
# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################
p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)
IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)
Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")
Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)
The warning message:
> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
and the output becomes
$`2`
x1 x2 x3 x4 trt IND
141 1.0894068 0.09765185 -0.46702047 0.4049424 A 3
145 -1.2953113 -1.94291045 0.09926239 -0.5338715 A 3
148 0.0274979 0.72971804 0.47194731 -0.1963896 A 3
$`3`
[1] x1 x2 x3 x4 trt IND
<0 rows> (or 0-length row.names)
I have checked my R code several times however, I can't figure out what I did wrong. Many thanks in advance
IND is the global variable for the full data, sim. You want to use the specific one for the subset, eg
Anew <- split(Aset, f = Aset$IND)
It's a warning, not an error, which means split executed successfully, but may not have done what you wanted to do.
From the "details" section of the help file:
f is recycled as necessary and if the length of x is not a multiple of
the length of f a warning is printed. Any missing values in f are
dropped together with the corresponding values of x.
Try checking the length of your IND against the size of your dataframe, maybe.
Not sure what your goal is once you have your data split, but this sounds like a good candidate for the plyr package.
> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
trt IND x1mean x2sum x3min x4max
1 A 1 -0.49356448 -1.5650528 -1.016615 2.0027822
2 A 2 0.05908053 5.1680463 -1.514854 0.8184445
3 A 3 0.22898716 1.8584443 -1.934188 1.6326763
4 B 1 0.01531230 1.1005720 -2.002830 2.6674931
5 B 2 0.17875088 0.2526760 -1.546043 1.2021935
6 B 3 0.13398967 -4.8739380 -1.565945 1.7887837
7 C 1 -0.16993037 -0.5445507 -1.954848 0.6222546
8 C 2 -0.04581149 -6.3230167 -1.491114 0.8714535
9 C 3 -0.41610973 0.9085831 -1.797661 2.1174894
>
Where you can substitute summarise and its following arguments for any function that returns a data.frame or something that can be coerced to one. If lists are the target, ldply is your friend.

Resources