Correlation using rolling window on second vector - r

I'm a bit of a r newbie, and have am a little stuck at the way forward to run a correlation on time-series data where the second vector is much longer and I want to run a rolling time window.
My data looks something like this :
set.seed(1)
# "Target sample" (this is always of known fixed length N, e.g. 20 )
target <- data.frame(Date=rep(seq(Sys.Date(),by="1 day",length=20)),Measurement=rnorm(2))
# "Potential Sample" (this is always much longer and of unknown length,e.g. 730 in this example)
potential <- data.frame(Date=rep(seq(Sys.Date()-1095,by="1 day",length=730)),Measurement=rnorm(2))
What I would like to do is take a rolling window of size N (i.e matching the size of target sample), incrementing the roll by one day at a time, and then print two columns for each window :
WindowStartDate and the result of cor(target,potentialWindow)
So in pseudo-code (using the generated example above) :
Start at Sys.Date()-1095, take window size N values
Print (or,probably better, put in to new data frame) Sys.Date()-1095 and result of cor(target,potentialWindow)
Roll forward +1 day to Sys.Date()-1094 , take window size N values
Print (or, probably better, put in to new data frame) Sys.Date()-1094 and result of cor(target,potentialWindow)
etc. etc.
N.B. The roll forward +1 day is obviously a variable that could be tweaked depending on desired overlap.

Here's a way we can do it. Note that in your original example you only specified rnorm(2), which worked because R can recycle arguments, but it's probably not what you wanted. We just need to initialize a few things, and then send it through a for loop.
It seems like we can just pull the date you want from the potential data set, but if you want to use the Sys.Date() - X formula, I've shown how to do that as well.
set.seed(1)
# "Target sample" (this is always of known fixed length N, e.g. 20 )
target <- data.frame(Date = rep(seq(Sys.Date(), by = "1 day", length = 20)),
Measurement = rnorm(20))
# "Potential Sample" (this is always much longer and of unknown length,e.g. 730 in this example)
potential <- data.frame(Date = rep(seq(Sys.Date() - 1095, by = "1 day", length = 730)),
Measurement = rnorm(730))
#initialize values
N <- 20
len_potential <- nrow(potential) - (N - 1)
time_start <- 1096
result.df <- data.frame(Day = potential[1,1],
Corr = numeric(len_potential),
Day2 = potential[1,1],
stringsAsFactors = FALSE
)
#use a for loop
for(i in 1:len_potential){
result.df[i,1] = as.Date(potential[i,1])
result.df[i,2] = cor(target[,2], potential[i:(i+N-1), 2])
result.df[i,3] = Sys.Date() - (time_start - i)
}
Also, as a note on posting questions to SO, sometimes it is helpful to provide desired output.

Related

Poisson Process algorithm in R (renewal processes perspective)

I have the following MATLAB code and I'm working to translating it to R:
nproc=40
T=3
lambda=4
tarr = zeros(1, nproc);
i = 1;
while (min(tarr(i,:))<= T)
tarr = [tarr; tarr(i, :)-log(rand(1, nproc))/lambda];
i = i+1;
end
tarr2=tarr';
X=min(tarr2);
stairs(X, 0:size(tarr, 1)-1);
It is the Poisson Process from the renewal processes perspective. I've done my best in R but something is wrong in my code:
nproc<-40
T<-3
lambda<-4
i<-1
tarr=array(0,nproc)
lst<-vector('list', 1)
while(min(tarr[i]<=T)){
tarr<-tarr[i]-log((runif(nproc))/lambda)
i=i+1
print(tarr)
}
tarr2=tarr^-1
X=min(tarr2)
plot(X, type="s")
The loop prints an aleatory number of arrays and only the last is saved by tarr after it.
The result has to look like...
Thank you in advance. All interesting and supportive comments will be rewarded.
Adding on to the previous comment, there are a few things which are happening in the matlab script that are not in the R:
[tarr; tarr(i, :)-log(rand(1, nproc))/lambda]; from my understanding, you are adding another row to your matrix and populating it with tarr(i, :)-log(rand(1, nproc))/lambda].
You will need to use a different method as Matlab and R handle this type of thing differently.
One glaring thing that stands out to me, is that you seem to be using R: tarr[i] and M: tarr(i, :) as equals where these are very different, as what I think you are trying to achieve is all the columns in a given row i so in R that would look like tarr[i, ]
Now the use of min is also different as R: min() will return the minimum of the matrix (just one number) and M: min() returns the minimum value of each column. So for this in R you can use the Rfast package Rfast::colMins.
The stairs part is something I am not familiar with much but something like ggplot2::qplot(..., geom = "step") may work.
Now I have tried to create something that works in R but am not sure really what the required output is. But nevertheless, hopefully some of the basics can help you get it done on your side. Below is a quick try to achieve something!
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
# Major alteration, create a temporary row from previous row in tarr
temp <- matrix(tarr[i, ] - log((runif(nproc))/lambda), nrow = 1)
# Join temp row to tarr matrix
tarr <- rbind(tarr, temp)
i = i + 1
}
# I am not sure what was meant by tarr' in the matlab script I took it as inverse of tarr
# which in matlab is tarr.^(-1)??
tarr2 = tarr^(-1)
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
As you can see I have sorted the min_for_each_col so that the plot is actually a stair plot and not some random stepwise plot. I think there is a problem since from the Matlab code 0:size(tarr2, 1)-1 gives the number of rows less 1 but I cant figure out why if grabbing colMins (and there are 40 columns) we would create around 20 steps. But I might be completely misunderstanding! Also I have change T to T0 since in R T exists as TRUE and is not good to overwrite!
Hope this helps!
I downloaded GNU Octave today to actually run the MatLab code. After looking at the code running, I made a few tweeks to the great answer by #Croote
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
temp <- matrix(tarr[i, ] - log(runif(nproc))/lambda, nrow = 1) #fixed paren
tarr <- rbind(tarr, temp)
i = i + 1
}
tarr2 = t(tarr) #takes transpose
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
Edit: Some extra plotting tweeks -- seems to be closer to the original
qplot(seq_along(min_for_each_col), c(1:length(min_for_each_col)), geom="step", ylab="", xlab="")
#or with ggplot2
df1 <- cbind(min_for_each_col, 1:length(min_for_each_col)) %>% as.data.frame
colnames(df1)[2] <- "index"
ggplot() +
geom_step(data = df1, mapping = aes(x = min_for_each_col, y = index), color = "blue") +
labs(x = "", y = "")
I'm not too familiar with renewal processes or matlab so bear with me if I misunderstood the intention of your code. That said, let's break down your R code step by step and see what is happening.
The first 4 lines assign numbers to variables.
The fifth line creates an array with 40 (nproc) zeros.
The sixth line (which doesnt seem to be used later) creates an empty vector with mode 'list'.
The seventh line starts a while loop. I suspect this line is supposed to say while the min value of tarr is less than or equal to T ...
or it's supposed to say while i is less than or equal to T ...
It actually takes the minimum of a single boolean value (tarr[i] <= T). Now this can work because TRUE and FALSE are treated like numbers. Namely:
TRUE == 1 # returns TRUE
FALSE == 0 # returns TRUE
TRUE == 0 # returns FALSE
FALSE == 1 # returns FALSE
However, since the value of tarr[i] depends on a random number (see line 8), this could lead to the same code running differently each time it is executed. This might explain why the code "prints an aleatory number of arrays ".
The eight line seems to overwrite the assignment of tarr with the computation on the right. Thus it takes the single value of tarr[i] and subtracts from it the natural log of runif(proc) divided by 4 (lambda) -- which gives 40 different values. These fourty different values from the last time through the loop are stored in tarr.
If you want to store all fourty values from each time through the loop, I'd suggest storing it in say a matrix or dataframe instead. If that's what you want to do, here's an example of storing it in a matrix:
for(i in 1:nrow(yourMatrix)){
//computations
yourMatrix[i,] <- rowCreatedByComputations
}
See this answer for more info about that. Also, since it's a set number of values per run, you could keep them in a vector and simply append to the vector each loop like this:
vector <- c(vector,newvector)
The ninth line increases i by one.
The tenth line prints tarr.
the eleveth line closes the loop statement.
Then after the loop tarr2 is assigned 1/tarr. Again this will be 40 values from the last time through the loop (line 8)
Then X is assigned the min value of tarr2.
This single value is plotted in the last line.
Also note that runif samples from the uniform distribution -- if you're looking for a Poisson distribution see: Poisson
Hope this helped! Let me know if there's more I can do to help.

R unique combinations from given ranges quickly and using less system resource

This is a follow up question from here:
https://stackoverflow.com/a/55912086/3988575
I have a dataset like this:
ID=as.character(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
IQ=c(120.5,128.1,126.5,122.5,127.1,129.7,124.2,123.7,121.7,122.3,120.9,122.4,125.7,126.4,128.2,129.1,121.2,128.4,127.6,125.1)
Section=c("A","A","B","B","A","B","B","A","B","A","B","B","A","A","B","B","A","B","B","A")
zz=data.frame(ID,IQ,Section)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
What I would like to do is to match people by the range of their IQ (which was the previous question).
Now, I want to create multiple levels of the ranges. For example one range can be 10 IQ classes: 120-121,121-122,122-123....129-130. Another example is a single IQ class:120-130. All the possible combinations of the above can be obtained by:
IQ_Class=c(120,121,122,123,124,125,126,127,128,129,130)
n = length(IQ_Class)-2
all_combin=expand.grid(replicate(n, 0:1, simplify = FALSE))
all_combin$First=1
all_combin$Last=1
all_combin_new=all_combin[c("First",names(all_combin)[1:(length(names(all_combin))-2)],"Last")] #Reorder columns
all_combin_new = t((apply(all_combin_new,1,function(x)(x*IQ_Class)))) #Multiply by IQ classes
all_combin_new = apply(all_combin_new, 1, function(x) { x[x!=0] })
Note that the final object all_combin_new provides a list of lists of all the classes (a total of 512 classes in total).
Now what I want to do is to take one class (one element from all_combin_new) and create all the combinations of ID's in that particular IQ class by their section. Save this dataset and take the next class from all_combin_new and repeat the operation.
From the previous answer, I was able to to modify the code to consider the combinations by Section by changing the following in the previous question:
zz1=list("list",length(all_combin_new))
for (i in 1:length(all_combin_new)){ #changed this line to run for all combinations in all_combin_new
zz2=all_combin_new[[i]]
zz11=zz_new%>%
mutate(ID=as.character(ID),vec=as.character(cut(IQ,zz2,right=F)))%>%
group_by(vec,Section)%>% #Changed this line
summarize(if(n()>1)list(data.frame(t(combn(ID,2)),stringsAsFactors = F))
else list(data.frame(X1=ID,X2=ID,stringsAsFactors = F)))%>%
unnest()%>%
bind_cols(read.csv(text=gsub("[^0-9,]","",.$vec),h=F))
zz1[[i]]=as.data.frame(zz11)
}
My actual dataset has about 10K (as compared to zz_new here) observations with 20 Sections (leading to 2^18=262144 ranges of IQ as compared to the the length of all_combin_new list here = 512). This causes two main issues:
a) Time: The speed is extremely slow. Is there a way to increase the speed?
b) Size of objects created: In my tests, even without considering as high number of combinations, the lists grow too big and the code fails. What alternate approaches could I use here? Note that in the list of list that I obtain here, I also need to do further computations.
Any help will be appreciated. Thanks in advance.
P.S.Please let me know if any part is unclear or any part of the code has some inadvertent errors.
Edit: Now with loop to go through all IQ combos and to include Section as a key on join.
I used the sample data in the linked question. Instead of making a list and looping, this does everything at once.
Note there is a cartesian product, so it may still run into memory issues. If you're having trouble, you can always try data.table as you can have non-equi joins.
library(tidyverse)
zz <- tibble(ID=1:12
,IQ=c(120.5,123,125,122.5,122.1,121.7,123.2,123.7,120.7,122.3,120.1,122)
,Section=c("A","A","B","B","A","B","B","A","B","A","B","B")
)
IQ_Class <- c(120,122,124,126)
IQ_Classes <- data.frame(First = 1
,expand.grid(replicate(length(IQ_Class)-2, 0:1, simplify = FALSE))
,Last = 1)
IQ_Classes <- IQ_Classes * IQ_Class[col(IQ_Classes)]
IQ_Classes_List <- apply(IQ_Classes, 1, function(x) { x[x!=0] })
all_combos <- lapply(IQ_Classes_List
, function(IQs)
{
z_cut <- zz%>%
mutate(cut_range = cut(IQ, IQ_Class, right = F, labels = F))
inner_join(z_cut
, z_cut %>%
select(V2 = ID, cut_range, Section)
, by = c('cut_range', 'Section'))%>%
filter(V2 > ID) %>%
mutate(Previous_IQ_class = IQs[cut_range],
Next_Class = IQs[cut_range+1])
}
)%>%
bind_rows(.id = 'IQ_List')

I am trying to run a nested for loop in r which subtracts each row of a variable in a data frame

Dataframe:
Number Time
1 10:25:00
2 10:35:15
3 10:42:26
For each number in the data frame I want to subtract Time, for example:
Number 1 = 10:25:00 - 10:35:15
Number 2 = 10:35:15 - 10:42:26
My code:
for (i in df$Number) {
for (j in df$Time) {
subtime <- df$Time[j] - df$Time[j+1]
}
}
This code only results in NA
Because subtime is reassigned in every loop, only the last value is returned when the loops finish. Further, at the last iteration, j == length(df$Time) so j + 1 is out of bounds, so df$Time[j + 1] will be NA, which means the entire result is NA.
Instead in general you can do:
df$subtime <- c(NA, diff(df$Time))
where NA is the first instance replaced by a suitable default for the first instance. Your case may require additional treatment depending on the exact class of df$Time.
(You should consider creating an MWE of your data if you need further help. What you provided is pretty close, but not quite enough for us to be of help.)
I think you may be looking for something like this. The result is given in hours.
For example: 10:25:00 - 10:35:15 = - 00:10:15 = - (10/60) - (15/3600) = -0.1708333
a = data.frame(Number = c(1, 2, 3), Time = c("10:25:00", "10:35:15", "10:42:26"), stringsAsFactors = FALSE)
x = 2
timeDiff = function(x, a){
as.difftime(a[x, 2]) - as.difftime(a[x+1, 2])
}
result = sapply(2:nrow(a), timeDiff, a)
result
Please note that it's impossible to compute such difference for case Number 3, ever since a fourth row would be necessary, and the data frame you provided has only 3 rows.
As per Stack Overflow's prompt, I can see you r a new user, Thus, for future nested for-loops, I recommend you explore sapply or lapply, as it will make your code look cleaner and easier to maintain.
If you need any further clarification, don't hesitate to comment my answer. :-)

Block bootstrap for genomic data

I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.
My data.frame has the following structure:
CHR POS var_A var_B
1 192 0.9 0.7
1 2000 0.8 0.3
2 3 0.21 0.76
2 30009 0.36 0.15
...
The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).
The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.
I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.
x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
df=freq[0,]
for (i in unique(freq$CHR)) {
t = freq[freq$CHR==i,]
fim = t[nrow(t),2]
i = t[1,2]
f = i + x
while(f < fim) {
rows = which(t$POS>=i & t$POS<f)
s = sample(rows)
df = rbind(df,t[s,])
i = f
f = f + x
}
}
cors = c(cors, cor(df$var_A, df$var_B))
}
Could anybody help me out? I am sure there is a more efficient way of doing this.
Thank you in advance.
One efficient way to try would be to use the 'boot' package, of which functions include parallel processing capabilities.
In particular, the 'tsboot', or time series boot function, will select ordered blocks of data. This could work if your POS variable is some kind of ordered observation.
The boot package functions are great, but they need a little help first. To use bootstrap functions in the boot package, one must first wrap the statistic of interest in a function which includes an index argument. This is the device the bootstrap generated index will use to pass sampled data to your statistic.
cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)
Note cor_hat in the arguments below. The sim = "fixed", l = 1000 arguments, which indicate you want fixed blocks of length(l) 1000. However, you could do blocks of any size, 5 or 10 if your trying to capture nearest neighbor dynamics moving over time. The multicore argument speaks for itself, but it maybe "snow" if you are using windows.
library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)
In addition, page 194 of Elements of Statistical Learning provides a good example of the framework using the traditional boot function, all of which is relevant to tsboot.
Hope that helps, good luck.
Justin
r
I hope I understood you right:
# needed for round_any()
library(plyr)
res <- lapply(unique(freq$CHR),function(x){
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
})
This should return a list with an entry for each chromosome. Within each entry, there's an observation per 1kb-block if present. The number of blocks is determined by the maximum POS value.
EDIT:
library(doParallel)
library(foreach)
library(plyr)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
}
stopCluster(cl)
This is a simple parallelisation with foreach on each Chromosome. It could be better to restructure the function and base the parallel processing on another level (such as the 1000 iterations or maybe the blocks). In any case, I can just stress again what I was saying in my comment: Before you work on parallelising your code, you should be sure that it's as efficient as possible. Meaning you might want to look into the boot package or similar to get an increase in efficiency. That said, with the number of iterations you're planning, parallel processing might be useful once you're comfortable with your function.
So, after a while I came up with an answer to my problem. Here it goes.
You'll need the package dplyr.
l = 1000
teste = freq %>%
mutate(w = ceiling(POS/l)) %>%
group_by(CHR, w) %>%
sample_n(1)
This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.
You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.

How can I force R to display 2 decimal points in a large double?

I'm trying to sum a vector of doubles in a data frame. When the sum is relatively low, this works as intended.
df <- data.frame(
numbers = c(50, 632.5, 12.45)
)
sum(df$numbers)
# 694.95
But when the sum gets higher, the R begins to round the sum.
df <- data.frame(
numbers = c(50000000, 632.5, 12.45)
)
sum(df$numbers)
# 50000645
How can I stop R from eliminating these decimal points? The output I want is:
sum(df$numbers)
# 50000645.95
Try setting the number of digits in options, you can set whatever works for you. For example:
options(digits = 10)
sum(df$numbers)
[1] 50000644.95
Alternatively, if you only want to change the settings for this computation, you could use:
print(sum(df$numbers, digits = 10))

Resources