cbind() time series without NAs - r

I've observed that for many operators on overlapping time series, the result is given only for the overlapping portion, which is nice:
> (ts1 <- ts(1:5, start=1, freq=3))
Time Series:
Start = c(1, 1)
End = c(2, 2)
Frequency = 3
[1] 1 2 3 4 5
> (ts2 <- ts((7:3)^2, start=2, freq=3))
Time Series:
Start = c(2, 1)
End = c(3, 2)
Frequency = 3
[1] 49 36 25 16 9
> ts1 + ts2
Time Series:
Start = c(2, 1)
End = c(2, 2)
Frequency = 3
[1] 53 41
However, this doesn't seem to be the case with cbind(). While the output is aligned properly, NAs are created for the non-overlapping data:
> (mts <- cbind(ts1, ts2))
Time Series:
Start = c(1, 1)
End = c(3, 2)
Frequency = 3
ts1 ts2
1.000000 1 NA
1.333333 2 NA
1.666667 3 NA
2.000000 4 49
2.333333 5 36
2.666667 NA 25
3.000000 NA 16
3.333333 NA 9
Is there a way to perform that cbind() without creating the rows with NA in them? Or if not, what's a good way to take the result and strip off the rows with the NAs? It's not a simple matter of subscripting, because then it loses its timeseries nature:
> mts[complete.cases(mts),]
ts1 ts2
[1,] 4 49
[2,] 5 36
Maybe something with window(), but calculating the start & end times for the window seems a little yucky. Any advice is welcome.

Why not just na.omit the result?
> na.omit(cbind(ts1,ts2))
Time Series:
Start = c(2, 1)
End = c(2, 2)
Frequency = 3
ts1 ts2
2.000000 4 49
2.333333 5 36
If you want to avoid na.omit, stats:::cbind.ts calls stats:::.cbind.ts, which has a union argument. You could set that to FALSE and call stats:::.cbind.ts directly (after creating appropriate arguments):
> stats:::.cbind.ts(list(ts1,ts2),list('ts1','ts2'),union=FALSE)
Time Series:
Start = c(2, 1)
End = c(2, 2)
Frequency = 3
ts1 ts2
2.000000 4 49
2.333333 5 36
But the na.omit solution seems a tad easier. ;-)

Related

From a sequence of numbers, how do I find an immediate smaller (and an immediate bigger) number than a particular random number, In R?

So I have 10 increasing sequence of numbers, each of them look like (say x(i) <- c(2, 3, 5, 6, 8, 10, 11, 17) for i ranging from 1 to 10 ) and I have a random sampling number say p=9.
Now for each sequence x(i), I need to find the number immediately smaller than p and immediately bigger than p, and then for each i (from 1 to 10) , I need to take the difference of these two numbers and store them in a string.
For the x(i) that I have given here, the immediate smaller number than p=9 would be 8 and the immediate bigger number than p=9 would be 10, the difference of these would be (10-8)=2.
I am trying to get a code that would create a string of these differences, where first number of the string would mean the difference for i=1, second number would mean the difference for i=2 and so on. The string would have i numbers.
I am relatively new to R, so anywhere connected to loops throws me off a little bit. Any help would be appreciated. Thanks.
EDIT: I am putting the code I am working with for clarification.
fr = 100
dt = 1/1000 #dt in milisecond
duration = 2 #no of duration in s
nBins = 2000 #SpikeTrain
nTrials = 20 #NumberOfSimulations
MyPoissonSpikeTrain = function(p, fr= 100) {
p = runif(nBins)
q = ifelse(p < fr*dt, 1, 0)
return(q)
}
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
Spike_times <- function(i) {
c(dt*which( SpikeMat[i, ]==1))}
set.seed(4)
RT <- runif(1, 0 , 2)
for (i in 1:nTrials){
The explanation for this code, is mentioned in my previous question. I have 20 (number of trials aka nTrials) strings with name Spike_times(i) here. Each Spike_times(i) is a string of time stamps between o and 2 seconds where spikes occurred and they have different number of entries. Now I have a random time sample in the form of RT, which is a random number between 0 and 2 seconds. Say RT is 1.17 seconds and Spike_times(i) are the sequence of increasing times stamps between 0 and 2 seconds.
Let me give you an example, Spike_times(3) looks like 0.003 0.015 0.017 ... 1.169 1.176 1.189 ... 1.985 1.990 1.997 then I need a code that picks out 1.169 and 1.176 and gives me the difference of these entries 0.007 and stores it in another string say W as the third entry c(_, _, 0.007, ...) and does this for all 20 strings Spike_times(i) and gives me W with 20 entries.
I hope my question is clear enough. Please let me know if I need to correct something.
This approach should do what you want. I am making a function that extracts the desired result from a single sequence and then applying it to each sequence. I am assuming here that your sequences are row-vectors and are stacked in a matrix. If your actual data structure is different the code can be adapted, but you need to indicate how your sequences are actually stored.
x <- matrix(rep(c(2,3,5,6,8,10,11,17), 10), nrow=10, byrow = T)
x
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 2 3 5 6 8 10 11 17
#> [2,] 2 3 5 6 8 10 11 17
#> [3,] 2 3 5 6 8 10 11 17
#> [4,] 2 3 5 6 8 10 11 17
#> [5,] 2 3 5 6 8 10 11 17
#> [6,] 2 3 5 6 8 10 11 17
#> [7,] 2 3 5 6 8 10 11 17
#> [8,] 2 3 5 6 8 10 11 17
#> [9,] 2 3 5 6 8 10 11 17
#> [10,] 2 3 5 6 8 10 11 17
set.seed(123)
p = sample(10, 1)
# write a function to do what you want on one sequence:
# NOTE: If p appears in the sequence I assume you want the
# closest numbers not equal to p! If you want the closest
# numbers to p including p itself change the less than/
# greater than to <= / >=
get_l_r_diff <- function(row, p) {
temp <- row - p
lower <- max(row[temp < 0])
upper <- min(row[temp > 0])
upper - lower
}
apply(x, 1, function(row)get_l_r_diff(row, p))
#> [1] 3 3 3 3 3 3 3 3 3 3
apply(x, 1, function(row) get_l_r_diff(row, 9))
#> [1] 2 2 2 2 2 2 2 2 2 2
# if the result really needs to be a string
paste(apply(x, 1, function(row) get_l_r_diff(row, 9)), collapse = "")
#> [1] "2222222222"
For your case you can just apply the two functions to your indices:
spikes <- sapply(1:20, function(i){get_l_r_diff(Spike_times(i), RT)})
By making a small change to your Spike_times function you can do this with sapply returning a vector of all calculated values
Spike_times <- function(i) {
x <- c(dt*which( SpikeMat[i, ]==1))
min(x[x > RT]) - max(x[x < RT])
}
set.seed(4)
RT <- runif(1, 0 , 2)
results <- sapply(1:20, Spike_times)

Cumsum but with maximum number of datapoints

I'm looking to create a hybrid of cumsum() and TTR::runSum()where cumSum() runs up until a pre-specified number of datapoints, at which points it acts more like a runSum()
For example:
library(TTR)
data <- rep(1:3,2)
cumsum <- cumsum(data)
runSum <- runSum(data, n = 3)
DesiredResult <- ifelse(is.na(runSum),cumsum,runSum)
Is there a way to get to DesiredResult that doesn't require getting finangly with NAs?
That is what the partial=TRUE argument to rollapplyr does. Here we show this with sum and also with sd and IQR. (Note that the sd of one value is NA and we chose IQR since it is a measure of spread that can be calculated for scalars although it is always 0 in that case.)
library(zoo)
rollapplyr(data, 3, sum, partial = TRUE)
## [1] 1 3 6 6 6 6
rollapplyr(data, 3, sd, partial = TRUE)
## [1] NA 0.7071068 1.0000000 1.0000000 1.0000000 1.0000000
rollapplyr(data, 3, IQR, partial = TRUE)
## [1] 0.0 0.5 1.0 1.0 1.0 1.0
Here are three alternatives.
n <- 3
rowSums(embed(c(rep(0, n - 1), data), n)) # base R
# [1] 1 3 6 6 6 6
library(TTR)
runSum(c(rep(0, n - 1), data), n = n)
# [1] NA NA 1 3 6 6 6 6 # na.omit fixes the beginning
library(zoo)
rollsum(c(rep(0, n - 1), data), k = 3, align = "right")
# [1] 1 3 6 6 6 6

Efficiently change elements in data based on neighbouring elements

Let me delve right in. Imagine you have data that looks like this:
df <- data.frame(one = c(1, 1, NA, 13),
two = c(2, NA,10, 14),
three = c(NA,NA,11, NA),
four = c(4, 9, 12, NA))
This gives us:
df
# one two three four
# 1 1 2 NA 4
# 2 1 NA NA 9
# 3 NA 10 11 12
# 4 13 14 NA NA
Each row are measurements in week 1, 2, 3 and 4 respectively. Suppose the numbers represent some accumulated measure since the last time a measurement happened. For example, in row 1, the "4" in column "four" represents a cumulative value of week 3 and 4.
Now I want to "even out" these numbers (feel free to correct my terminology here) by evenly spreading out the measurements to all weeks before the measurement if no measurement took place in the preceeding weeks. For instance, row 1 should read
1 2 2 2
since the 4 in the original data represents the cumulative value of 2 weeks (week "three" and "four"), and 4/2 is 2.
The final end result should look like this:
df
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
I struggle a bit with how to best approach this. One candidate solution would be to get the indices of all missing values, then to count the length of runs (NAs occuring multiple times), and use that to fill up the values somehow. However, my real data is large, and I think such a strategy might be time consuming. Is there an easier and more efficient way?
A base R solution would be to first identify the indices that need to be replaced, then determine groupings of those indices, finally assigning grouped values with the ave function:
clean <- function(x) {
to.rep <- which(is.na(x) | c(FALSE, head(is.na(x), -1)))
groups <- cumsum(c(TRUE, head(!is.na(x[to.rep]), -1)))
x[to.rep] <- ave(x[to.rep], groups, FUN=function(y) {
rep(tail(y, 1) / length(y), length(y))
})
return(x)
}
t(apply(df, 1, clean))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
If efficiency is important (your question implies it is), then an Rcpp solution could be a good option:
library(Rcpp)
cppFunction(
"NumericVector cleanRcpp(NumericVector x) {
const int n = x.size();
NumericVector y(x);
int consecNA = 0;
for (int i=0; i < n; ++i) {
if (R_IsNA(x[i])) {
++consecNA;
} else if (consecNA > 0) {
const double replacement = x[i] / (consecNA + 1);
for (int j=i-consecNA; j <= i; ++j) {
y[j] = replacement;
}
consecNA = 0;
} else {
consecNA = 0;
}
}
return y;
}")
t(apply(df, 1, cleanRcpp))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
We can compare performance on a larger instance (10000 x 100 matrix):
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
all.equal(apply(mat, 1, clean), apply(mat, 1, cleanRcpp))
# [1] TRUE
system.time(apply(mat, 1, clean))
# user system elapsed
# 4.918 0.035 4.992
system.time(apply(mat, 1, cleanRcpp))
# user system elapsed
# 0.093 0.016 0.120
In this case the Rcpp solution provides roughly a 40x speedup compared to the base R implementation.
Here's a base R solution that's nearly as fast as josilber's Rcpp function:
spread_left <- function(df) {
nc <- ncol(df)
x <- rev(as.vector(t(as.matrix(cbind(df, -Inf)))))
ii <- cumsum(!is.na(x))
f <- tabulate(ii)
v <- x[!duplicated(ii)]
xx <- v[ii]/f[ii]
xx[xx == -Inf] <- NA
m <- matrix(rev(xx), ncol=nc+1, byrow=TRUE)[,seq_len(nc)]
as.data.frame(m)
}
spread_left(df)
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
It manages to be relatively fast by vectorizing everything and completely avoiding time-expensive calls to apply(). (The downside is that it's also relatively obfuscated; to see how it works, do debug(spread_left) and then apply it to the small data.frame df in the OP.
Here are benchmarks for all currently posted solutions:
library(rbenchmark)
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
df <- as.data.frame(mat)
## First confirm that it produces the same results
identical(spread_left(df), as.data.frame(t(apply(mat, 1, clean))))
# [1] TRUE
## Then compare its speed
benchmark(josilberR = t(apply(mat, 1, clean)),
josilberRcpp = t(apply(mat, 1, cleanRcpp)),
Josh = spread_left(df),
Henrik = t(apply(df, 1, fn)),
replications = 10)
# test replications elapsed relative user.self sys.self
# 4 Henrik 10 38.81 25.201 38.74 0.08
# 3 Josh 10 2.07 1.344 1.67 0.41
# 1 josilberR 10 57.42 37.286 57.37 0.05
# 2 josilberRcpp 10 1.54 1.000 1.44 0.11
Another base possibility. I first create a grouping variable (grp), over which the 'spread' is then made with ave.
fn <- function(x){
grp <- rev(cumsum(!is.na(rev(x))))
res <- ave(x, grp, FUN = function(y) sum(y, na.rm = TRUE) / length(y))
res[grp == 0] <- NA
res
}
t(apply(df, 1, fn))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
I was thinking that if NAs are relatively rare, it might be better to make the edits by reference. (I'm guessing this is how the Rcpp approach works.) Here's how it can be done in data.table, borrowing #Henrik's function almost verbatim and converting to long format:
require(data.table) # 1.9.5
fill_naseq <- function(df){
# switch to long format
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
mDT[,badv := is.na(value)]
mDT[
# subset to rows that need modification
badv|shift(badv),
# apply #Henrik's function, more or less
value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
# revert to wide format
(setDF(dcast(mDT,id~variable)[,id:=NULL]))
}
identical(fill_naseq(df),spread_left(df)) # TRUE
To show the best-case scenario for this approach, I simulated so that NAs are very infrequent:
nr = 1e4
nc = 100
nafreq = 1/1e4
mat <- matrix(sample(
c(NA,1:3),
nr*nc,
replace=TRUE,
prob=c(nafreq,rep((1-nafreq)/3,3))
),nrow=nr)
df <- as.data.frame(mat)
benchmark(F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 1 F 10 3.82 1.394 3.72
# 2 Josh 10 2.74 1.000 2.70
# I don't have Rcpp installed and so left off josilber's even faster approach
So, it's still slower. However, with data kept in a long format, reshaping wouldn't be necessary:
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
fill_naseq_long <- function(mDT){
mDT[,badv := is.na(value)]
mDT[badv|shift(badv),value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
mDT
}
benchmark(
F2=fill_naseq_long(mDT),F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 2 F 10 3.98 8.468 3.81
# 1 F2 10 0.47 1.000 0.45
# 3 Josh 10 2.72 5.787 2.69
Now it's a little faster. And who doesn't like keeping their data in long format? This also has the advantage of working even if we don't have the same number of observations per "id".

Matrix elements manipulation

b = c(1,1,2,2,3,3,4,4,1)
c = c(10,10,20,20,30,30,40,40,5)
a <- NULL
a <- matrix(c(b,c), ncol=2)
What I want to do is to compare the numbers In the first column of this matrix, and if the first number is equal to the second consecutive number in the column (in this case if 1 = 1, and so on) then I want to add the corresponding numbers in the second column together (as in 10 + 10 = 20, and so on) and that would be only one value and I want then to store this output in a separate vector.
The output from the matrix I am looking for is as follows:
[,1] [,2] [,3]
[1,] 1 10 20
[2,] 1 10 40
[3,] 2 20 62
[4,] 2 20 85
[5,] 3 30 5
[6,] 3 32
[7,] 4 40
[8,] 4 45
[9,] 1 5
I am quite new to R and struggling with this. Thank you in advance!
This sounds like a job for rle and tapply:
b = c(1,1,2,2,3,3,4,4,1)
c = c(10,10,20,20,30,30,40,40,5)
a <- NULL
a <- matrix(c(b,c), ncol=2)
A <- rle(a[, 1])$lengths
tapply(a[, 2], rep(seq_along(A), A), sum)
# 1 2 3 4 5
# 20 40 60 80 5
Explanation:
rle identifies the run-lengths of the items in the first column of matrix "a".
We create a grouping variable for tapply from the run-lengths using rep(seq_along(A), A).
We put those two things together in tapply to get the sums you want.
Is this what you want? I bet there are clean base solutions, but I give it a try with rollsum in zoo package:
library(zoo)
mm <- cbind(c(1, 1, 2, 2, 3, 3, 4, 4, 1), c(10, 10, 20, 20, 30, 30, 40, 40, 5))
# calculate all lagged sums of column 2
sums <- rollsum(x = mm[ , 2], k = 2)
# calculate differences between consecutive numbers in column 1
diffs <- diff(mm[ , 1])
# select sums where diff is 0, i.e. where the two consecutive numbers in column 1 are equal.
sums2 <- sums[diffs == 0]
sums2
# [1] 20 40 60 80

bin range and form a data frame using the boundary.

Suppose I want to generate bins for range 1 to 10
round(seq(1,20,length.out=5))
the output is
1 6 10 15 20
I want to form a data.frame as
[,1] [,2]
[1,] 1 6
[2,] 7 10
[3,] 11 15
[4,] 16 20
so the start will be 1,7, 11, 16, and ends are 6, 10, 15, 20, respectively.
Any solution for this?
x = round(seq(1,20,length.out=5))
df = data.frame(a = c(x[1], head(x[-1],-1) + 1), b = x[-1])
df
# a b
#1 1 6
#2 7 10
#3 11 15
#4 16 20
I am not sure if you are looking for the following solution. If you are, you can use cut and sub function as in my earlier post:
mydata<-round(seq(1,20,length.out=5))
mydata<-as.data.frame(mydata)
names(mydata)<-"V" #name the column as V
mydata$V1<-cut(mydata$V,5) #break the data into five intervals and name that as col V1
mydata$lower<-with(mydata,as.numeric( sub("\\((.+),.*", "\\1", V1))) #extract lower value
mydata$upper<-with(mydata,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1",V1))) # extract upper value
myfinaldata<-mydata[,c("lower","upper")] #create data frame of lower and upper values
> myfinaldata
lower upper
1 0.981 4.79
2 4.790 8.60
3 8.600 12.40
4 12.400 16.20
5 16.200 20.00
Note: Although these look like ovelapping intervals, they are not. For example for the first row this means all data>=0.981 but <4.79 where as for the second row, this is >=4.79 and <8.60.

Resources