converting arrival time/process to count process in R - r

I have the data for an arrival process and I want to convert it to count process. This is what I did:
# inter-arrival time in milliseconds
x <- rpareto(100000, location = 10, shape = 1.2)
# arrival time in milliseconds
x.cumsum <- cumsum(x)
# the last arrival
x.max <- max(x.cumsum)
# the time scale for the count data, in this case 1 second
kTimeScale <- 1000
count.length <- ceiling(x.max / kTimeScale)
counts <- rep(0, times = count.length)
for (i in x.cumsum) {
counts[round(i / kTimeScale)] <- counts[round(i / kTimeScale)] + 1
}
This works but for very large dataset (few millions it's slow). I was wondering if there is a better faster way to do this?

You can do this with table:
countsTable<-table(round(x.cumsum/kTimeScale))
counts[1:10]
## [1] 24 41 1 2 33 26 20 45 36 19
countsTable[1:10]
##
## 0 1 2 3 4 5 6 7 8 9
## 5 24 41 1 2 33 26 20 45 36
The difference is that your function misses the 0 values. The table function won't put in 0 for values where there are no observations but you can do something like this to fix that:
counts2<-rep(0,length(counts)+1)
counts2[as.integer(names(countsTable))+1]<-countsTable
identical(counts,counts2[-1])
## [1] TRUE

Related

R ranges: 1:0 - illogical behavior

I have an array X of length N, and I'd like to compute sum(X[(i+1):N]) - sum(X[1:(i-1)]. This works fine if my index, i, is within 2..(N-1). If it's equal to 1, the second term will return the first element of the array rather than exclude it. If it's equal to N, the first term will return the last element of the array rather than exclude it. seq_len is the only function I'm aware of that does the job, but only for the 2nd term (it indexes 1:n). What I need is a range function that will return NULL (rather than throw an exception like seq) when its 2nd argument is below its first. The sum function will do the rest. Is anyone aware of one, or do I have to write one myself?
I suggest an alternate path for generating indexing sequences: seq_len, which reacts intuitively in the extremes.
Bottom Line Up Front: use sum(X[-seq_len(i)]) - sum(X[seq_len(i-1)]) instead.
First, some sample data:
X <- 1:10
N <- length(X)
Your approach, at the two extremes:
i <- 1
X[(i+1):N]
# [1] 2 3 4 5 6 7 8 9 10
X[1:(i-1)] # oops
# [1] 1
That should return "nothing", I believe. (More the point, sum(...) should return 0. For the record, sum(integer(0)) is 0.)
i <- 10
X[(i+1):N] # oops
# [1] NA 10
X[1:(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
There's your other error, where you'd expect "nothing" in the first subset.
Instead, I suggest you use seq_len:
i <- 1
X[-seq_len(i)]
# [1] 2 3 4 5 6 7 8 9 10
X[seq_len(i-1)]
# integer(0)
i <- 10
X[-seq_len(i)]
# integer(0)
X[seq_len(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
Both seem fine, and something in the middle makes sense.
i <- 5
X[-seq_len(i)]
# [1] 6 7 8 9 10
X[seq_len(i-1)]
# [1] 1 2 3 4
In this contrived example, what we're looking for at any value of i:
1: sum(2:10) - 0 = 54 - 0 = 54
2: sum(3:10) - sum(1:1) = 52 - 1 = 51
3: sum(4:10) - sum(1:2) = 49 - 3 = 46
...
10: 0 - sum(1:9) = 0 - 45 = -45
And we now get that:
func <- function(i, x) sum(x[-seq_len(i)]) - sum(x[seq_len(i-1)])
sapply(c(1,2,3,10), func, X)
# [1] 54 51 46 -45
Edit:
李哲源's answer got me to thinking that you don't need to re-sum the numbers before and after all the time. Just do it once and re-use it. This method could be easily a bit faster if your vector is large.
Xb <- c(0, cumsum(X)[-N])
Xb
# [1] 0 1 3 6 10 15 21 28 36 45
Xa <- c(rev(cumsum(rev(X)))[-1], 0)
Xa
# [1] 54 52 49 45 40 34 27 19 10 0
sapply(c(1,2,3,10), function(i) Xa[i] - Xb[i])
# [1] 54 51 46 -45
So this suggests that your summed value at any value of i is
Xs <- Xa - Xb
Xs
# [1] 54 51 46 39 30 19 6 -9 -26 -45
where you can find the specific value with Xs[i]. No repeated summing required.

r - lapply divides a column by an integer value from different dataset, unexpected result

I have two data.frames, one with genotype counts and one with a number that I need to normalize my counts from the first dataset.
countsdata=data.frame(genotype1=rep(c(10,20,30,40),each=1),
genotype2=rep(c(100,200,300,400),each=1),
genotype3=rep(c(40,50,60,70),each=1),
genotype4=rep(c(40,50,60,70),each=1)
)
coldata = data.frame(Group =c('genotype1', 'genotype2', 'genotype3', 'genotype4'),
Treatment = rep(c("control","treated"),each = 2),
Norm=rep(c(1,2,5,5)))
I made sure my variables don't have factors
factorsCharacter <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
as.character))
coldata=factorsCharacter(coldata)
Then I see that lapply loops through my counts, one column at the time and through my coldata that contains the normalization value (Norm). All is looking good, until I combined the two action in the same step
> lapply(coldata['Group'],function(group_i){group_i})
$Group
[1] "genotype1" "genotype2" "genotype3" "genotype4"
> lapply(coldata['Group'],function(group_i){countsdata[,group_i]})
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 20 200 50 50
3 30 300 60 60
4 40 400 70 70
> lapply(coldata['Group'],function(group_i){as.integer(coldata[coldata$Group==group_i,'Norm'])})
$Group
[1] 1 2 5 5
> lapply(coldata['Group'],function(group_i){
+ countsdata[,group_i]/as.integer(coldata[coldata$Group==group_i,'Norm'])
+ })
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 10 100 25 25
3 6 60 12 12
4 8 80 14 14
Here the result is not what I was expecting (dividing each column by its normalization number). After further inspection I noticed it's normalizing by rows, in other words it's normalizing across different columns, which shouldn't be the case as I am looping through one column at the time. I am probably missing a basic concept but looking through other SO posts didn't find anything I could use. My goal is to fix the code to make the right calculation but I also would like to understand why this code above is not working. Thanks so much.
The problem is in using [ and not [[. So, instead of looping through each of the elements in 'Group' column, we have a list of length 1 with all the elements. So, either use coldata[, 'Group'] or coldata[['Group']] or coldata$Group for looping.
countsdataNew <- countsdata
countsdataNew[] <- lapply(coldata[['Group']],function(group_i)
countsdata[,group_i]/coldata$Norm[coldata$Group==group_i])
countsdataNew
# genotype1 genotype2 genotype3 genotype4
#1 10 50 8 8
#2 20 100 10 10
#3 30 150 12 12
#4 40 200 14 14
If the column name in 'countsdata' and 'Group' column from 'countsdata' are in the same order, we can do this easily with Map
Map(`/`, countsdata, coldata$Norm)
Or just replicate the 'Norm' and do a simple division
countsdata/coldata$Norm[col(countsdata)]
Or with sweep
sweep(countsdata, 2, coldata$Norm, "/")

Maximum Intermediate Volatility

I have two vectors, a and b. See attached.
a is the signal and is a probability.
b is the absolute percentage change the next period.
Signalt <- seq(0, 1, 0.05)
I would like to find the maximum absolute return occuring within each intermediate 5%-tile (Signalt) of the a vector. So if it is
0.01, 0.02, 0.03, 0.06 0.07
then it should calculate the maximum return between
0.01 and 0.02,
0.01 and 0.03,
0.02 and 0.03.
Then move on to
0.06 and 0.07 do it over etc.
Output would then be combined in a matrix or table when the entire sequence has run.
It should follow the index from vector a and b.
i is an index that is updated by one every time that a crosses into a new percentile. t(i) is the bucket associated with the ith cross.
a is the probability vector which has length tao. This vector should be analyzed in its 5% tiles, with the maximum intermediate absolute return being the output. The price change of next period is the vector b. This would be represented by P in the equation below.
l and m are indexes.
Every time Signal moves from one 5% tile to another, we compute the
largest absolute return that occurs between any two intermediate
buckets, until Signal moves to another 5% tile. For example, suppose
that Signal moves into the 85th percentile and 4 volume buckets later
moves into the 90th percentile. We would then calculate absolute
returns between buckets 1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3
and 4. We are interested in the maximum absolute return. We would then
calculate the max return in the following percentile bucket, move on
to the next, which could be an 85th percentile and so on. So we let i
be an index that is updated by 1 every time that Signal moves from one
percentile into another, and τ(i) the bucket associated with the ith
cross.
This is the equation I am using. The notation might vary slightly.
Now my question is how to go about this. Perhaps someone has an intuitive solution to this.
I hope my question is clear.
"a","b"
0,0.013013698630137
0,0.0013522650439487
0,0.00135409614082593
0,0.00203389830508471
0.27804813511593,0.00135317997293627
0.300237801284318,0
0.495965075167796,0.00405405405405412
0.523741892051237,0.000672947510094168
0.558753750296458,0.00202020202020203
0.665762829019002,0.000672043010752743
0.493106479913899,0.000671591672263272
0.344592579573497,0.000672043010752854
0.336263897823707,0.00201748486886366
0.35884763774257,0.00536912751677865
0.23662807979007,0.00133511348464632
0.212636893966841,0.00267379679144386
0.362212830513403,0.000666666666666593
0.319216408413927,0.00333555703802535
0.277670854167344,0
0.310143323100971,0
0.374104373036218,0.00267737617135211
0.190943075221511,0.00268456375838921
0.165770070508112,0.00200803212851386
0.240310208616952,0.00133600534402145
0.212418038918236,0.00200133422281523
0.204282022136019,0.00200534759358306
0.363725074298064,0.000667111407605114
0.451807761954326,0.000666666666666593
0.369296011692801,0.000666222518321047
0.37503495989363,0.0026666666666666
0.323386355686901,0.00132978723404265
0.189216171830472,0.00266311584553924
0.185252052821193,0.00199203187250996
0.174882909380997,0.000662690523525522
0.149291525540782,0.00132625994694946
0.196824215268048,0.00264900662251666
0.164611993131396,0.000660501981505912
0.125470998266484,0.00132187706543285
0.179999532586703,0.00264026402640272
0.368749638521621,0.000658327847267826
0.427799340926225,0
My interpretation of the question
I hope I understand your question correctly. Here is what I understood:
For each row you compute which 5% percentile it belongs to
Whenever that percentile changes, you start a new bucket
All rows from the same bucket result in a single resulting value
If there is only a single row in a bucket, the b value from that row is the resulting value
Otherwise, you compute all abs(b[l]/b[m]-1) where m<l and both belong to the same bucket
Basic answer
Code
This code here does what I describe above:
# read the data (shortened, full data in OP)
d <- read.table(textConnection("a,b
0,0.013013698630137
[…]
0.427799340926225,0
"), sep=",", header=TRUE)
# compute percentile number for each line
d$percentile <- floor(d$a/0.05)*5 + 5
# start a new bucket whenever the percentile changes
d$bucket <- cumsum(c(1, diff(d$percentile) != 0))
# compute a single number for all rows of the same bucket
aggregate(b ~ percentile + bucket, d, function(b) {
if(length(b) == 1) return(b); # special case of only a single row
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1)) # compare all pairs
return(max(m[upper.tri(m)])) # only return pairs with m < l
})
Output
The result will look like this:
percentile bucket b
1 5 1 0.8960891071
2 30 2 0.0013531800
3 35 3 0.0000000000
4 50 4 0.0040540541
5 55 5 0.0006729475
6 60 6 0.0020202020
7 70 7 0.0006720430
8 50 8 0.0006715917
9 35 9 2.0020174849
10 40 10 0.0053691275
11 25 11 1.0026737968
12 40 12 0.0006666667
13 35 13 0.0033355570
14 30 14 0.0000000000
15 35 15 0.0000000000
16 40 16 0.0026773762
17 20 17 0.2520080321
18 25 18 0.5010026738
19 40 19 0.0006671114
20 50 20 0.0006666667
21 40 21 3.0026666667
22 35 22 0.0013297872
23 20 23 0.7511597084
24 15 24 0.0013262599
25 20 25 0.7506605020
26 15 26 0.0013218771
27 20 27 0.0026402640
28 40 28 0.0006583278
29 45 29 0.0000000000
Additional columns
Code
If you also want to know the number of items in each group, then I suggest you use the plyr library:
library(plyr)
aggB <- function(b) {
if(length(b) == 1) return(b)
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1))
return(max(m[upper.tri(m)]))
}
ddply(d, .(bucket), summarise,
percentile = percentile[1], n = length(b), maxr = aggB(b))
Output
This will give you the following result:
bucket percentile n maxr
1 1 5 4 0.8960891071
2 2 30 1 0.0013531800
3 3 35 1 0.0000000000
4 4 50 1 0.0040540541
5 5 55 1 0.0006729475
6 6 60 1 0.0020202020
7 7 70 1 0.0006720430
8 8 50 1 0.0006715917
9 9 35 2 2.0020174849
10 10 40 1 0.0053691275
11 11 25 2 1.0026737968
12 12 40 1 0.0006666667
13 13 35 1 0.0033355570
14 14 30 1 0.0000000000
15 15 35 1 0.0000000000
16 16 40 1 0.0026773762
17 17 20 2 0.2520080321
18 18 25 3 0.5010026738
19 19 40 1 0.0006671114
20 20 50 1 0.0006666667
21 21 40 2 3.0026666667
22 22 35 1 0.0013297872
23 23 20 3 0.7511597084
24 24 15 1 0.0013262599
25 25 20 2 0.7506605020
26 26 15 1 0.0013218771
27 27 20 1 0.0026402640
28 28 40 1 0.0006583278
29 29 45 1 0.0000000000
I am not sure to understand but here an attempt. My idea is to group data by centiles than do calculation on each group using by
To group data I create a new variable split
##dat$split <- cut(dat$a,seq(0, 1, 0.05),include.lowest=T)
dat$split <- c(0,cumsum(diff(dat$a) > 0.05))
Using by, I can performs my function en each group. I remove the singular cases of NULL prob values or one values.
by(dat,dat$split,FUN =function(x){
P <- x$b
if( is.null(P)||length(P) ==1) return(0)
nn <- length(P)
ind <- expand.grid(1:nn,1:nn) ## I generate indexes here
ret <- abs(P[ind[,1]]/P[ind[,2]]-1) ## perfom P_l/P_m-1 (vectorized)
list(P=P,
ret.max = max(ret),
ret.ind = ind[which.max(ret),])
})
Here the result list. For each interval I show ,
P ( Prob values),
The maximum return
The indexes from which this maximum is computed.
For example:
dat$split: 0
$P
[1] 0.0130 0.0014 0.0014 0.0020
$ret.max
[1] 8.6236
$ret.ind
Var1 Var2
5 1 2
---------------------------------------------------------------------------------------------------------------
dat$split: 1
$P
[1] 0.0014 0.0000
$ret.max
[1] 1
$ret.ind
Var1 Var2
2 2 1

Adding additional observation in panel data in R

I am trying to add additional years to my panel data. Just wondering if you guys have any ideas of quick way of doing it. Keep in mind my real data is T=6, i=4000.
# Here is my input
data = data.frame(time=c(30,40,50,30,40,50,30,40,50),
id=c(1,1,1,2,2,2,3,3,3),
d=c(1,4,7,8,14,2,41,11,61))
# declare panel data individ and time
pd = pdata.frame(data, c("id","time"), drop.index=FALSE)
#this is what I want out...
data.out = data.frame(time=c(30,40,50,60,30,40,50,60,30,40,50,60),
id=c(1,1,1,1,2,2,2,2,3,3,3,3),
d=c(1,4,7,8,9,14,2,41,50,11,61,70))
# declare panel data individ and time
pd.data.out = pdata.frame(data.out, c("id","time"), drop.index=FALSE)
I am not quite sure what you are doing but this might help:
data = data.frame(time=c(30,40,50,30,40,50,30,40,50),
id=c(1,1,1,2,2,2,3,3,3),
d=c(1,4,7,8,14,2,41,11,61))
newdata = data.frame(time=c(60,60,60),
id=c(1,2,3),
d=c(9,50,70))
combodata = rbind(data,newdata)
data.out = combodata[order(combodata$id,combodata$time), ]
rownames(data.out) = NULL
to produce
> data.out
time id d
1 30 1 1
2 40 1 4
3 50 1 7
4 60 1 9
5 30 2 8
6 40 2 14
7 50 2 2
8 60 2 50
9 30 3 41
10 40 3 11
11 50 3 61
12 60 3 70
and I think this is what you want for time and id, though d is marginally different. If the rows do not need to be ordered then the last three lines of the code can be condensed to
data.out = rbind(data,newdata)
Got it... just create new time and id data.frame and merge into it.
time = rep(c(unique(as.numeric(as.character(pd$time))),max(as.numeric(as.character(pd$time)))+10), length(unique(pd$id)))
id = rep( unique(pd$id), each=max(as.numeric(as.character(pd$id)))+1)
data2 = data.frame(time, id)
data.out = merge(data2, pd, all.x=T)
data.out = data.out[with(data.out, order(id,time) ), ]

Grouping R variables based on sub-groups

I have a data formatted as
PERSON_A PERSON_B MEET LEAVE
That describes basically when a PERSON_A met a PERSON_B at time MEET and they said "bye" to each other at moment LEAVE. The time is expressed in seconds, and there is a small part of the data on http://pastie.org/2825794 (simple.dat).
What I need is to count the number of meetings grouping it by day. At the moment, I have a code that works, the appearance is not beautiful. Anyway, I'd like a help in order to transform it in a code that reflects the grouping Im trying to do, e.g, using ddply, etc. Therefore, my main aim is to learn from this case. Probably there are many mistakes in this code regarding good practices in R.
library(plyr)
data = read.table("simple.dat", stringsAsFactors=FALSE)
names(data)=c('PERSON_A','PERSON_B','MEET','LEAVE')
attach(data)
min_interval = min(MEET)
max_interval = max(LEAVE)
interval = max_interval - min_interval
day = 86400
number_of_days = floor(interval/day)
g = data.frame(MEETINGS=c(0:number_of_days)) # just to store the result
g[,1] = 0
start_offset = min_interval # start of the first day
for (interval in c(0:number_of_days)) {
end_offset = start_offset + day
meetings = (length(data[data$MEET >= start_offset & data$LEAVE <= end_offset, ]$PERSON_A) + length(data[data$MEET >= start_offset & data$LEAVE <= end_offset, ]$PERSON_B))
g[interval+1, ] = meetings
start_offset = end_offset # start next day
}
g
This code iterates over the days (intervals of 86400 seconds) and stores the number of meetings on the dataframe g. The correct output (shown bellow) of this code when executed on the linked dataset gives for each line (day) the number o meetings.
MEETINGS
1 38
2 10
3 16
4 18
5 24
6 6
7 4
8 10
9 28
10 14
11 22
12 2
13 .. 44 0 # I simplified the output here
45 2
Anyway, I know that I could use ddply to get the number of meetings for each pair o nodes:
contacts <- ddply(data, .(PERSON_A, PERSON_B), summarise
, CONTACTS = length(c(PERSON_A, PERSON_B)) /2
)
but there is a huge hill for me between this and the result I need.
As a end note, I read How to make a great R reproducible example? and tried my best :)
Thanks,
try this:
> d2 <- transform(data, m = floor(MEET/86400) + 1, l = floor(LEAVE/86400) + 1)
> d3 <- subset(d2, m == l)
> table(d3$m) * 2
1 2 3 4 5 6 7 8 9 10 11 12 45
38 10 16 18 24 6 4 10 28 14 22 2 2
floor(x/(60*60*24)) is a quick way to convert second into day.

Resources