is data with individual and #event sequence (not fixed time) considered as a panel data? - panel-data

I have a dataset that includes individual events across a time period. some example records as below, each individual has 2-4 records observed within a period. The event# is ordered by time, however, the same event# did not occur at the same date (A's #1 event occurs on 6/1, while C's #1 event happens on 6/3). Should I analyze the data as an unbalanced panel data with 2 dimensions individual and event #(i.e, the time dimension)? thanks. If not, how should I treat this data? thanks.
obs
ind
event#
date
var1
y
1
A
1
6/1
11
33
2
A
2
6/4
12
23
3
A
3
6/5
13
32
4
A
4
6/5
14
55
5
B
1
6/1
15
44
6
B
2
6/2
18
54
7
C
1
6/3
15
22
8
C
2
6/3
29
55
9
C
3
6/6
31
23
10
D
1
6/3
13
45
11
D
2
6/5
2
12

Related

Fill Empty values with their respective respective KEY

Lets example ,Have one master Table contains the telemetry data for every 5 seconds one record
having millions of data in this table
ID
DateTime
IngestionTime
X
Y
Z
1
2012-12-28T12:04:00
2012-12-28T12:04:00
12
11
10
2
2012-12-28T12:06:00
2012-12-28T12:06:00
2
9
7
3
2012-12-29T12:11:00
2012-12-29T12:11:00
2
9
7
1
2012-12-29T12:15:00
2012-12-29T12:15:00
33
7
2
2012-12-29T12:24:00
2012-12-29T12:24:00
9
72
3
2012-12-29T12:30:00
2012-12-29T12:30:00
44
40
54
4
2012-12-29T12:35:00
2012-12-29T12:35:00
11
9
92
I'm having in function demo(datetime:fromTime, datetime:toTime)
from this I'm querying for fromTime 2012-12-29T12:11:00 to toTime: same 29thdecmber)
so if any empty values there i need to fill
those empty values from previous date with respective column
Requirement to fill this lastknownvalue in very optimized way since we are delaing with millions of data
ID
DateTime
IngestionTime
X
Y
Z
1
2012-12-28T12:04:00
2012-12-28T12:04:00
12
11
10
2
2012-12-28T12:06:00
2012-12-28T12:06:00
8
9
7
3
2012-12-29T12:11:00
2012-12-29T12:11:00
2
9
7
1
2012-12-29T12:15:00
2012-12-29T12:15:00
lastKnownValueForThisID?
33
7
2
2012-12-29T12:24:00
2012-12-29T12:24:00
lastKnownValueForThisID
9
72
3
2012-12-29T12:30:00
2012-12-29T12:30:00
44
40
54
4
2012-12-29T12:35:00
2012-12-29T12:35:00
11
9
92
I am not sure that I fully understand the question, however for filling missing values in a time series, there are a few useful builtin functions, for example series_fill_backward(), series_fill_const() and more.

R:How to apply a sliding conditional branch to consecutive values in the sequential data

I want to use conditional statement to consecutive values in the sliding manner.
For example, I have dataset like this;
data <- data.frame(ID = rep.int(c("A","B"), times = c(24, 12)),
+ time = c(1:24,1:12),
+ visit = as.integer(runif(36, min = 0, max = 20)))
and I got table below;
> data
ID time visit
1 A 1 7
2 A 2 0
3 A 3 6
4 A 4 6
5 A 5 3
6 A 6 8
7 A 7 4
8 A 8 10
9 A 9 18
10 A 10 6
11 A 11 1
12 A 12 13
13 A 13 7
14 A 14 1
15 A 15 6
16 A 16 1
17 A 17 11
18 A 18 8
19 A 19 16
20 A 20 14
21 A 21 15
22 A 22 19
23 A 23 5
24 A 24 13
25 B 1 6
26 B 2 6
27 B 3 16
28 B 4 4
29 B 5 19
30 B 6 5
31 B 7 17
32 B 8 6
33 B 9 10
34 B 10 1
35 B 11 13
36 B 12 15
I want to flag each ID by continuous values of "visit".
If the number of "visit" continued less than 10 for 6 times consecutively, I'd attach "empty", and "busy" otherwise.
In the data above, "A" is continuously below 10 from rows 1 to 6, then "empty". On the other hand, "B" doesn't have 6 consecutive one digit, then "busy".
I want to apply the condition to next segment of 6 values if the condition weren't fulfilled in the previous segment.
I'd like achieve this using R. Any advice will be appreciated.

Creating an summary dataset with multiple objects and multiple observations per object

I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.

Split data when time intervals exceed a defined value

I have a data frame of GPS locations with a column of seconds. How can I split create a new column based on time-gaps? i.e. for this data.frame:
df <- data.frame(secs=c(1,2,3,4,5,6,7,10,11,12,13,14,20,21,22,23,24,28,29,31))
I would like to cut the data frame when there is a time gap between locations of 3 or more seconds seconds and create a new column entitled 'bouts' which gives a running tally of the number of sections to give a data frame looking like this:
id secs bouts
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 10 2
9 11 2
10 12 2
11 13 2
12 14 2
13 20 3
14 21 3
15 22 3
16 23 3
17 24 3
18 28 4
19 29 4
20 31 4
Use cumsum and diff:
df$bouts <- cumsum(c(1, diff(df$secs) >= 3))
Remember that logical values get coerced to numeric values 0/1 automatically and that diff output is always one element shorter than its input.

I don't know how to create this tree in R

I would like to maximize revenue by applying the better campaign at each hour.
I would like to create a tree that would help me choose the better campaign.
At the data below there's a record with the revenue per campaign per hour.
Looking at the data, I may see that campaign A is better between hours 1-12, and that campaign B is better between hours 13-24.
How do I create in R the tree that would tell me that?
hour campaign revenue
1 A 23
1 B 20
2 A 21
2 B 22
3 A 23
3 B 20
4 A 21
4 B 22
5 A 23
5 B 20
6 A 21
6 B 22
7 A 20
7 B 17
8 A 18
8 B 19
9 A 20
9 B 17
10 A 18
10 B 19
11 A 20
11 B 17
12 A 19
12 B 18
13 A 8
13 B 9
14 A 6
14 B 11
15 A 9
15 B 8
16 A 6
16 B 11
17 A 9
17 B 8
18 A 6
18 B 11
19 A 3
19 B 2
20 A 3
20 B 2
21 A 0
21 B 5
22 A 3
22 B 2
23 A 3
23 B 2
24 A 0
24 B 5
I'm not sure what kind of tree you are looking for exactly, but a linear model tree for revenue with regressor campaign and partitioning variable hour might be useful. Using lmtree() in package partykit you can fit a tree that starts out by fitting a linear model with two coefficients (intercept and campaign B effect) and then splits the data as long as there are significant instabilities in at least one of the coefficients:
library("partykit")
(tr <- lmtree(revenue ~ campaign | hour, data = d))
## Linear model tree
##
## Model formula:
## revenue ~ campaign | hour
##
## Fitted party:
## [1] root
## | [2] hour <= 12: n = 24
## | (Intercept) campaignB
## | 20.583333 -1.166667
## | [3] hour > 12: n = 24
## | (Intercept) campaignB
## | 4.666667 1.666667
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
## Number of parameters per node: 2
## Objective function (residual sum of squares): 341.1667
In this (presumably artificial) data, this selects a single split at 12 hours and then has two terminal nodes: one with a negative campaign B effect (i.e., A is better) and one with a positive campaign B effect (i.e., B is better). The resulting plot(tr) yields:
This also brings out that the split is also driven by the change in revenue level and not only by the differing campaign effects (which are fairly small).
The underlying tree algorithm is called "Model-Based Recursive Partitioning" (MOB) and is also applicable to models other than linear regression. See the references in the manual and vignette for more details.
Another algorithm that might potentially be interesting is the QUINT (qualitative interaction trees) by Dusseldorp & Van Mechelen, available in the quint package.
For convenient replication of the example above: The d data frame can be recreated by
d <- read.table(textConnection("hour campaign revenue
1 A 23
1 B 20
2 A 21
2 B 22
3 A 23
3 B 20
4 A 21
4 B 22
5 A 23
5 B 20
6 A 21
6 B 22
7 A 20
7 B 17
8 A 18
8 B 19
9 A 20
9 B 17
10 A 18
10 B 19
11 A 20
11 B 17
12 A 19
12 B 18
13 A 8
13 B 9
14 A 6
14 B 11
15 A 9
15 B 8
16 A 6
16 B 11
17 A 9
17 B 8
18 A 6
18 B 11
19 A 3
19 B 2
20 A 3
20 B 2
21 A 0
21 B 5
22 A 3
22 B 2
23 A 3
23 B 2
24 A 0
24 B 5"), header = TRUE)
Would something like this work?
## create a sequence of hours from column 1 of the data
hr <- as.numeric(unique(data[,1]))
## Set up vectors to hold the A and B campaign "best" hours
A.hours=NULL
B.hours=NULL
## start at the lowest hour
i=1
while(i<=max(hr)) {
## create a subset of data from the current hour
sub.data <- data[matrix(which(data[,1]==hr[i])),]
## find the campaign with the highest revenue
best.camp <- sub.data[which(sub.data[,3]==max(sub.data[,3])),2]
if(best.camp=="A") {
A.hours <- c(A.hours,hr[i])
}
if(best.camp=="B") {
B.hours <- c(B.hours,hr[i])
}
i=i+1
}
The code indicates that during the A.hours (hours: 1 3 5 7 9 11 12 15 17 19 20 22 23), campaign A is more profitable.
However, during B.hours (hours: 2 4 6 8 10 13 14 16 18 21 24), campaign B is more profitable.

Resources