I want to use conditional statement to consecutive values in the sliding manner.
For example, I have dataset like this;
data <- data.frame(ID = rep.int(c("A","B"), times = c(24, 12)),
+ time = c(1:24,1:12),
+ visit = as.integer(runif(36, min = 0, max = 20)))
and I got table below;
> data
ID time visit
1 A 1 7
2 A 2 0
3 A 3 6
4 A 4 6
5 A 5 3
6 A 6 8
7 A 7 4
8 A 8 10
9 A 9 18
10 A 10 6
11 A 11 1
12 A 12 13
13 A 13 7
14 A 14 1
15 A 15 6
16 A 16 1
17 A 17 11
18 A 18 8
19 A 19 16
20 A 20 14
21 A 21 15
22 A 22 19
23 A 23 5
24 A 24 13
25 B 1 6
26 B 2 6
27 B 3 16
28 B 4 4
29 B 5 19
30 B 6 5
31 B 7 17
32 B 8 6
33 B 9 10
34 B 10 1
35 B 11 13
36 B 12 15
I want to flag each ID by continuous values of "visit".
If the number of "visit" continued less than 10 for 6 times consecutively, I'd attach "empty", and "busy" otherwise.
In the data above, "A" is continuously below 10 from rows 1 to 6, then "empty". On the other hand, "B" doesn't have 6 consecutive one digit, then "busy".
I want to apply the condition to next segment of 6 values if the condition weren't fulfilled in the previous segment.
I'd like achieve this using R. Any advice will be appreciated.
I have a dataset that includes individual events across a time period. some example records as below, each individual has 2-4 records observed within a period. The event# is ordered by time, however, the same event# did not occur at the same date (A's #1 event occurs on 6/1, while C's #1 event happens on 6/3). Should I analyze the data as an unbalanced panel data with 2 dimensions individual and event #(i.e, the time dimension)? thanks. If not, how should I treat this data? thanks.
obs
ind
event#
date
var1
y
1
A
1
6/1
11
33
2
A
2
6/4
12
23
3
A
3
6/5
13
32
4
A
4
6/5
14
55
5
B
1
6/1
15
44
6
B
2
6/2
18
54
7
C
1
6/3
15
22
8
C
2
6/3
29
55
9
C
3
6/6
31
23
10
D
1
6/3
13
45
11
D
2
6/5
2
12
I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.
I've been struggling with this problem for a while now, so I hope someone can help me find a more time efficient solution.
So, I have a dataframe of ID's like this:
IDinsurer<-c(rep(11,3),rep(12,2),rep(11,2),rep(13,2),11)
ClaimFileNum<-c(rep('AA',3),rep('BB',2),rep('CC',2),rep('DD',2),'EE')
IDdriver<-c(rep(11,3),rep(12,2),rep(21,2),rep(13,2),11)
IDclaimant<-c(31,11,32,12,33,11,34,13,11,11)
IDclaimdriver<-c(41,11,32,12,11,21,34,13,12,11)
dt<-data.frame(ClaimFileNum,IDinsurer,IDdriver,IDclaimant,IDclaimdriver)
ClaimFileNum IDinsurer IDdriver IDclaimant IDclaimdriver
1 AA 11 11 31 41
2 AA 11 11 11 11
3 AA 11 11 32 32
4 BB 12 12 12 12
5 BB 12 12 33 11
6 CC 11 21 11 21
7 CC 11 21 34 34
8 DD 13 13 13 13
9 DD 13 13 11 12
10 EE 11 11 11 11
What I'd like to do is to count the number of different claim files (ClaimFileNum) the individual IDinsurer has appeared on in other roles ( i.e. not as an insurer). So for each IDinsurer I only want the count of claim files, where his ID appeared in either IDdriver, IDclaimant or IDclaimdriver while at the same time he isn't the IDinsurer of the given claimfile. For example, IDinsurer==11 appeared with all ClaimFileNums, but only on "BB" and "DD" he wasn't also the IDinsurer meaning I'd want my program to return 2.
So this is how I'd like my final data frame to look like:
ClaimFileNum IDinsurer IDdriver IDclaimant IDclaimdriver N
1 AA 11 11 31 41 2
2 AA 11 11 11 11 2
3 AA 11 11 32 32 2
4 BB 12 12 12 12 1
5 BB 12 12 33 11 1
6 CC 11 21 11 21 2
7 CC 11 21 34 34 2
8 DD 13 13 13 13 0
9 DD 13 13 11 12 0
10 AA 11 11 11 11 2
So this is what I was able to come up with so far:
1)
For each of the three other roles (IDdriver, IDclaimant, IDclaimdriver) I individually calculated a new column with numbers revealing how many claim files the specific ID's appeared on IN THAT ROLE ONLY, excluding the cases of claim files, where they were also the insurers (for IDclaimdriver however it made more sense to exclude the cases where the ID matched either IDclaimant or IDdriver instead) . This is the code for the IDdriver counts:
count.duplicates <- function(dt){ #removing duplicated columns and adding a column with the frequency of duplications
x <- do.call('paste', c(dt[,c("ClaimFileNum","IDdriver")], sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(dt[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
dt<-count.duplicates(dt)
dt<-data.table(dt)
dt[,same:=ifelse(dt$IDinsurer==dt$IDdriver,0,1)]
dt[,N_IDdriver:=sum(same,na.rm = T),by=list(IDdriver)]
dt[,same:=NULL]
setorder(dt,ClaimFileNum)
dt<-expandRows(dt,"count")
dt<-as.data.frame(dt)
And this is the output for my example after all three counts:
ClaimFileNum IDinsurer IDdriver IDclaimant IDclaimdriver N_IDdriver N_IDclaimant N_IDclaimdriver
1 AA 11 11 31 41 0 1 1
2 AA 11 11 11 11 0 1 1
3 AA 11 11 32 32 0 1 0
4 BB 12 12 12 12 0 0 1
5 BB 12 12 33 11 0 1 1
6 CC 11 21 11 21 1 1 0
7 CC 11 21 34 34 1 1 0
8 DD 13 13 13 13 0 0 0
9 DD 13 13 11 12 0 1 1
10 EE 11 11 11 11 0 1 1
2) I now used a for loop over an entire IDinsurer column first to check if the insurerID[i] has appeared in any of the other three roles ID's using match function. If the match was found I simply added the count from the corresponding N_ column to the overall count.
Here is my for loop:
total<-length(dt$IDinsurer)
for(i in 1:total) {
j<-match(dt$IDinsurer[i],dt$IDdriver,nomatch=0);
k<-match(dt$IDinsurer[i],dt$IDclaimant,nomatch=0);
l<-match(dt$IDinsurer[i],dt$IDclaimdriver,nomatch=0);
dt$N[i]<-ifelse(j==0,0,N_IDdriver[j])+ifelse(k==0,0,N_IDclaimant[k])+ifelse(l==0,0,N_IDclaimdriver[l]);
}
Now while this approach gives me all the information I need, it's unfortunately incredibly sluggish, especially on a dataset with over 2 million cases like the one I'll have to work with. I'm sure there must be a more elegant solution and I've been trying to figure out how to do it with some more efficient tools (like data.table) but I just can't get the grasp of it.
EDIT: I decided to try both of the answers to my question on my example and compare them with my attempt so here are the calculation times:
Thom Quinn's for loop: 0.15sec,
my for loop: 0.25 sec,
bounyball's approach: 0.35 sec.
Using my loop on a 1,042,000 row dataset took just under 10 hours.
Match is notoriously slow and not needed in this case. In fact, you already solved the problem in English, you need just need to translate it to computer lingo!
So for each IDinsurer I only want the count of claim files, where his ID appeared in either IDdriver, IDclaimant or IDclaimdriver while at the same time he isn't the IDinsurer of the given claimfile
So, let's do just that. In pseudo-code:
for each unique IDinsurer:
count when IDdriver OR IDclaimant OR IDclaimdriver AND NOT IDinsurer
In R, this is:
for(i in unique(dt$IDinsurer)){
index <- dt$IDinsurer != i & (dt$IDdriver == i | dt$IDclaimant == i | dt$IDclaimdriver == i)
dt[dt$IDinsurer == i, "N"] <- sum(index)
}
We can use lapply to apply to do.call to merge.
We first split the data by unique ID. Then, we look at the data by excluding any rows where the ID equals the IDInsurer. Within that data set, we look for entries where any of the other ID's are equal to the ID we're working with. Then we combine the data and fold it up using merge.
res.df <-
do.call('rbind.data.frame',
lapply(unique(dt$IDinsurer), function(x)
c(
x, sum(apply(dt[dt$IDinsurer != x, 3:5] == x, 1, function(y) any(y)))
)
)
)
names(res.df) <- c('ID', 'Count')
merge(dt, res.df, by.x = 'IDinsurer', by.y = 'ID')
IDinsurer ClaimFileNum IDdriver IDclaimant IDclaimdriver Count
1 11 AA 11 31 41 2
2 11 AA 11 11 11 2
3 11 AA 11 32 32 2
4 11 CC 21 11 21 2
5 11 CC 21 34 34 2
6 11 EE 11 11 11 2
7 12 BB 12 12 12 1
8 12 BB 12 33 11 1
9 13 DD 13 13 13 0
10 13 DD 13 11 12 0
The problem:
I would like to construct a variable that measures cumulative work experience within a person-year longitudinal data set. The problem applies to all sorts of longitudinal data sets and many variables might be constructed in this cumulative way (e.g., number of children, cumulative education, cumulative dollars spend on vacations, etc.)
The case:
I have a large longitudinal data set in which every row constitutes a person year. The data set contains thousands of persons (variable “ID”) followed through their lives (variable “age”), resulting in a data frame with about 1.2 million rows. One variable indicates how many months a person has worked in each person year (variable “work”). For example, when Dan was 15 years old he worked 3 months.
ID age work
1 Dan 10 0
2 Dan 11 0
3 Dan 12 0
4 Dan 13 0
5 Dan 14 0
6 Dan 15 3
7 Dan 16 5
8 Dan 17 8
9 Dan 18 5
10 Dan 19 12
11 Jeff 20 0
12 Jeff 16 0
13 Jeff 17 0
14 Jeff 18 0
15 Jeff 19 0
16 Jeff 20 0
17 Jeff 21 8
18 Jeff 22 10
19 Jeff 23 12
20 Jeff 24 12
21 Jeff 25 12
22 Jeff 26 12
23 Jeff 27 12
24 Jeff 28 12
25 Jeff 29 12
I now want to construct a cumulative work experience variable, which adds the value of year x to year x+1. The goal is to know at each age of a person how many months they have worked in their entire carrier. The variable should look like “cumwork”.
ID age work cumwork
1 Dan 10 0 0
2 Dan 11 0 0
3 Dan 12 0 0
4 Dan 13 0 0
5 Dan 14 0 0
6 Dan 15 3 3
7 Dan 16 5 8
8 Dan 17 8 16
9 Dan 18 5 21
10 Dan 19 12 33
11 Jeff 20 0 0
12 Jeff 16 0 0
13 Jeff 17 0 0
14 Jeff 18 0 0
15 Jeff 19 0 0
16 Jeff 20 0 0
17 Jeff 21 8 8
18 Jeff 22 10 18
19 Jeff 23 12 30
20 Jeff 24 12 42
21 Jeff 25 12 54
22 Jeff 26 12 66
23 Jeff 27 12 78
24 Jeff 28 12 90
25 Jeff 29 12 102
A poor solution: I can construct such a cumulative variable using the following simple loop:
# Generate test data set
x=data.frame(ID=c(rep("Dan",times=10),rep("Jeff",times=15)),age=c(10:20,16:29),work=c(rep(0,times=5),3,5,8,5,12,rep(0,times=6),8,10,rep(12,times=7)),stringsAsFactors=F)
# Generate cumulative work experience variable
x$cumwork=x$work
for(r in 2:nrow(x)){
if(x$ID[r]==x$ID[r-1]){
x$cumwork[r]=x$cumwork[r-1]+x$cumwork[r]
}
}
However, my dataset has 1.2 million rows and looping through each row is highly inefficient and running this loop would take hours. Does any brilliant programmer have a suggestion of how to construct this cumulative measure most efficiently?
Many thanks in advance!
Best,
Raphael
ave is convenient for these types of tasks. The function you want to use with it is cumsum:
x$cumwork <- ave(x$work, x$ID, FUN = cumsum)
x
# ID age work cumwork
# 1 Dan 10 0 0
# 2 Dan 11 0 0
# 3 Dan 12 0 0
# 4 Dan 13 0 0
# 5 Dan 14 0 0
# 6 Dan 15 3 3
# 7 Dan 16 5 8
# 8 Dan 17 8 16
# 9 Dan 18 5 21
# 10 Dan 19 12 33
# 11 Jeff 20 0 0
# 12 Jeff 16 0 0
# 13 Jeff 17 0 0
# 14 Jeff 18 0 0
# 15 Jeff 19 0 0
# 16 Jeff 20 0 0
# 17 Jeff 21 8 8
# 18 Jeff 22 10 18
# 19 Jeff 23 12 30
# 20 Jeff 24 12 42
# 21 Jeff 25 12 54
# 22 Jeff 26 12 66
# 23 Jeff 27 12 78
# 24 Jeff 28 12 90
# 25 Jeff 29 12 102
However, given the scale of your data, I would also strongly suggest the "data.table" package, which also gives you access to convenient syntax:
library(data.table)
DT <- data.table(x)
DT[, cumwork := cumsum(work), by = ID]