Conditional column-wise or row-wise subtraction in a data frame - r

I need to do a column-wise subtraction and row-wise subtraction in R.
id on fail
1 10-10-2014 11-11-2014
1 11-10-2014 12-12-2014
1 12-10-2014 12-01-2015
2 13-10-2014 12-02-2015
2 14-10-2014 15-03-2015
2 15-10-2014 15-04-2015
2 16-10-2014 16-05-2015
3 17-10-2014 16-06-2015
3 18-10-2014 17-07-2015
3 19-10-2014 17-08-2015
3 20-10-2014 17-09-2015
For example, in the above table whenever a new id appears it should do a column-wise subtraction, else it should do row-wise subtraction. I need to have a result like this:
id on fail res
1 10-10-2014 11-11-2014 32
1 11-10-2014 12-12-2014 31
1 12-10-2014 12-01-2015 31
2 13-10-2014 12-02-2015 122
2 14-10-2014 15-03-2015 31
2 15-10-2014 15-04-2015 31
2 16-10-2014 16-05-2015 31
3 17-10-2014 16-06-2015 242
3 18-10-2014 17-07-2015 31
3 19-10-2014 17-08-2015 31
3 20-10-2014 17-09-2015 31
As of now I am using the following code:
data[,2] <- as.Date(data[,2],format="%d-%m-%Y")
data[,3] <- as.Date(data[,3],format="%d-%m-%Y")
x <- as.numeric(diff(data[,3]))

DF <- read.table(text="id on fail
1 10-10-2014 11-11-2014
1 11-10-2014 12-12-2014
1 12-10-2014 12-01-2015
2 13-10-2014 12-02-2015
2 14-10-2014 15-03-2015
2 15-10-2014 15-04-2015
2 16-10-2014 16-05-2015
3 17-10-2014 16-06-2015
3 18-10-2014 17-07-2015
3 19-10-2014 17-08-2015
3 20-10-2014 17-09-2015 ", header=TRUE)
DF[,2:3] <- lapply(DF[,2:3], as.Date, format="%d-%m-%Y")
DF$res <- c(NA, diff(DF$fail))
DF[c(TRUE ,diff(DF$id)!=0), "res"] <- DF[c(TRUE ,diff(DF$id)!=0), "fail"] -
DF[c(TRUE ,diff(DF$id)!=0), "on"]
# id on fail res
# 1 1 2014-10-10 2014-11-11 32
# 2 1 2014-10-11 2014-12-12 31
# 3 1 2014-10-12 2015-01-12 31
# 4 2 2014-10-13 2015-02-12 122
# 5 2 2014-10-14 2015-03-15 31
# 6 2 2014-10-15 2015-04-15 31
# 7 2 2014-10-16 2015-05-16 31
# 8 3 2014-10-17 2015-06-16 242
# 9 3 2014-10-18 2015-07-17 31
# 10 3 2014-10-19 2015-08-17 31
# 11 3 2014-10-20 2015-09-17 31

Related

identify sequences of approximately equivalent values in a series using R

I have a series of values that includes strings of values that are close to each other, for example the sequences below. Note that roughly around the places I have categorized the values in V1 with distinct values in V2, the range of the values changes. That is, all the values called 1 in V2 are within 20 points of each other. All the values marked 2 in V2 are within 20 points of each other. All the values marked 3 are within 20 points of each other, etc. Notice that the values are not identical (they are all different). But instead, they cluster around a common value.
I identified these clusters manually. How could I automate it?
V1 V2
1 399.710 1
2 403.075 1
3 405.766 1
4 407.112 1
5 408.458 1
6 409.131 1
7 410.477 1
8 411.150 1
9 412.495 1
10 332.419 2
11 330.400 2
12 329.054 2
13 327.708 2
14 326.363 2
15 325.017 2
16 322.998 2
17 319.633 2
18 314.923 2
19 288.680 3
20 285.315 3
21 283.969 3
22 281.950 3
23 279.932 3
24 276.567 3
25 273.875 3
26 272.530 3
27 271.857 3
28 272.530 3
29 273.875 3
30 274.548 3
31 275.894 3
32 275.894 3
33 276.567 3
34 277.240 3
35 278.586 3
36 279.932 3
37 281.950 3
38 284.642 3
39 288.007 3
40 291.371 3
41 294.063 4
42 295.409 4
43 296.754 4
44 297.427 4
45 298.100 4
46 299.446 4
47 300.792 4
48 303.484 4
49 306.848 4
50 327.708 5
51 309.540 6
52 310.213 6
53 309.540 6
54 306.848 6
55 304.156 6
56 302.811 6
57 302.811 6
58 304.156 6
59 305.502 6
60 306.175 6
61 306.175 6
62 304.829 6
I haven't tried anything yet, I don't know how to do this.
Using dist and hclust with cutree to detect clusters, but with unique levels at the breaks.
hc <- hclust(dist(x))
cl <- cutree(hc, k=6)
data.frame(x, seq=cumsum(c(0, diff(cl)) != 0) + 1)
# x seq
# 1 399.710 1
# 2 403.075 1
# 3 405.766 1
# 4 407.112 1
# 5 408.458 1
# 6 409.131 1
# 7 410.477 1
# 8 411.150 1
# 9 412.495 1
# 10 332.419 2
# 11 330.400 2
# 12 329.054 2
# 13 327.708 2
# 14 326.363 2
# 15 325.017 2
# 16 322.998 2
# 17 319.633 3
# 18 314.923 3
# 19 288.680 4
# 20 285.315 4
# 21 283.969 4
# 22 281.950 4
# 23 279.932 4
# 24 276.567 5
# 25 273.875 5
# 26 272.530 5
# 27 271.857 5
# 28 272.530 5
# 29 273.875 5
# 30 274.548 5
# 31 275.894 5
# 32 275.894 5
# 33 276.567 5
# 34 277.240 5
# 35 278.586 6
# 36 279.932 6
# 37 281.950 6
# 38 284.642 6
# 39 288.007 6
# 40 291.371 6
# 41 294.063 7
# 42 295.409 7
# 43 296.754 7
# 44 297.427 7
# 45 298.100 7
# 46 299.446 7
# 47 300.792 7
# 48 303.484 7
# 49 306.848 7
# 50 327.708 8
# 51 309.540 9
# 52 310.213 9
# 53 309.540 9
# 54 306.848 9
# 55 304.156 9
# 56 302.811 9
# 57 302.811 9
# 58 304.156 9
# 59 305.502 9
# 60 306.175 9
# 61 306.175 9
# 62 304.829 9
However, the dendrogram suggests rather k=4 clusters instead of 6, but it is arbitrary.
plot(hc)
abline(h=30, lty=2, col=2)
abline(h=18.5, lty=2, col=3)
abline(h=14, lty=2, col=4)
legend('topright', lty=2, col=2:4, legend=paste(c(4, 5, 7), 'cluster'), cex=.8)
Data:
x <- c(399.71, 403.075, 405.766, 407.112, 408.458, 409.131, 410.477,
411.15, 412.495, 332.419, 330.4, 329.054, 327.708, 326.363, 325.017,
322.998, 319.633, 314.923, 288.68, 285.315, 283.969, 281.95,
279.932, 276.567, 273.875, 272.53, 271.857, 272.53, 273.875,
274.548, 275.894, 275.894, 276.567, 277.24, 278.586, 279.932,
281.95, 284.642, 288.007, 291.371, 294.063, 295.409, 296.754,
297.427, 298.1, 299.446, 300.792, 303.484, 306.848, 327.708,
309.54, 310.213, 309.54, 306.848, 304.156, 302.811, 302.811,
304.156, 305.502, 306.175, 306.175, 304.829)
This solution iterates over every value, checks the range of all values in the group up to that point, and starts a new group if the range is greater than a threshold.
maxrange <- 18
grp_start <- 1
grp_num <- 1
V3 <- numeric(length(dat$V1))
for (i in seq_along(dat$V1)) {
grp <- dat$V1[grp_start:i]
if (max(grp) - min(grp) > maxrange) {
grp_num <- grp_num + 1
grp_start <- i
}
V3[[i]] <- grp_num
}
cbind(dat, V3)
V1 V2 V3
1 399.710 1 1
2 403.075 1 1
3 405.766 1 1
4 407.112 1 1
5 408.458 1 1
6 409.131 1 1
7 410.477 1 1
8 411.150 1 1
9 412.495 1 1
10 332.419 2 2
11 330.400 2 2
12 329.054 2 2
13 327.708 2 2
14 326.363 2 2
15 325.017 2 2
16 322.998 2 2
17 319.633 2 2
18 314.923 2 2
19 288.680 3 3
20 285.315 3 3
21 283.969 3 3
22 281.950 3 3
23 279.932 3 3
24 276.567 3 3
25 273.875 3 3
26 272.530 3 3
27 271.857 3 3
28 272.530 3 3
29 273.875 3 3
30 274.548 3 3
31 275.894 3 3
32 275.894 3 3
33 276.567 3 3
34 277.240 3 3
35 278.586 3 3
36 279.932 3 3
37 281.950 3 3
38 284.642 3 3
39 288.007 3 3
40 291.371 3 4
41 294.063 4 4
42 295.409 4 4
43 296.754 4 4
44 297.427 4 4
45 298.100 4 4
46 299.446 4 4
47 300.792 4 4
48 303.484 4 4
49 306.848 4 4
50 327.708 5 5
51 309.540 6 6
52 310.213 6 6
53 309.540 6 6
54 306.848 6 6
55 304.156 6 6
56 302.811 6 6
57 302.811 6 6
58 304.156 6 6
59 305.502 6 6
60 306.175 6 6
61 306.175 6 6
62 304.829 6 6
A threshold of 18 reproduces your groups, except that group 4 starts one row earlier. You could use a higher threshold, but then group 6 would start later than you have it.

r group by date difference with respect to first date

I have a dataset that looks like this.
Id Date1 Cars
1 2007-04-05 72
2 2014-01-07 12
2 2018-07-09 10
2 2018-07-09 13
3 2005-11-19 22
3 2005-11-23 13
4 2010-06-17 38
4 2010-09-23 57
4 2010-09-23 41
4 2010-10-04 17
What I would like to do is for each Id get the date difference with respect to the 1st Date (Earliest) date for that Id. For each Id, (EarliestDate - 2nd Earliest Date), (EarliestDate - 3rd Earliest Date), (Earliest Date - 4th Earliest Date) ... so on.
I would end up with a dataset like this
Id Date1 Cars Diff
1 2007-04-05 72 NA
2 2014-01-07 12 NA
2 2018-07-09 10 1644 = (2018-07-09 - 2014-01-07)
2 2018-07-09 13 1644 = (2018-07-09 - 2014-01-07)
3 2005-11-19 22 NA
3 2005-11-23 13 4 = (2005-11-23 - 2005-11-19)
4 2010-06-17 38 NA
4 2010-09-23 57 98 = (2010-09-23 - 2010-06-17)
4 2010-09-23 41 98 = (2010-09-23 - 2010-06-17)
4 2010-10-04 17 109 = (2010-10-04 - 2010-09-23)
I am unclear on how to accomplish this. Any help would be much appreciated. Thanks
Change Date1 to date class.
df$Date1 = as.Date(df$Date1)
You can subtract with the first value in each Id. This can be done using dplyr.
library(dplyr)
df %>% group_by(Id) %>% mutate(Diff = as.integer(Date1 - first(Date1)))
# Id Date1 Cars Diff
# <int> <date> <int> <int>
# 1 1 2007-04-05 72 0
# 2 2 2014-01-07 12 0
# 3 2 2018-07-09 10 1644
# 4 2 2018-07-09 13 1644
# 5 3 2005-11-19 22 0
# 6 3 2005-11-23 13 4
# 7 4 2010-06-17 38 0
# 8 4 2010-09-23 57 98
# 9 4 2010-09-23 41 98
#10 4 2010-10-04 17 109
data.table
setDT(df)[, Diff := as.integer(Date1 - first(Date1)), Id]
OR base R :
df$diff <- with(df, ave(as.integer(Date1), Id, FUN = function(x) x - x[1]))
Replace 0's to NA if you want output as such.

Merging Data frames and creating columns based on conditions

I have 2 data frames
Data Frame A:
Time Reading
1 20
2 23
3 25
4 22
5 24
6 23
7 24
8 23
9 23
10 22
Data Frame B:
TimeStart TimeEnd Alarm
2 5 556
7 9 556
I would like to create the following joined dataframe:
Time Reading Alarmtime Alarm alarmno
1 20 n/a n/a n/a
2 23 2 556 1
3 25 556 1
4 22 556 1
5 24 5 556 1
6 23 n/a n/a n/a
7 24 7 556 2
8 23 556 2
9 23 9 556 2
10 22 n/a n/a n/a
I can do the join easy enough however im struggling with getting the following rows filled with the alarm until the time the alarm ended. Also numbering each individual alarm so even if they are the same alarm they are counted separately. Any thoughts on how i can do this would be great
Thanks
library(sqldf)
df_b$AlarmNo <- seq_len(nrow(df_b))
sqldf('
select a.Time
, a.Reading
, case when a.Time in (b.TimeStart, b.TimeEnd)
then a.Time
else NULL
end as AlarmTime
, b.Alarm
, b.AlarmNo
from df_a a
left join df_b b
on a.Time between b.TimeStart and b.TimeEnd
')
# Time Reading AlarmTime Alarm AlarmNo
# 1 1 20 NA NA NA
# 2 2 23 2 556 1
# 3 3 25 NA 556 1
# 4 4 22 NA 556 1
# 5 5 24 5 556 1
# 6 6 23 NA NA NA
# 7 7 24 7 556 2
# 8 8 23 NA 556 2
# 9 9 23 9 556 2
# 10 10 22 NA NA NA
Or
library(data.table)
setDT(df_b)
df_c <-
df_b[, .(Time = seq(TimeStart, TimeEnd), Alarm, AlarmNo = .GRP)
, by = TimeStart]
merge(df_a, df_c, by = 'Time', all.x = T)
# Time Reading TimeStart Alarm AlarmNo
# 1: 1 20 NA NA NA
# 2: 2 23 2 556 1
# 3: 3 25 2 556 1
# 4: 4 22 2 556 1
# 5: 5 24 2 556 1
# 6: 6 23 NA NA NA
# 7: 7 24 7 556 2
# 8: 8 23 7 556 2
# 9: 9 23 7 556 2
# 10: 10 22 NA NA NA
Data used:
df_a <- fread('
Time Reading
1 20
2 23
3 25
4 22
5 24
6 23
7 24
8 23
9 23
10 22
')
df_b <- fread('
TimeStart TimeEnd Alarm
2 5 556
7 9 556
')

Split data frame based on group into list in defined order in R

I have simple data.frame
ts_df <- data.frame(val=c(20,30,40,50,21,26,11,41,47,41),
cycle=c(3,3,3,3,2,2,2,1,1,1),
date=as.Date(c("1985-06-30","1985-09-30","1985-12-31","1986-03-31","1986-06-30","1986-09-30","1986-12-31","1987-03-31","1987-06-30","1987-09-30")))
and I need split ts_df based on group but keep order inside resulted list based on date.
list_ts_df <- split(ts_df, ts_df$cycle)
So instead of
> list_ts_df
$`1`
val cycle date
8 41 1 1987-03-31
9 47 1 1987-06-30
10 41 1 1987-09-30
$`2`
val cycle date
5 21 2 1986-06-30
6 26 2 1986-09-30
7 11 2 1986-12-31
$`3`
val cycle date
1 20 3 1985-06-30
2 30 3 1985-09-30
3 40 3 1985-12-31
4 50 3 1986-03-31
I need desired output as
> list_ts_df
$`1`
val cycle date
1 20 3 1985-06-30
2 30 3 1985-09-30
3 40 3 1985-12-31
4 50 3 1986-03-31
$`2`
val cycle date
5 21 2 1986-06-30
6 26 2 1986-09-30
7 11 2 1986-12-31
$`3`
val cycle date
8 41 1 1987-03-31
9 47 1 1987-06-30
10 41 1 1987-09-30
Is there any simple solution how to achieve desired output? Thank you very much for any of your advises.
We can do an order of the dataset first and then do the split on the 'cycle' by creating a factor with the levels specified as unique elements
t1 <- ts_df[order(ts_df$date),]
split(t1, factor(t1$cycle, levels = unique(t1$cycle)) )

Appending a tally of subjects to a dataframe in R

I have a list of subjects:
myDat = list(Subject = c(10234, 10234, 10234, 10234, 10242, 10242, 10242, 10242, 10253, 10253, 10253, 10268, 10268, 10268, 10268))
and I would like to add a count (DayNo) which restarts with every change in subject to the dataframe to look like:
Thanks in advance
An ave variant:
df <- as.data.frame(myDat)
df$Day <- ave(df$Subject, df$Subject, FUN=seq_along)
Produces:
Subject Day
1 10234 1
2 10234 2
3 10234 3
4 10234 4
5 10242 1
6 10242 2
7 10242 3
8 10242 4
9 10253 1
10 10253 2
11 10253 3
12 10268 1
13 10268 2
14 10268 3
15 10268 4
Use rle to get the run lengths and use sequence to create sequences of corresponding length.
myDat <- as.data.frame(myDat)
myDat$DayNo <- sequence(rle(myDat$Subject)$lengths)
# Subject DayNo
# 1 10234 1
# 2 10234 2
# 3 10234 3
# 4 10234 4
# 5 10242 1
# 6 10242 2
# 7 10242 3
# 8 10242 4
# 9 10253 1
# 10 10253 2
# 11 10253 3
# 12 10268 1
# 13 10268 2
# 14 10268 3
# 15 10268 4

Resources