R Removing earliest observations when duplicate IDs are present

R Removing earliest observations when duplicate IDs are present - r

I have a data frame which looks something like this:
ID = c(1,1,1,1,2,2,3,3,3,3,4,4)
TIME = as.POSIXct(c("2013-03-31 09:07:00", "2013-09-26 10:07:00", "2013-03-31 11:07:00",
"2013-09-26 12:07:00","2013-03-31 09:10:00","2013-03-31 11:11:00",
"2013-03-31 09:06:00","2013-09-26 09:04:00","2013-03-31 10:35:00",
"2013-09-26 09:07:00","2013-09-26 09:07:00","2013-09-26 10:07:00"))
var = c(0,0,1,1,0,1,0,0,1,1,0,1)
DF = data.frame(ID, TIME, var)
ID TIME var
1 1 2013-03-31 09:07:00 0
2 1 2013-09-26 10:07:00 0
3 1 2013-03-31 11:07:00 1
4 1 2013-09-26 12:07:00 1
5 2 2013-03-31 09:10:00 0
6 2 2013-03-31 11:11:00 1
7 3 2013-03-31 09:06:00 0
8 3 2013-09-26 09:04:00 0
9 3 2013-03-31 10:35:00 1
10 3 2013-09-26 09:07:00 1
11 4 2013-09-26 09:07:00 0
12 4 2013-09-26 10:07:00 1
I would like to remove the row containing the earliest TIME value when there are identical ID and var present in the data, ie. to end up with something like this:
ID2 = c(1,1,2,2,3,3,4,4)
TIME2 = as.POSIXct(c("2013-09-26 10:07:00","2013-09-26 12:07:00","2013-03-31 09:10:00",
"2013-03-31 11:11:00","2013-09-26 09:04:00","2013-09-26 09:07:00",
"2013-09-26 09:07:00","2013-09-26 10:07:00"))
var2 = c(0,1,0,1,0,1,0,1)
DF2 = data.frame(ID2, TIME2, var2)
ID2 TIME2 var2
1 1 2013-09-26 10:07:00 0
2 1 2013-09-26 12:07:00 1
3 2 2013-03-31 09:10:00 0
4 2 2013-03-31 11:11:00 1
5 3 2013-09-26 09:04:00 0
6 3 2013-09-26 09:07:00 1
7 4 2013-09-26 09:07:00 0
8 4 2013-09-26 10:07:00 1
As you can see it is not simply about avoiding the measurements performed in March 2013, since these are valid. It is only the measurements for which there are "duplicates" and have been performed again in September that should be affected (see for example that ID = 2 remains in DF2).
Hope you can help.
Sincerily,
ykl

Here's an option with dplyr:
library(dplyr)
DF %>% group_by(ID, var) %>% filter(n() == 1L | !TIME %in% min(TIME))
#Source: local data frame [8 x 3]
#Groups: ID, var
#
# ID TIME var
#1 1 2013-09-26 10:07:00 0
#2 1 2013-09-26 12:07:00 1
#3 2 2013-03-31 09:10:00 0
#4 2 2013-03-31 11:11:00 1
#5 3 2013-09-26 09:04:00 0
#6 3 2013-09-26 09:07:00 1
#7 4 2013-09-26 09:07:00 0
#8 4 2013-09-26 10:07:00 1
What this does:
Take the data frame DF
group it by ID and var
the filter function is used to filter out (subset) by row. it takes a logical vector
and returns rows for which the vector is TRUE. The logic is:
1) if the group has only 1 row, i.e. n() == 1L, then always return that row.
2) if the group has more than 1 rows, i.e. n() > 1L, then check if the TIME value is
equal to the minimum (earlist) TIME value of the group. By using ! we negate the vector so that it is FALSE whenever TIME is at its minimum. Those 1) and 2) conditions are combined with an OR (|).

An option using data.table
library(data.table)
setDT(DF)[ ,{if(.N==1) .SD else .SD[-which.min(TIME)]}, by=list(ID, var)]
# ID var TIME
#1: 1 0 2013-09-26 10:07:00
#2: 1 1 2013-09-26 12:07:00
#3: 2 0 2013-03-31 09:10:00
#4: 2 1 2013-03-31 11:11:00
#5: 3 0 2013-09-26 09:04:00
#6: 3 1 2013-09-26 09:07:00
#7: 4 0 2013-09-26 09:07:00
#8: 4 1 2013-09-26 10:07:00
Or a similar logical approach as showed by #docendo discimus
setDT(DF)[DF[,.N==1L|!TIME %in% min(TIME), by=list(ID, var)]$V1]

Related

Find un-arrangeable consecutive time intervals with exactly n days difference

I have a data as follow and I need to group them based on dates that time_right + 1 = time_left (in other rows). The group id is equal to the minimum id of those records that satisfy this condition.
input = data.frame(id = c(1:6),
time_left = c("2016-01-01", "2016-09-05", "2016-09-06","2016-09-08", "2016-09-12","2016-09-15"),
time_right = c("2016-09-07", "2016-09-11", "2016-09-12", "2016-09-14", "2016-09-18","2016-09-21"))
Input
id time_left time_right
1 1 2016-01-01 2016-09-07
2 2 2016-09-05 2016-09-11
3 3 2016-09-06 2016-09-12
4 4 2016-09-08 2016-09-14
5 5 2016-09-12 2016-09-18
6 6 2016-09-15 2016-09-21
Output:
id time_left time_right group_id
1 1 2016-01-01 2016-09-07 1
2 2 2016-09-05 2016-09-11 2
3 3 2016-09-06 2016-09-12 3
4 4 2016-09-08 2016-09-14 1
5 5 2016-09-12 2016-09-18 2
6 6 2016-09-15 2016-09-21 1
Is there anyway to do it with dplyr?

Min and max value based on another column and combine those in r

So I basically got a while loop function that creates 1's in the "algorithm_column" based on the highest percentages in the "percent" column, until a certain total percentage is reached (90% or something). The rest of the rows that are not taken into account will have a value of 0 in the "algorithm_column" ( Create while loop function that takes next largest value untill condition is met)
I want to show, based on what the loop function found, the min and max times of the column "timeinterval" (the min is where the 1's start and max is the last row with a 1, the 0's are out of the scope). And then finally create a time interval from this.
So if we have the following code, I want to create in another column, lets say "total_time" a calculation from the min time 09:00 ( this is where 1 start in the algorithm_column) until 11:15, which makes a time interval of 02:15 hours added to the "total_time" column.
algorithm
# pc4 timeinterval stops percent idgroup algorithm_column
#1 5464 08:45:00 1 1.3889 1 0
#2 5464 09:00:00 5 6.9444 2 1
#3 5464 09:15:00 8 11.1111 3 1
#4 5464 09:30:00 7 9.7222 4 1
#5 5464 09:45:00 5 6.9444 5 1
#6 5464 10:00:00 10 13.8889 6 1
#7 5464 10:15:00 6 8.3333 7 1
#8 5464 10:30:00 4 5.5556 8 1
#9 5464 10:45:00 7 9.7222 9 1
#10 5464 11:00:00 6 8.3333 10 1
#11 5464 11:15:00 5 6.9444 11 1
#12 5464 11:30:00 8 11.1111 12 0
I have multiple pc4 groups, so it should look at every group and calculate a total_time for each group respectively.
I got this function, but I'm a bit stuck if this is what I need.
test <- function(x) {
ind <- x[["algorithm$algorithm_column"]] == 0
Mx <- max(x[["timeinterval"]][ind], na.rm = TRUE);
ind <- x[["algorithm$algorithm_column"]] == 1
Mn <- min(x[["timeinterval"]][ind], na.rm = TRUE);
list(Mn, Mx) ## or return(list(Mn, Mx))
}
test(algorithm)

Here is a dplyr solution.
library(dplyr)
algorithm %>%
mutate(tmp = cumsum(c(0, diff(algorithm_column) != 0))) %>%
filter(algorithm_column == 1) %>%
group_by(pc4, tmp) %>%
summarise(first = first(timeinterval),
last = last(timeinterval)) %>%
select(-tmp)
## A tibble: 1 x 3
## Groups: pc4 [1]
# pc4 first last
# <int> <fct> <fct>
#1 5464 09:00:00 11:15:00
Data.
algorithm <- read.table(text = "
pc4 timeinterval stops percent idgroup algorithm_column
1 5464 08:45:00 1 1.3889 1 0
2 5464 09:00:00 5 6.9444 2 1
3 5464 09:15:00 8 11.1111 3 1
4 5464 09:30:00 7 9.7222 4 1
5 5464 09:45:00 5 6.9444 5 1
6 5464 10:00:00 10 13.8889 6 1
7 5464 10:15:00 6 8.3333 7 1
8 5464 10:30:00 4 5.5556 8 1
9 5464 10:45:00 7 9.7222 9 1
10 5464 11:00:00 6 8.3333 10 1
11 5464 11:15:00 5 6.9444 11 1
12 5464 11:30:00 8 11.1111 12 0
", header = TRUE)

How to group time by every n minutes in R

I have a dataframe with a lot of time series:
1 0:03 B 1
2 0:05 A 1
3 0:05 A 1
4 0:05 B 1
5 0:10 A 1
6 0:10 B 1
7 0:14 B 1
8 0:18 A 1
9 0:20 A 1
10 0:23 B 1
11 0:30 A 1
I want to group the time series into every 6 minutes and count the frequency of A and B:
1 0:06 A 2
2 0:06 B 2
3 0:12 A 1
4 0:12 B 1
5 0:18 A 1
6 0:24 A 1
7 0:24 B 1
8 0:18 A 1
9 0:30 A 1
Also, the class of the time series is character. What should I do?

Here's an approach to convert times to POSIXct, cut the times by 6 minute intervals, then count.
First, you need to specify the year, month, day, hour, minute, and seconds of your data. This will help with scaling it to larger datasets.
library(tidyverse)
library(lubridate)
# sample data
d <- data.frame(t = paste0("2019-06-02 ",
c("0:03","0:06","0:09","0:12","0:15",
"0:18","0:21","0:24","0:27","0:30"),
":00"),
g = c("A","A","B","B","B"))
d$t <- ymd_hms(d$t) # convert to POSIXct with `lubridate::ymd_hms()`
If you check the class of your new date column, you will see it is "POSIXct".
> class(d$t)
[1] "POSIXct" "POSIXt"
Now that the data is in "POSIXct", you can cut it by minute intervals! We will add this new grouping factor as a new column called tc.
d$tc <- cut(d$t, breaks = "6 min")
d
t g tc
1 2019-06-02 00:03:00 A 2019-06-02 00:03:00
2 2019-06-02 00:06:00 A 2019-06-02 00:03:00
3 2019-06-02 00:09:00 B 2019-06-02 00:09:00
4 2019-06-02 00:12:00 B 2019-06-02 00:09:00
5 2019-06-02 00:15:00 B 2019-06-02 00:15:00
6 2019-06-02 00:18:00 A 2019-06-02 00:15:00
7 2019-06-02 00:21:00 A 2019-06-02 00:21:00
8 2019-06-02 00:24:00 B 2019-06-02 00:21:00
9 2019-06-02 00:27:00 B 2019-06-02 00:27:00
10 2019-06-02 00:30:00 B 2019-06-02 00:27:00
Now you can group_by this new interval (tc) and your grouping column (g), and count the frequency of occurences. Getting the frequency of observations in a group is a fairly common operation, so dplyr provides count for this:
count(d, g, tc)
# A tibble: 7 x 3
g tc n
<fct> <fct> <int>
1 A 2019-06-02 00:03:00 2
2 A 2019-06-02 00:15:00 1
3 A 2019-06-02 00:21:00 1
4 B 2019-06-02 00:09:00 2
5 B 2019-06-02 00:15:00 1
6 B 2019-06-02 00:21:00 1
7 B 2019-06-02 00:27:00 2
If you run ?dplyr::count() in the console, you'll see that count(d, tc) is simply a wrapper for group_by(d, g, tc) %>% summarise(n = n()).

According to the sample dataset, the time series is given as time-of-day, i.e., without date.
The data.table package has the ITime class which is a time-of-day class stored as the integer number of seconds in the day. With data.table, we can use a rolling join to map times to the upper limit of the 6 minutes intervals (right-closed intervals):
library(data.table)
# coerce from character to class ITime
setDT(ts)[, time := as.ITime(time)]
# create sequence of breaks
breaks <- as.ITime(seq(as.ITime("0:00"), as.ITime("23:59:59"), as.ITime("0:06")))
# rolling join and aggregate
ts[, CJ(breaks, group, unique = TRUE)
][ts, on = .(group, breaks = time), roll = -Inf, .(x.breaks, group)
][, .N, by = .(upper = x.breaks, group)]
which returns
upper group N
1: 00:06:00 B 2
2: 00:06:00 A 2
3: 00:12:00 A 1
4: 00:12:00 B 1
5: 00:18:00 B 1
6: 00:18:00 A 1
7: 00:24:00 A 1
8: 00:24:00 B 1
9: 00:30:00 A 1
Addendum
If the direction of the rolling join is changed (roll = +Inf instead of roll = -Inf) we get left-closed intervals
ts[, CJ(breaks, group, unique = TRUE)
][ts, on = .(group, breaks = time), roll = +Inf, .(x.breaks, group)
][, .N, by = .(lower = x.breaks, group)]
which changes the result significantly:
lower group N
1: 00:00:00 B 2
2: 00:00:00 A 2
3: 00:06:00 A 1
4: 00:06:00 B 1
5: 00:12:00 B 1
6: 00:18:00 A 2
7: 00:18:00 B 1
8: 00:30:00 A 1
Data
library(data.table)
ts <- fread("
1 0:03 B 1
2 0:05 A 1
3 0:05 A 1
4 0:05 B 1
5 0:10 A 1
6 0:10 B 1
7 0:14 B 1
8 0:18 A 1
9 0:20 A 1
10 0:23 B 1
11 0:30 A 1"
, header = FALSE
, col.names = c("rn", "time", "group", "value"))

replace missing with na.locf0

I am trying to fill missing values using zoo package.
my data set is as follows
a=c("2017-01-12 00:00:00","2017-01-12 00:03:00","2017-01-12 00:08:00",
"2017-01-12 00:11:00","2017-01-12 00:14:00","2017-01-12 04:59:00","2017-01-12 05:10:00",
"2017-01-12 05:30:00")
b=c(NA,NA,1,NA,0,NA,1,NA)
df =data.frame(a,b)
to fill the missing's i am trying with
df$new = na.locf0(df$b,fromLast=F)
O/p should be:
a b new
1/12/2017 0:00 NA 0
1/12/2017 0:03 NA 0
1/12/2017 0:08 1 1
1/12/2017 0:11 NA 1
1/12/2017 0:14 0 0
1/12/2017 4:59 NA 0
1/12/2017 5:10 1 1
1/12/2017 5:30 NA 1
Thanks in advance.

na.locf0 (correctly) does not fill in components for which there is no prior value. If you want to fill in those with some particular value then use na.fill. (In the development version of zoo na.fill0 will also work.)
transform(df, new = na.fill(na.locf0(b), 0))
giving:
a b new
1 2017-01-12 00:00:00 NA 0
2 2017-01-12 00:03:00 NA 0
3 2017-01-12 00:08:00 1 1
4 2017-01-12 00:11:00 NA 1
5 2017-01-12 00:14:00 0 0
6 2017-01-12 04:59:00 NA 0
7 2017-01-12 05:10:00 1 1
8 2017-01-12 05:30:00 NA 1

We can use
df$new <- na.locf(df$b,fromLast=F, na.rm = FALSE)
df$new[is.na(df$new)] <- 0
df$new
#[1] 0 0 1 1 0 0 1 1

Option 1
Using zoo::na.locf0
library(zoo);
library(tidyverse);
df %>% mutate(b = na.locf0(b), b = replace(b, is.na(b), 0))
# a b
#1 2017-01-12 00:00:00 0
#2 2017-01-12 00:03:00 0
#3 2017-01-12 00:08:00 1
#4 2017-01-12 00:11:00 1
#5 2017-01-12 00:14:00 0
#6 2017-01-12 04:59:00 0
#7 2017-01-12 05:10:00 1
#8 2017-01-12 05:30:00 1
Option 2
Using tidyr::fill
df %>% fill(b) %>% mutate(b = replace(b, is.na(b), 0))
# a b
#1 2017-01-12 00:00:00 0
#2 2017-01-12 00:03:00 0
#3 2017-01-12 00:08:00 1
#4 2017-01-12 00:11:00 1
#5 2017-01-12 00:14:00 0
#6 2017-01-12 04:59:00 0
#7 2017-01-12 05:10:00 1
#8 2017-01-12 05:30:00 1
Explanation: Both zoo::na.locf0 and tidyr::fill fill NA entries based on previous entries (by default top down); the last replace step replaces leading NA values with 0 (since there are no previous entries, these NAs cannot be filled).

R: Count by id, number of occurences in a predefined time interval

I want to compute a column that counts the number of occurences looking backward in a predefined time interval (e.g. 2 days) for a particular ID.
I have the following data structure (see code below) in R and want to compute the column countLast2d automatically:
userID <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3)
datetime <-c("2015-07-02 13:20:00", "2015-07-03 13:20:00", "2015-07-04 01:20:00",
"2015-07-10 01:20:00", "2015-07-23 01:20:00", "2015-07-23 06:08:00", "2015-07-24 06:08:00",
"2015-09-02 09:01:00", "2015-08-19 11:41:00", "2015-08-19 14:38:00", "2015-08-19 17:36:00",
"2015-08-19 20:33:00", "2015-08-19 23:30:00", "2015-08-19 23:46:00", "2015-08-19 05:19:00",
"2015-09-13 17:02:00", "2015-10-01 00:32:00", "2015-10-01 00:50:00")
The outcome should take on these values:
countLast2d <- c(0,1,2,0,0,1,2,0,0,1,0,0,0,1,0,0,0,1)
df <- data.frame(userID, countLast2d, datetime)
df$datetime = as.POSIXct(strptime(df$datetime, format = "%Y-%m-%d %H:%M:%S"))
In Excel, I would use the following formula:
=countifs([datecolumn],"<"&[date cell in that row],[datecolumn],"<"&[date cell in that row]-2,[idcolumn],[id cell in that row])
(So for example [C2]=+COUNTIFS($B:$B,"<"&$B2,$B:$B,">="&$B2-2,$A:$A,$A2), if Column A contains the id and column B the date)
I already asked that question once before (https://stackoverflow.com/questions/30998596/r-count-number-of-occurences-by-id-in-the-last-48h) but didn't include an example in my question. So sorry for asking again.

Here's a solution:
df <- data.frame(userID=c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3),datetime=as.POSIXct(c('2015-07-02 13:20:00','2015-07-03 13:20:00','2015-07-04 01:20:00','2015-07-10 01:20:00','2015-07-23 01:20:00','2015-07-23 06:08:00','2015-07-24 06:08:00','2015-09-02 09:01:00','2015-08-19 11:41:00','2015-08-19 14:38:00','2015-08-19 17:36:00','2015-08-19 20:33:00','2015-08-19 23:30:00','2015-08-19 23:46:00','2015-08-19 05:19:00','2015-09-13 17:02:00','2015-10-01 00:32:00','2015-10-01 00:50:00')));
window <- as.difftime(2,units='days');
df$countLast2d <- sapply(1:nrow(df),function(r) sum(df$userID==df$userID[r] & df$datetime<df$datetime[r] & df$datetime>=df$datetime[r]-window));
df;
## userID datetime countLast2d
## 1 1 2015-07-02 13:20:00 0
## 2 1 2015-07-03 13:20:00 1
## 3 1 2015-07-04 01:20:00 2
## 4 1 2015-07-10 01:20:00 0
## 5 1 2015-07-23 01:20:00 0
## 6 1 2015-07-23 06:08:00 1
## 7 1 2015-07-24 06:08:00 2
## 8 1 2015-09-02 09:01:00 0
## 9 2 2015-08-19 11:41:00 1
## 10 2 2015-08-19 14:38:00 2
## 11 2 2015-08-19 17:36:00 3
## 12 2 2015-08-19 20:33:00 4
## 13 2 2015-08-19 23:30:00 5
## 14 2 2015-08-19 23:46:00 6
## 15 2 2015-08-19 05:19:00 0
## 16 3 2015-09-13 17:02:00 0
## 17 3 2015-10-01 00:32:00 0
## 18 3 2015-10-01 00:50:00 1
Note that this differs from your expected output because your expected output is incorrect for userID==2.
This solution will work regardless of the ordering of df, which is essential for your example df because it is unordered (or at least not perfectly ordered) for userID==2.
Edit Here's a possibility, using by() to group by userID and only comparing each element against lesser-index elements, under the assumption that only those elements can be in the lookback window:
df2 <- df[order(df$userID,df$datetime),];
df2$countLast2d <- do.call(c,by(df2$datetime,df$userID,function(x) c(0,sapply(2:length(x),function(i) sum(x[1:(i-1)]>=x[i]-window)))));
df2;
## userID datetime countLast2d
## 1 1 2015-07-02 13:20:00 0
## 2 1 2015-07-03 13:20:00 1
## 3 1 2015-07-04 01:20:00 2
## 4 1 2015-07-10 01:20:00 0
## 5 1 2015-07-23 01:20:00 0
## 6 1 2015-07-23 06:08:00 1
## 7 1 2015-07-24 06:08:00 2
## 8 1 2015-09-02 09:01:00 0
## 15 2 2015-08-19 05:19:00 0
## 9 2 2015-08-19 11:41:00 1
## 10 2 2015-08-19 14:38:00 2
## 11 2 2015-08-19 17:36:00 3
## 12 2 2015-08-19 20:33:00 4
## 13 2 2015-08-19 23:30:00 5
## 14 2 2015-08-19 23:46:00 6
## 16 3 2015-09-13 17:02:00 0
## 17 3 2015-10-01 00:32:00 0
## 18 3 2015-10-01 00:50:00 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Removing earliest observations when duplicate IDs are present - r

Related

Find un-arrangeable consecutive time intervals with exactly n days difference

Min and max value based on another column and combine those in r

How to group time by every n minutes in R

replace missing with na.locf0

R: Count by id, number of occurences in a predefined time interval

Categories

Resources