I'm currently working in data.table in R with the following data set:
id age_start age_end cases
1 2 2 1000
1 3 3 500
1 4 4 300
1 2 4 1800
2 2 2 8000
2 3 3 200
2 4 4 100
In the given data set I only want values of cases where the age_start == 2 and the age_end ==4.
In each ID where the age_start !=2 and the age_end !=4, I need to sum or aggregate the rows to create a group of age_start==2 and age_end ==4. In these cases I'd need to sum up the cases of age_start==2 & age_end==2, age_start==3 & age_end==3, as well as age_start==4 & age_end==4 into one new row of age_start==2 and age_end==4.
After these are summed up into one row, I want to drop the rows that I used to make the new age_start==2 and age_start==4 row (i.e. the age values 2-2, 3-3, and 4-4) as they are no longer needed
Ideally the data set would look like this when I finish these steps:
id age_start age_end cases
1 2 4 1800
2 2 4 8300
Any suggestions on how to accomplish this in data.table are greatly appreciated!
You can use an equi-join for the first bullet; and a non-equi join for the second:
m_equi = x[.(id = unique(id), age_dn = 2, age_up = 4),
on=.(id, age_start = age_dn, age_end = age_up),
nomatch=0
]
m_nonequi = x[!m_equi, on=.(id)][.(id = unique(id), age_dn = 2, age_up = 4),
on=.(id, age_start >= age_dn, age_end <= age_up),
.(cases = sum(cases)), by=.EACHI
]
res = rbind(m_equi, m_nonequi)
id age_start age_end cases
1: 1 2 4 1800
2: 2 2 4 8300
How it works:
x[i] uses values in i to look up rows and columns in x according to rules specified in on=.
nomatch=0 means unmatched rows of i in x[i] are dropped, so m_equi only ends up with id=1.
x[!m_equi, on=.(id)] is an anti-join that skips id=1 since we already matched it in the equi join.
by=.EACHI groups by each row of i in x[i] for the purpose of doing the aggregation.
An alternative would be to anti-join on rows with start 2 and end 4 so that all groups need to be aggregated (similar to #akrun's answer), though I guess that would be less efficient.
We can specify the i with the logical condition, grouped by 'id', get the sum of 'cases' while adding 'age_start', 'age_end' as 2 and 4
library(data.table)
as.data.table(df1)[age_start != 2|age_end != 4,
.(age_start = 2, age_end = 4, cases = sum(cases)), id]
# id age_start age_end cases
#1: 1 2 4 1800
#2: 2 2 4 8300
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), age_start = c(2L,
3L, 4L, 2L, 2L, 3L, 4L), age_end = c(2L, 3L, 4L, 4L, 2L, 3L,
4L), cases = c(1000L, 500L, 300L, 1800L, 8000L, 200L, 100L)),
class = "data.frame", row.names = c(NA,
-7L))
Related
How do I find number of continuous weeks by group but counted from the max date in the dataset?
Say I have this dataframe:
id Week
1 A 2/06/2019
2 A 26/05/2019
3 A 19/05/2019
4 A 12/05/2019
5 A 5/05/2019
6 B 2/06/2019
7 B 26/05/2019
8 B 12/05/2019
9 B 5/05/2019
10 C 26/05/2019
11 C 19/05/2019
12 C 12/05/2019
13 D 2/06/2019
14 D 26/05/2019
15 D 19/05/2019
16 E 2/06/2019
17 E 19/05/2019
18 E 12/05/2019
19 E 5/05/2019
My desired output is:
id count
1: A 5
2: B 2
3: D 3
4: E 1
I am currently converting dates into factor to get ordered number and checking against the reference number created based on the number of rows in each group.
library(data.table)
df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 5L),
.Label = c("A", "B", "C", "D", "E"), class = "factor"),
Week = structure(c(3L, 4L, 2L, 1L, 5L, 3L, 4L, 1L, 5L, 4L, 2L, 1L, 3L, 4L, 2L, 3L, 2L, 1L, 5L),
.Label = c("12/05/2019", "19/05/2019", "2/06/2019", "26/05/2019", "5/05/2019"), class = "factor")),
class = "data.frame", row.names = c(NA, -19L))
dt <- data.table(df)
dt[, Week_no := as.factor(as.Date(Week, format = "%d/%m/%Y"))]
dt[, Week_no := factor(Week_no)]
dt[, Week_no := as.numeric(Week_no)]
max_no <- max(dt$Week_no)
dt[, Week_ref := max_no:(max_no - .N + 1), by = "id"]
dt[, Week_diff := Week_no - Week_ref]
dt[Week_diff == 0, list(count = .N), by = "id"]
Here's one way to do this:
dt <- dt[, Week := as.Date(Week, format = "%d/%m/%Y")]
ids_having_max <- dt[.(max(Week)), id, on = "Week"]
dt <- dt[.(ids_having_max), on = "id"
][order(-Week), .(count = sum(rleid(c(-7L, diff(Week))) == 1)), by = "id"]
Breaking it into steps:
We leave Week as a date because it can already be compared,
and you can subtract dates to get time differences.
We then get all the ids that contain the maximum date in the whole table.
This is using secondary indices.
We use secondary indices again to filter out those ids that were not part of the previous result
(the dt[.(ids_having_max), on = "id" part).
The last frame is tricky.
We group by id and make sure that rows are ordered by Week in descending order.
Then the logic is as follows.
When you have contiguous weeks,
diff(Week) is always -7 with the chosen sorting.
Computing diff returns a shorter vector because the first result is computed by subtracting the first input element from the second,
so we prepend a -7 to make sure that it is the first element in the input to rleid.
With rleid we assign a 1 to the first -7 and keep the 1 until we see something different from -7.
Something different means weeks stopped being contiguous.
The sum(rleid(c(-7L, diff(Week))) == 1) will simply return how many rows had a rleid equal to 1.
Example of the last part for B:
Differences: -7, -14, -7
After prepending -7: -7, -7, -14, -7
After rleid: 1, 1, 2, 3
From the previous, two had a rleid == 1
Apologies for dplyr solution, but I presume a similar approach can be achieved more concisely with data.table.
library(dplyr)
df$Week = lubridate::dmy(df$Week)
df %>%
group_by(id) %>%
arrange(id, Week) %>%
# Assign group to each new streak
mutate(new_streak = cumsum(Week != lag(Week, default = 0) + 7)) %>%
add_count(id, new_streak) %>%
slice(n()) # Only keep last week
So I would suggest converting the format of the data column to show week number "%W" as follows
dt[, Week_no := format(as.Date(Week, format = "%d/%m/%Y"),"%W")]
Then find the amount of unique week numbers for each id value
dt[,(length(unique(Week_no))),by="id"]
FULL DISCLOSURE
I realise that when I run this I get a different table than you present, as R counts the week by the week # for the given year.
If this doesnt answer your question just let me know and I can try to update
I have two dataframes, one with ID , DATE, and the name of the drug . Another has ID and date of event date.event.
expected column prev_drug :
how can I count the number of the different drug prior the current date ? for example, for ID=1 , prev_drug for row 4 is 2 , because it has two drugs ( A ,B) different from drug C prior the the DATE of row 4.
2.expected column event.30d.prior :
for each ID and each DATE in the first data frame, how many events happened during the 30days prior to the DATE ? eg. for row 2, the event for id=1 happened at 1/20/2001 , falls in to the 30 days prior to 2/1/2001 period.
ID DATE DRUG prev_drug event.30d.prior
1 1/1/2001 A 0 0
1 2/1/2001 A 0 1
1 3/15/2001 B 1 0
1 4/20/2001 C 2 1
1 5/29/2001 A 2 0
1 5/2/2001 B 2 0
2 3/2/2001 A 0 1
2 3/23/2001 C 1 1
2 4/4/2001 D 2 0
2 5/5/2001 B 3 0
ID date.event
1 1/20/2001
1 4/11/2001
2 3/1/2001
Here is a solution with base R with some dplyr methods used. This is not the cleanest and best solution but it should solve your problem.
df<-structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
DATE = structure(c(11323, 11354, 11396, 11432, 11471, 11444,
11383, 11404, 11416, 11447), class = "Date"), DRUG = structure(c(1L,
1L, 2L, 3L, 1L, 2L, 1L, 3L, 4L, 2L), .Label = c("A", "B",
"C", "D"), class = "factor")), row.names = c(NA, -10L), class = "data.frame")
#Note DATE was converted to a Date object with the following line
#df$DATE<-as.Date(df$DATE, "%m/%d/%Y")
date.event<-read.table(header=TRUE, text="ID date.event
1 1/20/2001
1 4/11/2001
2 3/1/2001")
date.event$date.event<-as.Date(date.event$date.event, "%m/%d/%Y")
library(dplyr)
#calculate the prev_drup by counting the number of unique drugs
df<-df %>% group_by(ID) %>% mutate(prev_drug= (cumsum(!duplicated(DRUG)))-1)
#loop through each row after spitting and filtering by ID
event.30d.prior<-sapply(1:nrow(df), function(i){
events<-date.event[date.event$ID==df$ID[i], "date.event"]
sum(between(events, df$DATE[i]-30, df$DATE[i]))
})
finalanswer<-cbind(df, event.30d.prior=unlist(event.30d.prior))
My data looks like this:
dfin <-
ID TIME CONC STATUS
1 0 5 0
1 1 4 1
1 2 3 0
2 0 2 0
2 10 2 0
2 15 1 0
I want to subset the dfin for the first occurrence (for each ID) when STATUS==1 and TIME > 0. If the subject ID has no STATUS==1 recorded at any time, then I need to subset the last raw of that subject.
the output here should be:
dfout <-
ID TIME CONC STATUS
1 1 4 1
2 15 1 0
One way with dplyr, we can group_by ID and check if there is any row which satisfies our condition (STATUS == 1 & TIME > 0), if it is then we get the first row which satisfies the condition using which.max, if there is no such row then we just return the last row using n().
library(dplyr)
df %>%
group_by(ID) %>%
slice(ifelse(any(STATUS == 1 & TIME > 0), which.max(STATUS == 1 & TIME > 0), n()))
# ID TIME CONC STATUS
# <int> <int> <int> <int>
#1 1 1 4 1
#2 2 15 1 0
Another approach using only base R. This actually follows the same logic as in dplyr but ave returns length same as input so we keep only unique values and take cumulative sum (cumsum) over it to get corresponding rows from the data frame.
df[cumsum(unique(with(df, ave(STATUS == 1 & TIME > 0, ID, FUN = function(x)
if(any(x)) which.max(x) else length(x))))), ]
# ID TIME CONC STATUS
#2 1 1 4 1
#5 2 10 2 0
Here is one approach with data.table. Convert the data.frame to 'data.table' (setDT(dfin)), grouped by 'ID', if there is any 'STATUS' as 1, then get the logical expression where 'TIME' is greater than 0 or else get the last row (.N) and subset with .SD
library(data.table)
setDT(dfin)[, .SD[if(any(STATUS == 1)) STATUS == 1& TIME > 0 else .N], ID]
# ID TIME CONC STATUS
#1: 1 1 4 1
#2: 2 15 1 0
It can be also written as
setDT(dfin)[, .SD[(STATUS == 1 & TIME > 0)| (!any(STATUS) & seq_len(.N) == .N)], ID]
data
dfin <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), TIME = c(0L, 1L,
2L, 0L, 10L, 15L), CONC = c(5L, 4L, 3L, 2L, 2L, 1L), STATUS = c(0L,
1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
I have a dataframe as follows:
position_time telematic_trip_no lat_dec lon_dec
1 2016-06-05 00:00:01 526132109 -26.6641 27.8733
2 2016-06-05 00:00:01 526028387 -26.6402 27.8059
3 2016-06-05 00:00:01 526081476 -26.5545 28.3263
4 2016-06-05 00:00:04 526140512 -26.5310 27.8704
5 2016-06-05 00:00:05 526140518 -26.5310 27.8704
6 2016-06-05 00:00:19 526006880 -26.5010 27.8490
is_stolen hour_of_day time_of_day day_of_week lat_min
1 0 0 0 Sunday -26.6651
2 0 0 0 Sunday -26.6412
3 0 0 0 Sunday -26.5555
4 0 0 0 Sunday -26.5320
5 0 0 0 Sunday -26.5320
6 0 0 0 Sunday -26.5020
lat_max lon_max lon_min
1 -26.6631 27.8743 27.8723
2 -26.6392 27.8069 27.8049
3 -26.5535 28.3273 28.3253
4 -26.5300 27.8714 27.8694
5 -26.5300 27.8714 27.8694
6 -26.5000 27.8500 27.8480
Now what I want to do is count for each line where is_stolen = 1, the number of rows in the dataframe that fulfill the following conditions:
the lat_dec and lon_dec are between the lat_max, lat_min, lon_max and lon_min (i.e. fit within the 'box' around that GPS point)
the time_of_day and day_of_week are the same as that of the row of interest
the telematic_trip_no of the rows need to be different to that of the row of interest
and finally the is_stolen tag of the matching rows needs to be equal to 0
I've written a script to do this using a for loop but it ran very slowly and it got me thinking if there's an efficient way to do complex row counts with many conditions using something like dplyr or data.table?
ps If you're curious I am indeed trying to calculate how many cars a stolen-car passes during a typical trip :)
Given your description of the problem, the following should work
library(dplyr)
library(stats)
# df is the data.frame (see below)
df <- cbind(ID=seq_len(nrow(df)),df)
r.stolen <- which(df$is_stolen == 1)
r.not <- which(df$is_stolen != 1)
print(df[rep(r.not, times=length(r.stolen)),] %>%
setNames(.,paste0(names(.),"_not")) %>%
bind_cols(df[rep(r.stolen, each=length(r.not)),], .) %>%
mutate(in_range = as.numeric(telematic_trip_no != telematic_trip_no_not & time_of_day == time_of_day_not & day_of_week == day_of_week_not & lat_dec >= lat_min_not & lat_dec <= lat_max_not & lon_dec >= lon_min_not & lon_dec <= lon_max_not)) %>%
group_by(ID) %>%
summarise(count = sum(in_range)) %>%
arrange(desc(count)))
The first line just adds a column named ID to df that identifies the row by its row number that we can later dplyr::group_by to make the count.
The next two lines divides the rows into stolen and not-stolen cars. The key is to:
replicate each row of stolen cars N times where N is the number of not-stolen car rows,
replicate the rows of not-stolen cars (as a block) M times where M is the number of stolen car rows, and
append the result of (2) to (1) as new columns and change the names of these new columns so that we can reference them in the condition
The result of (3) have rows that enumerates all pairs of stolen and not-stolen rows from the original data frame so that your condition can be applied in an array fashion. The dplyr piped R workflow that is the fourth line of the code (wrapped in a print()) does this:
the first command replicates the not-stolen car rows using times
the second command appends _not to the column names to distinguish them from the stolen car columns when we bind the columns. Thanks to this SO answer for that gem.
the third command replicates the stolen car rows using each and appends the previous result as new columns using dplyr::bind_cols
the fourth command uses dplyr::mutate to create a new column named in_range that is the result of applying the condition. The boolean result is converted to {0,1} to allow for easy accumulation
the rest of the commands in the pipe does the counting of in_range grouped by the ID and arranging the results in decreasing order of the count. Note that now ID is the column that identifies the rows of the original data frame for which is_stolen = 1 whereas ID_not is the column for rows that is_stolen = 0
This assumes that you want the count for each row that is_stolen = 1 in the original data frame, which is what you said in your question. If instead you really want the count for each telematic_trip_no that is stolen, then you can use
group_by(telematic_trip_no) %>%
in the pipe instead.
I've tested this using the following data snippet
df <- structure(list(position_time = structure(c(1L, 1L, 1L, 2L, 3L,
4L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("2016-06-05 00:00:01",
"2016-06-05 00:00:04", "2016-06-05 00:00:05", "2016-06-05 00:00:19",
"2016-06-05 00:00:20", "2016-06-05 00:00:22", "2016-06-05 00:00:23",
"2016-06-05 00:00:35", "2016-06-05 00:09:34", "2016-06-06 01:00:06"
), class = "factor"), telematic_trip_no = c(526132109L, 526028387L,
526081476L, 526140512L, 526140518L, 526006880L, 526017880L, 526027880L,
526006880L, 526006890L, 526106880L, 526005880L, 526007880L),
lat_dec = c(-26.6641, -26.6402, -26.5545, -26.531, -26.531,
-26.501, -26.5315, -26.5325, -26.501, -26.5315, -26.5007,
-26.5315, -26.5315), lon_dec = c(27.8733, 27.8059, 28.3263,
27.8704, 27.8704, 27.849, 27.88, 27.87, 27.849, 27.87, 27.8493,
27.87, 27.87), is_stolen = c(0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), hour_of_day = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), time_of_day = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 9L, 0L), day_of_week = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("Monday",
"Sunday"), class = "factor"), lat_min = c(-26.6651, -26.6412,
-26.5555, -26.532, -26.532, -26.502, -26.532, -26.532, -26.502,
-26.532, -26.502, -26.532, -26.532), lat_max = c(-26.6631,
-26.6392, -26.5535, -26.53, -26.53, -26.5, -26.53, -26.53,
-26.5, -26.53, -26.5, -26.53, -26.53), lon_max = c(27.8743,
27.8069, 28.3273, 27.8714, 27.8714, 27.85, 27.8714, 27.8714,
27.85, 27.8714, 27.85, 27.8714, 27.8714), lon_min = c(27.8723,
27.8049, 28.3253, 27.8694, 27.8694, 27.848, 27.8694, 27.8694,
27.848, 27.8694, 27.848, 27.8694, 27.8694)), .Names = c("position_time",
"telematic_trip_no", "lat_dec", "lon_dec", "is_stolen", "hour_of_day",
"time_of_day", "day_of_week", "lat_min", "lat_max", "lon_max",
"lon_min"), class = "data.frame", row.names = c(NA, -13L))
Here, I appended 7 new rows with is_stolen = 1 to your original 6 rows that are all is_stolen = 0:
the first added row with telematic_trip_no = 526005880 violates the longitude condition for all not-stolen rows, so its count should be 0
the second added row with telematic_trip_no = 526006880 violates the latitude condition for all not-stolen rows, so its count should be 0
the third added row with telematic_trip_no = 526007880 violates the telematic_trip_no condition for all not-stolen rows, so its count should be 0
the fourth added row with telematic_trip_no = 526006890 satisfies the condition for rows 4 and 5 that are not-stolen, so its count should be 2
the fifth added row with telematic_trip_no = 526106880 satisfies the condition for row 6 that is not-stolen, so its count should be 1
the sixth added row with telematic_trip_no = 526017880 violates the time_of_day condition for all not-stolen rows, so its count should be 0
the seventh added row with telematic_trip_no = 526027880 violates the day_of_week condition for all not-stolen rows, so its count should be 0
Running the code on this data gives:
# A tibble: 7 x 2
ID count
<int> <dbl>
1 10 2
2 11 1
3 7 0
4 8 0
5 9 0
6 12 0
7 13 0
which is as expected recalling that the appended rows with is_stolen = 1 starts at row 7 with ID = 7.
If one were to group by telematic_trip_no instead, we get the result:
# A tibble: 7 x 2
telematic_trip_no count
<int> <dbl>
1 526006890 2
2 526106880 1
3 526005880 0
4 526006880 0
5 526007880 0
6 526017880 0
7 526027880 0
As a caveat, the above approach does cost memory. Worst case the number of rows grows to N^2/4 where N is the number of rows in the original data frame, and the number of columns doubles for the data frame that is used to evaluate the condition. As with most array processing techniques, there is a trade between speed and memory.
Hope this helps.
The current development version of data.table, v1.9.7 has a new feature non-equi joins, which makes conditional joins quite straightforward. Using #aichao's data:
require(data.table) # v1.9.7+
setDT(df)[, ID := .I] # add row numbers
not_stolen = df[is_stolen == 0L]
is_stolen = df[is_stolen == 1L]
not_stolen[is_stolen,
.(ID = i.ID, N = .N - sum(telematic_trip_no == i.telematic_trip_no)),
on = .(time_of_day, day_of_week, lat_min <= lat_dec,
lat_max >= lat_dec, lon_min <= lon_dec, lon_max >= lon_dec),
by=.EACHI][, .(ID, N)]
# ID N
# 1: 7 NA
# 2: 8 NA
# 3: 9 0
# 4: 10 2
# 5: 11 1
# 6: 12 NA
# 7: 13 NA
The part not_stolen[is_stolen, performs a subset-like join operation.. i.e., for each row in is_stolen, matching row indices (based on condition provided to on= argument) is extracted.
by = .EACHI ensures that, for each row in i (first) argument, here is_stolen, on the corresponding matching row indices, the expression provided in j, the second argument, .(ID = i.ID, N = .N-sum(telematic_trip_no==i.telematic_trip_no)), is evaluated. That returns the result shown above.
HTH.
I'm trying to keep/split groups in a data frame which meet a condition for a specific row; the data.frame looks like:
COW PARITY DFC ABCS
1 1 1 0.5
1 1 2 1
1 1 3 0.25
1 2 1 -0.3
1 2 2 0.5
I would like to create groups with the same value of COW and parity for which
ABCS>0 for DFC==1
I try with group_by + filter but I'm unable to correctly split.
You could try with logical subsetting, like this:
df1[df1$COW==df1$PARITY & df1$ABCS>0 & df1$DFC==1,]
# COW PARITY DFC ABCS
#1 1 1 1 0.5
This considers three conditions, connected with a logical AND (&). First, the value of COW and PARITY should be equal, then the value of ABCS should be greater than 0, and finally the value of DFC should be equal to one. In the example posted above, only one observation (row) fulfills these three conditions.
edit
Following the suggestion by #docendodiscimus the command can be shortened and rendered more legible by using with(), for instance like this:
df1[with(df1,COW==PARITY & ABCS>0 & DFC==1),]
data
df1 <- structure(list(COW = c(1L, 1L, 1L, 1L, 1L),
PARITY = c(1L, 1L, 1L, 2L, 2L), DFC = c(1L, 2L, 3L, 1L, 2L),
ABCS = c(0.5, 1, 0.25, -0.3, 0.5)),
.Names = c("COW", "PARITY", "DFC", "ABCS"),
class = "data.frame", row.names = c(NA,-5L))