Counting rows based upon conditional grouping with dplyr - r
I have a dataframe as follows:
position_time telematic_trip_no lat_dec lon_dec
1 2016-06-05 00:00:01 526132109 -26.6641 27.8733
2 2016-06-05 00:00:01 526028387 -26.6402 27.8059
3 2016-06-05 00:00:01 526081476 -26.5545 28.3263
4 2016-06-05 00:00:04 526140512 -26.5310 27.8704
5 2016-06-05 00:00:05 526140518 -26.5310 27.8704
6 2016-06-05 00:00:19 526006880 -26.5010 27.8490
is_stolen hour_of_day time_of_day day_of_week lat_min
1 0 0 0 Sunday -26.6651
2 0 0 0 Sunday -26.6412
3 0 0 0 Sunday -26.5555
4 0 0 0 Sunday -26.5320
5 0 0 0 Sunday -26.5320
6 0 0 0 Sunday -26.5020
lat_max lon_max lon_min
1 -26.6631 27.8743 27.8723
2 -26.6392 27.8069 27.8049
3 -26.5535 28.3273 28.3253
4 -26.5300 27.8714 27.8694
5 -26.5300 27.8714 27.8694
6 -26.5000 27.8500 27.8480
Now what I want to do is count for each line where is_stolen = 1, the number of rows in the dataframe that fulfill the following conditions:
the lat_dec and lon_dec are between the lat_max, lat_min, lon_max and lon_min (i.e. fit within the 'box' around that GPS point)
the time_of_day and day_of_week are the same as that of the row of interest
the telematic_trip_no of the rows need to be different to that of the row of interest
and finally the is_stolen tag of the matching rows needs to be equal to 0
I've written a script to do this using a for loop but it ran very slowly and it got me thinking if there's an efficient way to do complex row counts with many conditions using something like dplyr or data.table?
ps If you're curious I am indeed trying to calculate how many cars a stolen-car passes during a typical trip :)
Given your description of the problem, the following should work
library(dplyr)
library(stats)
# df is the data.frame (see below)
df <- cbind(ID=seq_len(nrow(df)),df)
r.stolen <- which(df$is_stolen == 1)
r.not <- which(df$is_stolen != 1)
print(df[rep(r.not, times=length(r.stolen)),] %>%
setNames(.,paste0(names(.),"_not")) %>%
bind_cols(df[rep(r.stolen, each=length(r.not)),], .) %>%
mutate(in_range = as.numeric(telematic_trip_no != telematic_trip_no_not & time_of_day == time_of_day_not & day_of_week == day_of_week_not & lat_dec >= lat_min_not & lat_dec <= lat_max_not & lon_dec >= lon_min_not & lon_dec <= lon_max_not)) %>%
group_by(ID) %>%
summarise(count = sum(in_range)) %>%
arrange(desc(count)))
The first line just adds a column named ID to df that identifies the row by its row number that we can later dplyr::group_by to make the count.
The next two lines divides the rows into stolen and not-stolen cars. The key is to:
replicate each row of stolen cars N times where N is the number of not-stolen car rows,
replicate the rows of not-stolen cars (as a block) M times where M is the number of stolen car rows, and
append the result of (2) to (1) as new columns and change the names of these new columns so that we can reference them in the condition
The result of (3) have rows that enumerates all pairs of stolen and not-stolen rows from the original data frame so that your condition can be applied in an array fashion. The dplyr piped R workflow that is the fourth line of the code (wrapped in a print()) does this:
the first command replicates the not-stolen car rows using times
the second command appends _not to the column names to distinguish them from the stolen car columns when we bind the columns. Thanks to this SO answer for that gem.
the third command replicates the stolen car rows using each and appends the previous result as new columns using dplyr::bind_cols
the fourth command uses dplyr::mutate to create a new column named in_range that is the result of applying the condition. The boolean result is converted to {0,1} to allow for easy accumulation
the rest of the commands in the pipe does the counting of in_range grouped by the ID and arranging the results in decreasing order of the count. Note that now ID is the column that identifies the rows of the original data frame for which is_stolen = 1 whereas ID_not is the column for rows that is_stolen = 0
This assumes that you want the count for each row that is_stolen = 1 in the original data frame, which is what you said in your question. If instead you really want the count for each telematic_trip_no that is stolen, then you can use
group_by(telematic_trip_no) %>%
in the pipe instead.
I've tested this using the following data snippet
df <- structure(list(position_time = structure(c(1L, 1L, 1L, 2L, 3L,
4L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("2016-06-05 00:00:01",
"2016-06-05 00:00:04", "2016-06-05 00:00:05", "2016-06-05 00:00:19",
"2016-06-05 00:00:20", "2016-06-05 00:00:22", "2016-06-05 00:00:23",
"2016-06-05 00:00:35", "2016-06-05 00:09:34", "2016-06-06 01:00:06"
), class = "factor"), telematic_trip_no = c(526132109L, 526028387L,
526081476L, 526140512L, 526140518L, 526006880L, 526017880L, 526027880L,
526006880L, 526006890L, 526106880L, 526005880L, 526007880L),
lat_dec = c(-26.6641, -26.6402, -26.5545, -26.531, -26.531,
-26.501, -26.5315, -26.5325, -26.501, -26.5315, -26.5007,
-26.5315, -26.5315), lon_dec = c(27.8733, 27.8059, 28.3263,
27.8704, 27.8704, 27.849, 27.88, 27.87, 27.849, 27.87, 27.8493,
27.87, 27.87), is_stolen = c(0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), hour_of_day = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), time_of_day = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 9L, 0L), day_of_week = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("Monday",
"Sunday"), class = "factor"), lat_min = c(-26.6651, -26.6412,
-26.5555, -26.532, -26.532, -26.502, -26.532, -26.532, -26.502,
-26.532, -26.502, -26.532, -26.532), lat_max = c(-26.6631,
-26.6392, -26.5535, -26.53, -26.53, -26.5, -26.53, -26.53,
-26.5, -26.53, -26.5, -26.53, -26.53), lon_max = c(27.8743,
27.8069, 28.3273, 27.8714, 27.8714, 27.85, 27.8714, 27.8714,
27.85, 27.8714, 27.85, 27.8714, 27.8714), lon_min = c(27.8723,
27.8049, 28.3253, 27.8694, 27.8694, 27.848, 27.8694, 27.8694,
27.848, 27.8694, 27.848, 27.8694, 27.8694)), .Names = c("position_time",
"telematic_trip_no", "lat_dec", "lon_dec", "is_stolen", "hour_of_day",
"time_of_day", "day_of_week", "lat_min", "lat_max", "lon_max",
"lon_min"), class = "data.frame", row.names = c(NA, -13L))
Here, I appended 7 new rows with is_stolen = 1 to your original 6 rows that are all is_stolen = 0:
the first added row with telematic_trip_no = 526005880 violates the longitude condition for all not-stolen rows, so its count should be 0
the second added row with telematic_trip_no = 526006880 violates the latitude condition for all not-stolen rows, so its count should be 0
the third added row with telematic_trip_no = 526007880 violates the telematic_trip_no condition for all not-stolen rows, so its count should be 0
the fourth added row with telematic_trip_no = 526006890 satisfies the condition for rows 4 and 5 that are not-stolen, so its count should be 2
the fifth added row with telematic_trip_no = 526106880 satisfies the condition for row 6 that is not-stolen, so its count should be 1
the sixth added row with telematic_trip_no = 526017880 violates the time_of_day condition for all not-stolen rows, so its count should be 0
the seventh added row with telematic_trip_no = 526027880 violates the day_of_week condition for all not-stolen rows, so its count should be 0
Running the code on this data gives:
# A tibble: 7 x 2
ID count
<int> <dbl>
1 10 2
2 11 1
3 7 0
4 8 0
5 9 0
6 12 0
7 13 0
which is as expected recalling that the appended rows with is_stolen = 1 starts at row 7 with ID = 7.
If one were to group by telematic_trip_no instead, we get the result:
# A tibble: 7 x 2
telematic_trip_no count
<int> <dbl>
1 526006890 2
2 526106880 1
3 526005880 0
4 526006880 0
5 526007880 0
6 526017880 0
7 526027880 0
As a caveat, the above approach does cost memory. Worst case the number of rows grows to N^2/4 where N is the number of rows in the original data frame, and the number of columns doubles for the data frame that is used to evaluate the condition. As with most array processing techniques, there is a trade between speed and memory.
Hope this helps.
The current development version of data.table, v1.9.7 has a new feature non-equi joins, which makes conditional joins quite straightforward. Using #aichao's data:
require(data.table) # v1.9.7+
setDT(df)[, ID := .I] # add row numbers
not_stolen = df[is_stolen == 0L]
is_stolen = df[is_stolen == 1L]
not_stolen[is_stolen,
.(ID = i.ID, N = .N - sum(telematic_trip_no == i.telematic_trip_no)),
on = .(time_of_day, day_of_week, lat_min <= lat_dec,
lat_max >= lat_dec, lon_min <= lon_dec, lon_max >= lon_dec),
by=.EACHI][, .(ID, N)]
# ID N
# 1: 7 NA
# 2: 8 NA
# 3: 9 0
# 4: 10 2
# 5: 11 1
# 6: 12 NA
# 7: 13 NA
The part not_stolen[is_stolen, performs a subset-like join operation.. i.e., for each row in is_stolen, matching row indices (based on condition provided to on= argument) is extracted.
by = .EACHI ensures that, for each row in i (first) argument, here is_stolen, on the corresponding matching row indices, the expression provided in j, the second argument, .(ID = i.ID, N = .N-sum(telematic_trip_no==i.telematic_trip_no)), is evaluated. That returns the result shown above.
HTH.
Related
How to show percentage up to the current observation?
I am working with the following data frame: I am wondering how I can create a new column which shows the percentage of the indicator column for all previous observations within the group. So the above data frame would become: Basically, the new column just indicates the percentage (in decimal form) of the indicator up to that point within the group. It just divides the sum of the indicator column up to that point by the row count of previous observations within the group. My first thought was to use group_by along with row_number in order reference previous observations, but I couldn't figure out how to make it work. Data: structure(list(Group = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), Indicator = c(1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L), IndicatorPercent = c(NA, 1, 0.5, 0.67, 0.75, NA, 0, 0, 0, 0.25)), class = "data.frame", row.names = c(NA, -10L))
We get the cummean of the 'Indicator' after grouping by 'Group' and then get the lag on it library(dplyr) df1 %>% group_by(Group) %>% mutate(IndicatorPercent = lag(cummean(Indicator))) %>% ungroup -output # A tibble: 10 x 3 # Group Indicator IndicatorPercent # <int> <int> <dbl> # 1 1 1 NA # 2 1 0 1 # 3 1 1 0.5 # 4 1 1 0.667 # 5 1 0 0.75 # 6 2 0 NA # 7 2 0 0 # 8 2 0 0 # 9 2 1 0 #10 2 0 0.25 If we want to do this based on value of other column, use replace library(tidyr) df1 %>% group_by(Group) %>% mutate(IndicatorPercent = replace(rep(NA_real_, n()), color == 'red', lag(cummean(Indicator[color == "red"])))) %>% fill(IndicatorPercent) %>% ungroup Or with data.table library(data.table) setDT(df1)[color == 'red', IndicatorPercent := shift(cummean(Indicator)), Group][, IndicatorPercent := nafill(IndicatorPercent, type = 'locf'), Group][]
How can I remove rows on more conditions in R?
I have session id's, client id's, a conversion column and all with a specific date. I want to delete the rows after the last purchase of a client. My data looks as follows: SessionId ClientId Conversion Date 1 1 0 05-01 2 1 0 06-01 3 1 0 07-01 4 1 1 08-01 5 1 0 09-01 6 2 0 05-01 7 2 1 06-01 8 2 0 07-01 9 2 1 08-01 10 2 0 09-01 As output I want: SessionId ClientId Conversion Date 1 1 0 05-01 2 1 0 06-01 3 1 1 07-01 6 2 0 05-01 7 2 1 06-01 8 2 0 07-01 9 2 1 08-01 I looks quite easy, but it has some conditions. Based on the client id, the sessions after the last purchase of a cutomer need to be deleted. I have many observations, so deleting after a particular date is not possible. It need to check every client id on when someone did a purchase. I have no clue what kind of function I need to use for this. Maybe a certain kind of loop? Hopefully someone can help me with this.
If your data is already ordered according to Date, for each ClientId we can select all the rows before the last conversion took place. This can be done in base R : subset(df, ave(Conversion == 1, ClientId, FUN = function(x) seq_along(x) <= max(which(x)))) Using dplyr : library(dplyr) df %>% group_by(ClientId) %>% filter(row_number() <= max(which(Conversion == 1))) Or data.table : library(data.table) setDT(df)[, .SD[seq_len(.N) <= max(which(Conversion == 1))], ClientId]
We could try library(dplyr) df1 %>% group_by(ClientId) %>% slice(seq_len(tail(which(Conversion == 1), 1))) data df1 <- structure(list(SessionId = 1:10, ClientId = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), Conversion = c(0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L), Date = c("05-01", "06-01", "07-01", "08-01", "09-01", "05-01", "06-01", "07-01", "08-01", "09-01")), class = "data.frame", row.names = c(NA, -10L))
R data.table Aggregating rows by group conditionally
I'm currently working in data.table in R with the following data set: id age_start age_end cases 1 2 2 1000 1 3 3 500 1 4 4 300 1 2 4 1800 2 2 2 8000 2 3 3 200 2 4 4 100 In the given data set I only want values of cases where the age_start == 2 and the age_end ==4. In each ID where the age_start !=2 and the age_end !=4, I need to sum or aggregate the rows to create a group of age_start==2 and age_end ==4. In these cases I'd need to sum up the cases of age_start==2 & age_end==2, age_start==3 & age_end==3, as well as age_start==4 & age_end==4 into one new row of age_start==2 and age_end==4. After these are summed up into one row, I want to drop the rows that I used to make the new age_start==2 and age_start==4 row (i.e. the age values 2-2, 3-3, and 4-4) as they are no longer needed Ideally the data set would look like this when I finish these steps: id age_start age_end cases 1 2 4 1800 2 2 4 8300 Any suggestions on how to accomplish this in data.table are greatly appreciated!
You can use an equi-join for the first bullet; and a non-equi join for the second: m_equi = x[.(id = unique(id), age_dn = 2, age_up = 4), on=.(id, age_start = age_dn, age_end = age_up), nomatch=0 ] m_nonequi = x[!m_equi, on=.(id)][.(id = unique(id), age_dn = 2, age_up = 4), on=.(id, age_start >= age_dn, age_end <= age_up), .(cases = sum(cases)), by=.EACHI ] res = rbind(m_equi, m_nonequi) id age_start age_end cases 1: 1 2 4 1800 2: 2 2 4 8300 How it works: x[i] uses values in i to look up rows and columns in x according to rules specified in on=. nomatch=0 means unmatched rows of i in x[i] are dropped, so m_equi only ends up with id=1. x[!m_equi, on=.(id)] is an anti-join that skips id=1 since we already matched it in the equi join. by=.EACHI groups by each row of i in x[i] for the purpose of doing the aggregation. An alternative would be to anti-join on rows with start 2 and end 4 so that all groups need to be aggregated (similar to #akrun's answer), though I guess that would be less efficient.
We can specify the i with the logical condition, grouped by 'id', get the sum of 'cases' while adding 'age_start', 'age_end' as 2 and 4 library(data.table) as.data.table(df1)[age_start != 2|age_end != 4, .(age_start = 2, age_end = 4, cases = sum(cases)), id] # id age_start age_end cases #1: 1 2 4 1800 #2: 2 2 4 8300 data df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), age_start = c(2L, 3L, 4L, 2L, 2L, 3L, 4L), age_end = c(2L, 3L, 4L, 4L, 2L, 3L, 4L), cases = c(1000L, 500L, 300L, 1800L, 8000L, 200L, 100L)), class = "data.frame", row.names = c(NA, -7L))
Subset data based on first occurrence of a status flag
My data looks like this: dfin <- ID TIME CONC STATUS 1 0 5 0 1 1 4 1 1 2 3 0 2 0 2 0 2 10 2 0 2 15 1 0 I want to subset the dfin for the first occurrence (for each ID) when STATUS==1 and TIME > 0. If the subject ID has no STATUS==1 recorded at any time, then I need to subset the last raw of that subject. the output here should be: dfout <- ID TIME CONC STATUS 1 1 4 1 2 15 1 0
One way with dplyr, we can group_by ID and check if there is any row which satisfies our condition (STATUS == 1 & TIME > 0), if it is then we get the first row which satisfies the condition using which.max, if there is no such row then we just return the last row using n(). library(dplyr) df %>% group_by(ID) %>% slice(ifelse(any(STATUS == 1 & TIME > 0), which.max(STATUS == 1 & TIME > 0), n())) # ID TIME CONC STATUS # <int> <int> <int> <int> #1 1 1 4 1 #2 2 15 1 0 Another approach using only base R. This actually follows the same logic as in dplyr but ave returns length same as input so we keep only unique values and take cumulative sum (cumsum) over it to get corresponding rows from the data frame. df[cumsum(unique(with(df, ave(STATUS == 1 & TIME > 0, ID, FUN = function(x) if(any(x)) which.max(x) else length(x))))), ] # ID TIME CONC STATUS #2 1 1 4 1 #5 2 10 2 0
Here is one approach with data.table. Convert the data.frame to 'data.table' (setDT(dfin)), grouped by 'ID', if there is any 'STATUS' as 1, then get the logical expression where 'TIME' is greater than 0 or else get the last row (.N) and subset with .SD library(data.table) setDT(dfin)[, .SD[if(any(STATUS == 1)) STATUS == 1& TIME > 0 else .N], ID] # ID TIME CONC STATUS #1: 1 1 4 1 #2: 2 15 1 0 It can be also written as setDT(dfin)[, .SD[(STATUS == 1 & TIME > 0)| (!any(STATUS) & seq_len(.N) == .N)], ID] data dfin <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), TIME = c(0L, 1L, 2L, 0L, 10L, 15L), CONC = c(5L, 4L, 3L, 2L, 2L, 1L), STATUS = c(0L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -6L))
R: How to delete rows of a data frame based on the values of a given column
I have 100 simulated data sets, for example a single set is shown below pid time status 1 2 1 1 6 0 1 4 1 2 3 0 2 1 1 2 7 1 3 8 1 3 11 1 3 2 0 pid denotes patient id. This indicates that each patient has three records on the time and status column. I want to write R code to delete any row with 0 status if that row is not a record for the first observation of a given patient and keep rows with 0 status if it denotes the first observation while the remaining rows with status 1 following the this 0 are deleted for that patient. The output should look like pid time status 1 2 1 1 4 1 2 3 0 3 8 1 3 11 1 As there are 100 simulated data sets the positions of 0's and 1's in the status column are not the same for all the data. Could anyone be of help to provide R code that can perform this task? Thank you in advance.
dplyr package can help. I added a record to your data example to include multiple 0 values for a pid. Group by pid and with the function first you can hold the first value of status. Due to the group by this will be held for all the records per pid. Then just filter if the first record is 0 and row_number() = 1 just in case there are more records with 0 (see pid 4) or if the first record has status = 1 and keep all the records with status 1. df %>% group_by(pid) %>% filter((first(status) == 0 & row_number() == 1) | (first(status) == 1 & status == 1)) # A tibble: 6 x 3 # Groups: pid [4] pid time status <int> <int> <int> 1 1 2 1 2 1 4 1 3 2 3 0 4 3 8 1 5 3 11 1 6 4 3 0 data: df <- structure( list( pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L, 3L, 6L, 8L), status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L) ), .Names = c("pid", "time", "status"), class = "data.frame", row.names = c(NA,-12L) )
This question is more appropriate on https://stackoverflow.com. Here is an attempt using tapply() (it's a little verbose): dat <- structure(list(pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L), status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L)), .Names = c("pid", "time", "status"), class = "data.frame", row.names = c(NA, -9L)) ind <- unlist(tapply(dat$status, dat$pid, function(x) { # browser() y <- (rep(FALSE, length(x))) if (x[1] == 1) { y[x != 0] <- TRUE } else { y[1] <- TRUE } y })) dat[ind, ] #> pid time status #> 1 1 2 1 #> 3 1 4 1 #> 4 2 3 0 #> 7 3 8 1 #> 8 3 11 1 ind is a vector of TRUEs and FALSEs, which will indicate whether a row of dat should be kept or not according to your rules. I use tapply(X, INDEX, FUN) to apply a function to subsets of a vector (here X = dat$status), which are defined by a grouping factor (here INDEX = dat$pid). Here, I used an anonymous function (i.e., FUN = function(x){}) to do something with each subset of X. In particular, I first define y, which I will return later, to be a vector of FALSEs. If the first status is 1 for a subgroup, I turn all elements that are non-zero (i.e., y[x != 0]) into TRUE. Otherwise, I turn only the first element (i.e., y[1]) into TRUE. You may uncomment the browser() statement and see at the console what the function does by typing n (for next) or x or y (to see what they are).