How can I remove rows on more conditions in R? - r

I have session id's, client id's, a conversion column and all with a specific date. I want to delete the rows after the last purchase of a client. My data looks as follows:
SessionId ClientId Conversion Date
1 1 0 05-01
2 1 0 06-01
3 1 0 07-01
4 1 1 08-01
5 1 0 09-01
6 2 0 05-01
7 2 1 06-01
8 2 0 07-01
9 2 1 08-01
10 2 0 09-01
As output I want:
SessionId ClientId Conversion Date
1 1 0 05-01
2 1 0 06-01
3 1 1 07-01
6 2 0 05-01
7 2 1 06-01
8 2 0 07-01
9 2 1 08-01
I looks quite easy, but it has some conditions. Based on the client id, the sessions after the last purchase of a cutomer need to be deleted. I have many observations, so deleting after a particular date is not possible. It need to check every client id on when someone did a purchase.
I have no clue what kind of function I need to use for this. Maybe a certain kind of loop?
Hopefully someone can help me with this.

If your data is already ordered according to Date, for each ClientId we can select all the rows before the last conversion took place.
This can be done in base R :
subset(df, ave(Conversion == 1, ClientId, FUN = function(x) seq_along(x) <= max(which(x))))
Using dplyr :
library(dplyr)
df %>% group_by(ClientId) %>% filter(row_number() <= max(which(Conversion == 1)))
Or data.table :
library(data.table)
setDT(df)[, .SD[seq_len(.N) <= max(which(Conversion == 1))], ClientId]

We could try
library(dplyr)
df1 %>%
group_by(ClientId) %>%
slice(seq_len(tail(which(Conversion == 1), 1)))
data
df1 <- structure(list(SessionId = 1:10, ClientId = c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L), Conversion = c(0L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 1L, 0L), Date = c("05-01", "06-01", "07-01", "08-01",
"09-01", "05-01", "06-01", "07-01", "08-01", "09-01")),
class = "data.frame", row.names = c(NA,
-10L))

Related

count number of consecutive cells under some conditions

I'd like to create a new column "mzpceyrs" that records the number of years the variable "code" remains at peace. The "mzinit" variable codes whether or not "ccode" initiates a conflict in a given year.
ccode
year
mzinit
mzpceyrs
2
1816
1
NA
2
1817
1
0
2
1818
1
0
2
1819
0
1
2
1820
0
??
2
1821
0
??
I suppose there would be far more efficient ways to do this, but the following codes are the ones that I've been coming up with. I basically consider four different scenarios:
for(i in 1:nrow(test)){
previousindex<-i-1
if(identical(previousindex,integer(0))){
test$mzpceyrs[i]<-NA}
else if(
test$mzinit[previousindex]==1 & test$mzinit[i]==1){
test$mzpceyrs[i]<-0
}
else if(test$mzinit[previousindex]==1 & test$mzinit[i]==0){
test$mzpceyrs[i]<-1
}
else if(test$mzinit[previousindex]==0 & test$mzinit[i]==1){
test$mzpceyrs[i]<-0
}
else if(test$mzinit[previousindex]==0 & test$mzinit[i]==0){
test$mzpceyrs[i]<-??
}
}
i) If "ccode" initiates a conflict in the previous year AND in the current year, I assign a value of 0 (no peace year).
ii) If "ccode" initiates a conflict in the previous year and DOES NOT in the current year, I assign a value of 1 (one peace year).
iii) If "ccode" does NOT initiate a conflict in the previous year and DOES initiate a conflict in the current year, I assign a value of 0 (no peace year).
iv) If "ccode" DOES NOT initiate a conflict in the previous year and DOES in the current year, I assign the number of peace years "ccode" has remaining.
I'm struggling with how to code the last scenario. Would you be able to share your insights in terms of how to calculate consecutive "0" values after it has 1 in the "mzinit" column? For example, my desired outcome is to have 2 and 3 in the "mzpceyrs" variable in rows 5 and 6. Any advice would be much appreciated.
Using data.table
library(data.table)
setDT(test)[, mzcpeyrs := rowid(mzinit)* (mzinit == 0)]
-output
> test
ccode year mzinit mzcpeyrs
<int> <int> <int> <int>
1: 2 1816 1 0
2: 2 1817 1 0
3: 2 1818 1 0
4: 2 1819 0 1
5: 2 1820 0 2
6: 2 1821 0 3
data
test <- structure(list(ccode = c(2L, 2L, 2L, 2L, 2L, 2L), year = 1816:1821,
mzinit = c(1L, 1L, 1L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
You can use cumsum() combined with rleid() from the data.table package:
library(data.table)
setDT(df)[, mzcpeyrs:=cumsum(mzinit==0),rleid(mzinit)]
Output:
ccode year mzinit mzcpeyrs
1: 2 1816 1 0
2: 2 1817 1 0
3: 2 1818 1 0
4: 2 1819 0 1
5: 2 1820 0 2
6: 2 1821 0 3
Input:
structure(list(ccode = c(2L, 2L, 2L, 2L, 2L, 2L), year = 1816:1821,
mzinit = c(1L, 1L, 1L, 0L, 0L, 0L)), row.names = c(NA, -6L
), class = "data.frame")
Alternatively, you can use this data.table::rleid() function within a dplyr pipeline of course, but it is slower and more verbose:
df %>%
group_by(rl = data.table::rleid(mzinit)) %>%
mutate(mzcpeyrs = cumsum(mzinit==0)) %>%
ungroup() %>%
select(-rl)

Counting patients that have not had medication [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a dataframe in which patients have multiple observations of medication use over time. Some patients have consistently used medication, others have gaps, while I am trying to count the patients which have never used medication.
I can't show the actual data but here is an example data frame of what I am working with.
patid meds
1 0
1 1
1 1
2 0
2 0
3 1
3 1
3 1
4 0
5 1
5 0
So from this two patients (4 and 2) never used medication. That's what I'm looking for.
I'm fairly new to R and have no idea how to do this, any would be appreciated.
Here is another alternative from dplyr package.
library(dplyr)
df <- data.frame(patid = c(1,1,1,2,2,3,3,3,4,5,5),
meds = c(0,1,1,0,0,1,1,1,0,1,0))
df %>%
distinct(patid, meds) %>%
arrange(desc(meds))%>%
filter(meds == 0 & !duplicated(patid))
# patid meds
#1 2 0
#2 4 0
Try this:
library(dplyr)
#Data
df <- structure(list(patid = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L,
5L, 5L), meds = c(0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-11L))
#Code
df %>% group_by(patid) %>% summarise(sum=sum(meds,na.rm=T)) %>% filter(sum==0)
# A tibble: 2 x 2
patid sum
<int> <int>
1 2 0
2 4 0
A Base R solution could be
subset(aggregate(meds ~ patid, df, sum), meds == 0)
which returns
patid meds
2 2 0
4 4 0

Subset data based on first occurrence of a status flag

My data looks like this:
dfin <-
ID TIME CONC STATUS
1 0 5 0
1 1 4 1
1 2 3 0
2 0 2 0
2 10 2 0
2 15 1 0
I want to subset the dfin for the first occurrence (for each ID) when STATUS==1 and TIME > 0. If the subject ID has no STATUS==1 recorded at any time, then I need to subset the last raw of that subject.
the output here should be:
dfout <-
ID TIME CONC STATUS
1 1 4 1
2 15 1 0
One way with dplyr, we can group_by ID and check if there is any row which satisfies our condition (STATUS == 1 & TIME > 0), if it is then we get the first row which satisfies the condition using which.max, if there is no such row then we just return the last row using n().
library(dplyr)
df %>%
group_by(ID) %>%
slice(ifelse(any(STATUS == 1 & TIME > 0), which.max(STATUS == 1 & TIME > 0), n()))
# ID TIME CONC STATUS
# <int> <int> <int> <int>
#1 1 1 4 1
#2 2 15 1 0
Another approach using only base R. This actually follows the same logic as in dplyr but ave returns length same as input so we keep only unique values and take cumulative sum (cumsum) over it to get corresponding rows from the data frame.
df[cumsum(unique(with(df, ave(STATUS == 1 & TIME > 0, ID, FUN = function(x)
if(any(x)) which.max(x) else length(x))))), ]
# ID TIME CONC STATUS
#2 1 1 4 1
#5 2 10 2 0
Here is one approach with data.table. Convert the data.frame to 'data.table' (setDT(dfin)), grouped by 'ID', if there is any 'STATUS' as 1, then get the logical expression where 'TIME' is greater than 0 or else get the last row (.N) and subset with .SD
library(data.table)
setDT(dfin)[, .SD[if(any(STATUS == 1)) STATUS == 1& TIME > 0 else .N], ID]
# ID TIME CONC STATUS
#1: 1 1 4 1
#2: 2 15 1 0
It can be also written as
setDT(dfin)[, .SD[(STATUS == 1 & TIME > 0)| (!any(STATUS) & seq_len(.N) == .N)], ID]
data
dfin <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), TIME = c(0L, 1L,
2L, 0L, 10L, 15L), CONC = c(5L, 4L, 3L, 2L, 2L, 1L), STATUS = c(0L,
1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))

Counting rows based upon conditional grouping with dplyr

I have a dataframe as follows:
position_time telematic_trip_no lat_dec lon_dec
1 2016-06-05 00:00:01 526132109 -26.6641 27.8733
2 2016-06-05 00:00:01 526028387 -26.6402 27.8059
3 2016-06-05 00:00:01 526081476 -26.5545 28.3263
4 2016-06-05 00:00:04 526140512 -26.5310 27.8704
5 2016-06-05 00:00:05 526140518 -26.5310 27.8704
6 2016-06-05 00:00:19 526006880 -26.5010 27.8490
is_stolen hour_of_day time_of_day day_of_week lat_min
1 0 0 0 Sunday -26.6651
2 0 0 0 Sunday -26.6412
3 0 0 0 Sunday -26.5555
4 0 0 0 Sunday -26.5320
5 0 0 0 Sunday -26.5320
6 0 0 0 Sunday -26.5020
lat_max lon_max lon_min
1 -26.6631 27.8743 27.8723
2 -26.6392 27.8069 27.8049
3 -26.5535 28.3273 28.3253
4 -26.5300 27.8714 27.8694
5 -26.5300 27.8714 27.8694
6 -26.5000 27.8500 27.8480
Now what I want to do is count for each line where is_stolen = 1, the number of rows in the dataframe that fulfill the following conditions:
the lat_dec and lon_dec are between the lat_max, lat_min, lon_max and lon_min (i.e. fit within the 'box' around that GPS point)
the time_of_day and day_of_week are the same as that of the row of interest
the telematic_trip_no of the rows need to be different to that of the row of interest
and finally the is_stolen tag of the matching rows needs to be equal to 0
I've written a script to do this using a for loop but it ran very slowly and it got me thinking if there's an efficient way to do complex row counts with many conditions using something like dplyr or data.table?
ps If you're curious I am indeed trying to calculate how many cars a stolen-car passes during a typical trip :)
Given your description of the problem, the following should work
library(dplyr)
library(stats)
# df is the data.frame (see below)
df <- cbind(ID=seq_len(nrow(df)),df)
r.stolen <- which(df$is_stolen == 1)
r.not <- which(df$is_stolen != 1)
print(df[rep(r.not, times=length(r.stolen)),] %>%
setNames(.,paste0(names(.),"_not")) %>%
bind_cols(df[rep(r.stolen, each=length(r.not)),], .) %>%
mutate(in_range = as.numeric(telematic_trip_no != telematic_trip_no_not & time_of_day == time_of_day_not & day_of_week == day_of_week_not & lat_dec >= lat_min_not & lat_dec <= lat_max_not & lon_dec >= lon_min_not & lon_dec <= lon_max_not)) %>%
group_by(ID) %>%
summarise(count = sum(in_range)) %>%
arrange(desc(count)))
The first line just adds a column named ID to df that identifies the row by its row number that we can later dplyr::group_by to make the count.
The next two lines divides the rows into stolen and not-stolen cars. The key is to:
replicate each row of stolen cars N times where N is the number of not-stolen car rows,
replicate the rows of not-stolen cars (as a block) M times where M is the number of stolen car rows, and
append the result of (2) to (1) as new columns and change the names of these new columns so that we can reference them in the condition
The result of (3) have rows that enumerates all pairs of stolen and not-stolen rows from the original data frame so that your condition can be applied in an array fashion. The dplyr piped R workflow that is the fourth line of the code (wrapped in a print()) does this:
the first command replicates the not-stolen car rows using times
the second command appends _not to the column names to distinguish them from the stolen car columns when we bind the columns. Thanks to this SO answer for that gem.
the third command replicates the stolen car rows using each and appends the previous result as new columns using dplyr::bind_cols
the fourth command uses dplyr::mutate to create a new column named in_range that is the result of applying the condition. The boolean result is converted to {0,1} to allow for easy accumulation
the rest of the commands in the pipe does the counting of in_range grouped by the ID and arranging the results in decreasing order of the count. Note that now ID is the column that identifies the rows of the original data frame for which is_stolen = 1 whereas ID_not is the column for rows that is_stolen = 0
This assumes that you want the count for each row that is_stolen = 1 in the original data frame, which is what you said in your question. If instead you really want the count for each telematic_trip_no that is stolen, then you can use
group_by(telematic_trip_no) %>%
in the pipe instead.
I've tested this using the following data snippet
df <- structure(list(position_time = structure(c(1L, 1L, 1L, 2L, 3L,
4L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("2016-06-05 00:00:01",
"2016-06-05 00:00:04", "2016-06-05 00:00:05", "2016-06-05 00:00:19",
"2016-06-05 00:00:20", "2016-06-05 00:00:22", "2016-06-05 00:00:23",
"2016-06-05 00:00:35", "2016-06-05 00:09:34", "2016-06-06 01:00:06"
), class = "factor"), telematic_trip_no = c(526132109L, 526028387L,
526081476L, 526140512L, 526140518L, 526006880L, 526017880L, 526027880L,
526006880L, 526006890L, 526106880L, 526005880L, 526007880L),
lat_dec = c(-26.6641, -26.6402, -26.5545, -26.531, -26.531,
-26.501, -26.5315, -26.5325, -26.501, -26.5315, -26.5007,
-26.5315, -26.5315), lon_dec = c(27.8733, 27.8059, 28.3263,
27.8704, 27.8704, 27.849, 27.88, 27.87, 27.849, 27.87, 27.8493,
27.87, 27.87), is_stolen = c(0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), hour_of_day = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), time_of_day = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 9L, 0L), day_of_week = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("Monday",
"Sunday"), class = "factor"), lat_min = c(-26.6651, -26.6412,
-26.5555, -26.532, -26.532, -26.502, -26.532, -26.532, -26.502,
-26.532, -26.502, -26.532, -26.532), lat_max = c(-26.6631,
-26.6392, -26.5535, -26.53, -26.53, -26.5, -26.53, -26.53,
-26.5, -26.53, -26.5, -26.53, -26.53), lon_max = c(27.8743,
27.8069, 28.3273, 27.8714, 27.8714, 27.85, 27.8714, 27.8714,
27.85, 27.8714, 27.85, 27.8714, 27.8714), lon_min = c(27.8723,
27.8049, 28.3253, 27.8694, 27.8694, 27.848, 27.8694, 27.8694,
27.848, 27.8694, 27.848, 27.8694, 27.8694)), .Names = c("position_time",
"telematic_trip_no", "lat_dec", "lon_dec", "is_stolen", "hour_of_day",
"time_of_day", "day_of_week", "lat_min", "lat_max", "lon_max",
"lon_min"), class = "data.frame", row.names = c(NA, -13L))
Here, I appended 7 new rows with is_stolen = 1 to your original 6 rows that are all is_stolen = 0:
the first added row with telematic_trip_no = 526005880 violates the longitude condition for all not-stolen rows, so its count should be 0
the second added row with telematic_trip_no = 526006880 violates the latitude condition for all not-stolen rows, so its count should be 0
the third added row with telematic_trip_no = 526007880 violates the telematic_trip_no condition for all not-stolen rows, so its count should be 0
the fourth added row with telematic_trip_no = 526006890 satisfies the condition for rows 4 and 5 that are not-stolen, so its count should be 2
the fifth added row with telematic_trip_no = 526106880 satisfies the condition for row 6 that is not-stolen, so its count should be 1
the sixth added row with telematic_trip_no = 526017880 violates the time_of_day condition for all not-stolen rows, so its count should be 0
the seventh added row with telematic_trip_no = 526027880 violates the day_of_week condition for all not-stolen rows, so its count should be 0
Running the code on this data gives:
# A tibble: 7 x 2
ID count
<int> <dbl>
1 10 2
2 11 1
3 7 0
4 8 0
5 9 0
6 12 0
7 13 0
which is as expected recalling that the appended rows with is_stolen = 1 starts at row 7 with ID = 7.
If one were to group by telematic_trip_no instead, we get the result:
# A tibble: 7 x 2
telematic_trip_no count
<int> <dbl>
1 526006890 2
2 526106880 1
3 526005880 0
4 526006880 0
5 526007880 0
6 526017880 0
7 526027880 0
As a caveat, the above approach does cost memory. Worst case the number of rows grows to N^2/4 where N is the number of rows in the original data frame, and the number of columns doubles for the data frame that is used to evaluate the condition. As with most array processing techniques, there is a trade between speed and memory.
Hope this helps.
The current development version of data.table, v1.9.7 has a new feature non-equi joins, which makes conditional joins quite straightforward. Using #aichao's data:
require(data.table) # v1.9.7+
setDT(df)[, ID := .I] # add row numbers
not_stolen = df[is_stolen == 0L]
is_stolen = df[is_stolen == 1L]
not_stolen[is_stolen,
.(ID = i.ID, N = .N - sum(telematic_trip_no == i.telematic_trip_no)),
on = .(time_of_day, day_of_week, lat_min <= lat_dec,
lat_max >= lat_dec, lon_min <= lon_dec, lon_max >= lon_dec),
by=.EACHI][, .(ID, N)]
# ID N
# 1: 7 NA
# 2: 8 NA
# 3: 9 0
# 4: 10 2
# 5: 11 1
# 6: 12 NA
# 7: 13 NA
The part not_stolen[is_stolen, performs a subset-like join operation.. i.e., for each row in is_stolen, matching row indices (based on condition provided to on= argument) is extracted.
by = .EACHI ensures that, for each row in i (first) argument, here is_stolen, on the corresponding matching row indices, the expression provided in j, the second argument, .(ID = i.ID, N = .N-sum(telematic_trip_no==i.telematic_trip_no)), is evaluated. That returns the result shown above.
HTH.

How do I create occasion variable (time) for each ID?

I would like to create variable "Time" which basically indicates the number of times variable ID showed up within each day minus 1. In other words, the count is lagged by 1 and the first time ID showed up in a day should be left blank. Second time the same ID shows up on a given day should be 1.
Basically, I want to create the "Time" variable in the example below.
ID Day Time Value
1 1 0
1 1 1 0
1 1 2 0
1 2 0
1 2 1 0
1 2 2 0
1 2 3 1
2 1 0
2 1 1 0
2 1 2 0
Below is the code I am working on. Have not been successful with it.
data$time<-data.frame(data$ID,count=ave(data$ID==data$ID, data$Day, FUN=cumsum))
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'Day', we get the lag of sequence of rows (shift(seq_len(.N))) and assign (:=) it as "Time" column.
library(data.table)
setDT(df1)[, Time := shift(seq_len(.N)), .(ID, Day)]
df1
# ID Day Value Time
# 1: 1 1 0 NA
# 2: 1 1 0 1
# 3: 1 1 0 2
# 4: 1 2 0 NA
# 5: 1 2 0 1
# 6: 1 2 0 2
# 7: 1 2 1 3
# 8: 2 1 0 NA
# 9: 2 1 0 1
#10: 2 1 0 2
Or with base R
with(df1, ave(Day, Day, ID, FUN= function(x)
ifelse(seq_along(x)!=1, seq_along(x)-1, NA)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
Or without the ifelse
with(df1, ave(Day, Day, ID, FUN= function(x)
NA^(seq_along(x)==1)*(seq_along(x)-1)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
Day = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), Value = c(0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), .Names = c("ID", "Day",
"Value"), row.names = c(NA, -10L), class = "data.frame")

Resources