My data looks like this:
dfin <-
ID TIME CONC STATUS
1 0 5 0
1 1 4 1
1 2 3 0
2 0 2 0
2 10 2 0
2 15 1 0
I want to subset the dfin for the first occurrence (for each ID) when STATUS==1 and TIME > 0. If the subject ID has no STATUS==1 recorded at any time, then I need to subset the last raw of that subject.
the output here should be:
dfout <-
ID TIME CONC STATUS
1 1 4 1
2 15 1 0
One way with dplyr, we can group_by ID and check if there is any row which satisfies our condition (STATUS == 1 & TIME > 0), if it is then we get the first row which satisfies the condition using which.max, if there is no such row then we just return the last row using n().
library(dplyr)
df %>%
group_by(ID) %>%
slice(ifelse(any(STATUS == 1 & TIME > 0), which.max(STATUS == 1 & TIME > 0), n()))
# ID TIME CONC STATUS
# <int> <int> <int> <int>
#1 1 1 4 1
#2 2 15 1 0
Another approach using only base R. This actually follows the same logic as in dplyr but ave returns length same as input so we keep only unique values and take cumulative sum (cumsum) over it to get corresponding rows from the data frame.
df[cumsum(unique(with(df, ave(STATUS == 1 & TIME > 0, ID, FUN = function(x)
if(any(x)) which.max(x) else length(x))))), ]
# ID TIME CONC STATUS
#2 1 1 4 1
#5 2 10 2 0
Here is one approach with data.table. Convert the data.frame to 'data.table' (setDT(dfin)), grouped by 'ID', if there is any 'STATUS' as 1, then get the logical expression where 'TIME' is greater than 0 or else get the last row (.N) and subset with .SD
library(data.table)
setDT(dfin)[, .SD[if(any(STATUS == 1)) STATUS == 1& TIME > 0 else .N], ID]
# ID TIME CONC STATUS
#1: 1 1 4 1
#2: 2 15 1 0
It can be also written as
setDT(dfin)[, .SD[(STATUS == 1 & TIME > 0)| (!any(STATUS) & seq_len(.N) == .N)], ID]
data
dfin <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), TIME = c(0L, 1L,
2L, 0L, 10L, 15L), CONC = c(5L, 4L, 3L, 2L, 2L, 1L), STATUS = c(0L,
1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
Related
I have session id's, client id's, a conversion column and all with a specific date. I want to delete the rows after the last purchase of a client. My data looks as follows:
SessionId ClientId Conversion Date
1 1 0 05-01
2 1 0 06-01
3 1 0 07-01
4 1 1 08-01
5 1 0 09-01
6 2 0 05-01
7 2 1 06-01
8 2 0 07-01
9 2 1 08-01
10 2 0 09-01
As output I want:
SessionId ClientId Conversion Date
1 1 0 05-01
2 1 0 06-01
3 1 1 07-01
6 2 0 05-01
7 2 1 06-01
8 2 0 07-01
9 2 1 08-01
I looks quite easy, but it has some conditions. Based on the client id, the sessions after the last purchase of a cutomer need to be deleted. I have many observations, so deleting after a particular date is not possible. It need to check every client id on when someone did a purchase.
I have no clue what kind of function I need to use for this. Maybe a certain kind of loop?
Hopefully someone can help me with this.
If your data is already ordered according to Date, for each ClientId we can select all the rows before the last conversion took place.
This can be done in base R :
subset(df, ave(Conversion == 1, ClientId, FUN = function(x) seq_along(x) <= max(which(x))))
Using dplyr :
library(dplyr)
df %>% group_by(ClientId) %>% filter(row_number() <= max(which(Conversion == 1)))
Or data.table :
library(data.table)
setDT(df)[, .SD[seq_len(.N) <= max(which(Conversion == 1))], ClientId]
We could try
library(dplyr)
df1 %>%
group_by(ClientId) %>%
slice(seq_len(tail(which(Conversion == 1), 1)))
data
df1 <- structure(list(SessionId = 1:10, ClientId = c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L), Conversion = c(0L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 1L, 0L), Date = c("05-01", "06-01", "07-01", "08-01",
"09-01", "05-01", "06-01", "07-01", "08-01", "09-01")),
class = "data.frame", row.names = c(NA,
-10L))
I am playing around with binary data.
I have data in columns in the following manner:
A B C D E F G H I J K L M N
-----------------------------------------------------
1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 1 0 1 1 0 0 1 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 - Indicating that the system was on and 0 indicating that the system was off
I am trying to figure out ways to figure out a way to summarize the gaps between the on/off transition of these systems.
For example,
for the first row, it stops working after 'I'
for the second row, it works from 'E' to 'G' and then works again in 'I' and 'M' but is off during other.
Is there a way to summarize this?
I wish to see my result in the following form
row-number Number of 1's Range
------------ ------------------ ------
1 9 A-I
2 3 E-G
2 2 I-J
2 1 M
3 5 H-L
Here's a tidyverse solution:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
group_by(rowid) %>%
# This counts the number of times a new streak starts
mutate(grp_num = cumsum(val != lag(val, default = -99))) %>%
filter(val == 1) %>%
group_by(rowid, grp_num) %>%
summarise(num_1s = n(),
range = paste0(first(col), "-", last(col)))
## A tibble: 5 x 4
## Groups: rowid [3]
# rowid grp_num num_1s range
# <int> <int> <int> <chr>
#1 1 1 9 A-I
#2 2 2 3 E-G
#3 2 4 2 I-J
#4 2 6 1 M-M
#5 3 2 5 H-L
An option with data.table. Convert the 'data.frame' to 'data.table' while creating a row number column (setDT), melt from 'wide' to 'long' format specifying the id.var as row number column 'rn', create a run-lenght-id (rleid) column on the 'value' column grouped by 'rn', subset the rows where 'value' is 1, summarise with number of rows (.N), and pasted range of 'variable' values, grouped by 'grp' and 'rn', assign the columns not needed to NULL and order by 'rn' if necessary.
library(data.table)
melt(setDT(df1, keep.rownames = TRUE), id.var = 'rn')[,
grp := rleid(value), rn][value == 1, .(NumberOfOnes = .N,
Range = paste(range(as.character(variable)), collapse="-")),
.(grp, rn)][, grp := NULL][order(rn)]
# rn NumberOfOnes Range
#1: 1 9 A-I
#2: 2 3 E-G
#3: 2 2 I-J
#4: 2 1 M-M
#5: 3 5 H-L
Or using base R with rle
do.call(rbind, apply(df1, 1, function(x) {
rl <- rle(x)
i1 <- rl$values == 1
l1 <- rl$lengths[i1]
nm1 <- tapply(names(x), rep(seq_along(rl$values), rl$lengths),
FUN = function(y) paste(range(y), collapse="-"))[i1]
data.frame(NumberOfOnes = l1, Range = nm1)}))
data
df1 <- structure(list(A = c(1L, 0L, 0L), B = c(1L, 0L, 0L), C = c(1L,
0L, 0L), D = c(1L, 0L, 0L), E = c(1L, 1L, 0L), F = c(1L, 1L,
0L), G = c(1L, 1L, 0L), H = c(1L, 0L, 1L), I = c(1L, 1L, 1L),
J = c(0L, 1L, 1L), K = c(0L, 0L, 1L), L = c(0L, 0L, 1L),
M = c(0L, 1L, 0L), N = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
I have a data frame like this:
ID TIME AMT CONC
1 0 10 2
1 1 0 1
1 5 20 15
1 10 0 30
1 12 0 16
I want to subset data for each subject ID, from the last time when AMT > 0 till the last row of the data frame for that individual.
output should be this:
ID TIME AMT CONC
1 5 20 15
1 10 0 30
1 12 0 16
I am using RStudio.
We can use slice and create a sequence between the max index where AMT > 0 and the last index for each ID.
library(dplyr)
df %>%
group_by(ID) %>%
slice(max(which(AMT > 0)) : n())
# ID TIME AMT CONC
# <int> <int> <int> <int>
#1 1 5 20 15
#2 1 10 0 30
#3 1 12 0 16
We can use filter
library(dplyr)
df %>%
group_by(ID) %>%
mutate(ind = cumsum(AMT > 0)) %>%
filter(ind == max(ind), ind > 0) %>%
select(-ind)
# A tibble: 3 x 4
# Groups: ID [1]
# ID TIME AMT CONC
# <int> <int> <int> <int>
#1 1 5 20 15
#2 1 10 0 30
#3 1 12 0 16
NOTE: This also works well when all the elements of 'AMT' is 0 for a particular group
df$ID[4:5] <- 2
df$AMT <- 0
df$AMT[4:5] <- c(1, 0)
Or another option is fewer steps
df %>%
group_by(ID) %>%
filter(row_number() >= which.max(cumsum(AMT > 0)))
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L), TIME = c(0L, 1L, 5L,
10L, 12L), AMT = c(10L, 0L, 20L, 0L, 0L), CONC = c(2L, 1L, 15L,
30L, 16L)), class = "data.frame", row.names = c(NA, -5L))
My dataset is set up as follows:
User Day
10 2
1 3
15 1
3 1
1 2
15 3
1 1
I'n trying to find out the users that are present on all three days. I'm using the below code using dplyr package:
MAU%>%
group_by(User)%>%
filter(c(1,2,3) %in% Day)
# but get this error message:
# Error in filter_impl(.data, quo) : Result must have length 12, not 3
any idea how to fix?
Using the input shown reproducibly in the Note at the end, count the distinct Users and filter out those for which there are 3 days:
library(dplyr)
DF %>%
distinct %>%
count(User) %>%
filter(n == 3) %>%
select(User)
giving:
# A tibble: 1 x 1
User
<int>
1 1
Note
Lines <- "
User Day
10 2
1 3
15 1
3 1
1 2
15 3
1 1"
DF <- read.table(text = Lines, header = TRUE)
We can use all to get a single TRUE/FALSE from the logical vector 1:3 %in% Day
library(dplyr)
MAU %>%
group_by(User)%>%
filter(all(1:3 %in% Day))
# A tibble: 3 x 2
# Groups: User [1]
# User Day
# <int> <int>
#1 1 3
#2 1 2
#3 1 1
data
MAU <- structure(list(User = c(10L, 1L, 15L, 3L, 1L, 15L, 1L), Day = c(2L,
3L, 1L, 1L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA,
-7L))
I have a data frame like this named 'a'.
ID V1
1 -1
1 0
1 1
1 1000
1 0
1 1
2 -1
2 0
2 1000
...
I shorten this data frame to show briefly.
And now I want to create a new column using conditional mutate function, but it should refer new column created by mutate function.
a %>%
group_by(ID) %>%
mutate(V2, ifelse(row_number() == 1, 1,
ifelse(V1 < 1000, 1,
ifelse(V1 >= 1000, lag(V2) + 1))
"Error: Then 'V2' not found" message is produced.
This result is what I want.
ID V1 V2
1 -1 1
1 0 1
1 1 1
1 1000 2
1 0 2
1 1 2
2 -1 1
2 0 1
2 1000 2
How to I get this? Thanks for your help.
We can try
a %>%
group_by(ID) %>%
mutate(V2 = cumsum(V1 >= 1000)+1L)
# ID V1 V2
# <int> <int> <int>
#1 1 -1 1
#2 1 0 1
#3 1 1 1
#4 1 1000 2
#5 1 0 2
#6 1 1 2
#7 2 -1 1
#8 2 0 1
#9 2 1000 2
data
a <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
V1 = c(-1L,
0L, 1L, 1000L, 0L, 1L, -1L, 0L, 1000L)), .Names = c("ID", "V1"
), class = "data.frame", row.names = c(NA, -9L))
This should work:
a %>% group_by(ID) %>% mutate(V2 = ifelse(row_number() == 1, 1, 0) +
ifelse(row_number() > 1 & V1 <= 1000, 1, 0) +
cumsum(ifelse(V1 >= 1000, 1, 0)))
Update: Changed second ifelse logic statement from row_number() > 1 & V1 < 1000 to that shown above. This alteration should give the results as requested in the comments.