I'm trying to use R to find the average number of attempts before a success in a dataframe with 300,000+ rows. Data is structured as below.
EventID SubjectID ActionID Success DateUpdated
a b c TRUE 2014-06-21 20:20:08.575032+00
b a c FALSE 2014-06-20 02:58:40.70699+00
I'm still learning my way through R. It looks like I can use ddply to separate the frame out based on Subject and Action (I want to see how many times a given subject tries an action before achieving a success), but I can't figure out how to write the formula I need to apply.
library(data.table)
# example data
dt = data.table(group = c(1,1,1,1,1,2,2), success = c(F,F,T,F,T,F,T))
# group success
#1: 1 FALSE
#2: 1 FALSE
#3: 1 TRUE
#4: 1 FALSE
#5: 1 TRUE
#6: 2 FALSE
#7: 2 TRUE
dt[, which(success)[1] - 1, by = group]
# group V1
#1: 1 2
#2: 2 1
Replace group with list(subject, action) or whatever is appropriate for your data (after converting it to data.table from data.frame).
To follow up on Tarehman's suggestion, since I like rle,
foo <- rle(data$Success)
mean(foo$lengths[foo$values==FALSE])
This might be an answer to a totally different question, but does this get close to what you want?
tfs <- sample(c(FALSE,TRUE),size = 50, replace = TRUE, prob = c(0.8,0.2))
tfs_sums <- cumsum(!tfs)
repsums <- tfs_sums[duplicated(tfs_sums)]
mean(repsums - c(0,repsums[-length(repsums)]))
tfs
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[20] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
[39] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
repsums
1 6 8 9 20 20 20 20 24 26 31 36
repsums - c(0,repsums[-length(repsums)])
1 5 2 1 11 0 0 0 4 2 5 5
The last vector shown is the length of each continuous "run" of FALSE values in the vector tfs
you could use data.table work around to get what you need as follows:
library (data.table)
df=data.frame(EventID=c("a","b","c","d"),SubjectID=c("b","a","a","a"),ActionID=c("c","c","c","c"),Success=c(TRUE,FALSE,FALSE,TRUE))
dt=data.table(df)
dt[ , Index := 1:.N , by = c("SubjectID" , "ActionID","Success") ]
Now this Index column will hold the number that you need for each subject/action consecutive experiments. You need to aggregate to get that number (max number)
result=stats:::aggregate.formula(Index~(SubjectID+ActionID),data=dt,FUN= function(x) max(x))
so this will give you the max index and it is the number of the falses before you hit a true. Note that you might need to do further processing to filter out subjects that has never had a true
Related
Thanks in advance for your kind help. This is my dataframe:
df <- data.frame('a'=c(1,2,3,4,5), 'b'=c("A",NA,"B","C","A"))
df
And I want to create a new column based on if the value of dataframe$b is present/or absent (TRUE/FALSE). I'm using grepl for this but I'm not sure how to dinamically create the new column.
I'm creating a vector with the unique values of df$b
list <- as.vector(unique(df$b))
And want to iterate with a for in df$b, in order to get a dataframe like this:
a b A B C
1 1 A TRUE FALSE FALSE
2 2 NA FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 A FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
But I'm not sure how to generate the new column inside the for loop. I'm trying to do something like this:
for (i in list) {
logical <- grepl(df$b, i)
df$i <- logical
But it generates an error. Any help will be appreciated
This may need table
df <- cbind(df, as.data.frame.matrix(table(df) > 0))
-output
df
a b A B C
1 1 A TRUE FALSE FALSE
2 2 <NA> FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 C FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
You can use this for loop
list <- as.vector(unique(na.omit(df$b)))
for(i in 1:length(list)){
`[[`(df , list[i]) <- ifelse(!is.na(df$b),
list[i] == df$b , FALSE)
}
output
a b A B C
1 1 A TRUE FALSE FALSE
2 2 <NA> FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 C FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
I have a transactional data like this
library(data.table)
library(stringr)
sample <- data.table (customerid=c(1,1,2,2,2,3,4,4,5,5,6,6,6,7,7),
product=c("A","A+B","A","A+B+C","A+C","B","B+C+D","C+D","A+D","A+B+D","A+B","A","A+C","B+D","D"))
I am trying to count how many product each customer buy totally and add it into a column name total_product
I tried this code in data.table
sample[, A:= str_detect(product,"A")]
sample[, B:= str_detect(product,"B")]
sample[, C:= str_detect(product,"C")]
sample[, C:= str_detect(product,"D")]
sample
the code returns
customerid product A B C D
1: 1 A TRUE FALSE FALSE FALSE
2: 1 A+B TRUE TRUE FALSE FALSE
3: 2 A TRUE FALSE FALSE FALSE
4: 2 A+B+C TRUE TRUE TRUE FALSE
5: 2 A+C TRUE FALSE TRUE FALSE
6: 3 B FALSE TRUE FALSE FALSE
7: 4 B+C+D FALSE TRUE TRUE TRUE
8: 4 C+D FALSE FALSE TRUE TRUE
9: 5 A+D TRUE FALSE FALSE TRUE
10: 5 A+B+D TRUE TRUE FALSE TRUE
11: 6 A+B TRUE TRUE FALSE FALSE
12: 6 A TRUE FALSE FALSE FALSE
13: 6 A+C TRUE FALSE TRUE FALSE
14: 7 B+D FALSE TRUE FALSE TRUE
15: 7 D FALSE FALSE FALSE TRUE
I saw a question on Stack that I should merge four c(A,B,C,D) column and count the TRUE
But in my case, I will have the same product count more than one time.
Thanks for your advice!
We can use lapply on the pattern vector (LETTERS[1:4]) and either specify the arguments of the function str_detect
sample[, LETTERS[1:4] := lapply(LETTERS[1:4], str_detect, string = product)]
Or use anonymous/lambda function
sample[, LETTERS[1:4] := lapply(LETTERS[1:4], function(x)
str_detect(product, x))]
Then create the 'total_product' count as the row wise sum of logical vector i.e. TRUE -> 1 and FALSE -> 0
sample[, total_product := rowSums(.SD), .SDcols = A:D]
If we want to count the unique elements from 'product' for each 'customerid', an option is to split the column with strsplit, get the unique count with uniqueN
sample[, .(total_product = uniqueN(unlist(strsplit(product,
'+', fixed = TRUE)))), by = customerid]
-output
# customerid total_product
#1: 1 2
#2: 2 3
#3: 3 1
#4: 4 3
#5: 5 3
#6: 6 3
#7: 7 2
I am looking for a function that takes a column of a data.frame as the reference and finds all subsets with respect to the other variable levels. For example, let z be data frame with 4 columns a,b,c,d, each column has 2 levels for instance. let a be the reference. Then z would be like
z$a : TRUE FALSE
z$b : TRUE FALSE
z$c : TRUE FALSE
z$d : TRUE FALSE
Then what I need is a LIST that the elements are combination names such as
aTRUEbTRUEcTRUEdTR UE :subset of the dataframe
aTRUEbFALSEcTRUEdTRUE : subset
...
Here is an example,
set.seed(123)
z=matrix(sample(c(TRUE,FALSE),size = 100,replace = TRUE),ncol=4)
colnames(z) = letters[1:4]
z=as.data.frame(z)
output= list(
'bTUEcTRUEdFALSE' = subset(z,b==TRUE & c==TRUE & d==FALSE),
'bTRUEcTRUEdTRUE' = subset(z,b==TRUE & c==TRUE & d==TRUE),
'bTRUEcFALSEdFALSE' = subset(z,b==TRUE & c==FALSE & d==FALSE),
'bTRUEcFALSEdTRUE' = subset(z,b==TRUE & c==FALSE & d==TRUE)
# and so on ...
)
output
$bTUEcTRUEdFALSE
a b c d
13 FALSE TRUE TRUE FALSE
14 FALSE TRUE TRUE FALSE
$bTRUEcTRUEdTRUE
a b c d
4 FALSE TRUE TRUE TRUE
10 TRUE TRUE TRUE TRUE
16 FALSE TRUE TRUE TRUE
20 FALSE TRUE TRUE TRUE
24 FALSE TRUE TRUE TRUE
$bTRUEcFALSEdFALSE
a b c d
17 TRUE TRUE FALSE FALSE
19 TRUE TRUE FALSE FALSE
22 FALSE TRUE FALSE FALSE
$bTRUEcFALSEdTRUE
a b c d
5 FALSE TRUE FALSE TRUE
11 FALSE TRUE FALSE TRUE
15 TRUE TRUE FALSE TRUE
18 TRUE TRUE FALSE TRUE
21 FALSE TRUE FALSE TRUE
23 FALSE TRUE FALSE TRUE
However, there is an issue with the example. firstly, I do not know the number of variables (in this case 4 (a to d). Secondly, the name of the variables must be caught from the data (simple speaking, I cannot use subset since I do not know the variable name in the condition (a== can be anything==)
What is the most efficient way of doing this in R?
You can use split and paste like so:
split(z, paste(z$b, z$c, z$d))
But the tricky part of your question is how to programmatically combine the variables in columns 2:end without knowing beforehand the number of columns, their names or values. We can use a function like below to paste the values by row in columns 2:end
apply(df, 1, function(i) paste(i[-1], collapse=""))
Now combine with split
split(z, apply(z, 1, function(i) paste(i[-1], collapse="")))
I'm a relative newcomer to R so I'm sorry if there's an obvious answer to this. I've looked at other questions and I think 'apply' is the answer but I can't work out how to use it in this case.
I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they die. I need to identify which participants have taken part for a consistent 'streak' since from the start of the survey (i.e. if they stop, they stop for good).
I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and the loop is very slow. Is there a faster approach I could use?
In the example, TRUE means they participated in that year. The loop creates two vectors - 'finalyear' for the last year they took part, and 'streak' to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).
dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), "2000" = c(T, F, T, F, T), "2001" = c(T, T, T, T, T), "2002" = c(F, T, T, T, T), "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
x <- as.numeric(dat[i,2:6])
y <- max(grep(1, x))
finalyear[i] <- y
streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak
Thanks!
We could use max.col and rowSums as a vectorized approach.
dat$finalyear <- max.col(dat[-1], 'last')
If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.
dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])
Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'
dat$streak <- rowSums(dat[,2:6])==dat$finalyear
dat
# ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
#2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
#3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
#4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
#5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by #ColonelBeauvel
library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'),
streak=rowSums(dat[-1])==finalyear)
For-loops are not inherently bad in R, but they are slow if you grow vectors iteratively (like you are doing). There are often better ways to do things. Example of a solution with only apply-functions:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:7],MARGIN=1,function(x){sum(x[1:5])==x[6]})
Or option 2, based on comment by #Spacedman:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))==sum(x)})
> dat
ids X1999 X2000 X2001 X2002 X2003 finalyear streak
1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Here is a solution with dplyr and tidyr.
gather(data = dat,year,value,-ids) %>%
mutate(year=as.integer(gsub("X","",year))) %>%
group_by(ids) %>%
summarize(finalyear=last(year[value]),
streak=!any(value[first(year):finalyear] == FALSE))
output
ids finalyear streak
1 1 2001 TRUE
2 2 2003 FALSE
3 3 2003 TRUE
4 4 2003 FALSE
5 5 2002 TRUE
Here's a base version using apply to loop over rows and rle to see how often the state changes. Your condition seems to be equivalent to the state starting as TRUE and only ever changing to FALSE at most once, so I test the rle as being shorter than 3 and the first value being TRUE:
> dat$streak = apply(dat[,2:6],1,function(r){r[1] & length(rle(r)$length)<=2})
>
> dat
ids X1999 X2000 X2001 X2002 X2003 streak
1 1 TRUE TRUE TRUE FALSE FALSE TRUE
2 2 TRUE FALSE TRUE TRUE TRUE FALSE
3 3 TRUE TRUE TRUE TRUE TRUE TRUE
4 4 FALSE FALSE TRUE TRUE TRUE FALSE
5 5 TRUE TRUE TRUE TRUE FALSE TRUE
There's probably loads of ways of working out finalyear, this just finds the last element of each row which is TRUE:
> dat$finalyear = apply(dat[,2:6], 1, function(r){max(which(r))})
> dat
ids X1999 X2000 X2001 X2002 X2003 streak finalyear
1 1 TRUE TRUE TRUE FALSE FALSE TRUE 3
2 2 TRUE FALSE TRUE TRUE TRUE FALSE 5
3 3 TRUE TRUE TRUE TRUE TRUE TRUE 5
4 4 FALSE FALSE TRUE TRUE TRUE FALSE 5
5 5 TRUE TRUE TRUE TRUE FALSE TRUE 4
This loop is workable for small amount of data but when it comes to huge volume of data, it took quite long for looping. So I want to find out is there any alternate way to do it so it can help to speed up the process time by using R programming:
#set correction to the transaction
mins<-45
for (i in 1:nrow(tnx)) {
if(tnx$id[i] == tnx$id[i+1]){
#check trip within 45 mins
if(tnx$diff[i]>=mins){
tnx$FIRST[i+1] <- TRUE
tnx$LAST[i] <- TRUE
}
}
else{
tnx$LAST[i]<-TRUE
}
}
Thanks in advance.
EDIT
What I am trying to do is set the true false value in first and last column by checking the diff column.
Data like:
tnx <- data.frame(
id=rep(c("A","C","D","E"),4:1),
FIRST=c(T,T,F,F,T,F,F,T,F,T),
LAST=c(T,F,F,T,F,F,T,F,T,T),
diff=c(270,15,20,-1,5,20,-1,15,-1,-1)
)
EDIT PORTION FOR #thelatemail
# id diff FIRST LAST
#1 A 270 TRUE TRUE
#2 A 15 TRUE FALSE
#3 A 20 FALSE FALSE
#4 A -1 FALSE TRUE
#5 C 5 TRUE FALSE
#6 C 20 FALSE FALSE
#7 C -1 FALSE TRUE
#8 D 15 TRUE FALSE
#9 D -1 FALSE TRUE
#10 E -1 TRUE TRUE
Something like this should work:
I reset the FIRST and LAST values to make it obvious in this example:
tnx$FIRST <- FALSE
tnx$LAST <- FALSE
The next two parts use ?ave to respectively set tnx$FIRST to TRUE for the first row in each id group, and tnx$LAST to TRUE for the last row in each id group.
tnx$FIRST <- as.logical(
with(tnx, ave(diff,id,FUN=function(x) seq_along(x)==1) ))
tnx$LAST <- as.logical(
with(tnx, ave(diff,id,FUN=function(x) seq_along(x)==length(x))))
The final two parts then:
- set tnx$LAST to TRUE when tnx$diff is >=45.
- set tnx$FIRST to TRUE when the previous value for tnx$diff is >=45
tnx$LAST[tnx$diff >= 45] <- TRUE
tnx$FIRST[c(NA,head(tnx$diff,-1)) >= 45] <- TRUE
# id diff FIRST LAST
#1 A 270 TRUE TRUE
#2 A 15 TRUE FALSE
#3 A 20 FALSE FALSE
#4 A -1 FALSE TRUE
#5 C 5 TRUE FALSE
#6 C 20 FALSE FALSE
#7 C -1 FALSE TRUE
#8 D 15 TRUE FALSE
#9 D -1 FALSE TRUE
#10 E -1 TRUE TRUE
This solves the problem just about as fast as R can do it. You'll note that the meat and potatoes is 4 lines and there are no loops of any kind. I first test id against a version of itself shifted by one position so that the single test gets all of the positions where id[i] == id[i+1] all at once. After that I just use that logical vector to select, or assist in selecting the values in LAST and TRUE that I want to change.
# First I reset the LAST and FIRST columns and set some variables up.
# Note that if you're starting from scratch with no FIRST column at all then
# you don't need to declare it here yet
tnx$FIRST <- FALSE
tnx$LAST <- FALSE
mins <- 45
n <- nrow(tnx)
# and this is all there is to it
idMatch <- tnx$id == c(as.character(tnx$id[2:n]), 'XX')
tnx$LAST[ idMatch & tnx$diff >= mins] <- TRUE
tnx$LAST[ !idMatch] <- TRUE
tnx$FIRST <- c(TRUE, tnx$LAST[1:(n-1)])