This loop is workable for small amount of data but when it comes to huge volume of data, it took quite long for looping. So I want to find out is there any alternate way to do it so it can help to speed up the process time by using R programming:
#set correction to the transaction
mins<-45
for (i in 1:nrow(tnx)) {
if(tnx$id[i] == tnx$id[i+1]){
#check trip within 45 mins
if(tnx$diff[i]>=mins){
tnx$FIRST[i+1] <- TRUE
tnx$LAST[i] <- TRUE
}
}
else{
tnx$LAST[i]<-TRUE
}
}
Thanks in advance.
EDIT
What I am trying to do is set the true false value in first and last column by checking the diff column.
Data like:
tnx <- data.frame(
id=rep(c("A","C","D","E"),4:1),
FIRST=c(T,T,F,F,T,F,F,T,F,T),
LAST=c(T,F,F,T,F,F,T,F,T,T),
diff=c(270,15,20,-1,5,20,-1,15,-1,-1)
)
EDIT PORTION FOR #thelatemail
# id diff FIRST LAST
#1 A 270 TRUE TRUE
#2 A 15 TRUE FALSE
#3 A 20 FALSE FALSE
#4 A -1 FALSE TRUE
#5 C 5 TRUE FALSE
#6 C 20 FALSE FALSE
#7 C -1 FALSE TRUE
#8 D 15 TRUE FALSE
#9 D -1 FALSE TRUE
#10 E -1 TRUE TRUE
Something like this should work:
I reset the FIRST and LAST values to make it obvious in this example:
tnx$FIRST <- FALSE
tnx$LAST <- FALSE
The next two parts use ?ave to respectively set tnx$FIRST to TRUE for the first row in each id group, and tnx$LAST to TRUE for the last row in each id group.
tnx$FIRST <- as.logical(
with(tnx, ave(diff,id,FUN=function(x) seq_along(x)==1) ))
tnx$LAST <- as.logical(
with(tnx, ave(diff,id,FUN=function(x) seq_along(x)==length(x))))
The final two parts then:
- set tnx$LAST to TRUE when tnx$diff is >=45.
- set tnx$FIRST to TRUE when the previous value for tnx$diff is >=45
tnx$LAST[tnx$diff >= 45] <- TRUE
tnx$FIRST[c(NA,head(tnx$diff,-1)) >= 45] <- TRUE
# id diff FIRST LAST
#1 A 270 TRUE TRUE
#2 A 15 TRUE FALSE
#3 A 20 FALSE FALSE
#4 A -1 FALSE TRUE
#5 C 5 TRUE FALSE
#6 C 20 FALSE FALSE
#7 C -1 FALSE TRUE
#8 D 15 TRUE FALSE
#9 D -1 FALSE TRUE
#10 E -1 TRUE TRUE
This solves the problem just about as fast as R can do it. You'll note that the meat and potatoes is 4 lines and there are no loops of any kind. I first test id against a version of itself shifted by one position so that the single test gets all of the positions where id[i] == id[i+1] all at once. After that I just use that logical vector to select, or assist in selecting the values in LAST and TRUE that I want to change.
# First I reset the LAST and FIRST columns and set some variables up.
# Note that if you're starting from scratch with no FIRST column at all then
# you don't need to declare it here yet
tnx$FIRST <- FALSE
tnx$LAST <- FALSE
mins <- 45
n <- nrow(tnx)
# and this is all there is to it
idMatch <- tnx$id == c(as.character(tnx$id[2:n]), 'XX')
tnx$LAST[ idMatch & tnx$diff >= mins] <- TRUE
tnx$LAST[ !idMatch] <- TRUE
tnx$FIRST <- c(TRUE, tnx$LAST[1:(n-1)])
Related
Thanks in advance for your kind help. This is my dataframe:
df <- data.frame('a'=c(1,2,3,4,5), 'b'=c("A",NA,"B","C","A"))
df
And I want to create a new column based on if the value of dataframe$b is present/or absent (TRUE/FALSE). I'm using grepl for this but I'm not sure how to dinamically create the new column.
I'm creating a vector with the unique values of df$b
list <- as.vector(unique(df$b))
And want to iterate with a for in df$b, in order to get a dataframe like this:
a b A B C
1 1 A TRUE FALSE FALSE
2 2 NA FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 A FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
But I'm not sure how to generate the new column inside the for loop. I'm trying to do something like this:
for (i in list) {
logical <- grepl(df$b, i)
df$i <- logical
But it generates an error. Any help will be appreciated
This may need table
df <- cbind(df, as.data.frame.matrix(table(df) > 0))
-output
df
a b A B C
1 1 A TRUE FALSE FALSE
2 2 <NA> FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 C FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
You can use this for loop
list <- as.vector(unique(na.omit(df$b)))
for(i in 1:length(list)){
`[[`(df , list[i]) <- ifelse(!is.na(df$b),
list[i] == df$b , FALSE)
}
output
a b A B C
1 1 A TRUE FALSE FALSE
2 2 <NA> FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 C FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
I have an example dataset looks like:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
> data
category
1 A
2 B
3 C
4 X1_theta
5 X2_theta
6 AB_theta
7 BC_theta
8 CD_theta
I am trying to generate a logical variable when the category (variable) contains "theta" in it. However, I would like to assign the logical value as "FALSE" when cell values contain "X1" and "X2".
Here is what I did:
data$logic <- str_detect(data$category, "theta")
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta TRUE
5 X2_theta TRUE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
here, all cells value that have "theta" have the logical value of "TRUE".
Then, I wrote this below to just assign "FALSE" when the cell value has "X" in it.
data$logic <- ifelse(grepl("X", data$category), "FALSE", "TRUE")
> data
category logic
1 A TRUE
2 B TRUE
3 C TRUE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
But this, of course, overwrote the previous application
What I would like to get is to combine two conditions:
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
Any thoughts?
Thanks
We can create the 'logic', by detecting substring 'theta' at the end and not having 'X' ([^X]) as the starting (^) character
libary(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"))
If we need to split the column into separate columns based on the conditions
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"),
category = case_when(logic ~ str_replace(category, "_", ","),
TRUE ~ as.character(category))) %>%
separate(category, into = c("split1", "split2"), sep= ",", remove = FALSE)
# category split1 split2 logic
#1 A A <NA> FALSE
#2 B B <NA> FALSE
#3 C C <NA> FALSE
#4 X1_theta X1_theta <NA> FALSE
#5 X2_theta X2_theta <NA> FALSE
#6 AB,theta AB theta TRUE
#7 BC,theta BC theta TRUE
#8 CD,theta CD theta TRUE
Or in base R
data$logic <- with(data, grepl("^[^X].*theta$", category))
Another option is to have two grepl condition statements
data$logic <- with(data, grepl("theta$", category) & !grepl("^X\\d+", category))
data$logic
#[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Not the cleanest in the world (since it adds 2 unnecessary cols) but it gets the job done:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
data$logic1 <- ifelse(grepl('X',data$category), FALSE, TRUE)
data$logic2 <- ifelse(grepl('theta',data$category),TRUE, FALSE)
data$logic <- ifelse((data$logic1 == TRUE & data$logic2 == TRUE), TRUE, FALSE)
print(data)
I think you can also remove the logic1 and logic2 cols if you want but I usually don't bother (I'm a messy coder haha).
Hope this helped!
EDIT: akrun's grepl solution does what I'm doing way more cleanly (as in, it doesn't require the extra cols). I definitely recommend that approach!
I am getting some unexpected behavior using %in% c() versus == c() to filter data on multiple conditions. I am returning incomplete results when the == c() method. Is there a logical explanation for this behavior?
df <- data.frame(region = as.factor(c(1,1,1,2,2,3,3,4,4,4)),
value = 1:10)
library(dplyr)
filter(df, region == c(1,2))
filter(df, region %in% c(1,2))
# using base syntax
df[df$region == c(1,2),]
df[df$region %in% c(1,2),]
The results do not change if I convert 'region' to numeric.
I am returning incomplete results when the == c() method. Is there a
logical explanation for this behavior?
That's kind of logical, let's see:
df$region == 1:2
# [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
df$region %in% 1:2
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
The reason is in the first form your trying to compare different lenght vectors, as #lukeA said in his comment this form is the same as (see implementation-of-standard-recycling-rules):
# 1 1 1 2 2 3 3 4 4 4 ## df$region
# 1 2 1 2 1 2 1 2 1 2 ## c(1,2) recycled to the same length
# T F T T F F F F F F ## equality of the corresponding elements
df$region == c(1,2,1,2,1,2,1,2,1,2)
# [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Where each value on the left hand side of the operator is tested with the corresponding value on the right hand side of the operator.
However when you use df$region %in% 1:2 it's more in the idea:
sapply(df$region, function(x) { any(x==1:2) })
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
I mean each value is tested against the second vector and TRUE is returned if there's one match.
I'm trying to use R to find the average number of attempts before a success in a dataframe with 300,000+ rows. Data is structured as below.
EventID SubjectID ActionID Success DateUpdated
a b c TRUE 2014-06-21 20:20:08.575032+00
b a c FALSE 2014-06-20 02:58:40.70699+00
I'm still learning my way through R. It looks like I can use ddply to separate the frame out based on Subject and Action (I want to see how many times a given subject tries an action before achieving a success), but I can't figure out how to write the formula I need to apply.
library(data.table)
# example data
dt = data.table(group = c(1,1,1,1,1,2,2), success = c(F,F,T,F,T,F,T))
# group success
#1: 1 FALSE
#2: 1 FALSE
#3: 1 TRUE
#4: 1 FALSE
#5: 1 TRUE
#6: 2 FALSE
#7: 2 TRUE
dt[, which(success)[1] - 1, by = group]
# group V1
#1: 1 2
#2: 2 1
Replace group with list(subject, action) or whatever is appropriate for your data (after converting it to data.table from data.frame).
To follow up on Tarehman's suggestion, since I like rle,
foo <- rle(data$Success)
mean(foo$lengths[foo$values==FALSE])
This might be an answer to a totally different question, but does this get close to what you want?
tfs <- sample(c(FALSE,TRUE),size = 50, replace = TRUE, prob = c(0.8,0.2))
tfs_sums <- cumsum(!tfs)
repsums <- tfs_sums[duplicated(tfs_sums)]
mean(repsums - c(0,repsums[-length(repsums)]))
tfs
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[20] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
[39] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
repsums
1 6 8 9 20 20 20 20 24 26 31 36
repsums - c(0,repsums[-length(repsums)])
1 5 2 1 11 0 0 0 4 2 5 5
The last vector shown is the length of each continuous "run" of FALSE values in the vector tfs
you could use data.table work around to get what you need as follows:
library (data.table)
df=data.frame(EventID=c("a","b","c","d"),SubjectID=c("b","a","a","a"),ActionID=c("c","c","c","c"),Success=c(TRUE,FALSE,FALSE,TRUE))
dt=data.table(df)
dt[ , Index := 1:.N , by = c("SubjectID" , "ActionID","Success") ]
Now this Index column will hold the number that you need for each subject/action consecutive experiments. You need to aggregate to get that number (max number)
result=stats:::aggregate.formula(Index~(SubjectID+ActionID),data=dt,FUN= function(x) max(x))
so this will give you the max index and it is the number of the falses before you hit a true. Note that you might need to do further processing to filter out subjects that has never had a true
I have a data.frame with a block of columns that are logicals, e.g.
> tmp <- data.frame(a=c(13, 23, 52),
+ b=c(TRUE,FALSE,TRUE),
+ c=c(TRUE,TRUE,FALSE),
+ d=c(TRUE,TRUE,TRUE))
> tmp
a b c d
1 13 TRUE TRUE TRUE
2 23 FALSE TRUE TRUE
3 52 TRUE FALSE TRUE
I'd like to compute a summary column (say: e) that is a logical AND over the whole range of logical columns. In other words, for a given row, if all b:d are TRUE, then e would be TRUE; if any b:d are FALSE, then e would be FALSE.
My expected result is:
> tmp
a b c d e
1 13 TRUE TRUE TRUE TRUE
2 23 FALSE TRUE TRUE FALSE
3 52 TRUE FALSE TRUE FALSE
I want to indicate the range of columns by indices, as I have a bunch of columns, and the names are cumbersome. The following code works, but i'd rather use a vectorized approach to improve performance.
> tmp$e <- NA
> for(i in 1:nrow(tmp)){
+ tmp[i,"e"] <- all(tmp[i,2:(ncol(tmp)-1)]==TRUE)
+ }
> tmp
a b c d e
1 13 TRUE TRUE TRUE TRUE
2 23 FALSE TRUE TRUE FALSE
3 52 TRUE FALSE TRUE FALSE
Any way to do this without using a for loop to step through the rows of the data.frame?
You can use rowSums to loop over rows... and some fancy footwork to make it quasi-automated:
# identify the logical columns
boolCols <- sapply(tmp, is.logical)
# sum each row of the logical columns and
# compare to the total number of logical columns
tmp$e <- rowSums(tmp[,boolCols]) == sum(boolCols)
By using rowSums in ifelse statement, in one go it can be acheived:
tmp$e <- ifelse(rowSums(tmp[,2:4] == T) == 3, T, F)