removing rows based on conditions in R - r

For my research, I have to remove certain trials to limit contamination in the data. Here are the rules:
Remove the first trial.
For RT calculation, remove error trials (trials with 0, trials with 1 are corrects trials) and trials following an error. Say, we have 20 trials and 3 errors, we have to remove 6 trials from the final data, i.e., mean RT will be calculated from 14 trials. If the errors are in a row, say 3 errors in a row, RT will be calculated based on 17 trials.
For error rate calculation, remove trials following an error. Say, we have 20 trials and 3 errors, we have to remove 3 trials following errors from the final data, i.e., error rate will be calculated by 3/17. If the errors are in a row, say 3 errors in a row, only the first error retains and the next two errors are excluded, so the error rate will be calculated by 1/18.
I am new to R. I hope someone can help me with the script. Thanx in advance
trial_no correct
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
14 0
15 1
16 1
17 1
18 1
19 0
20 1
21 1
22 1
23 0
24 0
25 1
26 1
27 1
28 1
29 1
30 1
31 1
32 0
33 0
34 1
35 1
36 1
37 1
38 1
39 0
40 1

If I understand your correctly the two rules work as follows:
A trial is excluded from RT calculation if this trial or the preceding trial has correct==0.
A trial is excluded from RT calculation if this trial has correct==0 and the preceding trial has correct==0.
If this is how you want things done, consider the two methods below.
Reproducing your example data:
df <- data.frame(trial_no = 1:40,
correct = rep(1, 40))
df$correct[c(13, 19, 23, 24, 32, 33, 39)] <- 0
Solution using base R only:
# usage for RT
df$use_rt <- df$correct==1
df$use_rt[2:nrow(df)][df$correct[-nrow(df)]==0] <- FALSE
# usage for error rate
df$use_er <- TRUE
df$use_er[2:nrow(df)] <- df$correct[-nrow(df)]==1 | df$correct[-1]==1
# calculate error rate
1 - mean(df$correct[df$use_er])
Solution using dplyr:
library(dplyr)
# usage for RT and error rate
df <- df %>%
mutate(use_rt = correct==1 & lag(correct)==1,
use_er = correct==1 | lag(correct)==1)
df$use_rt[1] <- df$correct[1]==1 # (without this you will have an NA-value in the first row)
# calculate error rate
df %>%
filter(use_er) %>%
summarize(error_rate = 1 - mean(correct))
The error rate is 0.1315789 (5 / 38).
For calculating reaction times you can now filter by use_rt, i.e. df[df$use_rt,].
df resulting from either method:
trial_no correct use_rt use_er
1 1 TRUE TRUE
2 1 TRUE TRUE
3 1 TRUE TRUE
4 1 TRUE TRUE
5 1 TRUE TRUE
6 1 TRUE TRUE
7 1 TRUE TRUE
8 1 TRUE TRUE
9 1 TRUE TRUE
10 1 TRUE TRUE
11 1 TRUE TRUE
12 1 TRUE TRUE
13 0 FALSE TRUE
14 1 FALSE TRUE
15 1 TRUE TRUE
16 1 TRUE TRUE
17 1 TRUE TRUE
18 1 TRUE TRUE
19 0 FALSE TRUE
20 1 FALSE TRUE
21 1 TRUE TRUE
22 1 TRUE TRUE
23 0 FALSE TRUE
24 0 FALSE FALSE
25 1 FALSE TRUE
26 1 TRUE TRUE
27 1 TRUE TRUE
28 1 TRUE TRUE
29 1 TRUE TRUE
30 1 TRUE TRUE
31 1 TRUE TRUE
32 0 FALSE TRUE
33 0 FALSE FALSE
34 1 FALSE TRUE
35 1 TRUE TRUE
36 1 TRUE TRUE
37 1 TRUE TRUE
38 1 TRUE TRUE
39 0 FALSE TRUE
40 1 FALSE TRUE

Related

Sum of column by condition

Trying to summarize column 3 if column 1 is >.25
if(df$V1>.25){sum(df$V3)} ##This returns an error In
if (df$V1 > 0.25) { :
the condition has length > 1 and only the first element will be used
Any code to summarize column 3 when Column one is >.25
0.1287953 3 12 1
1.094262 13 14 3
0.5962845 8 17 4
0.6511204 7 19 5
0.2533915 4 6 2
0.8222555 6 18 6
0.08695875 3 7 1
0.6096232 6 6 2
1.583204 24 7 1
0.08337463 4 7 1
0.06398186 1 11 2
0.2713974 4 11 2
0.6205648 13 4 1
1.276595 15 14 3
If you only want to sum over the entries in column 3, where column 1 entries are > 0.25:
inds <- (df$V1 > 0.25)
inds
# [1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
Just use that to subset the third column:
sum( df$V3[ inds ] )
# 116
Or short: sum( df$V3[ df$V1 > 0.25 ] )

reduce the length of every connected FALSE block to a number of n

Lets generate some reproducible example data:
vector <- c()
set.seed(1337)
for (i in 1:3) {
vector <- c(vector,rep(T,sample(4:10,1)),rep(F,sample(1:10,1)))
}
df <- data.frame(bools = vector, values = 1:length(vector))
lets print the data:
> df
bools values
1 TRUE 1
2 TRUE 2
3 TRUE 3
4 TRUE 4
5 TRUE 5
6 TRUE 6
7 TRUE 7
8 TRUE 8
9 FALSE 9
10 FALSE 10
11 FALSE 11
12 FALSE 12
13 FALSE 13
14 FALSE 14
15 TRUE 15
16 TRUE 16
17 TRUE 17
18 TRUE 18
19 FALSE 19
20 FALSE 20
21 FALSE 21
22 FALSE 22
23 FALSE 23
24 TRUE 24
25 TRUE 25
26 TRUE 26
27 TRUE 27
28 TRUE 28
29 TRUE 29
30 FALSE 30
31 FALSE 31
32 FALSE 32
33 FALSE 33
>
The rules are: (n = 2 in the following example)
Keep all TRUE rows
A FALSE block if "longer" than n=2 will be reduced to n = 2
Keep the first n = 2 rows of that "too long" FALSE-Block
applying the rules with result in the following dataframe: df.new
df.new <- df[c(1:10,15:20,24:31),]
> df.new
bools values
1 TRUE 1
2 TRUE 2
3 TRUE 3
4 TRUE 4
5 TRUE 5
6 TRUE 6
7 TRUE 7
8 TRUE 8
9 FALSE 9
10 FALSE 10
15 TRUE 15
16 TRUE 16
17 TRUE 17
18 TRUE 18
19 FALSE 19
20 FALSE 20
24 TRUE 24
25 TRUE 25
26 TRUE 26
27 TRUE 27
28 TRUE 28
29 TRUE 29
30 FALSE 30
31 FALSE 31
>
How can i reduce df to df.new? Please keep in mind that a FALSE-Block can be "smaller" than n and in that case we will keep that FALSE-Block unchanged.
With the valuable help of Roland, i came up with following working (in my opinion ugly) solution:
with the use of rleidv() and n times duplicate()
n=2
df$blocks <- rleidv(df$bools)
df$blocks[df$bools %in% T] <- NA
for (i in 1:n) {
df$blocks[duplicated(df$blocks) %in% F] <- NA
}
df.new <- df[is.na(df$blocks),1:2]
print result
> df.new
bools values
1 TRUE 1
2 TRUE 2
3 TRUE 3
4 TRUE 4
5 TRUE 5
6 TRUE 6
7 TRUE 7
8 TRUE 8
9 FALSE 9
10 FALSE 10
15 TRUE 15
16 TRUE 16
17 TRUE 17
18 TRUE 18
19 FALSE 19
20 FALSE 20
24 TRUE 24
25 TRUE 25
26 TRUE 26
27 TRUE 27
28 TRUE 28
29 TRUE 29
30 FALSE 30
31 FALSE 31
>
another similar solution could be :
library(plyr)
library(data.table)
df$id <- rleid(df$bools)
ddply(df, .(id), function(x) if(x$bools[1]){x}else{x[1:min(2, sum(!x$bools)),]})
library(dplyr)
library(data.table)
df$id <- rleid(df$bools)
df %>% group_by(id) %>%
slice(if(bools[1]){1:n()}else{1:min(2, sum(!bools))})
A base R alternative, (that is still pretty ugly) that uses the split-apply-combine methodology is
do.call(rbind, lapply(split(df, cumsum(c(1,abs(diff(df$bools))))),
function(i) if(!i[1, "bools"]) head(i, 2) else i))
df is split using cumsum(c(1,abs(diff(df$bools)))) which is a base R version of data.table's rleid function. Each subset of df (now stored in a list is checked as to whether or not it is a TRUE block or FALSE block using if ... else if a FALSE block, then head(i, 2) keeps the first two observations of the block. Otherwise, the full block is returned. The resulting data.frames are then combined with do.call and rbind.
Also, note that head(i, n) will return i without change if length or nrow of i is less than n.
head(1, 2)
[1] 1
This code returns
bools values
1.1 TRUE 1
1.2 TRUE 2
1.3 TRUE 3
1.4 TRUE 4
1.5 TRUE 5
1.6 TRUE 6
1.7 TRUE 7
1.8 TRUE 8
2.9 FALSE 9
2.10 FALSE 10
3.15 TRUE 15
3.16 TRUE 16
3.17 TRUE 17
3.18 TRUE 18
4.19 FALSE 19
4.20 FALSE 20
5.24 TRUE 24
5.25 TRUE 25
5.26 TRUE 26
5.27 TRUE 27
5.28 TRUE 28
5.29 TRUE 29
6.30 FALSE 30
6.31 FALSE 31

R - add column checking occurrence of something in last n rows of column

I want to create a new column where at each row TRUE is returned if a certain value is found within the last n rows of another column, and FALSE is returned otherwise.
Here is an example dataframe (suppose this is a sample from a much larger dataframe):
A
2
23
1
5
6
15
14
3
7
9
55
3
77
2
And here is what I want (where conditional value=1 and n=10)
A B
2 FALSE
23 FALSE
1 FALSE
5 TRUE
6 TRUE
15 TRUE
14 TRUE
3 TRUE
7 TRUE
9 TRUE
55 TRUE
3 TRUE
77 TRUE
2 FALSE
I can do this with many "OR" conditions in an ifelse statement in dplyr:
df<-df %>% mutate(B=ifelse(lag(A)==1|lag(A,2)==1 ... |lag(A,10)==1,T,F))
But this is far too tedious, especially when n is large. Also, lag in dplyr only takes integers so lag(A,1:10) doesn't work.
Is there an easy way to do this (preferably without a for loop)?
As you've noticed, lag from dplyr does not allow you to pass a vector as shift amount, but the shift function from data.table allows you to do so, which has the same functionality as lag and lead in dplyr, so you can use shift from data.table with a Reduce function to do that:
library(data.table)
setDT(df)
df[, B := Reduce("|", shift(A == 1, n = 1:10, fill = F))]
df
A B
# 1: 2 FALSE
# 2: 23 FALSE
# 3: 1 FALSE
# 4: 5 TRUE
# 5: 6 TRUE
# 6: 15 TRUE
# 7: 14 TRUE
# 8: 3 TRUE
# 9: 7 TRUE
#10: 9 TRUE
#11: 55 TRUE
#12: 3 TRUE
#13: 77 TRUE
#14: 2 FALSE
We can also do this in dplyr with do and shift from data.table
library(dplyr)
df %>%
do(data.frame(., B= Reduce(`|`, shift(.$A==1, n = 1:10, fill = 0))))
# A B
#1 2 FALSE
#2 23 FALSE
#3 1 FALSE
#4 5 TRUE
#5 6 TRUE
#6 15 TRUE
#7 14 TRUE
#8 3 TRUE
#9 7 TRUE
#10 9 TRUE
#11 55 TRUE
#12 3 TRUE
#13 77 TRUE
#14 2 FALSE

R conditional grouping of rows and numbering of groups

I work with data frames for flight movements (~ 1 million rows * 108 variables) and want to group phases during which a certain criterion is met (i.e. the value of a certain variable). In order to identify these groups, I want to number them.
Being a R newbie, I made it work for my case. Now I am looking for a more elegant way. In particular, I would like to overcome with the "useless" gaps in the numbering of the groups.
I provide a simplified example of my dplyr data frame with the value THR for the threshold criterion. The rows are sorted by the timestamp (and thus, i can truncate this here).
THR <- c(13,17,19,22,21,19,17,12,12,17,20,20,20,17,17,13, 20,20,17,13)
df <- as.data.frame(THR)
df <- tbl_df(df)
To flag all rows where the criterion is (not) met
df <- mutate(df, CRIT = THR < 19)
With the following, I managed to conditionally "cumsum" to get a unique group identification:
df <- mutate(df, GRP = ifelse(CRIT == 1, 0, cumsum(CRIT))
df
x CRIT GRP
1 13 TRUE 0
2 17 TRUE 0
3 19 FALSE 2
4 22 FALSE 2
5 21 FALSE 2
6 19 FALSE 2
7 17 TRUE 0
8 12 TRUE 0
9 12 TRUE 0
10 17 TRUE 0
11 20 FALSE 6
12 20 FALSE 6
While this does the trick and I can operate on the groups with group_by (e.g. summarise, filter), the numbering is not ideal as can be seen in the example output. In this example the 1st is numbered 2, and the 2nd group is numbered 6 which is in line with the cumsum() result.
I would appreciate, if anybody could shed some light on me. I was not able to find an appropriate solution in other posts.
I don't you can really avoid that preliminary step of creating CRIT, though I'd suggest to add cumsum when creating it and then just run a simple cumsum/diff wrap up on it. Also, If you don't need the groups that aren't meeting the criteria, it is better to assign NA instead of just some random number such as zero. Here's a possible data.table wrap up (also, you don't need the df <- tbl_df(df) step at all)
library(data.table)
setDT(df)[, CRIT := cumsum(THR < 19)]
df[THR >= 19, GRP := cumsum(c(0L, diff(CRIT)) != 0L) + 1L]
# THR CRIT GRP
# 1: 13 1 NA
# 2: 17 2 NA
# 3: 19 2 1
# 4: 22 2 1
# 5: 21 2 1
# 6: 19 2 1
# 7: 17 3 NA
# 8: 12 4 NA
# 9: 12 5 NA
# 10: 17 6 NA
# 11: 20 6 2
# 12: 20 6 2
# 13: 20 6 2
# 14: 17 7 NA
# 15: 17 8 NA
# 16: 13 9 NA
# 17: 20 9 3
# 18: 20 9 3
# 19: 17 10 NA
# 20: 13 11 NA
You can do:
x = rle(df$CRIT)
mask = x$values
x$values[mask] = 0
x$values[!mask] = cumsum(!x$values[!mask])
mutate(df, GRP=inverse.rle(x))
# THR CRIT GRP
#1 13 TRUE 0
#2 17 TRUE 0
#3 19 FALSE 1
#4 22 FALSE 1
#5 21 FALSE 1
#6 19 FALSE 1
#7 17 TRUE 0
#8 12 TRUE 0
#9 12 TRUE 0
#10 17 TRUE 0
#11 20 FALSE 2
#12 20 FALSE 2
#13 20 FALSE 2
#14 17 TRUE 0
#15 17 TRUE 0
#16 13 TRUE 0
#17 20 FALSE 3
#18 20 FALSE 3
#19 17 TRUE 0
#20 13 TRUE 0

R - identify row if next x rows have equal or smaller values compared to each previous row

I'm trying to work out whether the next x (6 is the current plan but this could be subject to change) balances remain the same or decrease each month.
I did this in Excel such that it would start with the current month's value and compare next month's against it to see if it stayed the same or decreased and so on.
=IF(AND(H3<=H2,H4<=H3,H5<=H4,H6<=H5,H7<=H6,H8<=H7),1,0)
This isn't the most flexible or elegant formula as it was part of an initial exploration. To make everything cleaner and more reproducible, I'd like to put my calculations into R instead.
Here is a basic dataset that is like my data for multiple IDs over many months.
rbind(data.frame(ID=1,Month=1:11,Bal=seq(from=500, to=300, by=-20)),
data.frame(ID=2,Month=1:10,Bal=rep(200,10)),
data.frame(ID=3,Month=1:11,Bal=seq(from=300, to=500, by=20)))
Having something that calculates against the raw data on a row level or will work inside a ddply are ideal solutions variants.
I'm still pretty new to R and I'm sure there's an elegant solution for this, but I really can't see it. Anyone have a neat solution or could point me in the direction of the sorts of keyterms I should be researching to try and reach a solution?
I am not sure if I understood correctly:
checkfun <- function(x,n) {
rev(filter(rev(c(diff(x) <= 0,NA)),rep(1,n),sides=1)) == n
}
This function calculates the differences between consecutive values and checks if they are <= 0. The filter sums the number of following n differences that fulfill the condition. This number is finally compared with n, to see if all of them fulfill the condition. (rev is only used, so that a one-sided filter can be used.)
DF$Bal[6] <- 505 #to not only have equal differences
library(plyr)
#example with 3 next values
ddply(DF,.(ID),transform,check=checkfun(Bal,3))
# ID Month Bal check
# 1 1 1 500 TRUE
# 2 1 2 480 TRUE
# 3 1 3 460 FALSE
# 4 1 4 440 FALSE
# 5 1 5 420 FALSE
# 6 1 6 505 TRUE
# 7 1 7 380 TRUE
# 8 1 8 360 TRUE
# 9 1 9 340 NA
# 10 1 10 320 NA
# 11 1 11 300 NA
# 12 2 1 200 TRUE
# 13 2 2 200 TRUE
# 14 2 3 200 TRUE
# 15 2 4 200 TRUE
# 16 2 5 200 TRUE
# 17 2 6 200 TRUE
# 18 2 7 200 TRUE
# 19 2 8 200 NA
# 20 2 9 200 NA
# 21 2 10 200 NA
# 22 3 1 300 FALSE
# 23 3 2 320 FALSE
# 24 3 3 340 FALSE
# 25 3 4 360 FALSE
# 26 3 5 380 FALSE
# 27 3 6 400 FALSE
# 28 3 7 420 FALSE
# 29 3 8 440 FALSE
# 30 3 9 460 NA
# 31 3 10 480 NA
# 32 3 11 500 NA
If df is your data.frame:
you can find consecutive differences using:
df$diff <- do.call("c",lapply(unique(df$ID), function(x) c(0,diff(df$Bal[df$ID==x]))))
This assumes that you want to keep those calculations separate for different ID's.
> head(df)
ID Month Bal diff
1 1 1 500 0
2 1 2 480 -20
3 1 3 460 -20
4 1 4 440 -20
5 1 5 420 -20
6 1 6 400 -20
Now, for a give k=6 (say), check:
sapply(unique(df$ID), function(x) ifelse(sum(df$diff[df$ID==x][1:k] < 0)!=0,1,0))
[1] 1 0 0
It returns a value of 1 (all differences are negative) or 0 (all differences are positive) for each ID.

Resources