Find duplicate in data frame and change identified value - r

I am stuck with probably a stupid and easy to solve issue.
I have a trigger that code 1 when the computer key is pressed (and) and 0 when the key is released. I need to identify each trigger start and stop (i.e., first and last 1) and replace the 1 in between by 0. The data record is time (continuous, t below) and value (electrodermal activity, value). To process the data more quickly, I need to preprocess it, that is identify the 1 corresponding to the beginning and the end of the window of interest.
Please find an exemple of the code:
t <- seq(0.1,10,0.1)
value <- rnorm(length(t), mean=1, sd=2)
trig <- c(rep(0,20),rep(c(rep(1,10), rep(0,10)),4))
id <- 1:length(t)
the expected output is
trig_result <- c(rep(0,20), rep(c(1, rep(0,8),1,rep(0,10)),4)); length(trig_result)
The use of duplicate only identify the first 1 and the last one but not the intermediate value. I have seen similar post, but none solve the identification issue.
I look into dplyr function but I cannot figure out how to replace the 1 in 0 to end the preprocessing phase.
Your help will be greatly appreciated.
Sincerely your,

Here's a base R solution with rle and cumsum:
result <- rep(0,length(trig))
result[head(cumsum(rle(trig)$lengths)+c(1,0),-1)] <- 1
all.equal(result,trig_result)
#[1] TRUE
Note that this solution assumes the data begins and ends with 0.

Here is another base R solution, using logical vectors.
borders <- function(x, b = 1){
n <- length(x)
d1 <- c(x[1] == b, diff(x) != 0 & x[-1] == b)
d2 <- c(rev(diff(rev(x)) != 0 & rev(x[-n]) == b), x[n] == b)
d1 + d2
}
trig <- c(rep(0,20),rep(c(rep(1,10), rep(0,10)),4))
tr <- borders(trig)
The result is not identical() to the expected output because its class is different but the values are all.equal().
trig_result <- c(rep(0,20), rep(c(1, rep(0,8),1,rep(0,10)),4))
identical(trig_result, tr) # FALSE
all.equal(trig_result, tr) # TRUE
class(trig_result)
#[1] "numeric"
class(tr)
#[1] "integer"

One option is to create a grouping index with rle or rleid (from data.table)
library(data.table)
out <- ave(trig, rleid(trig), FUN = function(x)
x == 1 & (!duplicated(x) | !duplicated(x, fromLast = TRUE)))
identical(trig_result, out)
#[1] TRUE

You'd like to find the starts and ends of runs of 1s, and remove all 1s that aren't the start or end of a run.
The start of a run of ones is where the value of the current row is a 1, and the value of the previous row is a 0. You can access the value of previous row using the lag function.
The end of a run of 1s is where the current row is a 1, and the next row is a zero. You can access the value of the next row using the lead function.
library(tidyverse)
result = tibble(Trig = trig) %>%
mutate(StartOfRun = Trig == 1 & lag(Trig == 0),
EndOfRun = Trig == 1 & lead(Trig == 0),
Result = ifelse(StartOfRun | EndOfRun, 1, 0)) %>%
pull(Result)

Related

How do I fix this for loop that takes a value in one column to change one in another?

It's been a while since I've touched for loops. I am simply trying to check if the value in the LocationID column is 1, and if so, set the value in the corresponding row in RegionID to 4. It should run through every value in the LocationID column.
for (i in 1:length(df$LocationID)) {
if (df_7$LocationID == 1) {
df_7$RegionID = 4
}
}
From my understanding, the code I've come up with here just checks the first value ~44000 times (the length of the dataset) and then sets every value of RegionID to 4.
Any help is greatly appreciated!
We need to specify the index i to subset the value. In the OP's code, the if condition is checking the whole column 'LocationID' and if is not vectorized, so it expects a single TRUE/FALSE i.e a vector of length 1
for (i in seq_along(df_7$LocationID)) {
if(df_7$LocationID[i] == 1) {
df_7$RegionID[i] <- 4
}
}
In R, we can vectorize this easily
df_7$RegionID[df_7$LocationID == 1] <- 4
_
Or with ifelse
df_7$RegionID <- with(df_7, ifelse(LocationID == 1, 4, RegionID))
Or another option with tidyverse
library(dplyr)
df_7 <- df_7 %>%
mutate(RegionID = case_when(LocationID == 1 ~ 4, TRUE ~ RegionID))
Or with data.table
library(data.table)
setDT(df_7)[LocationID == 1, Region_ID := 4]

R limit output of dataframe?

I have a data frame of transactions.
I am using dplyr to filter the transaction by gender.
Gender in my case is 0 or 1.
I want to filter 2 rows one with Gender == 0 and the second with Gender == 1.
The closest was to do it like this
df %>% arrange(Gender)
and then select 2 transactions in the middle where one is 1 and the second is 0.
Please advise.
To randomly sample a row/cell where condition in another cell is satisfied you can use sample like this:
# Dummy data: X = value of interest, G = Gender (0,1)
df1 <- data.frame("X" = rnorm(10, 0, 1), "G" = sample(c(0,1), replace = T, size = 10))
# Sampling
sample(df1[,'X'][df1[,'G'] == 1], size = 1)
sample(df1[,'X'][df1[,'G'] == 0], size = 1)
This is taking one value of X for each gender (condition of G being set by [df1[,'G'] == 1]).
Building from the comment by docendo discimus you can use the popular dplyr package, using the script below, but note that this runs considerably slower (5 times slower, 3M rows & 1000 iterations) than the sample approach I offered above:
pull(df1 %>% group_by(G) %>% sample_n(1), X)

R function or loop that could go through a binary variable (1 and 0) in a dataframe and returns a third variable (y) value from a different column

I do need some help. I am trying to build a function or a loop using R that could go through a binary variable (1 and 0) in a dataframe in such way that everytime 1 is followed by a 0, I could save a vector indicating the value of a third variable (y) in the same line where it occurred. I tried a couple of options based on previous posts, but nothing gives me something even close from that.
My data looks a bit like that:
ID <- rep(1001, 5)
variable <- c(1, 1, 0, 1, 0)
y <- c(10, 20, 30, 40, 50)
df <- cbind(ID, variable, y)
In this case, for example, the answer would give me a vector with the y values 30 and 50. Sorry if someone already has answered that, I could not find something similar. Thanks a lot!
Here's a 'vectorial' solution. Basically, I paste together variable in position i and i+1. Then I check to see if the combination is "10". The position you want is actually the next one (e.g. i+1), so we add 1.
df <- data.frame(ID, variable, y)
idx <- which(paste0(df$variable[-nrow(df)], df$variable[-1]) == "10") + 1
df$y[idx]
Here is an approach with tidyverse:
library(tidyverse)
df %>%
as.tibble %>%
mutate(y1 = ifelse(lag(variable) == 1 & variable == 0, y, NA)) %>%
pull(y1)
#output
[1] NA NA 30 NA 50
and in base R:
ifelse(c(NA, df[-nrow(df),2]) == 1 & df[, 2] == 0, df[, 3], NA)
if the lag of variable is 1 and the variable is 0 then return y, else return NA.
If you would like to remove the NA. wrap it in na.omit

R - add column based on intervals in separate data frame

I have the following data frames:
DF <- data.frame(Time=c(1:20))
StartEnd <- data.frame(Start=c(2,6,14,19), End=c(4,10,17,20))
I want to add a column "Activity" to DF if the values in the Time column lie inbetween one of the intervals specified in the StartEnd dataframe.
I came up with the following:
mapply(FUN = function(Start,End) ifelse(DF$Time >= Start & DF$Time <= End, 1, 0),
Start=StartEnd$Start, End=StartEnd$End)
This doesn't give me the output I want (it gives me a matrix with four columns), but I would like to get a vector that I can add to DF.
I guess the solution is easy but I'm not seeing it :) Thank you in advance.
EDIT: I'm sure I can use a loop but I'm wondering if there are more elegant solutions.
You can achieve this with
DF$Activity <- sapply(DF$Time, function(x) {
ifelse(sum(ifelse(x >= StartEnd$Start & x <= StartEnd$End, 1, 0)), 1, 0)
})
I hope this helps!
If you're using the tidyverse, I think a good way to go would be with with purrr::map2:
# generate a sequence (n, n + 1, etc.) for each StartEnd row
# (map functions return a list; purrr::flatten_int or unlist can
# squash this down to a vector!)
activity_times = map2(StartEnd$Start, StartEnd$End, seq) %>% flatten_int
# then get a new DF column that is TRUE if Time is in activity_times
DF %>% mutate(active = Time %in% active_times)

change variable values based on preceding value

I have the following dataset:
df <- data.frame(subject = c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3),
time = c(1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,11),
performance = c(1,0,-1,-1,0,1,1,-1,0,0,0,1,1,1,-1,0,1,1,-1,0,0,1,-1,1,1,0,1,1,-1,0,-1,-1,0))
What I would like to do is to change some of the entries in the performance variable. More specifically, if a "-1" entry is preceded by a "1", I want to change the "-1" to "0".
However, this should be done within subjects only, but not across subjects (all of the subjects have a varying number of sessions).
So, this is what I'd like to have in the end:
df2 =data.frame(subject = c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3),
time = c(1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,11),
performance = c(1,0,-1,-1,0,1,1,0,0,0,0,1,1,1,0,0,1,1,0,0,0,1,-1,1,1,0,1,1,-1,0,-1,-1,0))
Does anyone have an idea how to do this?
Thanks in advance!
S.
Using dplyr,
df %>%
group_by(subject) %>%
mutate(performance = replace(performance, which(performance + lag(performance)==0 & performance == -1), 0))
Here's a data.table approach, where I first create a flag column which is then used to subset the data and update the performance column by reference.
library(data.table)
dt <- as.data.table(df) # or setDT(df)
dt[, flag := performance == -1 & shift(performance, 1L) == 1, by = subject]
dt[(flag), performance := 0][, flag := NULL]
I chose to do it with an intermediate flag-column because I expect that to perform very well for large data sets. If performance is not your concern, you could of course use ifelse or replace instead.
This is ugly, but should work:
dftest <- df
for (i in 2:nrow(dftest)) {
if(
dftest$performance[i] == -1 && dftest$performance[i - 1] == 1
){
if(
dftest$subject[i] == dftest$subject[i - 1]
) {
dftest$performance[i] <- 0
}
}
}
all.equal(df2, dftest) # ONE ERROR
This gives an error in line 29 - can you check whether your example df2 is correct here? If I understand the question correctly df2$performance[29] should be 0?
A base R solution using by and sapply:
gr <- do.call(c, by(df, df$subject, function(x) {
c(FALSE, unlist(sapply(1:length(x$performance),
function(y) (x$performance[y] == -1) & (x$performance[y-1] == 1))))
}))
df[gr, 3] <- 0
cbind(df, df2)

Resources