Dealing dataframes with conditional statements - r

Following the previous two questions:
removing the first 3 rows of a group with conditional statement in r
Assigning NAs to rows with conditional statement in r
I'm having some troubles with my code. If Instead of deleting rows, I want to assign NAs to every event that has in their first row a Value higher than 2. So, if an event is having in its first row a Value higher than 2, I want to assign NA to that row, and to the coming two rows of that event. If the event has no more rows, just assign NAs to the rows the event has.
Here is an example, with a column of the desire output I want.
Event<- c(1,1,1,1,1,2,2,2,2,3,3,4,5,6,6,6,7,7,7,7)
Value<- c(1,0,8,0,8,8,7,1,10,4,0,1,10,3,0,0,NA,NA,5,0)
Desire_output<- c(1,0,8,0,8,NA, NA, NA,10,NA,NA,1,NA,NA,NA,NA,NA,NA,5,0)
AAA<- data.frame(Event, Value, Desire_output)
Event Value Desire_output
1 1 1 1
2 1 0 0
3 1 8 8
4 1 0 0
5 1 8 8
6 2 8 NA
7 2 7 NA
8 2 1 NA
9 2 10 10
10 3 4 NA
11 3 0 NA
12 4 1 1
13 5 10 NA
14 6 3 NA
15 6 0 NA
16 6 0 NA
17 7 NA NA
18 7 NA NA
19 7 5 5
20 7 0 0
Note: If an event start with an NA, do nothing (like in event 7).
please let me know if you have an idea, and thanks in advance for your time.

Here's a dplyr pipe to do that:
library(dplyr)
AAA %>%
group_by(Event) %>%
mutate(
bad = row_number() == 1 & !is.na(Value) & Value >= 2,
bad = bad | lag(bad, default = FALSE) | lag(bad, 2, default = FALSE),
bad = bad | is.na(Value),
Value2 = if_else(bad, NA_real_, Value)
) %>%
ungroup()
# # A tibble: 20 x 5
# Event Value Desire_output bad Value2
# <dbl> <dbl> <dbl> <lgl> <dbl>
# 1 1 1 1 FALSE 1
# 2 1 0 0 FALSE 0
# 3 1 8 8 FALSE 8
# 4 1 0 0 FALSE 0
# 5 1 8 8 FALSE 8
# 6 2 8 NA TRUE NA
# 7 2 7 NA TRUE NA
# 8 2 1 NA TRUE NA
# 9 2 10 10 FALSE 10
# 10 3 4 NA TRUE NA
# 11 3 0 NA TRUE NA
# 12 4 1 1 FALSE 1
# 13 5 10 NA TRUE NA
# 14 6 3 NA TRUE NA
# 15 6 0 NA TRUE NA
# 16 6 0 NA TRUE NA
# 17 7 NA NA TRUE NA
# 18 7 NA NA TRUE NA
# 19 7 5 5 FALSE 5
# 20 7 0 0 FALSE 0
I updated the data with
AAA$Desire_output[9] <- 10
since it was inconsistent with your displayed frame (and the display made more sense).

Related

Fill missing values (NA) before the first non-NA value by group

I have a data frame grouped by 'id' and a variable 'age' which contains missing values, NA.
Within each 'id', I want to replace missing values of 'age', but only "fill up" before the first non-NA value.
data <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
age=c(NA,6,NA,8,NA,NA,NA,NA,3,8,NA,NA,NA,7,NA,9))
id age
1 1 NA
2 1 6 # first non-NA in id = 1. Fill up from here
3 1 NA
4 1 8
5 1 NA
6 1 NA
7 2 NA
8 2 NA
9 2 3 # first non-NA in id = 2. Fill up from here
10 2 8
11 2 NA
12 3 NA
13 3 NA
14 3 7 # first non-NA in id = 3. Fill up from here
15 3 NA
16 3 9
Expected output:
1 1 6
2 1 6
3 1 NA
4 1 8
5 1 NA
6 1 NA
7 2 3
8 2 3
9 2 3
10 2 8
11 2 NA
12 3 7
13 3 7
14 3 7
15 3 NA
16 3 9
I tried using fill with .direction = "up" like this:
library(dplyr)
library(tidyr)
data1 <- data %>% group_by(id) %>%
fill(!is.na(age[1]), .direction = "up")
You could use cumall(is.na(age)) to find the positions before the first non-NA value.
library(dplyr)
data %>%
group_by(id) %>%
mutate(age2 = replace(age, cumall(is.na(age)), age[!is.na(age)][1])) %>%
ungroup()
# A tibble: 16 × 3
id age age2
<dbl> <dbl> <dbl>
1 1 NA 6
2 1 6 6
3 1 NA NA
4 1 8 8
5 1 NA NA
6 1 NA NA
7 2 NA 3
8 2 NA 3
9 2 3 3
10 2 8 8
11 2 NA NA
12 3 NA 7
13 3 NA 7
14 3 7 7
15 3 NA NA
16 3 9 9
Another option (agnostic about where the missing and non-missing values start) could be:
data %>%
group_by(id) %>%
mutate(rleid = with(rle(is.na(age)), rep(seq_along(lengths), lengths)),
age2 = ifelse(rleid == min(rleid[is.na(age)]),
age[rleid == (min(rleid[is.na(age)]) + 1)][1],
age))
id age rleid age2
<dbl> <dbl> <int> <dbl>
1 1 NA 1 6
2 1 6 2 6
3 1 NA 3 NA
4 1 8 4 8
5 1 NA 5 NA
6 1 NA 5 NA
7 2 NA 1 3
8 2 NA 1 3
9 2 3 2 3
10 2 8 2 8
11 2 NA 3 NA
12 3 NA 1 7
13 3 NA 1 7
14 3 7 2 7
15 3 NA 3 NA
16 3 9 4 9

Is there a way to group values in a column between data gaps in R?

I want to group my data in different chunks when the data is continuous. Trying to get the group column from dummy data like this:
a b group
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
I tried using
test %>% mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test == TRUE)) %>%
select(group, everything())
But it doesn't work as expected:
group a b test
<int> <dbl> <dbl> <lgl>
1 1 1 1 TRUE
2 2 2 2 TRUE
3 3 3 3 TRUE
4 3 4 NA FALSE
5 3 5 NA FALSE
6 3 6 NA FALSE
7 4 7 12 TRUE
8 5 8 15 TRUE
9 5 9 NA FALSE
10 6 10 25 TRUE
Any advice?
Using rle in base R -
transform(df, group1 = with(rle(!is.na(b)), rep(cumsum(values), lengths))) |>
transform(group1 = replace(group1, is.na(b), NA))
# a b group group1
#1 1 1 1 1
#2 2 2 1 1
#3 3 3 1 1
#4 4 NA NA NA
#5 5 NA NA NA
#6 6 NA NA NA
#7 7 12 2 2
#8 8 15 2 2
#9 9 NA NA NA
#10 10 25 3 3
A couple of approaches to consider if you wish to use dplyr for this.
First, you could look at transition from non-complete cases (using lag) to complete cases.
library(dplyr)
test %>%
mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test & !lag(test, default = F))) %>%
mutate(group = replace(group, !test, NA))
Alternatively, you could add row numbers to your data.frame. Then, you could filter to include only complete cases, and group_by enumerating with cumsum based on gaps in row numbers. Then, join back to original data.
test$rn <- seq.int(nrow(test))
test %>%
filter(complete.cases(.)) %>%
group_by(group = c(0, cumsum(diff(rn) > 1)) + 1) %>%
right_join(test) %>%
arrange(rn) %>%
dplyr::select(-rn)
Output
a b group
<int> <int> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
Using data.table, get rleid then remove group IDs for NAs, then fix the sequence with factor to integer conversion:
library(data.table)
setDT(test)[, group1 := {
x <- complete.cases(test)
grp <- rleid(x)
grp[ !x ] <- NA
as.integer(factor(grp))
}]
# a b group group1
# 1: 1 1 1 1
# 2: 2 2 1 1
# 3: 3 3 1 1
# 4: 4 NA NA NA
# 5: 5 NA NA NA
# 6: 6 NA NA NA
# 7: 7 12 2 2
# 8: 8 15 2 2
# 9: 9 NA NA NA
# 10: 10 25 3 3

Assigning NAs to rows with conditional statement in r

I'm trying to assign NAs to the first two rows of each event, with the following conditional statement:
If the first day of each event has a value of "variable" = 0, check the day before. If the day before (last day of previous event) has a "variable" > 0, then assign NAs to the first two rows of the event having "variable" = 0 on the first day. If the day before has a "variable" = 0, do nothing.
Here is an example:
day <- c(1:16)
event<- c(1,1,2,3,4,4,4,5,5,5,6,6,6,7,7,7)
variable<- c(0,0,5,0,0,0,10,0,1,1,0,0,0,0,0,0)
A<- data.frame(day, event, variable)
day event variable
1 1 1 0
2 2 1 0
3 3 2 5
4 4 3 0
5 5 4 0
6 6 4 0
7 7 4 10
8 8 5 0
9 9 5 1
10 10 5 1
11 11 6 0
12 12 6 0
13 13 6 0
14 14 7 0
15 15 7 0
16 16 7 0
And how it should look like
day event variable
1 1 1 0
2 2 1 0
3 3 2 5
4 4 3 NA
5 5 4 0
6 6 4 0
7 7 4 10
8 8 5 NA
9 9 5 NA
10 10 5 1
11 11 6 NA
12 12 6 NA
13 13 6 0
14 14 7 0
15 15 7 0
16 16 7 0
Note: It doesn't matter if event 1 has to be assigned with NAs
I tried to do this with if conditions, but is not working well. Any idea? and thanks in advance!
EDIT: New example data from OP
library(data.table)
event2<- c(1,2,2,3,4,4,4,4,4,5,5)
variable2<- c(140, 0, 69, 569, 28, 0,0,0,100,0,0)
desire_output<- c(140, NA, NA, 569, 28, 0,0,0,100, NA,NA)
A2<- data.frame(event2, variable2, desire_output)
setDT(A2)
A2[,first_days_event:=fifelse(.I==min(.I),1,fifelse(.I==min(.I)+1,2,NA_integer_)),by=.(event2)]
A2[,result:={v <- variable2
for (i in 2:.N) {
if (is.na(first_days_event[i])) {
v[i] <- variable2[i]
} else if (first_days_event[i]==1 & variable2[i]==0){
if (variable2[i-1]>0) {
v[i] <- NA_integer_
if (first_days_event[i+1]==2) {
v[i+1] <- NA_integer_
}
}
}
}
v}]
A2
#> event2 variable2 desire_output first_days_event result
#> 1: 1 140 140 1 140
#> 2: 2 0 NA 1 NA
#> 3: 2 69 NA 2 NA
#> 4: 3 569 569 1 569
#> 5: 4 28 28 1 28
#> 6: 4 0 0 2 0
#> 7: 4 0 0 NA 0
#> 8: 4 0 0 NA 0
#> 9: 4 100 100 NA 100
#> 10: 5 0 NA 1 NA
#> 11: 5 0 NA 2 NA
I will use this simple loop solution. Just need to create a flag indicating the first tow days of each event.
library(data.table)
day <- c(1:16)
event<- c(1,1,2,3,4,4,4,5,5,5,6,6,6,7,7,7)
variable<- c(0,0,5,0,0,0,10,0,1,1,0,0,0,0,0,0)
A<- data.frame(day, event, variable)
setDT(A)
A[,first_days_event:=fifelse(.I==min(.I),1,fifelse(.I==min(.I)+1,2,NA_integer_)),by=.(event)]
A[,result:={v <- numeric(.N)
for (i in 2:.N) {
if (is.na(first_days_event[i])) {
v[i] <- variable[i]
} else if (first_days_event[i]==1){
if (variable[i-1]>0) {
v[i] <- NA_integer_
if (first_days_event[i+1]==2) {
v[i+1] <- NA_integer_
}
} else {
v[i] <- variable[i]
}
}
}
v}]
A
#> day event variable first_days_event result
#> 1: 1 1 0 1 0
#> 2: 2 1 0 2 0
#> 3: 3 2 5 1 5
#> 4: 4 3 0 1 NA
#> 5: 5 4 0 1 0
#> 6: 6 4 0 2 0
#> 7: 7 4 10 NA 10
#> 8: 8 5 0 1 NA
#> 9: 9 5 1 2 NA
#> 10: 10 5 1 NA 1
#> 11: 11 6 0 1 NA
#> 12: 12 6 0 2 NA
#> 13: 13 6 0 NA 0
#> 14: 14 7 0 1 0
#> 15: 15 7 0 2 0
#> 16: 16 7 0 NA 0
Here is a potential tidyverse approach.
You can store the last value of a group in a temporary column last_var and use lag to move to the first row of the following group for comparison.
Note that the default in lag will determine if variable in event 1 is 0 or NA.
The final mutate will evaluate the row if within the first 2 rows of the group, and check last_var to determine if should set to NA or leave alone.
Edit: For the ifelse need to also check if first day's variable for the event is 0.
library(tidyverse)
A %>%
group_by(event) %>%
mutate(last_var = ifelse(row_number() == n(), last(variable), 0)) %>%
ungroup %>%
mutate(last_var = lag(last_var, default = 0)) %>%
group_by(event) %>%
mutate(variable = ifelse(row_number() <= 2 & first(last_var) > 0 & first(variable) == 0, NA, variable)) %>%
select(-last_var)
Output
# A tibble: 16 x 3
# Groups: event [7]
day event variable
<int> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 2 5
4 4 3 NA
5 5 4 0
6 6 4 0
7 7 4 10
8 8 5 NA
9 9 5 NA
10 10 5 1
11 11 6 NA
12 12 6 NA
13 13 6 0
14 14 7 0
15 15 7 0
16 16 7 0
With the second data frame included in the comments:
Output
# A tibble: 11 x 3
# Groups: event [5]
event variable desire_output
<dbl> <dbl> <dbl>
1 1 140 140
2 2 NA NA
3 2 NA NA
4 3 569 569
5 4 28 28
6 4 0 0
7 4 0 0
8 4 0 0
9 4 100 100
10 5 NA NA
11 5 NA NA

Fill Missing Values

data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,3))
library(dplyr);library(tidyverse)
data$timeWANTattempt=data$timeHAVE
data <- data %>%
group_by(student) %>%
fill(timeWANTattempt)+3
I have 'timeHAVE' and I want to replace missing times with the previous time +3. I show my dplyr attempt but it does not work. I seek a data.table solution. Thank you.
you can try.
data %>%
group_by(student) %>%
mutate(n_na = cumsum(is.na(timeHAVE))) %>%
mutate(timeHAVE = ifelse(is.na(timeHAVE), timeHAVE[n_na == 0 & lead(n_na) == 1] + 3*n_na, timeHAVE))
student timeHAVE timeWANT n_na
<dbl> <dbl> <dbl> <int>
1 1 1 1 0
2 1 4 4 0
3 1 7 7 0
4 1 10 10 0
5 2 2 2 0
6 2 5 5 0
7 2 8 8 1
8 2 11 11 1
9 3 6 6 0
10 3 9 9 1
11 3 12 12 2
12 3 15 15 3
13 4 3 3 0
I included the little helper n_na which counts NA's in a row. Then the second mutate muliplies the number of NAs with three and adds this to the first non-NA element before NA's
Here's an approach using 'locf' filling
setDT(data)
data[ , by = student, timeWANT := {
# carry previous observations forward whenever missing
locf_fill = nafill(timeHAVE, 'locf')
# every next NA, the amount shifted goes up by another 3
na_shift = cumsum(idx <- is.na(timeHAVE))
# add the shift, but only where the original data was missing
locf_fill[idx] = locf_fill[idx] + 3*na_shift[idx]
# return the full vector
locf_fill
}]
Warning that this won't work if a given student can have more than one non-consecutive set of NA values in timeHAVE
Another data.table option without grouping:
setDT(data)[, w := fifelse(is.na(timeHAVE) & student==shift(student),
nafill(timeHAVE, "locf") + 3L * rowid(rleid(timeHAVE)),
timeHAVE)]
output:
student timeHAVE timeWANT w
1: 1 1 1 1
2: 1 4 4 4
3: 1 7 7 7
4: 1 10 10 10
5: 2 2 2 2
6: 2 5 5 5
7: 2 NA 8 8
8: 2 11 11 11
9: 3 6 6 6
10: 3 NA 9 9
11: 3 NA 12 12
12: 3 NA 15 15
13: 4 NA NA NA
14: 4 3 3 3
data with student=4 having NA for the first timeHAVE:
data = data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,NA,3))

R: creating multiple new variables based on conditions of selection of other variables with similar names

I have a data frame where each condition (in the example: hope, dream, joy) has 5 variables (in the example, coded with suffixes x, y, z, a, b - the are the same for each condition).
df <- data.frame(matrix(1:16,5,16))
names(df) <- c('ID','hopex','hopey','hopez','hopea','hopeb','dreamx','dreamy','dreamz','dreama','dreamb','joyx','joyy','joyz','joya','joyb')
df[1,2:6] <- NA
df[3:5,c(7,10,14)] <- NA
This is how the data looks like:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16
I want to create a new variable for each condition (hope, dream, joy) that codes whether all of the variables x...b for that condition are NA (0 if all are NA, 1 if any is non-NA). And I want the new variables to be stored in the data frame. Thus, the output should be this:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope joy dream
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12 0 1 1
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13 1 1 1
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14 1 1 1
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15 1 1 1
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16 1 1 1
The code below does it, but I'm looking for a more elegant solution (e.g., for a case where I have even more conditions). I've tried with various combinations of all(), select(), mutate(), but while they all seem useful, I cannot figure out how to combine them to get what I want. I'm stuck and would be interested in learning to code more efficiently. Thanks in advance!
df$hope <- 0
df[is.na(df$hopex) == FALSE | is.na(df$hopey) == FALSE | is.na(df$hopez) == FALSE | is.na(df$hopea) == FALSE | is.na(df$hopeb) == FALSE, "hope"] <- 1
df$dream <- 0
df[is.na(df$dreamx) == FALSE | is.na(df$dreamy) == FALSE | is.na(df$dreamz) == FALSE | is.na(df$dreama) == FALSE | is.na(df$dreamb) == FALSE, "dream"] <- 1
df$joy<- 0
df[is.na(df$joyx) == FALSE | is.na(df$joyy) == FALSE | is.na(df$joyz) == FALSE | is.na(df$joya) == FALSE | is.na(df$joyb) == FALSE, "joy"] <- 1
Here is an option with tidyverse
library(dplyr)
library(purrr)
library(magrittr)
df %>%
mutate(hope = select(., starts_with('hope')) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer)
# hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope
#1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
#2 1 1 4 3 2 3 5 4 5 2 5 NA 4 3 1 1
#3 2 NA 4 4 4 3 5 NA 5 5 4 NA 4 5 1 1
#4 4 3 NA 1 1 1 5 2 NA 5 1 2 1 1 1 1
#5 1 NA 4 NA NA 2 1 5 1 2 NA 3 1 2 5 1
Or with rowSums
df %>%
mutate(hope = +(rowSums(!is.na(select(., starts_with('hope'))))!= 0))
For multiple columns, we can create a function
f1 <- function(dat, colSubstr) {
dplyr::select(dat, starts_with(colSubstr)) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer
}
df %>%
mutate(hope = f1(., 'hope'),
dream = f1(., 'dream'),
joy = f1(., 'joy'))
Or using base R
cbind(df, sapply(split.default(df, sub(".$", "", names(df))),
function(x) +(rowSums(!is.na(x)) != 0)))
If we want to subset columns
nm1 <- setdiff(names(df), "ID")
cbind(df, sapply(split.default(df[nm1], sub(".$", "", names(df[nm1]))),
function(x) +(rowSums(!is.na(x)) != 0)))
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:5), 5 * 15, replace = TRUE),
ncol = 15, dimnames = list(NULL, paste0(rep(c("hope", "dream", "joy"),
each = 5), c('x', 'y', 'z', 'a', 'b')))))
df[1,] <- NA

Resources