R logical test + replace output within group - r

I have a large dataframe. As an example:
Week <- c(1, 1, 1, 2, 2, 2)
Outcome <- c( FALSE, FALSE , TRUE , TRUE, FALSE, FALSE)
df <- data.frame(Week, Outcome)
Week Outcome
1 1 FALSE
2 1 FALSE
3 1 TRUE
4 2 TRUE
5 2 FALSE
6 2 FALSE
In Outcome I would like to change FALSE to NA in case Outcome consists TRUE within the same Week.
So in this case the result will be:
Week Outcome
1 1 NA
2 1 NA
3 1 TRUE
4 2 TRUE
5 2 NA
6 2 NA
Thanks in advance for suggestions.

If you have cases when a week could be all FALSE and you don't want to change those ones, you can do it all in one pass like:
df$Outcome[df$Week %in% unique(df$Week[df$Outcome]) & (!df$Outcome)] <- NA
df
# Week Outcome
#1 1 NA
#2 1 NA
#3 1 TRUE
#4 2 TRUE
#5 2 NA
#6 2 NA
Extended example:
Week <- c(1, 1, 1, 2, 2, 2, 3, 3)
Outcome <- c( FALSE, FALSE , TRUE , TRUE, FALSE, FALSE, FALSE, FALSE)
df2 <- data.frame(Week, Outcome)
df2$Outcome[df2$Week %in% unique(df2$Week[df2$Outcome]) & (!df2$Outcome)] <- NA
df2
# Week Outcome
#1 1 NA
#2 1 NA
#3 1 TRUE
#4 2 TRUE
#5 2 NA
#6 2 NA
#7 3 FALSE
#8 3 FALSE

We can negate the 'Outcome' to convert the FALSE to TRUE and assign those values to NA
df$Outcome[!df$Outcome] <- NA
-output
df
# Week Outcome
#1 1 NA
#2 1 NA
#3 1 TRUE
#4 2 TRUE
#5 2 NA
#6 2 NA
If we have cases where there are no TRUE, an option is
library(dplyr)
df2 %>%
group_by(Week) %>%
mutate(Outcome = replace(Outcome, any(Outcome) & !Outcome, NA))
-output
# A tibble: 8 x 2
# Groups: Week [3]
# Week Outcome
# <dbl> <lgl>
#1 1 NA
#2 1 NA
#3 1 TRUE
#4 2 TRUE
#5 2 NA
#6 2 NA
#7 3 FALSE
#8 3 FALSE
Or using base R
df2$Outcome <- with(df2, !(NA^ave(Outcome, Week, FUN = any) & !Outcome))
data
df2 <- structure(list(Week = c(1, 1, 1, 2, 2, 2, 3, 3), Outcome = c(FALSE,
FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)), class = "data.frame",
row.names = c(NA,
-8L))

Related

Filter logical columns

Not able to filter on logical columns below. Wanted to check if there is way to filter. If I filter based on false, NA row becomes all NA. I need rows with NA but with values in other column?
asd <- data.frame(Cat = c("A","B","B","A","B","A"), Start_num = c(2, 5, 1, 6, 6, 4), End_num = c(3, 7, 4, 7, 8, 5))
new <- asd %>% arrange(Cat,Start_num) %>%
group_by(Cat) %>%
mutate(Var=lead(Start_num)>End_num)
new <- as.data.frame(new)
new[new$Var != FALSE,]
Cat Start_num End_num Var
1 A 2 3 TRUE
2 A 4 5 TRUE
NA <NA> NA NA NA
4 B 1 4 TRUE
NA.1 <NA> NA NA NA
```
If you want to subset on the TRUE values but also have NA values, you need to subset both on NA and TRUE:
new[!is.na(new$Var) & new$Var == TRUE,]
# A tibble: 3 x 4
# Groups: Cat [2]
Cat Start_num End_num Var
<chr> <dbl> <dbl> <lgl>
1 A 2 3 TRUE
2 A 4 5 TRUE
3 B 1 4 TRUE
Why not using %in%?
new[new$Var %in% TRUE, ]
# Cat Start_num End_num Var
# 1 A 2 3 TRUE
# 2 A 4 5 TRUE
# 4 B 1 4 TRUE

Mutating column to dataframe using apply function by group

I am trying to use the apply function to rows within a grouped dataframe to check for the existence of other rows within that group that match certain conditions dependent on each row. I am able to get this to work for one group but not for all.
For example, with no grouping:
library(dplyr)
id <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
station <- c(1, 2, 3, 3, 2, 2, 1, 1, 3, 2, 2)
timeslot <- c(13, 14, 20, 21, 24, 23, 8, 9, 10, 15, 16)
df <- data.frame(id, station, timeslot)
s <- 2
df <-
df %>%
filter(id == 1) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
In the above code, for each station 2 row, I am trying to check all other rows to see if there exists a timeslot with a value of one greater (for any station). This works as expected.
Then, I go on to apply this to a grouped dataframe:
df <-
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
<int> <int> <int> <lgl>
1 1 1 13 FALSE
2 1 2 14 TRUE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
7 2 1 8 FALSE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 FALSE
11 2 2 16 TRUE
and get some unwanted results. It seems like it is not applied by group and I can't figure out how to fix this. How can I apply this function so that only the other rows within a group are checked? In reality, my dataset is much bigger and the conditions are more complex, so it is not running quickly either.
Thanks in advance
Edit: I should add that I have also tried a solution using the arrange() and lead() function but since some timeslot values are shared by many stations in my larger dataset I could not get this to work
This seems to work:
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = station == s & ((timeslot + 1) %in% timeslot))
# # A tibble: 11 x 4
# # Groups: id [2]
# id station timeslot match
# <dbl> <dbl> <dbl> <lgl>
# 1 1 1 13 FALSE
# 2 1 2 14 FALSE
# 3 1 3 20 FALSE
# 4 1 3 21 FALSE
# 5 1 2 23 TRUE
# 6 1 2 24 FALSE
# 7 2 1 8 FALSE
# 8 2 1 9 FALSE
# 9 2 3 10 FALSE
# 10 2 2 15 TRUE
# 11 2 2 16 FALSE
My sincere apologies if I understood the question wrong. This does what I understand from the question:
df$match = apply(df, 1, function(line) any(df$id == line[1] &
df$station == line[2] &
df$timeslot == line[3] + 1))
The result then is
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 TRUE
4 1 3 21 FALSE
5 1 2 24 FALSE
6 1 2 23 TRUE
7 2 1 8 TRUE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 TRUE
11 2 2 16 FALSE

Counting sequential rows with the same / missing value

I have a time series dataset containing different sensor measurements. The sensors software has some bugs, resulting in missing measurements. I added the missing measurement times, resulting in NAs in the "value" column. The dataset looks as follows:
df <- structure(list(time_id = 1:10, value = c(-1.80603125680195, -0.582075924689333,
NA, NA, -0.162309523556819, NA, NA, NA, 1.6059096288573, NA),
is_missing = c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, FALSE, TRUE)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L))
df
I want to group sequential rows with numeric vs missing values and at the same time count the number of sequential rows in each group. The result should look like this:
df %>% mutate(group = c(1, 1, 2, 2, 3, 4, 4, 4, 5, 6),
seq_NA = c(1:2, 1:2, 1, 1:3, 1, 1))
Help is very much appreciated!
Here is another idea. Here we use is.na() to capture the NAs and change group whenever a non-NA appears, i.e.
df %>%
group_by(grp = cumsum(c(1, diff(is.na(value)) != 0))) %>%
mutate(seq_NA = seq(n()))
which gives,
# A tibble: 10 x 5
# Groups: grp [6]
time_id value is_missing grp seq_NA
<int> <dbl> <lgl> <dbl> <int>
1 1 -1.81 FALSE 1 1
2 2 -0.582 FALSE 1 2
3 3 NA TRUE 2 1
4 4 NA TRUE 2 2
5 5 -0.162 FALSE 3 1
6 6 NA TRUE 4 1
7 7 NA TRUE 4 2
8 8 NA TRUE 4 3
9 9 1.61 FALSE 5 1
10 10 NA TRUE 6 1
Here is a base R solution using ave() + rle()
df$group <- with(df, rep(seq_along(z<-rle(is_missing)$lengths),z))
df$seq_NA <- with(df,ave(seq(nrow(df)),group,FUN = seq_along))
such that
> df
time_id value is_missing group seq_NA
1 1 -1.8060313 FALSE 1 1
2 2 -0.5820759 FALSE 1 2
3 3 NA TRUE 2 1
4 4 NA TRUE 2 2
5 5 -0.1623095 FALSE 3 1
6 6 NA TRUE 4 1
7 7 NA TRUE 4 2
8 8 NA TRUE 4 3
9 9 1.6059096 FALSE 5 1
10 10 NA TRUE 6 1

Value matching with NA - missing values - using mutate

I am somewhat stuck. Is there a better way than the below to do value matching considering NAs as "real values" within mutate?
library(dplyr)
data_foo <- data.frame(A= c(1:2, NA, 4, NA), B = c(1, 3, NA, NA, 4))
Not the desired output:
data_foo %>% mutate(irr = A==B)
#> A B irr
#> 1 1 1 TRUE
#> 2 2 3 FALSE
#> 3 NA NA NA
#> 4 4 NA NA
#> 5 NA 4 NA
data_foo %>% rowwise() %>% mutate(irr = A%in%B)
#> Source: local data frame [5 x 3]
#> Groups: <by row>
#>
#> # A tibble: 5 x 3
#> A B irr
#> <dbl> <dbl> <lgl>
#> 1 1 1 TRUE
#> 2 2 3 FALSE
#> 3 NA NA FALSE
#> 4 4 NA FALSE
#> 5 NA 4 FALSE
Desired output: The below shows the desired column, irr. I am using this somewhat cumbersome helper columns. Is there a shorter way?
data_foo %>%
mutate(NA_A = is.na(A),
NA_B = is.na(B),
irr = if_else(is.na(A)|is.na(B), NA_A == NA_B, A == B))
#> A B NA_A NA_B irr
#> 1 1 1 FALSE FALSE TRUE
#> 2 2 3 FALSE FALSE FALSE
#> 3 NA NA TRUE TRUE TRUE
#> 4 4 NA FALSE TRUE FALSE
#> 5 NA 4 TRUE FALSE FALSE
Using map2
library(tidyverse)
data_foo %>%
mutate(irr = map2_lgl(A, B, `%in%`))
# A B irr
#1 1 1 TRUE
#2 2 3 FALSE
#3 NA NA TRUE
#4 4 NA FALSE
#5 NA 4 FALSE
Or with setequal
data_foo %>%
rowwise %>%
mutate(irr = setequal(A, B))
The above method is concise, but it is also loopy. We can replace the NA with a different value and then do the ==
data_foo %>%
mutate_all(list(new = ~ replace_na(., -999))) %>%
transmute(A, B, irr = A_new == B_new)
# A B irr
#1 1 1 TRUE
#2 2 3 FALSE
#3 NA NA TRUE
#4 4 NA FALSE
#5 NA 4 FALSE
Or with bind_cols and reduce
data_foo %>%
mutate_all(replace_na, -999) %>%
reduce(`==`) %>%
bind_cols(data_foo, irr = .)
Maybe simpler than akrun's answer?
Any of the two ways below will produce the expected result. Note that as.character won't do it, because the return value of as.character(NA) is NA_character_.
data_foo %>%
mutate(irr = paste(A) == paste(B))
data_foo %>%
mutate(irr = sQuote(A) == sQuote(B))
#Source: local data frame [5 x 3]
#Groups: <by row>
#
## A tibble: 5 x 3
# A B irr
# <dbl> <dbl> <lgl>
#1 1 1 TRUE
#2 2 3 FALSE
#3 NA NA TRUE
#4 4 NA FALSE
#5 NA 4 FALSE
Edit.
Following the comments below I have updated the code and it now follows akrun's suggestion.
There is also the excellent idea in tmfmnk's answer. I use a similar one in yet another way of solving the question's problem.
The documentation of all.equal says that
Do not use all.equal directly in if expressions—either use
isTRUE(all.equal(....)) or identical if appropriate.
Though there is no if expression in mutate, I believe that it is more stable than identical and has the same effect if the values being compared are (sort of/in fact) equal.
data_foo %>%
mutate(irr = isTRUE(all.equal(A, B)))
Could also be a possibility:
data_foo %>%
rowwise() %>%
mutate(irr = identical(A, B)) %>%
ungroup()
A B irr
<dbl> <dbl> <lgl>
1 1 1 TRUE
2 2 3 FALSE
3 NA NA TRUE
4 4 NA FALSE
5 NA 4 FALSE
The coalesce function is useful if you want to perform an action when a value is NA
data_foo %>%
mutate(irr = coalesce(A == B, is.na(A) & is.na(B)))
# A B irr
# 1 1 1 TRUE
# 2 2 3 FALSE
# 3 NA NA TRUE
# 4 4 NA FALSE
# 5 NA 4 FALSE
Same thing for > 2 columns
data_foo %>%
mutate(irr = coalesce(reduce(., `==`), rowMeans(is.na(.)) == 1))

How to merge variables looping through by variable number in R

I have a dataframe with a lot of variables seen in multiple conditions. I'd like to merge each variable by condition.
The example data frame is a simplified version of what I have (3 variables over 2 conditions).
VAR.B_1 <- c(1, 2, 3, 4, 5, 'NA', 'NA', 'NA', 'NA', 'NA')
VAR.B_2 <- c(2, 2, 3, 4, 5,'NA', 'NA', 'NA', 'NA', 'NA')
VAR.B_3 <- c(1, 1, 1, 1, 1,'NA', 'NA', 'NA', 'NA', 'NA')
VAR.E_1 <- c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1)
VAR.E_2 <- c(NA, NA, NA, NA, NA, 1, 2, 3, 4, 5)
VAR.E_3 <- c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1)
Condition <- c("B", "B","B","B","B","E","E","E","E","E")
#Example dataset
data<-as.data.frame(cbind(VAR.B_1,VAR.B_2,VAR.B_3, VAR.E_1,VAR.E_2, VAR.E_3, Condition))
I want to end up with this, appended to the original data frame:
VAR_1 VAR_2 VAR_3
1 2 1
2 2 1
3 3 1
4 4 1
5 5 1
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
I understand that R won't work with i inside the variable name, but I have an example of the kind of for loop I was trying to do. I would rather not call variables by column location, since there will be a lot of variables.
##Example of how I want to merge - this code does not work
for(i in 1:3) {
data$VAR_[,i] <-ifelse(data$Condition == "B", VAR.B_[,i],
ifelse(data$Condition == "E", VAR.E_[,i], NA))
}
This might work for your situation:
library(tidyverse)
library(stringr)
data %>%
mutate_all(as.character) %>%
gather(key, value, -Condition) %>%
filter(!is.na(value), value != "NA") %>%
mutate(key = str_replace(key, paste0("\\.", Condition), "")) %>%
group_by(Condition, key) %>%
mutate(rowid = 1:n()) %>%
spread(key, value) %>%
bind_cols(data)
#> # A tibble: 10 x 12
#> # Groups: Condition [2]
#> Condition rowid VAR_1 VAR_2 VAR_3 VAR.B_1 VAR.B_2 VAR.B_3 VAR.E_1
#> <chr> <int> <chr> <chr> <chr> <fctr> <fctr> <fctr> <fctr>
#> 1 B 1 1 2 1 1 2 1 NA
#> 2 B 2 2 2 1 2 2 1 NA
#> 3 B 3 3 3 1 3 3 1 NA
#> 4 B 4 4 4 1 4 4 1 NA
#> 5 B 5 5 5 1 5 5 1 NA
#> 6 E 1 1 1 1 NA NA NA 1
#> 7 E 2 1 2 1 NA NA NA 1
#> 8 E 3 1 3 1 NA NA NA 1
#> 9 E 4 1 4 1 NA NA NA 1
#> 10 E 5 1 5 1 NA NA NA 1
#> # ... with 3 more variables: VAR.E_2 <fctr>, VAR.E_3 <fctr>,
#> # Condition1 <fctr>
data.frame(lapply(split.default(data[-NCOL(data)], gsub("\\D+", "", head(names(data), -1))),
function(a){
a = sapply(a, function(x) as.numeric(as.character(x)))
rowSums(a, na.rm = TRUE)
}))
# X1 X2 X3
#1 1 2 1
#2 2 2 1
#3 3 3 1
#4 4 4 1
#5 5 5 1
#6 1 1 1
#7 1 2 1
#8 1 3 1
#9 1 4 1
#10 1 5 1
#Warning messages:
#1: In FUN(X[[i]], ...) : NAs introduced by coercion
#2: In FUN(X[[i]], ...) : NAs introduced by coercion
#3: In FUN(X[[i]], ...) : NAs introduced by coercion
Your data appears to have two kinds of NA values in it. It has NA, or R's NA value, and it also has the string 'NA'. In my solution below, I replace both with zero, cast each column in the data frame to numeric, and then just sum together like-numbered VAR columns. Then, drop the original columns which you don't want anymore.
data <- as.data.frame(cbind(VAR.B_1,VAR.B_2,VAR.B_3, VAR.E_1,VAR.E_2, VAR.E_3),
stringsAsFactors=FALSE)
data[is.na(data)] <- 0
data[data == 'NA'] <- 0
data <- as.data.frame(lapply(data, as.numeric))
data$VAR_1 <- data$VAR.B_1 + data$VAR.E_1
data$VAR_2 <- data$VAR.B_2 + data$VAR.E_2
data$VAR_3 <- data$VAR.B_3 + data$VAR.E_3
data <- data[c("VAR_1", "VAR_2", "VAR_3")]
Demo

Resources