Recode and sum to NA when all values are NA in R - r

I need to assign NA when all the columns are empty in summation for each id.
Here is how my sample dataset looks like;
df <- data.frame(id = c(1,2,3),
i1 = c(1,NA,0),
i2 = c(1,NA,1),
i3 = c(1,NA,0),
total = c(3,0,1))
> df
id i1 i2 i3 total
1 1 1 1 1 3
2 2 NA NA NA 0
3 3 0 1 0 1
For the second id the total should be NA instead of 0 because all the values are NA for the second id. How can I change the dataset to below?
> df1
id i1 i2 i3 total
1 1 1 1 1 3
2 2 NA NA NA NA
3 3 0 1 0 1

We could create a condition with if_all in case_when to return NA when all the column values are NA for a row or else do the rowSums with na.rm = TRUE
library(dplyr)
df %>%
mutate(total = case_when(if_all(i1:i3, is.na) ~ NA_real_,
TRUE ~ rowSums(across(i1:i3), na.rm = TRUE)))
-output
id i1 i2 i3 total
1 1 1 1 1 3
2 2 NA NA NA NA
3 3 0 1 0 1

Related

Drop columns when there are many missingness in R

I am trying to drop some columns that have less than 5 valid values. Here is an example dataset.
df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
i1 = c(0,1,1,1,1,0,0,1,NA,1),
i2 = c(1,0,0,1,0,1,1,0,0,NA),
i3 = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,0),
i4 = c(NA,1,NA,NA,NA,NA,NA,NA,1,NA))
> df
id i1 i2 i3 i4
1 1 0 1 NA NA
2 2 1 0 NA 1
3 3 1 0 NA NA
4 4 1 1 NA NA
5 5 1 0 NA NA
6 6 0 1 NA NA
7 7 0 1 NA NA
8 8 1 0 NA NA
9 9 NA 0 NA 1
10 10 1 NA 0 NA
in this case, columns i3 and i4 needs to be dropped from the data frame.
How can I get the desired dataset below:
> df
id i1 i2
1 1 0 1
2 2 1 0
3 3 1 0
4 4 1 1
5 5 1 0
6 6 0 1
7 7 0 1
8 8 1 0
9 9 NA 0
10 10 1 NA
You can keep cols with at least 5 non-missing values with:
df[colSums(!is.na(df)) >= 5]
Can use discard from the purrr package:
library(purrr)
df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
i1 = c(0,1,1,1,1,0,0,1,NA,1),
i2 = c(1,0,0,1,0,1,1,0,0,NA),
i3 = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,0),
i4 = c(NA,1,NA,NA,NA,NA,NA,NA,1,NA))
df %>%
discard(~ sum(!is.na(.))<5)
#> id i1 i2
#> 1 1 0 1
#> 2 2 1 0
#> 3 3 1 0
#> 4 4 1 1
#> 5 5 1 0
#> 6 6 0 1
#> 7 7 0 1
#> 8 8 1 0
#> 9 9 NA 0
#> 10 10 1 NA
Created on 2022-11-10 with reprex v2.0.2
While this is likely slower than base R methods (for datasets with extremely many columns > 1000), I generally feel the readability of the code is far superior. In addition, it is easy to do more complicated statements.
Using R base, another approach...
> df[, sapply(df, function(x) sum(is.na(x))) < 5]
id i1 i2
1 1 0 1
2 2 1 0
3 3 1 0
4 4 1 1
5 5 1 0
6 6 0 1
7 7 0 1
8 8 1 0
9 9 NA 0
10 10 1 NA
A performance comparison of the different answers given in this post:
funs = list(
colSums = function(df){df[colSums(!is.na(df)) >= nrow/10]},
sapply = function(df){df[, sapply(df, function(x) sum(!is.na(x))) >= nrow/10]},
discard = function(df){df %>% discard(~ sum(!is.na(.)) < nrow/10)},
mutate = function(df){df %>% mutate(across(where(~ sum(!is.na(.)) < nrow/10), ~ NULL))},
select = function(df){df %>% select(where(~ sum(!is.na(.)) >= nrow/10))})
ncol = 10000
nrow = 100
df = replicate(ncol, sample(c(1:9, NA), nrow, TRUE)) %>% as_tibble()
avrtime = map_dbl(funs, function(f){
duration = c()
for(i in 1:10){
t1 = Sys.time()
f(df)
t2 = Sys.time()
duration[i] = as.numeric(t2 - t1)}
return(mean(duration))})
avrtime[order(avrtime)]
The average time taken by each (in seconds):
colSums sapply discard select mutate
0.04510500 0.04692972 0.29207475 0.29451160 0.31755514
Using select
library(dplyr)
df %>%
select(where(~ sum(complete.cases(.x)) >=5))
-output
id i1 i2
1 1 0 1
2 2 1 0
3 3 1 0
4 4 1 1
5 5 1 0
6 6 0 1
7 7 0 1
8 8 1 0
9 9 NA 0
10 10 1 NA
Or in base R
Filter(\(x) sum(complete.cases(x)) >=5 , df)

how to count number of response values by time thresholds in r

I have a student dataset that includes responses to questions as right or wrong. There is also a time variable in seconds. I would like to create a time flag to record number of correct and incorrect responses by 1 minute 2 minute and 3 minute thresholds. Here is a sample dataset.
df <- data.frame(id = c(1,2,3,4,5),
gender = c("m","f","m","f","m"),
age = c(11,12,12,13,14),
i1 = c(1,0,NA,1,0),
i2 = c(0,1,0,"1]",1),
i3 = c("1]",1,"1]",0,"0]"),
i4 = c(0,"0]",1,1,0),
i5 = c(1,1,NA,"0]","1]"),
i6 = c(0,0,"0]",1,1),
i7 = c(1,"1]",1,0,0),
i8 = c(0,0,0,"1]","1]"),
i9 = c(1,1,1,0,NA),
time = c(115,138,148,195, 225))
> df
id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time
1 1 m 11 1 0 1] 0 1 0 1 0 1 115
2 2 f 12 0 1 1 0] 1 0 1] 0 1 138
3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148
4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195
5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225
The minute thresholds are represented by a ] sign at the right side of the score.
For example for the id = 3, the 1-minute threshold is at item i3 , the 2-minute threshold is at item i6. Each student might have different time thresholds.
I need to create flagging variables to count number of correct and incorrect responses by the 1-min 2-min and 3-min thresholds.
How can I achieve the desired dataset as below.
> df1
id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time one_true one_false two_true two_false three_true three_false
1 1 m 11 1 0 1] 0 1 0 1 0 1 115 2 1 NA NA NA NA
2 2 f 12 0 1 1 0] 1 0 1] 0 1 138 2 2 4 3 NA NA
3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148 1 1 2 2 NA NA
4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195 2 0 3 2 5 3
5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225 1 2 2 3 4 4
library(tidyverse)
df %>%
pivot_longer(i1:i9,values_transform = as.character) %>%
group_by(id)%>%
mutate(vs = rev(cumsum(replace_na(str_detect(rev(value),']'),0))))%>%
filter(vs > 0)%>%
mutate(vs = max(vs) - vs + 1)%>%
group_by(vs,.add = TRUE)%>%
summarise(true = sum(str_detect(value, '1'), na.rm = TRUE),
false = sum(str_detect(value, '0'), na.rm = TRUE),
.groups = "drop_last")%>%
mutate(across(c(true, false),cumsum)) %>%
pivot_wider(id, names_from = vs, values_from = c(true, false))
# A tibble: 5 x 7
# Groups: id [5]
id true_1 true_2 true_3 false_1 false_2 false_3
<dbl> <int> <int> <int> <int> <int> <int>
1 1 2 NA NA 1 NA NA
2 2 2 4 NA 2 3 NA
3 3 1 2 NA 1 2 NA
4 4 2 3 5 0 2 3
5 5 1 2 4 2 3 4
You could also accomplish the same in base R:
fun <- function(x){
a <- diff(c(0,which(grepl("]", x))))
f_sum <- function(x,y) sum(na.omit(grepl(x,y)))
fn <- function(x) c(true = f_sum('1',x), false = f_sum('0',x))
y <- tapply(x[seq(sum(a))], rep(seq_along(a),a), fn)
s <- do.call(rbind, Reduce("+", y, accumulate = TRUE))
nms <- do.call(paste, c(sep='_',expand.grid(colnames(s), seq(nrow(s)))))
setNames(c(t(s)), nms)
}
fun2 <- function(x){
ln <- lengths(x)
nms <- names(x[[which.max(ln)]])
do.call(rbind, lapply(x, function(x)setNames(`length<-`(x,max(ln)),nms)))
}
fun2(apply(df[4:12],1,fun))
true_1 false_1 true_2 false_2 true_3 false_3
[1,] 2 1 NA NA NA NA
[2,] 2 2 4 3 NA NA
[3,] 1 1 2 2 NA NA
[4,] 2 0 3 2 5 3
[5,] 1 2 2 3 4 4

change all columns conditional on other column

I have the following df:
df = data.frame(a = c(0,1,0,0,1),
b= c(0,0,0,1,0),
SL = c(1,0,1,0,0))
df2 = data.frame(a = c(NA,1,NA,0,1),
b= c(NA,0,NA,1,0),
SL = c(NA,0,NA,0,0))
Now, I would like to change all values in a row to NA if SL == 1, like in df2. I tried with dplyr --> mutate(), across(), mutate_all but wasn't successful.
An option with dplyr would be
library(dplyr)
df <- df %>%
mutate(across(everything(), ~ case_when(SL != 1 ~ SL)))
df
# a b SL
#1 NA NA NA
#2 0 0 0
#3 NA NA NA
#4 0 0 0
#5 0 0 0
Using %in%.
df[df$SL %in% 1, ] <- NA
df
# a b SL
# 1 NA NA NA
# 2 1 0 0
# 3 NA NA NA
# 4 0 1 0
# 5 1 0 0

Replace NA with 0 depending on group (rows) and variable names (column)

I have a large data set and want to replace many NAs, but not all.
In one group i want to replace all NAs with 0.
In the other group i want to replace all NAs with 0, but only in variables that do not include a certain part of the variable name e.g. 'b'
Here is an example:
group <- c(1,1,2,2,2)
abc <- c(1,NA,NA,NA,NA)
bcd <- c(2,1,NA,NA,NA)
cde <- c(5,NA,NA,1,2)
df <- data.frame(group,abc,bcd,cde)
group abc bcd cde
1 1 1 2 5
2 1 NA 1 NA
3 2 NA NA NA
4 2 NA NA 1
5 2 NA NA 2
This is what i want:
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
This is what i tried:
#set 0 in first group: this works fine
df[is.na(df) & df$group==1] <- 0
#set 0 in second group but only if the variable name includes b: does not work
df[is.na(df) & df$group==2 & !grepl('b',colnames(df))] <- 0
dplyr solutions are welcome as well as basic
For the second group, create the column index with grep and use that to subset the data while assigning
j1 <- !grepl('b',colnames(df))
df[j1][df$group == 2 & is.na(df[j1])] <- 0
df
# group abc bcd cde
#1 1 1 2 5
#2 1 0 1 0
#3 2 NA NA 0
#4 2 NA NA 1
#5 2 NA NA 2
Using dplyr::mutate_at you can also do:
library(dplyr)
vars_mutate_1 <- names(df)[-1]
vars_mutate_2 <- grep(x = names(df)[-1], pattern = '^(?!.*b).*$', perl = TRUE, value = TRUE)
df %>%
mutate_at(.vars = vars_mutate_1, .funs = funs(if_else(group == 1 & is.na(.), 0, .))) %>%
mutate_at(.vars = vars_mutate_2, .funs = funs(if_else(group == 2 & is.na(.), 0, .)))
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
Alternatively, you can use:
library(dplyr)
df2 <- df %>% mutate_at(vars(names(df)[-1]),
function(x) case_when((group==1 & is.na(x) ) ~ 0,
(group==2 & is.na(x) & !grepl("b",deparse(substitute(x)))) ~ 0,
TRUE ~ x))
> df2
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2

sapply function(x) where x is subsetted argument

So, I want to generate a new vector from the information in two existing ones (numerical), one which sets the id for the participant, the other indicating the observation number. Each paticipant has been observed different times.
Now, the new vector should should state: 0 when obs_no=1; 1 when obs_no=last observation for that id; NA for cases in between.
id obs_no new_vector
1 1 0
1 2 NA
1 3 NA
1 4 NA
1 5 1
2 1 0
2 2 1
3 1 0
3 2 NA
3 3 1
I figure I could do this separatly for every id using code like this
new_vector <- c(0, rep(NA, times=length(obs_no[id==1])-2), 1)
Or I guess just using max() but it wouldn't make any difference.
But adding each participant manually is really inconvenient since I have a lot of cases. I can't figure out how to make a generic function. I tried to define a function(x) using sapply but cant get it to work since x is positioned within subsetting brackets.
Any advice would be helpful. Thanks.
ave to the rescue:
dat$newvar <- NA
dat$newvar <- with(dat,
ave(newvar, id, FUN=function(x) replace(x, c(length(x),1), c(1,0)) )
)
Or use a bit of duplicated() fun:
dat$newvar <- NA
dat$newvar[!duplicated(dat$id, fromLast=TRUE)] <- 1
dat$newvar[!duplicated(dat$id)] <- 0
Both giving:
# id obs_no new_vector newvar
#1 1 1 0 0
#2 1 2 NA NA
#3 1 3 NA NA
#4 1 4 NA NA
#5 1 5 1 1
#6 2 1 0 0
#7 2 2 1 1
#8 3 1 0 0
#9 3 2 NA NA
#10 3 3 1 1
You can also do this with dplyr
str <- "
id obs_no new_vector
1 1 0
1 2 NA
1 3 NA
1 4 NA
1 5 1
2 1 0
2 2 1
3 1 0
3 2 NA
3 3 1
"
dt <- read.table(textConnection(str), header = T)
library(dplyr)
dt %>% group_by(id) %>%
mutate(newvar = if_else(obs_no==1,0L,if_else(obs_no==max(obs_no),1L,as.integer(NA))))
We can use data.table
library(data.table)
i1 <- setDT(df1)[, .I[seq_len(.N) %in% c(1, .N)], id]$V1
df1[i1, newvar := c(0, 1)]
df1
# id obs_no new_vector newvar
# 1: 1 1 0 0
# 2: 1 2 NA NA
# 3: 1 3 NA NA
# 4: 1 4 NA NA
# 5: 1 5 1 1
# 6: 2 1 0 0
# 7: 2 2 1 1
# 8: 3 1 0 0
# 9: 3 2 NA NA
#10: 3 3 1 1
Use split:
result = lapply(split(obs_no, id), function (x) c(0, rep(NA, length(x) - 2), 1))
This gives you a list of vectors. You can paste them back together like this:
do.call(c, result)

Resources