I do get the wrong result, what am I doing wrong?
df <- data.frame(x=c(1,1,NA),y=c(1,NA,NA),z=c(NA,NA,NA))
df <-mutate(df,result=ifelse(is.na(x),NA,ifelse(any(!is.na(y),!is.na(z)),1,0)))
I get this (data[2,4]==0)
x y z result
1 1 1 NA 1
2 1 NA NA 1
3 NA NA NA NA
Instead of this:
df_wanted <- data.frame(x=c(1,1,NA),y=c(1,NA,NA),z=c(NA,NA,NA), result=c(1,0,NA))
x y z result
1 1 1 NA 1
2 1 NA NA 0
3 NA NA NA NA
We can use | instead of any because any returns a single TRUE/FALSE as output
with(df, any(!is.na(y), !is.na(z)))
#[1] TRUE
and that gets recycled for the entire column and because the first ifelse with 'x' returns already 'NA' for the third row, all the others are returned 1
instead we need to do this for each row and this can be accomplished with |
library(dplyr)
df %>%
mutate(result = ifelse(is.na(x), NA, ifelse(!is.na(y)|!is.na(z), 1, 0)))
# x y z result
#1 1 1 NA 1
#2 1 NA NA 0
#3 NA NA NA NA
Or another option is case_when
df %>%
mutate(result = case_when(is.na(x) ~ NA_integer_,
!is.na(y)| !is.na(z) ~ 1L,
TRUE ~ 0L))
# x y z result
#1 1 1 NA 1
#2 1 NA NA 0
#3 NA NA NA NA
Or with coalesce
df %>%
mutate(result = x * +coalesce(!is.na(y)|!is.na(z)))
# x y z result
#1 1 1 NA 1
#2 1 NA NA 0
#3 NA NA NA NA
You can use case_when and mention each condition explicitly.
library(dplyr)
df %>%
mutate(result = case_when(is.na(x) ~ NA_integer_,
!(is.na(y) & is.na(z)) ~ 1L,
TRUE ~ 0L))
# x y z result
#1 1 1 NA 1
#2 1 NA NA 0
#3 NA NA NA NA
Related
I need to assign NA when all the columns are empty in summation for each id.
Here is how my sample dataset looks like;
df <- data.frame(id = c(1,2,3),
i1 = c(1,NA,0),
i2 = c(1,NA,1),
i3 = c(1,NA,0),
total = c(3,0,1))
> df
id i1 i2 i3 total
1 1 1 1 1 3
2 2 NA NA NA 0
3 3 0 1 0 1
For the second id the total should be NA instead of 0 because all the values are NA for the second id. How can I change the dataset to below?
> df1
id i1 i2 i3 total
1 1 1 1 1 3
2 2 NA NA NA NA
3 3 0 1 0 1
We could create a condition with if_all in case_when to return NA when all the column values are NA for a row or else do the rowSums with na.rm = TRUE
library(dplyr)
df %>%
mutate(total = case_when(if_all(i1:i3, is.na) ~ NA_real_,
TRUE ~ rowSums(across(i1:i3), na.rm = TRUE)))
-output
id i1 i2 i3 total
1 1 1 1 1 3
2 2 NA NA NA NA
3 3 0 1 0 1
I would like to conditionally subset a dataframe in R, using dplyr::select_if(). More specifically, I have a dataframe that is made up of a grouping variable and numerous other variables that contain a bunch of NAs:
data <- tibble(group = sort(rep(letters[1:5],3)),
var_1 = c(1,1,1,1,rep(NA,11)),
var_2 = c(1,1,1,1,1,1,rep(NA,9)),
var_3 = 1,
var_4 = c(1,1,rep(NA,10),1,1,1),
var_5 = c(1,1,1,1,1,1,NA,NA,NA,NA,NA,NA,1,1,1))
# A tibble: 15 x 6
group var_1 var_2 var_3 var_4 var_5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 1 1 1 1 1
2 a 1 1 1 1 1
3 a 1 1 1 NA 1
4 b 1 1 1 NA 1
5 b NA 1 1 NA 1
6 b NA 1 1 NA 1
7 c NA NA 1 NA NA
8 c NA NA 1 NA NA
9 c NA NA 1 NA NA
10 d NA NA 1 NA NA
11 d NA NA 1 NA NA
12 d NA NA 1 NA NA
13 e NA NA 1 1 1
14 e NA NA 1 1 1
15 e NA NA 1 1 1
In this dataframe, I need to identify and remove columns like var_4 in this case that only occur in one group (but irrespective of whether or not they show up in the last group: "e"). Importantly, everything else has to remain untouched (i.e. I want to keep variables that look like var_1,var_2,var_3, and var_5). This is what I tried:
library(dplyr)
data %>%
filter(group!="e") %>% # Ignore last group.
select_if(~ function(col)) %>% # Write function to look for cols that only have values for one group of the total four groups remaining (a-d).
names() -> cols_to_drop # Save col names.
data %>% select(-cols_to_drop) -> new_data # Subset by saved col names.
Unfortunately, I can't figure out how to write that function inside select_if() to specify that grouping variable condition.
A second thing that I have been wondering about is whether I can use select_if() to remove cols based on the percentage of NAs it contains. Is there a way?
I am not sure if select_if would be able to do such grouped selection of columns.
Here is one way to do this getting data in long format :
library(dplyr)
cols <- data %>%
filter(group != "e") %>%
tidyr::pivot_longer(cols = starts_with('var')) %>%
group_by(name, group) %>%
summarise(value = any(!is.na(value))) %>%
summarise(value = sum(value)) %>%
filter(value > 1) %>%
pull(name)
#Select the columns
data %>% select(group, cols)
# group var_1 var_2 var_3 var_5
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 1 1 1
# 2 a 1 1 1 1
# 3 a 1 1 1 1
# 4 b 1 1 1 1
# 5 b NA 1 1 1
# 6 b NA 1 1 1
# 7 c NA NA 1 NA
# 8 c NA NA 1 NA
# 9 c NA NA 1 NA
#10 d NA NA 1 NA
#11 d NA NA 1 NA
#12 d NA NA 1 NA
#13 e NA NA 1 1
#14 e NA NA 1 1
#15 e NA NA 1 1
data = data.frame(STUDENT=c(1,2,3,4,5,6,7,8),
CAT=c(NA,NA,1,2,3,NA,NA,0),
DOG=c(NA,NA,2,3,2,NA,1,NA),
MOUSE=c(2,3,NA,NA,NA,NA,NA,NA),
WANT=c(2,3,2,2,3,NA,NA,NA))
I have 'data' and wish to create the 'WANT' variable and what it does is it takes the first non-NA value that does not equals to '1' or '0' and it stores it in 'WANT'. The code example above shows an example of what I hope to get.
We can use coalesce after changing the values 0, 1 in the selected columns to NA, then bind the column with the original dataset
library(dplyr)
data %>%
transmute(across(CAT:MOUSE, ~ replace(., . %in% 0:1, NA))) %>%
transmute(WANT2 = coalesce(!!! .)) %>%
bind_cols(data, .)
# STUDENT CAT DOG MOUSE WANT WANT2
#1 1 NA NA 2 2 2
#2 2 NA NA 3 3 3
#3 3 1 2 NA 2 2
#4 4 2 3 NA 2 2
#5 5 3 2 NA 3 3
#6 6 NA NA NA NA NA
#7 7 NA 1 NA NA NA
#8 8 0 NA NA NA NA
Or using data.table with fcoalesce. Convert the 'data.frame' to 'data.table' (setDT(data)), specify the columns of interest in .SDcols, loop over the .SD replace the values that are 0, 1 to NA, use fcoalesce and assign (:=) it to create new column 'WANT2'
library(data.table)
setDT(data)[, WANT2 := do.call(fcoalesce, lapply(.SD, function(x)
replace(x, x %in% 0:1, NA))), .SDcols = CAT:MOUSE]
or with base R, we can use a vectorized option with row/column indexing to extract the first non-NA element after replaceing the values 0, 1 to NA
m1 <- !is.na(replace(data[2:4], data[2:4] == 1|data[2:4] == 0, NA))
data$WAN2 <- data[2:4][cbind(seq_len(nrow(m1)), max.col(m1, "first"))]
data$WANT2[data$WANT2 == 0] <- NA
Try this:
data$Want2 <- apply(data[,-c(1,5)],1,function(x) x[min(which(!is.na(x) & x!=0 & x!=1))])
STUDENT CAT DOG MOUSE WANT Want2
1 1 NA NA 2 2 2
2 2 NA NA 3 3 3
3 3 1 2 NA 2 2
4 4 2 3 NA 2 2
5 5 3 2 NA 3 3
6 6 NA NA NA NA NA
7 7 NA 1 NA NA NA
8 8 0 NA NA 0 NA
If I have a dataset with three columns like this below
Id Date Gender
1 NA F
1 NA NA
1 03-11-1977 NA
2 04-17-2005 NA
2 NA M
3 NA NA
3 06-04-1999 NA
3 NA F
How could I clean this data such that I see a dataset like this below ?
Id Date Gender
1 03-11-1977 F
2 04-17-2005 M
3 06-04-1999 F
Thanks.
fill the values by Id and filter NA values.
library(dplyr)
df %>%
group_by(Id) %>%
tidyr::fill(Gender, .direction = "updown") %>%
filter(!is.na(Date))
# Id Date Gender
# <int> <chr> <chr>
#1 1 03-11-1977 F
#2 2 04-17-2005 M
#3 3 06-04-1999 F
You may use na.omit in a by approach.
dat <- do.call(rbind, by(dat, dat$Id, function(x) cbind(x[1,1,drop=F], lapply(x[-1], na.omit))))
dat
# Id Date Gender
# 1 1 03-11-1977 F
# 2 2 04-17-2005 M
# 3 3 06-04-1999 F
Data:
dat <- read.table(header=T,text=' Id Date Gender
1 NA F
1 NA NA
1 03-11-1977 NA
2 04-17-2005 NA
2 NA M
3 NA NA
3 06-04-1999 NA
3 NA F')
I have a large data set and want to replace many NAs, but not all.
In one group i want to replace all NAs with 0.
In the other group i want to replace all NAs with 0, but only in variables that do not include a certain part of the variable name e.g. 'b'
Here is an example:
group <- c(1,1,2,2,2)
abc <- c(1,NA,NA,NA,NA)
bcd <- c(2,1,NA,NA,NA)
cde <- c(5,NA,NA,1,2)
df <- data.frame(group,abc,bcd,cde)
group abc bcd cde
1 1 1 2 5
2 1 NA 1 NA
3 2 NA NA NA
4 2 NA NA 1
5 2 NA NA 2
This is what i want:
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
This is what i tried:
#set 0 in first group: this works fine
df[is.na(df) & df$group==1] <- 0
#set 0 in second group but only if the variable name includes b: does not work
df[is.na(df) & df$group==2 & !grepl('b',colnames(df))] <- 0
dplyr solutions are welcome as well as basic
For the second group, create the column index with grep and use that to subset the data while assigning
j1 <- !grepl('b',colnames(df))
df[j1][df$group == 2 & is.na(df[j1])] <- 0
df
# group abc bcd cde
#1 1 1 2 5
#2 1 0 1 0
#3 2 NA NA 0
#4 2 NA NA 1
#5 2 NA NA 2
Using dplyr::mutate_at you can also do:
library(dplyr)
vars_mutate_1 <- names(df)[-1]
vars_mutate_2 <- grep(x = names(df)[-1], pattern = '^(?!.*b).*$', perl = TRUE, value = TRUE)
df %>%
mutate_at(.vars = vars_mutate_1, .funs = funs(if_else(group == 1 & is.na(.), 0, .))) %>%
mutate_at(.vars = vars_mutate_2, .funs = funs(if_else(group == 2 & is.na(.), 0, .)))
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
Alternatively, you can use:
library(dplyr)
df2 <- df %>% mutate_at(vars(names(df)[-1]),
function(x) case_when((group==1 & is.na(x) ) ~ 0,
(group==2 & is.na(x) & !grepl("b",deparse(substitute(x)))) ~ 0,
TRUE ~ x))
> df2
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2