Check whether in specific columns all elements of rows are NA - r

I want my_var to be 0 if my_var_a to my_var_c are all NA
# A tibble: 4 x 5
my_var my_var_a my_var_b my_var_c my_var_others
<int> <int> <int> <int> <int>
1 0 NA NA NA NA
2 1 NA 1 NA NA
3 0 NA NA NA NA
4 NA NA NA NA NA
I get my desired result using:
library(tidyverse)
df %>% mutate(my_var = if_else(apply(select(., my_var_a:my_var_c), 1, function(x) all(is.na(x))), 0L, my_var))
Is there a less complicated way of doing that or at least a way using purrr? I looked into pmap but couldn't figure out how it would replace apply.
Which results in:
my_var my_var_a my_var_b my_var_c my_var_others
<int> <int> <int> <int> <int>
1 0 NA NA NA NA
2 1 NA 1 NA NA
3 0 NA NA NA NA
4 0 NA NA NA NA
This is the data frame:
structure(list(my_var = c(0L, 1L, 0L, NA), my_var_a = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), my_var_b = c(NA, 1L,
NA, NA), my_var_c = c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_), my_var_others = c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))

We can use pmap_int from purrr to iterate over multiple columns row-wise.
library(dplyr)
library(purrr)
df %>% mutate(my_var = pmap_int(select(., my_var_a:my_var_c), ~any(!is.na(c(...)))))
# my_var my_var_a my_var_b my_var_c my_var_others
# <int> <int> <int> <int> <int>
#1 0 NA NA NA NA
#2 1 NA 1 NA NA
#3 0 NA NA NA NA
#4 0 NA NA NA NA
In base R, we can use rowSums and assign 1 to rows where there is atleast one non-NA value.
cols <- paste0("my_var_",letters[1:3])
df$my_var <- +(rowSums(is.na(df[cols])) < length(cols))

Checking for all(is.na(x)) yields TRUE where you want 0, so use ! in front. The ^1 transforms into "numeric". Fairly uncomplicated in base R.
dat <- transform(dat, my_var=apply(dat[-1], 1, function(x) !all(is.na(x)))^1)
dat
# my_var my_var_a my_var_b my_var_c my_var_others
# 1 0 NA NA NA NA
# 2 1 NA 1 NA NA
# 3 0 NA NA NA NA
# 4 0 NA NA NA NA

Related

How to take a maximum of several columns in R/dplyr [duplicate]

This question already has answers here:
Calculate max value across multiple columns by multiple groups
(5 answers)
Closed 2 years ago.
I have data which looks basically like this:
id <- c(1:5)
VolumeA <- c(12, NA, NA, NA, NA)
VolumeB <- c(NA, 34, NA, NA, NA)
VolumeC <- c(NA, NA, 56, NA, NA)
VolumeD <- c(NA, NA, NA, 78, NA)
VolumeE <- c(NA, NA, NA, NA, 90)
df_now <- tibble(id, VolumeA, VolumeB, VolumeC, VolumeD, VolumeE)
df_now
# A tibble: 5 x 6
id VolumeA VolumeB VolumeC VolumeD VolumeE
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 12 NA NA NA NA
2 2 NA 34 NA NA NA
3 3 NA NA 56 NA NA
4 4 NA NA NA 78 NA
5 5 NA NA NA NA 90
In the IRL dataset, there are MANY more Volume[label] columns, but in each row I only need one of them: the largest one. So I want to create a new variable which has the largest value:
Volume <- c(12, 34, 56, 78, 90)
df_desired <- cbind(df_now, Volume)
df_desired
id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
1 1 12 NA NA NA NA 12
2 2 NA 34 NA NA NA 34
3 3 NA NA 56 NA NA 56
4 4 NA NA NA 78 NA 78
5 5 NA NA NA NA 90 90
After looking at the dplyr documentation, I tried this...
library(tidyverse)
df_try <- df_now %>%
mutate(Volume = across(contains("Volume"), max, na.rm = TRUE))
...but got back a tibble of data, not a single column. Can someone tell me how to do this properly?
(Please assume, due to issues with my IRL data too complicated to explain here, that I cannot just gather and spread my data. I want to use a conditional mutate.)
Since you have "MANY more Volume[label] columns", any solution that works over each row (rowwise) or individually on each column (with reduce or Reduce) is going to be much slower than necessary.
df_now %>%
mutate(Volume = do.call(pmax, c(select(., starts_with('Volume')), na.rm = TRUE)))
# # A tibble: 5 x 7
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 12 NA NA NA NA 12
# 2 2 NA 34 NA NA NA 34
# 3 3 NA NA 56 NA NA 56
# 4 4 NA NA NA 78 NA 78
# 5 5 NA NA NA NA 90 90
Proof of relative improvement:
Using Reduce or purrr::reduce or anything that will iterate per column (well, with nc columns, then it will iterate nc-1 times):
mypmax <- function(...) { message("mypmax"); pmax(...); }
df_now %>%
mutate(Volume = reduce(select(., starts_with('Volume')), mypmax, na.rm = TRUE))
# mypmax
# mypmax
# mypmax
# mypmax
# # A tibble: 5 x 7
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 12 NA NA NA NA 12
# 2 2 NA 34 NA NA NA 34
# 3 3 NA NA 56 NA NA 56
# 4 4 NA NA NA 78 NA 78
# 5 5 NA NA NA NA 90 90
Anything rowwise is doing this once per row, perhaps even worse (assuming more rows than columns in your data:
mymax <- function(...) { message("mymax"); max(...); }
df_now %>%
rowwise %>%
mutate(Volume = mymax(c_across(starts_with('Volume')), na.rm = TRUE))
# mymax
# mymax
# mymax
# mymax
# mymax
# # A tibble: 5 x 7
# # Rowwise:
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 12 NA NA NA NA 12
# 2 2 NA 34 NA NA NA 34
# 3 3 NA NA 56 NA NA 56
# 4 4 NA NA NA 78 NA 78
# 5 5 NA NA NA NA 90 90
Do it once across all columns, all rows:
mypmax <- function(...) { message("mypmax"); pmax(...); }
df_now %>%
mutate(Volume = do.call(mypmax, c(select(., starts_with('Volume')), na.rm = TRUE)))
# mypmax
# # A tibble: 5 x 7
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 12 NA NA NA NA 12
# 2 2 NA 34 NA NA NA 34
# 3 3 NA NA 56 NA NA 56
# 4 4 NA NA NA 78 NA 78
# 5 5 NA NA NA NA 90 90
The benchmarking is minor at this scale, but will be more dramatic with larger data:
microbenchmark::microbenchmark(
red = df_now %>% mutate(Volume = reduce(select(., starts_with('Volume')), pmax, na.rm = TRUE)),
row = df_now %>% rowwise %>% mutate(Volume = max(c_across(starts_with('Volume')), na.rm = TRUE)),
sgl = df_now %>% mutate(Volume = do.call(pmax, c(select(., starts_with('Volume')), na.rm = TRUE)))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# red 4.9736 5.36240 7.240561 5.68010 6.19915 70.7482 100
# row 4.5813 5.02020 6.082047 5.34460 5.70345 63.1166 100
# sgl 3.8270 4.18605 5.803043 4.43215 4.76030 65.7217 100
We can use pmax (first posted the pmax solution here). Note that the relative improvement is very small with do.call
library(dplyr)
library(purrr)
df_now %>%
mutate(Volume = reduce(select(., starts_with('Volume')), pmax, na.rm = TRUE))
# A tibble: 5 x 7
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 12 NA NA NA NA 12
#2 2 NA 34 NA NA NA 34
#3 3 NA NA 56 NA NA 56
#4 4 NA NA NA 78 NA 78
#5 5 NA NA NA NA 90 90
Or with c_across and max (using only tidyverse approaches)
df_now %>%
rowwise %>%
mutate(Volume = max(c_across(starts_with('Volume')), na.rm = TRUE))
# A tibble: 5 x 7
# Rowwise:
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 12 NA NA NA NA 12
#2 2 NA 34 NA NA NA 34
#3 3 NA NA 56 NA NA 56
#4 4 NA NA NA 78 NA 78
#5 5 NA NA NA NA 90 90
Benchmarks
system.time({df_now %>% mutate(Volume = reduce(select(., starts_with('Volume')), pmax, na.rm = TRUE))})
# user system elapsed
# 0.023 0.006 0.029
system.time({df_now %>% rowwise %>% mutate(Volume = max(c_across(starts_with('Volume')), na.rm = TRUE))})
# user system elapsed
# 0.012 0.002 0.015
system.time({df_now %>% mutate(Volume = do.call(pmax, c(select(., starts_with('Volume')), na.rm = TRUE)))})
# user system elapsed
# 0.011 0.001 0.011
NOTE: Not that much difference in timings

Mutate with nested ifesle

I do get the wrong result, what am I doing wrong?
df <- data.frame(x=c(1,1,NA),y=c(1,NA,NA),z=c(NA,NA,NA))
df <-mutate(df,result=ifelse(is.na(x),NA,ifelse(any(!is.na(y),!is.na(z)),1,0)))
I get this (data[2,4]==0)
x y z result
1 1 1 NA 1
2 1 NA NA 1
3 NA NA NA NA
Instead of this:
df_wanted <- data.frame(x=c(1,1,NA),y=c(1,NA,NA),z=c(NA,NA,NA), result=c(1,0,NA))
x y z result
1 1 1 NA 1
2 1 NA NA 0
3 NA NA NA NA
We can use | instead of any because any returns a single TRUE/FALSE as output
with(df, any(!is.na(y), !is.na(z)))
#[1] TRUE
and that gets recycled for the entire column and because the first ifelse with 'x' returns already 'NA' for the third row, all the others are returned 1
instead we need to do this for each row and this can be accomplished with |
library(dplyr)
df %>%
mutate(result = ifelse(is.na(x), NA, ifelse(!is.na(y)|!is.na(z), 1, 0)))
# x y z result
#1 1 1 NA 1
#2 1 NA NA 0
#3 NA NA NA NA
Or another option is case_when
df %>%
mutate(result = case_when(is.na(x) ~ NA_integer_,
!is.na(y)| !is.na(z) ~ 1L,
TRUE ~ 0L))
# x y z result
#1 1 1 NA 1
#2 1 NA NA 0
#3 NA NA NA NA
Or with coalesce
df %>%
mutate(result = x * +coalesce(!is.na(y)|!is.na(z)))
# x y z result
#1 1 1 NA 1
#2 1 NA NA 0
#3 NA NA NA NA
You can use case_when and mention each condition explicitly.
library(dplyr)
df %>%
mutate(result = case_when(is.na(x) ~ NA_integer_,
!(is.na(y) & is.na(z)) ~ 1L,
TRUE ~ 0L))
# x y z result
#1 1 1 NA 1
#2 1 NA NA 0
#3 NA NA NA NA

Forward fill rows in a r data table

I have a large data.table in the format below
Name Value 1 2 3 4 5
A 58 1 NA NA NA NA
B 47 NA 1 NA NA NA
C 89 NA NA 1 NA NA
D 68 NA NA NA 1 NA
E 75 NA NA NA NA 1
I would like to forward rows of the data table to achieve below results. I know how to forward fill columns.
Name Value 1 2 3 4 5
A 58 1 1 1 1 1
B 47 NA 1 1 1 1
C 89 NA NA 1 1 1
D 68 NA NA NA 1 1
E 75 NA NA NA NA 1
Help!
data.table has it's own nafill function.
library(data.table) #v>=1.12.8
library(magrittr)
melt(dt, id = 1:2) %>%
.[, value := nafill(value, "locf"), by = Name] %>%
dcast(., ... ~ variable)
# Name Value 1 2 3 4 5
# 1: A 58 1 1 1 1 1
# 2: B 47 NA 1 1 1 1
# 3: C 89 NA NA 1 1 1
# 4: D 68 NA NA NA 1 1
# 5: E 75 NA NA NA NA 1
Data
dt <- fread("Name Value 1 2 3 4 5
A 58 1 NA NA NA NA
B 47 NA 1 NA NA NA
C 89 NA NA 1 NA NA
D 68 NA NA NA 1 NA
E 75 NA NA NA NA 1")
Use fill in tidyr to fill in missing values with previous value.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(3:7) %>%
group_by(Name) %>%
fill(value) %>%
ungroup() %>%
pivot_wider()
# # A tibble: 5 x 7
# Name Value `1` `2` `3` `4` `5`
# <fct> <int> <int> <int> <int> <int> <int>
# 1 A 58 1 1 1 1 1
# 2 B 47 NA 1 1 1 1
# 3 C 89 NA NA 1 1 1
# 4 D 68 NA NA NA 1 1
# 5 E 75 NA NA NA NA 1
Note: The output above is the same as
df %>% fill(3:7, .direction = "up")
but the logic is different. The former belongs to "filling rows forward" and the latter is "filling columns backward". They will differ in other cases.
Data
df <- structure(list(Name = structure(1:5, .Label = c("A", "B", "C",
"D", "E"), class = "factor"), Value = c(58L, 47L, 89L, 68L, 75L
), `1` = c(1L, NA, NA, NA, NA), `2` = c(NA, 1L, NA, NA, NA),
`3` = c(NA, NA, 1L, NA, NA), `4` = c(NA, NA, NA, 1L, NA),
`5` = c(NA, NA, NA, NA, 1L)), class = "data.frame", row.names = c(NA, -5L))

How can I convert certain rows to columns in R?

I have a dataframe which looks like this:
`Row Labels` Female Male
<chr> <chr> <chr>
1 London <NA> <NA>
2 42 <NA> 1
3 Paris <NA> <NA>
4 36 1 <NA>
5 Belgium <NA> <NA>
6 18 1
7 21 <NA> 1
8 Madrid <NA> <NA>
9 20 1 <NA>
10 Berlin <NA> <NA>
11 37 <NA> 1
12 23 1
13 25 1
14 44 1
The code I used to produce this dataframe looks like this:
structure(list(`Row Labels` = c("London", "42", "Paris","36", "Belgium","18" ,"21", "Madrid", "20", "Berlin", "37","23","25","44"),
Female = c(NA, NA, NA, "1", NA, NA,NA, NA, "1", NA, NA,"1","1","1"), Male = c(NA,"1", NA, NA, NA, "1", NA, NA, NA, "1",NA,NA,NA,NA)),
.Names = c("Row Labels","Female", "Male"), row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))
I would like to know how I can change multiple rows in this dataframe to become columns.
My ideal output looks like this:
'Row Labels' Female Male 42 36 21 20 37 18 23 25 44
London 1 1
Paris 1 1
Belgium 1 1 1 1
Madrid 1 1
Berlin 3 1 1 1 1 1
Seems very mechanical. Calling your data d:
d1 = d[seq(1, nrow(d), by = 2), ]
d2 = d[seq(2, nrow(d), by = 2), ]
d1[, c("Male", "Female")] = d2[, c("Male", "Female")]
d3 = matrix(nrow = nrow(d2), ncol = nrow(d2))
diag(d3) = 1
colnames(d3) = d2$`Row Labels`
cbind(d2, d3)
# Row Labels Female Male 42 36 21 20 37
# 1 42 <NA> 1 1 NA NA NA NA
# 2 36 1 <NA> NA 1 NA NA NA
# 3 21 <NA> 1 NA NA 1 NA NA
# 4 20 1 <NA> NA NA NA 1 NA
# 5 37 <NA> 1 NA NA NA NA 1
Using tidyverse.
library(dplyr)
library(tidyr)
#cumsum based on country names
df %>% group_by(gr=cumsum(grepl('\\D+',`Row Labels`))) %>%
#Sum Female and Male
mutate_at(vars('Female','Male'), list(~sum(as.numeric(.), na.rm = T))) %>%
#Create RL from country name and number where we are at numbers
mutate(RL=ifelse(row_number()>1, paste0(first(`Row Labels`),',',`Row Labels`), NA)) %>%
filter(!is.na(RL)) %>%
select(RL, gr, Male, Female) %>%
separate(RL, into = c('RL','Age')) %>% mutate(flag=1) %>% spread(Age, flag) %>%
ungroup() %>% select(-gr)
# A tibble: 5 x 12
RL Male Female `18` `20` `21` `23` `25` `36` `37` `42` `44`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Belgium 1 0 1 NA 1 NA NA NA NA NA NA
2 Berlin 1 3 NA NA NA 1 1 NA 1 NA 1
3 London 1 0 NA NA NA NA NA NA NA 1 NA
4 Madrid 0 1 NA 1 NA NA NA NA NA NA NA
5 Paris 0 1 NA NA NA NA NA 1 NA NA NA

R Add missing columns AND rows of data (Dplyr/TidyR & Complete?)

I'm fairly used to adding in missing cases for data but this use case escapes me.
I have a number of dataframes (which differ slightly), an example would be:
> t1
3 4 5
2 1 0 0
3 0 2 2
4 2 6 4
5 1 2 1
structure(list(`3` = c(1L, 0L, 2L, 1L), `4` = c(0L, 2L, 6L, 2L
), `5` = c(0L, 2L, 4L, 1L)), .Names = c("3", "4", "5"), row.names = c("2",
"3", "4", "5"), class = "data.frame")
Row names & Column names should be from 1:5 and, obviously, where these were missing the cell value set to NA. For the example above this would give:
> t1
1 2 3 4 5
1 NA NA NA NA NA
2 NA NA 1 0 0
3 NA NA 0 2 2
4 NA NA 2 6 4
5 NA NA 1 2 1
In each case ANY one or more rows AND/OR columns might be missing.
I can readily get the missing columns using the method described by Josh O'Brien here but am missing the row method.
Can anyone help?
We can do this in a much easier way with base R by creating a matrix of NAs of the required dimensions and then assign the values of 't1' based on the row names and column names of 't1'
m1 <- matrix(NA, ncol=5, nrow=5, dimnames = list(1:5, 1:5))
m1[row.names(t1), colnames(t1)] <- unlist(t1)
m1
# 1 2 3 4 5
#1 NA NA NA NA NA
#2 NA NA 1 0 0
#3 NA NA 0 2 2
#4 NA NA 2 6 4
#5 NA NA 1 2 1
Or using tidyverse
library(tidyverse)
rownames_to_column(t1, "rn") %>%
gather(Var, Val, -rn) %>%
mutate_at(vars(rn, Var), as.integer) %>%
complete(rn = seq_len(max(rn)), Var = seq_len(max(Var))) %>%
spread(Var, Val)
# A tibble: 5 × 6
# rn `1` `2` `3` `4` `5`
#* <int> <int> <int> <int> <int> <int>
#1 1 NA NA NA NA NA
#2 2 NA NA 1 0 0
#3 3 NA NA 0 2 2
#4 4 NA NA 2 6 4
#5 5 NA NA 1 2 1
Based on the solution you mentioned by Josh O'Brien, you can do the same but use rownames instead of names. Take a look at the code below..
df <- data.frame(a=1:4, e=4:1)
colnms <- c("a", "b", "d", "e")
rownms <- c("1", "2", "3", "4", "5")
rownames(df) <- c("1", "3", "4", "5")
## find missing columns and replace with zero, and order them
Missing <- setdiff(colnms, names(df))
df[Missing] <- 0
df <- df[colnms]
df
## do the same for rows
MissingR <- setdiff(rownms, rownames(df))
df[MissingR,] <- 0
df <- df[rownms,]
df
# > df
# a b d e
#1 1 0 0 4
#2 0 0 0 0
#3 2 0 0 3
#4 3 0 0 2
#5 4 0 0 1

Resources