How to substitute NA by 0 in 20 columns? - r

I want to substitute NA by 0 in 20 columns. I found this approach for 2 columns, however I guess it's not optimal if the number of columns is 20. Is there any alternative and more compact solution?
mydata[,c("a", "c")] <-
apply(mydata[,c("a","c")], 2, function(x){replace(x, is.na(x), 0)})
UPDATE:
For simplicity lets take this data with 8 columns and substitute NAs in columns b, c, e, f and d
a b c d e f g d
1 NA NA 2 3 4 7 6
2 g 3 NA 4 5 4 Y
3 r 4 4 NA t 5 5
The result must be this one:
a b c d e f g d
1 0 0 2 3 4 7 6
2 g 3 NA 4 5 4 Y
3 r 4 4 0 t 5 5

The replace_na function from tidyr can be applied over a vector as well as a dataframe (http://tidyr.tidyverse.org/reference/replace_na.html).
Use it with a mutate_at variation from dplyr to apply it to multiple columns at the same time:
my_data %>% mutate_at(vars(b,c,e,f), replace_na, 0)
or
my_data %>% mutate_at(c('b','c','e','f'), replace_na, 0)

Here is a tidyverse way to replace NA with different values based on the data type of the column.
library(tidyverse)
dataset %>% mutate_if(is.numeric, replace_na, 0) %>%
mutate_if(is.character, replace_na, "")

Another option:
library(tidyr)
v <- c('b', 'c', 'e', 'f')
replace_na(df, as.list(setNames(rep(0, length(v)), v)))
Which gives:
# a b c d e f g d.1
#1 1 0 0 2 3 4 7 6
#2 2 g 3 NA 4 5 4 Y
#3 3 r 4 4 0 t 5 5

We can use NAer from qdap to convert the NA to 0. If there are multiple column, loop using lapply.
library(qdap)
nm1 <- c('b', 'c', 'e', 'f')
mydata[nm1] <- lapply(mydata[nm1], NAer)
mydata
# a b c d e f g d.1
#1 1 0 0 2 3 4 7 6
#2 2 g 3 NA 4 5 4 Y
#3 3 r 4 4 0 t 5 5
Or using dplyr
library(dplyr)
mydata %>%
mutate_each_(funs(replace(., which(is.na(.)), 0)), nm1)
# a b c d e f g d.1
#1 1 0 0 2 3 4 7 6
#2 2 g 3 NA 4 5 4 Y
#3 3 r 4 4 0 t 5 5

Another strategy using tidyr::replace_na()
library(tidyverse)
df <- read.table(header = T, text = 'a b c d e f g h
1 NA NA 2 3 4 7 6
2 g 3 NA 4 5 4 Y
3 r 4 4 NA t 5 5')
df %>%
mutate(across(everything(), ~replace_na(., 0)))
#> a b c d e f g h
#> 1 1 0 0 2 3 4 7 6
#> 2 2 g 3 0 4 5 4 Y
#> 3 3 r 4 4 0 t 5 5
Created on 2021-08-22 by the reprex package (v2.0.0)

Knowing that replace_na() accepts a named list for the replace argument, using purrr::map() is a good option here to reduce the amount of code. It is also possible to replace different values in each column using 'map2()'.
code:
library(data.table)
library(tidyverse)
tbl <-read_table("a b c d e f g d
1 NA NA 2 3 4 7 6
2 g 3 NA 4 5 4 Y
3 r 4 4 NA t 5 5")
#> Warning: Duplicated column names deduplicated: 'd' => 'd_1' [8]
nms <- c('b', 'c', 'e', 'f', 'g')
imap_dfc(tbl, ~ if(any(.y == nms)) replace_na(.x, 0) else .x)
#> # A tibble: 3 × 8
#> a b c d e f g d_1
#> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 1 0 0 2 3 4 7 6
#> 2 2 g 3 NA 4 5 4 Y
#> 3 3 r 4 4 0 t 5 5
#using data.table
tblDT <- as.data.table(tbl)
#Further explanation here: https://stackoverflow.com/questions/16846380
tblDT[, (nms) := map(.SD, ~replace_na(., 0)), .SDcols = nms]
tblDT %>%
as_tibble()
#> # A tibble: 3 × 8
#> a b c d e f g d_1
#> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 1 0 0 2 3 4 7 6
#> 2 2 g 3 NA 4 5 4 Y
#> 3 3 r 4 4 0 t 5 5
#to replace na's in every column:
tbl %>%
replace_na(map(., ~0))
#> # A tibble: 3 × 8
#> a b c d e f g d_1
#> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 1 0 0 2 3 4 7 6
#> 2 2 g 3 0 4 5 4 Y
#> 3 3 r 4 4 0 t 5 5
Created on 2021-09-25 by the reprex package (v2.0.1)

Related

Remove groups if all NA

Let's say I have a table like so:
df <- data.frame("Group" = c("A","A","A","B","B","B","C","C","C"),
"Num" = c(1,2,3,1,2,NA,NA,NA,NA))
Group Num
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B NA
7 C NA
8 C NA
9 C NA
In this case, because group C has Num as NA for all entries, I would like to remove rows in group C from the table. Any help is appreciated!
You could group_by on you Group and filter the groups with all values that are NA. You can use the following code:
library(dplyr)
df %>%
group_by(Group) %>%
filter(!all(is.na(Num)))
#> # A tibble: 6 × 2
#> # Groups: Group [2]
#> Group Num
#> <chr> <dbl>
#> 1 A 1
#> 2 A 2
#> 3 A 3
#> 4 B 1
#> 5 B 2
#> 6 B NA
Created on 2023-01-18 with reprex v2.0.2
In base R you could index based on all the groups that have at least one non-NA value:
idx <- df$Group %in% unique(df[!is.na(df$Num),"Group"])
idx
df[idx,]
# or in one line
df[df$Group %in% unique(df[!is.na(df$Num),"Group"]),]
output
Group Num
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B NA
Using ave.
df[with(df, !ave(Num, Group, FUN=\(x) all(is.na(x)))), ]
# Group Num
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 1
# 5 B 2
# 6 B NA

R: How do I add a column with values to a table, for every row in said table?

For example, if I have two lists:
x <- data.frame(c('a', 'b', 'c'))
y <- data.frame(c('1', '2', '3'))
I want my output to look like:
x
y
a
1
a
2
a
3
b
1
b
2
b
3
c
1
c
2
c
3
I sadly have no idea how such an operation is called, or where to start. Could anyone help me with a solution? Thanks!
Here are a few options:
library(tidyverse)
x <- data.frame(x = c('a', 'b', 'c'))
y <- data.frame(y = c('1', '2', '3'))
#option 1
expand.grid(x = x$x, y = y$y) |>
arrange(x)
#> x y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 1
#> 5 b 2
#> 6 b 3
#> 7 c 1
#> 8 c 2
#> 9 c 3
#option 2
map_dfr(x$x, ~tibble(x = .x, y = y$y))
#> # A tibble: 9 x 2
#> x y
#> <chr> <chr>
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 1
#> 5 b 2
#> 6 b 3
#> 7 c 1
#> 8 c 2
#> 9 c 3
#option 3
full_join(x, y, by = character())
#> x y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 1
#> 5 b 2
#> 6 b 3
#> 7 c 1
#> 8 c 2
#> 9 c 3
Using rep to repeat elements individually, then put them in a data frame.
data.frame(x = rep(x[, 1], each=nrow(y)), y = y[, 1])
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3

How to conditionally update a R tibble using multiple conditions of another tibble

I have two tables. I would like to update the first table using a second table using multiple conditions. In base R I would use if...else type constructs to do this but would like to know how to achieve this using dplyr.
The table to be updated (have a field added) looks like this:
> Intvs
# A tibble: 12 x 3
Group From To
<chr> <dbl> <dbl>
1 A 0 1
2 A 1 2
3 A 2 3
4 A 3 4
5 A 4 5
6 A 5 6
7 B 0 1
8 B 1 2
9 B 2 3
10 B 3 4
11 B 4 5
12 B 5 6
The tibble that I would like to use to make the update looks like this:
>Zns
# A tibble: 2 x 4
Group From To Zone
<chr> <chr> <dbl> <dbl>
1 A X 1 5
2 B Y 3 4
I would like to update the Intvs tibble with the Zns tibble using the fields == Group, >= From, and <= To to control the update. The expected output should look like this
> Intvs
# A tibble: 12 x 4
Group From To Zone
<chr> <dbl> <dbl> <chr>
1 A 0 1 NA
2 A 1 2 X
3 A 2 3 X
4 A 3 4 X
5 A 4 5 X
6 A 5 6 NA
7 B 0 1 NA
8 B 1 2 NA
9 B 2 3 NA
10 B 3 4 Y
11 B 4 5 NA
12 B 5 6 NA
What is the most efficient way to do this using dplyr?
The code below should make the dummy tables Intv and Zns
# load packages
require(tidyverse)
# Intervals table
a <- c(rep("A", 6), rep("B", 6))
b <- c(seq(0,5,1), seq(0,5,1) )
c <- c(seq(1,6,1), seq(1,6,1))
Intvs <- bind_cols(a, b, c)
names(Intvs) <- c("Group", "From", "To")
# Zones table
a <- c("A", "B")
b <- c("X", "Y")
c <- c(1, 3)
d <- c(5, 4)
Zns <- bind_cols(a, b, c, d)
names(Zns) <- c("Group", "From", "To", "Zone")
Using non-equi join from data.table
library(data.table)
setDT(Intvs)[Zns, Zone := Zone, on = .(Group, From >= From, To <= To)]
-output
> Intvs
Group From To Zone
<char> <num> <num> <char>
1: A 0 1 <NA>
2: A 1 2 X
3: A 2 3 X
4: A 3 4 X
5: A 4 5 X
6: A 5 6 <NA>
7: B 0 1 <NA>
8: B 1 2 <NA>
9: B 2 3 <NA>
10: B 3 4 Y
11: B 4 5 <NA>
12: B 5 6 <NA>
This is the closest I get. It is not giving the expected output:
library(dplyr)
left_join(Intvs, Zns, by="Group") %>%
group_by(Group) %>%
mutate(Zone1 = case_when(From.x <= Zone & From.x >= To.y ~ From.y)) %>%
select(Group, From=From.x, To=To.x, Zone = Zone1)
Group From To Zone
<chr> <dbl> <dbl> <chr>
1 A 0 1 NA
2 A 1 2 X
3 A 2 3 X
4 A 3 4 X
5 A 4 5 X
6 A 5 6 X
7 B 0 1 NA
8 B 1 2 NA
9 B 2 3 NA
10 B 3 4 Y
11 B 4 5 Y
12 B 5 6 NA
Not sure why your first row does not give NA, since 0 - 1 is not in the range of 1 - 5.
First left_join the two dataframes using the Group column. Here I assign the suffix "_Zns" to values from the Zns dataframe. Then use a single case_when or (ifelse) statement to assign NA to rows that do not fit the range. Finally, drop the columns that end with Zns.
library(dplyr)
left_join(Intvs, Zns, by = "Group", suffix = c("", "_Zns")) %>%
mutate(Zone = case_when(From >= From_Zns & To <= To_Zns ~ Zone,
TRUE ~ NA_character_)) %>%
select(-ends_with("Zns"))
# A tibble: 12 × 4
Group From To Zone
<chr> <dbl> <dbl> <chr>
1 A 0 1 NA
2 A 1 2 X
3 A 2 3 X
4 A 3 4 X
5 A 4 5 X
6 A 5 6 NA
7 B 0 1 NA
8 B 1 2 NA
9 B 2 3 NA
10 B 3 4 Y
11 B 4 5 NA
12 B 5 6 NA
Data
Note that I have changed your column name order in the Zns dataframe.
a <- c(rep("A", 6), rep("B", 6))
b <- c(seq(0,5,1), seq(0,5,1) )
c <- c(seq(1,6,1), seq(1,6,1))
Intvs <- bind_cols(a, b, c)
names(Intvs) <- c("Group", "From", "To")
# Zones table
a <- c("A", "B")
b <- c("X", "Y")
c <- c(1, 3)
d <- c(5, 4)
Zns <- bind_cols(a, b, c, d)
colnames(Zns) <- c("Group", "Zone", "From", "To")

Keeping all NAs in dplyr distinct function

I have a data.frame (the eBird basic dataset) where many observers may upload a record from a same sighting to a database, in this case, the event is given a "group identifier"; when not from a group session, a NA will appear in the database; so I'm trying to filter out all those duplicates from group events and keep all NAs, I'm trying to do this without splitting the dataframe in two:
library(dplyr)
set.seed(1)
df <- tibble(
x = sample(c(1:6, NA), 30, replace = T),
y = sample(c(letters[1:4]), 30, replace = T)
)
df %>% count(x,y)
gives:
> df %>% count(x,y)
# A tibble: 20 x 3
x y n
<int> <chr> <int>
1 1 a 1
2 1 b 2
3 2 a 1
4 2 b 1
5 2 c 1
6 2 d 3
7 3 a 1
8 3 b 1
9 3 c 4
10 4 d 1
11 5 a 1
12 5 b 2
13 5 c 1
14 5 d 1
15 6 a 1
16 6 c 2
17 NA a 1
18 NA b 2
19 NA c 2
20 NA d 1
I want no NA at x to be grouped together, as here happened with "NA b" and "NA c" combinations; distinct function has no information on not taking NAs into the computation; is splitting the dataframe the only solution?
With distinct an option is to create a new column based on the NA elements in 'x'
library(dplyr)
df %>%
mutate(x1 = row_number() * is.na(x)) %>%
distinct %>%
select(-x1)
Or we can use duplicated with an OR (|) condition to return all NA elements in 'x' with filter
df %>%
filter(is.na(x)|!duplicated(cur_data()))
# A tibble: 20 x 2
# x y
# <int> <chr>
# 1 1 b
# 2 4 b
# 3 NA a
# 4 1 d
# 5 2 c
# 6 5 a
# 7 NA d
# 8 3 c
# 9 6 b
#10 2 b
#11 3 b
#12 1 c
#13 5 d
#14 2 d
#15 6 d
#16 2 a
#17 NA c
#18 NA a
#19 1 a
#20 5 b

Using pivot_wider instead of spread for R dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
My understanding was that to an extend, pivot_wider(data, names_from, values_from) is roughly equivalent to spread(data, key, value). I'm not seeing that here.
library(tidyverse)
df = data.frame(
A = sample(letters[1:3], 20, replace=T),
B = sample(LETTERS[1:3], 20, replace=T)
)
df
#> A B
#> 1 c B
#> 2 b C
#> 3 b B
#> 4 b C
#> 5 c C
#> 6 b B
#> 7 c B
#> 8 b A
#> 9 c A
#> 10 b B
#> 11 c B
#> 12 b B
#> 13 b A
#> 14 c C
#> 15 c C
#> 16 a B
#> 17 c C
#> 18 b C
#> 19 b A
#> 20 a C
df %>% count(A,B)
#> A B n
#> 1 a B 1
#> 2 a C 1
#> 3 b A 3
#> 4 b B 4
#> 5 b C 3
#> 6 c A 1
#> 7 c B 3
#> 8 c C 4
df %>% count(A,B) %>% spread(key=B, value=n)
#> A B C
#> 1 NA 1 1
#> 2 3 4 3
#> 3 1 3 4
df %>% count(A,B) %>% pivot_wider(names_from=B, values_from=n)
#> Error: Failed to create output due to bad names.
#> * Choose another strategy with `names_repair`
Created on 2020-10-22 by the reprex package (v0.3.0)
pivot_wider is equivalent to spread however, it is also more stricter. You need to be more explicit in doing transformations.
library(dplyr)
library(tidyr)
df %>% count(A,B)
# A B n
#1 a A 3
#2 a B 1
#3 a C 1
#4 b A 2
#5 b B 4
#6 b C 2
#7 c A 2
#8 c B 4
#9 c C 1
Notice how A column above is silently overwritten when you use spread.
df %>% count(A,B) %>% spread(key=B, value=n)
# A B C
#1 3 1 1
#2 2 4 2
#3 2 4 1
pivot_wider doesn't allow that. It wants you to explicitly mention what you want to do. Since you already have an A column and in names_from you specify B as column name which has 'A' value in it so you'll have a another A column. Tibbles don't allow duplicate column names hence you get an error.
An option would be to rename the original A column to something else.
df %>%
count(A,B) %>%
rename(A1 = A) %>%
pivot_wider(names_from=B, values_from=n)
# A1 A B C
# <chr> <int> <int> <int>
#1 a 3 1 1
#2 b 2 4 2
#3 c 2 4 1

Resources