Adding a new column replacing NA values in R - r

I am analyzing some variables for some specific locations. These variables have NA values that should be replaced by the values from adjacent columns. I found a way to do it, but it's not an efficient way if I have more columns. Can you help?
Dataset:
location <- rep(c("A", "B", "C"), times = 2)
v1 <- c(11,92,NA,NA,NA,NA)
v2 <- c(NA,NA,NA,50,NA,NA)
v3 <- c(NA,NA,66,NA,NA,79)
v4 <- c(NA,NA,NA,74,23,88)
df <- data.frame(location,v1,v2,v3,v4)
I tried this approach to create a new column in which NA values are replaced by values from other columns.
library (dplyr)
col_1 <- df %>% mutate(new_col = v1 %>% is.na %>% ifelse(v2,v1))
col_2 <- col_1 %>% mutate(new_col_1 = new_col %>% is.na %>% ifelse(v3,new_col))
col_3 <- col_2 %>% mutate(final_col = new_col_1 %>% is.na %>% ifelse(v4,new_col_1))
It solves the problem but I have two questions:
1. Is there any efficient way to do this, instead of creating three columns?
2. For some cases in v3 and v4 OR v2 and v4, where more than one value is available to replace NA, can I take the mean of those values to replace? How?
Thanks in advance.

We can use coalesce
library(dplyr)
library(purrr)
df %>%
mutate(new = coalesce(!!! rlang::syms(names(.)[-1])))
Or
df %>%
mutate(new = reduce(.[-1], coalesce))
# location v1 v2 v3 v4 new
#1 A 11 NA NA NA 11
#2 B 92 NA NA NA 92
#3 C NA NA 66 NA 66
#4 A NA 50 NA 74 50
#5 B NA NA NA 23 23
#6 C NA NA 79 88 79

Related

Recode missing values in multiple columns: mutate with across and ifelse

I am working with an SPSS file that has been exported as tab delimited. In SPSS, you can set values to represent different types of missing and the dataset has 98 and 99 to indicate missing.
I want to convert them to NA but only in certain columns (V2 and V3 in the example data, leaving V1 and V4 unchanged).
library(dplyr)
testdf <- data.frame(V1 = c(1, 2, 3, 4),
V2 = c(1, 98, 99, 2),
V3 = c(1, 99, 2, 3),
V4 = c(98, 99, 1, 2))
outdf <- testdf %>%
mutate(across(V2:V3), . = ifelse(. %in% c(98,99), NA, .))
I haven't used across before and cannot work out how to have the mutate return the ifelse into the same columns. I suspect I am overthinking this, but can't find any similar examples that have both across and ifelse. I need a tidyverse answer, prefer dplyr or tidyr.
You need the syntax to be slightly different to make it work. Check ?across for more info.
You need to use a ~ to make a valid function (or use \(.), or use function(.)),
You need to include the formula in the across function
library(dplyr)
testdf %>%
mutate(across(V2:V3, ~ ifelse(. %in% c(98,99), NA, .)))
# V1 V2 V3 V4
# 1 1 1 1 98
# 2 2 NA NA 99
# 3 3 NA 2 1
# 4 4 2 3 2
Note that an alternative is replace:
testdf %>%
mutate(across(V2:V3, ~ replace(., . %in% c(98,99), NA)))
Base R option using lapply with an ifelse like this:
cols <- c("V2","V3")
testdf[,cols] <- lapply(testdf[,cols],function(x) ifelse(x %in% c(98,99),NA,x))
testdf
#> V1 V2 V3 V4
#> 1 1 1 1 98
#> 2 2 NA NA 99
#> 3 3 NA 2 1
#> 4 4 2 3 2
Created on 2022-10-19 with reprex v2.0.2
Base R:
cols <- c("V2", "V3")
testdf[, cols ][ testdf[, cols ] > 97 ] <- NA

Selecting Rows with Missing Data in a Range of Columns

There are several ways to identify and manipulate individual cells with missing data in R, e.g., with complete.cases or even rowSums.
However, I've not been able to find---or figure out myself---an expedient way to select rows that have missing data within a subsetted range of columns.
For example, in dataframe df:
df <- data.frame(D1 = c('A', 'B', 'C', 'D'),
D2 = c(NA, 0, 1, 1),
V1 = c(11, NA, 33, NA),
V2 = c(111, 222, NA, NA)
)
df
# D1 D2 V1 V2
# A NA 11 111
# B 0 NA 222
# C 1 33 NA
# D 1 NA NA
I would like to select all rows that have missing data in both columns V1 and V2, thus selecting row D but not rows B or C (or A).
I have a larger range of columns than given in that toy example, so selecting a set of columns with, e.g., && could make for a long command.
N.B., a similar SO question addresses selecting rows where none are NSs.
You can try this:
df %>% filter(is.na(V1) & is.na(V2))
OUTPUT
D1 D2 V1 V2
1 D 1 NA NA
You can use dplyr::if_all. You can select the columns very flexibly with tidyselect, for instance using :, c, starts_with...
library(dplyr)
df %>%
filter(if_all(V1:V2, is.na))
# D1 D2 V1 V2
#1 D 1 NA NA
Also works (this shows the flexibility of tidyselect):
filter(df, if_all(3:4, is.na))
filter(df, if_all(starts_with("V"), is.na))
filter(df, if_all(c(V1, V2), is.na))
filter(df, if_all((last_col()-1):last_col(), is.na))
filter(df, if_all(num_range("V", 1:2), is.na))

Dplyr: add multiple columns with mutate/across from character vector

I want to add several columns (filled with NA) to a data.frame using dplyr. I've defined the names of the columns in a character vector. Usually, with only one new column, you can use the following pattern:
test %>%
mutate(!!new_column := NA)
However, I don't get it to work with across:
library(dplyr)
test <- data.frame(a = 1:3)
add_cols <- c("col_1", "col_2")
test %>%
mutate(across(!!add_cols, ~ NA))
#> Error: Problem with `mutate()` input `..1`.
#> x Can't subset columns that don't exist.
#> x Columns `col_1` and `col_2` don't exist.
#> ℹ Input `..1` is `across(c("col_1", "col_2"), ~NA)`.
test %>%
mutate(!!add_cols := NA)
#> Error: The LHS of `:=` must be a string or a symbol
expected_output <- data.frame(
a = 1:3,
col_1 = rep(NA, 3),
col_2 = rep(NA, 3)
)
expected_output
#> a col_1 col_2
#> 1 1 NA NA
#> 2 2 NA NA
#> 3 3 NA NA
Created on 2021-10-05 by the reprex package (v1.0.0)
With the first approach, the column names are correctly created, but then it directly tries to find it in the existing column names. In the second approach, I can't use anything other than a single string.
Is there a tidyverse solution or do I need to resort to the good old for loop?
The !! works for a single element
for(nm in add_cols) test <- test %>%
mutate(!! nm := NA)
-output
> test
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
Or another option is
test %>%
bind_cols(setNames(rep(list(NA), length(add_cols)), add_cols))
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
In base R, this is easier
test[add_cols] <- NA
Which can be used in a pipe
test %>%
`[<-`(., add_cols, value = NA)
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
across works only if the columns are already present i.e. it is suggesting to loop across the columns present in the data and do some modification/create new columns with .names modification
We could make use add_column from tibble
library(tibble)
library(janitor)
add_column(test, !!! add_cols) %>%
clean_names %>%
mutate(across(all_of(add_cols), ~ NA))
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
Another approach:
library(tidyverse)
f <- function(x) df$x = NA
mutate(test, map_dfc(add_cols,~ f(.x)))

Replacement of plyr::cbind.fill in dplyr?

I apologize if this question is elementary, but I've been scouring the internet and I can't seem to find a simple solution.
I currently have a list of R objects (named vectors or dataframes of 1 variable, I can work with either), and I want to join them into 1 large dataframe with 1 row for each unique name/rowname, and 1 column for each element in the original list.
My starting list looks something like:
l1 <- list(df1 = data.frame(c(1,2,3), row.names = c("A", "B", "C")),
df2 = data.frame(c(2,6), row.names = c("B", "D")),
df3 = data.frame(c(3,6,9), row.names = c("C", "D", "A")),
df4 = data.frame(c(4,12), row.names = c("A", "E")))
And I want the output to look like:
data.frame("df1" = c(1,2,3,NA,NA),
+ "df2" = c(NA,2,NA,6,NA),
+ "df3" = c(9,NA,3,6,NA),
+ "df4" = c(4,NA,NA,NA,12), row.names = c("A", "B", "C", "D", "E"))
df1 df2 df3 df4
A 1 NA 9 4
B 2 2 NA NA
C 3 NA 3 NA
D NA 6 6 NA
E NA NA NA 12
I don't mind if the fill values are NA or 0 (ultimately I want 0 but that's an easy fix).
I'm almost positive that plyr::cbind.fill does exactly this, but I have been using dplyr in the rest of my script and I don't think using both is a good idea. dplyr::bind_cols does not seem to work with vectors of different lengths. I'm aware a very similar question has been asked here: R: Is there a good replacement for plyr::rbind.fill in dplyr?
but as I mentioned, this solution doesn't actually seem to work. Neither does dplyr::full_join, even wrapped in a do.call. Is there a straightforward solution to this, or is the only solution to write a custom function?
We can convert the rownames to a column with rownames_to_column, then rename the second column, bind the list elements with bind_rows, and reshape to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
map_dfr(l1, ~ rownames_to_column(.x, 'rn') %>%
rename_at(2, ~'v1'), .id = 'grp') %>%
pivot_wider(names_from = grp, values_from = v1) %>%
column_to_rownames('rn')
Here's a way with some purrr and dplyr functions. Create column names to represent each data frame—since each has only one column, this is easy with setNames, but with more columns you could use dplyr::rename. Do a full-join across the whole list based on the original row names, and fill NAs with 0.
library(dplyr)
library(purrr)
l1 %>%
imap(~setNames(.x, .y)) %>%
map(tibble::rownames_to_column) %>%
reduce(full_join, by = "rowname") %>%
mutate_all(tidyr::replace_na, 0)
#> rowname df1 df2 df3 df4
#> 1 A 1 0 9 4
#> 2 B 2 2 0 0
#> 3 C 3 0 3 0
#> 4 D 0 6 6 0
#> 5 E 0 0 0 12
Yet another purrr and dplyr option could be:
l1 %>%
map2_dfr(.x = ., .y = names(.), ~ setNames(.x, .y) %>%
rownames_to_column()) %>%
group_by(rowname) %>%
summarise_all(~ ifelse(all(is.na(.)), NA, first(na.omit(.))))
rowname df1 df2 df3 df4
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 NA 9 4
2 B 2 2 NA NA
3 C 3 NA 3 NA
4 D NA 6 6 NA
5 E NA NA NA 12

combination of pairs of columns BUT not rows in a data frame

How to calculate the combinations of pairs of columns in a data frame, but restrict it, so that it does not considers combinations among rows?
I have a data frame like the following, where each column is a variable.
ID A B C D E F G H I J
1 12 185 NA NA NA NA NA NA NA NA
2 35 20 11 NA NA NA NA NA NA NA
3 45 NA NA NA NA NA NA NA NA NA
I want an output like this:
Var1
12, 185
35, 20
35, 11
20, 11
45, 45
I tried the following code, but it considers ALL possible pairs of combinations among columns and rows. I want each row to be consider independently from each other. Does someone have an idea? Thanks.
numNetList <- read.csv2("abd.csv", sep=";")
comb <- lapply(numNetList, function(x) if (length(x) > 1)
combn(sort(as.numeric(x)), 2))
combb <- do.call(cbind, comb)
pajek_list <- as.data.frame(table(paste(combb[1,], combb[2,], sep = ',')))
not the efficient method, but solves the problem
func <- function(x){
t = as.character(x[!is.na(x)])
if (length(t)==1)
t = rep(t,2)
t1 = combn(t,2)
}
l = apply(df[-1], 1, func)
l1 <- as.data.frame(l)
colnames(l1) = NULL
l2= data.frame(t(l1))
library(tidyr)
unite(l2, "new_col", X1,X2 ,sep = ",")
# new_col
# 12,185
# 35,20
# 35,11
# 20,11
# 45,45
I would go with a combination of dplyr and tidyr:
library(dplyr)
library(tidyr)
df <- tibble(A = c(12,35,45), B = c(185, 20, NA), C = c(NA, 11, NA))
df %>%
mutate(group = 1:n()) %>%
gather(col, val, -group) %>%
group_by(group) %>%
expand(col, val) %>%
distinct(val) %>%
summarise(val = toString(val))

Resources