problem with using coalesce to merge 2 columns into 1 - r

I would like to merge values from 2 columns into 1. For example, here is sample data:
id x y
1 12
1 14
2 13
3 15
3 18
4 19
I want
id x y z
1 12 12
1 14 14
2 13 13
3 15 15
3 18 18
4 19 19
I tried using coalesce to create a new variable.
coalesce <- function(...) {
apply(cbind(...), 1, function(x) {
x[which(!is.na(x))[1]]
})
}
df$z <- coalesce(df$x, df$y)
However, the variable doesn't reflect the columns joined. Am I using this function incorrectly?

You could use the dplyr::coalesce function:
> df$z <- dplyr::coalesce(ifelse(df$x == "", NA, df$x), df$y)
> df
id x y z
1 1 12 12
2 1 14 14
3 2 13 13
4 3 15 15
5 3 18 18
6 4 19 19
>
To implement my own mycoalesce:
mycoalesce <- function(...) {apply(cbind(...), 1, max)}
And:
> df$z <- mycoalesce(df$x, df$y)
> df
id x y z
1 1 12 12
2 1 14 14
3 2 13 13
4 3 15 15
5 3 18 18
6 4 19 19
>

This might be a more crude and inefficient way than the other methods posted above, but still worth a try:
df1<-df
df1[is.na(df1)]=0
z=df1$x+df1$y
df<-cbind(df,z)
df
# ID x y z
#1 1 12 NA 12
#2 2 NA 14 14
#3 3 13 NA 13
#4 4 15 NA 15
#5 5 NA 18 18
#6 6 NA 19 19
I mainly copied the original dataframe to a new dataframe so as to preserve the NA values in the original dataframe. Also, I assumed that none of the ID's are missing along with #Park's assumption.
Data: df<-data.frame(ID=1:6,x=c(12,NA,13,15,NA,NA),y=c(NA,14,NA,NA,18,19))

If one of x and y is always NA and one has value,
for a custom function %+% defined like
`%+%` <- function(x, y) mapply(sum, x, y, MoreArgs = list(na.rm = TRUE))
df$z <- df$x %+% df$y
df
id x y z
1 1 12 NA 12
2 1 NA 14 14
3 2 13 NA 13
4 3 15 NA 15
5 3 NA 18 18
6 4 NA 19 19

Related

Replace values of rows with missing values by values of another row

I’m trying to work with conditional but don’t find an easy way to do it.
I have a dataset with missing value in column As, I want to create a new column C that takes the original values in A for all the rows without missing, and for row with missing value take the value from another column (column B).
All columns are character variables.
A
B
13 A 1
15 A 2
15 A 2
15 A 2
NA
15 A 8
10 B 3
15 A 2
NA
15 A 5
What i want is:
A
B
C
13 A 1
15 A 2
13 A 1
15 A 2
15 A 2
15 A 2
NA
15 A 8
15 A 8
10 B 3
15 A 2
10 B 3
NA
15 A 5
15 A 5
I tried with a loop but the result is not satisfactory,
for(i in 1:length(df$A)) {
if(is.na(df$A[i])) {
df$C <- df$B
}
else {
df$C<- df$A
}
}
If anyone can help me,
Thanks in advance
In general, if you find yourself looping over a data frame, there is probably a more efficient solution, either to use vectorised functions like
Jonathan has in his answer, or to use dplyr as follows.
We can check if a is NA - if so, we set c equal to b, otherwise keep it as a.
library(dplyr)
dat %>% mutate(c = if_else(is.na(A), B, A))
A B c
1 13 A 1 15 A 2 13 A 1
2 15 A 2 15 A 2 15 A 2
3 <NA> 15 A 8 15 A 8
4 10 B 3 15 A 2 10 B 3
5 <NA> 15 A 5 15 A 5
df$C <- ifelse(is.na(df$A), df$B, df$A)
We could use coalesce:
library(dplyr)
df %>%
mutate(c = coalesce(A, B))
A B c
1 13 A 1 15 A 2 13 A 1
2 15 A 2 15 A 2 15 A 2
3 <NA> 15 A 8 15 A 8
4 10 B 3 15 A 2 10 B 3
5 <NA> 15 A 5 15 A 5

Update a variable if dplyr filter conditions are met

With the command df %>% filter(is.na(df)[,2:4]) filter function subset in a new df that has rows with NA's in columns 2, 3 and 4. What I want is not a new subsetted df but rather assign in example "1" to a new variable called "Exclude" in the actual df.
This example with mutate was not exactly what I was looking for, but close:
Use dplyr´s filter and mutate to generate a new variable
Also I would need the same to happen with other filter conditions.
Example I have the following:
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3,2:4] <- NA
df[5,2:4] <- NA
df
> df
A B C D
1 1 11 21 31
2 2 12 22 32
3 3 NA NA NA
4 4 14 24 34
5 5 NA NA NA
6 6 16 26 36
and would like
> df
A B C D Exclude
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Any good ideas how the filter subset could be used to update easy? The hard way work around would be to generate this subset, create new variable for all and then join back but that is not tidy code.
We can do this with base R using vectorized rowSums
df$Exclude <- NA^!rowSums(is.na(df[-1]))
-output
df
# A B C D Exclude
#1 1 11 21 31 NA
#2 2 12 22 32 NA
#3 3 NA NA NA 1
#4 4 14 24 34 NA
#5 5 NA NA NA 1
#6 6 16 26 36 NA
Does this work:
library(dplyr)
df %>% rowwise() %>%
mutate(Exclude = +any(is.na(c_across(everything()))), Exclude = na_if(Exclude, 0))
# A tibble: 6 x 5
# Rowwise:
A B C D Exclude
<int> <int> <int> <int> <int>
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Using anyNA.
df %>% mutate(Exclude=ifelse(apply(df[2:4], 1, anyNA), 1, NA))
# A B C D Exclude
# 1 1 11 21 31 NA
# 2 2 12 22 32 NA
# 3 3 NA NA NA 1
# 4 4 14 24 34 NA
# 5 5 NA NA NA 1
# 6 6 16 26 36 NA
Or just
df$Exclude <- ifelse(apply(df[2:4], 1, anyNA), 1, NA)
Another one-line solution:
df$Exclude <- as.numeric(apply(df[2:4], 1, function(x) any(is.na(x))))
Use rowwise, sum over all numeric columns, assign 1 or NA in ifelse.
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3, 2:4] <- NA
df[5, 2:4] <- NA
library(tidyverse)
df %>%
rowwise() %>%
mutate(Exclude = ifelse(
is.na(sum(c_across(where(is.numeric)))), 1, NA
))
#> # A tibble: 6 x 5
#> # Rowwise:
#> A B C D Exclude
#> <int> <int> <int> <int> <dbl>
#> 1 1 11 21 31 NA
#> 2 2 12 22 32 NA
#> 3 3 NA NA NA 1
#> 4 4 14 24 34 NA
#> 5 5 NA NA NA 1
#> 6 6 16 26 36 NA

Fill column with prior nonmissing value, no ID

I'm trying to fill a missing ID column of a data frame as shown below. It's not blank in the first row it applies to and then blank until the next ID. I wrote ugly code to do this in a for loop, but wonder if there's a tidy-ier way to do this. Any suggestions?
Here's what I've got:
code data
1 A 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 B 11
12 12
13 13
14 14
15 15
16 C 16
17 17
18 18
19 19
20 20
I want:
code data
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 11
12 B 12
13 B 13
14 B 14
15 B 15
16 C 16
17 C 17
18 C 18
19 C 19
20 C 20
Code I've got now:
# Create mock data frame
df <- data.frame(code = c("A", rep("", 9),
"B", rep("", 4),
"C", rep("", 4)),
data = 1:20)
# For loop over rows (BAD!)
for (i in seq(2, nrow(df))) {
df[i,]$code <- ifelse(df[i,]$code == "", df[i-1,]$code, df[i, ]$code)
}
There is a tidyr way to do it, there is the fill function. You also need to replace the zero length string with NA for this to work, which you can easily do using the mutate and na_if functions from dplyr.
df %>%
mutate(code = na_if(code,"")) %>%
fill(code)
code data
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 11
12 B 12
13 B 13
14 B 14
15 B 15
16 C 16
17 C 17
18 C 18
19 C 19
20 C 20

How to remove columns with dplyr with NA in specific row?

This code removes all columns which contain at least one NA.
library(dplyr)
df %>%
select_if(~ !any(is.na(.)))
What do I need to modify if I want only remove the columns that have NA for the eighth row (for my generated data below)?
set.seed(1234)
df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
In base-R one can simply try as:
df[,which(!is.na(df[8,]))]
Or as suggested by #RichScriven:
df[, !is.na(df[8,])]
# A B
# 1 1 11
# 2 2 12
# 3 3 13
# 4 4 NA
# 5 NA 15
# 6 6 16
# 7 7 17
# 8 8 18
# 9 9 19
# 10 10 20
You could do this:
df %>%
select_if(!is.na(.[8,]))
A B
1 1 11
2 2 12
3 3 13
4 4 NA
5 NA 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Another option is keep
library(purrr)
keep(df, ~ !(is.na(.x[8])))
# A B
#1 1 11
#2 2 12
#3 3 13
#4 4 NA
#5 NA 15
#6 6 16
#7 7 17
#8 8 18
#9 9 19
#10 10 20
Or with Filter from base R
Filter(function(x) !(is.na(x[8])), df)

How to generate new variables based on the name of the variables in the data frame

For example, I have a toy dataset as the one I created below,
a1<-1:10
a2<-11:20
v<-c(1,2,1,NA,2,1,2,1,2,1)
data<-data.frame(a1,a2,v,stringsAsFactors = F)
Then I want to create a new variable y which will be assigned the value a1 or a2 or NA based on the value of variable v. Therefore, the 'y'
should equals to 1 12 3 NA 15 6 17 8 19 10.
I want to generate it with the command similar to the ones I list below, It doesn't work, I guess it's because of the vectorization issue, then how can I fix it?
In reality, I have several as, say 10 and the actual values are characters instead of numeric ones.
data$y[!is.na(data$v)]<-data[,paste0('a',data$v)]
or
data%>%
mutate(y=ifelse(!is.na(v),get(paste0('a',v)),NA))
You could use standard indexing with cbind for that:
dat$y <- dat[cbind(1:nrow(dat), dat$v)]
The result:
> dat
a1 a2 v y
1 1 11 1 1
2 2 12 2 12
3 3 13 1 3
4 4 14 NA NA
5 5 15 2 15
6 6 16 1 6
7 7 17 2 17
8 8 18 1 8
9 9 19 2 19
10 10 20 1 10
(I used dat instead of data, because it is not wise to call a dataframe the same as a function; see ?data)
Only idea that comes to my mind:
data%>%
mutate(y=ifelse(!is.na(v),paste0('a',v),NA)) %>%
mutate(z=ifelse(!is.na(y),(ifelse(y=="a1",get("a1"),get("a2"))),NA))
a1 a2 v y z
1 1 11 1 a1 1
2 2 12 2 a2 12
3 3 13 1 a1 3
4 4 14 NA <NA> NA
5 5 15 2 a2 15
6 6 16 1 a1 6
7 7 17 2 a2 17
8 8 18 1 a1 8
9 9 19 2 a2 19
10 10 20 1 a1 10
or more directly:
data%>%
mutate(y=ifelse(!is.na(v),(ifelse(v==1, get("a1"),get("a2"))),NA))
a1 a2 v y
1 1 11 1 1
2 2 12 2 12
3 3 13 1 3
4 4 14 NA NA
5 5 15 2 15
6 6 16 1 6
7 7 17 2 17
8 8 18 1 8
9 9 19 2 19
10 10 20 1 10
still based on ifelse :(
You need to use a matrix accessor:
# Get the indices of missing values
ind <- which(!is.na(data$v))
# Transform colnames to indices
tab <- structure(match(c("a1", "a2"), names(data)), .Names = c("a1", "a2"))
# Access data with a matrix accessor
data$y[ind] <- data[cbind(ind, tab[paste0('a', data$v[ind])])]

Resources