How do I impute missing variables in R using dplyr? - r

I would like to impute missing values for a variable given the existing values.
In var2, we notice that there are a lot of NAs.
If any 2 ids are the same, then their values for var2 are the same.
If the id has no values for var2, like in the case of id==2, then we just output as NA.
It should look from df_old to df_new.
df_old<- read.table(header = TRUE, text = "
id var1 var2
1 A 12
1 B NA
1 E NA
2 G NA
2 J NA
")
df_new<- read.table(header = TRUE, text = "
id var1 var2
1 A 12
1 B 12
1 E 12
2 G NA
2 J NA
")
I tried take:
df_new<-df_old %>%
group_by(id) %>%
mutate(var2=na.omit(var2))
I believe it doesn't work because of the second case. I was also wondering if using ifelse would be okay. Need help thanks!

If there is only one var2 value per id available you could simply do:
df_old %>%
group_by(id) %>%
mutate(var2 = min(var2, na.rm = TRUE))
Source: local data frame [5 x 3]
Groups: id [2]
id var1 var2
<int> <fctr> <int>
1 1 A 12
2 1 B 12
3 1 E 12
4 2 G NA
5 2 J NA
Another option would be:
mutate(var2 = var2[1])

We can use data.table, but unlike dplyr, for groups that have all NA, we have to specify NA to return or else it will give Inf
library(data.table)
setDT(df_old)[, var2 := if(any(!is.na(var2))) min(var2, na.rm = TRUE)
else NA_integer_, by = id]
df_old
# id var1 var2
#1: 1 A 12
#2: 1 B 12
#3: 1 E 12
#4: 2 G NA
#5: 2 J NA

By now there is tidyimpute package available in CRAN which looks like it might do the trick
"Functions and methods for imputing missing values (NA) in tables and list
patterned after the tidyverse approach of 'dplyr' and 'rlang'; works with
data.tables as well."
https://cran.r-project.org/web/packages/tidyimpute/tidyimpute.pdf

Related

de-duplicate rows and align columns in R

I have a dataframe where there are duplicate samples but the reason for this is that only variable appears per row:
Sample
Var1
Var2
A
1
NA
B
NA
1
A
NA
3
C
NA
2
C
5
NA
B
4
NA
I would like to end up with the row names de-duplicated and corresponding column values side-by-side:
Sample
Var1
Var2
A
1
3
B
4
1
C
5
2
I've tried the group_by() function and that failed miserably!
I would very much appreciate any assistance and happy to clarify anything further if required!
We could use group_by and summarise for this task. Getting the max() will give us the desired output:
library(dplyr)
df %>%
group_by(Sample) %>%
summarise(across(, ~max(., na.rm=TRUE)))
Sample Var1 Var2
<chr> <int> <int>
1 A 1 3
2 B 4 1
3 C 5 2
data.table approach
library(data.table)
DT <- fread("Sample Var1 Var2
A 1 NA
B NA 1
A NA 3
C NA 2
C 5 NA
B 4 NA")
# or setDT(DT) if DT is not a data.table format
# melt to long format, and remove NA's
DT.melt <- melt(DT, id.vars = "Sample", na.rm = TRUE)
# cast to wide again
dcast(DT.melt, Sample ~ variable, fill = NA)
# Sample Var1 Var2
# 1: A 1 3
# 2: B 4 1
# 3: C 5 2
Using collapse
library(collapse)
fmax(df1[-1], g = df1$Sample)
Var1 Var2
A 1 3
B 4 1
C 5 2
Or in dplyr 1.1.0, we can also use .by in reframe
df1 %>%
reframe(across(where(is.numeric), ~ max(.x, na.rm = TRUE)), .by = 'Sample')
Sample Var1 Var2
1 A 1 3
2 B 4 1
3 C 5 2

R - Summarize dataframe to avoid NAs

Having a dataframe like:
id = c(1,1,1)
A = c(3,NA,NA)
B = c(NA,5,NA)
C= c(NA,NA,2)
df = data.frame(id,A,B,C)
id A B C
1 1 3 NA NA
2 1 NA 5 NA
3 1 NA NA 2
I want to summarize the whole dataframe in one row that it contains no NAs. It should looke like:
id A B C
1 1 3 5 2
It should work also when the dataframe is bigger and contains more ids but in the same logic.
I didnt found the right function for that and tried some variations of summarise().
You can group_by id and use max with na.rm = TRUE:
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(everything(), max, na.rm = TRUE))
id A B C
1 1 3 5 2
If multiple cases, max may not be what you want, you can use sum instead.
Using fmax from collapse
library(collapse)
fmax(df[-1], df$id)
A B C
1 3 5 2
Alternatively please check the below code
data.frame(id,A,B,C) %>% group_by(id) %>% fill(c(A,B,C), .direction = 'downup') %>%
slice_head(n=1)
Created on 2023-02-03 with reprex v2.0.2
# A tibble: 1 × 4
# Groups: id [1]
id A B C
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 2

keep last non missing observation for all variables by group

My data has multiple columns and some of those columns have missing values in different rows. I would like to group (collapse) the data by the variable "g", keeping the last non missing obserbation of each varianle.
Input:
d <- data.table(a=c(1,NA,3,4),b=c(1,2,3,4),c=c(NA,NA,'c',NA),g=c(1,1,2,2))
Desired output
d_g <- data.table(a=c(1,4),b=c(2,4),c=c(NA,'c'),g=c(1,2))
data.table (or dplyr) solution prefered here
OBS:this is related to this question, but the main answers there seem to cause unecessary NAs in some groups
Using data.table :
library(data.table)
d[, lapply(.SD, function(x) last(na.omit(x))), g]
# g a b c
#1: 1 1 2 <NA>
#2: 2 4 4 c
One option using dplyr could be:
d %>%
group_by(g) %>%
summarise(across(everything(), ~ if(all(is.na(.))) NA else last(na.omit(.))))
g a b c
<dbl> <dbl> <dbl> <chr>
1 1 1 2 <NA>
2 2 4 4 c
In base aggregatecould be used.
aggregate(.~g, d, function(x) tail(x[!is.na(x)], 1), na.action = NULL)
# g a b c
#1 1 1 2
#2 2 4 4 c

How to summarise across different types of variables with dplyr::c_across()

I have data with different types of variables. Some are character, some factors, and some numeric, like below:
df <- data.frame(a = c("tt", "ss", "ss", NA), b=c(2,3,NA,1), c=c(1,2,NA, NA), d=c("tt", "ss", "ss", NA))
I'm trying to count the number of missing values per observation using c_across in dplyr
However, c_across doesn't seem to be able to combine different type of values, as the error message below suggests
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across())))
Error: Problem with summarise() input NAs.
x Can't combine a <factor> and b .
ℹ Input NAs is sum(is.na(c_across())).
ℹ The error occurred in row 1.
Indeed, if I include only numeric variables, it works.
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across(b:c))))
Same thing if I include only character variables
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across(c(a,d)))))
I could solve the issue without using c_across like below, but I have lots of variables, so it's not very practical.
df %>%
rowwise() %>%
summarise(NAs = is.na(a)+is.na(b)+is.na(c)+is.na(d))
I could use the traditional apply approach, like below, but I'd like to solve this using dplyr.
apply(df, 1, function(x)sum(is.na(x)))
Any suggestions as to how to compute the number of missing values, row-wise, efficiently, and using dplyr?
I would suggest this approach. The issue is because of two things. First, different type of variables in your dataframe an second that you need a key variable for the rowwise style task. So, in next code we first transform variables into a similar type, then we create an id based on the number of row. With this we use that element as input for rowwise() and then we can use c_across() function. Here the code (I have used you df data):
library(tidyverse)
#Code
df %>%
mutate_at(vars(everything()),funs(as.character(.))) %>%
mutate(id=1:n()) %>%
rowwise(id) %>%
mutate(NAs = sum(is.na(c_across(a:d))))
Output:
# A tibble: 4 x 6
# Rowwise: id
a b c d id NAs
<chr> <chr> <chr> <chr> <int> <int>
1 tt 2 1 tt 1 0
2 ss 3 2 ss 2 0
3 ss NA NA ss 3 2
4 NA 1 NA NA 4 3
And we can avoid the mutate_at() function using the new across() with mutate() to homologate the variables:
#Code 2
df %>%
mutate(across(a:d,~as.character(.))) %>%
mutate(id=1:n()) %>%
rowwise(id) %>%
mutate(NAs = sum(is.na(c_across(a:d))))
Output:
# A tibble: 4 x 6
# Rowwise: id
a b c d id NAs
<chr> <chr> <chr> <chr> <int> <int>
1 tt 2 1 tt 1 0
2 ss 3 2 ss 2 0
3 ss NA NA ss 3 2
4 NA 1 NA NA 4 3
A much faster option is not to use rowwise or c_across, but with rowSums
library(dplyr)
df %>%
mutate(NAs = rowSums(is.na(.)))
# a b c d NAs
#1 tt 2 1 tt 0
#2 ss 3 2 ss 0
#3 ss NA NA ss 2
#4 <NA> 1 NA <NA> 3
If we want to select certain columns i.e. numeric
df %>%
mutate(NAs = rowSums(is.na(select(., where(is.numeric)))))
# a b c d NAs
#1 tt 2 1 tt 0
#2 ss 3 2 ss 0
#3 ss NA NA ss 2
#4 <NA> 1 NA <NA> 1

How do I apply a function within `mutate_at` that conditions rowwise on values in other columns?

I have a data frame within which I would like to transform the values of one set of columns, conditional on values in another set of columns in the same row. I am trying and failing to do this in the tidyverse with a combination of rowwise and mutate_at. Here's a reproducible example.
library(dplyr)
set.seed(20912)
dat <- data.frame(cat1 = sample(LETTERS[1:2], 10, replace = TRUE), cat2 = sample(LETTERS[1:2], 10, replace = TRUE), id = 3, sim_1 = rnorm(10), sim_2 = rnorm(10), stringsAsFactors = FALSE)
> dat
cat1 cat2 id sim_1 sim_2
1 A A 3 -0.1054062 -0.47563580
2 B A 3 -1.7198921 0.76713640
3 A B 3 -0.5946627 -0.33958464
4 B B 3 -1.6547488 -0.13026564
5 B B 3 -0.3779149 1.29590315
6 B B 3 0.6271939 0.08707965
7 B B 3 1.6376711 1.02151753
8 A B 3 1.7675520 1.66983954
9 B A 3 -0.3284081 -1.28175621
10 B B 3 0.8431148 -0.15415091
In that table, I want to transform the values of all columns that begin with "sim_", conditional on the values of cat1 and cat2. Say, for example, I want to replace the values in all the "sim_*" columns with NA, but only in rows where cat1 == cat2. So my expected result would be:
cat1 cat2 id sim_1 sim_2
1 A A 3 NA NA
2 B A 3 -1.7198921 0.7671364
3 A B 3 -0.5946627 -0.3395846
4 B B 3 NA NA
5 B B 3 NA NA
6 B B 3 NA NA
7 B B 3 NA NA
8 A B 3 1.7675520 1.6698395
9 B A 3 -0.3284081 -1.2817562
10 B B 3 NA NA
I tried a few variations on the theme of rowwise plus mutate_at with no luck. For example:
> dat %>% rowwise() %>% mutate_at(vars(starts_with("sim_")), function(x) { ifelse(cat1 == cat2, NA, x) })
Error in ifelse(cat1 == cat2, x, 0) : object 'cat1' not found
What am I missing? I realize that this would be easier if I were to reshape the data from wide to long first, but I'm hoping to learn something about tidyverse functions or syntax and find a way to do this without reshaping the data.
We can use replace and ifelse/replace are vectorized, so can avoid the rowwise
library(dplyr)
dat %>%
mutate_at(vars(starts_with('sim')), ~ replace(., cat1 == cat2, NA_real_))
Or as these are numeric columns, can directly do the transformation
dat %>%
mutate_at(vars(starts_with('sim')), ~.* NA^(cat1 == cat2))

Resources