Fill up missing values based on other entries on R - r

I have dataset input with a couple of missing values. and I have to create dataset output with the following logic:
If there is a missing in any of the columns b, c, or d, then
check the correspondent a column and fill up the missing with the
correspondent value from that row to the specific column.
I tried to do that with _join functions from dplyr but was unsuccessful.
I can do it manually, but this option is off the table because I have a big dataset with multiple instances like that.
Input
library(dplyr)
input <- tibble( a = rep(c("A", "B", "C", "D"),2 ),
b = c(1:3, NA, rep(NA,4)),
c = c(21:28),
d = c(rep(NA,4), 54, NA, 34,11)) %>%
arrange(a)
Input view
# A tibble: 8 × 4
# a b c d
# <chr> <int> <int> <dbl>
#1 A 1 21 NA
#2 A NA 25 54
#3 B 2 22 NA
#4 B NA 26 NA
#5 C 3 23 NA
#6 C NA 27 34
#7 D NA 24 NA
#8 D NA 28 11
Output - expected view
# A tibble: 8 × 4
# a b c d
# <chr> <int> <int> <dbl>
# 1 A 1 21 54
# 2 A 1 25 54
# 3 B 2 22 NA
# 4 B 2 26 NA
# 5 C 3 23 34
# 6 C 3 27 34
# 7 D NA 24 11
# 8 D NA 28 11

Use function na.locf from package zoo to carry the last observation forward or in the opposite direction.
suppressPackageStartupMessages(library(dplyr))
input <- tibble( a = rep(c("A", "B", "C", "D"),2 ),
b = c(1:3, NA, rep(NA,4)),
c = c(21:28),
d = c(rep(NA,4), 54, NA, 34,11)) %>%
arrange(a)
input %>%
group_by(a) %>%
mutate(across(b:d, zoo::na.locf, na.rm = FALSE)) %>%
mutate(across(b:d, zoo::na.locf, na.rm = FALSE, fromLast = TRUE))
#> # A tibble: 8 × 4
#> # Groups: a [4]
#> a b c d
#> <chr> <int> <int> <dbl>
#> 1 A 1 21 54
#> 2 A 1 25 54
#> 3 B 2 22 NA
#> 4 B 2 26 NA
#> 5 C 3 23 34
#> 6 C 3 27 34
#> 7 D NA 24 11
#> 8 D NA 28 11
Created on 2022-05-14 by the reprex package (v2.0.1)

This is hasty imputation:
library(dplyr)
input %>%
group_by(a) %>%
mutate(across(b:d, ~ if_else(is.na(.), na.omit(.)[1], .))) %>%
ungroup()
# # A tibble: 8 x 4
# a b c d
# <chr> <int> <int> <dbl>
# 1 A 1 21 54
# 2 A 1 25 54
# 3 B 2 22 NA
# 4 B 2 26 NA
# 5 C 3 23 34
# 6 C 3 27 34
# 7 D NA 24 11
# 8 D NA 28 11
I think the group_by(a) is fairly intuitive and makes sense. The "hasty" part of my first sentence is that we find the first non-NA value and use it. Other imputation techniques may use the average, median, previous valid data ("locf" as in Rui's answer), or random sampling.
The mice package specializes in imputation.

Related

collapse a dataframe in R that contains both numeric and character variables

I have the following data.frame:
data <- data.frame("ag" = rep(LETTERS[1:4],6),
"date" = c(sapply(1:3, function(x) rep(x, 8))),
"num_var1"= 1:24,
"num_var2"= 24:1,
"alpha_var1" = LETTERS[1:24],
"alpha_var2" = LETTERS[25:2] )
and I would like to summarize (mean) its rows by ag and date using dplyr. The issue is that some rows include characters: in this case, I would like to get the first entry by group (the example dataset is already sorted).
Since my dataset has several entries, I would like the code to be able to recognize whether a variable is numeric (including integers) or a character. However, the best solution that I have so far is the following one:
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), mean))
which creates NAs for non-numeric variables. Do you have a better solution?
Is this what you are looking for?
library(dplyr)
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), ~
if(is.numeric(.x)) mean(.x) else first(.x)))
#> `summarise()` has grouped output by 'ag'. You can override using the `.groups` argument.
#> # A tibble: 12 x 6
#> # Groups: ag [4]
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F
Created on 2022-03-03 by the reprex package (v2.0.1)
Another possible solution:
library(tidyverse)
data %>%
group_by(ag, date) %>%
summarise(across(where(is.numeric), mean),
across(where(is.character), first), .groups = "drop")
#> # A tibble: 12 × 6
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F

Mutate based on conditions?

df <- data.frame(x1 = c("a","a","a","a","b","b","b","b"),ind = c("O","O","C","C","O","O","O","O"), num = c(6,12,18,24,6,12,18,24))
set.seed(1)
df <- df[sample(nrow(df)),]
df2 <- df %>% group_by(x1) %>%
arrange(x1,num)
> df2
# A tibble: 8 x 3
# Groups: x1 [2]
x1 ind num
<fct> <fct> <dbl>
1 a O 6
2 a O 12
3 a C 18
4 a C 24
5 b O 6
6 b O 12
7 b O 18
8 b O 24
I want to create some new columns to this data, the first one should check for each unique value of the column x1 it should take the minimum value of the column num where the column ind is equal to C. For the value a this should return 18. It then does this again but check when ind is equal to O instead. If it finds nothing then it should just return N/A. So the two columns should be result like this:
x1 ind num min_O min_C
<fct> <fct> <dbl> <dbl> <dbl>
1 a O 6 6 18
2 a O 12 6 18
3 a C 18 6 18
4 a C 24 6 18
5 b O 6 6 NA
6 b O 12 6 NA
7 b O 18 6 NA
8 b O 24 6 NA
I've tried a variation of grouping by the x1 and ind column but couldn't get it to work as I want to do a minimum if it equals a particular value. I am sure there is an easy way!
This looks a bit cumbersome but does the job
library(dplyr)
library(tidyr)
df2 %>%
group_by(x1, ind) %>%
pivot_wider(names_from = ind, values_from = num, values_fn = min, names_prefix = 'min_') %>%
left_join(df2, by = 'x1')
# A tibble: 8 x 5
# Groups: x1 [2]
x1 min_O min_C ind num
<chr> <dbl> <dbl> <chr> <dbl>
1 a 6 18 O 6
2 a 6 18 O 12
3 a 6 18 C 18
4 a 6 18 C 24
5 b 6 NA O 6
6 b 6 NA O 12
7 b 6 NA O 18
8 b 6 NA O 24
Another way could be
library(tidyr)
library(dplyr)
df %>%
arrange(x1,num) %>%
group_by(x1) %>%
mutate(min_C = min(num[ind == "C"]),
min_O = min(num[ind == "O"]),
across(starts_with("min"), ~ ifelse(.x == Inf, NA_real_, .x)))
which returns
# A tibble: 8 x 5
# Groups: x1 [2]
x1 ind num min_C min_O
<chr> <chr> <dbl> <dbl> <dbl>
1 a O 6 18 6
2 a O 12 18 6
3 a C 18 18 6
4 a C 24 18 6
5 b O 6 NA 6
6 b O 12 NA 6
7 b O 18 NA 6
8 b O 24 NA 6
but also returns a warning, since there are no C in group b.
If you don't use the across(...) part, NAs are replaced with Inf.

Filter tibble for a specific variable but keep non matching rows and fill with NA/NULL

I have this tibble and i want to filter it by all ids of an item, but additionally keep all rows and fill those rows with NA or 0.
library(tibble)
tibble::tibble(item=c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
id=c("10","11","12","10","15","16","8","9","12"),
val=c(25,27,31,24,38,39,12,14,39))
# A tibble: 9 x 3
item id val
<chr> <chr> <dbl>
1 a 10 25
2 a 11 27
3 a 12 31
4 b 10 24
5 b 15 38
6 b 16 39
7 c 8 12
8 c 9 14
9 c 12 39
For example I want to filter all ids of item a (i.e. 10, 11, 12) and my desired tibble would look like this where I keep all non matching rows and fill <chr> with NA and <dbl> with 0.
# A tibble: 9 x 3
item id val
<chr> <chr> <dbl>
1 a 10 25
2 a 11 27
3 a 12 31
4 b 10 24
5 b NA 0
6 b NA 0
7 c NA 0
8 c NA 0
9 c 12 39
You can use replace or ifelse -
library(dplyr)
keep <- c(10, 11, 12)
df %>%
mutate(val = replace(val, !id %in% keep, 0),
id = replace(id, !id %in% keep, NA))
# item id val
# <chr> <chr> <dbl>
#1 a 10 25
#2 a 11 27
#3 a 12 31
#4 b 10 24
#5 b NA 0
#6 b NA 0
#7 c NA 0
#8 c NA 0
#9 c 12 39
If you have many columns you may use across -
df %>%
mutate(across(c(where(is.character), -item), ~replace(., !id %in% keep, NA)),
across(where(is.numeric), ~replace(., !id %in% keep, 0)))

Using dplyr: Within groups, select the first value meeting a condition

I need assistance obtaining a solution that will scan backwards in time and obtain the first value meeting a condition. I have data similar to:
set.seed(42)
df <- data.frame(
id = sample(LETTERS[1:3], 20, replace = TRUE),
time.var = sample(1:20, 20, replace = TRUE),
x = sample(c(1:10), 20, replace = TRUE)
)
df <- df[order(df$id, df$time.var),]
id time.var x
A 5 2
A 14 8
A 19 7
A 20 1
B 1 1
B 2 5
B 9 10
B 11 10
B 13 6
B 15 4
B 19 3
C 1 7
C 3 5
C 8 9
C 8 4
C 17 7
C 17 4
C 17 8
C 19 4
C 19 10
For the last member of each group defined in time order by time.var, I'd like to obtain the first value from x less than 5 by scanning in descending time order.
I have tried:
test <- df %>%
group_by(id) %>%
arrange(id, time.var) %>%
mutate(less.5 = which.max(x[x < 5]) )
What strategy can I use to obtain this type of output:
id time.var x previous.less.5
A 5 2
A 14 8
A 19 7
A 20 1 2
B 1 1
B 2 5
B 9 10
B 11 10
B 13 6
B 15 4
B 19 3 4
C 1 7
C 3 5
C 8 9
C 8 4
C 17 7
C 17 4
C 17 8
C 19 4
C 19 10 4
Using library(dplyr):
df %>%
arrange(id, time.var) %>%
group_by(id) %>%
mutate(previous.less.5 = tail(c(x[c((x[-n()] < 5), FALSE)]),1)) %>%
group_by(id) %>%
mutate(previous.less.5 = if_else(row_number() == n(), previous.less.5, NULL))
or
df %>%
arrange(id, time.var) %>%
group_by(id) %>%
slice(1:(n()-1)) %>%
filter(x < 5) %>%
slice(n()) %>%
select(-time.var) %>%
right_join(df, ., by="id", suffix =c("",".y")) %>%
group_by(id) %>%
mutate(previous.less.5 = if_else(row_number() == n(), x.y, NULL)) %>%
select(-x.y)
giving:
#> # A tibble: 20 x 4
#> # Groups: id [3]
#> id time.var x previous.less.5
#> <fct> <int> <int> <int>
#> 1 A 3 10 NA
#> 2 A 4 8 NA
#> 3 A 4 6 NA
#> 4 A 5 2 NA
#> 5 A 5 8 NA
#> 6 A 5 7 NA
#> 7 A 11 6 NA
#> 8 A 13 3 NA
#> 9 A 15 2 3
#> 10 B 2 1 NA
#> 11 B 4 3 NA
#> 12 B 4 6 NA
#> 13 B 8 5 NA
#> 14 B 8 4 NA
#> 15 B 20 7 4
#> 16 C 1 2 NA
#> 17 C 2 10 NA
#> 18 C 10 6 NA
#> 19 C 13 2 NA
#> 20 C 18 5 2
Update:
If there's a group with no record less than 5 (or only last record less than 5) then following works:
df %>%
arrange(id, time.var) %>%
group_by(id) %>%
mutate(previous.less.5 = if_else(row_number() == n(),
max(tail(c( x[ c( x[-n()] < 5, FALSE) ] ), 1)),
NULL)) %>%
mutate(previous.less.5 = replace(previous.less.5, is.infinite(previous.less.5), NA))
Data:
set.seed(42) # I am getting different data than what you've shown with this seed
df <- data.frame(
id = sample(LETTERS[1:3], 20, replace = TRUE),
time.var = sample(1:20, 20, replace = TRUE),
x = sample(c(1:10), 20, replace = TRUE)
)
df <- df[order(df$id, df$time.var),]
We can reverse value of x by id get the first number which is less than 5 using which. The last replace is to assign NA to all the values in previous.less.5 except the last one.
library(dplyr)
df %>%
#Data is already sorted by `id` and `time.var` but if your still need use
#arrange(id, time.var) %>%
group_by(id) %>%
mutate(rev_x = c(NA, rev(x)[-1]), previous.less.5 = rev_x[which(rev_x < 5)[1]],
previous.less.5 = replace(previous.less.5, row_number() != n(), NA)) %>%
select(-rev_x)
# id time.var x previous.less.5
# <fct> <int> <int> <int>
# 1 A 5 2 NA
# 2 A 14 8 NA
# 3 A 19 7 NA
# 4 A 20 1 2
# 5 B 1 1 NA
# 6 B 2 5 NA
# 7 B 9 10 NA
# 8 B 11 10 NA
# 9 B 13 6 NA
#10 B 15 4 NA
#11 B 19 3 4
#12 C 1 7 NA
#13 C 3 5 NA
#14 C 8 9 NA
#15 C 8 4 NA
#16 C 17 7 NA
#17 C 17 4 NA
#18 C 17 8 NA
#19 C 19 4 NA
#20 C 19 10 4
This should also handle the case and return NA's if there is no value less than 5 in an id.

debugging: function to create multiple lags for multiple columns (dplyr)

I want to create multiple lags of multiple variables, so I thought writing a function would be helpful. My code throws a warning ("Truncating vector to length 1 ") and false results:
library(dplyr)
time <- c(2000:2009, 2000:2009)
x <- c(1:10, 10:19)
id <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
df <- data.frame(id, time, x)
three_lags <- function (data, column, group, ordervar) {
data <- data %>%
group_by_(group) %>%
mutate(a = lag(column, 1L, NA, order_by = ordervar),
b = lag(column, 2L, NA, order_by = ordervar),
c = lag(column, 3L, NA, order_by = ordervar))
}
df_lags <- three_lags(data=df, column=x, group=id, ordervar=time) %>%
arrange(id, time)
Also I wondered if there might be a more elegant solution using mutate_each, but I didn't get that to work either. I can of course just write a long code with a line for each new lagged variable, but Id like to avoid that.
EDIT:
akrun's dplyr answer works, but takes a long time to compute for large data frames. The solution using data.table seems to be more efficient. So a dplyr or other solution that also allows the be implemented for several columns & several lags is still to be found.
EDIT 2:
For multiple columns and no groups (e.g. "ID") the following solution seems very well suited to me, due to its simplicity. The code may of course be shortened, but step by step:
df <- arrange(df, time)
df.lag <- shift(df[,1:24], n=1:3, give.names = T) ##column indexes of columns to be lagged as "[,startcol:endcol]", "n=1:3" sepcifies the number of lags (lag1, lag2 and lag3 in this case)
df.result <- bind_cols(df, df.lag)
We can use shift from data.table which can take multiple values for 'n'
library(data.table)
setDT(df)[order(time), c("a", "b", "c") := shift(x, 1:3) , id][order(id, time)]
Suppose, we need to do this on multiple columns
df$y <- df$x
setDT(df)[order(time), paste0(rep(c("x", "y"), each =3),
c("a", "b", "c")) :=shift(.SD, 1:3), id, .SDcols = x:y]
The shift can also be used in the dplyr
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, time) %>%
do(data.frame(., setNames(shift(.$x, 1:3), c("a", "b", "c"))))
# id time x a b c
# <dbl> <int> <int> <int> <int> <int>
#1 1 2000 1 NA NA NA
#2 1 2001 2 1 NA NA
#3 1 2002 3 2 1 NA
#4 1 2003 4 3 2 1
#5 1 2004 5 4 3 2
#6 1 2005 6 5 4 3
#7 1 2006 7 6 5 4
#8 1 2007 8 7 6 5
#9 1 2008 9 8 7 6
#10 1 2009 10 9 8 7
#11 2 2000 10 NA NA NA
#12 2 2001 11 10 NA NA
#13 2 2002 12 11 10 NA
#14 2 2003 13 12 11 10
#15 2 2004 14 13 12 11
#16 2 2005 15 14 13 12
#17 2 2006 16 15 14 13
#18 2 2007 17 16 15 14
#19 2 2008 18 17 16 15
#20 2 2009 19 18 17 16
Could also create a function that will output a tibble:
library(tidyverse)
lag_multiple <- function(x, n_vec){
map(n_vec, lag, x = x) %>%
set_names(paste0("lag", n_vec)) %>%
as_tibble()
}
tibble(x = 1:30) %>%
mutate(lag_multiple(x, 1:5))
#> # A tibble: 30 x 6
#> x lag1 lag2 lag3 lag4 lag5
#> <int> <int> <int> <int> <int> <int>
#> 1 1 NA NA NA NA NA
#> 2 2 1 NA NA NA NA
#> 3 3 2 1 NA NA NA
#> 4 4 3 2 1 NA NA
#> 5 5 4 3 2 1 NA
#> 6 6 5 4 3 2 1
#> 7 7 6 5 4 3 2
#> 8 8 7 6 5 4 3
#> 9 9 8 7 6 5 4
#> 10 10 9 8 7 6 5
#> # ... with 20 more rows

Resources