Replace NA in a series of variables by a factor level

Replace NA in a series of variables by a factor level - r

This is my data, and I want to replace NA with "No". I can replace missing values one by one. However, I need to replace NAs in s_1:s_4 in the code. Just as a reminder, all of the variables are factor levels.
id x s_0 s_1 s_2 s_3
1 5 75 A 4 110
2 9 36 NA NA 921
3 11 13 B 7 769
4 11 34 C 2 912
5 11 NA C NA 835
6 13 39 NA 4 NA
7 14 45 B 4 577
8 19 42 D 6 NA
9 20 4 NA 7 577
10 13 28 NA 3 573

If these are already existing factors, you can use forcats::fct_explicit_na():
library(dplyr)
library(forcats)
# Make sample data vars factors
dat <- dat %>%
mutate(across(starts_with("s_"), as.factor))
# Add 'No' as factor level
dat %>%
mutate(across(starts_with("s_"), fct_explicit_na, "No"))
# A tibble: 10 x 6
id x s_0 s_1 s_2 s_3
<dbl> <dbl> <fct> <fct> <fct> <fct>
1 1 5 75 A 4 110
2 2 9 36 No No 921
3 3 11 13 B 7 769
4 4 11 34 C 2 912
5 5 11 No C No 835
6 6 13 39 No 4 No
7 7 14 45 B 4 577
8 8 19 42 D 6 No
9 9 20 4 No 7 577
10 10 13 28 No 3 573

In base R, you need to include "No" as factor level before turning NA's to "No".
cols <- grep('s_\\d+', names(df))
df[cols] <- lapply(df[cols], function(x) {
levels(x) <- c(levels(x), 'No')
x[is.na(x)] <- 'No'
x
})
df
# id x s_0 s_1 s_2 s_3
#1 1 5 75 A 4 110
#2 2 9 36 No No 921
#3 3 11 13 B 7 769
#4 4 11 34 C 2 912
#5 5 11 No C No 835
#6 6 13 39 No 4 No
#7 7 14 45 B 4 577
#8 8 19 42 D 6 No
#9 9 20 4 No 7 577
#10 10 13 28 No 3 573

Related

rbind dataframes by filling missing rows from the first dataframe

I have 4 datasets from 4 rounds of a survey, with the first round containing 5 variables and the next ones containing only 3. This is because the ID (same sample) and the other two variables (v1 and v2) are fixed over time.
df1 <- data.frame(id = c(1:5), round=1, v1 = c(6:10), v2 = c(11:15), v3=c(16:20))
df2 <- data.frame(id = c(1:5), round=2, v3=c(26:30))
df3 <- data.frame(id = c(1:5), round=3, v3=c(36:40))
df4 <- data.frame(id = c(1:5), round=4, v3=c(46:50))
** rbind
list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id)
Now when I rbind them, I end up with missing rows for the two fixed variables for rounds 1 to 3:
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 NA NA 26
7 2 2 2 NA NA 27
8 2 3 2 NA NA 28
9 2 4 2 NA NA 29
10 2 5 2 NA NA 30
11 3 1 3 NA NA 36
12 3 2 3 NA NA 37
13 3 3 3 NA NA 38
14 3 4 3 NA NA 39
15 3 5 3 NA NA 40
16 4 1 4 NA NA 46
17 4 2 4 NA NA 47
18 4 3 4 NA NA 48
19 4 4 4 NA NA 49
20 4 5 4 NA NA 50
but I need v1 and v2 to be filled for the next rounds as well by matching the respective ID.
Please let me know if there is any way to do this in R (or in Python).
Thank you.

list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id) %>%
fill(v1:v3) # from tidyr
#fill(4:6) # alternative syntax: columns 4-6
#fill(-c(1:3)) # alternative syntax: everything except columns 1:3
#fill(everything()) # alternative syntax: fill NAs in all columns
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 6 11 26
7 2 2 2 7 12 27
8 2 3 2 8 13 28
9 2 4 2 9 14 29
10 2 5 2 10 15 30
11 3 1 3 6 11 36
12 3 2 3 7 12 37
13 3 3 3 8 13 38
14 3 4 3 9 14 39
15 3 5 3 10 15 40
16 4 1 4 6 11 46
17 4 2 4 7 12 47
18 4 3 4 8 13 48
19 4 4 4 9 14 49
20 4 5 4 10 15 50

tidyverse: binding list elements of same dimension

Using reduce(bind_cols), the list elements of same dimension may be combined. However, I would like to know how to combine only same dimension (may be specified dimesion in some way) elements from a list which may have elements of different dimension.
library(tidyverse)
df1 <- data.frame(A1 = 1:10, A2 = 10:1)
df2 <- data.frame(B = 11:30)
df3 <- data.frame(C = 31:40)
ls1 <- list(df1, df3)
ls1
[[1]]
A1 A2
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
[[2]]
C
1 31
2 32
3 33
4 34
5 35
6 36
7 37
8 38
9 39
10 40
ls1 %>%
reduce(bind_cols)
A1 A2 C
1 1 10 31
2 2 9 32
3 3 8 33
4 4 7 34
5 5 6 35
6 6 5 36
7 7 4 37
8 8 3 38
9 9 2 39
10 10 1 40
ls2 <- list(df1, df2, df3)
ls2
[[1]]
A1 A2
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
[[2]]
B
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
20 30
[[3]]
C
1 31
2 32
3 33
4 34
5 35
6 36
7 37
8 38
9 39
10 40
ls2 %>%
reduce(bind_cols)
Error: Can't recycle `..1` (size 10) to match `..2` (size 20).
Run `rlang::last_error()` to see where the error occurred.
Question
Looking for a function to combine all data.frames in a list with an argument of number of rows.

One option could be:
map(split(lst, map_int(lst, NROW)), bind_cols)
$`10`
A1 A2 C
1 1 10 31
2 2 9 32
3 3 8 33
4 4 7 34
5 5 6 35
6 6 5 36
7 7 4 37
8 8 3 38
9 9 2 39
10 10 1 40
$`20`
B
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
20 30

You can use -
n <- 1:max(sapply(ls2, nrow))
res <- do.call(cbind, lapply(ls2, `[`, n, ,drop = FALSE))
res
# A1 A2 B C
#1 1 10 11 31
#2 2 9 12 32
#3 3 8 13 33
#4 4 7 14 34
#5 5 6 15 35
#6 6 5 16 36
#7 7 4 17 37
#8 8 3 18 38
#9 9 2 19 39
#10 10 1 20 40
#NA NA NA 21 NA
#NA.1 NA NA 22 NA
#NA.2 NA NA 23 NA
#NA.3 NA NA 24 NA
#NA.4 NA NA 25 NA
#NA.5 NA NA 26 NA
#NA.6 NA NA 27 NA
#NA.7 NA NA 28 NA
#NA.8 NA NA 29 NA
#NA.9 NA NA 30 NA
A little-bit shorter with purrr::map_dfc
purrr::map_dfc(ls2, `[`, n, , drop = FALSE)

We can use cbind.fill from rowr
library(rowr)
do.call(cbind.fill, c(ls2, fill = NA))

A base R option using tapply + sapply
tapply(
ls2,
sapply(ls2, nrow),
function(x) do.call(cbind, x)
)
gives
$`10`
A1 A2 C
1 1 10 31
2 2 9 32
3 3 8 33
4 4 7 34
5 5 6 35
6 6 5 36
7 7 4 37
8 8 3 38
9 9 2 39
10 10 1 40
$`20`
B
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
20 30

You may also use if inside reduce if you want to combine similar elements of list (case: when first item in list has priority)
df1 <- data.frame(A1 = 1:10, A2 = 10:1)
df2 <- data.frame(B = 11:30)
df3 <- data.frame(C = 31:40)
ls1 <- list(df1, df3)
ls2 <- list(df1, df2, df3)
library(tidyverse)
reduce(ls2, ~if(nrow(.x) == nrow(.y)){bind_cols(.x, .y)} else {.x})
#> A1 A2 C
#> 1 1 10 31
#> 2 2 9 32
#> 3 3 8 33
#> 4 4 7 34
#> 5 5 6 35
#> 6 6 5 36
#> 7 7 4 37
#> 8 8 3 38
#> 9 9 2 39
#> 10 10 1 40
Created on 2021-06-09 by the reprex package (v2.0.0)

Here's another tidyverse option.
We're creating a dummy ID in each data.frame based on the row_number(), then joining all data.frames by the dummy ID, and then dropping the dummy ID.
ls2 %>%
map(., ~mutate(.x, id = row_number())) %>%
reduce(full_join, by = "id") %>%
select(-id)
This gives us:
A1 A2 B C
1 1 10 11 31
2 2 9 12 32
3 3 8 13 33
4 4 7 14 34
5 5 6 15 35
6 6 5 16 36
7 7 4 17 37
8 8 3 18 38
9 9 2 19 39
10 10 1 20 40
11 NA NA 21 NA
12 NA NA 22 NA
13 NA NA 23 NA
14 NA NA 24 NA
15 NA NA 25 NA
16 NA NA 26 NA
17 NA NA 27 NA
18 NA NA 28 NA
19 NA NA 29 NA
20 NA NA 30 NA

We can also use Reduce function from base R:
lst <- list(df1, df2, df3)
# First we create id number for each underlying data set
lst |>
lapply(\(x) {x$id <- 1:nrow(x);
x
}
) -> ls2
Reduce(function(x, y) if(nrow(x) == nrow(y)){
merge(x, y, by = "id")
} else {
x
}, ls2)
id A1 A2 C
1 1 1 10 31
2 2 2 9 32
3 3 3 8 33
4 4 4 7 34
5 5 5 6 35
6 6 6 5 36
7 7 7 4 37
8 8 8 3 38
9 9 9 2 39
10 10 10 1 40

Repeat the first two rows for each id two times

I would like to repeat the first two rows for each id two times. I don't know how to do that. Does anyone have a suggestion?
id <- rep(1:4,each=6)
scored <- c(12,13,NA,NA,NA,NA,14,20,NA,NA,NA,NA,23,56,NA,NA,NA,NA, 45,78,NA,NA,NA,NA)
df <- data.frame(id,scored)
df
id scored
1 1 12
2 1 13
3 1 NA
4 1 NA
5 1 NA
6 1 NA
7 2 14
8 2 20
9 2 NA
10 2 NA
11 2 NA
12 2 NA
13 3 23
14 3 56
15 3 NA
16 3 NA
17 3 NA
18 3 NA
19 4 45
20 4 78
21 4 NA
22 4 NA
23 4 NA
24 4 NA
>
I want it to look like:
df
id score
1 1 12
2 1 13
3 1 12
4 1 13
5 1 12
6 1 13
7 2 14
8 2 20
9 2 14
10 2 20
11 2 14
12 2 20
13 3 23
14 3 56
15 3 23
16 3 56
17 3 23
18 3 56
19 4 45
20 4 78
21 4 45
22 4 78
23 4 45
24 4 78
>
..................................................
..................................................
..................................................

We can do a group by rep on the non-NA elements of 'scored'
library(dplyr)
df %>%
group_by(id) %>%
mutate(scored = rep(scored[!is.na(scored)], length.out = n()))
# A tibble: 24 x 2
# Groups: id [4]
# id scored
# <int> <dbl>
# 1 1 12
# 2 1 13
# 3 1 12
# 4 1 13
# 5 1 12
# 6 1 13
# 7 2 14
# 8 2 20
# 9 2 14
#10 2 20
# … with 14 more rows

R recode multiple variables following same rules

data=data.frame("x1"=c(1:10),
"x2"=c(1:4,4,6:10),
"x3"=c(1:3,2:5,5:7),
"x4"=c(21:30),
"x5"=c(35:44))
recode=c("x1","x2","x3")
data <- data[recode %in% c(4,5)] <- NA
I want to store a specific set of variables for example above I store x1,x2,x3 in 'recode'. Then I want to change all the values for all variables in recode such that any value of 4 or 5 is set to NA.

We need to use replace with lapply
data[recode] <- lapply(data[recode], function(x) replace(x, x %in% 4:5, NA))
data
# x1 x2 x3 x4 x5
#1 1 1 1 21 35
#2 2 2 2 22 36
#3 3 3 3 23 37
#4 NA NA 2 24 38
#5 NA NA 3 25 39
#6 6 6 NA 26 40
#7 7 7 NA 27 41
#8 8 8 NA 28 42
#9 9 9 6 29 43
#10 10 10 7 30 44
Or with dplyr
library(dplyr)
data %>%
mutate_at(vars(recode), ~ na_if(., 4)) %>%
mutate_at(vars(recode), ~ na_if(., 5))
# x1 x2 x3 x4 x5
#1 1 1 1 21 35
#2 2 2 2 22 36
#3 3 3 3 23 37
#4 NA NA 2 24 38
#5 NA NA 3 25 39
#6 6 6 NA 26 40
#7 7 7 NA 27 41
#8 8 8 NA 28 42
#9 9 9 6 29 43
#10 10 10 7 30 44

One dplyr possibility could be:
data %>%
mutate_at(vars(recode), ~ replace(., . %in% 4:5, NA))
x1 x2 x3 x4 x5
1 1 1 1 21 35
2 2 2 2 22 36
3 3 3 3 23 37
4 NA NA NA 24 38
5 NA NA NA 25 39
6 6 6 4 26 40
7 7 7 5 27 41
8 8 8 5 28 42
9 9 9 6 29 43
10 10 10 7 30 44

Use Map().
data[recode] <- Map(function(x) ifelse(x %in% c(4, 5), NA, x), data[recode])
data
# x1 x2 x3 x4 x5
# 1 1 1 1 21 35
# 2 2 2 2 22 36
# 3 3 3 3 23 37
# 4 NA NA 2 24 38
# 5 NA NA 3 25 39
# 6 6 6 NA 26 40
# 7 7 7 NA 27 41
# 8 8 8 NA 28 42
# 9 9 9 6 29 43
# 10 10 10 7 30 44

Replace NA with other row value based on id

I would like to replace NA with value from other rows based on ID.
I've found similar questions but I not found solution for my problem.
Below part of table
XCODE Age Sex ResultA ResultB ResultC
1 X001 12 2 2 3 4
2 X002 23 2 4 6 66
3 X003 NA NA NA NA NA
4 X004 32 1 1 7 3
5 X005 NA NA NA NA NA
6 X001 NA NA NA NA NA
7 X002 NA NA NA NA NA
8 X003 33 1 8 7 6
9 X004 NA NA NA NA NA
10 X005 55 2 8 8 8
I have SPSS file with over 6000 columns.
I used
library(data.table)
setDT(dataset)[, Age:= Age[!is.na(Age)][1L] , by = XCODE]
but this is good only for single column and I need deal with many columns.
So how can I execute code above on all columns?

Using data.table we can select the columns which we want to replace
library(data.table)
setDT(df)[, (2:ncol(df)) := lapply(.SD, function(x)
replace(x, is.na(x), x[!is.na(x)][1])) , XCODE]
df
# XCODE Age Sex ResultA ResultB ResultC
# 1: X001 12 2 2 3 4
# 2: X002 23 2 4 6 66
# 3: X003 33 1 8 7 6
# 4: X004 32 1 1 7 3
# 5: X005 55 2 8 8 8
# 6: X001 12 2 2 3 4
# 7: X002 23 2 4 6 66
# 8: X003 33 1 8 7 6
# 9: X004 32 1 1 7 3
#10: X005 55 2 8 8 8
Using the same logic in dplyr we can replace NAs with first non-NA value of the group for all columns
library(dplyr)
df %>%
group_by(XCODE) %>%
mutate_all(~replace(., is.na(.), .[!is.na(.)][1]))
# XCODE Age Sex ResultA ResultB ResultC
# <fct> <int> <int> <int> <int> <int>
# 1 X001 12 2 2 3 4
# 2 X002 23 2 4 6 66
# 3 X003 33 1 8 7 6
# 4 X004 32 1 1 7 3
# 5 X005 55 2 8 8 8
# 6 X001 12 2 2 3 4
# 7 X002 23 2 4 6 66
# 8 X003 33 1 8 7 6
# 9 X004 32 1 1 7 3
#10 X005 55 2 8 8 8
Or only selected columns
cols <- c("Age", "Sex", "ResultA","ResultB")
df %>%
group_by(XCODE) %>%
mutate_at(vars(cols), ~ replace(., is.na(.), .[!is.na(.)][1]))

We can group by XCODE and use fill() to fill in NAs with latest non-NA. In this case we need to fill in both directions. Also note that since you are filling up all variables, then the function everything() can be used
library(tidyverse)
df %>%
group_by(XCODE) %>%
fill(everything()) %>%
fill(everything(), .direction = 'up')
which gives,
# A tibble: 10 x 6
# Groups: XCODE [5]
XCODE Age Sex ResultA ResultB ResultC
<fct> <int> <int> <int> <int> <int>
1 X001 12 2 2 3 4
2 X001 12 2 2 3 4
3 X002 23 2 4 6 66
4 X002 23 2 4 6 66
5 X003 33 1 8 7 6
6 X003 33 1 8 7 6
7 X004 32 1 1 7 3
8 X004 32 1 1 7 3
9 X005 55 2 8 8 8
10 X005 55 2 8 8 8

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replace NA in a series of variables by a factor level - r

Related

rbind dataframes by filling missing rows from the first dataframe

tidyverse: binding list elements of same dimension

Repeat the first two rows for each id two times

R recode multiple variables following same rules

Replace NA with other row value based on id

Categories

Resources