Related
For each person in my dataset (1 row per person), I am trying search a set of variables (months, so in my example Jan - Jul) to see if any of them matches the value in a different variable (follow-up month). I want to create a new variable that says yes or no there is a matching value in the set of variables with the 1 variable.
Basically I am trying to create a timeline for a follow-up visit. I have 'Have' and 'Want' data sets below.
Thank you!
HAVE:
ID
Jan
Feb
Mar
Apr
May
June
Jul
Follow-up month
1
NA
2
3
4
NA
NA
NA
4
2
NA
NA
NA
4
NA
NA
NA
6
3
1
NA
3
4
5
NA
NA
5
4
NA
NA
NA
NA
NA
6
7
9
WANT:
ID
Jan
Feb
Mar
Apr
May
June
Jul
Follow-up month
Follow_up_Status
1
NA
2
3
4
NA
NA
NA
4
Yes
2
NA
NA
NA
4
NA
NA
NA
6
No
3
1
NA
3
4
5
NA
NA
5
Yes
4
NA
NA
NA
NA
NA
6
7
9
No
Here is a version with pivoting:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-c(ID, Follow.up_month)
) %>%
group_by(ID) %>%
mutate(Follow_up_status = ifelse(Follow.up_month %in% value, "Yes", "No")) %>%
pivot_wider(
names_from = name,
values_from = value
)
output:
ID Follow.up_month Follow_up_status Jan Feb Mar Apr May June Jul
<int> <int> <chr> <int> <int> <int> <int> <int> <int> <int>
1 1 4 Yes NA 2 3 4 NA NA NA
2 2 6 No NA NA NA 4 NA NA NA
3 3 5 Yes 1 NA 3 4 5 NA NA
4 4 9 No NA NA NA NA NA 6 7
I think rowwise and if_any will work for you:
library(dplyr)
quux %>%
rowwise() %>%
mutate(
Follow2_int = which(c_across(Jan:Jul) %in% Follow.up.month)[1],
Follow2_lgl = !is.na(Follow2_int)
) %>%
ungroup()
# # A tibble: 4 x 12
# ID Jan Feb Mar Apr May June Jul Follow.up.month Follow_up_Status Follow2_int Follow2_lgl
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <chr> <int> <lgl>
# 1 1 NA 2 3 4 NA NA NA 4 Yes 4 TRUE
# 2 2 NA NA NA 4 NA NA NA 6 No NA FALSE
# 3 3 1 NA 3 4 5 NA NA 5 Yes 5 TRUE
# 4 4 NA NA NA NA NA 6 7 9 No NA FALSE
Edited to include both the logical and the first column number (counting within Jan:Jul) that matches.
Data
quux <- structure(list(ID = 1:4, Jan = c(NA, NA, 1L, NA), Feb = c(2L, NA, NA, NA), Mar = c(3L, NA, 3L, NA), Apr = c(4L, 4L, 4L, NA), May = c(NA, NA, 5L, NA), June = c(NA, NA, NA, 6L), Jul = c(NA, NA, NA, 7L), Follow.up.month = c(4L, 6L, 5L, 9L), Follow_up_Status = c("Yes", "No", "Yes", "No")), class = "data.frame", row.names = c(NA, -4L))
Another dplyr solution.
library(dplyr)
dat2 <- dat %>%
mutate(across(Jan:Jul, .fns = ~.x - Follow_up_month == 0)) %>%
mutate(Follow_up_status = as.character(rowSums(select(., Jan:Jul), na.rm = TRUE))) %>%
transmute(Follow_up_status = recode(Follow_up_status, "0" = "No", "1" = "Yes")) %>%
bind_cols(dat, .)
dat2
# ID Jan Feb Mar Apr May June Jul Follow_up_month Follow_up_status
# 1 1 NA 2 3 4 NA NA NA 4 Yes
# 2 2 NA NA NA 4 NA NA NA 6 No
# 3 3 1 NA 3 4 5 NA NA 5 Yes
# 4 4 NA NA NA NA NA 6 7 9 No
Date
dat <- structure(list(ID = 1:4, Jan = c(NA, NA, 1L, NA), Feb = c(2L, NA, NA, NA), Mar = c(3L, NA, 3L, NA), Apr = c(4L, 4L, 4L, NA), May = c(NA, NA, 5L, NA), June = c(NA, NA, NA, 6L), Jul = c(NA, NA, NA, 7L), Follow_up_month = c(4L, 6L, 5L, 9L)), class = "data.frame", row.names = c(NA, -4L))
Performance
When the data frame is small, all the solutions here will work. But when data frame is large, the pivoting approach and the rowwise approach may be slow. Below I tried to show the performance comparison of the three solutions. Although the final outputs are different, with different data type and column order, I will still compare them, assuming that these differences are acceptable.
Here is the setup.
library(microbenchmark)
library(dplyr)
library(tidyr)
pivot_fun <- function(x){
x2 <- x %>%
pivot_longer(
-c(ID, Follow_up_month)
) %>%
group_by(ID) %>%
mutate(Follow_up_status = ifelse(Follow_up_month %in% value, "Yes", "No")) %>%
pivot_wider(
names_from = name,
values_from = value
)
return(x2)
}
rowwise_fun <- function(x){
x2 <- x %>%
pivot_longer(
-c(ID, Follow_up_month)
) %>%
group_by(ID) %>%
mutate(Follow_up_status = ifelse(Follow_up_month %in% value, "Yes", "No")) %>%
pivot_wider(
names_from = name,
values_from = value
)
return(x2)
}
rowSums_fun <- function(x){
x2 <- x %>%
mutate(across(Jan:Jul, .fns = ~.x - Follow_up_month == 0)) %>%
mutate(Follow_up_status = as.character(rowSums(select(., Jan:Jul), na.rm = TRUE))) %>%
transmute(Follow_up_status = recode(Follow_up_status, "0" = "No", "1" = "Yes")) %>%
bind_cols(x, .)
return(x2)
}
Here is the comparison on the original example. The solution provided in this post is the fastest.
set.seed(1)
microbenchmark(pivot_fun(dat), rowwise_fun(dat), rowSums_fun(dat))
# Unit: milliseconds
# expr min lq mean median uq max neval
# pivot_fun(dat) 11.037401 11.927201 13.58003 12.659001 13.882151 30.0207 100
# rowwise_fun(dat) 10.907602 11.670701 13.56004 12.295051 13.614201 24.4249 100
# rowSums_fun(dat) 6.590502 7.147702 8.48469 7.714351 8.808602 17.0109 100
And here is a comparison on a larger data frame. The solution provided in this post is about 10 times faster than other answers.
set.seed(12)
n <- 100000
dat_n <- data.frame(
ID = 1:n,
Jan = sample(dat$Jan, size = n, replace = TRUE),
Feb = sample(dat$Feb, size = n, replace = TRUE),
Mar = sample(dat$Mar, size = n, replace = TRUE),
Apr = sample(dat$Apr, size = n, replace = TRUE),
May = sample(dat$May, size = n, replace = TRUE),
June = sample(dat$June, size = n, replace = TRUE),
Jul = sample(dat$Jul, size = n, replace = TRUE),
Follow_up_month = sample(1:12, size = n, replace = TRUE)
)
set.seed(123)
microbenchmark(pivot_fun(dat_n), rowwise_fun(dat_n), rowSums_fun(dat_n))
# Unit: milliseconds
# expr min lq mean median uq max neval
# pivot_fun(dat_n) 1168.416 1405.5724 1496.6545 1471.0253 1574.3927 2327.1624 100
# rowwise_fun(dat_n) 1159.790 1401.0586 1494.9987 1465.8929 1580.0092 1982.5099 100
# rowSums_fun(dat_n) 84.494 102.0946 122.2843 111.8158 123.6288 296.3234 100
df <- data.frame(A1 = c(6, 8, NA, 1, 5),
A2 = c(NA, NA, 9, 3, 6),
A3 = c(9, NA, 1, NA, 4),
B1 = c(NA, NA, 9, 3, 6),
B2 = c(9, NA, 1, NA, 4),
B3 = c(NA, NA, 9, 3, 6)
)
I have a dataset with multiple questionnaires that each have multiple items. I would like to replace the missing data with the row mean of the observable values for each of the questionnaires (missing values in A items replaced by row mean of A1 to A3 and missing values in B items replaces by row mean of B1 to B3). What is the best way to do that?
You may try
df <- data.frame(A1 = c(6, 8, NA, 1, 5),
A2 = c(NA, NA, 9, 3, 6),
A3 = c(9, NA, 1, NA, 4),
B1 = c(NA, NA, 9, 3, 6),
B2 = c(9, NA, 1, NA, 4),
B3 = c(NA, NA, 9, 3, 6)
)
df1 <- df %>%
select(starts_with("A"))
df2 <- df %>%
select(starts_with("B"))
x1 <- which(is.na(df1), arr.ind = TRUE)
df1[x1] <- rowMeans(df1, na.rm = T)[x1[,1]]
x2 <- which(is.na(df2), arr.ind = TRUE)
df2[x2] <- rowMeans(df2, na.rm = T)[x2[,1]]
df <- cbind(df1, df2)
df
A1 A2 A3 B1 B2 B3
1 6 7.5 9 9 9 9
2 8 8.0 8 NaN NaN NaN
3 5 9.0 1 9 1 9
4 1 3.0 2 3 3 3
5 5 6.0 4 6 4 6
You may use split.default to split data in different groups and replace NA with row-wise mean (taken from this answer https://stackoverflow.com/a/6918323/3962914 )
as.data.frame(lapply(split.default(df, sub('\\d+', '', names(df))), function(x) {
k <- which(is.na(x), arr.ind = TRUE)
x[k] <- rowMeans(x, na.rm = TRUE)[k[, 1]]
x
})) -> result
names(result) <- names(df)
result
# A1 A2 A3 B1 B2 B3
#1 6 7.5 9 9 9 9
#2 8 8.0 8 NaN NaN NaN
#3 5 9.0 1 9 1 9
#4 1 3.0 2 3 3 3
#5 5 6.0 4 6 4 6
You could also do:
library(dplyr)
df %>%
reshape(names(.), dir='long', sep="")%>%
group_by(id) %>%
mutate(across(A:B, ~replace(.x, is.na(.x), mean(.x, na.rm = TRUE))))%>%
pivot_wider(id, names_from = time, values_from = A:B, names_sep = "") %>%
ungroup() %>%
select(-id)
# A tibble: 5 x 6
A1 A2 A3 B1 B2 B3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 7.5 9 9 9 9
2 8 8 8 NaN NaN NaN
3 5 9 1 9 1 9
4 1 3 2 3 3 3
5 5 6 4 6 4 6
We can use split.default with na.aggregate
library(purrr)
library(zoo)
library(dplyr)
library(stringr)
map_dfc(split.default(df, str_remove(names(df), "\\d+")), ~
as_tibble(t(na.aggregate(t(.x)))))
# A tibble: 5 × 6
A1 A2 A3 B1 B2 B3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 7.5 9 9 9 9
2 8 8 8 NaN NaN NaN
3 5 9 1 9 1 9
4 1 3 2 3 3 3
5 5 6 4 6 4 6
Span a matrix of rowMeans on the rows and replace the NA's. In an lapply that greps the questions.
do.call(cbind, lapply(c('A', 'B'), function(q) {
s <- df[, grep(q, names(df))]
na <- is.na(s)
replace(s, na, rowMeans(s, na.rm=TRUE)[row(s)][na])
}))
# A1 A2 A3 B1 B2 B3
# 1 6 7.5 9 9 9 9
# 2 8 8.0 8 NaN NaN NaN
# 3 5 9.0 1 9 1 9
# 4 1 3.0 2 3 3 3
# 5 5 6.0 4 6 4 6
Data:
df <- structure(list(A1 = c(6, 8, NA, 1, 5), A2 = c(NA, NA, 9, 3, 6
), A3 = c(9, NA, 1, NA, 4), B1 = c(NA, NA, 9, 3, 6), B2 = c(9,
NA, 1, NA, 4), B3 = c(NA, NA, 9, 3, 6)), class = "data.frame", row.names = c(NA,
-5L))
I am trying to move data from one column to another, due to the underlying forms being filled out incorrectly.
In the form it asks for information on a household and asks for their age(AGE) and gender(SEX) for each member, allowing up to 5 people per household. However some users have filled in information for person 1,3 and 4, but not filled in any info for person 2 because they filled out person 2 incorrectly, crossed out the details and have filled person 2 details into the person 3 boxes etc.
The data looks like this (ref 1 and 5 are correct in this data, all others are incorrect)
df <- data.frame(
ref = c(1, 2, 3, 4, 5, 6),
AGE1 = c(45, 36, 26, 47, 24, NA),
AGE2 = c(NA, 24, NA, 13, 57, 28),
AGE3 = c(NA, NA, 35, NA, NA, 26),
AGE4 = c(NA, NA, 15, 11, NA, NA),
AGE5 = c(NA, 15, NA, NA, NA, NA),
SEX1 = c("M", "F", "M", "M", "M", NA),
SEX2 = c(NA, "M", NA, "F", "F", "F"),
SEX3 = c(NA, NA, "M", NA, NA, "M"),
SEX4 = c(NA, NA, "F", "F", NA, NA),
SEX5 = c(NA, "F", NA, NA, NA, NA)
)
This is what the table looks like currently
(I have replaced NA with - to make reading easier)
ref
AGE1
AGE2
AGE3
AGE4
AGE5
SEX1
SEX2
SEX3
SEX4
SEX5
1
45
-
-
-
-
M
-
-
-
-
2
36
24
-
-
15
F
M
-
-
F
3
26
-
35
15
-
M
-
M
F
-
4
47
13
-
11
-
M
F
-
F
-
5
24
57
-
-
-
M
F
-
-
-
6
-
28
26
-
-
-
F
M
-
-
but i would like it to look like this
ref
AGE1
AGE2
AGE3
AGE4
AGE5
SEX1
SEX2
SEX3
SEX4
SEX5
1
45
-
-
-
-
M
-
-
-
-
2
36
24
15
-
-
F
M
F
-
-
3
26
35
15
-
-
M
M
F
-
-
4
47
13
11
-
-
M
F
F
-
-
5
24
57
-
-
-
M
F
-
-
-
6
28
26
-
-
-
F
M
-
-
-
Is there a way of correcting this using dplyr? If not, is there another way in R of correcting the data
Here is a way using dplyr and tidyr. The approach involves pivoting the data to longer format, sorting the NA values to the end, renumbering the column names, and the pivoting to wide form again.
library(dplyr)
library(tidyr)
df <- data.frame(ref, AGE1, AGE2, AGE3, AGE4, AGE5,
SEX1, SEX2, SEX3, SEX4, SEX5)
df %>%
mutate(across(starts_with("AGE"), as.character)) %>%
pivot_longer(2:11) %>%
separate(name, into = c("cat", "num"), 3) %>%
arrange(is.na(value)) %>%
group_by(ref, cat) %>%
mutate(num = seq_along(value)) %>%
ungroup() %>%
arrange(cat) %>%
unite(name, cat, num, sep = "") %>%
pivot_wider(id_cols = ref) %>%
mutate(across(starts_with("AGE"), as.numeric))
# A tibble: 6 x 11
ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 1 45 NA NA NA NA M NA NA NA NA
2 2 36 24 15 NA NA F M F NA NA
3 3 26 35 15 NA NA M M F NA NA
4 4 47 13 11 NA NA M F F NA NA
5 5 24 57 NA NA NA M F NA NA NA
6 6 28 26 NA NA NA F M NA NA NA
Here's a way using dplyr and tidyr library.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ref,
names_to = c('.value', 'num'),
names_pattern = '([A-Z]+)(\\d+)') %>%
arrange(ref, AGE, SEX) %>%
group_by(ref) %>%
mutate(num = row_number()) %>%
ungroup %>%
pivot_wider(names_from = num, values_from = c(AGE, SEX))
# ref AGE_1 AGE_2 AGE_3 AGE_4 AGE_5 SEX_1 SEX_2 SEX_3 SEX_4 SEX_5
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
#1 1 45 NA NA NA NA M NA NA NA NA
#2 2 15 24 36 NA NA F M F NA NA
#3 3 15 26 35 NA NA F M M NA NA
#4 4 11 13 47 NA NA F F M NA NA
#5 5 24 57 NA NA NA M F NA NA NA
#6 6 26 28 NA NA NA M F NA NA NA
Try the base code below
u1 <- reshape(
setNames(df, sub("(\\d)", ".\\1", names(df))),
direction = "long",
idvar = "ref",
varying = -1
)
u2 <- reshape(
transform(
u1[with(u1, order(is.na(AGE), is.na(SEX))), ],
time = ave(time, ref, FUN = seq_along)
),
direction = "wide",
idvar = "ref"
)
out <- u2[match(names(df),sub("\\.","",names(u2)))]
and you will get
> out
ref AGE.1 AGE.2 AGE.3 AGE.4 AGE.5 SEX.1 SEX.2 SEX.3 SEX.4 SEX.5
1.1 1 45 NA NA NA NA M <NA> <NA> <NA> <NA>
2.1 2 36 24 15 NA NA F M F <NA> <NA>
3.1 3 26 35 15 NA NA M M F <NA> <NA>
4.1 4 47 13 11 NA NA M F F <NA> <NA>
5.1 5 24 57 NA NA NA M F <NA> <NA> <NA>
6.2 6 28 26 NA NA NA F M <NA> <NA> <NA>
data
df <- data.frame(
ref = c(1, 2, 3, 4, 5, 6),
AGE1 = c(45, 36, 26, 47, 24, NA),
AGE2 = c(NA, 24, NA, 13, 57, 28),
AGE3 = c(NA, NA, 35, NA, NA, 26),
AGE4 = c(NA, NA, 15, 11, NA, NA),
AGE5 = c(NA, 15, NA, NA, NA, NA),
SEX1 = c("M", "F", "M", "M", "M", NA),
SEX2 = c(NA, "M", NA, "F", "F", "F"),
SEX3 = c(NA, NA, "M", NA, NA, "M"),
SEX4 = c(NA, NA, "F", "F", NA, NA),
SEX5 = c(NA, "F", NA, NA, NA, NA)
)
Here is a solution using package dedupewider:
library(dedupewider)
df <- data.frame(
ref = c(1, 2, 3, 4, 5, 6),
AGE1 = c(45, 36, 26, 47, 24, NA),
AGE2 = c(NA, 24, NA, 13, 57, 28),
AGE3 = c(NA, NA, 35, NA, NA, 26),
AGE4 = c(NA, NA, 15, 11, NA, NA),
AGE5 = c(NA, 15, NA, NA, NA, NA),
SEX1 = c("M", "F", "M", "M", "M", NA),
SEX2 = c(NA, "M", NA, "F", "F", "F"),
SEX3 = c(NA, NA, "M", NA, NA, "M"),
SEX4 = c(NA, NA, "F", "F", NA, NA),
SEX5 = c(NA, "F", NA, NA, NA, NA)
)
age_moved <- na_move(df, cols = names(df)[grepl("^AGE\\d$", names(df))]) # 'right' direction is by default
sex_moved <- na_move(age_moved, cols = names(df)[grepl("^SEX\\d$", names(df))])
sex_moved
#> ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
#> 1 1 45 NA NA NA NA M <NA> <NA> NA NA
#> 2 2 36 24 15 NA NA F M F NA NA
#> 3 3 26 35 15 NA NA M M F NA NA
#> 4 4 47 13 11 NA NA M F F NA NA
#> 5 5 24 57 NA NA NA M F <NA> NA NA
#> 6 6 28 26 NA NA NA F M <NA> NA NA
Given the data frame:
df1 <- data.frame(Company = c('A','B','C','D','E'),
`X1980` = c(NA, 5, 3, 8, 13),
`X1981` = c(NA, 12, NA, 11, 29),
`X1982` = c(33, NA, NA, 41, 42),
`X1983` = c(45, 47, 53, NA, 55))
I would like to create a new data frame where each value is replaced by the sum of the current value and the previous value of the row. NAs should be kept as they are.
This should result in the following data frame:
Company 1980 1981 1982 1983
A NA NA 33 78
B 5 17 NA 47
C 3 NA NA 53
D 8 19 60 NA
E 13 42 84 139
Here is a tidyverse approach
library(dplyr)
library(tidyr)
library(purrr)
df1 %>%
pivot_longer(matches("\\d{4}$")) %>%
group_by(Company) %>%
mutate(value = accumulate(value, ~if (is.na(out <- .x + .y)) .y else out)) %>%
pivot_wider()
Output
# A tibble: 5 x 5
# Groups: Company [5]
Company X1980 X1981 X1982 X1983
<chr> <dbl> <dbl> <dbl> <dbl>
1 A NA NA 33 78
2 B 5 17 NA 47
3 C 3 NA NA 53
4 D 8 19 60 NA
5 E 13 42 84 139
I have a data frame with NA values. I want to replace these NAs with a sequence between the values before and after the NAs.
Consider the following example:
# Example data
df <- data.frame(x1 = c(5, NA, NA, 10, NA),
x2 = c(NA, 2, NA, - 10, NA),
x3 = c(10, NA, 15, NA, 20))
df
# x1 x2 x3
# 5 NA 10
# NA 2 NA
# NA NA 15
# 10 -10 NA
# NA NA 20
The NAs between two values should be replaced with a sequence. NAs at the beginning or the end should remain NA:
# Expected output
# x1 x2 x3
# 5 NA 10
# 6.666667 2 12.5
# 8.333333 -4 15
# 10 -10 17.5
# NA NA 20
How could I replace NAs between two values in an automatized way?
The na.approx function in zoo does this interpolation very easily.
df <- data.frame(x1 = c(5, NA, NA, 10, NA),
x2 = c(NA, 2, NA, - 10, NA),
x3 = c(10, NA, 15, NA, 20))
df
#> x1 x2 x3
#> 1 5 NA 10
#> 2 NA 2 NA
#> 3 NA NA 15
#> 4 10 -10 NA
#> 5 NA NA 20
zoo::na.approx(df)
#> x1 x2 x3
#> [1,] 5.000000 NA 10.0
#> [2,] 6.666667 2 12.5
#> [3,] 8.333333 -4 15.0
#> [4,] 10.000000 -10 17.5
#> [5,] NA NA 20.0
Created on 2019-02-10 by the reprex package (v0.2.0).
Here is a solution with imputeTS package:
# Example data
df <- data.frame(x1 = c(5, NA, NA, 10, NA),
x2 = c(NA, 2, NA, - 10, NA),
x3 = c(10, NA, 15, NA, 20))
library("imputeTS")
na.interpolation(df, option = "linear)
For imputeTS::na.interpolation you can choose a different interpolation method via the parameter option (option = "spline" or option = "stine").