r - how to fill in values on stepped data hierarchy - r

Is there an elegant/tidy way to fill in the data if there are non-null values to the right? I have a wonky work-around but wanted to know if there was a nice dplyr way to do this.
actual <-
tibble(
a = c("A", NA, NA, NA, NA, NA, NA, "B", NA, NA, NA),
b = c(NA, "A", NA, NA, NA, "C", NA, NA, "E", NA, NA),
c = c(NA, NA, "B", NA, NA, NA, "D", NA, NA, "F", "G"),
d = c(NA, NA, NA, "C", "D", NA, NA, NA, NA, NA, NA)
)
desired <-
tibble(
w = c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B"),
x = c(NA, "A", "A", "A", "A", "C", "C", NA, "E", "E", "E"),
y = c(NA, NA, "B", "B", "B", NA, "D", NA, NA, "F", "G"),
z = c(NA, NA, NA, "C", "D", NA, NA, NA, NA, NA, NA)
)

We can use fill from tidyr together with dplyr like the following.
library(dplyr)
library(tidyr)
dat <- actual %>%
fill(a) %>%
group_by(a) %>%
fill(b) %>%
group_by(b) %>%
fill(c) %>%
group_by(c) %>%
fill(d) %>%
ungroup()
print(dat)
# # A tibble: 11 x 4
# a b c d
# <chr> <chr> <chr> <chr>
# 1 A NA NA NA
# 2 A A NA NA
# 3 A A B NA
# 4 A A B C
# 5 A A B D
# 6 A C NA NA
# 7 A C D NA
# 8 B NA NA NA
# 9 B E NA NA
# 10 B E F NA
# 11 B E G NA

Related

How to merge rows with duplicate ID, replacing NAs with data in the other row, and leading with data present in both duplicate rows?

I have a df like this:
data <- tribble(~id, ~othervar, ~it_1, ~it_2, ~it_3, ~it_4, ~it_5, ~it_6,
"k01", "lum", "a", "b", "c", "a", NA, NA,
"k01", "lum", NA, NA, NA, NA, "a", "d",
"k02", "nar", "a", "b", "c", "b", NA, NA,
"k03", "lum", "a", "b", "a", "c", NA, NA,
"k03", "lum", "b", "b", "a", NA, "d", "e")
I want to merge rows with duplicated IDs in only one row where NAs are replaced with the information available in the other row. But where there are no-NA in both rows, the problem is to preserve any one. I´ve tried pivoting the table, but have no resources to deal with this.
i expect somthing like this:
id othervar it_1 it_2 it_3 it_4 it_5 it_6
k01 lum a b c a a d
k02 nar a b c b NA NA
k03 lum a b a c d e
With ifelse and summarise:
library(dplyr)
data %>%
group_by(id) %>%
summarise(across(everything(), ~ ifelse(any(complete.cases(.x)),
first(.x[!is.na(.x)]),
NA)))
# A tibble: 3 × 8
id othervar it_1 it_2 it_3 it_4 it_5 it_6
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 k01 lum a b c a a d
2 k02 nar a b c b NA NA
3 k03 lum a b a c d e
Without ifelse with dplyr functions only:
data %>%
group_by(id) %>%
summarise(across(everything(),
~coalesce(.x) %>%
`[`(!is.na(.)) %>%
`[`(1) ))

Based on common values on one column, assign same values in another column

Absolute newbie to R.
I have a dataframe that has some common values in one column(C1), but only one of the corresponding column has a value(C2), so I want to paste that value to all of the empty/NA spaces in C2 based on same value in C1.
This would make more sense:
df:
C1 C2
A NA
A val10
A NA
B val14
B NA
B NA
B NA
C NA
C val9
What I wanted it to look like is
C1 C2
A val10
A val10
A val10
B val14
B val14
B val14
B val14
C val9
C val9
(C2 and C1 don't have any particular pattern or sequence between each other)
I'm assuming I would do a Group_by for C1, but I'm bit confused how to copy the values. Using transmute/mutate or paste. I tried a few iterations but wasn't successful.
You can use the fill function from tidyr, which makes it really easy to take care of the NAs.
library(tidyr)
library(dplyr)
df %>%
dplyr::group_by(C1) %>%
tidyr::fill(C2) %>% #default direction down
tidyr::fill(C2, .direction = "up")
Output
# A tibble: 9 × 2
# Groups: C1 [3]
C1 C2
<chr> <chr>
1 A val10
2 A val10
3 A val10
4 B val14
5 B val14
6 B val14
7 B val14
8 C val9
9 C val9
Data
df <- structure(list(C1 = c("A", "A", "A", "B", "B", "B", "B", "C",
"C"), C2 = c(NA, "val10", NA, "val14", NA, NA, NA, NA, "val9"
)), class = "data.frame", row.names = c(NA, -9L))
I doubt this is the most elegant solution, but a Tidyverse-style method could be:
df <- tibble::tribble(
~C1, ~C2,
"A", NA,
"A", "val10",
"A", NA,
"B", "val14",
"B", NA,
"B", NA,
"B", NA,
"C", NA,
"C", "val9"
)
df %>%
filter(!is.na(C2)) %>%
rename(C3 = C2) %>%
right_join(df) %>%
select(-C2) %>%
rename(C2 = C3)
Which gives you:

check if subset of rows is NA then move adjacent rows to replace them

I have a dataframe that's a result of combining multiple sheets from excel. The columns did not align properly. I need to check if a subset of rows is all NA. If they are NA, then I need to check if the adjacent equally sized subset has content, and if it does, I need to copy over that row to replace the NAs.
This is what the data looks like from my dput:
structure(list(id = 1:20, A = c(NA, NA, NA, NA, NA, "c", "d",
"q", "p", "m", NA, NA, NA, NA, NA, "k", "o", "i", "a", "b"),
B = c(NA, NA, NA, NA, NA, "h", "a", "f", "b", "e", NA, NA,
NA, NA, NA, "m", "c", "s", "g", "p"), C = c(NA, NA, NA, NA,
NA, "a", "f", "j", "s", "g", NA, NA, NA, NA, NA, "l", "m",
"o", "k", "t"), D = c(NA, NA, NA, NA, NA, "n", "r", "l",
"h", "g", NA, NA, NA, NA, NA, "j", "p", "f", "d", "q"), E = c("j",
"p", "n", "i", "g", NA, NA, NA, NA, NA, "k", "e", "s", "m",
"l", NA, NA, NA, NA, NA), F = c("o", "d", "r", "q", "a",
NA, NA, NA, NA, NA, "h", "s", "f", "j", "k", NA, NA, NA,
NA, NA), G = c("f", "c", "a", "l", "m", NA, NA, NA, NA, NA,
"n", "t", "s", "e", "r", NA, NA, NA, NA, NA), H = c("r",
"c", "h", "i", "j", NA, NA, NA, NA, NA, "f", "e", "b", "l",
"n", NA, NA, NA, NA, NA)), row.names = c(NA, -20L), class = "data.frame")
If you have equal number of non-missing values in each row as shown in the shared example you can drop NA values in each row.
df1 <- as.data.frame(t(apply(df, 1, na.omit)))
# V1 V2 V3 V4 V5
#1 1 j o f r
#2 2 p d c c
#3 3 n r a h
#4 4 i q l i
#5 5 g a m j
#6 6 c h a n
#7 7 d a f r
#8 8 q f j l
#9 9 p b s h
#10 10 m e g g
#11 11 k h n f
#12 12 e s t e
#13 13 s f s b
#14 14 m j e l
#15 15 l k r n
#16 16 k m l j
#17 17 o c m p
#18 18 i s o f
#19 19 a g k d
#20 20 b p t q
To check for 1st half values and if all of them are NA we select second half we can do :
cbind(df[1], t(apply(df[-1], 1, function(x) {
x1 <- (length(x)/2)
if(all(is.na(x[1:x1]))) x[(x1+1):length(x)]
else x[1:x1]
})))

Create date range based on sparse variable by group in R

I have sparse data which has a score taken at periodic intervals and a measurement taken at more regular interval for multiple subjects along with corresponding dates. I would like to generate date ranges based on the score dates for each subject ID ie. starting at the score date and ending at the next score date (or starting/ending at the first/last subject observation if the score doesn't fall on those dates).
I would then like to average the measurement variable within these date ranges. The averaging step should be straightforward but I am stuck on generating the date ranges.
Below is a sample of the data and an example of how I would envision the resulting data
sample data:
structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"B", "B", "C", "C", "C", "D", "D", "D", "D", "D", "D", "D", "D",
"D", "D", "D", "D", "D", "D", "D"), date = c("1/21/2020", "1/27/2020",
"2/1/2020", "2/3/2020", "2/5/2020", "2/6/2020", "2/8/2020", "2/9/2020",
"2/11/2020", "2/12/2020", "2/13/2020", "2/15/2020", "2/18/2020",
"2/20/2020", "2/21/2020", "2/22/2020", "2/25/2020", "2/1/2020",
"2/5/2020", "2/7/2020", "2/8/2020", "2/11/2020", "2/12/2020",
"1/30/2020", "2/10/2020", "2/11/2020", "2/6/2020", "2/7/2020",
"2/8/2020", "2/9/2020", "2/11/2020", "2/13/2020", "2/14/2020",
"2/16/2020", "2/17/2020", "2/20/2020", "2/23/2020", "2/26/2020",
"3/1/2020", "3/3/2020", "3/5/2020"), score = c(0.5, 2, NA, NA,
3, NA, NA, NA, NA, NA, 2.5, NA, NA, 1.5, NA, NA, NA, 3, NA, NA,
2.5, NA, 1, 0.5, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 14,
NA, NA, 11.5, NA, 9.5, NA), measure = c(0.394160734, 0.722462998,
0.82984815, 0.738432745, 0.321792398, 0.167492308, 0.218020898,
0.929210786, 0.686818585, 0.939678073, 0.708172942, 0.299863884,
0.48216267, 0.290307369, 0.801947902, 0.579418467, 0.78101844,
0.219494852, 0.875129822, 0.517971003, 0.475625007, 0.723003744,
0.257473477, 0.629818537, 0.817369151, 0.628573413, 0.364660834,
0.5971024, 0.002274261, 0.318937617, 0.983917106, 0.685933928,
0.487922831, 0.151769304, 0.392413694, 0.012429414, 0.149627658,
0.011724992, 0.536998203, 0.798399999, 0.763353822)), class = "data.frame", row.names = c(NA,
-41L))
answer data:
structure(list(ID = c("A", "A", "A"), startDate = c("1/21/2020",
"1/27/2020", "2/5/2020"), endDate = c("1/27/2020", "2/5/2020",
"2/13/2020"), score = c(0.5, 2, 3), measure = c(0.394160734,
0.763581298, 0.543835508)), class = "data.frame", row.names = c(NA,
-3L))
Here's a way with dplyr :
library(dplyr)
df %>%
group_by(ID, grp = cumsum(!is.na(score))) %>%
summarise(start_date = first(date),
score = first(score),
measure = mean(measure)) %>%
mutate(end_date = lead(start_date, default = last(start_date))) %>%
select(-grp)
# ID start_date score measure end_date
# <chr> <chr> <dbl> <dbl> <chr>
# 1 A 1/21/2020 0.5 0.394 1/27/2020
# 2 A 1/27/2020 2 0.764 2/5/2020
# 3 A 2/5/2020 3 0.544 2/13/2020
# 4 A 2/13/2020 2.5 0.497 2/20/2020
# 5 A 2/20/2020 1.5 0.613 2/20/2020
# 6 B 2/1/2020 3 0.538 2/8/2020
# 7 B 2/8/2020 2.5 0.599 2/12/2020
# 8 B 2/12/2020 1 0.257 2/12/2020
# 9 C 1/30/2020 0.5 0.692 1/30/2020
#10 D 2/6/2020 NA 0.449 2/17/2020
#11 D 2/17/2020 14 0.185 2/26/2020
#12 D 2/26/2020 11.5 0.274 3/3/2020
#13 D 3/3/2020 9.5 0.781 3/3/2020
Using data.table
library(data.table)
setDT(df)[, .(start_date = first(date),
score = first(score),
measure = mean(measure)),
by = .(ID, grp = cumsum(!is.na(score)))
][, end_date := shift(start_date, type= 'lead', fill = last(start_date))
][, grp := NULL][]

Remove NAs after pivot_wider to match up rows

I spread a column using pivot_wider so I could compare two groups (var1 vs var2) using an xy plot. But I can't compare them because there is a corresponding NA in the column.
Here is an example dataframe:
df <- data.frame(group = c("a", "a", "b", "b", "c", "c"), var1 = c(3, NA, 1, NA, 2, NA),
var2 = c(NA, 2, NA, 4, NA, 8))
I would like it to look like:
df2 <- data.frame(group = c("a", "b", "c"), var1 = c(3, 1, 2),
var2 = c( 2, 4, 8))
You can use summarize. But this treats the symptom not the cause. You may have a column in id_cols which is one-to-one with your variable in values_from.
library(dplyr)
df %>%
group_by(group) %>%
summarize_all(sum, na.rm = T)
# A tibble: 3 x 3
group var1 var2
<fct> <dbl> <dbl>
1 a 3 2
2 b 1 4
3 c 2 8
This solution is a bit more robust, with a slightly more general data.frame to begin with:
df <- data.frame(col_1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B"),
col_2 = c(1, 3, NA, NA, NA, NA, 4, NA, NA),
col_3 = c(NA, NA, 2, 5, NA, NA, NA, 5, NA),
col_4 = c(NA, NA, NA, NA, 5, 6, NA, NA, 7))
df %>% dplyr::group_by(col_1) %>%
dplyr::summarise_all(purrr::discard, is.na)
Here is a way to do it, assuming you only have two rows by group and one row with NA
library(dplyr)
df %>% group_by(group) %>%
summarise(var1=max(var1,na.rm=TRUE),
var2=max(var2,na.rm=TRUE))
The na.rm=TRUE will not count the NAs and get the max on only one value (the one which is not NA)

Resources