R Fill in missing rows - r

I have a similar question like this one: Fill in missing rows in R
However, the gaps I need to fill are not only months, but also missing years in between for one ID. This is an example:
structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B"), A = c(1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 3L), B = c(1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 3L), Var1 = 12:4), class = "data.frame", row.names = c(NA,
-9L))
ID A B Var1
1 A 1 1 12
2 A 1 2 11
3 A 3 1 10
4 A 3 2 9
5 A 3 3 8
6 B 2 1 7
7 B 2 2 6
8 B 2 3 5
9 B 3 3 4
And this is what I want it to look like:
ID A B Var1
1 A 1 1 12
2 A 1 2 11
3 A 1 3 0
4 A 2 1 0
5 A 2 2 0
6 A 2 3 0
7 A 3 1 10
8 A 3 2 9
9 A 3 3 8
10 B 2 1 7
11 B 2 2 6
12 B 2 3 5
13 B 3 1 0
14 B 3 2 0
15 B 3 3 4
Has someone an idea how to solve it? I have already played around with the solutions mentioned above.

library(tidyverse)
df <- structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B"), A = c(1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 3L), B = c(1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 3L), Var1 = 12:4), class = "data.frame", row.names = c(NA,
-9L))
df %>%
complete(ID, A, B, fill = list(Var1 = 0))
#> # A tibble: 18 x 4
#> ID A B Var1
#> <chr> <int> <int> <dbl>
#> 1 A 1 1 12
#> 2 A 1 2 11
#> 3 A 1 3 0
#> 4 A 2 1 0
#> 5 A 2 2 0
#> 6 A 2 3 0
#> 7 A 3 1 10
#> 8 A 3 2 9
#> 9 A 3 3 8
#> 10 B 1 1 0
#> 11 B 1 2 0
#> 12 B 1 3 0
#> 13 B 2 1 7
#> 14 B 2 2 6
#> 15 B 2 3 5
#> 16 B 3 1 0
#> 17 B 3 2 0
#> 18 B 3 3 4
Created on 2021-03-03 by the reprex package (v1.0.0)

You could use the solution described there altering it slightly for your problem.
df
full <- with(df, unique(expand.grid(ID = ID, A = A, B = B)))
complete <- merge(df, full, by = c('ID', 'A', 'B'), all.y = TRUE)
complete$Var1[is.na(complete$Var1)] <- 0

Just in case somebody else has the same question, this is what I came up with, thanks to the answers provided:
library(tidyverse)
df %>% group_by(ID) %>% complete(ID, A = full_seq(A,1), B, fill = list(Var1 = 0))
This code avoids that too many unused datasets are produced.

Related

Replace NA values in dataframe with variables in the next column (R)

I am new still trying to learn R and I could not find the answers I am looking for in any other thread.
I have a dataset with (for simplicity) 5 columns. Columns 1,2, and4 always have values, but in some rows column 3 doesn't. Below is an example:
Current
A B C D E
1 1 2 3
1 2 NA 4 5
1 2 3 4
1 3 NA 9 7
1 2 NA 5 6
I want to make it so that the NA's are replaced by the value in column D, and then the value in col E is shifted to D, etc.
Desired output:
A B C D E
1 1 2 3 NA
1 2 4 5 NA
1 2 3 4 NA
1 3 9 7 NA
1 2 5 6 NA
I copied what was on different Stack overflow threads and none achieved what I wanted.
na.omit gets rid of the row. Any help is greatly appreciated.
Data
data <- structure(list(A = c(1L, 1L, 1L, 1L, 1L), B = c(1L, 2L, 2L, 3L,
2L), C = c(2L, NA, 3L, NA, NA), D = c(3L, 4L, 4L, 9L, 5L), E = c(NA,
5L, NA, 7L, 6L)), class = "data.frame", row.names = c(NA, -5L
))
Code
library(dplyr)
data %>%
mutate(
aux = C,
C = if_else(is.na(aux),D,C),
D = if_else(is.na(aux),E,D),
E = NA
) %>%
select(-aux)
Output
A B C D E
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA
Replacement operation all in one go:
dat[is.na(dat$C), c("C","D","E")] <- c(dat[is.na(dat$C), c("D","E")], NA)
dat
# A B C D E
#1 1 1 2 3 NA
#2 1 2 4 5 NA
#3 1 2 3 4 NA
#4 1 3 9 7 NA
#5 1 2 5 6 NA
Where dat was:
dat <- read.table(text="A B C D E
1 1 2 3
1 2 NA 4 5
1 2 3 4
1 3 NA 9 7
1 2 NA 5 6", fill=TRUE, header=TRUE)
Using shift_row_values
library(hacksaw)
shift_row_values(df1)
A B C D E
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA
data
df1 <- structure(list(A = c(1L, 1L, 1L, 1L, 1L), B = c(1L, 2L, 2L, 3L,
2L), C = c(2L, NA, 3L, NA, NA), D = c(3L, 4L, 4L, 9L, 5L), E = c(NA,
5L, NA, 7L, 6L)), class = "data.frame", row.names = c(NA, -5L
))
A base R universal approach using order without prior knowledge of NA positions.
setNames(data.frame(t(apply(data, 1, function(x)
x[order(is.na(x))]))), colnames(data))
A B C D E
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA
Using dplyr
library(dplyr)
t(data) %>%
data.frame() %>%
mutate(across(everything(), ~ .x[order(is.na(.x))])) %>%
t() %>%
as_tibble()
# A tibble: 5 × 5
A B C D E
<int> <int> <int> <int> <int>
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA
Data
data <- structure(list(A = c(1L, 1L, 1L, 1L, 1L), B = c(1L, 2L, 2L, 3L,
2L), C = c(2L, NA, 3L, NA, NA), D = c(3L, 4L, 4L, 9L, 5L), E = c(NA,
5L, NA, 7L, 6L)), class = "data.frame", row.names = c(NA, -5L
))

How to merge datasets with repeated measures

I have Three datasets that I want to MERGE/JOIN.
This This examples only include the first participants I have a total of 25
df1
ID Grup pretest
1 1 A 2
2 1 A 1
3 1 A 3
4 2 B NA
5 2 B 1
6 2 B 3
7 3 A 2
8 3 A 1
9 3 A NA
10 4 B 2
11 4 B 1
12 4 B 3
df2 (this is missing one ID (5)
ID Grup posttest
1 1 A NA
2 1 A 5
3 1 A 4
4 2 B 2
5 2 B 4
6 2 B 3
7 3 A 5
8 3 A 6
9 3 A 3
10 6 B 4
11 6 B 2
12 6 B NA
Updated
df3( this have 5 Measurements for per ID)
ID Grup traning
1 1 A 2
2 1 A 6
3 1 A 3
4 1 A NA
5 1 A 1
6 2 B 3
7 2 B 4
8 2 B 1
9 2 B NA
10 2 B 2
11 3 A 1
12 3 A 3
I’ve been trying merge() and full_join() but both end up creating duplicates that I don’t want.
It won’t recognize the ID as an independent value, it’s creating 9 IDs for every ID value.
New <- merge(df1, df2, by= 'ID')
New <- full_join(df1, df2, By = "ID")
Setting all = TRUE doesn’t help.
I need the dataset to look like this
ID Grup pretest posttest traning
1 1 A 2 NA. 3
2 1 A 1 5. 4
3 1 A 3 4. 4
4 1 A NA Na. 4
5 1. A NA Na. 3
6 2 B 3 3. Na
7 2 B. 2 5. 3
8 2 B Na 6. 2
9 2 B NA Na. 5
10 2 B Na Na. 4
11 3 A. 1 2. 3
12 3 A. 3 3. 4
Since you are relying on the order of the frames, you can simply use cbind()
cbind(df1,df2[,3,F])
Output:
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA
You can add a helper column iid to separate the entries.
df1 <- cbind(iid = 1:nrow(df1), df1)
df2 <- cbind(iid = 1:nrow(df2), df2)
With dplyr
library(dplyr)
left_join(df1, df2, c("iid", "ID", "Grup"))[,-1]
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA
With base R merge
merge(df1, df2, c("iid", "ID", "Grup"))[,-1]
ID Grup pretest posttest
1 1 A 2 NA
2 4 B 2 4
3 4 B 1 2
4 4 B 3 NA
5 1 A 1 5
6 1 A 3 4
7 2 B NA 2
8 2 B 1 4
9 2 B 3 3
10 3 A 2 5
11 3 A 1 6
12 3 A NA 3
Data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), Grup = c("A", "A", "A", "B", "B", "B", "A", "A", "A",
"B", "B", "B"), pretest = c(2L, 1L, 3L, NA, 1L, 3L, 2L, 1L, NA,
2L, 1L, 3L)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12"))
df2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), Grup = c("A", "A", "A", "B", "B", "B", "A", "A", "A",
"B", "B", "B"), posttest = c(NA, 5L, 4L, 2L, 4L, 3L, 5L, 6L,
3L, 4L, 2L, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
Another option is joining by rownames, eg. row numbers:
library(tibble)
library(dplyr)
left_join(rownames_to_column(df1), df2 %>% rownames_to_column() , by="rowname") %>%
select(ID = ID.x, Grup = Grup.x, pretest, posttest)
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA

Merge columns of dataframe with all combinations of variables

"w" "n"
"1" 2 1
"2" 3 1
"3" 4 1
"4" 2 1
"5" 5 1
"6" 6 1
"7" 3 2
"8" 7 2
I tried the following command,but didnt show any change as I expect.
w2 <- w1 %>%
expand(w,n)
My output should look like this
w n
2 1
2 2
3 1
3 2
4 1
4 2
5 1
5 2
6 1
6 2
7 1
7 2
data
w1 <- structure(list(w = c(2L, 3L, 3L, 4L, 5L, 6L, 7L), n = c(1L, 1L,
2L, 1L, 1L, 1L, 2L)), .Names = c("w", "n"), row.names = c(NA,
-7L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(
w = c(2L, 3L, 3L, 4L, 5L, 6L, 7L), n = c(1L, 1L, 2L, 1L,
1L, 1L, 2L), .rows = list(1L, 2L, 3L, 4L, 5L, 6L, 7L)), .Names = c("w",
"n", ".rows"), row.names = c(NA, -7L), class = c("tbl_df", "tbl",
"data.frame"), .drop = TRUE))
The issue was in your data frame being grouped, consider:
w1 %>%
ungroup() %>%
expand(w, n)
Output:
# A tibble: 12 x 2
w n
<int> <int>
1 2 1
2 2 2
3 3 1
4 3 2
5 4 1
6 4 2
7 5 1
8 5 2
9 6 1
10 6 2
11 7 1
12 7 2
We can use complete from tidyr.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
distinct(w, .keep_all = TRUE) %>%
complete(w, n)
dat2
# # A tibble: 12 x 2
# w n
# <int> <int>
# 1 2 1
# 2 2 2
# 3 3 1
# 4 3 2
# 5 4 1
# 6 4 2
# 7 5 1
# 8 5 2
# 9 6 1
# 10 6 2
# 11 7 1
# 12 7 2
DATA
dat <- read.table(text = "w n
2 1
3 1
4 1
2 1
5 1
6 1
3 2
7 2",
header = TRUE)
Using the original data frame df you can create a new data frame that copies w for each unique value of n:
data.frame(w = rep(unique(df$w),
each = uniqueN(df$n)),
n = rep(unique(df$n),
times = uniqueN(df$w)))
Output:
w n
1 2 1
2 2 2
3 3 1
4 3 2
5 4 1
6 4 2
7 5 1
8 5 2
9 6 1
10 6 2
11 7 1
12 7 2

How to fix a time series with missing dates across multiple observations?

Let us consider the following time series with numbered days :
test=data.table( day=sample(1:9, 15, TRUE), name=sort(rep(c("a", "b", "c"), 5)), value=sample(1:3, 15, TRUE) )
test[test[, !duplicated(day), by=name][,V1]][order(name, -day)]
day name value
1: 7 a 3
2: 4 a 2
3: 2 a 2
4: 1 a 2
5: 9 b 1
6: 8 b 3
7: 6 b 3
8: 5 b 2
9: 3 b 3
10: 7 c 1
11: 6 c 1
12: 4 c 1
13: 3 c 3
14: 1 c 2
As you can see we made some measurments on three objects a, b and c during 9 days. We would like to perform a day to day value comparison between the three objects, unfortunately some dates are randomly missing and this causes a problem to run an algorithm that would otherwise be straightforward.
I would like to inject rows into this datatable so all objects have the same days. Injected rows would default the value to 0
All days available across all objects are listed with :
> sort(unique(test[,day]) )
[1] 1 2 3 4 5 6 7 8 9
So for instance the object a is missing days : 3, 5, 6, 8, 9
After the row injection the datatable for a would look like :
test[name=="a"]
day name value
1: 1 a 2
2: 2 a 1
3: 3 a 0
4: 4 a 3
5: 5 a 0
6: 6 a 0
7: 7 a 3
8: 8 a 0
9: 9 a 0
Any idea on how to tackle this problem ? Maybe some libraries such as lubridate already know how to do that.
Using the data that you posted, which I copied and put into a data.table, you can do this using:
library(data.table)
## create a table with all days and names
all.dates <- setDT(expand.grid(day=sort(unique(test[,day])),name=sort(unique(test[,name]))))
## perform a left-outer-join of all.dates with test
setkey(all.dates)
setkey(test,day,name)
test <- test[all.dates]
## set those NA's to zero
test[is.na(test)] <- 0
## day name value
##1 1 a 2
##2 1 b 0
##3 1 c 2
##4 2 a 2
##5 2 b 0
##6 2 c 0
##7 3 a 0
##8 3 b 3
##9 3 c 3
##10 4 a 2
##11 4 b 0
##12 4 c 1
##13 5 a 0
##14 5 b 2
##15 5 c 0
##16 6 a 0
##17 6 b 3
##18 6 c 1
##19 7 a 3
##20 7 b 0
##21 7 c 1
##22 8 a 0
##23 8 b 3
##24 8 c 0
##25 9 a 0
##26 9 b 1
##27 9 c 0
Data:
test <- structure(list(day = c(7L, 4L, 2L, 1L, 9L, 8L, 6L, 5L, 3L, 7L,
6L, 4L, 3L, 1L), name = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
value = c(3L, 2L, 2L, 2L, 1L, 3L, 3L, 2L, 3L, 1L, 1L, 1L,
3L, 2L)), .Names = c("day", "name", "value"), class = c("data.table",
"data.frame"), row.names = c(NA, -14L), .internal.selfref = <pointer: 0x102006778>)
## day name value
## 1: 7 a 3
## 2: 4 a 2
## 3: 2 a 2
## 4: 1 a 2
## 5: 9 b 1
## 6: 8 b 3
## 7: 6 b 3
## 8: 5 b 2
## 9: 3 b 3
##10: 7 c 1
##11: 6 c 1
##12: 4 c 1
##13: 3 c 3
##14: 1 c 2
In the tidyverse, one of the packages (tidyr) has a wrapper over expand.grid and left.join.
library(tidyverse)
test$day <- factor(test$day, levels = 1:9)
test$name = factor(test$name, levels = c("a", "b", "c"))
test %>%
complete(day, name, fill = list(value = 0))
#> # A tibble: 32 × 3
#> day name value
#> <fctr> <fctr> <dbl>
#> 1 1 a 0
#> 2 1 b 0
#> 3 1 c 0
#> 4 2 a 0
#> 5 2 b 0
#> 6 2 c 1
#> 7 3 a 1
#> 8 3 b 0
#> 9 3 c 0
#> 10 4 a 3
#> # ... with 22 more rows
You can also do it with expand.grid and a left join.
possibilities = expand.grid(levels(test$day), unique(test$name))
possibilities %>%
left_join(test, by = c("Var1" = "day", "Var2" = "name")) %>%
mutate(value = ifelse(is.na(value), 0, value))
#> Var1 Var2 value
#> 1 1 a 0
#> 2 2 a 0
#> 3 3 a 1
#> 4 4 a 3
#> 5 5 a 1

How to find the last occurrence of a certain observation in grouped data in R?

I have data that is grouped using dplyr in R. I would like to find the last occurrence of observations ('B') equal to or greater than 1 (1, 2, 3 or 4) in each group ('A'), in terms of the 'day' they occurred. I would like the value of 'day' for each group to be given in a new column.
For example, given the following sample of data, grouped by A (this has been simplified, my data is actually grouped by 3 variables):
A B day
a 2 1
a 2 2
a 1 5
a 0 8
b 3 1
b 3 4
b 3 6
b 0 7
b 0 9
c 1 2
c 1 3
c 1 4
I would like to achieve the following:
A B day last
a 2 1 5
a 2 2 5
a 1 5 5
a 0 8 5
b 3 1 6
b 3 4 6
b 3 6 6
b 0 7 6
b 0 9 6
c 1 2 4
c 1 3 4
c 1 4 4
I hope this makes sense, thank you all very much for your help! I have thoroughly searched for my answer online but couldn't find anything. However, if I have accidentally duplicated a question then I apologise.
We can try
library(data.table)
setDT(df1)[, last := day[tail(which(B>=1),1)] , A]
df1
# A B day last
# 1: a 2 1 5
# 2: a 2 2 5
# 3: a 1 5 5
# 4: a 0 8 5
# 5: b 3 1 6
# 6: b 3 4 6
# 7: b 3 6 6
# 8: b 0 7 6
# 9: b 0 9 6
#10: c 1 2 4
#11: c 1 3 4
#12: c 1 4 4
Or using dplyr
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(last = day[max(which(B>=1))])
Or use the last function from dplyr (as #docendo discimus suggested)
df1 %>%
group_by(A) %>%
mutate(last= last(day[B>=1]))
For the second question,
setDT(df1)[, dayafter:= if(all(!!B)) NA_integer_ else
day[max(which(B!=0))+1L] , A]
# A B day dayafter
# 1: a 2 1 8
# 2: a 2 2 8
# 3: a 1 5 8
# 4: a 0 8 8
# 5: b 3 1 7
# 6: b 3 4 7
# 7: b 3 6 7
# 8: b 0 7 7
# 9: b 0 9 7
#10: c 1 2 NA
#11: c 1 3 NA
#12: c 1 4 NA
Here is a solution that does not require loading external packages:
df <- structure(list(A = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
B = c(2L, 2L, 1L, 0L, 3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L), day = c(1L,
2L, 5L, 8L, 1L, 4L, 6L, 7L, 9L, 2L, 3L, 4L)), .Names = c("A",
"B", "day"), class = "data.frame", row.names = c(NA, -12L))
x <- split(df, df$A, drop = TRUE)
tp <- lapply(x, function(k) {
tmp <- k[k$B >0,]
k$last <- tmp$day[length(tmp$day)]
k
})
do.call(rbind, tp)
A B day last
#a.1 a 2 1 5
#a.2 a 2 2 5
#a.3 a 1 5 5
#a.4 a 0 8 5
#b.5 b 3 1 6
#b.6 b 3 4 6
#b.7 b 3 6 6
#b.8 b 0 7 6
#b.9 b 0 9 6
#c.10 c 1 2 4
#c.11 c 1 3 4
#c.12 c 1 4 4

Resources