R Fill in missing rows

R Fill in missing rows - r

I have a similar question like this one: Fill in missing rows in R
However, the gaps I need to fill are not only months, but also missing years in between for one ID. This is an example:
structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B"), A = c(1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 3L), B = c(1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 3L), Var1 = 12:4), class = "data.frame", row.names = c(NA,
-9L))
ID A B Var1
1 A 1 1 12
2 A 1 2 11
3 A 3 1 10
4 A 3 2 9
5 A 3 3 8
6 B 2 1 7
7 B 2 2 6
8 B 2 3 5
9 B 3 3 4
And this is what I want it to look like:
ID A B Var1
1 A 1 1 12
2 A 1 2 11
3 A 1 3 0
4 A 2 1 0
5 A 2 2 0
6 A 2 3 0
7 A 3 1 10
8 A 3 2 9
9 A 3 3 8
10 B 2 1 7
11 B 2 2 6
12 B 2 3 5
13 B 3 1 0
14 B 3 2 0
15 B 3 3 4
Has someone an idea how to solve it? I have already played around with the solutions mentioned above.

library(tidyverse)
df <- structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B"), A = c(1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 3L), B = c(1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 3L), Var1 = 12:4), class = "data.frame", row.names = c(NA,
-9L))
df %>%
complete(ID, A, B, fill = list(Var1 = 0))
#> # A tibble: 18 x 4
#> ID A B Var1
#> <chr> <int> <int> <dbl>
#> 1 A 1 1 12
#> 2 A 1 2 11
#> 3 A 1 3 0
#> 4 A 2 1 0
#> 5 A 2 2 0
#> 6 A 2 3 0
#> 7 A 3 1 10
#> 8 A 3 2 9
#> 9 A 3 3 8
#> 10 B 1 1 0
#> 11 B 1 2 0
#> 12 B 1 3 0
#> 13 B 2 1 7
#> 14 B 2 2 6
#> 15 B 2 3 5
#> 16 B 3 1 0
#> 17 B 3 2 0
#> 18 B 3 3 4
Created on 2021-03-03 by the reprex package (v1.0.0)

You could use the solution described there altering it slightly for your problem.
df
full <- with(df, unique(expand.grid(ID = ID, A = A, B = B)))
complete <- merge(df, full, by = c('ID', 'A', 'B'), all.y = TRUE)
complete$Var1[is.na(complete$Var1)] <- 0

Just in case somebody else has the same question, this is what I came up with, thanks to the answers provided:
library(tidyverse)
df %>% group_by(ID) %>% complete(ID, A = full_seq(A,1), B, fill = list(Var1 = 0))
This code avoids that too many unused datasets are produced.

Related

Replace NA values in dataframe with variables in the next column (R)

I am new still trying to learn R and I could not find the answers I am looking for in any other thread.
I have a dataset with (for simplicity) 5 columns. Columns 1,2, and4 always have values, but in some rows column 3 doesn't. Below is an example:
Current
A B C D E
1 1 2 3
1 2 NA 4 5
1 2 3 4
1 3 NA 9 7
1 2 NA 5 6
I want to make it so that the NA's are replaced by the value in column D, and then the value in col E is shifted to D, etc.
Desired output:
A B C D E
1 1 2 3 NA
1 2 4 5 NA
1 2 3 4 NA
1 3 9 7 NA
1 2 5 6 NA
I copied what was on different Stack overflow threads and none achieved what I wanted.
na.omit gets rid of the row. Any help is greatly appreciated.

Data
data <- structure(list(A = c(1L, 1L, 1L, 1L, 1L), B = c(1L, 2L, 2L, 3L,
2L), C = c(2L, NA, 3L, NA, NA), D = c(3L, 4L, 4L, 9L, 5L), E = c(NA,
5L, NA, 7L, 6L)), class = "data.frame", row.names = c(NA, -5L
))
Code
library(dplyr)
data %>%
mutate(
aux = C,
C = if_else(is.na(aux),D,C),
D = if_else(is.na(aux),E,D),
E = NA
) %>%
select(-aux)
Output
A B C D E
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA

Replacement operation all in one go:
dat[is.na(dat$C), c("C","D","E")] <- c(dat[is.na(dat$C), c("D","E")], NA)
dat
# A B C D E
#1 1 1 2 3 NA
#2 1 2 4 5 NA
#3 1 2 3 4 NA
#4 1 3 9 7 NA
#5 1 2 5 6 NA
Where dat was:
dat <- read.table(text="A B C D E
1 1 2 3
1 2 NA 4 5
1 2 3 4
1 3 NA 9 7
1 2 NA 5 6", fill=TRUE, header=TRUE)

Using shift_row_values
library(hacksaw)
shift_row_values(df1)
A B C D E
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA
data
df1 <- structure(list(A = c(1L, 1L, 1L, 1L, 1L), B = c(1L, 2L, 2L, 3L,
2L), C = c(2L, NA, 3L, NA, NA), D = c(3L, 4L, 4L, 9L, 5L), E = c(NA,
5L, NA, 7L, 6L)), class = "data.frame", row.names = c(NA, -5L
))

A base R universal approach using order without prior knowledge of NA positions.
setNames(data.frame(t(apply(data, 1, function(x)
x[order(is.na(x))]))), colnames(data))
A B C D E
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA
Using dplyr
library(dplyr)
t(data) %>%
data.frame() %>%
mutate(across(everything(), ~ .x[order(is.na(.x))])) %>%
t() %>%
as_tibble()
# A tibble: 5 × 5
A B C D E
<int> <int> <int> <int> <int>
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA
Data
data <- structure(list(A = c(1L, 1L, 1L, 1L, 1L), B = c(1L, 2L, 2L, 3L,
2L), C = c(2L, NA, 3L, NA, NA), D = c(3L, 4L, 4L, 9L, 5L), E = c(NA,
5L, NA, 7L, 6L)), class = "data.frame", row.names = c(NA, -5L
))

How to merge datasets with repeated measures

I have Three datasets that I want to MERGE/JOIN.
This This examples only include the first participants I have a total of 25
df1
ID Grup pretest
1 1 A 2
2 1 A 1
3 1 A 3
4 2 B NA
5 2 B 1
6 2 B 3
7 3 A 2
8 3 A 1
9 3 A NA
10 4 B 2
11 4 B 1
12 4 B 3
df2 (this is missing one ID (5)
ID Grup posttest
1 1 A NA
2 1 A 5
3 1 A 4
4 2 B 2
5 2 B 4
6 2 B 3
7 3 A 5
8 3 A 6
9 3 A 3
10 6 B 4
11 6 B 2
12 6 B NA
Updated
df3( this have 5 Measurements for per ID)
ID Grup traning
1 1 A 2
2 1 A 6
3 1 A 3
4 1 A NA
5 1 A 1
6 2 B 3
7 2 B 4
8 2 B 1
9 2 B NA
10 2 B 2
11 3 A 1
12 3 A 3
I’ve been trying merge() and full_join() but both end up creating duplicates that I don’t want.
It won’t recognize the ID as an independent value, it’s creating 9 IDs for every ID value.
New <- merge(df1, df2, by= 'ID')
New <- full_join(df1, df2, By = "ID")
Setting all = TRUE doesn’t help.
I need the dataset to look like this
ID Grup pretest posttest traning
1 1 A 2 NA. 3
2 1 A 1 5. 4
3 1 A 3 4. 4
4 1 A NA Na. 4
5 1. A NA Na. 3
6 2 B 3 3. Na
7 2 B. 2 5. 3
8 2 B Na 6. 2
9 2 B NA Na. 5
10 2 B Na Na. 4
11 3 A. 1 2. 3
12 3 A. 3 3. 4

Since you are relying on the order of the frames, you can simply use cbind()
cbind(df1,df2[,3,F])
Output:
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA

You can add a helper column iid to separate the entries.
df1 <- cbind(iid = 1:nrow(df1), df1)
df2 <- cbind(iid = 1:nrow(df2), df2)
With dplyr
library(dplyr)
left_join(df1, df2, c("iid", "ID", "Grup"))[,-1]
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA
With base R merge
merge(df1, df2, c("iid", "ID", "Grup"))[,-1]
ID Grup pretest posttest
1 1 A 2 NA
2 4 B 2 4
3 4 B 1 2
4 4 B 3 NA
5 1 A 1 5
6 1 A 3 4
7 2 B NA 2
8 2 B 1 4
9 2 B 3 3
10 3 A 2 5
11 3 A 1 6
12 3 A NA 3
Data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), Grup = c("A", "A", "A", "B", "B", "B", "A", "A", "A",
"B", "B", "B"), pretest = c(2L, 1L, 3L, NA, 1L, 3L, 2L, 1L, NA,
2L, 1L, 3L)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12"))
df2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), Grup = c("A", "A", "A", "B", "B", "B", "A", "A", "A",
"B", "B", "B"), posttest = c(NA, 5L, 4L, 2L, 4L, 3L, 5L, 6L,
3L, 4L, 2L, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

Another option is joining by rownames, eg. row numbers:
library(tibble)
library(dplyr)
left_join(rownames_to_column(df1), df2 %>% rownames_to_column() , by="rowname") %>%
select(ID = ID.x, Grup = Grup.x, pretest, posttest)
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA

Merge columns of dataframe with all combinations of variables

"w" "n"
"1" 2 1
"2" 3 1
"3" 4 1
"4" 2 1
"5" 5 1
"6" 6 1
"7" 3 2
"8" 7 2
I tried the following command,but didnt show any change as I expect.
w2 <- w1 %>%
expand(w,n)
My output should look like this
w n
2 1
2 2
3 1
3 2
4 1
4 2
5 1
5 2
6 1
6 2
7 1
7 2
data
w1 <- structure(list(w = c(2L, 3L, 3L, 4L, 5L, 6L, 7L), n = c(1L, 1L,
2L, 1L, 1L, 1L, 2L)), .Names = c("w", "n"), row.names = c(NA,
-7L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(
w = c(2L, 3L, 3L, 4L, 5L, 6L, 7L), n = c(1L, 1L, 2L, 1L,
1L, 1L, 2L), .rows = list(1L, 2L, 3L, 4L, 5L, 6L, 7L)), .Names = c("w",
"n", ".rows"), row.names = c(NA, -7L), class = c("tbl_df", "tbl",
"data.frame"), .drop = TRUE))

The issue was in your data frame being grouped, consider:
w1 %>%
ungroup() %>%
expand(w, n)
Output:
# A tibble: 12 x 2
w n
<int> <int>
1 2 1
2 2 2
3 3 1
4 3 2
5 4 1
6 4 2
7 5 1
8 5 2
9 6 1
10 6 2
11 7 1
12 7 2

We can use complete from tidyr.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
distinct(w, .keep_all = TRUE) %>%
complete(w, n)
dat2
# # A tibble: 12 x 2
# w n
# <int> <int>
# 1 2 1
# 2 2 2
# 3 3 1
# 4 3 2
# 5 4 1
# 6 4 2
# 7 5 1
# 8 5 2
# 9 6 1
# 10 6 2
# 11 7 1
# 12 7 2
DATA
dat <- read.table(text = "w n
2 1
3 1
4 1
2 1
5 1
6 1
3 2
7 2",
header = TRUE)

Using the original data frame df you can create a new data frame that copies w for each unique value of n:
data.frame(w = rep(unique(df$w),
each = uniqueN(df$n)),
n = rep(unique(df$n),
times = uniqueN(df$w)))
Output:
w n
1 2 1
2 2 2
3 3 1
4 3 2
5 4 1
6 4 2
7 5 1
8 5 2
9 6 1
10 6 2
11 7 1
12 7 2

How to fix a time series with missing dates across multiple observations?

Let us consider the following time series with numbered days :
test=data.table( day=sample(1:9, 15, TRUE), name=sort(rep(c("a", "b", "c"), 5)), value=sample(1:3, 15, TRUE) )
test[test[, !duplicated(day), by=name][,V1]][order(name, -day)]
day name value
1: 7 a 3
2: 4 a 2
3: 2 a 2
4: 1 a 2
5: 9 b 1
6: 8 b 3
7: 6 b 3
8: 5 b 2
9: 3 b 3
10: 7 c 1
11: 6 c 1
12: 4 c 1
13: 3 c 3
14: 1 c 2
As you can see we made some measurments on three objects a, b and c during 9 days. We would like to perform a day to day value comparison between the three objects, unfortunately some dates are randomly missing and this causes a problem to run an algorithm that would otherwise be straightforward.
I would like to inject rows into this datatable so all objects have the same days. Injected rows would default the value to 0
All days available across all objects are listed with :
> sort(unique(test[,day]) )
[1] 1 2 3 4 5 6 7 8 9
So for instance the object a is missing days : 3, 5, 6, 8, 9
After the row injection the datatable for a would look like :
test[name=="a"]
day name value
1: 1 a 2
2: 2 a 1
3: 3 a 0
4: 4 a 3
5: 5 a 0
6: 6 a 0
7: 7 a 3
8: 8 a 0
9: 9 a 0
Any idea on how to tackle this problem ? Maybe some libraries such as lubridate already know how to do that.

Using the data that you posted, which I copied and put into a data.table, you can do this using:
library(data.table)
## create a table with all days and names
all.dates <- setDT(expand.grid(day=sort(unique(test[,day])),name=sort(unique(test[,name]))))
## perform a left-outer-join of all.dates with test
setkey(all.dates)
setkey(test,day,name)
test <- test[all.dates]
## set those NA's to zero
test[is.na(test)] <- 0
## day name value
##1 1 a 2
##2 1 b 0
##3 1 c 2
##4 2 a 2
##5 2 b 0
##6 2 c 0
##7 3 a 0
##8 3 b 3
##9 3 c 3
##10 4 a 2
##11 4 b 0
##12 4 c 1
##13 5 a 0
##14 5 b 2
##15 5 c 0
##16 6 a 0
##17 6 b 3
##18 6 c 1
##19 7 a 3
##20 7 b 0
##21 7 c 1
##22 8 a 0
##23 8 b 3
##24 8 c 0
##25 9 a 0
##26 9 b 1
##27 9 c 0
Data:
test <- structure(list(day = c(7L, 4L, 2L, 1L, 9L, 8L, 6L, 5L, 3L, 7L,
6L, 4L, 3L, 1L), name = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
value = c(3L, 2L, 2L, 2L, 1L, 3L, 3L, 2L, 3L, 1L, 1L, 1L,
3L, 2L)), .Names = c("day", "name", "value"), class = c("data.table",
"data.frame"), row.names = c(NA, -14L), .internal.selfref = <pointer: 0x102006778>)
## day name value
## 1: 7 a 3
## 2: 4 a 2
## 3: 2 a 2
## 4: 1 a 2
## 5: 9 b 1
## 6: 8 b 3
## 7: 6 b 3
## 8: 5 b 2
## 9: 3 b 3
##10: 7 c 1
##11: 6 c 1
##12: 4 c 1
##13: 3 c 3
##14: 1 c 2

In the tidyverse, one of the packages (tidyr) has a wrapper over expand.grid and left.join.
library(tidyverse)
test$day <- factor(test$day, levels = 1:9)
test$name = factor(test$name, levels = c("a", "b", "c"))
test %>%
complete(day, name, fill = list(value = 0))
#> # A tibble: 32 Ã— 3
#> day name value
#> <fctr> <fctr> <dbl>
#> 1 1 a 0
#> 2 1 b 0
#> 3 1 c 0
#> 4 2 a 0
#> 5 2 b 0
#> 6 2 c 1
#> 7 3 a 1
#> 8 3 b 0
#> 9 3 c 0
#> 10 4 a 3
#> # ... with 22 more rows
You can also do it with expand.grid and a left join.
possibilities = expand.grid(levels(test$day), unique(test$name))
possibilities %>%
left_join(test, by = c("Var1" = "day", "Var2" = "name")) %>%
mutate(value = ifelse(is.na(value), 0, value))
#> Var1 Var2 value
#> 1 1 a 0
#> 2 2 a 0
#> 3 3 a 1
#> 4 4 a 3
#> 5 5 a 1

How to find the last occurrence of a certain observation in grouped data in R?

I have data that is grouped using dplyr in R. I would like to find the last occurrence of observations ('B') equal to or greater than 1 (1, 2, 3 or 4) in each group ('A'), in terms of the 'day' they occurred. I would like the value of 'day' for each group to be given in a new column.
For example, given the following sample of data, grouped by A (this has been simplified, my data is actually grouped by 3 variables):
A B day
a 2 1
a 2 2
a 1 5
a 0 8
b 3 1
b 3 4
b 3 6
b 0 7
b 0 9
c 1 2
c 1 3
c 1 4
I would like to achieve the following:
A B day last
a 2 1 5
a 2 2 5
a 1 5 5
a 0 8 5
b 3 1 6
b 3 4 6
b 3 6 6
b 0 7 6
b 0 9 6
c 1 2 4
c 1 3 4
c 1 4 4
I hope this makes sense, thank you all very much for your help! I have thoroughly searched for my answer online but couldn't find anything. However, if I have accidentally duplicated a question then I apologise.

We can try
library(data.table)
setDT(df1)[, last := day[tail(which(B>=1),1)] , A]
df1
# A B day last
# 1: a 2 1 5
# 2: a 2 2 5
# 3: a 1 5 5
# 4: a 0 8 5
# 5: b 3 1 6
# 6: b 3 4 6
# 7: b 3 6 6
# 8: b 0 7 6
# 9: b 0 9 6
#10: c 1 2 4
#11: c 1 3 4
#12: c 1 4 4
Or using dplyr
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(last = day[max(which(B>=1))])
Or use the last function from dplyr (as #docendo discimus suggested)
df1 %>%
group_by(A) %>%
mutate(last= last(day[B>=1]))
For the second question,
setDT(df1)[, dayafter:= if(all(!!B)) NA_integer_ else
day[max(which(B!=0))+1L] , A]
# A B day dayafter
# 1: a 2 1 8
# 2: a 2 2 8
# 3: a 1 5 8
# 4: a 0 8 8
# 5: b 3 1 7
# 6: b 3 4 7
# 7: b 3 6 7
# 8: b 0 7 7
# 9: b 0 9 7
#10: c 1 2 NA
#11: c 1 3 NA
#12: c 1 4 NA

Here is a solution that does not require loading external packages:
df <- structure(list(A = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
B = c(2L, 2L, 1L, 0L, 3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L), day = c(1L,
2L, 5L, 8L, 1L, 4L, 6L, 7L, 9L, 2L, 3L, 4L)), .Names = c("A",
"B", "day"), class = "data.frame", row.names = c(NA, -12L))
x <- split(df, df$A, drop = TRUE)
tp <- lapply(x, function(k) {
tmp <- k[k$B >0,]
k$last <- tmp$day[length(tmp$day)]
k
})
do.call(rbind, tp)
A B day last
#a.1 a 2 1 5
#a.2 a 2 2 5
#a.3 a 1 5 5
#a.4 a 0 8 5
#b.5 b 3 1 6
#b.6 b 3 4 6
#b.7 b 3 6 6
#b.8 b 0 7 6
#b.9 b 0 9 6
#c.10 c 1 2 4
#c.11 c 1 3 4
#c.12 c 1 4 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Fill in missing rows - r

You could use the solution described there altering it slightly for your problem. df full <- with(df, unique(expand.grid(ID = ID, A = A, B = B))) complete <- merge(df, full, by = c('ID', 'A', 'B'), all.y = TRUE) complete$Var1[is.na(complete$Var1)] <- 0

Just in case somebody else has the same question, this is what I came up with, thanks to the answers provided: library(tidyverse) df %>% group_by(ID) %>% complete(ID, A = full_seq(A,1), B, fill = list(Var1 = 0)) This code avoids that too many unused datasets are produced.

Related

Replace NA values in dataframe with variables in the next column (R)

How to merge datasets with repeated measures

Merge columns of dataframe with all combinations of variables

How to fix a time series with missing dates across multiple observations?

How to find the last occurrence of a certain observation in grouped data in R?

Categories

Resources