Replace NA values in dataframe with variables in the next column (R) - r

I am new still trying to learn R and I could not find the answers I am looking for in any other thread.
I have a dataset with (for simplicity) 5 columns. Columns 1,2, and4 always have values, but in some rows column 3 doesn't. Below is an example:
Current
A B C D E
1 1 2 3
1 2 NA 4 5
1 2 3 4
1 3 NA 9 7
1 2 NA 5 6
I want to make it so that the NA's are replaced by the value in column D, and then the value in col E is shifted to D, etc.
Desired output:
A B C D E
1 1 2 3 NA
1 2 4 5 NA
1 2 3 4 NA
1 3 9 7 NA
1 2 5 6 NA
I copied what was on different Stack overflow threads and none achieved what I wanted.
na.omit gets rid of the row. Any help is greatly appreciated.

Data
data <- structure(list(A = c(1L, 1L, 1L, 1L, 1L), B = c(1L, 2L, 2L, 3L,
2L), C = c(2L, NA, 3L, NA, NA), D = c(3L, 4L, 4L, 9L, 5L), E = c(NA,
5L, NA, 7L, 6L)), class = "data.frame", row.names = c(NA, -5L
))
Code
library(dplyr)
data %>%
mutate(
aux = C,
C = if_else(is.na(aux),D,C),
D = if_else(is.na(aux),E,D),
E = NA
) %>%
select(-aux)
Output
A B C D E
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA

Replacement operation all in one go:
dat[is.na(dat$C), c("C","D","E")] <- c(dat[is.na(dat$C), c("D","E")], NA)
dat
# A B C D E
#1 1 1 2 3 NA
#2 1 2 4 5 NA
#3 1 2 3 4 NA
#4 1 3 9 7 NA
#5 1 2 5 6 NA
Where dat was:
dat <- read.table(text="A B C D E
1 1 2 3
1 2 NA 4 5
1 2 3 4
1 3 NA 9 7
1 2 NA 5 6", fill=TRUE, header=TRUE)

Using shift_row_values
library(hacksaw)
shift_row_values(df1)
A B C D E
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA
data
df1 <- structure(list(A = c(1L, 1L, 1L, 1L, 1L), B = c(1L, 2L, 2L, 3L,
2L), C = c(2L, NA, 3L, NA, NA), D = c(3L, 4L, 4L, 9L, 5L), E = c(NA,
5L, NA, 7L, 6L)), class = "data.frame", row.names = c(NA, -5L
))

A base R universal approach using order without prior knowledge of NA positions.
setNames(data.frame(t(apply(data, 1, function(x)
x[order(is.na(x))]))), colnames(data))
A B C D E
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA
Using dplyr
library(dplyr)
t(data) %>%
data.frame() %>%
mutate(across(everything(), ~ .x[order(is.na(.x))])) %>%
t() %>%
as_tibble()
# A tibble: 5 × 5
A B C D E
<int> <int> <int> <int> <int>
1 1 1 2 3 NA
2 1 2 4 5 NA
3 1 2 3 4 NA
4 1 3 9 7 NA
5 1 2 5 6 NA
Data
data <- structure(list(A = c(1L, 1L, 1L, 1L, 1L), B = c(1L, 2L, 2L, 3L,
2L), C = c(2L, NA, 3L, NA, NA), D = c(3L, 4L, 4L, 9L, 5L), E = c(NA,
5L, NA, 7L, 6L)), class = "data.frame", row.names = c(NA, -5L
))

Related

R Recode variable for all observations that do not occur more than once

I have a simple dataframe that looks like the following:
Observation X1 X2 Group
1 2 4 1
2 6 3 2
3 8 4 2
4 1 3 3
5 2 8 4
6 7 5 5
7 2 4 5
How can I recode the group variable such that all non-recurrent observations are recoded as "unaffiliated"?
The desired output would be the following:
Observation X1 X2 Group
1 2 4 Unaffiliated
2 6 3 2
3 8 4 2
4 1 3 Unaffiliated
5 2 8 Unaffiliated
6 7 5 5
7 2 4 5
We may use duplicated to create a logical vector for non-duplicates and assign the 'Group' to Unaffiliated for those non-duplicates
df1$Group[with(df1, !(duplicated(Group)|duplicated(Group,
fromLast = TRUE)))] <- "Unaffiliated"
-output
> df1
Observation X1 X2 Group
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5
data
df1 <- structure(list(Observation = 1:7, X1 = c(2L, 6L, 8L, 1L, 2L,
7L, 2L), X2 = c(4L, 3L, 4L, 3L, 8L, 5L, 4L), Group = c(1L, 2L,
2L, 3L, 4L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-7L))
unfaffil takes a vector of Group numbers and returns "Unaffiliated" if it has one element and otherwise returns the input. We can then apply it by Group using ave. This does not overwrite the input. No packages are used but if you use dplyr then transform can be replaced with mutate.
unaffil <- function(x) if (length(x) == 1) "Unaffiliated" else x
transform(dat, Group = ave(Group, Group, FUN = unaffil))
giving
Observation X1 X2 Group
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5
Note
dat <- structure(list(Observation = 1:7, X1 = c(2L, 6L, 8L, 1L, 2L,
7L, 2L), X2 = c(4L, 3L, 4L, 3L, 8L, 5L, 4L), Group = c(1L, 2L,
2L, 3L, 4L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-7L))
One way could be first grouping then checking for maximum of row number and finishing with an ifelse:
library(dplyr)
df %>%
group_by(Group) %>%
mutate(Group = ifelse(max(row_number()) == 1, "Unaffiliated", as.character(Group))) %>%
ungroup()
Observation X1 X2 Group
<int> <int> <int> <chr>
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5

Expand a numerical variable based on sequence and value to multi columns

I have a data like that :
structure(list(time = c(3L, 4L, 2L, 1L, 2L, 3L,
1L, 4L, 2L)), class = "data.frame", row.names = c(NA,
-9L))
These numbers are time in which participants have been in the study. I want to form the column to have a binomial column for each time.
structure(list(time = c(3L, 4L, 2L, 1L, 2L, 3L, 1L, 4L, 2L),
t1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), t2 = c(1L, 1L,
1L, NA, 1L, 1L, NA, 1L, 1L), t3 = c(1L, 1L, NA, NA, NA, 1L,
NA, 1L, NA), t4 = c(NA, 1L, NA, NA, NA, NA, NA, 1L, NA)), class = "data.frame", row.names = c(NA,
-9L))
Another option with base R:
m <- matrix(nrow = nrow(dtt), ncol = max(dtt$time))
m[col(m) <= dtt$time] <- 1L
cbind(dtt, m)
# time 1 2 3 4
# 1 3 1 1 1 NA
# 2 4 1 1 1 1
# 3 2 1 1 NA NA
# 4 1 1 NA NA NA
# 5 2 1 1 NA NA
# 6 3 1 1 1 NA
# 7 1 1 NA NA NA
# 8 4 1 1 1 1
# 9 2 1 1 NA NA
Here's a pretty direct approach with base R (calling your input df):
max_length = max(df$time)
rows = lapply(df$time, function(t) c(rep(1, t), rep(NA, max_length - t)))
result = cbind(df, do.call(rbind, rows))
names(result)[-1] = paste0("t", names(result)[-1])
result
# time t1 t2 t3 t4
# 1 3 1 1 1 NA
# 2 4 1 1 1 1
# 3 2 1 1 NA NA
# 4 1 1 NA NA NA
# 5 2 1 1 NA NA
# 6 3 1 1 1 NA
# 7 1 1 NA NA NA
# 8 4 1 1 1 1
# 9 2 1 1 NA NA
Here's a tidyverse approach :
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
uncount(time, .remove = FALSE) %>%
group_by(row) %>%
mutate(col = row_number()) %>%
pivot_wider(names_from = col, values_from = col,
values_fn = length, names_prefix = 't') %>%
ungroup %>%
select(-row)
# time t1 t2 t3 t4
# <int> <int> <int> <int> <int>
#1 3 1 1 1 NA
#2 4 1 1 1 1
#3 2 1 1 NA NA
#4 1 1 NA NA NA
#5 2 1 1 NA NA
#6 3 1 1 1 NA
#7 1 1 NA NA NA
#8 4 1 1 1 1
#9 2 1 1 NA NA

R Fill in missing rows

I have a similar question like this one: Fill in missing rows in R
However, the gaps I need to fill are not only months, but also missing years in between for one ID. This is an example:
structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B"), A = c(1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 3L), B = c(1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 3L), Var1 = 12:4), class = "data.frame", row.names = c(NA,
-9L))
ID A B Var1
1 A 1 1 12
2 A 1 2 11
3 A 3 1 10
4 A 3 2 9
5 A 3 3 8
6 B 2 1 7
7 B 2 2 6
8 B 2 3 5
9 B 3 3 4
And this is what I want it to look like:
ID A B Var1
1 A 1 1 12
2 A 1 2 11
3 A 1 3 0
4 A 2 1 0
5 A 2 2 0
6 A 2 3 0
7 A 3 1 10
8 A 3 2 9
9 A 3 3 8
10 B 2 1 7
11 B 2 2 6
12 B 2 3 5
13 B 3 1 0
14 B 3 2 0
15 B 3 3 4
Has someone an idea how to solve it? I have already played around with the solutions mentioned above.
library(tidyverse)
df <- structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B",
"B"), A = c(1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 3L), B = c(1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 3L), Var1 = 12:4), class = "data.frame", row.names = c(NA,
-9L))
df %>%
complete(ID, A, B, fill = list(Var1 = 0))
#> # A tibble: 18 x 4
#> ID A B Var1
#> <chr> <int> <int> <dbl>
#> 1 A 1 1 12
#> 2 A 1 2 11
#> 3 A 1 3 0
#> 4 A 2 1 0
#> 5 A 2 2 0
#> 6 A 2 3 0
#> 7 A 3 1 10
#> 8 A 3 2 9
#> 9 A 3 3 8
#> 10 B 1 1 0
#> 11 B 1 2 0
#> 12 B 1 3 0
#> 13 B 2 1 7
#> 14 B 2 2 6
#> 15 B 2 3 5
#> 16 B 3 1 0
#> 17 B 3 2 0
#> 18 B 3 3 4
Created on 2021-03-03 by the reprex package (v1.0.0)
You could use the solution described there altering it slightly for your problem.
df
full <- with(df, unique(expand.grid(ID = ID, A = A, B = B)))
complete <- merge(df, full, by = c('ID', 'A', 'B'), all.y = TRUE)
complete$Var1[is.na(complete$Var1)] <- 0
Just in case somebody else has the same question, this is what I came up with, thanks to the answers provided:
library(tidyverse)
df %>% group_by(ID) %>% complete(ID, A = full_seq(A,1), B, fill = list(Var1 = 0))
This code avoids that too many unused datasets are produced.

Merge columns of dataframe with all combinations of variables

"w" "n"
"1" 2 1
"2" 3 1
"3" 4 1
"4" 2 1
"5" 5 1
"6" 6 1
"7" 3 2
"8" 7 2
I tried the following command,but didnt show any change as I expect.
w2 <- w1 %>%
expand(w,n)
My output should look like this
w n
2 1
2 2
3 1
3 2
4 1
4 2
5 1
5 2
6 1
6 2
7 1
7 2
data
w1 <- structure(list(w = c(2L, 3L, 3L, 4L, 5L, 6L, 7L), n = c(1L, 1L,
2L, 1L, 1L, 1L, 2L)), .Names = c("w", "n"), row.names = c(NA,
-7L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(
w = c(2L, 3L, 3L, 4L, 5L, 6L, 7L), n = c(1L, 1L, 2L, 1L,
1L, 1L, 2L), .rows = list(1L, 2L, 3L, 4L, 5L, 6L, 7L)), .Names = c("w",
"n", ".rows"), row.names = c(NA, -7L), class = c("tbl_df", "tbl",
"data.frame"), .drop = TRUE))
The issue was in your data frame being grouped, consider:
w1 %>%
ungroup() %>%
expand(w, n)
Output:
# A tibble: 12 x 2
w n
<int> <int>
1 2 1
2 2 2
3 3 1
4 3 2
5 4 1
6 4 2
7 5 1
8 5 2
9 6 1
10 6 2
11 7 1
12 7 2
We can use complete from tidyr.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
distinct(w, .keep_all = TRUE) %>%
complete(w, n)
dat2
# # A tibble: 12 x 2
# w n
# <int> <int>
# 1 2 1
# 2 2 2
# 3 3 1
# 4 3 2
# 5 4 1
# 6 4 2
# 7 5 1
# 8 5 2
# 9 6 1
# 10 6 2
# 11 7 1
# 12 7 2
DATA
dat <- read.table(text = "w n
2 1
3 1
4 1
2 1
5 1
6 1
3 2
7 2",
header = TRUE)
Using the original data frame df you can create a new data frame that copies w for each unique value of n:
data.frame(w = rep(unique(df$w),
each = uniqueN(df$n)),
n = rep(unique(df$n),
times = uniqueN(df$w)))
Output:
w n
1 2 1
2 2 2
3 3 1
4 3 2
5 4 1
6 4 2
7 5 1
8 5 2
9 6 1
10 6 2
11 7 1
12 7 2

How to find the last occurrence of a certain observation in grouped data in R?

I have data that is grouped using dplyr in R. I would like to find the last occurrence of observations ('B') equal to or greater than 1 (1, 2, 3 or 4) in each group ('A'), in terms of the 'day' they occurred. I would like the value of 'day' for each group to be given in a new column.
For example, given the following sample of data, grouped by A (this has been simplified, my data is actually grouped by 3 variables):
A B day
a 2 1
a 2 2
a 1 5
a 0 8
b 3 1
b 3 4
b 3 6
b 0 7
b 0 9
c 1 2
c 1 3
c 1 4
I would like to achieve the following:
A B day last
a 2 1 5
a 2 2 5
a 1 5 5
a 0 8 5
b 3 1 6
b 3 4 6
b 3 6 6
b 0 7 6
b 0 9 6
c 1 2 4
c 1 3 4
c 1 4 4
I hope this makes sense, thank you all very much for your help! I have thoroughly searched for my answer online but couldn't find anything. However, if I have accidentally duplicated a question then I apologise.
We can try
library(data.table)
setDT(df1)[, last := day[tail(which(B>=1),1)] , A]
df1
# A B day last
# 1: a 2 1 5
# 2: a 2 2 5
# 3: a 1 5 5
# 4: a 0 8 5
# 5: b 3 1 6
# 6: b 3 4 6
# 7: b 3 6 6
# 8: b 0 7 6
# 9: b 0 9 6
#10: c 1 2 4
#11: c 1 3 4
#12: c 1 4 4
Or using dplyr
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(last = day[max(which(B>=1))])
Or use the last function from dplyr (as #docendo discimus suggested)
df1 %>%
group_by(A) %>%
mutate(last= last(day[B>=1]))
For the second question,
setDT(df1)[, dayafter:= if(all(!!B)) NA_integer_ else
day[max(which(B!=0))+1L] , A]
# A B day dayafter
# 1: a 2 1 8
# 2: a 2 2 8
# 3: a 1 5 8
# 4: a 0 8 8
# 5: b 3 1 7
# 6: b 3 4 7
# 7: b 3 6 7
# 8: b 0 7 7
# 9: b 0 9 7
#10: c 1 2 NA
#11: c 1 3 NA
#12: c 1 4 NA
Here is a solution that does not require loading external packages:
df <- structure(list(A = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
B = c(2L, 2L, 1L, 0L, 3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L), day = c(1L,
2L, 5L, 8L, 1L, 4L, 6L, 7L, 9L, 2L, 3L, 4L)), .Names = c("A",
"B", "day"), class = "data.frame", row.names = c(NA, -12L))
x <- split(df, df$A, drop = TRUE)
tp <- lapply(x, function(k) {
tmp <- k[k$B >0,]
k$last <- tmp$day[length(tmp$day)]
k
})
do.call(rbind, tp)
A B day last
#a.1 a 2 1 5
#a.2 a 2 2 5
#a.3 a 1 5 5
#a.4 a 0 8 5
#b.5 b 3 1 6
#b.6 b 3 4 6
#b.7 b 3 6 6
#b.8 b 0 7 6
#b.9 b 0 9 6
#c.10 c 1 2 4
#c.11 c 1 3 4
#c.12 c 1 4 4

Resources