I have a question in R. I have a dataset whose cells I would like to change based on the value of the column next to each other
Data <- tibble(a = 1:5,
b = c("G","H","I","J","K"),
c = c("G","H","J","I","J"))
I would like to change the chr. to NA if b and c have the same chr.
Desired output
Data <- tibble(a = 1:5,
b = c("NA","NA","I","J","K"),
c = c("NA","NA","J","I","J"))
Thanks a lot for your help in advance.
library(data.table)
setDT(Data)[b == c, c("b", "c") := NA]
# a b c
# 1: 1 <NA> <NA>
# 2: 2 <NA> <NA>
# 3: 3 I J
# 4: 4 J I
# 5: 5 K J
With base R:
Data[Data$b == Data$c, c('b', 'c')] <- "NA"
Data
# # A tibble: 5 x 3
# a b c
# <int> <chr> <chr>
# 1 1 NA NA
# 2 2 NA NA
# 3 3 I J
# 4 4 J I
# 5 5 K J
Using which to subset Data on the rows where band c have the same values:
Data[c("b","c")][which(Data$b == Data$c),] <- NA
Result:
Data
# A tibble: 5 x 3
a b c
<int> <chr> <chr>
1 1 NA NA
2 2 NA NA
3 3 I J
4 4 J I
5 5 K J
With dplyr
library(dplyr)
Data %>%
rowwise() %>%
mutate(b = ifelse(b %in% c & c %in% b, "NA", b))%>%
mutate(c = ifelse(b == "NA", "NA", c))
Output:
a b c
<int> <chr> <chr>
1 1 NA NA
2 2 NA NA
3 3 I J
4 4 J I
5 5 K J
Another base R option
cols <- c("b", "c")
Data[cols] <- replace(Data[cols], Data[cols] == Data[rev(cols)], NA)
gives
> Data
# A tibble: 5 x 3
a b c
<int> <chr> <chr>
1 1 NA NA
2 2 NA NA
3 3 I J
4 4 J I
5 5 K J
Related
How can I expand a group to length of the max group:
df <- structure(list(ID = c(1L, 1L, 2L, 3L, 3L, 3L), col1 = c("A",
"B", "O", "U", "L", "R")), class = "data.frame", row.names = c(NA,
-6L))
ID col1
1 A
1 B
2 O
3 U
3 L
3 R
Desired Output:
1 A
1 B
NA NA
2 O
NA NA
NA NA
3 U
3 L
3 R
You can take advantage of the fact that df[n_bigger_than_nrow,] gives a row of NAs
dplyr
max_n <- max(count(df, ID)$n)
df %>%
group_by(ID) %>%
summarise(cur_data()[seq(max_n),])
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups`
#> argument.
#> # A tibble: 9 × 2
#> # Groups: ID [3]
#> ID col1
#> <int> <chr>
#> 1 1 A
#> 2 1 B
#> 3 1 <NA>
#> 4 2 O
#> 5 2 <NA>
#> 6 2 <NA>
#> 7 3 U
#> 8 3 L
#> 9 3 R
base R
n <- tapply(df$ID, df$ID, length)
max_n <- max(n)
i <- lapply(n, \(x) c(seq(x), rep(Inf, max_n - x)))
i <- Map(`+`, i, c(0, cumsum(head(n, -1))))
df <- df[unlist(i),]
rownames(df) <- NULL
df$ID <- rep(as.numeric(names(i)), each = max_n)
df
#> ID col1
#> 1 1 A
#> 2 1 B
#> 3 1 <NA>
#> 4 2 O
#> 5 2 <NA>
#> 6 2 <NA>
#> 7 3 U
#> 8 3 L
#> 9 3 R
Here's a base R solution.
split the df by the ID column, then use lapply to iterate over the split df, and rbind with a data frame of NA if there's fewer row than 3 (max(table(df$ID))).
do.call(rbind,
lapply(split(df, df$ID),
\(x) rbind(x, data.frame(ID = NA, col1 = NA)[rep(1, max(table(df$ID)) - nrow(x)), ]))
)
ID col1
1.1 1 A
1.2 1 B
1.3 NA <NA>
2.3 2 O
2.1 NA <NA>
2.1.1 NA <NA>
3.4 3 U
3.5 3 L
3.6 3 R
Here is a possible tidyverse solution. We can use add_row inside of summarise to add n number of rows to each group. I use max(count(df, ID)$n) to get the max group length, then I subtract that from the number of rows in each group to get the total number of rows that need to be added for each group. I use rep to produce the correct number of values that we need to add for each group. Finally, I replace ID with NA when there is an NA in col1.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(add_row(cur_data(),
col1 = rep(NA_character_,
unique(max(count(df, ID)$n) - n()))),
.groups = "drop") %>%
mutate(ID = replace(ID, is.na(col1), NA))
Output
ID col1
<int> <chr>
1 1 A
2 1 B
3 NA NA
4 2 O
5 NA NA
6 NA NA
7 3 U
8 3 L
9 3 R
Or another option without using add_row:
library(dplyr)
# Get maximum number of rows for all groups
N = max(count(df,ID)$n)
df %>%
group_by(ID) %>%
summarise(col1 = c(col1, rep(NA, N-length(col1))), .groups = "drop") %>%
mutate(ID = replace(ID, is.na(col1), NA))
Another option could be:
df %>%
group_split(ID) %>%
map_dfr(~ rows_append(.x, tibble(col1 = rep(NA_character_, max(pull(count(df, ID), n)) - group_size(.x)))))
ID col1
<int> <chr>
1 1 A
2 1 B
3 NA NA
4 2 O
5 NA NA
6 NA NA
7 3 U
8 3 L
9 3 R
A base R using merge + rle
merge(
transform(
data.frame(ID = with(rle(df$ID), rep(values, each = max(lengths)))),
q = ave(ID, ID, FUN = seq_along)
),
transform(
df,
q = ave(ID, ID, FUN = seq_along)
),
all = TRUE
)[-2]
gives
ID col1
1 1 A
2 1 B
3 1 <NA>
4 2 O
5 2 <NA>
6 2 <NA>
7 3 U
8 3 L
9 3 R
A data.table option may also work
> setDT(df)[, .(col1 = `length<-`(col1, max(df[, .N, ID][, N]))), ID]
ID col1
1: 1 A
2: 1 B
3: 1 <NA>
4: 2 O
5: 2 <NA>
6: 2 <NA>
7: 3 U
8: 3 L
9: 3 R
An option to tidyr::complete the ID and row_new, using row_old to replace ID with NA.
library (tidyverse)
df %>%
group_by(ID) %>%
mutate(
row_new = row_number(),
row_old = row_number()) %>%
ungroup() %>%
complete(ID, row_new) %>%
mutate(ID = if_else(is.na(row_old),
NA_integer_,
ID)) %>%
select(-matches("row_"))
# A tibble: 9 x 2
ID col1
<int> <chr>
1 1 A
2 1 B
3 NA <NA>
4 2 O
5 NA <NA>
6 NA <NA>
7 3 U
8 3 L
9 3 R
n <- max(table(df$ID))
df %>%
group_by(ID) %>%
summarise(col1 =`length<-`(col1, n), .groups = 'drop') %>%
mutate(ID = `is.na<-`(ID, is.na(col1)))
# A tibble: 9 x 2
ID col1
<int> <chr>
1 1 A
2 1 B
3 NA NA
4 2 O
5 NA NA
6 NA NA
7 3 U
8 3 L
9 3 R
Another base R solution using sequence.
print(
df[
sequence(
abs(rep(i <- rle(df$ID)$lengths, each = 2) - c(0L, max(i))),
rep(cumsum(c(1L, i))[-length(i) - 1L], each = 2) + c(0L, nrow(df)),
),
],
row.names = FALSE
)
#> ID col1
#> 1 A
#> 1 B
#> NA <NA>
#> 2 O
#> NA <NA>
#> NA <NA>
#> 3 U
#> 3 L
#> 3 R
Question updated 9/10 !
DF<-data.frame(id=c(1,1,1,2,2,2),rank=c("1","2","3","1","2","3"),code=c("A","B","B","B","B","A"))
DF
id rank code
1 A1 1 A
2 A1 2 B
3 A1 3 B
4 B2 1 B
5 B2 2 B
6 B2 3 A
Desired output:
id rank code type1 type2 type3
1 A1 1 A aa MIX MIX
2 A1 2 B NA MIX MIX
3 A1 3 B NA NA MIX
4 B2 1 B bb bb MIX
5 B2 2 B NA bb MIX
6 B2 3 A NA NA MIX
All is grouped by id
type1 gets code where rank = 1.
type2 gets code where rank = 1-2. If code is different in rank 1 and 2, then MIX
type3 gets code where rank = 1-3. etc. etc.
Anyone? :)
Here's a dplyr solution using ifelse and a temporary column to reduce boilerplate:
library(dplyr)
DF %>%
group_by(id) %>%
mutate(a = code[rank == 1],
type1 = ifelse(rank > 1, NA,
ifelse(all(code[!(rank > 1)] == a[1]), a[1], "MIX")),
type2 = ifelse(rank > 2, NA,
ifelse(all(code[!(rank > 2)] == a[1]), a[1], "MIX")),
type3 = ifelse(rank > 3, NA,
ifelse(all(code[!(rank > 3)] == a[1]), a[1], "MIX"))) %>%
select(-a)
#> # A tibble: 6 x 6
#> # Groups: id [2]
#> id rank code type1 type2 type3
#> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1 A A MIX MIX
#> 2 1 2 B NA MIX MIX
#> 3 1 3 B NA NA MIX
#> 4 2 1 B B B MIX
#> 5 2 2 B NA B MIX
#> 6 2 3 A NA NA MIX
Using dplyr with case_when statements:
DF %>%
group_by(id) %>%
mutate(type2_grp = if_else(rank <= 2, 1, 0),
type3_grp = if_else(rank <= 3, 1, 0)) %>%
mutate(type1 = case_when(rank == 1 ~ code)) %>%
group_by(id, type2_grp) %>%
mutate(type2 = case_when(type2_grp == 1 & length(unique(code)) > 1 ~ "MIX",
type2_grp == 1 & code == "A" ~ "A",
type2_grp == 1 & code == "B" ~ "B")) %>%
group_by(id, type3_grp) %>%
mutate(type3 = case_when(type3_grp == 1 & length(unique(code)) > 1 ~ "MIX",
type3_grp == 1 & code == "A" ~ "A",
type3_grp == 1 & code == "B" ~ "B")) %>%
ungroup() %>%
select(-type2_grp, -type3_grp)
Which creates:
# A tibble: 6 x 6
id rank code type1 type2 type3
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 1 A A MIX MIX
2 1 2 B NA MIX MIX
3 1 3 B NA NA MIX
4 2 1 B B B MIX
5 2 2 B NA B MIX
6 2 3 A NA NA MIX
A base R solution for an arbitrary number of "type" columns
maxtype=3
do.call(rbind,
by(DF,list(DF$id),function(x){
y=list()
for (i in 1:maxtype) {
tmp=rep(NA,nrow(x))
idx=as.numeric(x$rank)<=i
if (length(unique(x$code[idx]))==1) {
tmp[idx]=x$code[1]
} else {
tmp[idx]="MIX"
}
y[[paste0("type",i)]]=tmp
}
cbind(x,y)
})
)
id rank code type1 type2 type3
1.1 1 1 A A MIX MIX
1.2 1 2 B <NA> MIX MIX
1.3 1 3 B <NA> <NA> MIX
2.4 2 1 B B B MIX
2.5 2 2 B <NA> B MIX
2.6 2 3 A <NA> <NA> MIX
Assuming DF is sorted by id then rank, your type columns for each id will be an upper triangular matrix of "MIX" subset with an upper triangular matrix of the first code value for as many rows as it appears.
A data.table solution:
library(data.table)
DF <- data.frame(id=c(1,1,1,2,2,2),rank=c("1","2","3","1","2","3"),code=c("A","B","B","B","B","A"))
setDT(DF)[, `:=`(rank = factor(rank), code = factor(code))]
maxRank <- nlevels(DF$rank)
naLvl <- nlevels(DF$code) + 2L
mTri <- matrix(nlevels(DF$code) + 1L, nrow = maxRank, ncol = maxRank)
mTri[lower.tri(mTri)] <- naLvl
typeMat <- function(rank, code) {
firstrep <- rle(code)[[1]][1]
mSubTri <- matrix(naLvl, nrow = firstrep, ncol = firstrep)
mSubTri[upper.tri(mSubTri, diag = TRUE)] <- code[1]
mOut <- mTri
mOut[1:firstrep, 1:firstrep] <- mSubTri
return(mOut[rank,, drop = FALSE])
}
DF <- cbind(DF, as.data.table(do.call(rbind, DF[, (type = list(list(typeMat(as.integer(rank), as.integer(code))))), by = id]$V1)))
typeCols <- 4:(3 + maxRank)
DF[, (typeCols) := lapply(.SD, function(x) {factor(x, levels = 1:naLvl, labels = c(levels(code), "MIX", NA), exclude = NULL)}), .SDcols = typeCols]
setnames(DF, 4:(3 + maxRank), paste0("type", 1:maxRank))
> DF
id rank code type1 type2 type3
1: 1 1 A A MIX MIX
2: 1 2 B <NA> MIX MIX
3: 1 3 B <NA> <NA> MIX
4: 2 1 B B B MIX
5: 2 2 B <NA> B MIX
6: 2 3 A <NA> <NA> MIX
I am trying to fill in blank cells with the value of rows above. Similar to na.locf function, but I have a pattern that needs to be matched. I don't necessarily know how many rows between new values (i.e betweem a,b and c,d).
I have used the na.locf and searched around for a solution to no avail.
df <- df <- data.frame(col1 = c("a","b", NA, NA, NA, NA, "c", "d", NA, NA))
df
# col1
# 1 a
# 2 b
# 3 <NA>
# 4 <NA>
# 5 <NA>
# 6 <NA>
# 7 c
# 8 d
# 9 <NA>
# 10 <NA>
Solution I would like:
df
col1
a
b
a
b
a
b
c
d
c
d
ave(df$col1,
with(rle(!is.na(df$col1)), rep(cumsum(values), lengths)),
FUN = function(x){
rep(x[!is.na(x)], length.out = length(x))
})
# [1] a b a b a b c d c d
Here's way with dplyr. You can drop the group column if needed. -
df %>%
group_by(group = cumsum(is.na(lag(col1)) & !is.na(col1))) %>%
mutate(
col1 = rep(col1[!is.na(col1)], length.out = n())
) %>%
ungroup()
# A tibble: 10 x 2
col1 group
<chr> <int>
1 a 1
2 b 1
3 a 1
4 b 1
5 a 1
6 b 1
7 c 2
8 d 2
9 c 2
10 d 2
I have 2 dataframes in R (df1, df2).
A C D
1 1 1
2 2 2
df2 as
A B C
1 1 1
2 2 2
How can I merge these 2 dataframes to produce the following output?
A B C D
2 1 2 1
4 2 4 2
Columns are sorted and column values are added. Both DFs have same number of rows. Thank you in advance.
Code to create DF:
df1 <- data.frame("A" = 1:2, "C" = 1:2, "D" = 1:2)
df2 <- data.frame("A" = 1:2, "B" = 1:2, "C" = 1:2)
nm1 = names(df1)
nm2 = names(df2)
nm = intersect(nm1, nm2)
if (length(nm) == 0){ # if no column names in common
cbind(df1, df2)
} else { # if column names in common
cbind(df1[!nm1 %in% nm2], # columns only in df1
df1[nm] + df2[nm], # add columns common to both
df2[!nm2 %in% nm1]) # columns only in df2
}
# D A C B
#1 1 2 2 1
#2 2 4 4 2
You can try:
library(tidyverse)
list(df2, df1) %>%
map(rownames_to_column) %>%
bind_rows %>%
group_by(rowname) %>%
summarise_all(sum, na.rm = TRUE)
# A tibble: 2 x 5
rowname A B C D
<chr> <int> <int> <int> <int>
1 1 2 1 2 1
2 2 4 2 4 2
By using left_join() from dplyr you won't lose the column
library(tidyverse)
dat1 <- tibble(a = 1:10,
b = 1:10,
c = 1:10)
dat2 <- tibble(c = 1:10,
d = 1:10,
e = 1:10)
left_join(dat1, dat2, by = "c")
#> # A tibble: 10 x 5
#> a b c d e
#> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 1
#> 2 2 2 2 2 2
#> 3 3 3 3 3 3
#> 4 4 4 4 4 4
#> 5 5 5 5 5 5
#> 6 6 6 6 6 6
#> 7 7 7 7 7 7
#> 8 8 8 8 8 8
#> 9 9 9 9 9 9
#> 10 10 10 10 10 10
Created on 2019-01-16 by the reprex package (v0.2.1)
allnames <- sort(unique(c(names(df1), names(df2))))
df3 <- data.frame(matrix(0, nrow = nrow(df1), ncol = length(allnames)))
names(df3) <- allnames
df3[,allnames %in% names(df1)] <- df3[,allnames %in% names(df1)] + df1
df3[,allnames %in% names(df2)] <- df3[,allnames %in% names(df2)] + df2
df3
A B C D
1 2 1 2 1
2 4 2 4 2
Here is a fun base R method with Reduce.
Reduce(cbind,
list(Reduce("+", list(df1[intersect(names(df1), names(df2))],
df2[intersect(names(df1), names(df2))])), # sum results
df1[setdiff(names(df1), names(df2))], # in df1, not df2
df2[setdiff(names(df2), names(df1))])) # in df2, not df1
This returns
A C D B
1 2 2 1 1
2 4 4 2 2
This assumes that both df1 and df2 have columns that are not present in the other. If this is not true, you'd have to adjust the list.
Note also that you could replace Reduce with do.call in both places and you'd get the same result.
Ok, here is yet another missing value filling question.
I am looking for a way to fill NAs based on both the previous and next existent values in a column. Standard filling in a single direction is not sufficient for this task.
If the previous and next valid values in a column are not the same, then the chunk remains as NA.
The code for the sample data frame is:
df_in <- tibble(id= 1:12,
var1 = letters[1:12],
var2 = c(NA,rep("A",2),rep(NA,2),rep("A",2),rep(NA,2),rep("B",2),NA))
Thanks,
Comparing na.locf() (last observation carried forward) and na.locf(fromLast = TRUE) (backward):
mutate(df_in,
var_new = if_else(
zoo::na.locf(var2, na.rm = FALSE) ==
zoo::na.locf(var2, na.rm = FALSE, fromLast = TRUE),
zoo::na.locf(var2, na.rm = FALSE),
NA_character_
))
# # A tibble: 12 x 4
# id var1 var2 var_new
# <int> <chr> <chr> <chr>
# 1 1 a NA NA
# 2 2 b A A
# 3 3 c A A
# 4 4 d NA A
# 5 5 e NA A
# 6 6 f A A
# 7 7 g A A
# 8 8 h NA NA
# 9 9 i NA NA
# 10 10 j B B
# 11 11 k B B
# 12 12 l NA NA
Something like this?
df_in %>% mutate(var_new = {
tmp <- var2
tmp[is.na(tmp)] <- "NA"
rl <- rle(tmp)
tibble(before = c(NA, head(rl$values, -1)),
value = rl$values,
after = c(tail(rl$values, -1), NA),
lengths = rl$lengths) %>%
mutate(value = ifelse(value == "NA" & before == after, before, value),
value = ifelse(value == "NA", NA, value)) %>%
select(value, lengths) %>%
unname() %>%
do.call(rep, .)})
# # A tibble: 12 x 4
# id var1 var2 var_new
# <int> <chr> <chr> <chr>
# 1 1 a NA <NA>
# 2 2 b A A
# 3 3 c A A
# 4 4 d NA A
# 5 5 e NA A
# 6 6 f A A
# 7 7 g A A
# 8 8 h NA <NA>
# 9 9 i NA <NA>
# 10 10 j B B
# 11 11 k B B
# 12 12 l NA <NA>
Explanation
Convert NA to "NA" (because rle does not count consecutive NA.)
Create a run length encoded representation of tmp
Now you cna have a look at values beofre and after the relevant blocks
Replace the values.