I would like to update one column based on 2 columns
My example dataframe contains 3 columns
df <- data.frame(n1 = c(1,2,1,2,5,6),
n2 = c("a", "a", "a", NA, "b", "c"),
n3 = c("red", "red", NA, NA, NA, NA))
df
n1 n2 n3
1 1 a red
2 2 a red
3 1 a <NA>
4 2 <NA> <NA>
5 5 b <NA>
6 6 c <NA>
I would like to add red name to row number 3 and 4 with the condition is that if values of n1 (i.e. 1,2) match with n2 (i.e. a), even though the fourth row (n1 not match n2).
The main point is if n2 == a, and values of n1 associated with a, then values of n3 that are the same row with values of n1 should be added with red.
My desired output
n1 n2 n3
1 1 a red
2 2 a red
3 1 a red
4 2 <NA> red
5 5 b <NA>
6 6 c <NA>
Any sugesstions for this case? I hope my explanation is clear enough. Since my data is very long, I am trying to find a good to handle it.
In base R, create a logical vector to subset the rows of 'df' based on the unique values of 'n1' where 'n2' is "a", then do the assignment of 'n3' corresponding to that elements with the first non-NA element from 'n3'
i1 <- with(df, n1 %in% unique(n1[n2 %in% 'a']))
df$n3[i1] <- na.omit(df$n3[i1])[1]
-output
> df
n1 n2 n3
1 1 a red
2 2 a red
3 1 a red
4 2 <NA> red
5 5 b <NA>
6 6 c <NA>
Update:
df %>%
mutate(group = rep(row_number(), each=2, length.out = n())) %>%
group_by(group) %>%
mutate(n3 = ifelse(n1 %in% c(1,2) & any(n2 %in% "a", na.rm = TRUE), "red", n3)) %>%
ungroup() %>%
select(-group)
We could use an ifelse statement with conditions defined using any.
library(dplyr)
df %>%
mutate(n3 = ifelse(n1==1 | n1==2 & any(n2[3:4] %in% "a"), "red", n3))
n1 n2 n3
1 1 a red
2 2 a red
3 1 a red
4 2 <NA> red
5 5 b <NA>
6 6 c <NA>
library(dplyr)
library(tidyr)
df %>%
group_by(n1) %>%
fill(n3) %>%
group_by(n2) %>%
fill(n3)
# # A tibble: 6 × 3
# # Groups: n2 [4]
# n1 n2 n3
# <dbl> <chr> <chr>
# 1 1 a red
# 2 2 a red
# 3 1 a red
# 4 2 NA red
# 5 5 b NA
# 6 6 c NA
Related
I have the following dataframe called df (dput below):
> df
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 B 8
6 B 2
7 B 2
8 B 3
9 C 10
10 C 1
11 C 1
12 C 8
I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?
dput of df:
df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1,
1, 8)), class = "data.frame", row.names = c(NA, -12L))
Using dplyr
library(dplyr)
df %>%
group_by(group) %>%
filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
ungroup()
# A tibble: 8 × 2
group value
<chr> <int>
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
A base R alternative
grp <- na.omit(aggregate(. ~ group, df, function(x)
abs(diff(sort(x, decreasing=T)[1:2])) <= 2))
do.call(rbind, c(mapply(function(g, v)
list(df[df$group == g & v,]), grp$group, grp$value), make.row.names=F))
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:
library(dplyr)
group_to_keep <-
df %>%
group_by(group) %>%
slice_max(n = 2,value) %>%
filter(abs(diff(value)) <= 2) %>%
pull(group) %>%
unique()
df %>%
filter(group %in% group_to_keep)
You can use ave.
df[ave(df$value, df$group, FUN=\(x) diff(sort(c(-x, Inf)))[1]) <= 2,]
# group value
#1 A 5
#2 A 1
#3 A 1
#4 A 5
#9 C 10
#10 C 1
#11 C 1
#12 C 8
In case you can sure that you have all the time at least two values you can use.
df[ave(df$value, df$group, FUN=\(x) diff(tail(sort(x), 2))) <= 2,]
df[ave(df$value, df$group, FUN=\(x) diff(sort(-x)[1:2])) <= 2,]
Question updated 9/10 !
DF<-data.frame(id=c(1,1,1,2,2,2),rank=c("1","2","3","1","2","3"),code=c("A","B","B","B","B","A"))
DF
id rank code
1 A1 1 A
2 A1 2 B
3 A1 3 B
4 B2 1 B
5 B2 2 B
6 B2 3 A
Desired output:
id rank code type1 type2 type3
1 A1 1 A aa MIX MIX
2 A1 2 B NA MIX MIX
3 A1 3 B NA NA MIX
4 B2 1 B bb bb MIX
5 B2 2 B NA bb MIX
6 B2 3 A NA NA MIX
All is grouped by id
type1 gets code where rank = 1.
type2 gets code where rank = 1-2. If code is different in rank 1 and 2, then MIX
type3 gets code where rank = 1-3. etc. etc.
Anyone? :)
Here's a dplyr solution using ifelse and a temporary column to reduce boilerplate:
library(dplyr)
DF %>%
group_by(id) %>%
mutate(a = code[rank == 1],
type1 = ifelse(rank > 1, NA,
ifelse(all(code[!(rank > 1)] == a[1]), a[1], "MIX")),
type2 = ifelse(rank > 2, NA,
ifelse(all(code[!(rank > 2)] == a[1]), a[1], "MIX")),
type3 = ifelse(rank > 3, NA,
ifelse(all(code[!(rank > 3)] == a[1]), a[1], "MIX"))) %>%
select(-a)
#> # A tibble: 6 x 6
#> # Groups: id [2]
#> id rank code type1 type2 type3
#> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1 A A MIX MIX
#> 2 1 2 B NA MIX MIX
#> 3 1 3 B NA NA MIX
#> 4 2 1 B B B MIX
#> 5 2 2 B NA B MIX
#> 6 2 3 A NA NA MIX
Using dplyr with case_when statements:
DF %>%
group_by(id) %>%
mutate(type2_grp = if_else(rank <= 2, 1, 0),
type3_grp = if_else(rank <= 3, 1, 0)) %>%
mutate(type1 = case_when(rank == 1 ~ code)) %>%
group_by(id, type2_grp) %>%
mutate(type2 = case_when(type2_grp == 1 & length(unique(code)) > 1 ~ "MIX",
type2_grp == 1 & code == "A" ~ "A",
type2_grp == 1 & code == "B" ~ "B")) %>%
group_by(id, type3_grp) %>%
mutate(type3 = case_when(type3_grp == 1 & length(unique(code)) > 1 ~ "MIX",
type3_grp == 1 & code == "A" ~ "A",
type3_grp == 1 & code == "B" ~ "B")) %>%
ungroup() %>%
select(-type2_grp, -type3_grp)
Which creates:
# A tibble: 6 x 6
id rank code type1 type2 type3
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 1 A A MIX MIX
2 1 2 B NA MIX MIX
3 1 3 B NA NA MIX
4 2 1 B B B MIX
5 2 2 B NA B MIX
6 2 3 A NA NA MIX
A base R solution for an arbitrary number of "type" columns
maxtype=3
do.call(rbind,
by(DF,list(DF$id),function(x){
y=list()
for (i in 1:maxtype) {
tmp=rep(NA,nrow(x))
idx=as.numeric(x$rank)<=i
if (length(unique(x$code[idx]))==1) {
tmp[idx]=x$code[1]
} else {
tmp[idx]="MIX"
}
y[[paste0("type",i)]]=tmp
}
cbind(x,y)
})
)
id rank code type1 type2 type3
1.1 1 1 A A MIX MIX
1.2 1 2 B <NA> MIX MIX
1.3 1 3 B <NA> <NA> MIX
2.4 2 1 B B B MIX
2.5 2 2 B <NA> B MIX
2.6 2 3 A <NA> <NA> MIX
Assuming DF is sorted by id then rank, your type columns for each id will be an upper triangular matrix of "MIX" subset with an upper triangular matrix of the first code value for as many rows as it appears.
A data.table solution:
library(data.table)
DF <- data.frame(id=c(1,1,1,2,2,2),rank=c("1","2","3","1","2","3"),code=c("A","B","B","B","B","A"))
setDT(DF)[, `:=`(rank = factor(rank), code = factor(code))]
maxRank <- nlevels(DF$rank)
naLvl <- nlevels(DF$code) + 2L
mTri <- matrix(nlevels(DF$code) + 1L, nrow = maxRank, ncol = maxRank)
mTri[lower.tri(mTri)] <- naLvl
typeMat <- function(rank, code) {
firstrep <- rle(code)[[1]][1]
mSubTri <- matrix(naLvl, nrow = firstrep, ncol = firstrep)
mSubTri[upper.tri(mSubTri, diag = TRUE)] <- code[1]
mOut <- mTri
mOut[1:firstrep, 1:firstrep] <- mSubTri
return(mOut[rank,, drop = FALSE])
}
DF <- cbind(DF, as.data.table(do.call(rbind, DF[, (type = list(list(typeMat(as.integer(rank), as.integer(code))))), by = id]$V1)))
typeCols <- 4:(3 + maxRank)
DF[, (typeCols) := lapply(.SD, function(x) {factor(x, levels = 1:naLvl, labels = c(levels(code), "MIX", NA), exclude = NULL)}), .SDcols = typeCols]
setnames(DF, 4:(3 + maxRank), paste0("type", 1:maxRank))
> DF
id rank code type1 type2 type3
1: 1 1 A A MIX MIX
2: 1 2 B <NA> MIX MIX
3: 1 3 B <NA> <NA> MIX
4: 2 1 B B B MIX
5: 2 2 B <NA> B MIX
6: 2 3 A <NA> <NA> MIX
I have a dataframe
df <- data.frame(id1 = c("a" , "b", "b", "c"),
id2 = c(NA,"a","a",NA),
id3 = c("a", "a", "a", "e"),
n1 = c(2,2,2,3),
n2 = c(2,1,1,1),
n3 = c(0,1,1,3),
n4 = c(0,1,1,2))
I want to collapse the 2nd and 3rd rows into one. Afterwards, I will do aggregate by column id3 sharing same character (i.e. a).
My real dataframe is long contaning many different latin names, filter by name i.e. a doesn´t make sense this case. I am thinking to collapse rows with the condition id3 == id2, but I could not do it. Any sugesstions for me?
My desired out put like this
id1 id2 id3 n1 n2 n3 n4
a NA a 2 2 0 0
b a a 2 1 1 1
c NA e 3 1 3 2
#Afterthat, it should be
id1 id3 n1 n2 n3 n4
a a 4 3 1 1
c e 3 1 3 2
(I just updated the dataframe, sorry for my mistake)
We get the distinct rows to generate the first expected
library(dplyr)
df %>%
distinct
id1 id2 id3 n1 n2 n3 n4
1 a <NA> a 2 2 0 0
2 b a a 2 1 1 1
3 c <NA> e 3 1 3 2
The final output we can get from the above, i.e. after the distinct step, do a group by coalesced 'id2', 'id1' along with 'id3' and then get the sum of numeric columns
df %>%
distinct %>%
group_by(id1 = coalesce(id2, id1), id3) %>%
summarise(across(where(is.numeric), sum), .groups = 'drop')
-output
# A tibble: 2 × 6
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2
Here is a slightly different way using slice after group_by instead of distinct:
df %>%
group_by(id1, id3) %>%
dplyr::slice(1L) %>%
mutate(id1 = coalesce(id2,id1)) %>%
summarise(across(where(is.numeric), sum))
output:
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2
I have a list here, and I wish to mutate a new column with unique values for each list relative to the mutation. For example, I want to mutate a column named ID as n >= 1.
Naturally, on a dataframe I would do this:
dat %>% mutate(id = row_number())
For a list, I would do this:
dat%>% map(~ mutate(., ID = row_number()))
And I would get an output likeso:
dat <- list(data.frame(x=c("a", "b" ,"c", "d", "e" ,"f" ,"g") ), data.frame(y=c("p", "lk", "n", "m", "g", "f", "t")))
[[1]]
x id
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
[[2]]
y id
1 p 1
2 lk 2
3 n 3
4 m 4
5 g 5
6 f 6
7 t 7
Though, how would I mutate a new column ID such that the row number continues from the first list.
Expected output:
[[1]]
x id
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
[[2]]
y id
1 p 8
2 lk 9
3 n 10
4 m 11
5 g 12
6 f 13
7 t 14
An option is to bind them into a single dataset, create the 'id' with row_number(), split by 'grp', loop over the list and remove any columns that have all NA values
library(dplyr)
library(purrr)
dat %>%
bind_rows(.id = 'grp') %>%
mutate(id = row_number()) %>%
group_split(grp) %>%
map(~ .x %>%
select(where(~ any(!is.na(.))), -grp))
-output
#[[1]]
# A tibble: 7 x 2
# x id
# <chr> <int>
#1 a 1
#2 b 2
#3 c 3
#4 d 4
#5 e 5
#6 f 6
#7 g 7
#[[2]]
# A tibble: 7 x 2
# y id
# <chr> <int>
#1 p 8
#2 lk 9
#3 n 10
#4 m 11
#5 g 12
#6 f 13
#7 t 14
Or an easier approach is to unlist (assuming single column), get the sequence, add a new column with map2
map2(dat, relist(seq_along(unlist(dat)), skeleton = dat),
~ .x %>% mutate(id = .y))
Or using a for loop
dat[[1]]$id <- seq_len(nrow(dat[[1]]))
for(i in seq_along(dat)[-1]) dat[[i]]$id <-
seq(tail(dat[[i-1]]$id, 1) + 1, length.out = nrow(dat[[i]]), by = 1)
I am trying to fill in blank cells with the value of rows above. Similar to na.locf function, but I have a pattern that needs to be matched. I don't necessarily know how many rows between new values (i.e betweem a,b and c,d).
I have used the na.locf and searched around for a solution to no avail.
df <- df <- data.frame(col1 = c("a","b", NA, NA, NA, NA, "c", "d", NA, NA))
df
# col1
# 1 a
# 2 b
# 3 <NA>
# 4 <NA>
# 5 <NA>
# 6 <NA>
# 7 c
# 8 d
# 9 <NA>
# 10 <NA>
Solution I would like:
df
col1
a
b
a
b
a
b
c
d
c
d
ave(df$col1,
with(rle(!is.na(df$col1)), rep(cumsum(values), lengths)),
FUN = function(x){
rep(x[!is.na(x)], length.out = length(x))
})
# [1] a b a b a b c d c d
Here's way with dplyr. You can drop the group column if needed. -
df %>%
group_by(group = cumsum(is.na(lag(col1)) & !is.na(col1))) %>%
mutate(
col1 = rep(col1[!is.na(col1)], length.out = n())
) %>%
ungroup()
# A tibble: 10 x 2
col1 group
<chr> <int>
1 a 1
2 b 1
3 a 1
4 b 1
5 a 1
6 b 1
7 c 2
8 d 2
9 c 2
10 d 2