Code
count
AA
BB
CC
101
1
No
NO
4
101
2
Yes
NO
5
101
3
Yes
NO
10
102
1
Yes
NO
7
102
2
Yes
NO
40
102
3
Yes
NO
6
102
4
No
NO
12
I want to apply the condition as,
If the count column is 1 with respect to code column then AA should be "NO" and BB should be "NO".
For count between the max and min count with respect to code column then AA can be "NO" or "YES" and BB should be "NO".
For the max count column with respect to code column then AA should be "NO" and BB should be "NO".
Code
count
AA
BB
CC
101
1
No
NO
4
101
2
Yes
NO
5
102
2
Yes
NO
40
102
3
Yes
NO
6
102
4
No
NO
12
Hi,#Darren Tsai Whatever might be the case if the count column is 1 then it is getting deleted completely, by using you code I am getting the below output
Code
count
AA
BB
CC
101
2
Yes
NO
5
102
2
Yes
NO
40
102
3
Yes
NO
6
102
4
No
NO
12
A dplyr solution:
library(dplyr)
df %>%
group_by(Code) %>%
mutate(flag = count %in% range(count)) %>%
filter(flag & if_all(c(AA, BB), ~ toupper(.x) == 'NO') | !flag & toupper(BB) == 'NO') %>%
ungroup() %>%
select(-flag)
# # A tibble: 5 × 5
# Code count AA BB CC
# <int> <int> <chr> <chr> <int>
# 1 101 1 No NO 4
# 2 101 2 Yes NO 5
# 3 102 2 Yes NO 40
# 4 102 3 Yes NO 6
# 5 102 4 No NO 12
A base equivalent:
df |>
transform(flag = ave(count, Code, FUN = \(x) x %in% range(x))) |>
subset(flag & toupper(AA) == 'NO' & toupper(BB) == 'NO' | !flag & toupper(BB) == 'NO', -flag)
Data
df <- structure(list(Code = c(101L, 101L, 101L, 102L, 102L, 102L, 102L),
count = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), AA = c("No", "Yes",
"Yes", "Yes", "Yes", "Yes", "No"), BB = c("NO", "NO", "NO", "NO",
"NO", "NO", "NO"), CC = c(4L, 5L, 10L, 7L, 40L, 6L, 12L)), class = "data.frame", row.names = c(NA,-7L))
Update with another dataset
This dataset has 12 rows with 3 ID 8540, 2254, 607. After running my code the 2nd, 4th, 12th rows are removed.
library(dplyr)
df2 <- structure(list(Unique_Id = c(8540, 8540, 2254, 2254, 607, 607, 607, 607, 607, 607, 607, 607),
AA = c("No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No"),
count = c(1, 2, 1, 2, 1, 2, 3, 4, 5, 6, 7, 8),
BB = c("No", "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L))
df2
# A tibble: 12 × 4
Unique_Id AA count BB
<dbl> <chr> <dbl> <chr>
1 8540 No 1 No
2 8540 Yes 2 Yes
3 2254 No 1 No
4 2254 No 2 Yes
5 607 No 1 No
6 607 No 2 No
7 607 No 3 No
8 607 No 4 No
9 607 No 5 No
10 607 No 6 No
11 607 No 7 No
12 607 No 8 Yes
df2 %>%
group_by(Unique_Id) %>%
mutate(flag = count %in% range(count)) %>%
filter(flag & if_all(c(AA, BB), ~ toupper(.x) == 'NO') | !flag & toupper(BB) == 'NO') %>%
ungroup() %>%
select(-flag)
# A tibble: 9 × 4
Unique_Id AA count BB
<dbl> <chr> <dbl> <chr>
1 8540 No 1 No
2 2254 No 1 No
3 607 No 1 No
4 607 No 2 No
5 607 No 3 No
6 607 No 4 No
7 607 No 5 No
8 607 No 6 No
9 607 No 7 No
Related
I am in need of a conditional way to lag back to the last row where the value is one number or "level" lower than the current row. Whenever type = "yes", I want to go back one level lower to the last "no" and get the quantity. For example, rows 2 and 3 here are type "yes" and level 5. In that case, I'd like to go back to the last level 4 "no" row, get the quantity, and assign it to a new column. When type is "no" no lagging needs to be done.
Data:
row_id level type quantity
1 4 no 100
2 5 yes 110
3 5 yes 115
4 2 no 500
5 2 no 375
6 3 yes 250
7 3 yes 260
8 3 yes 420
Desired output:
row_id level type quantity lagged_quantity
1 4 no 100 NA
2 5 yes 110 100
3 5 yes 115 100
4 2 no 500 NA
5 2 no 375 NA
6 3 yes 250 375
7 3 yes 260 375
8 3 yes 420 375
Data:
structure(list(row_id = c(1, 2, 3, 4, 5, 6, 7, 8), level = c(4,
5, 5, 2, 2, 3, 3, 3), type = c("no", "yes", "yes", "no", "no",
"yes", "yes", "yes"), quantity = c(100, 110, 115, 500, 375, 250,
260, 420)), row.names = c(NA, -8L), class = c("tbl_df", "tbl",
"data.frame"))
Desired output:
structure(list(row_id = c(1, 2, 3, 4, 5, 6, 7, 8), level = c(4,
5, 5, 2, 2, 3, 3, 3), type = c("no", "yes", "yes", "no", "no",
"yes", "yes", "yes"), quantity = c(100, 110, 115, 500, 375, 250,
260, 420), lagged_quantity = c("NA", "100", "100", "NA", "NA",
"375", "375", "375")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
#Mossa
Direct solution would be to:
df1 %>%
mutate(
level_id = 1 + cumsum(c(1, diff(level)) < 0)
) %>%
mutate(lagged_quantity = if_else(type == "yes", NA_real_, quantity)) %>%
fill(lagged_quantity) %>%
mutate(lagged_quantity = if_else(type == "no", NA_real_, lagged_quantity))
Where first we retain only the values you would like, and then the missing entries are filled with last known value, and then the no answers, that need not be lagged, are taken out.
An option with data.table
library(data.table)
setDT(df1)[df1[, .(lagged_qty = last(quantity)), .(level, type)][,
lagged_qty := shift(lagged_qty), .(grp = cumsum(type == 'no'))],
lagged_qty := lagged_qty, on = .(level, type)]
-output
> df1
row_id level type quantity lagged_qty
<int> <int> <char> <int> <int>
1: 1 4 no 100 NA
2: 2 5 yes 110 100
3: 3 5 yes 115 100
4: 4 2 no 500 NA
5: 5 2 no 375 NA
6: 6 3 yes 250 375
7: 7 3 yes 260 375
8: 8 3 yes 420 375
I am working with a dataset where I need to evaluate hundreds of columns at the time to create new variables with computations by row. I have three new variables, one needs the "or" operator to decide if there is any "yes" across the ~100 columns. The second one needs to count across the variables how many "yes" I have in total, and the third one needs to create a constellation variable that shows me the name of variables with the "yes" value, all of this by row. I have the code for the first two, but for the third one I am stuck. Also, I am using only a few variables for example purposes but I have ~100 variables that I need to use. My code is below:
#making the data - I am using actually ~100 variables
test.data <- data.frame(var1 = c("yes", "no", "no", "N/A", NA, NA),
var2 = c(NA, NA, "yes", "no", "yes", NA),
var3 = c("yes", "yes", "yes", "no", "yes", "N/A"),
var4 = c("N/A", "yes", "no", "no", "yes", NA))
# code for the first two variables: is.positive and number.pos - not elegant nor efficient since I #need to work with ~100 vars
final.data <- data.frame(test.data %>%
mutate(is.positive = ifelse(var1=="yes" | var2=="yes" | var3=="yes" | var4=="yes", 1,
ifelse((is.na(var1) | var1=="N/A") &
(is.na(var2) | var2=="N/A") &
(is.na(var3) | var3=="N/A") &
(is.na(var4) | var4=="N/A"), NA, 0))) %>%
rowwise() %>%
mutate(number.pos = sum(c_across(c(var1, var2, var3, var4))=="yes",na.rm=TRUE)))
You could do it by making a list column for which ones are positive and then deriving the other values from that.
library(tidyverse)
test.data <- data.frame(var1 = c("yes", "no", "no", "N/A", NA, NA),
var2 = c(NA, NA, "yes", "no", "yes", NA),
var3 = c("yes", "yes", "yes", "no", "yes", "N/A"),
var4 = c("N/A", "yes", "no", "no", "yes", NA))
nv <- test.data %>%
select(var1:var4) %>%
names()
out <- test.data %>%
rowwise() %>%
mutate(which_pos = list(nv[which(c_across(var1:var4) == "yes")]),
num.positive = length(which_pos),
is.positive = num.positive > 0)
out
#> # A tibble: 6 × 7
#> # Rowwise:
#> var1 var2 var3 var4 which_pos num.positive is.positive
#> <chr> <chr> <chr> <chr> <list> <int> <lgl>
#> 1 yes <NA> yes N/A <chr [2]> 2 TRUE
#> 2 no <NA> yes yes <chr [2]> 2 TRUE
#> 3 no yes yes no <chr [2]> 2 TRUE
#> 4 N/A no no no <chr [0]> 0 FALSE
#> 5 <NA> yes yes yes <chr [3]> 3 TRUE
#> 6 <NA> <NA> N/A <NA> <chr [0]> 0 FALSE
out$which_pos
#> [[1]]
#> [1] "var1" "var3"
#>
#> [[2]]
#> [1] "var3" "var4"
#>
#> [[3]]
#> [1] "var2" "var3"
#>
#> [[4]]
#> character(0)
#>
#> [[5]]
#> [1] "var2" "var3" "var4"
#>
#> [[6]]
#> character(0)
Created on 2022-05-26 by the reprex package (v2.0.1)
If you wanted a normal column for the variable identifying which ones are positive, you could simply paste the names together to create a string that has comma-separated names:
library(tidyverse)
test.data <- data.frame(var1 = c("yes", "no", "no", "N/A", NA, NA),
var2 = c(NA, NA, "yes", "no", "yes", NA),
var3 = c("yes", "yes", "yes", "no", "yes", "N/A"),
var4 = c("N/A", "yes", "no", "no", "yes", NA))
nv <- test.data %>%
select(var1:var4) %>%
names()
out <- test.data %>%
rowwise() %>%
mutate(which_pos = paste(nv[which(c_across(var1:var4) == "yes")], collapse=","),
num.positive = sum(c_across(var1:var4) == "yes", na.rm=TRUE),
is.positive = num.positive > 0)
out
#> # A tibble: 6 × 7
#> # Rowwise:
#> var1 var2 var3 var4 which_pos num.positive is.positive
#> <chr> <chr> <chr> <chr> <chr> <int> <lgl>
#> 1 yes <NA> yes N/A "var1,var3" 2 TRUE
#> 2 no <NA> yes yes "var3,var4" 2 TRUE
#> 3 no yes yes no "var2,var3" 2 TRUE
#> 4 N/A no no no "" 0 FALSE
#> 5 <NA> yes yes yes "var2,var3,var4" 3 TRUE
#> 6 <NA> <NA> N/A <NA> "" 0 FALSE
Created on 2022-05-26 by the reprex package (v2.0.1)
The list column might be easier to use in subsequent analyses if needed, but the comma-separated variable maybe easier to use for visual inspection.
Using Base R:
is.na(test.data) <- test.data == 'N/A'
idx <- test.data == 'yes'
test.data['num.positive'] <- rowSums(idx, na.rm = TRUE)
test.data['is.positive'] <- +(test.data[['num.positive']] > 0)
idx2 <- data.frame(which(idx, TRUE))
df1 <- aggregate(col~row, idx2, \(x)paste(names(test.data)[x], collapse = '-'))
df2 <- merge(cbind(test.data, row = seq(nrow(test.data))), df1, all.x =TRUE)
df2
row var1 var2 var3 var4 num.positive is.positive col
1 1 yes <NA> yes <NA> 2 1 var1-var3
2 2 no <NA> yes yes 2 1 var3-var4
3 3 no yes yes no 2 1 var2-var3
4 4 <NA> no no no 0 0 <NA>
5 5 <NA> yes yes yes 3 1 var2-var3-var4
6 6 <NA> <NA> <NA> <NA> 0 0 <NA>
I've the following table
S/N
Unique ID
Code
1
111
YES
2
111
YES
3
111
NO
4
111
YES
5
222
YES
6
222
YES
7
222
YES
8
222
YES
9
333
NO
10
333
NO
11
333
YES
12
333
YES
How do I derive the following table based on the following conditions:
For each unique ID, if YES repeats, keep the first YES. If NO Appears, keep the following YES. I tried using mutate and it's giving me all sort of errors.
S/N
Unique ID
Code
1
111
YES
4
111
YES
5
222
YES
11
333
YES
Thanks!
base R
ind <- ave(dat$Code == "YES", dat$`Unique ID`,
FUN = function(z) z & c(TRUE, !z[-length(z)]))
dat[ind,]
# S/N Unique ID Code
# 1 1 111 YES
# 4 4 111 YES
# 5 5 222 YES
# 11 11 333 YES
dplyr
library(dplyr)
dat %>%
group_by(`Unique ID`) %>%
filter(Code == "YES" & lag(Code == "NO", default = TRUE)) %>%
ungroup()
# # A tibble: 4 x 3
# `S/N` `Unique ID` Code
# <int> <int> <chr>
# 1 1 111 YES
# 2 4 111 YES
# 3 5 222 YES
# 4 11 333 YES
data.table
library(data.table)
as.data.table(dat)[, .SD[Code == "YES" & shift(Code == "NO", fill = TRUE),], by = `Unique ID`]
# Unique ID S/N Code
# <int> <int> <char>
# 1: 111 1 YES
# 2: 111 4 YES
# 3: 222 5 YES
# 4: 333 11 YES
Data
dat <- structure(list("S/N" = 1:12, "Unique ID" = c(111L, 111L, 111L, 111L, 222L, 222L, 222L, 222L, 333L, 333L, 333L, 333L), Code = c("YES", "YES", "NO", "YES", "YES", "YES", "YES", "YES", "NO", "NO", "YES", "YES")), class = "data.frame", row.names = c(NA, -12L))
Suppose I have a data frame (df) like this:
Names ID Thing1 Thing2 Thing3 Thing4 Thing5
1: Gen1 id1 10 5 10 5 10
2: Gen2 id2 1 2 3 4 5
3: Gen1 id3 10 5 10 5 10
4: Gen2 id4 1 2 3 4 5
5: Gen3 id5 7 7 7 7 7
For each 'Names', I would like to sum 'Thing' columns, and collapse the strings in 'ID':
Names ID Thing1 Thing2 Thing3 Thing4 Thing5
1: Gen1 id1|id3 20 10 20 10 20
2: Gen2 id2|id4 2 4 6 8 10
3: Gen3 id5 7 7 7 7 7
I am able to achieve this via dplyr:
df1 <- df %>%
group_by(Names)%>%
summarise_each(funs(paste(unique(.), collapse='|')),matches('^\\D+$'))
df2 <- df %>%
group_by(Names)%>%
summarise_each(funs(sum = sum(., na.rm=TRUE)), starts_with('Thing' ))
bind_cols(df1, df2[-1])
However, this solution takes very long since I have a data frame with more than 10k rows and more than 10k column!
Is there any possible solution with data.table?
The closest I have gotten is this here:
> setDT(df)[, c(paste(df$ID,collapse = "-", sep = ""), lapply(.SD, sum, na.rm = TRUE)),
by = Names, .SDcols = !"ID"]
Names Thing1 Thing2 Thing3 Thing4 Thing5
1: Gen1 id1-id2-id3-id4-id5 20 10 20 10 20
2: Gen2 id1-id2-id3-id4-id5 2 4 6 8 10
3: Gen3 id1-id2-id3-id4-id5 7 7 7 7 7
Obviously this is not what I am going for since it will collapse all IDs and not just the ones that were aggregated by summarizing via "Names".
I would very much appreciate your help!
Here is the example data:
df <- structure(list(Names = c("Gen1", "Gen2", "Gen1", "Gen2","Gen3"),
ID=c("id1","id2","id3","id4","id5"),
Thing1 = c(10L, 1L, 10L, 1L, 7L),
Thing2 = c(5L, 2L, 5L, 2L,7L),
Thing3 = c(10L, 3L, 10L, 3L, 7L),
Thing4 = c(5L, 4L, 5L,4L, 7L),
Thing5 = c(10L, 5L, 10L, 5L, 7L)),
.Names = c("Names","ID","Thing1", "Thing2", "Thing3", "Thing4", "Thing5"),
class = "data.frame", row.names = c(1:5L))
If you don't heavily rely on data.table you could use aggregate two times and merge the results.
merge(aggregate(.~Names, df[-2], sum), aggregate(ID ~ Names, df, paste, collapse="|"))
# Names Thing1 Thing2 Thing3 Thing4 Thing5 ID
# 1 Gen1 20 10 20 10 20 id1|id3
# 2 Gen2 2 4 6 8 10 id2|id4
# 3 Gen3 7 7 7 7 7 id5
try it this way
use tidyverse
library(tidyverse)
df %>%
group_by(Names) %>%
summarise(across(where(is.character), str_c, collapse = "|"),
across(where(is.numeric), sum, na.rm = T))
# A tibble: 3 x 7
Names ID Thing1 Thing2 Thing3 Thing4 Thing5
<chr> <chr> <int> <int> <int> <int> <int>
1 Gen1 id1|id3 20 10 20 10 20
2 Gen2 id2|id4 2 4 6 8 10
3 Gen3 id5
use data.table
library(data.table)
dt <- copy(df)
setDT(dt)
out_sum <- dt[, lapply(.SD, sum), by = Names, .SDcols=!"ID"]
out_id <- dt[, list(id = sapply(list(ID), paste0, collapse = "|")), by = Names]
merge(out_id, out_sum)
Names id Thing1 Thing2 Thing3 Thing4 Thing5
1: Gen1 id1|id3 20 10 20 10 20
2: Gen2 id2|id4 2 4 6 8 10
3: Gen3 id5 7 7 7 7 7
I have data like this:
g1 g2 var
1 a Yes
1 a No
1 a No
1 b Yes
1 b Yes
1 b Yes
2 a No
2 a No
2 a No
I would like to change all values in var to Yes if in each g1&g2 group, there is at least one Yes in var. I tried to use combinations of group_by and mutate, replace, ifelse with no success. Any help is appreciated.
We can use if/else instead of ifelse. Grouped by 'g1', 'g2', if 'Yes' is %in% 'var', then return "Yes" or else return 'var'
library(dplyr)
df1 %>%
group_by(g1, g2) %>%
mutate(var = if("Yes" %in% var) "Yes" else var)
# A tibble: 9 x 3
# Groups: g1, g2 [3]
# g1 g2 var
# <int> <chr> <chr>
#1 1 a Yes
#2 1 a Yes
#3 1 a Yes
#4 1 b Yes
#5 1 b Yes
#6 1 b Yes
#7 2 a No
#8 2 a No
#9 2 a No
Or with case_when
df1 %>%
group_by(g1, g2) %>%
mutate(var = case_when("Yes" %in% var ~ "Yes", TRUE ~ var))
data
df1 <- structure(list(g1 = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), g2 = c("a",
"a", "a", "b", "b", "b", "a", "a", "a"), var = c("Yes", "No",
"No", "Yes", "Yes", "Yes", "No", "No", "No")), class = "data.frame",
row.names = c(NA, -9L))
You can also do:
df %>%
group_by(g1, g2) %>%
mutate(var = ifelse(any(var == "Yes"), "Yes", "No"))
g1 g2 var
<int> <chr> <chr>
1 1 a Yes
2 1 a Yes
3 1 a Yes
4 1 b Yes
5 1 b Yes
6 1 b Yes
7 2 a No
8 2 a No
9 2 a No
Here, if any value (per "g1" and "g2") in "var" is equal to Yes, it returns Yes, otherwise No.
An extra line of code from the above two solutions, but using ifelse or if_else by creating a new column then deleting and renaming:
library(tidyverse)
df %>%
group_by(g1, g2) %>%
mutate(var2 = if_else("Yes" %in% var, "Yes", "No")) %>%
select(-var, var = var2)
result:
g1 g2 var
<dbl> <chr> <chr>
1 1 a Yes
2 1 a Yes
3 1 a Yes
4 1 b Yes
5 1 b Yes
6 1 b Yes
7 2 a No
8 2 a No
9 2 a No `
a non-case_when if_else way, fun
df1 %>%
group_by(g1,g2) %>%
arrange (g1,g2,var) %>%
mutate(var=last(var))
# arranged alphabetically, var values may be changed to the last value by groups -- Yes in this case
g1 g2 var
<int> <chr> <chr>
1 1 a Yes
2 1 a Yes
3 1 a Yes
4 1 b Yes
5 1 b Yes
6 1 b Yes
7 2 a No
8 2 a No
9 2 a No