I have the following data frame structure :
id status
a 1
a 2
a 1
b 1
b 1
b 0
b 1
c 0
c 0
c 2
c 1
d 0
d 2
d 0
Here a,b,c are unique id's and status is a flag ranging from 0,1 and 2.
I need to select each individual id whose status has changed from 0 to 1 in any point during the whole time frame, so the expected output of this would be two id's 'b' and 'c'.
I thought of using lag to accomplish that but in that case, I wont't be able to handle id 'c', in which there is a 0 in the beginning but it reaches 1 at some stage. Any thoughts on how we can achieve this using window functions (or any other technique)
You want to find id's having a status of 1 after having had a status of 0.
Here is a dplyr solution:
library(dplyr)
# Generate data
mydf = data_frame(
id = c(rep("a", 3), rep("b", 4), rep("c", 4), rep("d", 3)),
status = c(1, 2, 1, 1, 1, 0, 1, 0, 0, 2, 1, 0, 2, 0)
)
mydf %>% group_by(id) %>%
# Keep only 0's and 1's
filter(status %in% c(0,1)) %>%
# Compute diff between two status
mutate(dif = status - lag(status, 1)) %>%
# If it is 1, it is a 0 => 1
filter(dif == 1) %>%
# Catch corresponding id's
select(id) %>%
unique
One possible way using dplyr (Edited to include id only when a 1 appears after a 0):
library(dplyr)
df %>%
group_by(id) %>%
filter(status %in% c(0, 1)) %>%
filter(status == 0 & lead(status, default = 0) == 1) %>%
select(id) %>% unique()
#> # A tibble: 2 x 1
#> # Groups: id [2]
#> id
#> <chr>
#> 1 b
#> 2 c
Data
df <- read.table(text = "id status
a 1
a 2
a 1
b 1
b 1
b 0
b 1
c 0
c 0
c 2
c 1
d 0
d 2
d 0", header = TRUE, stringsAsFactors = FALSE)
I dunno if this is the most efficient way, but: split by id, check statuses for 0, and if there is any, check for 1 behind the 0 index:
lst <- split(df$status, df$id)
f <- function(x) {
if (!any(x==0)) return(FALSE)
any(x[which.max(x==0):length(x)]==1)
}
names(lst)[(sapply(lst, f))]
# [1] "b" "c"
Related
I need to create multiple new data frames based on different filters that contain two variable counts “d” & “e” based of the values in columns “a”, “b” and “c”. I have created a function for this that works as long as at least one column has a value. However, sometimes certain groups will have no answer for a, b or c. I want both d and e to both return zero in the columns when this happens but instead it says “No data available in table”. I’ve added my code below.
f_calculate_net = function(data)
{ data %>% mutate(a = ifelse("a" %in% colnames(data), a, 0)) %>%
mutate(b = ifelse("b" %in% colnames(data), b, 0)) %>%
mutate(c = ifelse("c" %in% colnames(data), c, 0)) %>%
mutate(d = ifelse(a + b + c == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b + c == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e) }
A sample of the dataframe is
wt
beet
ilo
age
country
ine
sex
647
a
3
19
1
24
1
875
b
3
18
1
27
2
647
c
1
24
1
3
2
875
b
3
20
1
27
2
435
b
2
66
4
31
1
643
a
1
32
3
5
1
496
b
2
47
2
1
2
511
c
2
23
4
2
1
774
a
2
37
5
5
1
550
b
1
24
1
1
2
I take the main dataset and then apply a filter and count the number responses of the variable beet
data2 <- df_beet %>% filter(age == 18 & sex == 1 & ilo == 2) %>%
count(beet, wt = wt) %>%
pivot_wider(names_from = beet, values from = n) %>%
f_calculate_net()
There are no results and the resulting dataframe shows the columns d and e but it doesn’t show zeros and instead shows “no data available”
Your main problem here is in the way you are using ifelse. The expression "a" %in% colnames(data) always returns a length-1 logical vector (either TRUE or FALSE). So the output of the expression ifelse("a" %in% colnames(data), a, 0) will also be of length 1. It will return either the first element of a or a single 0. Since this is inside a mutate call, a will either be overwritten by the first element of a, or will be created as a column of zeros. Instead of ifelse you should use
if(!"a" %in% colnames(data)) data$a <- 0
And the same for columns b and c.
You will sometimes still get a NaN entry in columns d and e here if both a and b are 0, but c isn't, since your expression ((a/(a+b))*c)+a) divides by the sum of a and b. You should only check whether a + b == 0, since in that case you should return a 0
So the fixed function would be something like:
f_calculate_net = function(data) {
if(!"a" %in% colnames(data)) data$a <- 0
if(!"b" %in% colnames(data)) data$b <- 0
if(!"c" %in% colnames(data)) data$c <- 0
data %>%
mutate(d = ifelse(a + b == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e)
}
Let's create some random data to test this:
set.seed(123)
df <- data.frame(a = rpois(5, 1), b = rpois(5, 2), c = rpois(5, 1))
df
#> a b c
#> 1 0 0 3
#> 2 2 2 1
#> 3 1 4 1
#> 4 2 2 1
#> 5 3 2 0
And we see that we get the expected output:
f_calculate_net(df)
#> d e
#> 1 0.0 0.0
#> 2 2.5 2.5
#> 3 1.2 4.8
#> 4 2.5 2.5
#> 5 3.0 2.0
Created on 2022-08-15 by the reprex package (v2.0.1)
When a and b are zero a/b is NAN. If you want this case to be zero, try change a + b + c == 0 to (a + b) == 0
Based on Allan's explanation and comment, another possibility is to make a logical vector of the same length as the number of rows:
f_calculate_net = function(data)
{ data %>%
mutate(a = ifelse(rep("a" %in% colnames(data), nrow(data)), a, 0)) %>%
mutate(b = ifelse(rep("b" %in% colnames(data), nrow(data)), b, 0)) %>%
mutate(c = ifelse(rep("c" %in% colnames(data), nrow(data)), c, 0)) %>%
mutate(d = ifelse(a + b == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e) }
what's the easiest way to calculate row-wise sums? For example if I wanted to calculate the sum of all variables with "txt_"? (see example below)
df <- data.frame(var1 = c(1, 2, 3),
txt_1 = c(1, 1, 0),
txt_2 = c(1, 0, 0),
txt_3 = c(1, 0, 0))
base R
We can first use grepl to find the column names that start with txt_, then use rowSums on the subset.
rowSums(df[, grepl("txt_", names(df))])
[1] 3 1 0
If you want to bind it back to the original dataframe, then we can bind the output to the original dataframe.
cbind(df, sums = rowSums(df[, grepl("txt_", names(df))]))
var1 txt_1 txt_2 txt_3 sums
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Tidyverse
library(tidyverse)
df %>%
mutate(sum = rowSums(across(starts_with("txt_"))))
var1 txt_1 txt_2 txt_3 sum
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Or if you want just the vector, then we can use pull:
df %>%
mutate(sum = rowSums(across(starts_with("txt_")))) %>%
pull(sum)
[1] 3 1 0
Data Table
Here is a data.table option as well:
library(data.table)
dt <- as.data.table(df)
dt[ ,sum := rowSums(.SD), .SDcols = grep("txt_", names(dt))]
dt[["sum"]]
# [1] 3 1 0
Another dplyr option:
df %>%
rowwise() %>%
mutate(sum = sum(c_across(starts_with("txt"))))
How to filter out rows in one column when condition is met in another column for different groups?
For example:
library(dplyr)
df1 <-tribble(
~group, ~var1, ~var2,
"a", 0, 0,
"a", 1, 0,
"a",1, 0,
"a",0, 1,
"a", 1, 0,
"b", 1, 0,
"b", 0, 1,
"b", 1, 0,
"b", 0, 1)
I want to allow ones in var1 only after having the first 1 in var2. Therefore, in this example, I would like to get:
group var1 var2
<chr> <dbl> <dbl>
a 0 0
a 0 1
a 1 0
b 0 1
b 1 0
b 0 1
I can identify from where I want to start filtering the data, but don't know exactly how to proceed:
df1 %>%
group_by(var2,group) %>%
mutate(test = case_when(row_number() == 1 & var2 == 1 ~ "exclude_previous_rows",
T ~ "n"))
I'm sure there is a simple way to do this with dplyr, but couldn't find it so far.
We can use a cumulative sum. I think this is what you want:
df1 %>%
group_by(group) %>%
filter(cumsum(var2 == 1) > 0)
# # A tibble: 5 x 3
# # Groups: group [2]
# group var1 var2
# <chr> <dbl> <dbl>
# 1 a 0 1
# 2 a 1 0
# 3 b 0 1
# 4 b 1 0
# 5 b 0 1
This will keep all rows including and after the first 1 in var2, by group. I'm not really sure what you mean by "I want to allow ones in var1" - your code seems to ignore var1, and mine follows suit.
An option using data.table
library(data.table)
setDT(df1)[df1[, .I[cumsum(var2 == 1) > 0], group]$V1]
I have a data frame about whether a patient meets the study criteria, and each row is a patient, each column is a criterion. So some columns are inclusion criteria and some columns are exclusion criteria, and I want to output the reasons for ineligibility. For example,
test <- data.frame(A = c(0, 0, 1),
B = c(0, 0, 0),
C = c(0, 1, 1),
D = c(1, 0, 0),
E = c(1, 0, 1))
where A, B, C are inclusion criteria and D, E are exclusion criteria, and I want to output the column names (could be more than one) if the inclusion criteria == 0 or exclusion criteria == 1.
The expected output would be
output <- data.frame(A = c(0, 0, 1),
B = c(0, 0, 0),
C = c(0, 1, 1),
D = c(1, 0, 0),
E = c(1, 0, 1),
failed_incl = c("A, B, C", "A, B", "B"),
failed_excl = c("D, E", "", "E"))
Is there a way to do it efficiently without having to write out every possible scenario? The actual data frame has much more columns.
There are multiply ways. An option is to use apply to loop over the rows (MARGIN = 1), get the names of the logical vector (x== 0) and paste them together
test$failed_incl <- apply(test[1:3], 1, function(x) toString(names(x)[x == 0]))
test$failed_excl <- apply(test[4:5], 1, function(x) toString(names(x)[x == 1]))
-output
test
# A B C D E failed_incl failed_excl
#1 0 0 0 1 1 A, B, C D, E
#2 0 0 1 0 0 A, B
#3 1 0 1 0 1 B E
Or using tidyverse
library(dplyr)
test %>%
rowwise %>%
mutate(failed_incl = toString(names(.)[which(c_across(A:C) == 0)]),
failed_excl = toString(c('D', 'E')[which(c_across(D:E) == 1)])) %>%
ungroup
# A tibble: 3 x 7
# A B C D E failed_incl failed_excl
# <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#1 0 0 0 1 1 A, B, C "D, E"
#2 0 0 1 0 0 A, B ""
#3 1 0 1 0 1 B "E"
Here's a tidyverse approach, which pivots the column names and then summarizes based on the inclusion/exclusion conditions:
failed_df <-
test %>%
add_rownames() %>%
pivot_longer(-rowname) %>%
group_by(rowname) %>%
summarise(failed_incl = paste(name[value == 0 & name %in% c("A", "B", "C")], collapse = ", "),
failed_excl = paste(name[value == 1 & name %in% c("D", "E")], collapse = ", ")) %>%
select(-rowname)
bind_cols(test, failed_df)
A B C D E failed_incl failed_excl
1 0 0 0 1 1 A, B, C D, E
2 0 0 1 0 0 A, B
3 1 0 1 0 1 B E
There may be a more elegant way to do this with rowwise and c_across.
I have a dataset in which I need to conditionally remove duplicated rows based on values in another column.
Specifically, I need to delete any row where size = 0 only if SampleID is duplicated.
SampleID<-c("a", "a", "b", "b", "b", "c", "d", "d", "e")
size<-c(0, 1, 1, 2, 3, 0, 0, 1, 0)
data<-data.frame(SampleID, size)
I want to delete rows with:
Sample ID size
a 0
d 0
And keep:
SampleID size
a 1
b 1
b 2
b 3
c 0
d 1
e 0
Note. actual dataset it very large, so I am not looking for a way to just remove a known row by row number.
In dplyr we can do this using group_by and filter:
library(dplyr)
data %>%
group_by(SampleID) %>%
filter(!(size==0 & n() > 1)) # filter(size!=0 | n() == 1))
#> # A tibble: 7 x 2
#> # Groups: SampleID [5]
#> SampleID size
#> <fct> <dbl>
#> 1 a 1
#> 2 b 1
#> 3 b 2
#> 4 b 3
#> 5 c 0
#> 6 d 1
#> 7 e 0
Using data.table framework: Transform your set to data.table
require(data.table)
setDT(data)
Build a list of id where we can delete lines:
dropable_ids = unique(data[size != 0, SampleID])
Finaly keep lines that are not in the dropable list or with non 0 value
data = data[!(SampleID %in% dropable_ids & size == 0), ]
Please note that not( a and b ) is equivalent to a or b but data.table framework doesn't handle well or.
Hope it helps
A solution that works in base R without data.table and is easy to follow through for R starters:
#Find all duplicates
data$dup1 <- duplicated(data$SampleID)
data$dup2 <- duplicated(data$SampleID, fromLast = TRUE)
data$dup <- ifelse(data$dup1 == TRUE | data$dup2 == TRUE, 1, 0)
#Subset to relevant
data$drop <- ifelse(data$dup == 1 & data$size == 0, 1, 0)
data2 <- subset(data, drop == 0)