How to leave only rows that meet a specific condition in R - r

I have a data frame that contains around 700 cases with 1800 examinations. Some cases underwent several different modalities. I want to leave only one examination result based on the specific condition of the modality.
Here is a dummy data frame:
df <- data.frame (ID = c("1", "1", "1", "2", "2", "3", "4", "4", "5", "5"),
c1 = c("A", "B", "C", "A", "C", "A", "A", "B", "B", "C"),
x1 = c(5, 4, 5, 3, 1, 3, 4, 2, 3, 5),
x2 = c(4, 3, 7, 9, 1, 2, 4, 7, 5, 0))
There are five cases with 10 exams. [c1] is the exam modality (condition), and the results are x1 and x2.
I want to leave only one row based on the following condition:
C > B > A
I want to leave the row with C first; if not, leave the row with B; If C and B are absent, leave the row with A.
Desired output:
output <- data.frame (ID = c("1", "2", "3", "4", "5"),
c1 = c("C", "C", "A", "B", "C"),
x1 = c(5, 1, 3, 2, 5),
x2 = c(7, 1, 2, 7, 0))

You can arrange the data based on required correct order and for each ID select it's 1st row.
library(dplyr)
req_order <- c('C', 'B', 'A')
df %>%
arrange(ID, match(c1, req_order)) %>%
distinct(ID, .keep_all = TRUE)
# ID c1 x1 x2
# <chr> <chr> <dbl> <dbl>
#1 1 C 5 7
#2 2 C 1 1
#3 3 A 3 2
#4 4 B 2 7
#5 5 C 5 0
In base R, this can be written as :
df1 <- df[order(match(df$c1, req_order)), ]
df1[!duplicated(df1$ID), ]

Here is one approach:
df.srt <- df[order(df$c1, decreasing=TRUE), ]
df.spl <- split(df.srt, df.srt$ID)
first <- lapply(df.spl, head, n=1)
result <- do.call(rbind, first)
result
# ID c1 x1 x2
# 1 1 C 5 7
# 2 2 C 1 1
# 3 3 A 3 2
# 4 4 B 2 7
# 5 5 C 5 0

Related

Ifelse with conditional on grouped data

This has got to be simple, but I'm stuck. I want to mutate some grouped data using a where statement within an ifelse statement. Here's an example that works:
example <- tibble::tribble(
~Group, ~Code, ~Value,
"1", "A", 1,
"1", "B", 1,
"1", "C", 5,
"2", "A", 1,
"2", "B", 5,
"2", "C", 2
)
example %>% group_by(Group) %>%
mutate(GroupStatus=ifelse(Value[Code=="C"]==5, 1, 0))
This gives the desired result:
Group Code Value GroupStatus
<chr> <chr> <dbl> <dbl>
1 1 A 1 1
2 1 B 1 1
3 1 C 5 1
4 2 A 1 0
5 2 B 5 0
6 2 C 2 0
The problem is when one of the groups is missing Code C, as below:
example2 <- tibble::tribble(
~Group, ~Code, ~Value,
"1", "A", 1,
"1", "B", 1,
"1", "C", 5,
"2", "A", 1,
"2", "B", 5
)
example2 %>% group_by(Group) %>%
mutate(GroupStatus=ifelse(Value[Code=="C"]==5, 1, 0))
This gives me an error: Error: Problem with mutate() column GroupStatus.
i GroupStatus = ifelse(Value[Code == "C"] == 5, 1, 0).
i GroupStatus must be size 2 or 1, not 0.
i The error occurred in group 2: Group = "2".
What I'd like is for "GroupStatus" in any group that is missing Code C to just be set to zero. Is that possible?
Another possible solution, based on a nested ifelse:
library(dplyr)
example2 <- tibble::tribble(
~Group, ~Code, ~Value,
"1", "A", 1,
"1", "B", 1,
"1", "C", 5,
"2", "A", 1,
"2", "B", 5
)
example2 %>%
group_by(Group) %>%
mutate(GroupStatus = ifelse("C" %in% Code,
ifelse(Value[Code == "C"] == 5, 1, 0), 0)) %>%
ungroup
#> # A tibble: 5 × 4
#> Group Code Value GroupStatus
#> <chr> <chr> <dbl> <dbl>
#> 1 1 A 1 1
#> 2 1 B 1 1
#> 3 1 C 5 1
#> 4 2 A 1 0
#> 5 2 B 5 0
You really only have a single condition to check per group, so we can simplify to an any() instead of ifelse():
example2 %>%
group_by(Group) %>%
mutate(GroupStatus = as.integer(any(Value == 5 & Code == "C")))
# # A tibble: 5 × 4
# # Groups: Group [2]
# Group Code Value GroupStatus
# <chr> <chr> <dbl> <dbl>
# 1 1 A 1 1
# 2 1 B 1 1
# 3 1 C 5 1
# 4 2 A 1 0
# 5 2 B 5 0

How to pivot_wider only a single condition using a single command in R

Let's say I test 3 drugs (A, B, C) at 3 conditions (0, 1, 2), and then I want to compare two of the conditions (1, 2) to a reference condition (0). This is the plot I would like to get:
First: I do get there, but my solution seems overly complex.
# The data I have
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9),
)
# The data I want
df_wider0 <- data.frame(
drug = c("A", "A", "B", "B", "C", "C"),
result0 = c(1, 1, 2, 2, 3, 3),
cond = c(1, 2, 1, 2, 1, 2),
result = c(2, 3, 4, 6, 6, 9)
)
# This pivots also condition 1 and 2 ...
df_wider <- tidyr::pivot_wider(
df,
names_from = cond,
values_from = result
)
# ... so I pivot out these two again ...
colnames(df_wider)[colnames(df_wider) == "0"] <- "result0"
df_wider0 <- tidyr::pivot_longer(
df_wider,
cols = c("1", "2"),
names_to = "cond",
values_to = "result"
)
# ... so that I can use this ggplot command:
library(ggplot2)
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
As you can see, I use a sequence of pivot_wider and pivot_longer to do a selective pivot_wider (by inverting some of its effects later). Is there an integrated command that I can use to achieve this more elegantly?
This can also be a strategy. (Will work even if there are unequal number of conditions per group)
df %>%
filter(cond != 0) %>%
right_join(df %>% filter(cond == 0), by = "drug", suffix = c("", "0")) %>%
select(-cond0)
Revised df adopted
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2, 0),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9, 10)
)
Result of above syntax
drug cond result result0
1 A 1 2 1
2 A 2 3 1
3 B 1 4 2
4 B 2 6 2
5 C 1 6 3
6 C 2 9 3
7 D NA NA 10
You may also fill cond if desired so
You can do this without any pivot statement at all.
library(dplyr)
library(ggplot2)
df_wider0 <- df %>%
mutate(result0 = result[match(drug, unique(drug))]) %>%
filter(cond != 0)
df_wider0
# drug cond result result0
#1 A 1 2 1
#2 A 2 3 1
#3 B 1 4 2
#4 B 2 6 2
#5 C 1 6 3
#6 C 2 9 3
Plot the data :
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")

Count occurrences per entry in dataframe

I have the following kind of dataframe (this is simplified example):
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df
id bank
1 1 a
2 1 b
3 1 c
4 2 b
5 3 b
6 3 c
7 4 a
8 4 c
In this dataframe you can see that for some ids there are multiple banks, i.e. for id==1, bank=c(a,b,c).
The information I would like to extract from this dataframe is the overlap between id's within different banks and the count.
So for example for bank a: bank a has two persons (unique ids): 1 and 4. For these persons, I want to know what other banks they have
For person 1: bank b and c
For person 4: bank c
the total amount of other banks: 3, for which, b = 1, and c = 2.
So I want to create as output a sort of overlap table as below:
bank overlap amount
a b 1
a c 2
b a 1
b c 2
c a 2
c b 2
Took me a while to get a result, so I post it. Not as sexy as Ronak Shahs but same result.
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df$bank <- as.character(df$bank)
resultlist <- list()
dflist <- split(df, df$id)
for(i in 1:length(dflist)) {
if(nrow(dflist[[i]]) < 2) {
resultlist[[i]] <- data.frame(matrix(nrow = 0, ncol = 2))
} else {
resultlist[[i]] <- as.data.frame(t(combn(dflist[[i]]$bank, 2)))
}
}
result <- setNames(data.table(rbindlist(resultlist)), c("bank", "overlap"))
result %>%
group_by(bank, overlap) %>%
summarise(amount = n())
bank overlap amount
<fct> <fct> <int>
1 a b 1
2 a c 2
3 b c 2
We may use data.table:
df = data.frame(id = c("1", "1", "1", "2", "3", "3", "4", "4"),
bank = c("a", "b", "c", "b", "b", "c", "a", "c"))
library(data.table)
setDT(df)[, .(bank = rep(bank, (.N-1L):0L),
overlap = bank[(sequence((.N-1L):1L) + rep(1:(.N-1L), (.N-1L):1))]),
by=id][,
.N, by=.(bank, overlap)]
#> bank overlap N
#> 1: a b 1
#> 2: a c 2
#> 3: b c 2
#> 4: <NA> b 1
Created on 2019-07-01 by the reprex package (v0.3.0)
Please note that you have b for id==2 which is not overlapping with other values. If you don't want that in the final product, just apply na.omit() on the output.
An option would be full_join
library(dplyr)
full_join(df, df, by = "id") %>%
filter(bank.x != bank.y) %>%
dplyr::count(bank.x, bank.y) %>%
select(bank = bank.x, overlap = bank.y, amount = n)
# A tibble: 6 x 3
# bank overlap amount
# <fct> <fct> <int>
#1 a b 1
#2 a c 2
#3 b a 1
#4 b c 2
#5 c a 2
#6 c b 2
Do you need to cover both banks in both the directions? Since a -> b is same as b -> a in this case here. We can use combn and create combinations of unique bank taken 2 at a time, find out length of common id found in the combination.
as.data.frame(t(combn(unique(df$bank), 2, function(x)
c(x, with(df, length(intersect(id[bank == x[1]], id[bank == x[2]])))))))
# V1 V2 V3
#1 a b 1
#2 a c 2
#3 b c 2
data
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank, stringsAsFactors = FALSE)

Subset of dataframe for which 2 variables match another dataframe in R

I'm looking to obtain a subset of my first, larger, dataframe 'df1' by selecting rows which contain particular combinations in the first two variables, as specified in a smaller 'df2'. For example:
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df1 # my actual df has 20 varables
ID day value
A 1 4
A 2 5
A 2 6
B 1 7
B 2 8
B 3 9
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
df2 # this df remains at 2 variables
ID day
A 2
B 1
Where the output would be:
ID day value
A 2 5
A 2 6
B 1 7
Any help wouldbe much appreciated, thanks!
This is a good use of the merge function.
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
merge(df1,
df2,
by = c("ID", "day"))
Which gives output:
ID day value
1 A 2 5
2 A 2 6
3 B 1 7
Here is a dplyr solution:
library("dplyr")
semi_join(df1, df2, by = c("ID", "day"))
# ID day value
# 1 A 2 5
# 2 A 2 6
# 3 B 1 7

match rows across two columns

Given a data frame
df=data.frame(
E=c(1,1,2,1,3,2,2),
N=c(4,4,10,4,3,2,2)
)
I would like to create a third column: Every time a value equals another value in the same column and these rows are also equal in the other column it results in a match (new character for every match).
dfx=data.frame(
E=c(1,1,2,1,3,2,2,3, 2),
N=c(4,4,10,4,3,2,2,6, 10),
matched=c("A", "A", "B","A", NA, "C", "C", NA, "B")
)
Thanks!
Here, df is:
df <- structure(list(E = c(1, 1, 2, 1, 3, 2, 2, 3, 2), N = c(4, 4,
10, 4, 3, 2, 2, 6, 10)), .Names = c("E", "N"), row.names = c(NA,
-9L), class = "data.frame")
You can do:
dfx <- transform(df, matched = {
i <- as.character(interaction(df[c("E", "N")]))
tab <- table(i)[order(unique(i))]
LETTERS[match(i, names(tab)[tab > 1])]
})
# E N matched
# 1 1 4 A
# 2 1 4 A
# 3 2 10 B
# 4 1 4 A
# 5 3 3 <NA>
# 6 2 2 C
# 7 2 2 C
# 8 3 6 <NA>
# 9 2 10 B

Resources