Count number of shared observations between samples using dplyr - r

I have a list of observations grouped by samples. I want to find the samples that share the most identical observations. An identical observation is where the start and end number are both matching between two samples. I'd like to use R and preferably dplyr to do this if possible.
I've been getting used to using dplyr for simpler data handling but this task is beyond what I am currently able to do. I've been thinking the solution would involve grouping the start and end into a single variable: group_by(start,end) but I also need to keep the information about which sample each observation belongs to and compare between samples.
example:
sample start end
a 2 4
a 3 6
a 4 8
b 2 4
b 3 6
b 10 12
c 10 12
c 0 4
c 2 4
Here samples a, b and c share 1 observation (2, 4)
sample a and b share 2 observations (2 4, 3 6)
sample b and c share 2 observations (2 4, 10 12)
sample a and c share 1 observation (2 4)
I'd like an output like:
abc 1
ab 2
bc 2
ac 1
and also to see what the shared observations are if possible:
abc 2 4
ab 2 4
ab 3 6
etc
Thanks in advance

Here's something that should get you going:
df %>%
group_by(start, end) %>%
summarise(
samples = paste(unique(sample), collapse = ""),
n = length(unique(sample)))
# Source: local data frame [5 x 4]
# Groups: start [?]
#
# start end samples n
# <int> <int> <chr> <int>
# 1 0 4 c 1
# 2 2 4 abc 3
# 3 3 6 ab 2
# 4 4 8 a 1
# 5 10 12 bc 2

Here is an idea via base R,
final_d <- data.frame(count1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), nrow),
pairs1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), function(i) paste(i[[1]], collapse = '')))
# count1 pairs1
#0.4 1 c
#2.4 3 abc
#3.6 2 ab
#4.8 1 a
#10.12 2 bc

Related

count unique combinations of variable values in an R dataframe column [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Count number of rows within each group
(17 answers)
Closed 2 years ago.
I want to count the unique combinations of a variable that appear per group.
For example:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,4,4,4,5,6,6,7,7,7),
status = c("a","b","c","a","b","c","b","c","b","c","d","b","b","c","b","c", "d"))
> df
id status
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
7 3 b
8 3 c
9 4 b
10 4 c
11 4 d
12 5 b
13 6 b
14 6 c
15 7 b
16 7 c
17 7 d
So that, for example, I can tally how many times a given combination of "status" appears.
By hand, for example, I see that "a,b,c" appears twice total (id's 1 and 2).
These seem to be similar questions, but I couldn't work out how to do it and with clearer explanation in R:
Counting unique combinations
Count of unique combinations despite order
The result I think I am looking for would be something like:
abc 2
bc 3
b 1
...
An option with tidyverse where group by 'id', paste the 'status' and get the count
library(dplyr)
library(stringr)
df %>%
group_by(id) %>%
summarise(status = str_c(status, collapse="")) %>%
count(status)
# A tibble: 4 x 2
# status n
# <chr> <int>
#1 abc 2
#2 b 1
#3 bc 2
#4 bcd 2
Here is a base R option via aggregate
> aggregate(.~status,rev(aggregate(.~id,df,paste0,collapse = "")),length)
status id
1 abc 2
2 b 1
3 bc 2
4 bcd 2
You can use the apply family of functions too with tapply and lapply to get there with table.
tap <- tapply(df$status, df$id ,FUN= function(x) unique(x))
lap <- lapply(tap,FUN = function(x) paste0(x,collapse=""))
status <- unlist(lap)
df1 <- data.frame(table(status))
> df1
status Freq
1 abc 2
2 b 1
3 bc 2
4 bcd 2

Finding Average of multiple samples

I am trying to find the average of "Answers" for a given ID (1,2,3). I have created a subset of data that includes only students not in the lab "N", and questions pertaining to lab "L" called "LRi". So I need to find a way to average of Answers for the subset data "LRi" for each ID number. I would also like to assign it as a numeric vector.
ID StudentLab QuestionLab Question Answer
1 N L 1 4
2 N L 1 2
3 N L 1 3
1 N L 1 5
2 N L 1 1
3 N L 1 4
1 N L 1 7
2 N L 1 3
3 N L 1 5
Results
ID Answer
1 5.3
2 2
3 4
Group entries by ID and summarise Answers by calculating the average.
library(dplyr)
library(magrittr)
df %>% group_by(ID) %>% summarise(Answer = mean(Answer))
## A tibble: 3 x 2
# ID Answer
# <int> <dbl>
#1 1 5.33
#2 2 2.00
#3 3 4.00

R: Filtering by two columns using "is not equal" operator dplyr/subset

This questions must have been answered before but I cannot find it any where. I need to filter/subset a dataframe using values in two columns to remove them. In the examples I want to keep all the rows that are not equal (!=) to both replicate "1" and treatment "a". However, either subset and filter functions remove all replicate 1 and all treatment a. I could solve it by using which and then indexing, but it is not the best way for using pipe operator. do you know why filter/subset do not filter only when both conditions are true?
require(dplyr)
#Create example dataframe
replicate = rep(c(1:3), times = 4)
treatment = rep(c("a","b"), each = 6)
df = data.frame(replicate, treatment)
#filtering data
> filter(df, replicate!=1, treatment!="a")
replicate treatment
1 2 b
2 3 b
3 2 b
4 3 b
> subset(df, (replicate!=1 & treatment!="a"))
replicate treatment
8 2 b
9 3 b
11 2 b
12 3 b
#solution by which - indexing
index = which(df$replicate==1 & df$treatment=="a")
> df[-index,]
replicate treatment
2 2 a
3 3 a
5 2 a
6 3 a
7 1 b
8 2 b
9 3 b
10 1 b
11 2 b
12 3 b
I think you're looking to use an "or" condition here. How does this look:
require(dplyr)
#Create example dataframe
replicate = rep(c(1:3), times = 4)
treatment = rep(c("a","b"), each = 6)
df = data.frame(replicate, treatment)
df %>%
filter(replicate != 1 | treatment != "a")
replicate treatment
1 2 a
2 3 a
3 2 a
4 3 a
5 1 b
6 2 b
7 3 b
8 1 b
9 2 b
10 3 b

Delete the lower value in one column based on repeat values in another column in R (large data set)

I have a large data set loaded into R that contains multiple duplicates in one column (colA) and another column that has different unique values (colB). I need to figure out a way delete the lowest values in colB that correspond to the same value in colA.
For example,
A 1
A 2
A 3
B 8
B 9
B 10
should become
A 3
B 10
If this were something like Python, it would be an easy command to code, but I am new to R and greatly appreciate the help.
Here's a dplyr solution
d <- read.table(textConnection("A 1
A 2
A 3
B 8
B 9
B 10"))
library(dplyr)
d %>%
group_by(V1) %>%
summarize(max = max(V2))
# A tibble: 2 × 2
V1 max
<fctr> <int>
1 A 3
2 B 10
You can do this with aggregate
aggregate(df$B, list(df$A), max)
Group.1 x
1 A 3
2 B 10
library(plyr)
data<-data.frame("x"=c(rep("A",3),rep("B",3)),"y"=c(1:3,8:10))
ddply(data,~x,summarise,max=max(y))
x max
1 A 3
2 B 10

Replace values in a series exceeding a threshold

In a dataframe I'd like to replace values in a series where they exceed a given threshold.
For example, within a group ('ID') in a series designated by 'time', if 'value' ever exceeds 3, I'd like to make all following entries also equal 3.
ID <- as.factor(c(rep("A", 3), rep("B",3), rep("C",3)))
time <- rep(1:3, 3)
value <- c(c(1,1,2), c(2,3,2), c(3,3,2))
dat <- cbind.data.frame(ID, time, value)
dat
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 2
C 1 3
C 2 3
C 3 2
I'd like it to be:
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 3
C 1 3
C 2 3
C 3 3
This should be easy, but I can't figure it out. Thanks!
The ave function makes this very easy by allowing you to apply a function to each of the groupings. In this case, we will adapth the cummax (cumulative maximum) to see if we've seen a 3 yet.
dat$value2<-with(dat, ave(value, ID, FUN=
function(x) ifelse(cummax(x)>=3, 3, x)))
dat;
# ID time value value2
# 1 A 1 1 1
# 2 A 2 1 1
# 3 A 3 2 2
# 4 B 1 2 2
# 5 B 2 3 3
# 6 B 3 2 3
# 7 C 1 3 3
# 8 C 2 3 3
# 9 C 3 2 3
You could also just use FUN=cummax if you want never-decreasing values. I wasn't sure about the sequence c(1,2,1) if you wanted to keep that unchanged or not.
If you can assume your data are sorted by group, then this should be fast, essentially relying on findInterval() behind the scenes:
library(IRanges)
id <- Rle(ID)
three <- which(value>=3L)
ir <- reduce(IRanges(three, end(id)[findRun(three, id)])))
dat$value[as.integer(ir)] <- 3L
This avoids looping over the groups.

Resources