Finding Average of multiple samples - r

I am trying to find the average of "Answers" for a given ID (1,2,3). I have created a subset of data that includes only students not in the lab "N", and questions pertaining to lab "L" called "LRi". So I need to find a way to average of Answers for the subset data "LRi" for each ID number. I would also like to assign it as a numeric vector.
ID StudentLab QuestionLab Question Answer
1 N L 1 4
2 N L 1 2
3 N L 1 3
1 N L 1 5
2 N L 1 1
3 N L 1 4
1 N L 1 7
2 N L 1 3
3 N L 1 5
Results
ID Answer
1 5.3
2 2
3 4

Group entries by ID and summarise Answers by calculating the average.
library(dplyr)
library(magrittr)
df %>% group_by(ID) %>% summarise(Answer = mean(Answer))
## A tibble: 3 x 2
# ID Answer
# <int> <dbl>
#1 1 5.33
#2 2 2.00
#3 3 4.00

Related

R: splitting dataframe into distinct subgroups containing sequence of groups

This question is similar to one already answered: R: Splitting dataframe into subgroups consisting of every consecutive 2 groups
However, rather than splitting into subgroups that have a type in common, I need to split into subgroups that contain two consecutive types and are distinct. The groups in my actual data have differing numbers of rows as well.
df <- data.frame(ID=c('1','1','1','1','1','1','1'), Type=c('a','a','b','c','c','d','d'), value=c(10,2,5,3,7,3,9))
ID Type value
1 1 a 10
2 1 a 2
3 1 b 5
4 1 c 3
5 1 c 7
6 1 d 3
7 1 d 9
So subgroup 1 would be Type a and b:
ID Type value
1 1 a 10
2 1 a 2
3 1 b 5
And subgroup 2 would be Type c and d:
ID Type value
4 1 c 3
5 1 c 7
6 1 d 3
7 1 d 9
I have tried manipulating the code from this previous example, but I can't figure out how to make this happen without having overlapping Types in each group. Any help would be greatly appreciated - thanks!
EDIT: thanks for pointing out I didn't actually include the correct link.
We can do a little manipulation of a dense_rank of the Type variable to make an appropriate grouping variable:
library(dplyr)
df %>%
group_by(g = (dense_rank(match(Type, Type)) - 1) %/% 2) %>%
group_split()
# [[1]]
# # A tibble: 3 × 4
# ID Type value g
# <chr> <chr> <dbl> <dbl>
# 1 1 a 10 0
# 2 1 a 2 0
# 3 1 b 5 0
#
# [[2]]
# # A tibble: 4 × 4
# ID Type value g
# <chr> <chr> <dbl> <dbl>
# 1 1 c 3 1
# 2 1 c 7 1
# 3 1 d 3 1
# 4 1 d 9 1
Explanation: match(Type, Type) converts Type into integers ordered by number of appearance - but not dense. dense_rank() makes that dense (no gaps). We then subtract 1 to make it start at 0 and %/% 2 to see how many 2s go into it, effectively grouping by pairs.
Here is a rle way, written as a function. Pass the data.frame and the split column name as a character string.
df <- data.frame(ID=c('1','1','1','1','1','1','1'),
Type=c('a','a','b','c','c','d','d'),
value=c(10,2,5,3,7,3,9))
split_two <- function(x, col) {
r <- rle(x[[col]])
r$values[c(FALSE, TRUE)] <- r$values[c(TRUE, FALSE)]
split(x, inverse.rle(r))
}
split_two(df, "Type")
#> $a
#> ID Type value
#> 1 1 a 10
#> 2 1 a 2
#> 3 1 b 5
#>
#> $c
#> ID Type value
#> 4 1 c 3
#> 5 1 c 7
#> 6 1 d 3
#> 7 1 d 9
Created on 2023-02-09 with reprex v2.0.2

Which group meet the criterion a < b < c depending on condition

My title might not be very informative but this is an example which exposes my problem :
I have this dataframe :
df=data.frame(cond1=c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3),
group=c("F","V","M","F","V","M","F","V","M","F","V","M","F","V","M","F","V","M"),
gene=c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B"),
value=c(1,2,3,4,5,6,7,8,9,1,3,2,4,3,2,2,3,4))
df
cond1 group gene value
1 1 F A 1
2 1 V A 2
3 1 M A 3
4 2 F A 4
5 2 V A 5
6 2 M A 6
7 3 F A 7
8 3 V A 8
9 3 M A 9
10 1 F B 1
11 1 V B 3
12 1 M B 2
13 2 F B 4
14 2 V B 3
15 2 M B 2
16 3 F B 2
17 3 V B 3
18 3 M B 4
What I would like to obtain is for each gene, the sum of how many different cond1 have their value corresponding with F group smaller than their value corresponding with V their value corresponding with M.
In the 3 first lines, we are in gene A for the cond1. value correspoding to group F=1, V=2, M=3. So F<V<M for the A gene for the cond1=1 group.
My expected output for the gene A is 3 as all cond1 groups meet F<V<M for value.
My expected output for the gene B is 1 as only cond1=3 group meet F<V<M for value.
My desired output would be ideally a dataframe with gene and the sum of cond1 than meet my criterion :
gene count
1 A 3
2 B 1
I would be very grateful if you could provide me any tips on how should I proceed
Check if all the data is in increasing order and count how many such values exist for each gene.
library(dplyr)
df %>%
#If the data is not ordered, order it using arrange
#arrange(gene, cond1, match(group, c('F', 'V', 'M'))) %>%
group_by(gene, cond1) %>%
summarise(cond = all(diff(value) > 0)) %>%
summarise(count = sum(cond))
# gene count
# <chr> <int>
#1 A 3
#2 B 1
Using data.table
library(data.table)
setDT(df)[, .(cond = all(diff(value) > 0)), .(gene, cond1)][, .(count = sum(cond)), gene]
gene count
1: A 3
2: B 1

Finding the maximum value of a variable

I would like to find the maximum value of a variable (column) and then retain this value (the maximum value) and all values below it. Along with these values, I would like to retain the corresponding values from all other variables (columns) within the data frame. I want to exclude all values above this point from the data frame, for all variables within it. Included is the script for an example data frame (df), and an expected data frame (df2) i.e. what I am trying to achieve. I would be so grateful for some script to do this.
Ba <- c(1,1,1,2,2)
Sr <- c(1,1,1,2,2)
Mn <- c(1,1,2,1,1)
df <- data.frame(Ba, Sr, Mn)
df
# Ba Sr Mn
# 1 1 1 1
# 2 1 1 1
# 3 1 1 2
# 4 2 2 1
# 5 2 2 1
Showing 1 to 5 of 5 entries, 3 total columns
This is what I want to achieve in R:
Ba2 <- c(1,2,2)
Sr2 <- c(1,2,2)
Mn2 <- c(2,1,1)
df2 <- data.frame(Ba2, Sr2, Mn2)
df2
# Ba2 Sr2 Mn2
# 1 1 1 2
# 2 2 2 1
# 3 2 2 1
Showing 1 to 3 of 3 entries, 3 total columns
You can subset df with the sequence from min to nrow(df) of which.max per column:
df[min(sapply(df, which.max)):nrow(df),]
# Ba Sr Mn
#3 1 1 2
#4 2 2 1
#5 2 2 1
Does this work:
df[max(apply(df, 1, which.max)):nrow(df),]
Ba Sr Mn
3 1 1 2
4 2 2 1
5 2 2 1
Using cummax
library(dplyr)
library(purrr)
df %>%
filter(cummax(invoke(pmax, cur_data())) == max(cur_data()))
Ba Sr Mn
1 1 1 2
2 2 2 1
3 2 2 1

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

Count number of shared observations between samples using dplyr

I have a list of observations grouped by samples. I want to find the samples that share the most identical observations. An identical observation is where the start and end number are both matching between two samples. I'd like to use R and preferably dplyr to do this if possible.
I've been getting used to using dplyr for simpler data handling but this task is beyond what I am currently able to do. I've been thinking the solution would involve grouping the start and end into a single variable: group_by(start,end) but I also need to keep the information about which sample each observation belongs to and compare between samples.
example:
sample start end
a 2 4
a 3 6
a 4 8
b 2 4
b 3 6
b 10 12
c 10 12
c 0 4
c 2 4
Here samples a, b and c share 1 observation (2, 4)
sample a and b share 2 observations (2 4, 3 6)
sample b and c share 2 observations (2 4, 10 12)
sample a and c share 1 observation (2 4)
I'd like an output like:
abc 1
ab 2
bc 2
ac 1
and also to see what the shared observations are if possible:
abc 2 4
ab 2 4
ab 3 6
etc
Thanks in advance
Here's something that should get you going:
df %>%
group_by(start, end) %>%
summarise(
samples = paste(unique(sample), collapse = ""),
n = length(unique(sample)))
# Source: local data frame [5 x 4]
# Groups: start [?]
#
# start end samples n
# <int> <int> <chr> <int>
# 1 0 4 c 1
# 2 2 4 abc 3
# 3 3 6 ab 2
# 4 4 8 a 1
# 5 10 12 bc 2
Here is an idea via base R,
final_d <- data.frame(count1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), nrow),
pairs1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), function(i) paste(i[[1]], collapse = '')))
# count1 pairs1
#0.4 1 c
#2.4 3 abc
#3.6 2 ab
#4.8 1 a
#10.12 2 bc

Resources