Which group meet the criterion a < b < c depending on condition - r

My title might not be very informative but this is an example which exposes my problem :
I have this dataframe :
df=data.frame(cond1=c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3),
group=c("F","V","M","F","V","M","F","V","M","F","V","M","F","V","M","F","V","M"),
gene=c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B"),
value=c(1,2,3,4,5,6,7,8,9,1,3,2,4,3,2,2,3,4))
df
cond1 group gene value
1 1 F A 1
2 1 V A 2
3 1 M A 3
4 2 F A 4
5 2 V A 5
6 2 M A 6
7 3 F A 7
8 3 V A 8
9 3 M A 9
10 1 F B 1
11 1 V B 3
12 1 M B 2
13 2 F B 4
14 2 V B 3
15 2 M B 2
16 3 F B 2
17 3 V B 3
18 3 M B 4
What I would like to obtain is for each gene, the sum of how many different cond1 have their value corresponding with F group smaller than their value corresponding with V their value corresponding with M.
In the 3 first lines, we are in gene A for the cond1. value correspoding to group F=1, V=2, M=3. So F<V<M for the A gene for the cond1=1 group.
My expected output for the gene A is 3 as all cond1 groups meet F<V<M for value.
My expected output for the gene B is 1 as only cond1=3 group meet F<V<M for value.
My desired output would be ideally a dataframe with gene and the sum of cond1 than meet my criterion :
gene count
1 A 3
2 B 1
I would be very grateful if you could provide me any tips on how should I proceed

Check if all the data is in increasing order and count how many such values exist for each gene.
library(dplyr)
df %>%
#If the data is not ordered, order it using arrange
#arrange(gene, cond1, match(group, c('F', 'V', 'M'))) %>%
group_by(gene, cond1) %>%
summarise(cond = all(diff(value) > 0)) %>%
summarise(count = sum(cond))
# gene count
# <chr> <int>
#1 A 3
#2 B 1

Using data.table
library(data.table)
setDT(df)[, .(cond = all(diff(value) > 0)), .(gene, cond1)][, .(count = sum(cond)), gene]
gene count
1: A 3
2: B 1

Related

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

Finding Average of multiple samples

I am trying to find the average of "Answers" for a given ID (1,2,3). I have created a subset of data that includes only students not in the lab "N", and questions pertaining to lab "L" called "LRi". So I need to find a way to average of Answers for the subset data "LRi" for each ID number. I would also like to assign it as a numeric vector.
ID StudentLab QuestionLab Question Answer
1 N L 1 4
2 N L 1 2
3 N L 1 3
1 N L 1 5
2 N L 1 1
3 N L 1 4
1 N L 1 7
2 N L 1 3
3 N L 1 5
Results
ID Answer
1 5.3
2 2
3 4
Group entries by ID and summarise Answers by calculating the average.
library(dplyr)
library(magrittr)
df %>% group_by(ID) %>% summarise(Answer = mean(Answer))
## A tibble: 3 x 2
# ID Answer
# <int> <dbl>
#1 1 5.33
#2 2 2.00
#3 3 4.00

Get first and last value from groups using rle

I want to get first and last value for groups using grouping similar to what rle() function does.
For example I have this data frame:
> df
df time
1 1 A
2 1 B
3 1 C
4 1 D
5 2 E
6 2 F
7 2 G
8 1 H
9 1 I
10 1 J
11 3 K
12 3 L
13 3 M
14 2 N
15 2 O
16 2 P
I want to get something like this:
> want
df first last
1 1 A D
2 2 E G
3 1 H J
4 3 K M
5 2 N P
How you can see, I want to group my values in a way rle() function does. I want to group elements only when this same value is next to each other. group_by groups elements in the different way.
> rle(df$df)
Run Length Encoding
lengths: int [1:5] 4 3 3 3 3
values : num [1:5] 1 2 1 3 2
Is there a solution for my problem? Any advice will be appreciated.
There is a function rleid from data.table that does that job, i.e.
library(data.table)
setDT(dt)[, .(df = head(df, 1),
first = head(time, 1),
last = tail(time, 1)),
by = (grp = rleid(df))][, grp := NULL][]
Which gives,
df first last
1: 1 A D
2: 2 E G
3: 1 H J
4: 3 K M
5: 2 N P
Adding a dplyr approach, as #RonakShah mentions
library(dplyr)
df %>%
group_by(grp = cumsum(c(0, diff(df)) != 0)) %>%
summarise(df = first(df),
first = first(time),
last = last(time)) %>%
select(-grp)
Giving,
# A tibble: 5 x 3
df first last
<int> <chr> <chr>
1 1 A D
2 2 E G
3 1 H J
4 3 K M
5 2 N P
Here is an option using base R with rle. Once we do the rle on the first column, replicate the sequence of values with lengths, use that to create logical index with duplicated, then subset the values of the original dataset based on the index
rl <- rle(df[,1])
i1 <- rep(seq_along(rl$values), rl$lengths)
i2 <- !duplicated(i1)
i3 <- !duplicated(i1, fromLast = TRUE)
wanted <- data.frame(df = df[i2,1], first = df[i2,2], last = df[i3,2])
wanted
# df first last
#1 1 A D
#2 2 E G
#3 1 H J
#4 3 K M
#5 2 N P

match / find rows based on multiple required values in a single row in R

This must be a duplicate but I can't find it. So here goes.
I have a data.frame with two columns. One contains a group and the other contains a criterion. A group can contain many different criteria, but only one per row. I want to identify groups that contain three specific criteria (but that will appear on different rows. In my case, I want to identify all groups that contains the criteria "I","E","C". Groups may contain any number and combination of these and several other letters.
test <- data.frame(grp=c(1,1,2,2,2,3,3,3,4,4,4,4,4),val=c("C","I","E","I","C","E","I","A","C","I","E","E","A"))
> test
grp val
1 1 C
2 1 I
3 2 E
4 2 I
5 2 C
6 3 E
7 3 I
8 3 A
9 4 C
10 4 I
11 4 E
12 4 E
13 4 A
In the above example, I want to identify grp 2, and 4 because each of these contains the letters E, I, and C.
Thanks!
Here's a dplyr solution. %in% is vectorized so c("E", "I", "C") %in% val returns a logical vector of length three. For the target groups, passing that vector to all() returns TRUE. That's our filter, and we run it within each group using group_by().
library(dplyr)
test %>%
group_by(grp) %>%
filter(all(c("E", "I", "C") %in% val))
# Source: local data frame [8 x 2]
# Groups: grp [2]
#
# grp val
# (dbl) (fctr)
# 1 2 E
# 2 2 I
# 3 2 C
# 4 4 C
# 5 4 I
# 6 4 E
# 7 4 E
# 8 4 A
Or if this output would be handier (thanks #Frank),
test %>%
group_by(grp) %>%
summarise(matching = all(c("E", "I", "C") %in% val))
# Source: local data frame [4 x 2]
#
# grp matching
# (dbl) (lgl)
# 1 1 FALSE
# 2 2 TRUE
# 3 3 FALSE
# 4 4 TRUE
library(data.table)
test <- data.frame(grp=c(1,1,2,2,2,3,3,3,4,4,4,4,4),val=c("C","I","E","I","C","E","I","A","C","I","E","E","A"))
setDT(test) # convert the data.frame into a data.table
group.counts <- dcast(test, grp ~ val) # count number of same values per group and create one column per val with the count in the cell
group.counts[I>0 & E>0 & C>0,] # now filtering is easy
Results in:
grp A C E I
1: 2 0 1 1 1
2: 4 1 1 2 1
Instead of returning the group numbers only you could also "join" the resulting group numbers with the original data to show the "raw" data rows of each group that matches:
test[group.counts[I>0 & E>0 & C>0,], .SD, on="grp" ]
This shows:
grp val
1: 2 E
2: 2 I
3: 2 C
4: 4 C
5: 4 I
6: 4 E
7: 4 E
8: 4 A
PS: Just to understand the solution easier: The counts for all groups are:
> group.counts
grp A C E I
1: 1 0 1 0 1
2: 2 0 1 1 1
3: 3 1 0 1 1
4: 4 1 1 2 1

Determining if values of previous rows repeat in dataframe

I have some data organized like this:
set.seed(12)
ids <- matrix(replicate(1000,sample(LETTERS[1:4],2)),ncol=2,byrow=T)
df <- data.frame(
event = 1:100,
id1 = ids[,1],
id2 = ids[,2],
grp = rep(1:10, each=100), stringsAsFactors=F)
head(df,10)
event id1 id2 grp
1 1 A C 1
2 2 D A 1
3 3 A D 1
4 4 A B 1
5 5 A D 1
6 6 B C 1
7 7 B D 1
8 8 B D 1
9 9 B D 1
10 10 C A 1
There are pairs of ids (id1 & id2). Within a row they are never the same. There is a variable called grp. There are 10 groups. Each group could be considered a separate sample of data. The event variable goes from 1-100 in each group.
The first question I have is quite straightforward. Within each group, for each row, is the combination of the two ids (id1-id2) the same as the previous row, the reverse of the previous row, or neither of these two options. Obviously, if there is an A-C combination on row 100 of one group, I am not interested in whether it is reversed, the same or whatever on row 1 of the following group.
This is my temporary solution:
#Give each id pair and identifier:
df$pair <- paste(pmin(df$id1,df$id2), pmax(df$id1,df$id2))
#For each grp, work out using `lag` if previous row contains same pair of ids, and if they are in same or reversed order:
df.sp <- split(df, df$grp)
df$value <- unlist(lapply(df.sp, function(x) ifelse(x$pair!=lag(x$pair), NA, ifelse(x$id1==lag(x$id1), 1, 0)) ))
This gives:
head(df,10)
event id1 id2 grp pair value
1 1 A C 1 A C NA
2 2 D A 1 A D NA
3 3 A D 1 A D 0
4 4 A B 1 A B NA
5 5 A D 1 A D NA
6 6 B C 1 B C NA
7 7 B D 1 B D NA
8 8 B D 1 B D 1
9 9 B D 1 B D 1
10 10 C A 1 A C NA
This works - showing 0 as a reversal, 1 as a copy and NA as neither.
The more complex question I am interested in is the following. Within each group (grp), for each row, find if its combination of two ids (the pair) previously occurred in that grp. If they did, then return whether they were in the same order or reversed order the immediate previous time they occurred.
That result would look like this:
event id1 id2 grp pair value
1 1 A C 1 A C NA
2 2 D A 1 A D NA
3 3 A D 1 A D 0
4 4 A B 1 A B NA
5 5 A D 1 A D 1
6 6 B C 1 B C NA
7 7 B D 1 B D NA
8 8 B D 1 B D 1
9 9 B D 1 B D 1
10 10 C A 1 A C 0
e.g. row 10 is returned as a 0 because the combination A-C previously occurred and was in the reverse order (row 1). on row 5 a 1 is returned as A-D previously occurred in the same order on row 3.
You're almost there! The second question is equivalent to the first question, just grouping by pair as well as group. I converted the code to dplyr (though I appreciate the spirit behind keeping the question in base). I also removed the second ifelse, replacing it with a numeric conversion of the logical, which should be more performant (and some will find easier to read).
df %>% group_by(grp) %>%
mutate(
pair = paste(pmin(id1, id2), pmax(id1, id2)),
prev_row = ifelse(pair != lag(pair), NA, as.numeric(id1 == lag(id1)))
) %>%
group_by(grp, pair) %>%
mutate(prev_any = ifelse(pair != lag(pair), NA, as.numeric(id1 == lag(id1)))) %>%
head(10)
# Source: local data frame [10 x 7]
# Groups: grp, pair [5]
#
# event id1 id2 grp pair prev_row prev_any
# (int) (chr) (chr) (int) (chr) (dbl) (dbl)
# 1 1 A C 1 A C NA NA
# 2 2 D A 1 A D NA NA
# 3 3 A D 1 A D 0 0
# 4 4 A B 1 A B NA NA
# 5 5 A D 1 A D NA 1
# 6 6 B C 1 B C NA NA
# 7 7 B D 1 B D NA NA
# 8 8 B D 1 B D 1 1
# 9 9 B D 1 B D 1 1
# 10 10 C A 1 A C NA 0
For such grouping, filtering and mutating tasks, I find dplyr to be very helpful. Here is one way I came up with how you can achieve your goal:
df %>% group_by(grp) %>% mutate(value = ifelse(id1 == lag(id1) & id2 == lag(id2), 1, ifelse(id1 == lag(id2) & id2 == lag(id1), 0, NA)))
Within each group, you compare the ID values and conditionally assign a new value column. Hope this helps.

Resources