Compare values in a grouped data frame with corresponding value in a vector - r

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?

We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5

Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

Related

Which group meet the criterion a < b < c depending on condition

My title might not be very informative but this is an example which exposes my problem :
I have this dataframe :
df=data.frame(cond1=c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3),
group=c("F","V","M","F","V","M","F","V","M","F","V","M","F","V","M","F","V","M"),
gene=c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B"),
value=c(1,2,3,4,5,6,7,8,9,1,3,2,4,3,2,2,3,4))
df
cond1 group gene value
1 1 F A 1
2 1 V A 2
3 1 M A 3
4 2 F A 4
5 2 V A 5
6 2 M A 6
7 3 F A 7
8 3 V A 8
9 3 M A 9
10 1 F B 1
11 1 V B 3
12 1 M B 2
13 2 F B 4
14 2 V B 3
15 2 M B 2
16 3 F B 2
17 3 V B 3
18 3 M B 4
What I would like to obtain is for each gene, the sum of how many different cond1 have their value corresponding with F group smaller than their value corresponding with V their value corresponding with M.
In the 3 first lines, we are in gene A for the cond1. value correspoding to group F=1, V=2, M=3. So F<V<M for the A gene for the cond1=1 group.
My expected output for the gene A is 3 as all cond1 groups meet F<V<M for value.
My expected output for the gene B is 1 as only cond1=3 group meet F<V<M for value.
My desired output would be ideally a dataframe with gene and the sum of cond1 than meet my criterion :
gene count
1 A 3
2 B 1
I would be very grateful if you could provide me any tips on how should I proceed
Check if all the data is in increasing order and count how many such values exist for each gene.
library(dplyr)
df %>%
#If the data is not ordered, order it using arrange
#arrange(gene, cond1, match(group, c('F', 'V', 'M'))) %>%
group_by(gene, cond1) %>%
summarise(cond = all(diff(value) > 0)) %>%
summarise(count = sum(cond))
# gene count
# <chr> <int>
#1 A 3
#2 B 1
Using data.table
library(data.table)
setDT(df)[, .(cond = all(diff(value) > 0)), .(gene, cond1)][, .(count = sum(cond)), gene]
gene count
1: A 3
2: B 1

Generate pairwise data.frame of all combinations of two data.frame with different number of rows

I have to dataframes a and b that I want to combine in a final dataframe c
a <- data.frame(city=c("a","b","c"),detail=c(1,2,3))
b <- data.frame(city=c("x","y"),detail=c(5,6))
the dataframe c should look like
city.a detail.a city.b detail.b
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
I think I could use crossing from tidyr but for crossing(a,b) I get:
error: Column names `city`, `detail` must not be duplicated.
Use .name_repair to specify repair.
Yes, crossing is the right function but as the error message suggests that column names should be not be duplicated try to change the column names
names(a) <- paste0(names(a), ".a")
names(b) <- paste0(names(b), ".b")
tidyr::crossing(a, b)
# city.a detail.a city.b detail.b
# <fct> <dbl> <fct> <dbl>
#1 a 1 x 5
#2 a 1 y 6
#3 b 2 x 5
#4 b 2 y 6
#5 c 3 x 5
#6 c 3 y 6
crossing is a wrapper over expand_grid so after correcting the names you can also use it directly.
tidyr::expand_grid(a, b)
Here is a base R solution by using rep() + cbind(), which gives duplicated column names:
C <- `row.names<-`(cbind(a[rep(seq(nrow(a)),each = nrow(b)),],b),NULL)
such that
> C
city detail city detail
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
Or get a data frame having different column names by using data.frame():
C <- data.frame(a[rep(seq(nrow(a)),each = nrow(b)),],b,row.names = NULL)
such that
> C
city detail city.1 detail.1
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
With base R, we can use merge
merge(setNames(a, paste0(names(a), ".a")), b)
# city.a detail.a city detail
#1 a 1 x 5
#2 b 2 x 5
#3 c 3 x 5
#4 a 1 y 6
#5 b 2 y 6
#6 c 3 y 6

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

R delete fathers row based on sons in hierarchycal data

I'm working with some data like these:
id <- c(1,1,1,2,2,2,3,3,3,4,4) # fathers
name <- c('a','b','k','b','e','g','e','f','k','f','u') # sons
data <- data.frame(id,name)
data
> data
id name
1 1 a
2 1 b
3 1 k
4 2 b
5 2 e
6 2 g
7 3 e
8 3 f
9 3 k
10 4 f
11 4 u
My goal is this: if there is only a son that I do not want, remove all the row with the same father of the disliked son. For example, I don't like the son e, the result should be:
> data_e
id name
1 1 a
2 1 b
3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
10 4 f
11 4 u
Because the rows with id 2 and 3 have in their name e.
This could be also a task like " I do not like e and f together":
> data_eandf
id name
1 1 a
2 1 b
3 1 k
4 2 b
5 2 e
6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
10 4 f
11 4 u
Or, "I don't want you if you have e or f":
> data_eorf
id name
1 1 a
2 1 b
3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
# 10 4 f
# 11 4 u
As you've noticed, to be more clear, I've "commented" the must-be-deleted rows.
I've searched, but I've found a lot of question based on only one column like data[which(data$name=='e'),], but this is going to remove only at sons' levels, not all the row of the relative father. Also I've thought to put the data in the wide format, paste all the name of a id in an unique cell, and fetch if there is e for example with function like grepl(), but I think this could be a problem with large dataset (these data are an example).
Do you have any idea about how to manage this?
Thanks in advance
Here's a function to handle the different cases
dislike1 <- c('e')
dislike2 <- c('e', 'f')
myfun <- function(df, dislike, ops = NULL) {
require(dplyr)
if (is.null(ops) || ops == 'OR') {
df %>%
group_by(id) %>%
filter(!any(name %in% dislike)) %>%
ungroup
} else if (ops == 'AND') {
df %>%
group_by(id) %>%
filter(!all(dislike %in% name)) %>%
ungroup
}
}
myfun(data, dislike1)
# A tibble: 5 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
# 4 4 f
# 5 4 u
myfun(data, dislike2, 'AND')
# A tibble: 8 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 4 f
# 8 4 u
myfun(data, dislike2, 'OR')
# A tibble: 3 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
data[!(data$id %in% unique(data[data$name == 'e', 'id'])),]
unique(data[data$name == 'e', 'id']) will get the unique id's that have 'e' in the name field. Then you can use the %in% operator to find all the rows with those id's. The ! is a negation operator.
I have a data.table solution
require(data.table)
id <- c(1,1,1,2,2,2,3,3,3,4,4) # fathers
name <- c('a','b','k','b','e','g','e','f','k','f','u') # sons
data <- data.table(id,name)
# names to be deleted
to_del <- c("e","f")
# returns only id's without any of the names to be deleted
data[ , .SD[ !any(name %in% to_del) ,name ] , by = "id"]
id V1
1: 1 a
2: 1 b
3: 1 k

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

Resources