How to extract same rows with positive and negative values - r

I am trying to get the rows which have some value in one column but positive and negative values in another. Input is the below data frame
data <- data.frame(X = c(1,3,5,7,7,8,9,10,10,11,11,12,12),
Y = sample(36476545:36476557),
timepoint = c(0,1,0,-0.31,1,1,1,1,-1,1,1,1,1)
)
Output looks something like this
X Y timepoint
4 7 36476557 -0.31
5 7 36476545 1.00
8 10 36476556 1.00
9 10 36476548 -1.00
I was looking at this link, but not what I am looking for.

After grouping by 'X', filter those have both negative and positive 'timepoint' by taking the sign of 'timepoint', get the number of distinct elements (n_distinct) is 2 (assuming there is no zero)
library(dplyr)
data %>%
group_by(X) %>%
filter(n_distinct(sign(timepoint)) == 2)
# A tibble: 4 x 3
# Groups: X [2]
# X Y timepoint
# <dbl> <int> <dbl>
#1 7 36476547 -0.31
#2 7 36476556 1
#3 10 36476549 1
#4 10 36476557 -1
NOTE: 'Y' values are different as the example was created with no set.seed
If there is zero as well
data %>%
group_by(X) %>%
filter(all(c(-1, -1) %in% sign(timepoint)))
Or using base R with ave
data[with(data, ave(sign(timepoint), X, FUN = function(x) length(unique(x))) == 2),]
Or another base R option with table
subset(data, X %in% names(which(rowSums(with(subset(data,
timepoint != 0), table(X, sign(timepoint))) > 0) == 2)))

In base R, we can use ave and select groups where there is at least one timepoint value greater than 0 and one timepoint value less than 0.
data[with(data, ave(timepoint > 0, X, FUN = function(x) any(x) & any(!x))), ]
# X Y timepoint
#4 7 36476553 -0.31
#5 7 36476551 1.00
#8 10 36476556 1.00
#9 10 36476554 -1.00
In dplyr this would be
library(dplyr)
data %>%
group_by(X) %>%
filter(any(timepoint > 0) & any(timepoint < 0))

Related

grouped summarize still gives result for each individual row

I have the following data:
library(tidyverse)
df <- data.frame(id = c(1,1,1,2,2,2),
x = rep(letters[1:2], each = 3),
y = c(3,4,3,5,6,5),
z = c(7,8,9,10,11,12))
I now want to summarize the data by id in a way where I get the sum of z depending on y values. The y condition itself depends on the value of x.
I thought I could use the code below, but this gives me all input ids and doesn‘t summarize. The result is correct, but I still want to have one row per id.
df %>%
group_by(id) %>%
summarize(test = case_when(x == 'a' ~ sum(z[y == 3]),
x == 'b' ~ sum(z[y == 5])))
# A tibble: 6 x 2
# Groups: id [2]
id test
<dbl> <dbl>
1 1 16
2 1 16
3 1 16
4 2 22
5 2 22
6 2 22
The following works, but I don‘t understand why it does and the above code does not.
df %>%
group_by(id) %>%
summarize(test = case_when(all(x == 'a') ~ sum(z[y == 3]),
all(x == 'b') ~ sum(z[y == 5])))
# A tibble: 2 x 2
id test
<dbl> <dbl>
1 1 16
2 2 22
Also, is there a more straigthforward way to do my summarization?
Because, case_when similar to ifelse(test, x, y) will return a vector of the same length as test. all(x == z) has length 1 and so the returned valued is of length 1.

findInterval by group with dplyr [duplicate]

This question already has answers here:
How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
(11 answers)
Closed 1 year ago.
In this example I have a tibble with two variables:
a group variable gr
the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10),
val = gr + rnorm(30))
Goal
I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.
What I did
I first construct a nested tibble containing the vectors of breakpoints for each value of gr:
df_breakpoints <- bind_cols(gr = 1:3,
purrr::map_dfr(1:3, function(gr) {
c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
})) %>%
nest(bp = -gr) %>%
mutate(bp = purrr::map(.$bp, unlist))
Then I join it with df:
df <- inner_join(df, df_breakpoints, by = "gr")
My first guess to define the discretized variable lvl was
df %>% mutate(lvl = findInterval(x = val, vec = bp))
It produces the error
Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.
Then I tried
df$lvl <- purrr::imap_dbl(1:nrow(df),
~findInterval(x = df$val[.x], vec = df$bp[[.x]]))
or
df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))
It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?
You can do this in group_by + mutate step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval is applied for each gr separately.

Filter data.frame with multiple conditions without writing them out

How can I effectively filter a data.frame by multiple conditions without actually writing it out.
To make it more clear let us look at the following small and simplified example, where one would wish to extract all integers from 1 to 100 who fall in between either 1 and 2 or 4 and 6 or 60 and 65:
df <- data.frame(number = 1:100, someothermeasure = rnorm(100))
filters <- matrix(c(1,2,4,6,60,65), ncol = 2, byrow = T)
I would like the same result as below, but without listing the individual conditions by hand:
dplyr::filter(df, (number >= filters[1,1] & number <= filters[1,2])|(number >= filters[2,1] & number <= filters[2,2])|(number >= filters[3,1] & number <= filters[3,2]))
Writing it out is possible only when one has a small amount of conditions to filter over. But what to do, when the filter conditions dim(filters)[1] would equal for example 10000? How to deal with this situation?
A dplyr solution with rowwise() and filter().
library(dplyr)
df %>%
rowwise() %>%
filter(any(number >= filters[, 1] & number <= filters[, 2])) %>%
ungroup()
or you can use pmap_dfr() in purrr, which automatically combines all filtered data by rows.
library(purrr)
pmap_dfr(as.data.frame(filters),
~ filter(df, number >= .x & number <= .y))
Both methods give
# # A tibble: 11 x 2
# number someothermeasure
# <int> <dbl>
# 1 1 -0.319
# 2 2 0.497
# 3 4 0.501
# 4 5 1.20
# 5 6 -0.741
# 6 60 0.954
# 7 61 1.59
# 8 62 1.10
# 9 63 0.348
# 10 64 0.242
# 11 65 -0.170
apply is a pretty good tool to apply a function multiple times:
apply(X = filters, MARGIN = 1, FUN = function(x,y){
y %>%
dplyr::filter(number >= x[1] & number <= x[2])
}, y = df)

rowwise correlation for specific columns

For all rows (Symbols) i would like to correlated the five columns F1_G-F5_G with the five columns F1_P-F5_P. This should give one correlation value for each Symbol.
Symbol F1_G F2_G F3_G F4_G F5_G F1_P F2_P F3_P F4_P F5_P
1 abca2 0.7696639 1.301428 0.8447565 0.6936672 0.6987410 9.873610 9.705205 8.044027 8.311364 9.961380
2 aco2 7.4274715 7.234892 7.8543164 8.0142983 8.1512194 9.620114 9.767721 7.607115 7.854960 9.472660
3 adat3 -2.0560126 -1.536868 -0.4181457 -1.1946602 -0.7707472 8.975871 8.645235 7.926262 7.432755 8.633583
4 adat3 -2.0560126 -1.536868 -0.4181457 -1.1946602 -0.7707472 9.620114 9.237699 7.162386 7.972086 8.872763
5 adnp 1.4228436 0.932214 0.8964153 0.8125162 0.9921002 9.177645 9.323443 8.507508 8.080413 8.633583
6 arhgap8 -2.6517712 -2.067164 -1.4918958 -2.6517712 -1.5474257 9.395681 8.861322 8.333381 8.038053 8.872763
I tried something like this, but is does not consider each row:
res <- outer(df[, c(2,3,4,5,6)], df[, c(7,8,9,10,11)], function(X, Y){
mapply(function(...) cor.test(..., na.action = "na.exclude")$estimate,
X, Y)
})
out:
Symbol Cor
abca2 0.14
aco2 0.12
Since you want to do it row-wise we can use apply with MARGIN = 1
#Get column indices ending with G
g_cols <- grep("G$", names(df))
#Get column indices ending with P
p_cols <- grep("P$", names(df))
apply(df, 1, function(x) cor.test(as.numeric(x[g_cols]),
as.numeric(x[p_cols]), na.action = "na.exclude")$estimate)
# 1 2 3 4 5 6
# 0.21890 -0.52925 -0.52776 -0.82073 0.60473 -0.11785
A tidyverse approach would be
library(tidyverse)
df %>%
mutate(row = row_number()) %>%
select(-Symbol) %>%
gather(key, value, -row) %>%
group_by(row) %>%
summarise(ans = cor.test(value[key %in% g_cols], value[key %in% p_cols],
na.action = "na.exclude")$estimate)
# row ans
# <int> <dbl>
#1 1 0.219
#2 2 -0.529
#3 3 -0.528
#4 4 -0.821
#5 5 0.605
#6 6 -0.118

Multiple values in one cell

I have data looking somewhat similar to this:
number type results
1 5 x, y, z
2 6 a
3 8 x
1 5 x, y
Basically, I have data in Excel that has commas in a couple of individual cells and I need to count each value that is separated by a comma, after a certain requirement is met by subsetting.
Question: How do I go about receiving the sum of 5 when subsetting the data with number == 1 and type == 5, in R?
If we need the total count, then another option is str_count after subsetting
library(stringr)
with(df, sum(str_count(results[number==1 & type==5], "[a-z]"), na.rm = TRUE))
#[1] 5
Or with gregexpr from base R
with(df, sum(lengths(gregexpr("[a-z]", results[number==1 & type==5])), na.rm = TRUE))
#[1] 5
If there are no matching pattern for an element, use
with(df, sum(unlist(lapply(gregexpr("[a-z]",
results[number==1 & type==5]), `>`, 0)), na.rm = TRUE))
Here is an option using dplyr and tidyr. filter function can filter the rows based on conditions. separate_rows can separate the comma. group_by is to group the data. tally can count the numbers.
dt2 <- dt %>%
filter(number == 1, type == 5) %>%
separate_rows(results) %>%
group_by(results) %>%
tally()
# # A tibble: 3 x 2
# results n
# <chr> <int>
# 1 x 2
# 2 y 2
# 3 z 1
Or you can use count(results) only as the following code shows.
dt2 <- dt %>%
filter(number == 1, type == 5) %>%
separate_rows(results) %>%
count(results)
DATA
dt <- read.table(text = "number type results
1 5 'x, y, z'
2 6 a
3 8 x
1 5 'x, y'",
header = TRUE, stringsAsFactors = FALSE)
Here is a method using base R. You split results on the commas and get the length of each list, then add these up grouping by number.
aggregate(sapply(strsplit(df$results, ","), length), list(df$number), sum)
Group.1 x
1 1 5
2 2 1
3 3 1
Your data:
df = read.table(text="number type results
1 5 'x, y, z'
2 6 'a'
3 8 'x'
1 5 'x, y'",
header=TRUE, stringsAsFactors=FALSE)

Resources