Can I slice a data frame using an ifelse function? - r

I have a big data frame, and I only want a single line from it, if a certain condition of x >= 4 is met. However, out of my 50 entries, 43 reach x >= 4. For the others, I want to take the highest value it reaches for x. So, I want to create code which will filter for x >= 4 and take that value, unless 4 is not reached, and then I want the tail_end.
I currently have the following code, and I am not sure how to incorporate the ifelse statement:
selection_T01 <- df_T01 %>%
group_by(id) %>%
filter(X >= 0) %>%
slice(1) %>%
ungroup()

The idea is to create a separate condition column and use that as a grouping variable. I recommend the nest-map-unnest approach when dealing with groups of dataframes.
library(dplyr)
library(tidyr)
library(purrr)
get_selection <- function(condition, df) {
func_slice <- ifelse(
condition,
slice_max,
slice_min
)
func_slice(df, Sepal.Length)
}
selection <- iris |>
mutate(
condition = Sepal.Length > 6
) |>
group_by(Species, condition) |>
nest() |>
mutate(
selection = map2(condition, data, get_selection)
) |>
select(-data) |>
unnest(cols = c(selection))

Related

How to get this function to work with the pipe in r?

I have created this function that quickly does some summarization operations (mean, median, geometric mean and arranges them in descending order). This is the function:
summarize_values <- function(tbl, variable){
tbl %>%
summarize(summarized_mean = mean({{variable}}),
summarized_median = median({{variable}}),
geom_mean = exp(mean(log({{variable}}))),
n = n()) %>%
arrange(desc(n))
}
I can do this and it works:
summarize_values(data, lifeExp)
However, I would like to be able to do this:
data %>%
select(year, lifeExp) %>%
summarize_values()
or something like this
data %>%
summarize_values(year, lifeExp)
What am I missing to make this work?
thanks
With pipe, we don't need to specify the first argument which is the tbl,
library(dplyr)
data %>%
summarize_values(lifeExp)
-reproducible example
> mtcars %>%
summarize_values(gear)
summarized_mean summarized_median geom_mean n
1 3.6875 4 3.619405 32

Filter a dataframe based on condition in columns selected by name pattern

I have a dataframe that contains several columns
# Temp Data
library(dplyr)
df <- as.data.frame(matrix(seq(1:40),ncol=6,nrow=40))
colnames(df) <- c("A_logFC", "B_logFC", "C_logFC", "A_qvalue", "B_qvalue", "C_qvalue")
I would like to filter out those rows that have a qvalue lower than a threshold in all conditions (A, B, C).
I could do the obvious by filtering each column separately
df %>%
filter(A_qvalue < 0.05 & B_qvalue < 0.05 & C_qvalue < 0.05)
but the real dataframe has 15 columns with q-values.
I also tried reshaping the dataframe (as found here)
df_ID = DEGs_df %>% mutate(ID = 1:n())
df_ID %>%
select(contains("qval"), ID) %>%
gather(variable, value, -ID) %>%
filter(value < 0.05) %>%
semi_join(df_ID)
but then I cannot filter for those rows whose q-value is below the threshold in all conditions.
Conceptually, it would be something like
df %>%
filter(grep("q_value",.) < 0.05)
but this does not work either.
Any suggestions on how to solve that? Thank you in advance!
You can filter multiple columns at once using if_all:
library(dplyr)
df %>%
filter(if_all(matches("_qvalue"), ~ . < 0.05))
In this case I use the filtering condition x < 0.05 on all columns whose name matches _qvalue.
Your second approach can also work if you group by ID first and then use all inside filter:
df_ID = df %>% mutate(ID = 1:n())
df_ID %>%
select(contains("qval"), ID) %>%
gather(variable, value, -ID) %>%
group_by(ID) %>%
filter(all(value < 0.05)) %>%
semi_join(df_ID, by = "ID")

How to print a grouped_df grouped by two variables on two tables with dplyr in R

I want to group by two variables, compute a mean for the groups, then print the result on distinct tables.
Unlike the below where I get all my means in a single table, I would like one output table for x==1 and another one for x==2
data = tibble(x=factor(sample(1:2,10,rep=TRUE)),
y=factor(sample(letters[1:2],10,rep=TRUE)),
z=1:10)
data %>% group_by(x) %>% summarize(Mean_z=mean(z))
res = data %>% group_by(x,y) %>% summarize(Mean_z=mean(z))
print(res)
res %>% knitr::kable() %>% kableExtra::kable_styling()```
You want separate outputs for when x==1 and x==2. A simple way with dplyr would be to filter:
library(dplyr)
data = tibble(x=factor(sample(1:2,10,rep=TRUE)),
y=factor(sample(letters[1:2],10,rep=TRUE)),
z=1:10)
res = data %>% group_by(x,y) %>% summarize(Mean_z=mean(z))
x1= res%>%
filter(x ==1)
x2= res%>%
filter(x ==2)
x1 %>% knitr::kable() %>% kableExtra::kable_styling()
x2 %>% knitr::kable() %>% kableExtra::kable_styling()
I'm not sure why you have this line of code:
data %>% group_by(x) %>% summarize(Mean_z=mean(z))
It doesn't create a new object and so it's output won't be available to be used in subsequent lines of code. If you did use it, it would give you the means for z for each x value, without splitting into each y value.

r: combine filter with n_distinct in data frame

Simple question. Considering the data frame below, I want to count distinct IDs: one for all records and one after filtering on status. However, the %>% doesn't seem to work here. I just want to have a single value as ouput (so for total this should be 10, for closed it should be 5), not a dataframe . Both # lines don't work
dat <- data.frame (ID = as.factor(c(1:10)),
status = as.factor(rep(c("open","closed"))))
total <- n_distinct(dat$ID)
#closed <- dat %>% filter(status == "closed") %>% n_distinct(dat$ID)
#closed <- dat %>% filter(status == "closed") %>% n_distinct(ID)
n_distinct expects a vector as input, you are passing a dataframe. You can do :
library(dplyr)
dat %>%
filter(status == "closed") %>%
summarise(n = n_distinct(ID))
# n
#1 5
Or without using filter :
dat %>% summarise(n = n_distinct(ID[status == "closed"]))
You can add %>% pull(n) to above if you want a vector back and not a dataframe.
An option with data.table
library(data.table)
setDT(dat)[status == "closed"][, .(n = uniqueN(ID))]

R: Create new column based list of values from a multiple columns

I want to create a new column (T/F) based on any value from a list being present in multiple columns. For this example, I'm using mtcars for my example, searching for two values in two columns, but my actual challenge is many values in many columns.
I have a successful filter using filter_at() included below, but I've been unable to apply that logic to a mutate:
# there are 7 cars with 6 cyl
mtcars %>%
filter(cyl == 6)
# there are 2 cars with 19.2 mpg, one with 6 cyl, one with 8
mtcars %>%
filter(mpg == 19.2)
# there are 8 rows with either.
# these are the rows I want as TRUE
mtcars %>%
filter(mpg == 19.2 | cyl == 6)
# set the cols to look at
mtcars_cols <- mtcars %>%
select(matches('^(mp|cy)')) %>% names()
# set the values to look at
mtcars_numbs <- c(19.2, 6)
# result is 8 vars with either value in either col.
# this is a successful filter of the data
out1 <- mtcars %>%
filter_at(vars(mtcars_cols), any_vars(
. %in% mtcars_numbs
)
)
# shows set with all 6 cyl, plus one 8cyl 21.9 mpg
out1 %>%
select(mpg, cyl)
# This attempts to apply the filter list to the cols,
# but I only get 6 rows as True
# I tried to change == to %in& but that results in an error
out2 <- mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs) > 0
)
# only 6 rows returned
out2 %>%
filter(myset == T)
I'm not sure why the two rows are skipped. I think it might be the use of rowSums that is aggregating those two rows in some way.
If we want to do the corresponding checks, it may be better to use map2
library(dplyr)
library(purrr)
map2_df(mtcars_cols, mtcars_numbs, ~
mtcars %>%
filter(!! rlang::sym(.x) == .y)) %>%
distinct
NOTE: Doing the comparison (==) with floating point numbers can get into trouble as the precision can vary and result in FALSE
Also, note that == works only when when either the lhs and rhs elements have the same length or the rhs vector is of length 1 (here the recycling happens). If the length is greater than 1 and not equal to length of lhs vector, then the recycling would be comparing in the column order.
We can replicate to make the lengths equal and now it should work
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs[col(select(., mtcars_cols))]) > 0
) %>% pull(myset) %>% sum
#[1] 8
In the above code select is used twice for better understanding. Otherwise, we can also use rep
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == rep(mtcars_numbs, each = n())) > 0
) %>%
pull(myset) %>%
sum
#[1] 8

Resources