Use dplyr to substitute apply - r

I have table like this (but number of columns can be different, I have a number of pairs ref_* + alt_*):
+--------+-------+-------+-------+-------+
| GeneID | ref_a | alt_a | ref_b | alt_b |
+--------+-------+-------+-------+-------+
| a1 | 0 | 1 | 1 | 3 |
| a2 | 1 | 1 | 7 | 8 |
| a3 | 0 | 1 | 1 | 3 |
| a4 | 0 | 1 | 1 | 3 |
+--------+-------+-------+---------------+
and need to filter out rows that have ref_a + alt_a < 10 and ref_b + alt_b < 10. It's easy to do it with apply, creating additional columns and filtering, but I'm learning to keep my data tidy, so trying to do it with dplyr.
I would use mutate first to create columns with sums and then filter by these sums. But can't figure out how to use mutate in this case.
Edited:
Number of columns is not fixed!

You do not need to mutate here. Just do the following:
require(tidyverse)
df %>%
filter(ref_a + alt_a < 10 & ref_b + alt_b < 10)
If you want to use mutate first you could go with:
df %>%
mutate(sum1 = ref_a + alt_a, sum2 = ref_b + alt_b) %>%
filter(sum1 < 10 & sum2 < 10)
Edit: The fact that we don't know the number of variables in advance makes it a bit more complicated. However, I think you could use the following code to perform this task (assuming that the variable names are all formated with "_a", "_b" and so on. I hope there is a shorter way to perform this task :)
df$GeneID <- as.character(df$GeneID)
df %>%
gather(variable, value, -GeneID) %>%
rowwise() %>%
mutate(variable = unlist(strsplit(variable, "_"))[2]) %>%
ungroup() %>%
group_by(GeneID, variable) %>%
summarise(sum = sum(value)) %>%
filter(sum < 10) %>%
summarise(keepGeneID = ifelse(n() == (ncol(df) - 1)/2, TRUE, FALSE)) %>%
filter(keepGeneID == TRUE) %>%
select(GeneID) -> ids
df %>%
filter(GeneID %in% ids$GeneID)
Edit 2: After some rework I was able to improve the code a bit:
df$GeneID <- as.character(df$GeneID)
df %>%
gather(variable, value, -GeneID) %>%
rowwise() %>%
mutate(variable = unlist(strsplit(variable, "_"))[2]) %>%
ungroup() %>%
group_by(GeneID, variable) %>%
summarise(sum = sum(value)) %>%
group_by(GeneID) %>%
summarise(max = max(sum)) %>%
filter(max < 10) -> ids
df %>%
filter(GeneID %in% ids$GeneID)

Related

Pass a variable into a filter - R dplyr

Here is a sample of the dataset that I have. I am looking to find the state that has the maximum number of stores. In this case, CA and also see how many IDs come from that state
| ID | | State | | Stores|
| -- | |------ | | ----- |
|a11 | | CA | | 16585 |
|a12 | | CA | | 45552 |
|a13 | | AK | | 7811 |
|a14 | | MA | | 4221 |
I have this code using dplyr
max_state <- df %>%
group_by(State) %>%
summarise(total_stores = sum(Stores)) %>%
top_n(1) %>%
select(State)
This gives me "CA"
Can I use this variable "max(state)" to pass through a filter and use summarise(n()) to count the number of Ids for CA?
A few ways:
# this takes your max_state (CA) and brings in the parts of
# your original table that have the same State
max_state %>%
left_join(df) %>%
summarize(n = n())
# filter the State in df to match the State in max_state
df %>%
filter(State == max_state$State) %>%
summarize(n = n())
# Add Stores_total for each State, only keep the State rows which
# match that of the max State, and count the # of IDs therein
df %>%
group_by(State) %>%
mutate(Stores_total = sum(Stores)) %>%
filter(Stores_total == max(Stores_total)) %>%
count(ID)
You can combine more operations into one summarize call that will be applied to the same group:
df |>
group_by(State) |>
summarize(gsum = sum(Stores), nids = n()) |>
filter(gsum == max(gsum))
##>+ # A tibble: 1 × 3
##> State gsum nids
##> <chr> <dbl> <int>
##>1 CA 62137 2
Where the dataset df is obtained by:
df <- data.frame(ID = c("a11", "a12","a13", "a14"),
State = c("CA", "CA", "AK", "MA"),
Stores = c(16585, 45552, 7811, 4221))

Select and add in columns in R

How can I select other columns in sf_MX dataframe to add in sumbyweek? I am stuck.
sumbyweek <- sf_MX %>%
filter(CVE_ENT %in% c("09","15","17")) %>%
group_by(CVE_ENT) %>%
summarise(across(starts_with('cumul')[13:32],
sum,na.rm = TRUE,.names = '{col}_total'))%>%
select(Col1,col2) #unable to get the idea result
sf_MX Data Table:
Col 1 | Col 2 | Col 3| Cumul1 |Cumul2 | Cumul3 …
Expected result:
Col 1 | Col 2 | Cumul1_total |Cumul2_total |Cumul3_total
We could do
library(dplyr)
sumbyweek <- sf_MX %>%
filter(CVE_ENT %in% c("09","15","17")) %>%
group_by(CVE_ENT) %>%
summarise(across(starts_with('cumul'),
sum, na.rm = TRUE, .names = '{col}_total'))

Obtain more variables after grouping, summarising with select (dplyr)

My data frame:
date | weekday | price
2018 | 1 | 25
2018 | 1 | 35
2019 | 2 | 40
I try to run this code under dplyr:
pi %>%
group_by(date) %>%
group_by(date) %>%
summarise(price = sum(price, na.rm = T)) %>%
select(price, date, weekday) %>%
print()
It doesn't work.
Any solution? Thanks in advance
Follow the order: select-->group_by-->summarise
df%>%select(price, date, weekday)%>%
group_by(date, weekday)%>%summarise(sum(price,na.rm=T))
People are correctly suggesting to group_by date and weekday, but if you have a lot of columns, that could be a pain to write out. Here's another idiom I frequently use for data.frames with lots of columns:
pi %>%
group_by(date) %>%
mutate(price = sum(price, na.rm = T)) %>%
filter(row_number() == 1)
This will keep all the first instances of each column variables without having to explicitly write them all out.

How to get percentage value of each column across all rows in R

Using R's tidyverse, how do I get the percentage value of each column across rows? Using the mpg dataset as an example, I've tried the following code:
new_mpg <- mpg %>%
group_by(manufacturer, model) %>%
summarise (n = n()) %>%
spread(model, n) %>%
mutate_if(is.integer, as.numeric)
new_mpg[,-1] %>%
mutate(sum = rowSums(.))
I'm looking to create the following output:
manufacturer | 4runner4wd | a4 | a4 quattro | a6 quattro | altima |
--------------------------------------------------------------------------
audi | NA | 0.3888889 | 0.444444 | 0.166667 | NA |
However, when I get to
new_mpg[,-1] %>%
mutate(sum = rowSums(.))
the sum column returns NA. And I'm unable to calculate the n()/sum. I will just get NA. Any ideas how to fix this?
As #camille mentioned in the comments you need an na.rm = TRUE in the rowSums call. To get the percentage of each model in the manufacturer you need to first count the number of each model grouped by manufacturer and model and then get the percentage grouped only by manufacturer. dplyr is smart in this way because it removes one layer of grouping after the summarise so you just need to add a mutate:
library(dplyr)
library(tidyr)
library(ggplot2)
new_mpg <- mpg %>%
group_by(manufacturer, model) %>%
summarise (n = n()) %>%
mutate(n = n/sum(n)) %>%
spread(model, n) %>%
mutate_if(is.integer, as.numeric)
new_mpg[,-1] %>%
mutate(sum = rowSums(., na.rm = TRUE))

Remove old date rows in R

I have table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
5/1/1 | B | 4
4/1/1 | C | 5
1/1/1 | A | 1
7/1/1 | B | 2
1/1/1 | C | 3
I need table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
4/1/1 | C | 5
7/1/1 | B | 2
How to remove old rows based on two criteria (Column1, Column2)?
Group by Dates, arrange in descending order within group, then keep the first row with slice, like this
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(as.Date(Date))) %>% # will sort within group now
slice(1) %>% # keep first row entry of each group
ungroup()
Your error is occurring because your date format is a bit funny. I recommend using lubridate::parse_date_time which is more robust than base R datetime functions
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(parse_date_time(Date, format="mdy"))) %>% # will sort within group now
# the date format is specified as month-day-year
slice(1) %>% # keep first row entry of each group
ungroup()
EDIT
Based on helpful comment by #count, we can simplify dplyr chain to
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
slice(which.max(parse_date_time(Date, format="mdy"))) %>% # keep max-Date row entry of each group
ungroup()

Resources