Keep colums while grouping - r

I am a beginner in R and was looking for help online, but the examples I found among similar titles don't quite fit my needs, because they only deal with few colums.
I have a data.frame T1 with over 100 columns and what I am looking for is something like a summary, but I want to retain every other column after the summary. I thought about using aggregate but since it's not a function, I am uncertain. The most promising way I think of you can see below.
T2 <- T1 %>% group_by(ID) %>% summarise(AGI = paste(AGI, collapse = "; "))
The summary works the way I want, but I lose any other column.
I definitly appreciate any kind of advice! Thank you very much

Expanding on TTS's comment, if you want to keep any other column you have to use mutate instead of summarise because, as the documentation says, summarise() creates a new data frame.
You should therefore use
T1 %>% group_by(ID) %>% mutate(AGI = paste(AGI, collapse = "; ")) %>% ungroup()
Data
T1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 3L, 4L), UniProt_Accession = c("P25702",
"F4HWZ6", "Q9C5M0", "Q9SR37", "Q9LKR7", "Q9FXI7"), AGI = c("AT1G54630",
"AT1G54630", "AT5G19760", "AT3G09260", "AT5G28510", "AT1G19890"
)), class = "data.frame", row.names = c(NA, -6L))
Output
# A tibble: 6 x 3
# ID UniProt_Accession AGI
# <int> <chr> <chr>
# 1 1 P25702 AT1G54630; AT1G54630
# 2 1 F4HWZ6 AT1G54630; AT1G54630
# 3 2 Q9C5M0 AT5G19760
# 4 3 Q9SR37 AT3G09260; AT5G28510
# 5 3 Q9LKR7 AT3G09260; AT5G28510
# 6 4 Q9FXI7 AT1G19890

Related

Looking for an efficient way of making a new data frame of totals across categories in R

Total R beginner here, looking for the quickest / most sensible way to do this:
I have a data frame that looks similar to this (but much longer):
dataframe:
date
a
b
c
1/1/2021
4
3
2
1/2/2021
2
2
1
1/3/2021
5
3
5
I am attempting to create a new data frame showing totals for a, b, and c (which go on for a while), and don't need the dates. I want to make a data frame that would look this:
letter
total
a
11
b
8
c
8
So far, the closest I have got to this is by writing a pipe like this:
dataframe <- totals %>%
summarize(total_a = sum(a), total_b = sum(b), total_c = sum(c))
which almost gives me what I want, a data frame that looks like this:
|a|b|c|
|:-:|:-:|:-:|
|11|8|8|
Is there a way (besides manually typing out a new data frame for totals) to quickly turn my totals table into the format I'm looking for? Or is there a better way to write the pipe that will give me the table I want? I want to use these totals to make a pie chart but am running into problems when I attempt to make a pie chart out of the table like I have it now. I really appreciate any help in advance and hope I was able to explain what I'm trying to do correctly.
One efficient way is to use colSums from base R, where we get the sums of each column, excluding the date column (hence the reason for the -1 in df[,1]. Then, I use stack to put into long format. The [,2:1] is just changing the order of the column output, so that letter is first and total is second. I wrap this in setNames to rename the column names.
setNames(nm=c("letter", "total"),stack(colSums(df[,-1]))[,2:1])
letter total
1 a 11
2 b 8
3 c 8
Or with tidyverse, we can get the sum of every column, except for date. Then, we can put it into long format using pivot_longer.
df %>%
summarise(across(-date, sum)) %>%
pivot_longer(everything(), names_to = "letter", values_to = "total")
Or another option using data.table:
library(data.table)
dt <- as.data.table(df)
melt(dt[,-1][, lapply(.SD, sum)], id.vars=integer(), variable.name = "letter", value.name = "total")
Data
df <- structure(list(date = c("1/1/2021", "1/2/2021", "1/3/2021"),
a = c(4L, 2L, 5L), b = c(3L, 2L, 3L), c = c(2L, 1L, 5L)),
class = "data.frame", row.names = c(NA, -3L))
Try this :
totals %>% select(a:c) %>% colSums() %>% as.list() %>% as_tibble() %>%
pivot_longer(everything(), names_to = "letter", values_to = "total")
Actually totals %>% select(a:c) %>% colSums() gives what you need as a named vector and the next steps are to turn that into a tibble again. You can skip that part if you don't need it.

Use a separate dataframe to assign groups to another dataframe

I am currently working with genetic data where the headers are cell sample names. There are 2 samples from each type of cell collected, and they need to be plotted in a box plot. Due to inconsistent sample naming, I am using a separate .csv file where the user writes the sample name and the group it belongs to. I am trying to use the group_by() function to access the sample data but then use the grouping information from the other .csv file. Is there a way to accomplish what I am trying to do?
Cell Sample Data CSV:
Sample A1 Sample A2 Sample B1 Sample B2
1 3 3 5
Grouping CSV
Samples Group
Sample A 1
Sample B 1
Sample C 2
Sample D 2
My current idea is doing something like this
library(dplyr)
groupFile <- data %>% group_by(groupFile$Group)
however that didn't work, and I am stuck at how to make the data correspond to the grouping file.
Note: I previously uploaded this question without sample data and code and it was closed. I'm hoping this describes the problem well enough.
First let's improve your example cell sample data by including samples that are in different groups:
celldata <- structure(list(`Sample A1` = 1L, `Sample A2` = 3L, `Sample B1` = 3L,
`Sample B2` = 5L, `Sample C1` = 6L, `Sample C2` = 7L),
class = "data.frame", row.names = c(NA, -1L))
And your groups data:
groupdata <- structure(list(Samples = c("Sample A", "Sample B", "Sample C", "Sample D"),
Group = c(1L, 1L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -4L))
Life will be much easier with data in "long" format rather than wide, and with everything in the one dataframe.
We can use tidyr::gather to reshape the cell data, then dplyr::mutate to get Samples without the numeric suffixes and finally, dplyr::left_join to bring samples and groups together:
library(dplyr)
library(tidyr)
celldata %>%
gather(Sample, Value) %>%
mutate(Samples = gsub("\\d+", "", Sample)) %>%
left_join(groupdata)
Result:
Sample Value Samples Group
1 Sample A1 1 Sample A 1
2 Sample A2 3 Sample A 1
3 Sample B1 3 Sample B 1
4 Sample B2 5 Sample B 1
5 Sample C1 6 Sample C 2
6 Sample C2 7 Sample C 2
Now you can group on Group. Depending on what you want to do next, you may want to convert Group to a factor. And if you're using ggplot2, you may not even need to group_by.
For example:
library(ggplot2)
celldata %>%
gather(Sample, Value) %>%
mutate(Samples = gsub("\\d+", "", Sample)) %>%
left_join(groupdata) %>%
mutate(Group = factor(Group)) %>%
ggplot(aes(Group, Value)) +
geom_boxplot() +
geom_jitter(aes(color = Samples)) +
theme_bw()

find duplicates with grouped variables

I have a df that looks like this:
I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.
from to group
1 2 metro
2 4 metro
3 4 metro
4 5 train
6 1 train
8 7 train
I want to find the ids which exist in more than one group variable.
The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.
Thank you in advance!
Using base R we can split the first two columns based on group and find the intersecting value between the groups using intersect
Reduce(intersect, split(unlist(df[1:2]), df$group))
#[1] 1 4
We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, from:to) %>%
group_by(val) %>%
filter(n_distinct(group) > 1) %>%
distinct(val) %>%
pull(val)
#[1] 1 4
Or using base R we can just table to find the frequency, and get the ids out of it
out <- with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"
data
df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L,
4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train",
"train", "train")), class = "data.frame", row.names = c(NA, -6L
))
Convert data to long format and count unique values, using data.table. melt is used to convert to long format, and data table allows filtering in the i part of df1[ i, j, k], grouping in the k part, and pulling in the j part.
library(data.table)
library(magrittr)
setDT(df1)
melt(df1, 'group') %>%
.[, .(n = uniqueN(group)), value] %>%
.[n > 1, unique(value)]
# [1] 1 4

Separating Column Based on First Value of String

I have an ID variable that I am trying to separate into two separate columns based on their prefix being either a 1 or 2.
An example of my data is:
STR_ID
1434233
2343535
1243435
1434355
I have tried countless ways to try to separate these variables into columns based on their prefixes, but cannot seem to figure it out. Any ideas on how I would do this? Thank you in advance.
We create a grouping variable with substr by extracting the first character/digit of 'STR_ID', and spread it to 'wide' format
library(tidyverse)
df1 %>%
group_by(grp = paste0('grp', substr(STR_ID, 1, 1))) %>%
mutate(i = row_number()) %>%
spread(grp, STR_ID) %>%
select(-i)
# A tibble: 3 x 2
# grp1 grp2
# <int> <int>
#1 1434233 2343535
#2 1243435 NA
#3 1434355 NA
data
df1 <- structure(list(STR_ID = c(1434233L, 2343535L, 1243435L, 1434355L
)), class = "data.frame", row.names = c(NA, -4L))

Dynamic Grouping in R | Grouping based on condition on applied function

In R, in aggregate() function, How to specify stopping condition on grouping on applied function on the variable?
For example, I have data-frame like this: "df"
Input Data frame
Note: Assuming each row in input data frame is denoting single ball played by a player in that match. So, by counting a number of rows can tell us the number of balls required.
And, I want my data frame like this one: Output data frame
My need is: How many balls are required to score 10 runs?
Currently, I am using this R code:
group_data <- aggregate(df$score, by=list(Category=df$player,df$match), FUN=sum,na.rm = TRUE)
Using this code, I can not stop grouping as I want, it stops when it groups all rows. I don't want all rows to consider.
But How to put constraint like "Stop grouping as soon as score >= 10"
By putting this constraint, my sole purpose is to count the number of rows satisfying this condition.
Thanks in advance.
Here is one option using dplyr
library(dplyr)
df1 %>%
group_by(match, player) %>%
filter(!lag(cumsum(score) > 10, default = FALSE)) %>%
summarise(score = sum(score), Count = n())
# A tibble: 2 x 4
# Groups: match [?]
# match player score Count
# <int> <int> <dbl> <int>
#1 1 30 12 2
#2 2 31 15 3
data
df1 <- structure(list(match = c(1L, 1L, 1L, 2L, 2L, 2L), player = c(30L,
30L, 30L, 31L, 31L, 31L), score = c(6, 6, 6, 3, 6, 6)), .Names = c("match",
"player", "score"), row.names = c(NA, -6L), class = "data.frame")

Resources