Use a separate dataframe to assign groups to another dataframe - r

I am currently working with genetic data where the headers are cell sample names. There are 2 samples from each type of cell collected, and they need to be plotted in a box plot. Due to inconsistent sample naming, I am using a separate .csv file where the user writes the sample name and the group it belongs to. I am trying to use the group_by() function to access the sample data but then use the grouping information from the other .csv file. Is there a way to accomplish what I am trying to do?
Cell Sample Data CSV:
Sample A1 Sample A2 Sample B1 Sample B2
1 3 3 5
Grouping CSV
Samples Group
Sample A 1
Sample B 1
Sample C 2
Sample D 2
My current idea is doing something like this
library(dplyr)
groupFile <- data %>% group_by(groupFile$Group)
however that didn't work, and I am stuck at how to make the data correspond to the grouping file.
Note: I previously uploaded this question without sample data and code and it was closed. I'm hoping this describes the problem well enough.

First let's improve your example cell sample data by including samples that are in different groups:
celldata <- structure(list(`Sample A1` = 1L, `Sample A2` = 3L, `Sample B1` = 3L,
`Sample B2` = 5L, `Sample C1` = 6L, `Sample C2` = 7L),
class = "data.frame", row.names = c(NA, -1L))
And your groups data:
groupdata <- structure(list(Samples = c("Sample A", "Sample B", "Sample C", "Sample D"),
Group = c(1L, 1L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -4L))
Life will be much easier with data in "long" format rather than wide, and with everything in the one dataframe.
We can use tidyr::gather to reshape the cell data, then dplyr::mutate to get Samples without the numeric suffixes and finally, dplyr::left_join to bring samples and groups together:
library(dplyr)
library(tidyr)
celldata %>%
gather(Sample, Value) %>%
mutate(Samples = gsub("\\d+", "", Sample)) %>%
left_join(groupdata)
Result:
Sample Value Samples Group
1 Sample A1 1 Sample A 1
2 Sample A2 3 Sample A 1
3 Sample B1 3 Sample B 1
4 Sample B2 5 Sample B 1
5 Sample C1 6 Sample C 2
6 Sample C2 7 Sample C 2
Now you can group on Group. Depending on what you want to do next, you may want to convert Group to a factor. And if you're using ggplot2, you may not even need to group_by.
For example:
library(ggplot2)
celldata %>%
gather(Sample, Value) %>%
mutate(Samples = gsub("\\d+", "", Sample)) %>%
left_join(groupdata) %>%
mutate(Group = factor(Group)) %>%
ggplot(aes(Group, Value)) +
geom_boxplot() +
geom_jitter(aes(color = Samples)) +
theme_bw()

Related

Looking for an efficient way of making a new data frame of totals across categories in R

Total R beginner here, looking for the quickest / most sensible way to do this:
I have a data frame that looks similar to this (but much longer):
dataframe:
date
a
b
c
1/1/2021
4
3
2
1/2/2021
2
2
1
1/3/2021
5
3
5
I am attempting to create a new data frame showing totals for a, b, and c (which go on for a while), and don't need the dates. I want to make a data frame that would look this:
letter
total
a
11
b
8
c
8
So far, the closest I have got to this is by writing a pipe like this:
dataframe <- totals %>%
summarize(total_a = sum(a), total_b = sum(b), total_c = sum(c))
which almost gives me what I want, a data frame that looks like this:
|a|b|c|
|:-:|:-:|:-:|
|11|8|8|
Is there a way (besides manually typing out a new data frame for totals) to quickly turn my totals table into the format I'm looking for? Or is there a better way to write the pipe that will give me the table I want? I want to use these totals to make a pie chart but am running into problems when I attempt to make a pie chart out of the table like I have it now. I really appreciate any help in advance and hope I was able to explain what I'm trying to do correctly.
One efficient way is to use colSums from base R, where we get the sums of each column, excluding the date column (hence the reason for the -1 in df[,1]. Then, I use stack to put into long format. The [,2:1] is just changing the order of the column output, so that letter is first and total is second. I wrap this in setNames to rename the column names.
setNames(nm=c("letter", "total"),stack(colSums(df[,-1]))[,2:1])
letter total
1 a 11
2 b 8
3 c 8
Or with tidyverse, we can get the sum of every column, except for date. Then, we can put it into long format using pivot_longer.
df %>%
summarise(across(-date, sum)) %>%
pivot_longer(everything(), names_to = "letter", values_to = "total")
Or another option using data.table:
library(data.table)
dt <- as.data.table(df)
melt(dt[,-1][, lapply(.SD, sum)], id.vars=integer(), variable.name = "letter", value.name = "total")
Data
df <- structure(list(date = c("1/1/2021", "1/2/2021", "1/3/2021"),
a = c(4L, 2L, 5L), b = c(3L, 2L, 3L), c = c(2L, 1L, 5L)),
class = "data.frame", row.names = c(NA, -3L))
Try this :
totals %>% select(a:c) %>% colSums() %>% as.list() %>% as_tibble() %>%
pivot_longer(everything(), names_to = "letter", values_to = "total")
Actually totals %>% select(a:c) %>% colSums() gives what you need as a named vector and the next steps are to turn that into a tibble again. You can skip that part if you don't need it.

Keep colums while grouping

I am a beginner in R and was looking for help online, but the examples I found among similar titles don't quite fit my needs, because they only deal with few colums.
I have a data.frame T1 with over 100 columns and what I am looking for is something like a summary, but I want to retain every other column after the summary. I thought about using aggregate but since it's not a function, I am uncertain. The most promising way I think of you can see below.
T2 <- T1 %>% group_by(ID) %>% summarise(AGI = paste(AGI, collapse = "; "))
The summary works the way I want, but I lose any other column.
I definitly appreciate any kind of advice! Thank you very much
Expanding on TTS's comment, if you want to keep any other column you have to use mutate instead of summarise because, as the documentation says, summarise() creates a new data frame.
You should therefore use
T1 %>% group_by(ID) %>% mutate(AGI = paste(AGI, collapse = "; ")) %>% ungroup()
Data
T1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 3L, 4L), UniProt_Accession = c("P25702",
"F4HWZ6", "Q9C5M0", "Q9SR37", "Q9LKR7", "Q9FXI7"), AGI = c("AT1G54630",
"AT1G54630", "AT5G19760", "AT3G09260", "AT5G28510", "AT1G19890"
)), class = "data.frame", row.names = c(NA, -6L))
Output
# A tibble: 6 x 3
# ID UniProt_Accession AGI
# <int> <chr> <chr>
# 1 1 P25702 AT1G54630; AT1G54630
# 2 1 F4HWZ6 AT1G54630; AT1G54630
# 3 2 Q9C5M0 AT5G19760
# 4 3 Q9SR37 AT3G09260; AT5G28510
# 5 3 Q9LKR7 AT3G09260; AT5G28510
# 6 4 Q9FXI7 AT1G19890

Count frequency of same value in several columns

I'm quite new to R and I'm facing a problem which I guess is quite easy to fix but I couldn't find the answer.
I have a dataframe called clg where basically I have 3 columns date, X1, X2.
X1 and X2 are name of country teams. X1 and X2 have the same list of countries.
I'm simply trying to count the frequency of each country in the two columns as a total.
So far, I've only been able to count the frequency of the X1 column but I didn't find a way to sum both columns.
clt <- as_tibble(na.omit(count(clg, clg$X1)))
I would like to get a data frame where in the first columns I have unique countries, and in the second column the sum of occurrences in X1 + X2.
You can useunlist() and table() to get the overall counts. Wrapping it in data.frame() will give you the desired two column output.
clg <- data.frame(date=1:3,
X1=c("nor", "swe", "alg"),
X2=c("swe", "alg", "jpn"))
data.frame(table(unlist(clg[c("X1", "X2")])))
# Var1 Freq
# 1 alg 2
# 2 nor 1
# 3 swe 2
# 4 jpn 1
With tidyverse, we can gather into 'long' format and then do the count
library(tidyverse)
gather(clg, key, Var1, -date) %>%
count(Var1)
# A tibble: 4 x 2
# Var1 n
# <chr> <int>
#1 alg 2
#2 jpn 1
#3 nor 1
#4 swe 2
data
clg <- structure(list(date = 1:3, X1 = structure(c(2L, 3L, 1L), .Label = c("alg",
"nor", "swe"), class = "factor"), X2 = structure(c(3L, 1L, 2L
), .Label = c("alg", "jpn", "swe"), class = "factor")),
class = "data.frame", row.names = c(NA,
-3L))
You can obtain your goal with two steps. In the first step, you calculate the sum of occurrences for each country. In the next step, you're joining the two df's together and calculate the total sum.
X1_sum <- df %>%
dplyr::group_by(X1) %>%
dplyr::summarize(n_x1 = n())
X2_sum <- df %>%
dplyr::group_by(X2) %>%
dplyr::summarize(n_x2 = n()
final_summary <- X1_sum %>%
# merging data with by country names
dplyr::left_join(., X2_sum, by = c("X1", "X2")) %>%
dplyr::mutate(n_sum = n_x1 + n_x2)

find duplicates with grouped variables

I have a df that looks like this:
I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.
from to group
1 2 metro
2 4 metro
3 4 metro
4 5 train
6 1 train
8 7 train
I want to find the ids which exist in more than one group variable.
The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.
Thank you in advance!
Using base R we can split the first two columns based on group and find the intersecting value between the groups using intersect
Reduce(intersect, split(unlist(df[1:2]), df$group))
#[1] 1 4
We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, from:to) %>%
group_by(val) %>%
filter(n_distinct(group) > 1) %>%
distinct(val) %>%
pull(val)
#[1] 1 4
Or using base R we can just table to find the frequency, and get the ids out of it
out <- with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"
data
df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L,
4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train",
"train", "train")), class = "data.frame", row.names = c(NA, -6L
))
Convert data to long format and count unique values, using data.table. melt is used to convert to long format, and data table allows filtering in the i part of df1[ i, j, k], grouping in the k part, and pulling in the j part.
library(data.table)
library(magrittr)
setDT(df1)
melt(df1, 'group') %>%
.[, .(n = uniqueN(group)), value] %>%
.[n > 1, unique(value)]
# [1] 1 4

Split column into intervals based on row content

I am trying to convert a single-column data frame into separate columns — the main descriptor in the data is the "item number" and then includes information on the price, date, color, etc. I would just split the column depending on row number, but since each item has a different amount of information, that doesn't really work.
I've been playing around with this a bit but haven't found anything at all to come close, as I can't use regex to create a separate column (using str_which, for example) since the information differs so much item to item. How can I use regex to create intervals that I can then split the column into (so I need the information between each row containing "item" in a separate column). Sample data is below.
data
item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017
Thank you!
Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF, the variable var, and a vector of column names in the correct order colnames and byitem to choose the output format (default is TRUE, which outputs a dataframe with one row per item):
library(tidyverse)
df_transform = function(DF, var, colnames, byitem = TRUE){
if(byitem){
ID = sym("rowid")
}else{
ID = sym("id")
}
DF %>%
group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
filter(!grepl("item", var)) %>%
spread(!!ID, var)
}
Output:
> df_transform(df, var, c("price", "color", "date"))
# A tibble: 3 x 4
# Groups: id [3]
id color date price
<chr> <fct> <fct> <fct>
1 item1 red <NA> $600
2 item2 <NA> <NA> $70
3 item3 orange 10/11/2017 $430
> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)
# A tibble: 3 x 4
rowid item1 item2 item3
<chr> <fct> <fct> <fct>
1 color red <NA> orange
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430
Note that this would not work if you have missing values in the middle, since the column names are assigned by position.
Data:
df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L,
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1",
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA,
-9L))

Resources