How summarise count equal rows between factor columns R? - r

I have followed data example
df <- tibble(var1 = factor(c("a", "a", "a", "b", "b", "c", "c", "c")),
var2 = factor(c("a", "b", "b", "c", "c", "c", "d", "d")))
I would like to summarise this data as one row and three columns (1X3) in tibble format. Where the first column show the counting of similar row values, the second column show the counting of different and the third column the total of values with final format as:
final <- tibble(equal = 2, different = 6, total = 8)
thank you all

You could use
library(dplyr)
df %>%
summarise(equal = sum(as.numeric(var1) == as.numeric(var2)),
different = sum(as.numeric(var1) != as.numeric(var2)),
total = n())
This returns
# A tibble: 1 x 3
equal different total
<int> <int> <int>
1 2 6 8

Related

How to merge two data frames and fill with different options based on coincidences

I have two data frame y and z
y <- data.frame(ID = c("A", "A", "A", "B", "B"), gene = c("a", "b", "c", "a", "c"))
z <- data.frame(A = c(2,6,3), B = c(8,4,9), C=c(1,6,2))
rownames(z) <- c("a", "b", "c")
So for y I have a table with patients ID and gene for each patients and in Z I have the same patients IDs in the first row and a list of genes with a specific value (which is not important here). The genes in y are in z, but in z there are genes that are not included in y.
What I want to do is to merge this frames and have something like this:
a b c
A 1 1 1
B 1 0 1
So for each patient, if the genes in z are also in y, fill with 1 and if not, fill with 0
I don't really know how to handle this, any ideas?
Thank you
I've made RE from your question (add this to your question next time):
y <- data.frame(ID = c("id_A", "id_A", "id_A", "id_B", "id_B"), gene = c("a", "b", "c", "a", "c"))
z <- data.frame(id_A = c(2,6,3), id_B = c(8,4,9), id_C=c(1,6,2))
rownames(z) <- c("a", "b", "c")
The idea here is to pivot_longer your tables so you can join than easily.
To do so, you first need to make your rownames into a field:
z <- tibble::rownames_to_column(z, "gene")
Then, you pivot longer you z table:
library(tidyr)
z_long <- pivot_longer(z, starts_with("id_"), names_to = "ID")
and join it with your y table:
library(dplyr)
table_join <- left_join(y, z_long)
Finally, you just have to calculate the frequencies:
table(table_join$ID, table_join$gene)
a b c
id_A 1 1 1
id_B 1 0 1

Plotting Number of Times Value Appears in Two Dataframes in R

I have two sets of data. Each contains a column for the name of the molecule and a column for the number of times that molecule appears in the sample. I want to create a scatterplot with the number of times a molecule appears in dataset #1 on the x-axis and how many times it appears in dataset #2. If a molecule is in one dataset and not the other, it appears 0 times.
Example:
dat1 <- data.frame(
name = c("A", "B", "D", "E")
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F")
count = c(1, 3, 50, 40)
)
Point #1 would be (10,1) corresponding to A, Point #2 would be (1,3), Point #3 would be (0,50) and so on. I don't want to label my points since my datasets contain tens of thousands of molecules.
Try joining the data.frames
full_join(dat1, dat2, by="name") %>%
mutate_all(function(xx) ifelse(is.na(xx), 0, xx)) %>%
ggplot(aes(count.x, count.y)) +
geom_point()
which produces
You would need a full_join():
library(dplyr)
library(ggplot2)
#Data
dat1 <- data.frame(
name = c("A", "B", "D", "E"),
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F"),
count = c(1, 3, 50, 40)
)
#Code
dat1 %>% full_join(dat2 %>% rename(count2=count)) %>%
replace(is.na(.),0) %>%
ggplot(aes(x=count,y=count2))+
geom_point()+
geom_text(aes(label=name),vjust=-0.5)
Output:

Replace values in vector where not %in% vector

Short question:
I can substitute certain variable values like this:
values <- c("a", "b", "a", "b", "c", "a", "b")
df <- data.frame(values)
What's the easiest way to replace all the values of df$values by "x" (where the value is neither "a" or "b")?
Output should be:
c("a", "b", "a", "b", "x", "a", "b")
Your example is a bit unclear and not reproducible.
However, based on guessing what you actually want, I could suggest trying this option using the data.table package:
df[values %in% c("a", "b"), values := "x"]
or the dplyr package:
df %>% mutate(values = ifelse(values %in% c("a","b"), x, values))
What about:
df[!df[, 1] %in% c("a", "b"), ] <- "x"
values
1 a
2 b
3 a
4 b
5 x
6 a
7 b

How to drop observations based on conditions

I have a subset data that has a total count for each observation from a bigger dataset. If I want to drop duplicates based on a higher count and drop codes that appear less if the name is the same, how would I go about that? So for instance:
name = c("a", "a", "b", "b", "b", "c", "d", "e", "e", "e")
code = c(1,1,2,3,4,1,1,2,2,3)
n = c(1,10,2,3,5,4,8,100,90,40)
data = data.frame(name,code,n)
The end product would be left with these:
name = c("a", "b", "c", "d", "e")
code = c(1,4,1,1,2)
n = c(10,5,4,8,100)
data2 = data.frame(name,code,n)
If you can use dplyr, this should do the trick:
library(dplyr)
data %>%
group_by(name) %>%
filter(n == max(n)) %>%
ungroup()

How to avoid for loop when iterating through unique values in a column [R]

Let's assume that we have following toy data:
library(tidyverse)
data <- tibble(
subject = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
id1 = c("a", "a", "b", "a", "a", "a", "b", "a", "a", "b"),
id2 = c("b", "c", "c", "b", "c", "d", "c", "b", "c", "c")
)
which represent network relationships for each subject. For example, there are three unique subjects in the data and the network for the first subject could be represented as sequence of relations:
a -- b, a --c, b -- c
The task is to compute centralities for each network. Using for loop this is straightforward:
library(igraph)
# Get unique subjects
subjects_uniq <- unique(data$subject)
# Compute centrality of nodes for each graph
for (i in 1:length(subjects_uniq)) {
current_data <- data %>% filter(subject == i) %>% select(-subject)
current_graph <- current_data %>% graph_from_data_frame(directed = FALSE)
centrality <- eigen_centrality(current_graph)$vector
}
Question: My dataset is huge so I wonder how to avoid explicit for loop. Should I use apply() and its modern cousins (maybe map() in the purrr package)? Any suggestions are greatly welcome.
Here is an option using map
library(tidyverse)
library(igraph)
map(subjects_uniq, ~data %>%
filter(subject == .x) %>%
select(-subject) %>%
graph_from_data_frame(directed = FALSE) %>%
{eigen_centrality(.)$vector})
#[[1]]
#a b c
#1 1 1
#[[2]]
# a b c d
#1.0000000 0.8546377 0.8546377 0.4608111
#[[3]]
#a b c
#1 1 1

Resources