Counting frequencies of each letter for multiple column [duplicate] - r

This question already has answers here:
Create frequency tables for multiple factor columns in R
(3 answers)
Closed 5 years ago.
I have a data frame as below:
> dfnew
C1 C2 C3 C4 C5 C6
1 A A G A G A
2 A T T T G G
3 T A G A T A
4 C A A A A G
5 C A T T T C
6 C A A A T A
7 T C T G A A
8 G A G C T A
9 C T A T G A
10 G A A A G G
11 G G T T T A
12 G A C T T A
13 T T C T T T
14 A T A G C T
15 A C A A A A
16 A A C A A A
17 T G G A A T
18 A A A A G T
19 G T G G <NA> <NA>
I want to get answer as below in one line of code in R without looping:
A 6 10 7 9 5 10
C 4 2 3 1 1 1
G 5 2 5 3 5 3
T 4 5 4 6 7 4

We can use sapply to loop over the columns, convert it to factor with levels specified and get the frequency with table
sapply(dfnew, function(x) table(factor(x, levels = c("A", "C", "G", "T"))))
Or using tidyverse
library(dplyr)
library(tidyr)
dfnew %>%
gather(key, val, na.rm = TRUE) %>%
count(key, val) %>%
spread(key, n)

If you use stack to reshape everything to long form, you can call table on the result:
dfnew <- data.frame(C1 = c("A", "A", "T", "C", "C", "C", "T", "G", "C", "G", "G", "G", "T", "A", "A", "A", "T", "A", "G"),
C2 = c("A", "T", "A", "A", "A", "A", "C", "A", "T", "A", "G", "A", "T", "T", "C", "A", "G", "A", "T"),
C3 = c("G", "T", "G", "A", "T", "A", "T", "G", "A", "A", "T", "C", "C", "A", "A", "C", "G", "A", "G"),
C4 = c("A", "T", "A", "A", "T", "A", "G", "C", "T", "A", "T", "T", "T", "G", "A", "A", "A", "A", "G"),
C5 = c("G", "G", "T", "A", "T", "T", "A", "T", "G", "G", "T", "T", "T", "C", "A", "A", "A", "G", NA),
C6 = c("A", "G", "A", "G", "C", "A", "A", "A", "A", "G", "A", "A", "T", "T", "A", "A", "T", "T", NA),
stringsAsFactors = FALSE)
table(stack(dfnew))
#> ind
#> values C1 C2 C3 C4 C5 C6
#> A 6 10 7 9 5 10
#> C 4 2 3 1 1 1
#> G 5 2 5 3 5 3
#> T 4 5 4 6 7 4

using data.table and its pipe worflow with [:
library(data.table)
tab <- fread("
C1 C2 C3 C4 C5 C6
A A G A G A
A T T T G G
T A G A T A
C A A A A G
C A T T T C
C A A A T A
T C T G A A
G A G C T A
C T A T G A
G A A A G G
G G T T T A
G A C T T A
T T C T T T
A T A G C T
A C A A A A
A A C A A A
T G G A A T
A A A A G T
G T G G NA NA")
tab[, melt(.SD, measure.vars = paste0("C", 1:6), na.rm = TRUE)][
, dcast(.SD, value ~ variable, fun = length, drop = TRUE)
]
#> value C1 C2 C3 C4 C5 C6
#> 1: A 6 10 7 9 5 10
#> 2: C 4 2 3 1 1 1
#> 3: G 5 2 5 3 5 3
#> 4: T 4 5 4 6 7 4

Related

R - Creating a new variable based on multiple observations

My dataset represents patients which have been treated multiple times. The dataset is in a long format, patients either get treatment A, C or S or a combination. A and C are never combined.
Simply put, the data looks something like this:
df <- tibble(PatientID = c(1,1,1,2,2,3,3,3,3,4,4,5,5,5,6,6),
treatment = c("A", "A", "S", "C", "S", "S", "C", "C", NA, "C", NA, NA, "S", "A", "S", NA)
I would like to creat a new variable based on if any patient had treatment A or C or neither, so the end result looking something like:
df <- tibble(PatientID = c(1,1,1,2,2,3,3,3,3,4,4,5,5,5,6,6),
treatment = c("A", "A", "S", "C", "S", "S", "C", "C", NA, "C", NA, NA, "S", "A", "S", "S"),
group = c("A", "A", "A", "C", "C", "C", "C", "C", "C", "C", "C", "A", "A", "A", "S", "S"))
How can I best approach this? I'm struggling with how to deal with multiple observations per ID.
Thank you!
You can use group_by() in combination with mutate() and case_when() to achieve this:
library(tidyverse)
df <- tibble(PatientID = c(1,1,1,2,2,3,3,3,3,4,4,5,5,5,6,6),
treatment = c("A", "A", "S", "C", "S", "S", "C", "C", NA, "C", NA, NA, "S", "A", "S", NA))
df %>%
group_by(PatientID) %>%
mutate(groups = case_when("A" %in% treatment ~ "A",
"C" %in% treatment ~ "C",
TRUE ~ "S"))
#> # A tibble: 16 × 3
#> # Groups: PatientID [6]
#> PatientID treatment groups
#> <dbl> <chr> <chr>
#> 1 1 A A
#> 2 1 A A
#> 3 1 S A
#> 4 2 C C
#> 5 2 S C
#> 6 3 S C
#> 7 3 C C
#> 8 3 C C
#> 9 3 <NA> C
#> 10 4 C C
#> 11 4 <NA> C
#> 12 5 <NA> A
#> 13 5 S A
#> 14 5 A A
#> 15 6 S S
#> 16 6 <NA> S
Created on 2022-08-18 with reprex v2.0.2

Choose a set of markers to distinguish individuals

I have some genotype data from 8 markers and 20 individuals. I would like to select 5 markers from the 8 markers, which can form a unique genotype pattern for each individual. The purpose is to select as few as possible markers to distinguish the 20 individuals.
I know that I need to select 5 columns out of 8, and then compare each row. If we find duplicated rows, then we need to re-select another 5 columns, until we find no duplicated rows.
But I don't know how I can translate it into R. Could somebody help? Thanks!
sample data
Indiv MN1 MN2 MN3 MN4 MN5 MN6 MN7 MN8
1 A C C A C G A T
2 A C T A T A A T
3 A C T G C A A C
4 A C T G C G G C
5 A T C G C A A C
6 A T C G C A G C
7 A T T A T A A T
8 A T T A T A G T
9 A T T A T G G C
10 G C C A C A A C
11 G C C A C G A T
12 G C C G C G G T
13 G C C G T G G T
14 G C T G C G A T
15 G C T G T A G C
16 G T C A T A G T
17 G T C G T A A C
18 G T T A C G G T
19 G T T G T G G T
Impossible. Assume that we can't change the order of markers. You need at least 6 markers to distinguish the individuals. Consider this function (brute-force solution).
distinct_combn <- function(df, m) {
out <- combn(df, m, function(x) {
if (nrow(unique(x)) == nrow(x)) names(x) else character(0L)
}, simplify = FALSE)
out[lengths(out) > 0L]
}
Then we can see that
> distinct_combn(df[, -1L], 5)
list()
> distinct_combn(df[, -1L], 6)
[[1]]
[1] "MN1" "MN2" "MN3" "MN5" "MN6" "MN7"
[[2]]
[1] "MN1" "MN2" "MN3" "MN5" "MN7" "MN8"
[[3]]
[1] "MN1" "MN2" "MN4" "MN5" "MN6" "MN7"
[[4]]
[1] "MN1" "MN2" "MN4" "MN5" "MN7" "MN8"
Data I used
> df
Indiv MN1 MN2 MN3 MN4 MN5 MN6 MN7 MN8
1 1 A C C A C G A T
2 2 A C T A T A A T
3 3 A C T G C A A C
4 4 A C T G C G G C
5 5 A T C G C A A C
6 6 A T C G C A G C
7 7 A T T A T A A T
8 8 A T T A T A G T
9 9 A T T A T G G C
10 10 G C C A C A A C
11 11 G C C A C G A T
12 12 G C C G C G G T
13 13 G C C G T G G T
14 14 G C T G C G A T
15 15 G C T G T A G C
16 16 G T C A T A G T
17 17 G T C G T A A C
18 18 G T T A C G G T
19 19 G T T G T G G T
> dput(df)
structure(list(Indiv = 1:19, MN1 = c("A", "A", "A", "A", "A",
"A", "A", "A", "A", "G", "G", "G", "G", "G", "G", "G", "G", "G",
"G"), MN2 = c("C", "C", "C", "C", "T", "T", "T", "T", "T", "C",
"C", "C", "C", "C", "C", "T", "T", "T", "T"), MN3 = c("C", "T",
"T", "T", "C", "C", "T", "T", "T", "C", "C", "C", "C", "T", "T",
"C", "C", "T", "T"), MN4 = c("A", "A", "G", "G", "G", "G", "A",
"A", "A", "A", "A", "G", "G", "G", "G", "A", "G", "A", "G"),
MN5 = c("C", "T", "C", "C", "C", "C", "T", "T", "T", "C",
"C", "C", "T", "C", "T", "T", "T", "C", "T"), MN6 = c("G",
"A", "A", "G", "A", "A", "A", "A", "G", "A", "G", "G", "G",
"G", "A", "A", "A", "G", "G"), MN7 = c("A", "A", "A", "G",
"A", "G", "A", "G", "G", "A", "A", "G", "G", "A", "G", "G",
"A", "G", "G"), MN8 = c("T", "T", "C", "C", "C", "C", "T",
"T", "C", "C", "T", "T", "T", "T", "C", "T", "C", "T", "T"
)), class = "data.frame", row.names = c(NA, -19L))

Loop through the columns to search for multiple variables in R

Sorry, that this is a follow-up question. I am trying to count how many 'S' and 'T' appears in each column as 'downstream' from 1 to 10 rows and then as 'upstream' from 15 to 25.
ST <- data.frame(scale = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0),
aa = c('A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y'))
#input (example)
V1 V2 V3 V4 V5
1 C D E R N
2 C A M K P
3 V T Q Q E
4 A T S S S
5 C D E R N
6 C A M K P
7 V T Q Q E
8 A T S S S
9 R V D S A
10 W R H I C
11 S N I P T
12 Q A S D E
13 C D E R N
14 C A M K P
15 V T Q Q E
16 A T S S S
17 C D E R N
18 C A M K P
19 V T Q Q E
20 A T S S S
21 R V D S A
22 W R H I C
23 S N I P T
24 G A D S S
25 N T T S A
When I had a data frame with 'S' only, the script below worked but with 'ST', it doesn't. Could someone tell me why? Of course, I could get 'S' and 'T' separately and then add it later but is there a way to do it through this single data frame 'ST'?
#sum values from positions 1 to 10 and then from 15 to 25 works well for 1 letter only
count_aa <- df_trial %>%
summarise(across(everything(), ~ c(sum(.[1:10] == 'T'), sum(.[15:25] == 'T')))) %>%
mutate(categ = c('upstream', 'downstream'), .before = 1)
#view(count_aa)
df_count_aa<- as.data.frame(t(count_aa))
#view(df_count_aa)
We can use %in% instead of == when there are more than one element to compare
library(dplyr)
df_trial %>%
summarise(across(everything(), ~
c(sum(.[1:10] %in% c('S', 'T')),
sum(.[15:25] %in% c('S', 'T'))))) %>%
mutate(categ = c('upstream', 'downstream'), .before = 1)
-output
# categ V1 V2 V3 V4 V5
#1 upstream 0 4 2 3 2
#2 downstream 1 5 3 5 4
The == is doing elementwise comparison. If we do the == with more than one element as == c("S", "T"), then it does a recycling of the vector elements to the entire length of the column resulting i.e. 'S' gets compared to the first element of the colum, 'T' to second element, 'S' again to 3rd element and so on... i.e. the comparison would be based on position
In base R we can do colSums
colSums(df_trial == 'S') + colSums(df_trial == 'T')
In base R, you can do this sapply :
data.frame(categ = c('upstream', 'downstream'),
sapply(df_trial, function(x)
c(sum(x[1:10] %in% c('S', 'T')), sum(x[15:25] %in% c('S', 'T')))))
# categ V1 V2 V3 V4 V5
#1 upstream 0 4 2 3 2
#2 downstream 1 5 3 5 4
Using base R
> rbind(downstream = sapply(df[1:10,], function(x) sum(grepl('[ST]',x))),
+ upstream = sapply(df[15:25,], function(x) sum(grepl('[ST]',x))))
V1 V2 V3 V4 V5
downstream 0 4 2 3 2
upstream 1 5 3 5 4
>
Data Used:
> dput(df)
structure(list(V1 = c("C", "C", "V", "A", "C", "C", "V", "A",
"R", "W", "S", "Q", "C", "C", "V", "A", "C", "C", "V", "A", "R",
"W", "S", "G", "N"), V2 = c("D", "A", "T", "T", "D", "A", "T",
"T", "V", "R", "N", "A", "D", "A", "T", "T", "D", "A", "T", "T",
"V", "R", "N", "A", "T"), V3 = c("E", "M", "Q", "S", "E", "M",
"Q", "S", "D", "H", "I", "S", "E", "M", "Q", "S", "E", "M", "Q",
"S", "D", "H", "I", "D", "T"), V4 = c("R", "K", "Q", "S", "R",
"K", "Q", "S", "S", "I", "P", "D", "R", "K", "Q", "S", "R", "K",
"Q", "S", "S", "I", "P", "S", "S"), V5 = c("N", "P", "E", "S",
"N", "P", "E", "S", "A", "C", "T", "E", "N", "P", "E", "S", "N",
"P", "E", "S", "A", "C", "T", "S", "A")), row.names = c(NA, -25L
), class = c("tbl_df", "tbl", "data.frame"))
>

Replace a particular value with a value from the row above

I have a large table which has a few columns that have "-" in them. I want to replace "-" with the value from the row above in the same column
library(tidyverse)
# This is the df I have
df <- data.frame(stringsAsFactors=FALSE,
my = c("a", "a", "a", "-", "b", "b", "b", "-", "c", "c", "c", "-"),
bad = c("d", "d", "d", "-", "e", "e", "e", "-", "f", "f", "f", "-"),
table = c("g", "g", "g", "-", "h", "h", "h", "-", "i", "i", "i", "-")
)
# This is the desired output:
output_df <- data.frame(stringsAsFactors=FALSE,
my = c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c", "c", "c"),
bad = c("d", "d", "d", "d", "e", "e", "e", "e", "f", "f", "f", "f"),
table = c("g", "g", "g", "g", "h", "h", "h", "h", "i", "i", "i", "i")
)
# What I have tried unsuccessfully:
df %>%
mutate_at(c("my", "bad", "table"), .funs = str_replace("-", NA))
I'm a bit stumped with this one.....any ideas?
An option is fill after changing the - to NA
library(tidyverse)
df %>%
mutate_all(na_if, "-") %>%
fill(my, bad, table)
# orif there are many columns
# fill(!!! rlang::syms(names(.)))
# or as H1 suggested
# fill(everything())
# my bad table
#1 a d g
#2 a d g
#3 a d g
#4 a d g
#5 b e h
#6 b e h
#7 b e h
#8 b e h
#9 c f i
#10 c f i
#11 c f i
#12 c f i

Extract list of values from column based upon other column

The following code:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1,6)
)
Results in the following dataframe:
letter score
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
I want to get the scores for a sequence of letters, for example the scores of c("f", "a", "d", "e"). It should result in c(6, 1, 4, 5).
What's more, I want to get the scores for c("c", "o", "f", "f", "e", "e"). Now the o is not in the letter column so it should return NA, resulting in c(3, NA, 6, 6, 5, 5).
What is the best way to achieve this? Can I use dplyr for this?
We can use match to create an index and extract the corresponding 'score' If there is no match, then by default it gives NA
df$score[match(v1, df$letter)]
#[1] 3 NA 6 6 5 5
df$score[match(v2, df$letter)]
#[1] 6 1 4 5
data
v1 <- c("c", "o", "f", "f", "e", "e")
v2 <- c("f", "a", "d", "e")
If you want to use dplyr I would use a join:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1:6)
)
library(dplyr)
df2 <- data.frame(letter = c("c", "o", "f", "f", "e", "e"))
left_join(df2, df, by = "letter")
letter score
1 c 3
2 o NA
3 f 6
4 f 6
5 e 5
6 e 5

Resources