I have some genotype data from 8 markers and 20 individuals. I would like to select 5 markers from the 8 markers, which can form a unique genotype pattern for each individual. The purpose is to select as few as possible markers to distinguish the 20 individuals.
I know that I need to select 5 columns out of 8, and then compare each row. If we find duplicated rows, then we need to re-select another 5 columns, until we find no duplicated rows.
But I don't know how I can translate it into R. Could somebody help? Thanks!
sample data
Indiv MN1 MN2 MN3 MN4 MN5 MN6 MN7 MN8
1 A C C A C G A T
2 A C T A T A A T
3 A C T G C A A C
4 A C T G C G G C
5 A T C G C A A C
6 A T C G C A G C
7 A T T A T A A T
8 A T T A T A G T
9 A T T A T G G C
10 G C C A C A A C
11 G C C A C G A T
12 G C C G C G G T
13 G C C G T G G T
14 G C T G C G A T
15 G C T G T A G C
16 G T C A T A G T
17 G T C G T A A C
18 G T T A C G G T
19 G T T G T G G T
Impossible. Assume that we can't change the order of markers. You need at least 6 markers to distinguish the individuals. Consider this function (brute-force solution).
distinct_combn <- function(df, m) {
out <- combn(df, m, function(x) {
if (nrow(unique(x)) == nrow(x)) names(x) else character(0L)
}, simplify = FALSE)
out[lengths(out) > 0L]
}
Then we can see that
> distinct_combn(df[, -1L], 5)
list()
> distinct_combn(df[, -1L], 6)
[[1]]
[1] "MN1" "MN2" "MN3" "MN5" "MN6" "MN7"
[[2]]
[1] "MN1" "MN2" "MN3" "MN5" "MN7" "MN8"
[[3]]
[1] "MN1" "MN2" "MN4" "MN5" "MN6" "MN7"
[[4]]
[1] "MN1" "MN2" "MN4" "MN5" "MN7" "MN8"
Data I used
> df
Indiv MN1 MN2 MN3 MN4 MN5 MN6 MN7 MN8
1 1 A C C A C G A T
2 2 A C T A T A A T
3 3 A C T G C A A C
4 4 A C T G C G G C
5 5 A T C G C A A C
6 6 A T C G C A G C
7 7 A T T A T A A T
8 8 A T T A T A G T
9 9 A T T A T G G C
10 10 G C C A C A A C
11 11 G C C A C G A T
12 12 G C C G C G G T
13 13 G C C G T G G T
14 14 G C T G C G A T
15 15 G C T G T A G C
16 16 G T C A T A G T
17 17 G T C G T A A C
18 18 G T T A C G G T
19 19 G T T G T G G T
> dput(df)
structure(list(Indiv = 1:19, MN1 = c("A", "A", "A", "A", "A",
"A", "A", "A", "A", "G", "G", "G", "G", "G", "G", "G", "G", "G",
"G"), MN2 = c("C", "C", "C", "C", "T", "T", "T", "T", "T", "C",
"C", "C", "C", "C", "C", "T", "T", "T", "T"), MN3 = c("C", "T",
"T", "T", "C", "C", "T", "T", "T", "C", "C", "C", "C", "T", "T",
"C", "C", "T", "T"), MN4 = c("A", "A", "G", "G", "G", "G", "A",
"A", "A", "A", "A", "G", "G", "G", "G", "A", "G", "A", "G"),
MN5 = c("C", "T", "C", "C", "C", "C", "T", "T", "T", "C",
"C", "C", "T", "C", "T", "T", "T", "C", "T"), MN6 = c("G",
"A", "A", "G", "A", "A", "A", "A", "G", "A", "G", "G", "G",
"G", "A", "A", "A", "G", "G"), MN7 = c("A", "A", "A", "G",
"A", "G", "A", "G", "G", "A", "A", "G", "G", "A", "G", "G",
"A", "G", "G"), MN8 = c("T", "T", "C", "C", "C", "C", "T",
"T", "C", "C", "T", "T", "T", "T", "C", "T", "C", "T", "T"
)), class = "data.frame", row.names = c(NA, -19L))
Related
I have two data matrices of different dimensions stored as objects in R (I am using Rstudio with R v4.0.2 in Windows 10):
m1 = 1 column x 44 rows (this is a list of names with no spaces).
m2 = 500,000 columns x 164 rows (this contains a string of characters, the first row being a list of names).
I want to check how many (and which) of the rows of m1 are found in m2 (meaning it will be anywhere between 0 and 44). The end goal is that I have 4000 different matrices that will substitute the place of m2, and I need to see the extent of missing entries (found in m1) in all of the m2s (i.e., I am looking at the extent of missing data of those 44 names).
I am still a beginner to R, so apologies if my description is a bit off.
I tried storing each matrix, saved as CSV files, as such:
m1 <- read.csv("names-file.csv")
m2 <- read.csv("data-file.csv")
and tried to use the row.match function in the prodlim package, and ran row.match(m1, m2) but only got numeric values. I am looking to see just a number of how many of the names from m1 (first column) are found in m2 (first column), which values those are, and what the percentage would be (x out of 44).
As an example:
m1 =
Tom
Harry
Cindy
Megan
Jack
`
m2 =
Tom XXXXXXXXXXXX----XXXXXXXX
Stephanie XXXXXXXXXXXXXXXX----XXXX
Megan XXXXXXXXXXXXXXXXXXXXXXXX
Ryan XXXXXXXXXXXXXXXXXXXXXX-X
David XXXXXX---XXXXXXXXXXXXXXX
Josh XXXXXXXXXXXXXXXXXXXXXXXX
In the m2 matrix, each name is column 1, and the each subsequent X (which represents either an A, T, C, or G) are the subsequent columns (so some columns have an A, T, C, or G, or a "-"). I am looking to write a code that would see how many of the names from m1 and found in m2 (and conversely, how much data is missing from m2 as a percentage). In this case, the desired outputs would be:
2
Tom
Megan
60%
Here are my specific datafile using dput() (please let me know if I am using dput() correctly):
m1:
structure(list(V1 = c("Taxon1", "Taxon2", "Taxon3", "Taxon4",
"Taxon5", "Taxon6", "Taxon7", "Taxon8")), class = "data.frame", row.names = c(NA,
-8L))
m2:
structure(list(V1 = c("Taxon1", "Taxon3", "Taxon4", "Taxon6",
"Taxon7", "Taxon9", "Taxon10", "Taxon11", "Taxon12", "Taxon13",
"Taxon14", "Taxon15", "Taxon16", "Taxon17", "Taxon18", "Taxon19",
"Taxon20", "Taxon21", "Taxon22", "Taxon23", "Taxon24", "Taxon25",
"Taxon26", "Taxon27", "Taxon28", "Taxon29", "Taxon30"), V2 = c("A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "C", "C", "C", "C", "C", "C", "C"
), V3 = c("G", "G", "G", "G", "G", "C", "C", "G", "G", "G", "G",
"G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G",
"G", "G", "G"), V4 = c("C", "C", "C", "C", "C", "T", "G", "C",
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",
"C", "C", "C", "C", "C", "C"), V5 = c("T", "T", "G", "T", "G",
"G", "G", "T", "T", "T", "T", "T", "T", "T", "T", "T", "T", "T",
"T", "T", "T", "T", "T", "T", "T", "T", "T"), V6 = c("G", "G",
"C", "G", "C", "C", "C", "G", "G", "G", "G", "G", "G", "G", "G",
"G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G"),
V7 = c("C", "C", "A", "C", "A", "A", "A", "C", "C", "C",
"C", "C", "C", "C", "C", "C", "G", "G", "G", "G", "G", "G",
"G", "G", "G", "G", "G"), V8 = c("T", "T", "A", "T", "A",
"A", "A", "T", "T", "T", "T", "T", "T", "T", "T", "T", "T",
"T", "T", "T", "T", "T", "T", "T", "T", "T", "T"), V9 = c("A",
"A", "A", "A", "A", "T", "T", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "T", "T", "T", "T", "T", "T", "T", "T",
"T", "T")), class = "data.frame", row.names = c(NA, -27L))
Thank you!
You might want to have a look at the %in% operator in R. According to your question, you might want something like this:
m1[,1] %in% m2[,1]
#[1] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
You can then pair it with functions such as mean or sum which will help you to find the percentage as required:
sum(m1[,1] %in% m2[,1])
#[1] 5
mean(m1[,1] %in% m2[,1])
#[1] 0.625
EDIT: As required by the OP in the comments of this post, there are various methods for that, my personal favourite being the which function:
m1[which(m1[,1] %in% m2[,1]),]
#[1] "Taxon1" "Taxon3" "Taxon4" "Taxon6" "Taxon7"
m1[which(!(m1[,1] %in% m2[,1])),]
#[1] "Taxon2" "Taxon5" "Taxon8"
Again, this is only one method, out of many (I can count 3 right now...), so I suggest you to explore the other options...
To get common names in both the dataframes you may use intersect, to calculate missing percentage you can use %in% with mean
common_names <- intersect(m1$V1, m2$V1)
missing_percentage_in_m1 <- mean(!m1$V1 %in% m2$V1) * 100
missing_percentage_in_m2 <- mean(!m2$V1 %in% m1$V1) * 100
common_names
#[1] "Taxon1" "Taxon3" "Taxon4" "Taxon6" "Taxon7"
missing_percentage_in_m1
#[1] 37.5
missing_percentage_in_m2
#[1] 81.48148
This code will get result like this
2
Tom
Megan
60%
1.how many of the names from m1 and found in m2
m1 <- t(m1)
res1 <-m2 %>%
rowwise %>%
mutate(n = m1 %in% c_across(V1:V9) %>% sum)
res1
# A tibble: 27 x 10
# Rowwise:
V1 V2 V3 V4 V5 V6 V7 V8 V9 n
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 Taxon1 A G C T G C T A 1
2 Taxon3 A G C T G C T A 1
3 Taxon4 A G C G C A A A 1
4 Taxon6 A G C T G C T A 1
5 Taxon7 A G C G C A A A 1
6 Taxon9 A C T G C A A T 0
7 Taxon10 A C G G C A A T 0
8 Taxon11 A G C T G C T A 0
9 Taxon12 A G C T G C T A 0
10 Taxon13 A G C T G C T A 0
# ... with 17 more rows
res1 %>% select(n) %>% sum
[1] 5
res2 <-res1 %>%
filter(n >0) %>%
pull(V1) %>%
unique
res2
[1] "Taxon1" "Taxon3" "Taxon4" "Taxon6" "Taxon7"
2.how much data is missing from m2 as a percentage
res3 <- res2 %>% length
1 - res3 / length(unique(m2$V1))
[1] 0.8148148
Sorry, that this is a follow-up question. I am trying to count how many 'S' and 'T' appears in each column as 'downstream' from 1 to 10 rows and then as 'upstream' from 15 to 25.
ST <- data.frame(scale = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0),
aa = c('A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y'))
#input (example)
V1 V2 V3 V4 V5
1 C D E R N
2 C A M K P
3 V T Q Q E
4 A T S S S
5 C D E R N
6 C A M K P
7 V T Q Q E
8 A T S S S
9 R V D S A
10 W R H I C
11 S N I P T
12 Q A S D E
13 C D E R N
14 C A M K P
15 V T Q Q E
16 A T S S S
17 C D E R N
18 C A M K P
19 V T Q Q E
20 A T S S S
21 R V D S A
22 W R H I C
23 S N I P T
24 G A D S S
25 N T T S A
When I had a data frame with 'S' only, the script below worked but with 'ST', it doesn't. Could someone tell me why? Of course, I could get 'S' and 'T' separately and then add it later but is there a way to do it through this single data frame 'ST'?
#sum values from positions 1 to 10 and then from 15 to 25 works well for 1 letter only
count_aa <- df_trial %>%
summarise(across(everything(), ~ c(sum(.[1:10] == 'T'), sum(.[15:25] == 'T')))) %>%
mutate(categ = c('upstream', 'downstream'), .before = 1)
#view(count_aa)
df_count_aa<- as.data.frame(t(count_aa))
#view(df_count_aa)
We can use %in% instead of == when there are more than one element to compare
library(dplyr)
df_trial %>%
summarise(across(everything(), ~
c(sum(.[1:10] %in% c('S', 'T')),
sum(.[15:25] %in% c('S', 'T'))))) %>%
mutate(categ = c('upstream', 'downstream'), .before = 1)
-output
# categ V1 V2 V3 V4 V5
#1 upstream 0 4 2 3 2
#2 downstream 1 5 3 5 4
The == is doing elementwise comparison. If we do the == with more than one element as == c("S", "T"), then it does a recycling of the vector elements to the entire length of the column resulting i.e. 'S' gets compared to the first element of the colum, 'T' to second element, 'S' again to 3rd element and so on... i.e. the comparison would be based on position
In base R we can do colSums
colSums(df_trial == 'S') + colSums(df_trial == 'T')
In base R, you can do this sapply :
data.frame(categ = c('upstream', 'downstream'),
sapply(df_trial, function(x)
c(sum(x[1:10] %in% c('S', 'T')), sum(x[15:25] %in% c('S', 'T')))))
# categ V1 V2 V3 V4 V5
#1 upstream 0 4 2 3 2
#2 downstream 1 5 3 5 4
Using base R
> rbind(downstream = sapply(df[1:10,], function(x) sum(grepl('[ST]',x))),
+ upstream = sapply(df[15:25,], function(x) sum(grepl('[ST]',x))))
V1 V2 V3 V4 V5
downstream 0 4 2 3 2
upstream 1 5 3 5 4
>
Data Used:
> dput(df)
structure(list(V1 = c("C", "C", "V", "A", "C", "C", "V", "A",
"R", "W", "S", "Q", "C", "C", "V", "A", "C", "C", "V", "A", "R",
"W", "S", "G", "N"), V2 = c("D", "A", "T", "T", "D", "A", "T",
"T", "V", "R", "N", "A", "D", "A", "T", "T", "D", "A", "T", "T",
"V", "R", "N", "A", "T"), V3 = c("E", "M", "Q", "S", "E", "M",
"Q", "S", "D", "H", "I", "S", "E", "M", "Q", "S", "E", "M", "Q",
"S", "D", "H", "I", "D", "T"), V4 = c("R", "K", "Q", "S", "R",
"K", "Q", "S", "S", "I", "P", "D", "R", "K", "Q", "S", "R", "K",
"Q", "S", "S", "I", "P", "S", "S"), V5 = c("N", "P", "E", "S",
"N", "P", "E", "S", "A", "C", "T", "E", "N", "P", "E", "S", "N",
"P", "E", "S", "A", "C", "T", "S", "A")), row.names = c(NA, -25L
), class = c("tbl_df", "tbl", "data.frame"))
>
I have a large table which has a few columns that have "-" in them. I want to replace "-" with the value from the row above in the same column
library(tidyverse)
# This is the df I have
df <- data.frame(stringsAsFactors=FALSE,
my = c("a", "a", "a", "-", "b", "b", "b", "-", "c", "c", "c", "-"),
bad = c("d", "d", "d", "-", "e", "e", "e", "-", "f", "f", "f", "-"),
table = c("g", "g", "g", "-", "h", "h", "h", "-", "i", "i", "i", "-")
)
# This is the desired output:
output_df <- data.frame(stringsAsFactors=FALSE,
my = c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c", "c", "c"),
bad = c("d", "d", "d", "d", "e", "e", "e", "e", "f", "f", "f", "f"),
table = c("g", "g", "g", "g", "h", "h", "h", "h", "i", "i", "i", "i")
)
# What I have tried unsuccessfully:
df %>%
mutate_at(c("my", "bad", "table"), .funs = str_replace("-", NA))
I'm a bit stumped with this one.....any ideas?
An option is fill after changing the - to NA
library(tidyverse)
df %>%
mutate_all(na_if, "-") %>%
fill(my, bad, table)
# orif there are many columns
# fill(!!! rlang::syms(names(.)))
# or as H1 suggested
# fill(everything())
# my bad table
#1 a d g
#2 a d g
#3 a d g
#4 a d g
#5 b e h
#6 b e h
#7 b e h
#8 b e h
#9 c f i
#10 c f i
#11 c f i
#12 c f i
I am quite new to R programming, and am having some difficulty with ANOTHER step of my project. I am not even sure at this point if I am asking the question correctly. I have a dataframe of actual and predicted values:
actual predicted.1 predicted.2 predicted.3 predicted.4
a a a a a
a a a b b
b b a b b
b a b b c
c c c c c
c d c c d
d d d c d
d d d d a
The issue that I am having is that I need to create a vector of mismatches between the actual value and each of the four predicted values. This should result in a single vector: c(2,1,2,4)
I am trying to use a boolean mask to sum over the TRUE values...but something is not working right. I need to do this sum for each of the four predicted values to actual value comparisons.
discordant_sums(df[,seq(1,ncol(df),2)]!=,df[,seq(2,ncol(df),2)])
Any suggestions would be greatly appreciated.
You can use apply to compare values in 1st column with values in each of all other columns.
apply(df[-1], 2, function(x)sum(df[1]!=x))
# predicted.1 predicted.2 predicted.3 predicted.4
# 2 1 2 4
Data:
df <- read.table(text =
"actual predicted.1 predicted.2 predicted.3 predicted.4
a a a a a
a a a b b
b b a b b
b a b b c
c c c c c
c d c c d
d d d c d
d d d d a",
header = TRUE, stringsAsFactors = FALSE)
We can replicate the first column to make the lengths equal between the comparison objects and do the colSums
as.vector(colSums(df[,1][row(df[-1])] != df[-1]))
#[1] 2 1 2 4
data
df <- structure(list(actual = c("a", "a", "b", "b", "c", "c", "d",
"d"), predicted.1 = c("a", "a", "b", "a", "c", "d", "d", "d"),
predicted.2 = c("a", "a", "a", "b", "c", "c", "d", "d"),
predicted.3 = c("a", "b", "b", "b", "c", "c", "c", "d"),
predicted.4 = c("a", "b", "b", "c", "c", "d", "d", "a")),
.Names = c("actual",
"predicted.1", "predicted.2", "predicted.3", "predicted.4"),
class = "data.frame", row.names = c(NA,
-8L))
This question already has answers here:
Create frequency tables for multiple factor columns in R
(3 answers)
Closed 5 years ago.
I have a data frame as below:
> dfnew
C1 C2 C3 C4 C5 C6
1 A A G A G A
2 A T T T G G
3 T A G A T A
4 C A A A A G
5 C A T T T C
6 C A A A T A
7 T C T G A A
8 G A G C T A
9 C T A T G A
10 G A A A G G
11 G G T T T A
12 G A C T T A
13 T T C T T T
14 A T A G C T
15 A C A A A A
16 A A C A A A
17 T G G A A T
18 A A A A G T
19 G T G G <NA> <NA>
I want to get answer as below in one line of code in R without looping:
A 6 10 7 9 5 10
C 4 2 3 1 1 1
G 5 2 5 3 5 3
T 4 5 4 6 7 4
We can use sapply to loop over the columns, convert it to factor with levels specified and get the frequency with table
sapply(dfnew, function(x) table(factor(x, levels = c("A", "C", "G", "T"))))
Or using tidyverse
library(dplyr)
library(tidyr)
dfnew %>%
gather(key, val, na.rm = TRUE) %>%
count(key, val) %>%
spread(key, n)
If you use stack to reshape everything to long form, you can call table on the result:
dfnew <- data.frame(C1 = c("A", "A", "T", "C", "C", "C", "T", "G", "C", "G", "G", "G", "T", "A", "A", "A", "T", "A", "G"),
C2 = c("A", "T", "A", "A", "A", "A", "C", "A", "T", "A", "G", "A", "T", "T", "C", "A", "G", "A", "T"),
C3 = c("G", "T", "G", "A", "T", "A", "T", "G", "A", "A", "T", "C", "C", "A", "A", "C", "G", "A", "G"),
C4 = c("A", "T", "A", "A", "T", "A", "G", "C", "T", "A", "T", "T", "T", "G", "A", "A", "A", "A", "G"),
C5 = c("G", "G", "T", "A", "T", "T", "A", "T", "G", "G", "T", "T", "T", "C", "A", "A", "A", "G", NA),
C6 = c("A", "G", "A", "G", "C", "A", "A", "A", "A", "G", "A", "A", "T", "T", "A", "A", "T", "T", NA),
stringsAsFactors = FALSE)
table(stack(dfnew))
#> ind
#> values C1 C2 C3 C4 C5 C6
#> A 6 10 7 9 5 10
#> C 4 2 3 1 1 1
#> G 5 2 5 3 5 3
#> T 4 5 4 6 7 4
using data.table and its pipe worflow with [:
library(data.table)
tab <- fread("
C1 C2 C3 C4 C5 C6
A A G A G A
A T T T G G
T A G A T A
C A A A A G
C A T T T C
C A A A T A
T C T G A A
G A G C T A
C T A T G A
G A A A G G
G G T T T A
G A C T T A
T T C T T T
A T A G C T
A C A A A A
A A C A A A
T G G A A T
A A A A G T
G T G G NA NA")
tab[, melt(.SD, measure.vars = paste0("C", 1:6), na.rm = TRUE)][
, dcast(.SD, value ~ variable, fun = length, drop = TRUE)
]
#> value C1 C2 C3 C4 C5 C6
#> 1: A 6 10 7 9 5 10
#> 2: C 4 2 3 1 1 1
#> 3: G 5 2 5 3 5 3
#> 4: T 4 5 4 6 7 4