calculation of allele frequency using wig file in R - r

I have a matrix(similar to a wig file) like this:
Position reference A C G T N sum(total read counts)
68773265 A 1 0 0 0 0 1
68773266 C 0 1 0 1 0 2
68773267 C 0 1 1 2 0 4
To achieve variant(non-reference) allele ratio,
I want to create this: (sum-reference sequence's count)/sum * 100 per position
Position reference frequency(%) sum(total read counts)
68773265 A 0 1
68773266 C 50 2
68773267 C 75 4
Please give me some advice on this problem. Thanks in advance!!

Using the subset of column names "nm1", match the "reference" column with the "nm1" to get the column index, cbind with 1:nrow(df1) for creating row/column index. Get the rowSums of "nm1" columns ("Sum1"), use this to create "frequencyPercent" based on the formula in the post.
nm1 <- c('A', 'C', 'G', 'T') # this could include `N` also
indx <- cbind(1:nrow(df1), match(df1$reference, nm1))
Sum1 <- rowSums(df1[nm1])
data.frame(df1[1:2], frequencyPercent=100*(Sum1-df1[nm1][indx])/Sum1,
SumTotalCounts=df1[,ncol(df1)])
Or use transform on the original dataset
transform(df1, frequencyPercent=100*(Sum1-df1[nm1][indx])/Sum1,
check.names=FALSE)[c(1:2,8:9)]
# Position reference sum(total read counts) frequencyPercent
#1 68773265 A 1 0
#2 68773266 C 2 50
#3 68773267 C 4 75
data
df1 <- structure(list(Position = 68773265:68773267, reference = c("A",
"C", "C"), A = c(1L, 0L, 0L), C = c(0L, 1L, 1L), G = c(0L, 0L,
1L), T = 0:2, N = c(0L, 0L, 0L), `sum(total read counts)` = c(1L,
2L, 4L)), .Names = c("Position", "reference", "A", "C", "G",
"T", "N", "sum(total read counts)"), class = "data.frame",
row.names = c(NA, -3L))

Related

Remove trailing 0s and last 1 from list of observations in r

I have a dataframe (df) that looks similar to this:
person
outcome
a
1
a
1
a
0
a
0
a
0
b
1
b
0
b
1
c
1
c
1
c
0
c
0
c
0
For persons whose last observation is a 0, I would like to remove the trailing 0s plus the last 1, so that the final df looks like this:
person
outcome
a
1
b
1
b
0
b
1
c
1
The last three 0s and last 1 were removed for A and C, but B was left alone because its last observation was a 1. Is there a way to do this, or does it have to be done by hand?
May be this helps
library(dplyr)
df %>%
group_by(person) %>%
mutate(new = cumsum(outcome)) %>%
filter(if(last(outcome) == 0) new <max(new) else TRUE) %>%
ungroup %>%
select(-new)
-output
# A tibble: 5 × 2
person outcome
<chr> <int>
1 a 1
2 b 1
3 b 0
4 b 1
5 c 1
data
df <- structure(list(person = c("a", "a", "a", "a", "a", "b", "b",
"b", "c", "c", "c", "c", "c"), outcome = c(1L, 1L, 0L, 0L, 0L,
1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-13L))

Iterate 2x2 tables and fishers exact over multiple columns in large dataframe in R

I'm new to r and I'm looking for help with looping and iterating over a large data frame to create multiple 2x2 tables with the table() function, followed by fishers.test() on each table to export the odds ratio and p-values.
My dataframes have cluster(s) and genes as columns like this:
**df**
Tag Cluster Gene1 Gene2 Gene3 Gene4
1 a 1 0 0 1
2 b 0 1 1 1
3 c 1 1 0 0
4 d 0 0 0 1
5 e 1 1 0 0
The is the code I want iterated and looped.
tab<-table(df$Cluster =="a", df$Gene1)
fisher.test(tab)
I'm not sure where to begin and would appreciate any help.
Thanks!
P.S my actually data frame dimensions are 1010 x 191
If we are interested to do the fisher.test in all the 'Gene' column by using the same condition on 'Cluster'
library(dplyr)
out <- df %>%
summarise_at(vars(starts_with('Gene')), ~
list(table(Cluster == 'a', .) %>%
fisher.test))
out$Gene1
out$Gene2
In base R, this can be done with lapply
out2 <- lapply(df[startsWith(names(df), "Gene")], function(v)
fisher.test(table(df$Cluster == "a", v)))
out2$Gene1
out2$Gene2
data
df <- structure(list(Tag = 1:5, Cluster = c("a", "b", "c", "d", "e"
), Gene1 = c(1L, 0L, 1L, 0L, 1L), Gene2 = c(0L, 1L, 1L, 0L, 1L
), Gene3 = c(0L, 1L, 0L, 0L, 0L), Gene4 = c(1L, 1L, 0L, 1L, 0L
)), class = "data.frame", row.names = c(NA, -5L))

Sum the values of groups of 4 contiguous columns in R

Starting from a table of 372 columns and 12,000 rows in R, I need to create a new table with columns that contain rows with the sum of same row from columns 1:4, then 5:8, then 9:12, and so on up to column 372 of the original table. Here a short example:
Input:
m = structure(c(3L, 1L, 2L, 6L, 3L, 1L, 1L, 8L, 1L, 5L, 2L, 1L, 3L, 7L,
+ 1L, 1L), .Dim = c(2L, 8L), .Dimnames = list(c("r1", "r2"), c("a", "b",
+"c", "d", "e", "f", "g", "h")))
Which looks like this:
a b c d e f g h
r1 3 2 3 1 1 2 3 1
r2 1 6 1 8 5 1 7 1
Expected output:
A B
r1 9 7
r2 16 14
So, A = a+b+c+d, and B=e+f+g+h. Easy to do with a small table in Excel. Columns a-d correspond to a group, e-f to another, if that helps.
The question is currently underspecified, but supposing you have a matrix...
m = structure(c(3L, 1L, 2L, 6L, 3L, 1L, 1L, 8L, 1L, 5L, 2L, 1L, 3L,
7L, 1L, 1L), .Dim = c(2L, 8L), .Dimnames = list(c("r1", "r2"),
c("a", "b", "c", "d", "e", "f", "g", "h")))
Make your column mapping:
map = data.frame(old = colnames(m), new = rep(LETTERS, each=4, length.out=ncol(m)))
old new
1 a A
2 b A
3 c A
4 d A
5 e B
6 f B
7 g B
8 h B
And then rowsum by it:
res = rowsum(t(m), map$new)
r1 r2
A 9 16
B 7 14
We have to transpose the data with t here because R has rowsum but no colsum. You can transpose it back afterwards, like t(res).
A base R solution, suppose df is your data frame:
cols = 8
do.call(cbind, lapply(seq(1, ncols, 4), function(i) rowSums(df[i:(i+3)])))
# [,1] [,2]
# r1 9 7
# r2 16 14
Another way:
df <- data.frame(t(matrix(colSums(matrix(t(df), nrow=4)),nrow=nrow(df))))
## X1 X2
##1 9 7
##2 16 14
First transpose the data to a 4 x (ncol(df)/4 * now(df)) matrix where now each column is a group of four columns for each row in the original data frame.
Sum each column using colSums
Transpose the data back to a data frame with the original number of rows
You can do this in a vectorised way if you transform your original data to a matrix with 4 columns, then use rowSums on that, and then transform it back to match the rows of the original data frame. Here it is in one long command
df <- read.table(header = TRUE, text = "a b c d e f g h
3 2 3 1 1 2 3 1
1 6 1 8 5 1 7 1")
matrix(rowSums(matrix(as.vector(t(as.matrix(df))),
ncol = 4, byrow = TRUE)), ncol = ncol(df) / 4, byrow = TRUE)
# [,1] [,2]
#[1,] 9 7
#[2,] 16 14
Edit: To preserve the row names, if e.g. rownames(df) <- c("r1", "r2"), just apply them to the resulting matrix (the row order is preserved), ie run rownames(result) <- rownames(df).

Give percentage by group in R

For a sample dataframe:
df1 <- structure(list(i.d = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), group = c(1L,
1L, 2L, 1L, 3L, 3L, 2L, 2L, 1L), cat = c(0L, 0L, 1L, 1L, 0L,
0L, 1L, 0L, NA)), .Names = c("i.d", "group", "cat"), class = "data.frame", row.names = c(NA,
-9L))
I wish to add an additional column to my dataframe ("pc.cat") which records the percentage '1s' in column cat BY the group ID variable.
For example, there are four values in group 1 (i.d's a, b, d and i). Value 'i' is NA so this can be ignored for now. Only one of the three values left is one, so the percentage would read 33.33 (to 2 dp). This value will be populated into column 'pc.cat' next to all the rows with '1' in the group (even the NA columns). The process would then be repeated for the other groups (2 and 3).
If anyone could help me with the code for this I would greatly appreciate it.
This can be accomplished with the ave function:
df1$pc.cat <- ave(df1$cat, df1$group, FUN=function(x) 100*mean(na.omit(x)))
df1
# i.d group cat pc.cat
# 1 a 1 0 33.33333
# 2 b 1 0 33.33333
# 3 c 2 1 66.66667
# 4 d 1 1 33.33333
# 5 e 3 0 0.00000
# 6 f 3 0 0.00000
# 7 g 2 1 66.66667
# 8 h 2 0 66.66667
# 9 i 1 NA 33.33333
library(data.table)
setDT(df1)
df1[!is.na(cat), mean(cat), by=group]
With data.table:
library(data.table)
DT <- data.table(df1)
DT[, list(sum(na.omit(cat))/length(cat)), by = "group"]

In R, compare one column value to all other columns

I'm very new to R and I have a question which might be very simple for experts here.
Let's say i have a table "sales", which includes 4 customer IDs (123-126) and 4 products (A,B,C,D).
ID A B C D
123 0 1 1 0
124 1 1 0 0
125 1 1 0 1
126 0 0 0 1
I want to calculate the overlaps between products. So for A, the number of IDs that have both A and B will be 2. Similarly, the overlap between A and C will be 0 and that between A and D will be 1. Here is my code for A and B overlap:
overlap <- sales [which(sales [,"A"] == 1 & sales [,"B"] == 1 ),]
countAB <- count(overlap,"ID")
I want to repeat this calculation for all 4 products,so A overlaps with B,C,D and B overlaps with A,C,D, etc...How can i change the code to accomplish this?
I want the final output to be the number of IDs for each two-product combination. It's product affinity exercise and i want to find out for one product, which product sold the most with it. For example, for A, the most sold products with it will be B, followed by D, then C. Some sorting needs to be added to the code to get to this i think.
Thanks for your help!
#x1 is your dataframe
x1<-structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L,
1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID",
"A", "B", "C", "D"), class = "data.frame", row.names = c(NA,
-4L))
#get the combination of all colnames but the first ("ID")
k1<-combn(colnames(x1[,-1]),2)
#create two lists a1 and a2 so that we can iterate over each element
a1<-as.list(k1[seq(1,length(k1),2)])
a2<-as.list(k1[seq(2,length(k1),2)])
# your own functions with varying i and j
mapply(function(i,j) length(x1[which(x1[,i] == 1 & x1 [,j] == 1 ),1]),a1,a2)
[1] 2 0 1 1 1 0
Here's a possible solution :
sales <-
read.csv(text=
"ID,A,B,C,D
123,0,1,1,0
124,1,1,0,0
125,1,1,0,1
126,0,0,0,1")
# get product names
prods <- colnames(sales)[-1]
# generate all products pairs (and transpose the matrix for convenience)
combs <- t(combn(prods,2))
# turn the combs into a data.frame with column P1,P2
res <- as.data.frame(combs)
colnames(res) <- c('P1','P2')
# for each combination row :
# - subset sales selecting only the products in the row
# - count the number of rows summing to 2 (if sum=2 the 2 products have been sold together)
# N.B.: length(which(logical_condition)) can be implemented with sum(logical_condition)
# since TRUE and FALSE are automatically coerced to 1 and 0
# finally add the resulting vector to the newly created data.frame
res$count <- apply(combs,1,function(comb){sum(rowSums(sales[,comb])==2)})
> res
P1 P2 count
1 A B 2
2 A C 0
3 A D 1
4 B C 1
5 B D 1
6 C D 0
You can use matrix multiplication:
m <- as.matrix(d[-1])
z <- melt(crossprod(m,m))
z[as.integer(z$X1) < as.integer(z$X2),]
# X1 X2 value
# 5 A B 2
# 9 A C 0
# 10 B C 1
# 13 A D 1
# 14 B D 1
# 15 C D 0
where d is your data frame:
d <- structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L, 1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID", "A", "B", "C", "D"), class = "data.frame", row.names = c(NA, -4L))
[Update]
To calculate the product affinity, you can do:
z2 <- subset(z,X1!=X2)
do.call(rbind,lapply(split(z2,z2$X1),function(d) d[which.max(d$value),]))
# X1 X2 value
# A A B 2
# B B A 2
# C C B 1
# D D A 1
You might want to take a look at the arules package. It does exactly what you are looking for.
Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt.

Resources