transform a dataframe of frequencies to a wider format - r

I have a dataframe that looks like this.
input dataframe
position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2
These data are nucleotide differences at a given position in a hypothetical genome, mean_freq is relative to the reference, so the first row means the proportion of C's are 0.002 implying the A are at 0.998.
I want to transform this to a different structure by creating new columns such that,
desired_output
position,G,C,T,A,sampleid
1,0,0.002,0,0.998,name1
2, 0.96,0,0.04,0,name
3,0,0.93,0,0.07,name2
I have attempted this approach
per_position_full_nt_freq <- function(x){
df <- data.frame(A=0, C=0, G=0, T=0)
idx <- names(df) %in% x$alternative
df[,idx] <- x$mean_freq
idx2 <- names(df) %in% x$reference
df[,idx2] <- 1 - x$mean_freq
df$position <- x$position
df$sampleName <- x$sampleName
return(df)
}
desired_output_dataframe <- per_position_full_nt_freq(input_dataframe)
I ran into an error
In matrix(value, n, p) :
data length [8905] is not a sub-multiple or multiple of the number of columns
additionally, I feel there has to be a more intuitive solution and presumably using tidyr or dplyr.
How do I conveniently transform the input dataframe to the desired output dataframe format?
Thank you.

One option would be to create a matrix of 0's with the 'G', 'C', 'T', 'A' column names, match with the column names of the original dataset, use the row/column index to assign the values and then cbind with the original dataset's 'position' and 'sample_id', columns
m1 <- matrix(0, ncol=4, nrow=nrow(df1), dimnames = list(NULL, c("G", "C", "T", "A")))
m1[cbind(seq_len(nrow(df1)), match(df1$alternative, colnames(m1)))] <- df1$mean_freq
m1[cbind(seq_len(nrow(df1)), match(df1$reference, colnames(m1)))] <- 0.1 - df1$mean_freq
cbind(df1['position'], m1, df1['sample_id'])
# position G C T A sample_id
#1 1 0.00 0.002 0.00 0.098 name1
#2 2 0.06 0.000 0.04 0.000 name1
#3 3 0.00 0.030 0.00 0.070 name2

The following should do the trick:
library(readr)
library(dplyr)
library(tidyr)
input_df <- read_csv(
'position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2'
)
input_df %>%
mutate( ref_val = 0.1 -mean_freq) %>%
spread(alternative, mean_freq, fill=0) %>%
spread(reference, ref_val, fill=0) %>%
select( position, G, C, T, A, sample_id )
One assumption you have here is that the alternative and reference are distinct, otherwise you will get two columns with the same name, but different values. You need to handle for that with a couple of command at the beginning of your code if need be.

Related

What is the most efficient way to replace a vector's values in a data.table's column with correlating values from another data.table?

Here's a scaled down sample of my problem. I have a data.table with a column of multiple IDs in vector form. These IDs all correspond to words in another data.table.
ID.table <- data.table(IDs = list(c(4, 5, 6), c(2, 3, 4)))
word.table <- data.table(ID = c(1, 2, 3, 4, 5, 6), word = c("This", "is", "a", "test", "sentence", "."))
which yields
IDs
1: 4,5,6
2: 2,3,4
and
ID word
1: 1 This
2: 2 is
3: 3 a
4: 4 test
5: 5 sentence
6: 6 .
I need to convert all the IDs in ID.table to the corresponding words in word.table, like in the following.
IDs
1: test,sentence,.
2: is,a,test
I know I can do this using a for loop and looping through every vector in ID.table, but my actual table has thousands of rows, which means it runs very slowly.
row <- 1
for(ID.row in ID.table[, IDs]){
word.row <- word.table[ID %in% ID.row]$word
ID.table[row] <- word.row
row <- row + 1
}
Is there a more efficient way to do this?
EDIT: I made a mistake by listing sequential IDs starting from 1 in word.table. ID.table and word.table would look something more like this.
IDs
1: 608,609,610
2: 606,607,608
and
ID word
1: 605 This
2: 606 is
3: 607 a
4: 608 test
5: 609 sentence
6: 610 .
where each row of ID.table will be a vector of sequential numbers not starting from 1, and the ID column of word.table will have not-always sequential ID numbers not starting from 1.
You can use match :
library(data.table)
ID.table[, IDs := lapply(IDs,function(x) word.table$word[match(x,word.table$ID)])]
ID.table
# IDs
#1: test,sentence,.
#2: is,a,test
If you are ok with using tidyverse functions another option is to unnest the IDs and join with word.table.
library(dplyr)
ID.table %>%
mutate(row = row_number()) %>%
tidyr::unnest(IDs) %>%
left_join(word.table, by = c('IDs' = 'ID')) %>%
group_by(row) %>%
summarise(Ids = list(word)) %>%
select(-row)
We could pass a named vector to match and replace by looping over the list column 'IDs' and assign (:=) the output back to IDs
ID.table[, IDs := lapply(IDs, function(x)
setNames(word.table$word, word.table$ID)[as.character(x)])]
and if the IDs are in sequence, it is more easier i.e. use the IDs as a numeric index to replace the corresponding values from 'word' column
ID.table[, IDs := lapply(IDs, function(x) word.table$word[x])]
ID.table
# IDs
#1: test,sentence,.
#2: is,a,test
It may be also better to do this once without looping by unlisting, replace the values, then relist
ID.table[, IDs := relist(word.table$word[unlist(IDs)], skeleton= IDs)]
NOTE: Both methods are simple and more direct and efficient
Or using a compact tidyverse method
library(purrr)
library(dplyr)
ID.table %>%
mutate(IDs = map(IDs, ~ word.table$word[.x]))
# IDs
#1: test,sentence,.
#2: is,a,test
This wouldn't change the original attribute structure of data.table
Benchmarks
On a slightly bigger dataset
ID.table1 <- ID.table[rep(seq_len(.N), 1e6)]
ID.table2 <- copy(ID.table1)
ID.table3 <- copy(ID.table1)
ID.table4 <- copy(ID.table1)
system.time(ID.table1[, IDs := lapply(IDs, function(x)
setNames(word.table$word, word.table$ID)[as.character(x)])])
#user system elapsed
# 29.971 0.492 30.264
system.time(ID.table2[, IDs := lapply(IDs, function(x) word.table$word[x])])
#user system elapsed
# 8.079 0.086 8.097
system.time(ID.table3[, IDs := relist(word.table$word[unlist(IDs)], skeleton= IDs)])
# user system elapsed
# 14.085 0.109 14.081
system.time(ID.table4 %>%
mutate(IDs = map(IDs, ~ word.table$word[.x])))
#user system elapsed
# 3.724 0.018 3.734

Pull names of variables with a threshold of missing values

I am working with a data set containing 93 columns and many of them have a large percentage of missing values. I am looking for a way to streamline screening each column for the percentage of missing values and then return a list of names of those variables above that threshold to include in a new data set.
I have a function to check for missing values and return the percentage of missing:
#check for missing data
pMiss <- function(x) {
sum(is.na(x))/length(x)*100
}
#percent of data missing per column
x <- apply(dt2,2,pMiss)
How can I retrieve all the names [from x] of the columns where the percent of missing values is less than 20%? I would like to retrieve these names as a list that I can paste into a new data set, so I don't have to manually copy and paste each name from x.
Thank you in advance.
This'll work:
# example dataset
set.seed(123)
dat <- data.frame(a=sample(c(1,2,NA), size=20, replace=TRUE),
b=sample(c(1,2,NA), size=20, replace=TRUE),
c=sample(c(1:10,NA), size=20, replace=TRUE))
threshold <- .25 # for example
# get subset of colnames s.t. NA proportion is greater than threshold
names(dat)[sapply(dat, function(x) mean(is.na(x)) > threshold)]
## [1] "a" "b"
You can use the tidyverse approach:
require(tidyverse)
set.seed(123)
dat <- data.frame(a=sample(c(1,2,NA), size=20, replace=TRUE),
b=sample(c(1,2,NA), size=20, replace=TRUE),
c=sample(c(1:10,NA), size=20, replace=TRUE))
threshold <- .43
dat %>%
gather(var, value) %>%
group_by(var) %>%
summarise(prep.missing = sum(is.na(value)) / n()) %>%
filter(prep.missing < threshold)
var prep.missing
<chr> <dbl>
1 a 0.400
2 c 0.
df <- data.frame(a=c(NA,NA,1,1),b=c(NA,1,1,1),c=c(1,1,1,1))
x <- colMeans(is.na(df))
# a b c
# 0.50 0.25 0.00
x[x < .3]
# b c
# 0.25 0.00
names(x[x < .3])
# [1] "b" "c"
or all in one line:
names(df)[colMeans(is.na(df)) < .3]
# [1] "b" "c"

assign row name while rbind a row in a data frame

I want to assign a rowname (A_B) while I rbind a row into a new dataframe (d).
The row is the result of the ratio of two rows of another data frame (df).
df <- data.frame(ID = c("A", "B" ),replicate(3,sample(1:100,2,rep=TRUE)))
d <- data.frame()
d<-rbind(d, df[df$ID == "A",2:4 ]/df[df$ID == "B", 2:4])
Actual output
X1 X2 X3
1 0.08 0.14 0.66
Expected output. Instead of rowname 1 I want A_B as result of A_B ratio
X1 X2 X3
A_B 0.08 0.14 0.66
Maybe it's a stupid solution, but have you tried give it directly the row name? Some like this:
rbind(d, your_name = (df[df$ID == "A",2:4 ]/df[df$ID == "B", 2:4]))
For me it's working... Regards.
updated the solution to address multiple rows
You can workaround for the desired row names from following solution....
df <- data.frame(ID = c("A", "B" ),replicate(3, sample(1:100,8,rep=TRUE)))
# This is where you control what you like to see as the row names
rownames(df) <- make.names( paste("NAME", df[ ,"ID"]) , unique = TRUE)
d <- data.frame()
rbind(d, df[df$ID == "A",2:4 ]/df[df$ID == "B", 2:4], make.row.names = "T")
output
X1 X2 X3
NAME.A 0.8690476 1.1851852 2.40909091
NAME.A.1 1.8181818 0.8095238 1.01408451
NAME.A.2 0.8235294 5.4444444 2.50000000
NAME.A.3 1.4821429 1.8139535 0.05617978
You can assign row names by passing your rbind output into a pipe to the function rownames() and mutate the row name as you like.
%>% rownames(.) <- c("Name you like")
Just make sure you mutate the correct row you intended.

Looping by row across subset of columns

I have a data frame with column 1 being the gene and all other columns being gene expression data for that gene under different conditions. I want to go gene by gene and divide all the expression values by the median expression value for that gene. I have the medians in a data frame called s.med.df.
I’m trying to direct R to divide all the expression columns (2:n) but not the first column by the median value for each gene. I'm new to R, but the script I have so far is as follows:
Con1 <- c(5088.77, 274.62, 251.97, 122.21)
Con2 <- c(4382.59, 288.55, 208.12, 171.93)
Con3 <- c(4732.81, 417.43, 305.58, 132.93)
Solid.df <- data.frame(Gene = c("A", "B", "C", "D"), Con1=Con1, Con2=Con2, Con3=Con3)
Gene Con1 Con2 Con3
A 5088.77 4382.59 4732.81
B 274.62 288.55 417.43
C 251.97 208.12 305.58
D 122.21 171.93 132.93
n <- ncol(Solid.df)
genes = levels(s.med.df$Gene)
Solid.mt.df = Solid.df
for (i in 1:length(genes)) {
gene = genes[i]
Solid.mt.df[2:n][Solid.mt.df$Gene == gene] = Solid.mt.df[2:n][Solid.mt.df$Gene == gene] / s.med.df$Medians[i]
print(gene)
}
Thank you in advance
This can be achieved by direct divide. Change s.med.df to a vector. See the following example.
d1 <- data.frame(ge=c("A", "B", "C"), e1=1:3, e2=7:9,
stringsAsFactors = FALSE)
m1 <- data.frame(md=4:6, stringsAsFactors = FALSE)
d1[,2:3]/unlist(m1)
# e1 e2
# 1 0.25 1.75
# 2 0.40 1.60
# 3 0.50 1.50
Can also bind the gene names with the results.
cbind(d1[,1], d1[,2:3]/unlist(m1))
For anything to do with applying a function over columns or rows, you're looking for apply:
median_centered <- t(apply(genes[,2:length(genes)], 1, function(x) x / median(x)))
genes2 <- cbind(genes[,1], median_centered)
This takes the data frame except for the first column, iterates over the 1st axis (rows), and applies x / median(x) to those rows. Since R broadcasts scalar operations to vectors, you'll get the desired result, but it will be transposed, so calling t() on it turns it back into the original format. Then we can cbind it back with the gene names.
like #VenYao pointed out, you can use direct division if you turn your medians into a vector. It would be helpful to show what structure is your s.med.df file.
This can be achieved using data.table pretty easily:
cbind your dataframes into a data.table:
library(data.table)
combined <- data.table(cbind(Solid.df, s.med.df))
combined[, med.con1 := Con1/median]
# assume median is the column in s.med.df that stores median values.
# then you can repeat that for all three conditions:
combined[, med.con2 := Con2/median]
combined[, med.con2 := Con2/median]

Efficiently reformat column entries in large data set in R

I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about.
The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns.
The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns.
The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.
The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.
My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.
Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.
pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
current <- pos[[i]]
mutation <- change[[i]]
af <- fraction[[i]]
total.esp$chrom[i] <- current[1]
total.esp$base[i] <- current [2]
total.esp$ref[i] <- mutation[1]
total.esp$alt[i] <- mutation[2]
total.esp$af[i] <- af[3]
}
Thanks!
Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.
library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]
# Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1: C G 10 130448 0.02 0.03
#2: A C 2 40483 0.05 0.03
#3: A G 5 30821291 0.02 0.04
# MAFinPercent3
#1: 0.20
#2: 0.04
#3: 0.03
NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.
data
df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'),
NCBI.Base=c('10:130448', '2:40483', '5:30821291'),
MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'),
stringsAsFactors=FALSE)
You can use tidyr, dplyr and separate:
library(tidyr)
library(dplyr)
total.esp %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent.EA.AA.All., c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)
You'll need to be careful about that last MAFinPercent.EA.AA.All. - you have a horrible column name so may have to rename it/quote it depending on how exactly r has it (this is also a good reason to include at least some data in your question, such as the output of dput(head(total.esp))).
data used to check:
total.esp <- data.frame(Alleles= rep("C>G", 50), NCBI.Base = rep("10:130448", 50), 'MAFinPercent(EA/AA/All)'= rep(".02/.03/.20", 50))
Because we now have a tidyr/dplyr solution, a data.table solution and a base solution, let's benchmark them. First, data from #akrun, 300,000 rows in total:
df1 <- data.frame(Alleles =rep(c('C>G', 'A>C', 'A>G'), 100000),
NCBI.Base=rep(c('10:130448', '2:40483', '5:30821291'), 100000),
MAFinPercent= rep(c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 100000),
stringsAsFactors=FALSE)
Now, the benchmark:
microbenchmark::microbenchmark(
tidyr = {df1 %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent, c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)},
data.table = {setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]},
base = {pos <- strsplit(df1$NCBI.Base, ":");
change <- strsplit(df1$Alleles, ">");
fraction <- strsplit(df1$MAFinPercent, "/");
data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( fraction, "[", 3)
)}
)
Unit: seconds
expr min lq mean median uq max neval
tidyr 1.295970 1.398792 1.514862 1.470185 1.629978 1.889703 100
data.table 2.140007 2.209656 2.315608 2.249883 2.481336 2.666345 100
base 2.718375 3.079861 3.183766 3.154202 3.221133 3.791544 100
tidyr is the winner
Try this (after retaining your first three lines of code):
total.esp <- data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( af, "[", 3)
)
I cannot imagine this taking more than a couple of minutes. (I do work with R objects of similar size.)

Resources