Looping by row across subset of columns - r

I have a data frame with column 1 being the gene and all other columns being gene expression data for that gene under different conditions. I want to go gene by gene and divide all the expression values by the median expression value for that gene. I have the medians in a data frame called s.med.df.
I’m trying to direct R to divide all the expression columns (2:n) but not the first column by the median value for each gene. I'm new to R, but the script I have so far is as follows:
Con1 <- c(5088.77, 274.62, 251.97, 122.21)
Con2 <- c(4382.59, 288.55, 208.12, 171.93)
Con3 <- c(4732.81, 417.43, 305.58, 132.93)
Solid.df <- data.frame(Gene = c("A", "B", "C", "D"), Con1=Con1, Con2=Con2, Con3=Con3)
Gene Con1 Con2 Con3
A 5088.77 4382.59 4732.81
B 274.62 288.55 417.43
C 251.97 208.12 305.58
D 122.21 171.93 132.93
n <- ncol(Solid.df)
genes = levels(s.med.df$Gene)
Solid.mt.df = Solid.df
for (i in 1:length(genes)) {
gene = genes[i]
Solid.mt.df[2:n][Solid.mt.df$Gene == gene] = Solid.mt.df[2:n][Solid.mt.df$Gene == gene] / s.med.df$Medians[i]
print(gene)
}
Thank you in advance

This can be achieved by direct divide. Change s.med.df to a vector. See the following example.
d1 <- data.frame(ge=c("A", "B", "C"), e1=1:3, e2=7:9,
stringsAsFactors = FALSE)
m1 <- data.frame(md=4:6, stringsAsFactors = FALSE)
d1[,2:3]/unlist(m1)
# e1 e2
# 1 0.25 1.75
# 2 0.40 1.60
# 3 0.50 1.50
Can also bind the gene names with the results.
cbind(d1[,1], d1[,2:3]/unlist(m1))

For anything to do with applying a function over columns or rows, you're looking for apply:
median_centered <- t(apply(genes[,2:length(genes)], 1, function(x) x / median(x)))
genes2 <- cbind(genes[,1], median_centered)
This takes the data frame except for the first column, iterates over the 1st axis (rows), and applies x / median(x) to those rows. Since R broadcasts scalar operations to vectors, you'll get the desired result, but it will be transposed, so calling t() on it turns it back into the original format. Then we can cbind it back with the gene names.

like #VenYao pointed out, you can use direct division if you turn your medians into a vector. It would be helpful to show what structure is your s.med.df file.
This can be achieved using data.table pretty easily:
cbind your dataframes into a data.table:
library(data.table)
combined <- data.table(cbind(Solid.df, s.med.df))
combined[, med.con1 := Con1/median]
# assume median is the column in s.med.df that stores median values.
# then you can repeat that for all three conditions:
combined[, med.con2 := Con2/median]
combined[, med.con2 := Con2/median]

Related

How would I run a t test on 58 (variables) columns to compare 2 different data frames

I have 58 columns in each data frame that I would like to compare to see if there is a significant difference between them (individually and as a whole) as each of the 58 comprise a water basin and would be a sum of the whole, but still individually represent different things. I am not sure how to run a t.test on this. I am really new to coding and to R
Here is a way of conducting t-tests on all colimns of two data.frames using a lapply loop. Each of the tests returns a list of class "htest", and the sapply instructions extract the list members of interest.
tests_list <- lapply(seq_along(df1), function(i){
t.test(df1[[i]], df2[[i]])
})
sapply(tests_list, '[[', 'statistic')
sapply(tests_list, '[[', 'p.value')
sapply(tests_list, '[[', 'conf.int')
Test data
set.seed(2021)
n <- 20
df1 <- matrix(rnorm(n*4), ncol = 4)
df2 <- matrix(rnorm(n*4), ncol = 4)
df1 <- as.data.frame(df1)
df2 <- as.data.frame(df2)
In most simplistic case, you would loop through each column and do multiple t-test, one such example shown below.
# Dataframe 1: Col 1: It has 100 values, mean = 1, SD = 1
df_1_col_1 = rnorm(100, 1, 1)
# Dataframe 2: Col 1: It has 75 values, mean = 2, SD = 1
df_2_col_1 = rnorm(75, 2, 1)
# Null hyposthesis: difference between x and y is = 0
t.test(df_1_col_1, df_2_col_1)
# P-value < 0.05 you reject the null hypothesis.
Or, you can row-wise aggregate the 58 columns to get one value for each row. Ex: take mean of 58 column values. Now you will get a list of values(df_1_col_1 & df_2_col_1 in above code) for dataframe 1 and dataframe 2. If you don't like simple mean, you can do PCA on your dataframes and use 1st principal component from both the dataframes, to do a t-test.

extract highest and lowest values for columns in R, as well as row identifiers

Say I have some data of the following kind:
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
I want a new dataframe that keeps the 10 original columns, but for every column retains only the highest 10 and lowest 10 values. Importantly, the rows have names corresponding to id values that need to be kept in the new data frame.
Thus, the end result data.frame is gonna be of dimensions m by 10, where m is very likely to be more than 20. But for every column, I want only 20 valid values.
The only way I can think of doing this is doing it manually per column, using dplyr and arrange, grabbing the top and bottom rows, and then creating a matrix from all the individual vectors. Clearly this is inefficient. Help?
Assuming you want to keep all the rows from the original dataset, where there is at least one value satisfying your condition (value among ten largest or ten smallest in the given column), you could do it like this:
# create a data frame
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
# function to find lowes 10 and highest 10 values
lowHigh <- function(x)
{
test <- x
test[!(order(x) <= 10 | order(x) >= (length(x)- 10))] <- NA
test
}
# apply the function defined above
test2 <- apply(df, 2, lowHigh)
# use the original rownames
rownames(test2) <- rownames(df)
# keep only rows where there is value of interest
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
Please note that there is definitely some smarter way of doing it...
Here is the data matrix with 10 highest and 10 lowest in each column,
x<-apply(df,2,function(k) k[order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))]])
x is your 20 by 10 matrix.
Your requirement of rownames is conflicting column by column, altogether you only have 20 rownames in this matrix and it can not be same for all 10 columns. Instead, here is your order matrix,
x_roworder<-apply(df,2,function(k) order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))])
This will give you corresponding rows in original data matrix within each column.
I offer a couple of answers to this.
A base R implementation ( I have used %>% to make it easier to read)
ix = lapply(df, function(x) order(x)[-(1:(length(x)-20)+10)]) %>%
unlist %>% unique %>% sort
df[ix,]
This abuses the fact that data frames are lists, finds the row id satisfying the condition for each column, then takes the unique ones in order as the row indices you want to keep. This should retain any row names attached to df
An alternative using dplyr (since you mentioned it) which if I remember correctly doesn't particular like row names
# add id as a variable
df$id = 1:nrow(df) # or row names
df %>%
gather("col",value,-id) %>%
group_by(col) %>%
filter(min_rank(value) <= 10 | min_rank(desc(value)) <= 10) %>%
ungroup %>%
select(id) %>%
left_join(df)
Edited: To fix code alignment and make a neater filter
I'm not entirely sure what you're expecting for your return / output. But this will get you the appropriate indices
# example data
set.seed(41234L)
N <- 1000
df<-data.frame(id= 1:N, matrix(rnorm(10*N, 1, .5), ncol=10))
# for each column, extract ID's for top 10 and bottom 10 values
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N)
# check:
xx <- sort(df[,2])
all.equal(sort(df[l1[[1]], 2]), xx[c(1:10, 991:1000)])
[1] TRUE
If you want an m * 10 matrix with these unique values, where m is the number of unique indices, you could do:
l2 <- do.call("c", l1)
l2 <- unique(l2)
df2 <- df[l2,] # in this case, m == 189
This doesn't 0 / NA the columns which you're not searching on for each row. But it's unclear what your question is trying to do.
Note
This isn't as efficient as using data.table since you're going to get a copy of the data in xy <- data.frame(x,y)
Benchmark
library(microbenchmark)
microbenchmark(ira= {
test2 <- apply(df[,2:11], 2, lowHigh);
rownames(test2) <- rownames(df);
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
},
alex= {
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N);
l2 <- unique(do.call("c", l1));
df2 <- df[l2,]
}, times= 50L)
Unit: milliseconds
expr min lq mean median uq max neval cld
ira 4.360452 4.522082 5.328403 5.140874 5.560295 8.369525 50 b
alex 3.771111 3.854477 4.054388 3.936716 4.158801 5.654280 50 a

Efficiently reformat column entries in large data set in R

I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about.
The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns.
The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns.
The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.
The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.
My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.
Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.
pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
current <- pos[[i]]
mutation <- change[[i]]
af <- fraction[[i]]
total.esp$chrom[i] <- current[1]
total.esp$base[i] <- current [2]
total.esp$ref[i] <- mutation[1]
total.esp$alt[i] <- mutation[2]
total.esp$af[i] <- af[3]
}
Thanks!
Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.
library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]
# Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1: C G 10 130448 0.02 0.03
#2: A C 2 40483 0.05 0.03
#3: A G 5 30821291 0.02 0.04
# MAFinPercent3
#1: 0.20
#2: 0.04
#3: 0.03
NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.
data
df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'),
NCBI.Base=c('10:130448', '2:40483', '5:30821291'),
MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'),
stringsAsFactors=FALSE)
You can use tidyr, dplyr and separate:
library(tidyr)
library(dplyr)
total.esp %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent.EA.AA.All., c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)
You'll need to be careful about that last MAFinPercent.EA.AA.All. - you have a horrible column name so may have to rename it/quote it depending on how exactly r has it (this is also a good reason to include at least some data in your question, such as the output of dput(head(total.esp))).
data used to check:
total.esp <- data.frame(Alleles= rep("C>G", 50), NCBI.Base = rep("10:130448", 50), 'MAFinPercent(EA/AA/All)'= rep(".02/.03/.20", 50))
Because we now have a tidyr/dplyr solution, a data.table solution and a base solution, let's benchmark them. First, data from #akrun, 300,000 rows in total:
df1 <- data.frame(Alleles =rep(c('C>G', 'A>C', 'A>G'), 100000),
NCBI.Base=rep(c('10:130448', '2:40483', '5:30821291'), 100000),
MAFinPercent= rep(c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 100000),
stringsAsFactors=FALSE)
Now, the benchmark:
microbenchmark::microbenchmark(
tidyr = {df1 %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent, c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)},
data.table = {setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]},
base = {pos <- strsplit(df1$NCBI.Base, ":");
change <- strsplit(df1$Alleles, ">");
fraction <- strsplit(df1$MAFinPercent, "/");
data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( fraction, "[", 3)
)}
)
Unit: seconds
expr min lq mean median uq max neval
tidyr 1.295970 1.398792 1.514862 1.470185 1.629978 1.889703 100
data.table 2.140007 2.209656 2.315608 2.249883 2.481336 2.666345 100
base 2.718375 3.079861 3.183766 3.154202 3.221133 3.791544 100
tidyr is the winner
Try this (after retaining your first three lines of code):
total.esp <- data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( af, "[", 3)
)
I cannot imagine this taking more than a couple of minutes. (I do work with R objects of similar size.)

How can we rank rows of a matrix based on the mean of each column?

I want to rank each row of my data based on the mean of each column
Here you can find an example data
https://gist.github.com/anonymous/2c69
I calculate the mean of row and the mean of each row and each column by
C <- colMeans(data, na.rm = FALSE, dims = 1)
R <- rowMeans(data, na.rm = FALSE, dims = 1)
Then I divide each row mean by each column mean and somehow rank them. Is there any idea?
After we read the dataset, (read.table('Nemo.txt'....)), remove the first character column (data2 <- data[,-1]), get the row means and column means and extend it to the "row/column" (rowMeans(...)[row(data)]), divide and create a matrix "m1". If we need to get the "ranks" of rows for each column in "m1", use mutate_each from dplyr.
data <- read.table('Nemo.txt', header=TRUE, stringsAsFactors=FALSE)
data2 <- data[,-1]
m1 <- rowMeans(data2, na.rm=FALSE, dims=1)[row(data2)]/colMeans(data2,
na.rm=FALSE, dims=1)[col(data2)]
dim(m1) <- dim(data2)
library(dplyr)
d1 <- as.data.frame(m1)
d1 %>%
mutate_each(funs(rank(., ties.method='min')))
But, suppose we need to get the aggregate rank of each row, (not sure if this is what you want), perhaps we can get the row means of "m1" and rank it.
rnk <- rank(rowMeans(m1))
head(rnk)
#[1] 1234 1557 1052 1176 575 290
Then rank the original data based on rnk as follows:
rankeddata <- data[rnk,]

Function for pasting corrected values inside existing dataframe

Does something like the 'paste_over' function below already exist within base R or one of the standard R packages?
paste_over <- function(original, corrected, key){
corrected <- corrected[order(corrected[[key]]),]
output <- original
output[
original[[key]] %in% corrected[[key]],
names(corrected)
] <- corrected
return(output)
}
An example:
D1 <- data.frame(
k = 1:5,
A = runif(5),
B = runif(5),
C = runif(5),
D = runif(5),
E = runif(5)
)
D2 <- data.frame(
k=c(4,1,3),
D=runif(3),
E=runif(3),
A=runif(3)
)
D2 <- D2[order(D2$k),]
D3 <- D1
D3[
D1$k %in% D2$k,
names(D2)
] <- D2
D4 <- paste_over(D1, D2, "k")
all(D4==D3)
In the example D2 contains some values that I want to paste over corresponding cells within D1. However D2 is not in the same order and does not have the same dimension as D1.
The motivation for this is that I was given a very large dataset, reported some errors within it, and received a subset of the original dataset with some corrected values. I would like to be able to 'paste over' the new, corrected values into the old dataset without changing the old dataset in terms of structure. (As the rest of the code I've written assume's the old dataset's structure.)
Although the paste_over function seems to work I can't help but think this must have been tackled before, and so maybe there's already a well known function that's both faster and has error checking. If there is then please let me know what it is.
Thanks.
We can accomplish this using data.table as follows:
setkeyv(setDT(D1), "k")
cols = c("D", "E", "A")
D1[D2, (cols) := D2[, cols]]
setDT() converts a data.frame to data.table by reference (without actually copying the data). We want D1 to be a data.table.
setkey() sorts the data.table by the column specified (here k) and marks that column as sorted (by setting the attribute sorted) by reference. This allows us to perform joins using binary search.
x[i] in data.table performs a join. You can read more about it here. Briefly, for each row of column k in D2, it finds the matching row indices in D1 by matching on D1's key column (here k).
x[i, LHS := RHS] performs the join to find matching rows, and the LHS := RHS part adds/updates x with the columns specified in LHS with the values specified in RHS by reference. LHS should be a a vector of column names or numbers, and RHS should be a list of values.
So, D1[D2, (cols) := D2[, cols]] finds matching rows in D1 for k=c(1,3,4) from D2 and updates the columns D,E,A specified in cols by the list (a data.frame is also a list) of corresponding columns from D2 on RHS.
D1 will now be modified in-place.
HTH
You could use the replacement method for data frames in your function, like this maybe. It does adequate checking for you. I chose to pass the logical row subset as an argument, but you can change that
pasteOver <- function(original, corrected, key) {
"[<-.data.frame"(original, key, names(corrected), corrected)
}
(p1 <- pasteOver(D1, D2, D1$k %in% D2$k))
k A B C D E
1 1 0.18827167 0.006275082 0.3754535 0.8690591 0.73774065
2 2 0.54335829 0.122160101 0.6213813 0.9931259 0.38941407
3 3 0.62946977 0.323090601 0.4464805 0.5069766 0.41443988
4 4 0.66155954 0.201218532 0.1345516 0.2990733 0.05296677
5 5 0.09400961 0.087096652 0.2327039 0.7268058 0.63687025
p2 <- paste_over(D1, D2, "k")
identical(p1, p2)
# [1] TRUE

Resources