gene ranking microarray - r

I wonder if anyone is familiar with the Bioconductor RankProduct package for ranking and obtaining differentially expressed genes. Some info about the software are as follows paper, manual, documentation.
I ran into some problems while using the program, maybe because of my little knowledge of R language. I tried to replicate the steps in the pdf files above with my own data. Although my own datasets were not in the afffy .cel files as in the examples, but only as rows and columns in a tab-delimited file. I have two conditions (1 and 2, replicate = 4 for each)
Here is my code:
library(RankProd)
library(preprocessCore)
#Read expression data
#gdata <- read.table(file="data2.txt", sep="\t", header=T) #9000 rows of genes X 8 columns of chips
gdata <- read.table(file="data2.txt", sep="\t", header=T, row.names=1) #9000 rows of genes X 8 columns of chips
#colnames(gdata)
# This vector contains the microarray sample names
SampleNames= names(data.frame(gdata[,-1]))
#names(datExpr)=gdata[,1]
# This vector contains the gene names
datExpr.gnames= gdata$GeneName
# Since the first column contains the gene names, exclude it.
# dataExp is then the matix required
datExpr=data.frame(gdata[,-1])
#convert data into matrix form
datExpr <- as.matrix(datExpr)
#data normalization - quantile normalization
#datExpr.log.norm <- normalize.quantiles((log2(datExpr)),copy=TRUE) #with logged data
datExpr <- datExpr.log.norm
#datExpr.norm <- normalize.quantiles(datExpr,copy=TRUE) #without logged data
#datExpr <- datExpr.norm
# Identify two class data - control/treatment (or condition 1/condition2)
nl <- 4
n2 <- 4
cl <- rep(c(0,1), c(nl, n2))
datExpr.cl <- cl
# data were generated under identical or very similar conditions except the
# factor of interest (e.g., control and treatment),
origin <- rep(1, nl + n2)
datExpr.origin <- origin
# Data anslysis
datExpr.sub <- datExpr[,which(datExpr.origin == 1)]
datExpr.cl.sub <- datExpr.cl[which(datExpr.origin == 1)]
datExpr.origin.sub <- datExpr.origin[which(datExpr.origin == 1)]
#Rank product analysis and output
#RP.out <- RP(datExpr.sub, datExpr.cl.sub, num.perm = 100, logged = TRUE,na.rm = FALSE, plot = FALSE, rand = 123)
RP.out <- RPadvance(datExpr.sub, datExpr.cl.sub, datExpr.origin.sub, num.perm = 100,logged = TRUE,
na.rm = FALSE, gene.names = datExpr.gnames, plot = FALSE,rand = 123)
# Output a table of the identified genes based on user-specified selection criteria
topGene(RP.out, cutoff = 0.05, method = "pfp", logged = TRUE,logbase = 2, gene.names = datExpr.gnames)
I did run the code, but my fold changes for differentially expressed genes in one condition VS the other were either 0's or infinities. I wonder if anyone with experience with this program can help me.

At a first glance what I note is that
#datExpr.log.norm <- normalize.quantiles((log2(datExpr)),copy=TRUE) #with logged data
datExpr <- datExpr.log.norm
Here as long as the first line is commented out datExpr will result empty.

Related

How to efficiently split each row into test and train subsets using R?

I have a data table that provides the length and composition of given vectors
for example:
set.seed(1)
dt = data.table(length = c(100, 150),
n_A = c(30, 30),
n_B = c(20, 100),
n_C = c(50, 20))
I need to randomly split each vector into two subsets with 80% and 20% of observations respectively. I can currently do this using a for loop. For example:
dt_80_list <- list() # create output lists
dt_20_list <- list()
for (i in 1:nrow(dt)){ # for each row in the data.table
sample_vec <- sample( c( rep("A", dt$n_A[i]), # create a randomised vector with the given nnumber of each component.
rep("B", dt$n_B[i]),
rep("C", dt$n_C[i]) ) )
sample_vec_80 <- sample_vec[1:floor(length(sample_vec)*0.8)] # subset 80% of the vector
dt_80_list[[i]] <- data.table( length = length(sample_vec_80), # count the number of each component in the subset and output to list
n_A = length(sample_vec_80[which(sample_vec_80 == "A")]),
n_B = length(sample_vec_80[which(sample_vec_80 == "B")]),
n_C = length(sample_vec_80[which(sample_vec_80 == "C")])
)
dt_20_list[[i]] <- data.table( length = dt$length[i] - dt_80_list[[i]]$length, # subtract the number of each component in the 80% to identify the number in the 20%
n_A = dt$n_A[i] - dt_80_list[[i]]$n_A,
n_B = dt$n_B[i] - dt_80_list[[i]]$n_B,
n_C = dt$n_C[i] - dt_80_list[[i]]$n_C
)
}
dt_80 <- do.call("rbind", dt_80_list) # collapse lists to output data.tables
dt_20 <- do.call("rbind", dt_20_list)
However, the dataset I need to apply this to is very large, and this is too slow. Does anyone have any suggestions for how I could improve performance?
Thanks.
(I assumed your dataset consists of many more rows (but only a few colums).)
Here's a version I came up with, with mainly three changes
use .N and by= to count the number of "A","B","C" drawn in each row
use the size argument in sample
join the original dt and dt_80 to calculate dt_20 without a for-loop
## draw training data
dt_80 <- dcast(
dt[,row:=1:nrow(dt)
][, .(draw=sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*length) )
, by=row
][,.N,
by=.(row,draw)],
row~draw,value.var="N")[,length80:=A80+B80+C80]
## draw test data
dt_20 <- dt[dt_80,
.(A20=n_A-A80,
B20=n_B-B80,
C20=n_C-C80),on="row"][,length20:=A20+B20+C20]
There is probably still room for optimization, but I hope it already helps :)
EDIT
Here I add my initial first idea, I did not post this because the code above is much faster. But this one might be more memory-efficient which seems crucial in your case. So, even if you already have a working solution, this might be of interest...
library(data.table)
library(Rfast)
## add row numbers
dt[,row:=1:nrow(dt)]
## sampling function
sampfunc <- function(n_A,n_B,n_C){
draw <- sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*(n_A+n_B+n_C))
out <- Rfast::Table(draw)
return(as.list(out))
}
## draw training data
dt_80 <- dt[,sampfunc(n_A,n_B,n_C),by=row]

Neural Network model doesn't run

for some reason, my model is not running. I created a model matrix to run a simple model with the package neuralnet. I know it might be challenging to debug other people code especially without the data but in case you think you could assist me here is the code:
library(tidyverse)
library(neuralnet)
#Activity 1 Load Data
featchannels <-read.csv("features_channel.csv")
trainTargets <-read.table("traintargets.txt")
#Activity 2 Normalize every column of the features dataset using min-max
normalization to range [0-1].
normalized <- function(x) {
return((x-min(x)) /(max(x) -min(x)))
}
featchannels <- normalized(featchannels)
#Activity 3 Add a target feature named response to the features dataset
with 0-1 values read from trainTargets.txt, with 1 indicating P300
response and 0 otherwise.
colnames(trainTargets)[1] <- "State"
featchannels <- cbind(featchannels, trainTargets)
# Changing rows to P300 and others.
featchannels <- within(featchannels, State <- factor(State, labels =
c("Other", "P300")))
featchannels$State <- as.factor(featchannels$State)
#4. Take the first 3840 rows of the dataset as the training data set, and
the remaining 960 rows as the testing data set.
training <- featchannels[1:3840,]
testing <- featchannels[3841:4800,]
enter code here
#Activitry 6
#Creating model matrix before runing the model
df_comb_training <- training
y <- model.matrix(~ df_comb_training$State + 0, data = df_comb_training[,
c('State'), drop=FALSE])
# fix up names for as.formula
y_feats <- gsub("^[^ ]+\\$", "", colnames(y))
colnames(y) <- y_feats
df_comb_training <- df_comb_training[, !(colnames(df_comb_training) ==
"State")]
feats <- colnames(df_comb_training)
df_comb_training <- cbind(y, df_comb_training)
# Concatenate strings
f <- paste(feats, collapse=' + ')
y_f <- paste(y_feats, collapse=' + ')
f <- paste(y_f, '~', f)
# Convert to formula
f <- as.formula(f)
model_h5 <- neuralnet(f, df_comb_training, stepmax = 1e+08, hidden = 5)

R - How to repeat data frame manipulation 100x with new random numbers and plot removals

I am a new user to R and am trying to create multiple subsamples of a data frame. I have my data assigned to 4 stratum (STRATUM = 1, 2, 3, 4), and want to randomly keep only a specified number of rows in each stratum. To achieve this, I import my data, sort by the stratification value, then assign a random number to each row. I want to keep my original random number assignments since I need to use them again in future analyses, so I save a .csv with these values. Next, I subset the data by their stratum, and then specify the number of records that I want to retain in each stratum. Finally, I rejoin the data and save as a new .csv. The code works, however, I want to repeat this process 100 times. In each case I want to save the .csv with random numbers assigned, as well as the final .csv of randomly selected plots. I am unsure of how to get this block of code to repeat 100x, and also how to assign a unique file name for each iteration. Any help would be much appreciated.
DataFiles <- "//Documents/flownData_JR.csv"
PlotsFlown <- read.table (file = DataFiles, header = TRUE, sep = ",")
#Sort the data by the stratification
FlownStratSort <- PlotsFlown[order(PlotsFlown$STRATUM),]
#Create a new column with a random number (no duplicates)
FlownStratSort$RAND_NUM <- sample(137, size = nrow(FlownStratSort), replace = FALSE)
#Sort by the stratum, then random number
FLOWNRAND <- FlownStratSort[order(FlownStratSort$STRATUM,FlownStratSort$RAND_NUM),]
#Save a csv file with the random numbers
write.table(FLOWNRAND, file = "//Documents/RANDNUM1_JR.csv", sep = ",", row.names = FALSE, col.names = TRUE)
#Subset the data by stratum
FLOWNRAND1 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='1'),]
FLOWNRAND2 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='2'),]
FLOWNRAND3 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='3'),]
FLOWNRAND4 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='4'),]
#Remove data from each stratum, specifying the number of records we want to retain
FLOWNRAND1 <- FLOWNRAND1[1:34, ]
FLOWNRAND2 <- FLOWNRAND2[1:21, ]
FLOWNRAND3 <- FLOWNRAND3[1:7, ]
FLOWNRAND4 <- FLOWNRAND4[1:7, ]
#Rejoin the data
FLOWNRAND_uneven <- rbind(FLOWNRAND1, FLOWNRAND2, FLOWNRAND3, FLOWNRAND4)
#Save the table with plots removed from each stratum flown in 2017
write.table(FLOWNRAND_uneven, file = "//Documents/Flown_RAND_uneven_JR.csv", sep = ",", row.names = FALSE, col.names = TRUE)
Here's a data.table solution if you just need to know which rows are in each set.
library(data.table)
df <- data.table(dat = runif(100),
stratum = sample(1:4, 100, replace = T))
# Gets specified number randomly from each strata
get_strata <- function(df, n, i){
# Subset data frame to randomly chosen w/in strata
# replace stratum with var name
f <- df[df[, .I[sample(.N, n)], by = stratum]$V1]
# Save as CSV, replace path
write.csv(f, file = paste0("path/df_", i),
row.names = F, col.names = T)
}
for (i in 1:100){
# replace 10 with number needed
get_strata(df, 10, i)
}

How to apply a function that gets data from differents data frames and has conditions in it?

I have a customized function (psup2) that gets data from a data frame and returns a result. The problem is that it takes a while since I am using a "for" loop that runs for every row and column.
Input:
I have a table that contains the ages (table_costumers), an n*m matrix of different terms, and two different mortality tables (for males and females).
The mortality tables i´m using contains one column for ages and another one for its corresponding survival probabilities.
Output:
I want to create a separate dataframe with the same size as that of the term table. The function will take the data from the different mortality tables (depending on the gender) and then apply the function above (psup2) taking the ages from the table X and the terms from the matrix terms.
Up to now I managed to create a very inefficient way to do this...but hopefully by using one of the functions from the apply family this could get faster.
The following code shows the idea of what I am trying to do:
#Function
psup2 <- function(x, age, term) {
P1 = 1
for (i in 1:term) {
P <- x[age + i, 2]
P1 <- P1*P
}
return(P1)
}
#Inputs
terms <- data.frame(V1 = c(1,2,3), V2 = c(1,3,4), V2 = c(2,3,4))
male<- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9981,0.9979,0.9978,.994,.992,.99))
female <- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9983,0.998,0.9979,.9970,.9964,.9950))
table_customers <- data.frame(id = c(1,2,3), age = c(0,0,0), gender = c(1,2,1))
#Loop
output <- data.frame(matrix(NA, nrow = 3, ncol = 0))
for (i in 1:3) {
for (j in 1:3) {
prob <- ifelse(table_customers[j, 3] == 1,
psup2(male, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])),
psup2(female, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])))
output[j, i] <- prob
}
}
your psup function can be simplified into:
psup2 <- function(x, age, term) { prod(x$probability[age+(1:term)]) }
So actually, we won't use it, we'll use the formula directly.
We'll put your male and female df next to each other, so we can use the value of the gender column to choose one or another.
mf <- merge(male,female,by="age") # assuming you have the same ages on both sides
input_df <- cbind(table_customers,terms)
output <- t(apply(input_df,1,function(x){sapply(1:3,function(i){prod(mf[x["age"]+(1:x[3+i]),x["gender"]+1])})}))
And that's it :)
The sapply function is used to loop on the columns of terms.
x["age"]+(1:x[3+i]) are the indices of the rows you want to multiply
x["gender"]+1 is the relevant column of the mf data.frame

Loop through (subsets) using jags

I have a big dataframe with 10000 rows and 12 columns (discountdataset).
The columns contain different variables. The first 210 rows represents subject 1 (there is also a column with "subject1"), the next 210 rows represent subject 2, and so on.
I want to use jags and a loop function to loop through all 52 subjects in the dataframe, and assign a function to each of them. My code looks like this:
#subsetting the dataframe by the variable subjectid
subsetdiscount <- split(discountdataset, as.factor(discountdataset$subjectid))
Here my plan is to loop and assign the following jags function to all subjects in the subset), but, it doesn't work. I think my mistake is that the variables "nt", "Choice" that I want to pass on to jags are not defined right, or, are not updated.
library(rjags)
for (i in 1:length(subsetdiscount))
{
nt <- nrow (subsetdiscount)
Choice <- subsetdiscount$choice
amountSS <- subsetdiscount$val_basic
amountLL <- subsetdiscount$val_d
delayDIFF <- subsetdiscount$delay
con <- subsetdiscount$condition
data <- list("nt", "Choice", "amountSS", "amountLL", "delayDIFF", "con") # to be passed on to JAGS
myinits <- list(
list(k = (c(0.01, 0.01))),
list(temp = (c(6, 6))))
parameters <- c("k", "temp")
samples <- jags(data, inits=myinits, parameters,
model.file ="singlesubmodel_Ben_roundedchoice.txt", n.chains=2, n.iter=20000,
n.burnin=1, n.thin=1, DIC=T)
Try:
library(rjags)
library(R2jags)
subsetdiscount <- split(discountdataset, as.factor(discountdataset$subjectid))
output_models <- lapply(subsetdiscount, function(x) {
nt <- nrow(x)
Choice <- x$choice
amountSS <- x$val_basic
amountLL <- x$val_d
delayDIFF <- x$delay
con <- x$condition
data <- list("nt", "Choice", "amountSS", "amountLL", "delayDIFF", "con") # to be passed on to JAGS
myinits <- list(list(k = (c(0.01, 0.01))),
list(temp = (c(6, 6))))
parameters <- c("k", "temp")
samples <- jags(data, inits=myinits, parameters,
model.file ="singlesubmodel_Ben_roundedchoice.txt",
n.chains=2, n.iter=20000,
n.burnin=1, n.thin=1, DIC=T)
return(samples)
})
output_models should be a list containing outputs for each of the factors you split main dataset by.
Please note that it is quite hard to test this without any provided data. So, if this fails to work, you may want to provide some data for testing.
I hope it helps.

Resources