Problems following a code example. InformationValue::Woe - r

I'm learning new feature selection methods with this entry of a blog:
https://www.machinelearningplus.com/machine-learning/feature-selection/
Point 9. And I stumbled upon some problems. First is the CV, which I have solved.
library(InformationValue)
adult <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
sep = ',', fill = F, strip.white = T,stringsAsFactors = FALSE)
colnames(adult) <- c('age', 'WORKCLASS', 'fnlwgt', 'EDUCATION',
'educatoin_num', 'MARITALSTATUS', 'OCCUPATION', 'RELATIONSHIP', 'RACE', 'SEX',
'capital_gain', 'capital_loss', 'hours_per_week', 'NATIVECOUNTRY', 'ABOVE50K')
inputData <- adult
print(head(inputData))
But then I can't solve the next chunk
# Choose Categorical Variables to compute Info Value.
cat_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY") # get all categorical variables
# Init Output
df_iv <- data.frame(VARS=cat_vars, IV=numeric(length(cat_vars)), STRENGTH=character(length(cat_vars)), stringsAsFactors = F) # init output dataframe
# Get Information Value for each variable
for (factor_var in factor_vars){
df_iv[df_iv$VARS == factor_var, "IV"] <- InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K)
df_iv[df_iv$VARS == factor_var, "STRENGTH"] <- attr(InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K), "howgood")
}
# Sort
df_iv <- df_iv[order(-df_iv$IV), ]
df_iv
And I keep getting 0 values in IV and, of course, Not predictive in the column of the dataframe.
I've tried to do a
factor_vars=cat_vars
But it doesn't seems to work and quite frankly I can't figure out why this doesn't work.

Just solved it. In first instance the argument of stringsAsFactors = FALSE its unnecesary, since we need factors.
Then, consulting the IV function and looking at the summary of the dataset, i noticed that despise its a factor the function requieres a numeric input, the function cannot extract its "value" (level). So we must work arround it.
as.numeric(inputData$ABOVE50K)
"Solves it" Although maybe i should change the values since it gives 1-2 instead of the classic 0-1 response. Im working on it.
I think theres got to be an easiest solution, but:
levels(inputData$ABOVE50K)
inputData$ABOVE50K2 = as.numeric(inputData$ABOVE50K)
inputData$ABOVE50K3= ifelse(inputData$ABOVE50K2 ==1,0, ifelse(inputData$ABOVE50K2==2,1,NA))
inputData$ABOVE50K3 <- factor(inputData$ABOVE50K3)
And the output is the same. So there is no need to change the levels to 0-1.
# Choose Categorical Variables to compute Info Value.
cat_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY") # get all categorical variables
factor_vars= cat_vars
# Init Output
df_iv <- data.frame(VARS=cat_vars, IV=numeric(length(cat_vars)), STRENGTH=character(length(cat_vars)), stringsAsFactors = F) # init output dataframe
# Get Information Value for each variable
for (factor_var in factor_vars){
df_iv[df_iv$VARS == factor_var, "IV"] <- InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K3)
df_iv[df_iv$VARS == factor_var, "STRENGTH"] <- attr(InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K3), "howgood")
}
# Sort
df_iv <- df_iv[order(-df_iv$IV), ]
df_iv

Related

R for loop going wrong when applied to function

I am trying to work on a for loop to make running a function I've developed more efficient.
However, when I put it in a for loop, it is overwriting columns that it should not be and returning incorrect results.
Edit: The error is that in the resulting dataframe MiSeq_Bord_Outliers_table0, the resulting columns containing label Outlier_type is returning incorrect outputs.
As per the Outlier_Hunter function, when Avg_Trim_Cov and S2_Total_Read_Pairs_Processed are below their
respective Q1 Thresholds their respective Outlier_type columns should read "Lower_Outlier", if between Q1 & Q3 Threshold, "Normal" and if above Q3 Threshold then "Upper_outlier". But when the for loop is executed, only "Upper_outlier" is shown in the Outlier_type columns.
Edit: The inputs have been simplified and tested on the different computer with a clean console. If there were any artifacts there before, they should have been eliminated now, and there should be no errors here now. It is important to run the outlier_results_1var part first. If you test run this code and get errors, please let me know which part failed.
Edit: MiSeq_Bord_Outliers_table0_error is the error that is being reproduced. This is the error result, not an input.
Can someone please tell me why is it returning these incorrect results and what I can do to fix it? I will upload the relevant code below. Or is there another way to do this without a for loop?
#libraries used
library(tidyverse)
library(datapasta)
library(data.table)
library(janitor)
library(ggpubr)
library(labeling)
#2.) Outlier_Hunter Function
#Function to Generate the Outlier table
#Outlier Hunter function takes 4 arguments: the dataset, column/variable of interest,
#Q1 and Q3. Q1 and Q3 are stored in the results of Quartile_Hunter.
#Input ex: MiSeq_Bord_final_report0, Avg_Trim_Cov, MiSeq_Bord_Quartiles_ATC$First_Quartile[1], MiSeq_Bord_Quartiles_ATC$Third_Quartile[1]
#Usage ex: Outlier_Hunter(MiSeq_Bord_final_report0, Avg_Trim_Cov,
#MiSeq_Bord_Quartiles_ATC$First_Quartile[1], MiSeq_Bord_Quartiles_ATC$Third_Quartile[1])
#Here is the Function to get the Outlier Table
Outlier_Hunter <- function(Platform_Genus_final_report0, my_col, Q1, Q3) {
#set up and generalize the variable name you want to work with
varname <- enquo(my_col)
#print(varname) #just to see what variable the function is working with
#get the outliers
Platform_Genus_Variable_Outliers <- Platform_Genus_final_report0 %>%
select(ReadID, Platform, Genus, !!varname) %>%
#Tell if it is an outlier, and if so, what kind of outlier
mutate(
Q1_Threshold = Q1,
Q3_Threshold = Q3,
Outlier_type =
case_when(
!!varname < Q1_Threshold ~ "Lower_Outlier",
!!varname >= Q1_Threshold & !!varname <= Q3_Threshold ~ "Normal",
!!varname > Q3_Threshold ~ "Upper_Outlier"
)
)
}
#MiSeq_Bord_Quartiles entries
MiSeq_Bord_Quartiles <- data.frame(
stringsAsFactors = FALSE,
row.names = c("Avg_Trim_Cov", "S2_Total_Read_Pairs_Processed"),
Platform = c("MiSeq", "MiSeq"),
Genus = c("Bord", "Bord"),
Min = c(0.03, 295),
First_Quartile = c(80.08, 687613.25),
Median = c(97.085, 818806.5),
Third_Quartile = c(121.5625, 988173.75),
Max = c(327.76, 2836438)
)
#Remove the hashtag below to test if what you have is correct
#datapasta::df_paste(head(MiSeq_Bord_Quartiles, 5))
#dataset entry
MiSeq_Bord_final_report0 <- data.frame(
stringsAsFactors = FALSE,
ReadID = c("A005_20160223_S11_L001","A050_20210122_S6_L001",
"A073_20210122_S7_L001",
"A076_20210426_S11_L001",
"A080_20210426_S12_L001"),
Platform = c("MiSeq","MiSeq",
"MiSeq","MiSeq","MiSeq"),
Genus = c("Bordetella",
"Bordetella","Bordetella",
"Bordetella","Bordetella"),
Avg_Raw_Read_bp = c(232.85,241.09,
248.54,246.99,248.35),
Avg_Trimmed_Read_bp = c(204.32,232.6,
238.56,242.54,244.91),
Avg_Trim_Cov = c(72.04,101.05,
92.81,41.77,54.83),
Genome_Size_Mb = c(4.1, 4.1, 4.1, 4.1, 4.1),
S1_Input_reads = c(1450010L,
1786206L,1601542L,710792L,925462L),
S1_Contaminant_reads = c(12220L,6974L,
7606L,1076L,1782L),
S1_Total_reads_removed = c(12220L,6974L,
7606L,1076L,1782L),
S1_Result_reads = c(1437790L,
1779232L,1593936L,709716L,923680L),
S2_Read_Pairs_Written = c(712776L,882301L,
790675L,352508L,459215L),
S2_Total_Read_Pairs_Processed = c(718895L,889616L,
796968L,354858L,461840L)
)
MiSeq_Bord_final_report0
#Execution for 1 variable
outlier_results_1var <- Outlier_Hunter(MiSeq_Bord_final_report0, Avg_Trim_Cov,
MiSeq_Bord_Quartiles$First_Quartile[1], MiSeq_Bord_Quartiles$Third_Quartile[1])
#Now do it with a for loop
col_var_outliers <- row.names(MiSeq_Bord_Quartiles)
#col_var_outliers <- c("Avg_Trim_Cov", "S2_Total_Read_Pairs_Processed")
#change line above to change input of variables few into Outlier Hunter Function
outlier_list_MiSeq_Bord <- list()
for (y in col_var_outliers) {
outlier_results0 <- Outlier_Hunter(MiSeq_Bord_final_report0, y, MiSeq_Bord_Quartiles[y, "First_Quartile"], MiSeq_Bord_Quartiles[y, "Third_Quartile"])
outlier_results1 <- outlier_results0
colnames(outlier_results1)[5:7] <- paste0(y, "_", colnames(outlier_results1[, c(5:7)]), sep = "")
outlier_list_MiSeq_Bord[[y]] <- outlier_results1
}
MiSeq_Bord_Outliers_table0 <- reduce(outlier_list_MiSeq_Bord, left_join, by = c("ReadID", "Platform", "Genus"))
#the columns containing label Outlier_type is where the code goes wrong.
#When Avg_Trim_Cov and S2_Total_Read_Pairs_Processed are below their
#respective Q1 Thresholds their respective Outlier_type columns should read
#"Lower_Outlier", if between Q1 & Q3 Threshold, "Normal" and if above Q3
#Threshold then "Upper_outlier". But when the for loop is executed, only
"Upper_outlier" is shown in the Outlier_type columns.
datapasta::df_paste(head(MiSeq_Bord_Outliers_table0, 5))
MiSeq_Bord_Outliers_table0_error <- data.frame(
stringsAsFactors = FALSE,
ReadID = c("A005_20160223_S11_L001",
"A050_20210122_S6_L001",
"A073_20210122_S7_L001","A076_20210426_S11_L001",
"A080_20210426_S12_L001"),
Platform = c("MiSeq",
"MiSeq","MiSeq","MiSeq",
"MiSeq"),
Genus = c("Bordetella","Bordetella","Bordetella",
"Bordetella","Bordetella"),
Avg_Trim_Cov = c(72.04,
101.05,92.81,41.77,54.83),
Avg_Trim_Cov_Q1_Threshold = c(80.08,
80.08,80.08,80.08,80.08),
Avg_Trim_Cov_Q3_Threshold = c(121.5625,
121.5625,121.5625,121.5625,
121.5625),
Avg_Trim_Cov_Outlier_type = c("Upper_Outlier","Upper_Outlier",
"Upper_Outlier","Upper_Outlier",
"Upper_Outlier"),
S2_Total_Read_Pairs_Processed = c(718895L,
889616L,796968L,354858L,
461840L),
S2_Total_Read_Pairs_Processed_Q1_Threshold = c(687613.25,
687613.25,687613.25,
687613.25,687613.25),
S2_Total_Read_Pairs_Processed_Q3_Threshold = c(988173.75,
988173.75,988173.75,
988173.75,988173.75),
S2_Total_Read_Pairs_Processed_Outlier_type = c("Upper_Outlier","Upper_Outlier",
"Upper_Outlier","Upper_Outlier",
"Upper_Outlier")
)
For use in a loop like you do, it would be more useful to write your Outlier_Hunter() function to take the target column as a character string rather than an expression.
To do that, try replacing all instances of !!varname in your function with .data[[my_col]], and remove the enquo() line altogether.
Note that with these changes, you also need to change how you call the function when you don't have the column name in a variable. For example, your single execution would become:
Outlier_Hunter(
MiSeq_Bord_final_report0,
"Avg_Trim_Cov",
MiSeq_Bord_Quartiles$First_Quartile[1],
MiSeq_Bord_Quartiles$Third_Quartile[1]
)
For more info about programming with tidy evaluation functions, you may find this rlang vignette useful.

Subsetting a data set and plotting means

I have a data set including Year, Site, and Species Count. I am trying to write a code that reflects in some years, the counts were done twice. For those years I have to find the mean count at each site for each species (there are two different species), and plot those means. This is the code I have generated:
DataSet1 <- subset(channel_islands,
channel_islands$SpeciesName=="Hypsypops ubicundus, adult" |
channel_islands$SpeciesName=="Paralabrax clathratus,adult")
years<-unique(DataSet1$Year)
Hypsypops_mean <- NULL
Paralabrax_mean <- NULL
Mean <- NULL
years <- unique(DataSet1$Year)
for(i in 1:length(years)){
data_year <- DataSet1[which(DataSet1$Year == years[i]), ]
Hypsypops<-data_year[which(data_year$SpeciesName=="Hypsypops rubicundus,adult"), ]
Paralabrax<-data_year[which(data_year$SpeciesName=="Paralabrax clathratus,adult"), ]
UNIQUESITE<-unique(unique(data_year$Site))
for(m in 1:(length(UNIQUESITE))){
zz<-Hypsypops[Hypsypops$Site==m,]
if(length(zz$Site)>=2){
Meanp <- mean(Hypsypops$Count[Hypsypops$Site==UNIQUESITE[m]])
Hypsypops_mean <- rbind(Hypsypops_mean,
c(UNIQUESITE[m], years[i], round(Meanp,2),
'Hypsypops rubicundus,adult'))
}
kk <- Paralabrax[Paralabrax$Site==m, ]
if(length(kk$Site)>=2){
Meane <- mean(Paralabrax$Count[Paralabrax$Site==UNIQUESITE[m]])
Paralabrax_mean <- rbind(Paralabrax_mean,
c(UNIQUESITE[m], years[i], round(Meane, 2),
'Paralabrax clathratus,adult'))
}
}
if(i==1){
Mean<-rbind(Hypsypops_mean, Paralabrax_mean)
}
if(i>1){
Mean<-rbind(DataMean, Hypsypops_mean, Paralabrax_mean)
}
Hypsypops_mean<-NULL
Paralabrax_mean<-NULL
}
Mean <- as.data.frame(Mean,stringsAsFactors=F)
names(Mean) <- c('Site','Year','mean_count','SpeciesName')
Mean$Site <- as.integer(Mean$Site)
Mean$Year <- as.integer(Mean$Year)
Mean$mean_count <- as.numeric(Mean$mean_count)
par(mfrow=c(5,5), oma=c(4,2,4,2), mar=c(5.5,4,3,0))
for(i in 1:length(years)){
if(any(Mean$Year==years[i])) {
year1<-Mean[which(Mean$Year==years[i]),]
Species<-unique(as.character(year1$SpeciesName))
Colors<-c("pink","purple")[Species]
Data_Hr<-year1[year1$SpeciesName=="Hypsypops rubicundus,adult",]
Data_Pc<-year1[year1$SpeciesName=="Paralabrax clathratus,adult",]
plot(Data_Hr$mean_count~Data_Pc$mean_count,
xlab=c("Hypsypops rubicundus"),
ylab=c("Paralabrax clathratus"),main=years[i],pch=16)
}
}
It's a lot I'm sorry, I'm not sure of a way to streamline the process. But I keep getting an error:
Error in names(Mean) <- c("Site", "Year", "mean_count", "SpeciesName")
: 'names' attribute [4] must be the same length as the vector [0]
Not sure how I can debug this.
Not sure why you want to do this with an elaborate loop code. It sounds like you are trying to summarise your data.
This can be done in different ways. Here is a solution using dplyr:
DataSet1 %>%
group_by(Year, SpeciesName, Site) %>%
summarise(nrecords = n(),
Count = mean(Count))
To get a better answer, it might be helpful to post a subset of the data and the intended result you are after.

Getting error when applying Smbinning in R

I am working on an example from http://r-statistics.co/Logistic-Regression-With-R.html. I have problem with smbinning code. I am trying to get Information Value via using smbinning.
library(smbinning)
# segregate continuous and factor variables
factor_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY")
continuous_vars <- c("AGE", "FNLWGT","EDUCATIONNUM", "HOURSPERWEEK", "CAPITALGAIN", "CAPITALLOSS")
iv_df <- data.frame(VARS=c(factor_vars, continuous_vars), IV=numeric(14)) # init for IV results
# compute IV for categoricals
for(factor_var in factor_vars){
smb <- smbinning.factor(trainingData, y="ABOVE50K", x=factor_var) # WOE table
if(class(smb) != "character"){ # heck if some error occured
iv_df[iv_df$VARS == factor_var, "IV"] <- smb$iv
}
}
This is the code given. I cannot understand the reason behind checking class of the smbinning. My general understanding on smbinning is also not that good.
for(vars in factor_vars){
smb <- smbinning.factor(trainingData, y = "ABOVE50K", x = vars )
iv_df[iv_df$VARS == vars, "IV"] <- smb["iv"]
}
When I run this code I am getting some values NA values. So class checking is apparently needed but why?
Thank you very much.
Following the example to the letter, your problem would be the following:
If you do smb <- smbinning.factor(trainingData, y="ABOVE50K", x="EDUCATION") and then smb, you get
1 "Too many categories"
str(trainingData) shows that:
$ EDUCATION : Factor w/ 16 levels...
While the smbinning documentation says that
maxcat - Specifies the maximum number of categories. Default value is 10. Name of x
must not have a dot.
Therefore your solution is to use: smb <- smbinning.factor(trainingData, y="ABOVE50K", x=factor_var, maxcat=16) in the for loop

Potential bug in stargazer omit.labels

There appears to be a bug in version 5.2 of the stargazer package, where the omit.label functionality does not work consistently depending on the order of the included models:
library(stargazer)
library(ggplot2)
as.data.frame(data("midwest"))
fit.1 <- lm(poptotal ~ popadults, data = midwest)
fit.2 <- lm(poptotal ~ popadults + state, data = midwest)
# Works, column listed as "Yes":
stargazer(fit.2, omit = c("state"), omit.labels = c("States"))
# Does not work, both columns listed as "No":
stargazer(fit.1, fit.2, omit = c("state"), omit.labels = c("States"))
# Works, first column "Yes", second "No":
stargazer(fit.2, fit.1, omit = c("state"), omit.labels = c("States"))
Does anyone know of a workaround?
I just manually specified dummies for each column using the add.lines property. For your example:
stargazer(fit.1, fit.2, omit = c("state"),
add.lines = list(
c("States", "No", "Yes")
)
)
Here's one approach, using a wrapper function to generate the add.lines values automatically. This also has (to me) a more natural syntax than having separate "omit" and "omit.labels" arguments. Plus, you can omit variables without having an indicator:
gazer<- function(...,indicate=NULL, staroptions=NULL){
dots <- list(...)
if (is.null(indicate)==FALSE) {
indicate.lines<-sapply(names(indicate), function(indic)
ifelse(
sapply(dots,function(x) length(grep(indic,names(coef(x))))>0
) ,"Yes","No"
)
)
indicate.lines<-rbind(unlist(indicate),indicate.lines)
staroptions$omit <- c(staroptions$omit,names(indicate))
staroptions$add.lines <- c(split(indicate.lines,rep(1:ncol(indicate.lines), each=nrow(indicate.lines))),staroptions$add.lines)
}
do.call(stargazer,c(dots,staroptions))
}
You provide a list of names and labels in indicate() and all your other stargazer options in a list in staroptions
For your example:
gazer(fit.1,fit.2,indicate=list(state="State"))

Huge data file and running multiple parameters and memory issue, Fisher's test

I have a R code that I am trying to run in a server. But it is stopping in the middle/get frozen probably because of memory limitation. The data files are huge/massive (one has 20 million lines) and if you look at the double for loop in the code, length(ratSplit) = 281 and length(humanSplit) = 36. The data has specific data of human and rats' genes and human has 36 replicates, while rat has 281. So, the loop is basically 281*36 steps. What I want to do is to process data using the function getGeneType and see how different/independent are the expression of different replicate combinations. Using Fisher's test. The data rat_processed_7_25_FDR_05.out looks like this :
2 Sptbn1 114201107 114200202 chr14|Sptbn1:114201107|Sptbn1:114200202|reg|- 2 Thymus_M_GSM1328751 reg
2 Ndufb7 35680273 35683909 chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+ 2 Thymus_M_GSM1328751 rev
2 Ndufb10 13906408 13906289 chr10|Ndufb10:13906408|Ndufb10:13906289|reg|- 2 Thymus_M_GSM1328751 reg
3 Cdc14b 1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|- 3 Thymus_M_GSM1328751 reg
and the data fetal_output_7_2.out has the form
SPTLC2 78018438 77987924 chr14|SPTLC2:78018438|SPTLC2:77987924|reg|- 11 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
EXOSC1 99202993 99201016 chr10|EXOSC1:99202993|EXOSC1:99201016|rev|- 5 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
SHMT2 57627893 57628016 chr12|SHMT2:57627893|SHMT2:57628016|reg|+ 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
ZNF510 99538281 99537128 chr9|ZNF510:99538281|ZNF510:99537128|reg|- 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
PPFIBP1 27820253 27824363 chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+ 10 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
Now I have few questions on how to make this more efficient. I think when I run this code, R takes up lots of memory that ultimately causes problems. I am wondering if there is any way of doing this more efficiently
Another possibility is the usage of double for-loop'. Will sapply help? In that case, how should I apply sapply?
At the end I want to convert result into a csv file. I know this is a bit overwhelming to put code like this. But any optimization/efficient coding/programming will be A LOT! I really need to run the whole thing at least one to get the data soon.
#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]
#function to capitalize
capitalize <- function(x){
capital <- toupper(x) ## capitalize
paste0(capital)
}
#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))
#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)
print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG
#function to be used to aggregate
getGeneType <- function(types) {
types <- as.character(types)
if ('rev' %in% types) {
return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
}
else {
return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
}
}
#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis",
Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
for(j in 1:length(humanSplit)) {
ratReplicateName <- names(ratSplit[i])
humanReplicateName <- names(humanSplit[j])
#merging above two based on the one-to-one gene mapping as in geneList defined above.
mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
mergedRatData <- mergedRatData[,c(2,1,4,5)] #rearrange column
mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
#now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
#is.integer0 function is used
HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
# contingencyTable:
# HnRn --|--HyRn
# |------|-----|
# HnRy --|-- HyRy
#
fisherTest <- fisher.test(contingencyTable)
#make new line out of the result of fisherTest
newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2],
oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine) #append newline to result
if(j%%10 = 0) print(c(i,j))
}
}
write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
Referring to the accepted answer to Monitor memory usage in R, the amount of memory used by R can be tracked with gc().
If the script is, indeed, running short of memory (which would not surprise me), the easiest way to resolve the problem would be to move the write.table() from the outside to the inside of the loop, to replace the rbind(). It would just be necessary to create a new file name for the CSV file that is written from each output, e.g. by:
csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)
If the CSV files are written without headers, they could then be concatenated separately outside R (e.g. using cat in Unix) and the header added later.
While this approach might succeed in creating the CSV file that is sought, it is possible that file might be too big to process subsequently. If so, it may be preferable to process the CSV files individually, rather than concatenating them at all.

Resources