Different spacing while printing to log - r

I am printing importance matrix of xgBoost into log using write command (write works with file connection and direct it to stderr well). Here is the command I am using:
importance_matrix <- xgb.importance(names, model=bst)
write("The top 30 variables are:",stderr())
write(paste0("Feature",'\t','\t','Gain','\t','Cover','\t','Frequency'),stderr())
write(t(as.matrix(importance_matrix[1:30,])),sep="\t",ncolumns = length(names(importance_matrix)),stderr())
Output comes in format:
Feature Gain Cover Frequency
pctTillDate 0.560359696 0.1314074664 0.024278250
colr_per 0.183149483 0.0962457545 0.049618673
date 0.050528297 0.1143752021 0.066395735
GREG_D 0.025648433 0.0381476142 0.018070143
LNGTD_I 0.020346020 0.0485235001 0.101322109
LATTD_I 0.019241497 0.0421892270 0.093867103
which make it look a bit clumsy (much clumsy in log than appearing here in SO). So in order to make it better looking I want to change last line of t(as.matrix(importance_matrix[1:30,])),sep="\t" such that first sep will be 2 tabs ('\t','\t') and rest single tab ('\t'); instead of current uniform spacing. Simple but search doesn't give any idea. Any suggestions?

Consider padding the column names and first char column of matrix with whitespace to align each to largest character size of first column:
write.table(importance_matrix, sep="\t", row.names = FALSE, quote = FALSE)
# Feature Gain Cover Frequency
# pctTillDate 0.56035970 0.13140747 0.02427825
# colr_per 0.18314948 0.09624575 0.04961867
# date 0.05052830 0.11437520 0.06639573
# GREG_D 0.02564843 0.03814761 0.01807014
# LNGTD_I 0.02034602 0.04852350 0.10132211
# LATTD_I 0.01924150 0.04218923 0.09386710
new_matrix <- importance_matrix
# FIRST COLUMN LARGEST CHAR LENGTH
charmax <- max(nchar(new_matrix[,1]))
# PAD COLUMN HEADERS
colnames(new_matrix) <- lapply(1:ncol(new_matrix), function(i)
paste0(colnames(new_matrix)[i],
paste(rep(" ", charmax - nchar(colnames(new_matrix)[i])), collapse=""))
)
# PAD FIRST COLUMN
new_matrix[,1] <- sapply(1:nrow(new_matrix), function(i)
paste0(new_matrix[i,1],
paste(rep(" ", charmax - nchar(new_matrix[i,1])), collapse=""))
)
write.table(new_matrix, sep="\t", row.names = FALSE, quote = FALSE)
# Feature Gain Cover Frequency
# pctTillDate 0.56035970 0.13140747 0.02427825
# colr_per 0.18314948 0.09624575 0.04961867
# date 0.05052830 0.11437520 0.06639573
# GREG_D 0.02564843 0.03814761 0.01807014
# LNGTD_I 0.02034602 0.04852350 0.10132211
# LATTD_I 0.01924150 0.04218923 0.09386710

Related

Print out the text value of the points on a cluster when using UMAP and HDBScan and BERT sentence transformer

I have seen a number of questions similar to this but my cluster labels consist of sentence embeddings, thus a better question may be how do I get text values from the sentence embeddings?
How can I get from my sentence embeddings to print a text output?
umap_embeddings = umap.UMAP(n_neighbors=50,
n_components=5,
metric='cosine').fit_transform(embeddings)
cluster = hdbscan.HDBSCAN(min_cluster_size=3,
metric='euclidean',
cluster_selection_method='eom').fit(umap_embeddings)
# Prepare data
umap_data = umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine', random_state=24).fit_transform(embeddings)
result = pd.DataFrame(umap_data, columns=['x', 'y'])
result['labels'] = cluster.labels_
# Visualize clusters
fig, ax = plt.subplots(figsize=(20, 10))
outliers = result.loc[result.labels == -1, :]
clustered = result.loc[result.labels != -1, :]
plt.scatter(outliers.x, outliers.y, color='#202020', s=25)
plt.scatter(clustered.x, clustered.y, c=clustered.labels, s=25, cmap='hsv_r'
)
some previous answers have suggested;
textdata_with_label_113 = textData[clusterer.labels_ == 113]
However, this returns the embedded value oppoesd to the text value.
With more time on the problem I realised that the embeddings are in the same sequence as the original DF.
therefore you can work back quite easily.
lbls=[]
#seperate the clustered labels into seperate lists (0,1,2,3)
for x in range(len(clustered.labels)):
lbls.append(clustered[clustered.labels == x])
df_desc=[]
# extract the rows from the data frame using the lbls list and use column 6 only in my case
for x in range(len(lbls)):
df_desc.append(df.iloc[lbls[x].index,5])
for i in range(4):
txt = "Cluster {number}"
print(txt.format(number = i))
print(df_desc[i])

In R how do you factorise and add label values to specific data.table columns, using a second file of meta data?

This is part of a project to switch from SPSS to R. While there are good tools to import SPSS files into R (expss) what this question is part of is attempting to get the benefits of SPSS style labeling when data originates from CSV sources. This is to help bridge the staff training gap between SPSS and R by providing a common format for data.tables irrespective of file format origin.
Whilst CSV does a reasonable job of storing data it is hopeless for providing meaningful data. This inevitably means variable and factor levels and labels have to come from somewhere else. In most short examples of this (e.g. in documentation) it is practical to simply hard code the meta data in. But for larger projects it makes more sense to store this meta data in a second csv file.
Example data file
ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
1,1,34,1,,1,,1,1,4,
2,1,21,0,1,,1,3,14,3,2
3,1,54,1,,,1,3,6,4,4
4,2,32,1,1,1,,3,7,4,
5,3,66,0,,,1,3,9,3,3
6,2,43,1,,1,,1,12,2,1
7,2,26,0,,,1,2,11,1,
8,3,,1,1,,,2,15,1,4
9,1,34,1,,1,,1,12,3,4
10,2,46,0,,,,3,13,2,
11,3,39,1,1,1,,3,7,1,2
12,1,28,0,,,1,1,6,5,1
13,2,64,0,,1,,2,11,,3
14,3,34,1,1,,,3,10,1,1
15,1,52,1,,1,1,1,8,6,
Example metadata file
Rowlabels,ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
varlabel,,Question one,Question two,Question three,Question four,Question five,Question six,Question seven,Question eight,Question nine,Question ten
varrole,Unique,Attitude,Unique,Filter,Filter,Filter,Filter,Attitude,Filter,Attitude,Attitude
Missing,Error,Error,Ignored,Error,Unchecked,Unchecked,Unchecked,Error,Error,Error,Ignored
vallable,,One,,No,Checked,Checked,Checked,x,One,A,Support
vallable,,Two,,Yes,,,,y,Two,B,Neutral
vallable,,Three,,,,,,z,Three,C,Oppose
vallable,,,,,,,,,Four,D,Dont know
vallable,,,,,,,,,Five,E,
vallable,,,,,,,,,Six,F,
vallable,,,,,,,,,Seven,G,
vallable,,,,,,,,,Eight,,
vallable,,,,,,,,,Nine,,
vallable,,,,,,,,,Ten,,
vallable,,,,,,,,,Eleven,,
vallable,,,,,,,,,Twelve,,
vallable,,,,,,,,,Thirteen,,
vallable,,,,,,,,,Fourteen,,
vallable,,,,,,,,,Fifteen,,
SO the common elements are the column names which are the key to both files
The first column of the metadata file describes the role of the row for the data file
so
varlabel provides the variable label for each column
varrole describes the analytic purpose of the variable
missing describes how to treat missing data
varlabel describes the label for a factor level starting at one on up to as many labels as there are.
Right! Here's the code that works:
```#Libraries
library(expss)
library(data.table)
library(magrittr)```
readcsvdata <- function(dfile)
{
# TESTED - Working
print("OK Lets read some comma separated values")
rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE, stringsAsFactors = FALSE,
na.strings = getOption("datatable.na.strings",""))
return(rdata)
}
rawdatafilename <- "testdata.csv"
rawmetadata <- "metadata.csv"
mdt <- readcsvdata(rawmetadata)
rdt <- readcsvdata(rawdatafilename)
names(rdt)[names(rdt) == "ï..ID"] <- "ID" # correct minor data error
commonnames <- intersect(names(mdt),names(rdt)) # find common variable names so metadata applies
commonnames <- commonnames[-(1)] # remove ID
qlabels <- as.list(mdt[1, commonnames, with = FALSE])
(Here I copy the rdt datatable simply so I can roll back to the original data without re-running the previous read chunks and tidying whenever I make changes that don't work out.
# set var names to columns
for (each_name in commonnames) # loop through commonnames and qlabels
{
expss::var_lab(tdt[[each_name]]) <- qlabels[[each_name]]
}
OK this is where I fall down.
Failure from here
factorcols <- as.vector(commonnames) # create a vector of column names (for later use)
for (col in factorcols)
{
print( is.na(mdt[4, ..col])) # print first row of value labels (as test)
if (is.na(mdt[4, ..col])) factorcols <- factorcols[factorcols != col]
# if not a factor column, remove it from the factorcol list and dont try to factor it
else { # if it is a vector factorise
print(paste("working on",col)) # I have had a lot of problem with unrecognised ..col variables
tlabels <- as.vector(na.omit(mdt[4:18, ..col])) # get list of labels from the data column}
validrange <- seq(1,lengths(tlabels),1) # range of valid values is 1 to the length of labels list
print(as.character(tlabels)) # for testing
print(validrange) # for testing
tdt[[col]] <- factor(tdt[[col]], levels = validrange, ordered = is.ordered(validrange), labels = as.character(tlabels))
# expss::val_lab(tdt[, ..col]) <- tlabels
tlabels = c() # flush loop variable
validrange = c() # flush loop variable
}
}
So the problem is revealed here when we check the data table.
tdt
the labels have been applied as whole vectors to each column entry except where there is only one value in the vector ("checked" for varfour and varfive)
tdt
id (int) 1
varone (fctr) c("One", "Two", "Three") 1 (should be "One" 1)
vartwo (S3: labelled) 34
varthree (fctr) c("No", "Yes") 1 (should be "No" 1)
varfour (fctr) NA
varfive (fctr) Checked
And a mystery
this code works just fine on a single columns when I don't use a for loop variable
# test using column name
tlabels <- c("one","two","three")
validrange <- c(1,2,3)
factor(tdt[,varone], levels = validrange, ordered=is.ordered(validrange), labels = tlabels)
It seems the issue is in the line tlabels <- as.vector(na.omit(mdt[4:18, ..col])). It doesn't make vector as you expect. Contrary to usual data.frame data.table doesn't drop dimensions when you provide single column in the index. And as.vector do nothing with data.frames/data.tables. So tlabels remains data.table. This line need to be rewritten as tlabels <- na.omit(mdt[[col]][4:18]).
Example:
library(data.table)
mdt = as.data.table(mtcars)
col = "am"
tlabels <- as.vector(na.omit(mdt[3:6, ..col])) # ! tlabels is data.table
str(tlabels)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 1 variable:
# $ am: num 1 0 0 0
# - attr(*, ".internal.selfref")=<externalptr>
as.character(tlabels) # character vector of length 1
# [1] "c(1, 0, 0, 0)"
tlabels <- na.omit(mdt[[col]][3:6]) # vector
str(tlabels)
# num [1:4] 1 0 0 0
as.character(tlabels) # character vector of length 4
# [1] "1" "0" "0" "0"

How to coerce stslist.freq to dataframe

I am doing some describtive sequence analysis using the "TraMineR" library. I want to report my findings via R-Markdown in html format. For formating tables I use "kable" and "kableExtra".
To get the frequency and propotions of the most common sequences I use seqtab(). The result is an stslist.freq object. When I try to coerce it to a dataframe, the dataframe is not containing any frequencies and proportions.
I tried to print the results of seqtab() and store this result again. This gives me the dataframe I desire. However there are two "problems" with that: (1) I don't understand what is happening here and it seems like a "dirty" trick, (2) as a result I also get the output of the print command in my final html document if I don't split the code in multiple chunks and disable the ouput in the specific chunk.
Here is some code to replicate the problem:
library("TraMineR")
#Data creation
data.long <- data.frame(
id=rep(1:50, each=4),
time = c(0,1,2,3),
status = sample(letters[1:2], 200, replace = TRUE),
weight=rep(runif(50, 0, 1), each=4)
)
#reshape
data.wide <- reshape(data.long, v.names = "status", idvar="id", direction="wide", timevar="time")
#sequence
sequence <- seqdef(data.wide,
var=c("status.0", "status.1", "status.2", "status.3"),
weights=data.wide$weight)
#frequencies of sequences
##doesn't work:
seqtab.df1 <- as.data.frame(seqtab(sequence))
##works:
seqtab.df2 <- print(seqtab(sequence))
I expect the dataframe to be the same as the one saved in seqtab.df2, however either without using the print command or with "silently" (no output printed) using the print command.
Thank you very much for your help and let me know if I forgot something to make answering the question possible!
If you look at the class() of the object returned by seqtab, it has the type
class(seqtab(sequence))
# [1] "stslist.freq" "stslist" "data.frame"
so if we look at exactly, what's happening in the print statement for such an object we can get a clue what's going on
TraMineR:::print.stslist.freq
# function (x, digits = 2, width = 1, ...)
# {
# table <- attr(x, "freq")
# print(table, digits = digits, width = width, ...)
# }
# <bytecode: 0x0000000003e831f8>
# <environment: namespace:TraMineR>
We see that what it's really giving you is the "freq" attribute. You can extract this directly and skip the print()
attr(seqtab(sequence), "freq")
# Freq Percent
# a/3-b/1 4.283261 20.130845
# b/1-a/1-b/2 2.773341 13.034390
# a/2-b/1-a/1 2.141982 10.067073
# a/1-b/1-a/1-b/1 1.880359 8.837476
# a/1-b/2-a/1 1.723489 8.100203
# b/1-a/2-b/1 1.418302 6.665861
# b/2-a/1-b/1 1.365099 6.415813
# a/1-b/3 1.241644 5.835586
# a/1-b/1-a/2 1.164434 5.472710
# a/2-b/2 1.092656 5.135360

Fast reading (by chunk?) and processing of a file with dummy lines at regular interval in R

I have a file with regular numeric output (same format) of many arrays, each separated by a single line (containing some info).
For example:
library(gdata)
nx = 150 # ncol of my arrays
ny = 130 # nrow of my arrays
myfile = 'bigFileWithRowsToSkip.txt'
niter = 10
for (i in 1:niter) {
write(paste(i, 'is the current iteration'), myfile, append=T)
z = matrix(runif(nx*ny), nrow = ny) # random numbers with dim(nx, ny)
write.fwf(z, myfile, append=T, rownames=F, colnames=F) #write in fixed width format
}
With nx=5 and ny=2, I would have a file like this:
# 1 is the current iteration
# 0.08051668 0.19546772 0.908230985 0.9920930408 0.386990316
# 0.57449532 0.21774728 0.273851698 0.8199024885 0.441359571
# 2 is the current iteration
# 0.655215475 0.41899060 0.84615044 0.03001664 0.47584591
# 0.131544592 0.93211342 0.68300161 0.70991368 0.18837031
# 3 is the current iteration
# ...
I want to read the successive arrays as fast as possible to put them in a single data.frame (in reality, I have thousands of them). What is the most efficient way to proceed?
Given the output is regular, I thought readr would be a good idea (?).
The only way I can think of, is to do it manually by chunks in order to eliminate the useless info lines:
library(readr)
ztot = numeric(niter*nx*ny) # allocate a vector with final size
# (the arrays will be vectorized and successively appended to each other)
for (i in 1:niter) {
nskip = (i-1)*(ny+1) + 1 # number of lines to skip, including the info lines
z = read_table(myfile, skip = nskip, n_max = ny, col_names=F)
z = as.vector(t(z))
ifirst = (i-1)*ny*nx + 1 # appropriate index
ztot[ifirst:(ifirst+nx*ny-1)] = z
}
# The arrays are actually spatial rasters. Compute the coordinates
# and put everything in DF for future analysis:
x = rep(rep(seq(1:nx), ny), niter)
y = rep(rep(seq(1:ny), each=nx), niter)
myDF = data.frame(x=x, y=y, z=z)
But this is not fast enough. How can I achieve this faster?
Is there a way to read everything at once and delete the useless rows afterwards?
Alternatively, is there no reading function accepting a vector with precise locations as skip argument, rather than a single number of initial rows?
PS: note the reading operation is to be repeated on many files (same structure) located in different directories, in case it influences the solution...
EDIT
The following solution (reading all lines with readLines and removing the undesirable ones and then processing the rest) is a faster alternative with niter very high:
bylines <- readLines(myfile)
dummylines = seq(1, by=(ny+1), length.out=niter)
bylines = bylines[-dummylines] # remove dummy, undesirable lines
asOneChar <- paste(bylines, collapse='\n') # Then process output from readLines
library(data.table)
ztot <- fread(asOneVector)
ztot <- c(t(ztot))
Discussion on how to proceed results from the readLines can be found here
Pre-processing the file with a command line tool (i.e., not in R) is actually way faster. For example with awk:
tmpfile <- 'cleanFile.txt'
mycommand <- paste("awk '!/is the current iteration/'", myfile, '>', tmpfile)
# "awk '!/is the current iteration/' bigFileWithRowsToSkip.txt > cleanFile.txt"
system(mycommand) # call the command from R
ztot <- fread(tmpfile)
ztot <- c(t(ztot))
Lines can be removed on the basis of a pattern or of indices for example.
This was suggested by #Roland from here.
Not sure if I still understood your problem correctly. Running your script created a file with 1310 lines. With This is iteration 1or2or3 printed at lines
Line 1: This is iteration 1
Line 132: This is iteration 2
Line 263: This is iteration 3
Line 394: This is iteration 4
Line 525: This is iteration 5
Line 656: This is iteration 6
Line 787: This is iteration 7
Line 918: This is iteration 8
Line 1049: This is iteration 9
Line 1180: This is iteration 10
Now there is data between these lines that you want to read and skip this 10 strings.
You can do this by tricking read.table saying your comment.char is "T" which will make read.table thinks all lines starting with letter "T" are comments and will skip those.
data<-read.table("bigFile.txt",comment.char = "T")
this will give you a data.frame of 1300 observations with 150 variables.
> dim(data)
[1] 1300 150
For a non-consisted strings. Read your data with read.table with fill=TRUE flag. This will not break your input process.
data<-read.table("bigFile.txt",fill=TRUE)
Your data looks like this
> head(data)
V1 V2 V3 V4 V5 V6 V7
1: 1.0000000 is the current iteration NA NA
2: 0.4231829 0.142353335 0.3813622692 0.07224282 0.037681101 0.7761575 0.1132471
3: 0.1113989 0.587115721 0.2960257430 0.49175715 0.642754463 0.4036675 0.4940814
4: 0.9750350 0.691093967 0.8610487920 0.08208387 0.826175117 0.8789275 0.3687355
5: 0.1831840 0.001007096 0.2385952028 0.85939856 0.646992019 0.5783946 0.9095849
6: 0.7648907 0.204005372 0.8512769730 0.10731854 0.299391995 0.9200760 0.7814541
Now if you see how the strings are distributed in columns. Now you can simply subset your data set with pattern matching. Matching columns that match these strings. For example
library(data.table)
data<-as.data.table(data)
cleaned_data<-data[!(V3 %like% "the"),]
> head(cleaned_data)
V1 V2 V3 V4 V5 V6 V7
1: 0.4231829 0.142353335 0.3813622692 0.07224282 0.037681101 0.7761575 0.1132471
2: 0.1113989 0.587115721 0.2960257430 0.49175715 0.642754463 0.4036675 0.4940814
3: 0.9750350 0.691093967 0.8610487920 0.08208387 0.826175117 0.8789275 0.3687355
4: 0.1831840 0.001007096 0.2385952028 0.85939856 0.646992019 0.5783946 0.9095849
5: 0.7648907 0.204005372 0.8512769730 0.10731854 0.299391995 0.9200760 0.7814541
6: 0.3943193 0.508373900 0.2131134905 0.92474343 0.432134031 0.4585807 0.9811607

Huge data file and running multiple parameters and memory issue, Fisher's test

I have a R code that I am trying to run in a server. But it is stopping in the middle/get frozen probably because of memory limitation. The data files are huge/massive (one has 20 million lines) and if you look at the double for loop in the code, length(ratSplit) = 281 and length(humanSplit) = 36. The data has specific data of human and rats' genes and human has 36 replicates, while rat has 281. So, the loop is basically 281*36 steps. What I want to do is to process data using the function getGeneType and see how different/independent are the expression of different replicate combinations. Using Fisher's test. The data rat_processed_7_25_FDR_05.out looks like this :
2 Sptbn1 114201107 114200202 chr14|Sptbn1:114201107|Sptbn1:114200202|reg|- 2 Thymus_M_GSM1328751 reg
2 Ndufb7 35680273 35683909 chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+ 2 Thymus_M_GSM1328751 rev
2 Ndufb10 13906408 13906289 chr10|Ndufb10:13906408|Ndufb10:13906289|reg|- 2 Thymus_M_GSM1328751 reg
3 Cdc14b 1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|- 3 Thymus_M_GSM1328751 reg
and the data fetal_output_7_2.out has the form
SPTLC2 78018438 77987924 chr14|SPTLC2:78018438|SPTLC2:77987924|reg|- 11 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
EXOSC1 99202993 99201016 chr10|EXOSC1:99202993|EXOSC1:99201016|rev|- 5 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
SHMT2 57627893 57628016 chr12|SHMT2:57627893|SHMT2:57628016|reg|+ 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
ZNF510 99538281 99537128 chr9|ZNF510:99538281|ZNF510:99537128|reg|- 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
PPFIBP1 27820253 27824363 chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+ 10 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
Now I have few questions on how to make this more efficient. I think when I run this code, R takes up lots of memory that ultimately causes problems. I am wondering if there is any way of doing this more efficiently
Another possibility is the usage of double for-loop'. Will sapply help? In that case, how should I apply sapply?
At the end I want to convert result into a csv file. I know this is a bit overwhelming to put code like this. But any optimization/efficient coding/programming will be A LOT! I really need to run the whole thing at least one to get the data soon.
#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]
#function to capitalize
capitalize <- function(x){
capital <- toupper(x) ## capitalize
paste0(capital)
}
#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))
#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)
print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG
#function to be used to aggregate
getGeneType <- function(types) {
types <- as.character(types)
if ('rev' %in% types) {
return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
}
else {
return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
}
}
#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis",
Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
for(j in 1:length(humanSplit)) {
ratReplicateName <- names(ratSplit[i])
humanReplicateName <- names(humanSplit[j])
#merging above two based on the one-to-one gene mapping as in geneList defined above.
mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
mergedRatData <- mergedRatData[,c(2,1,4,5)] #rearrange column
mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
#now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
#is.integer0 function is used
HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
# contingencyTable:
# HnRn --|--HyRn
# |------|-----|
# HnRy --|-- HyRy
#
fisherTest <- fisher.test(contingencyTable)
#make new line out of the result of fisherTest
newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2],
oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine) #append newline to result
if(j%%10 = 0) print(c(i,j))
}
}
write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
Referring to the accepted answer to Monitor memory usage in R, the amount of memory used by R can be tracked with gc().
If the script is, indeed, running short of memory (which would not surprise me), the easiest way to resolve the problem would be to move the write.table() from the outside to the inside of the loop, to replace the rbind(). It would just be necessary to create a new file name for the CSV file that is written from each output, e.g. by:
csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)
If the CSV files are written without headers, they could then be concatenated separately outside R (e.g. using cat in Unix) and the header added later.
While this approach might succeed in creating the CSV file that is sought, it is possible that file might be too big to process subsequently. If so, it may be preferable to process the CSV files individually, rather than concatenating them at all.

Resources