Reading in all files with a specific extension

Reading in all files with a specific extension - r

I have several csv files which are stored in a folder "C://Users//Prices//" I want to read this files in R and store them as a dataframe. I tried a for loop however this would take hours to read in all the files(I measured the system.time()).
Can can this be done besides using a for loop?

I will reiterate that fread is significantly quicker as is shown in this post on Stack Overflow: Quickly reading very large tables as dataframes in R. In summary, the tests (on a 51 Mb file - 1e6 rows x 6 columns) showed an performance improvement of over 70% against the best alternative methods including sqldf, ff and read.table with and without the optimised setting recommended in the answer by #lukeA. This was backed up in the comments which report a 4GB file loading in under a minute with fread, compared to 15 hours with base functions.
I ran some tests of my own, to compare alternative methods of reading and combining CSV files. The experimental setup is as follows:
Generate 4 column CSV file (character x 1, numeric x 3) for each run. There are 6 runs, each with a different number of rows, ranging from 10^1, 10^2,...,10^6 records in the data file.
Import the CSV file into R 10 times, joining with rbind or rbindlist to create a single table.
Test out read.csv & read.table, with and without optimised arguments such as colClasses, against fread.
Using microbenchmark repeat every test 10 times (probably unnecessarily high!), and collect the timings for each run.
The results show again in favour of fread with rbindlist over optimised read.table with rbind functionality.
This table shows the median total duration for 10 file reads & combines for each method and number of rows per file. The first 3 columns are in microseconds, the last 3 in seconds.
expr 10 100 1000 10000 1e+05 1e+06
1: FREAD 3.93704 5.229699 16.80106 0.1470289 1.324394 12.28122
2: READ.CSV 12.38413 18.887334 78.68367 0.9609491 8.820387 187.89306
3: READ.CSV.PLUS 10.24376 14.480308 60.55098 0.6985101 5.728035 51.83903
4: READ.TABLE 12.82230 21.019998 74.49074 0.8096604 9.420266 123.53155
5: READ.TABLE.PLUS 10.12752 15.622499 57.53279 0.7150357 5.715737 52.91683
This plot shows the comparison of timings when run 10 times on the HPC:
Normalising these values against the fread timing shows how much longer these other methods take for all scenarios:
10 100 1000 10000 1e+05 1e+06
FREAD 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
READ.CSV 3.145543 3.611553 4.683256 6.535784 6.659941 15.299223
READ.CSV.PLUS 2.601893 2.768861 3.603998 4.750835 4.325023 4.221001
READ.TABLE 3.256838 4.019352 4.433693 5.506811 7.112887 10.058576
READ.TABLE.PLUS 2.572370 2.987266 3.424355 4.863232 4.315737 4.308762
Table of results for 10 microbenchmark iterations on the HPC
Interestingly for 1 million rows per file the optimised version of read.csv and read.table take 422% and 430% more time than fread whilst without optimisation this leaps to around 1500% and 1005% longer.
Note that when I conducted this experiment on my powerful laptop as opposed to the HPC cluster the performance gains were somewhat less (around 81% slower as opposed to 400% slower). This is interesting in itself, not sure that I can explain it however!
10 100 1000 10000 1e+05 1e+06
FREAD 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
READ.CSV 2.595057 2.166448 2.115312 3.042585 3.179500 6.694197
READ.CSV.PLUS 2.238316 1.846175 1.659942 2.361703 2.055851 1.805456
READ.TABLE 2.191753 2.819338 5.116871 7.593756 9.156118 13.550412
READ.TABLE.PLUS 2.275799 1.848747 1.827298 2.313686 1.948887 1.832518
Table of results for only 5 `microbenchmark` iterations on my i7 laptop
Given that the data volume is reasonably large I'd suggest the benefits will not only be in the reading of the files with fread but in the subsequent manipulation of the data with the data.table package as opposed to traditional data.frame operations! I am lucky to have learned this lesson at an early stage and would recommend others to follow suit...
Here is the code used in the tests.
rm(list=ls()) ; gc()
library(data.table) ; library(microbenchmark)
#=============== FUNCTIONS TO BE TESTED ===============
f_FREAD = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = fread("file.csv") else x = rbindlist(list(x, fread("file.csv")))
}
}
f_READ.TABLE = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = read.table("file.csv") else x = rbind(x, read.table("file.csv"))
}
}
f_READ.TABLE.PLUS = function (NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) {
x = read.table("file.csv", sep = ",", header = TRUE, comment.char="", colClasses = c("character", "numeric", "numeric", "numeric"))
} else {
x = rbind(x, read.table("file.csv", sep = ",", header = TRUE, comment.char="", colClasses = c("character", "numeric", "numeric", "numeric")))
}
}
}
f_READ.CSV = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = read.csv("file.csv") else x = rbind(x, read.csv("file.csv"))
}
}
f_READ.CSV.PLUS = function (NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) {
x = read.csv("file.csv", header = TRUE, colClasses = c("character", "numeric", "numeric", "numeric"))
} else {
x = rbind(x, read.csv("file.csv", comment.char="", header = TRUE, colClasses = c("character", "numeric", "numeric", "numeric")))
}
}
}
#=============== MAIN EXPERIMENTAL LOOP ===============
for (i in 1:6)
{
NUM_ROWS = (10^i) # the loop allows us to test the performance over varying numbers of rows
NUM_READS = 10
# create a test data.table with the specified number of rows and write it to file
dt = data.table(
col1 = sample(letters[],NUM_ROWS,replace=TRUE),
col2 = rnorm(NUM_ROWS),
col3 = rnorm(NUM_ROWS),
col4 = rnorm(NUM_ROWS)
)
write.csv(dt, "file.csv", row.names=FALSE)
# run the imports for each method, recording results with microbenchmark
results = microbenchmark(
FREAD = f_FREAD(NUM_READS),
READ.TABLE = f_READ.TABLE(NUM_READS),
READ.TABLE.PLUS = f_READ.TABLE.PLUS(NUM_READS),
READ.CSV = f_READ.CSV(NUM_READS),
READ.CSV.PLUS = f_READ.CSV.PLUS(NUM_READS),
times = NUM_ITERATIONS)
results = data.table(NUM_ROWS = NUM_ROWS, results)
if (i == 1) results.all = results else results.all = rbindlist(list(results.all, results))
}
results.all[,time:=time/1000000000] # convert from nanoseconds

Speed up your read.table command by
predefining colClasses=c("numeric", "factor", ...)
setting stringsAsFactors=FALSE
disabling csv comments using comment.char=""
Via http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/

Related

How to find number of lines of a Large CSV file without reading it - using R? [duplicate]

I have a CSV file of size ~1 GB, and as my laptop is of basic configuration, I'm not able to open the file in Excel or R. But out of curiosity, I would like to get the number of rows in the file. How am I to do it, if at all I can do it?

For Linux/Unix:
wc -l filename
For Windows:
find /c /v "A String that is extremely unlikely to occur" filename

Option 1:
Through a file connection, count.fields() counts the number of fields per line of the file based on some sep value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.
length(count.fields(filename))
If you have a header row, you can skip it with skip = 1
length(count.fields(filename, skip = 1))
There are other arguments that you can adjust for your specific needs, like skipping blank lines.
args(count.fields)
# function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE,
# comment.char = "#")
# NULL
See help(count.fields) for more.
It's not too bad as far as speed goes. I tested it on one of my baseball files that contains 99846 rows.
nrow(data.table::fread("Batting.csv"))
# [1] 99846
system.time({ l <- length(count.fields("Batting.csv", skip = 1)) })
# user system elapsed
# 0.528 0.000 0.503
l
# [1] 99846
file.info("Batting.csv")$size
# [1] 6153740
(The more efficient) Option 2: Another idea is to use data.table::fread() to read the first column only, then take the number of rows. This would be very fast.
system.time(nrow(fread("Batting.csv", select = 1L)))
# user system elapsed
# 0.063 0.000 0.063

Estimate number of lines based on size of first 1000 lines
size1000 <- sum(nchar(readLines(con = "dgrp2.tgeno", n = 1000)))
sizetotal <- file.size("dgrp2.tgeno")
1000 * sizetotal / size1000
This is usually good enough for most purposes - and is a lot faster for huge files.

Here is something I used:
testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )
nooflines <- nooflines+linesread )
close(testcon)
nooflines
Check out this post for more:
https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/

Implementing Tony's answer in R:
file <- "/path/to/file"
cmd <- paste("wc -l <", file)
as.numeric(system(cmd, intern = TRUE))
This is about 4x faster than data.table for a file with 100k lines
> microbenchmark::microbenchmark(
+ nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)),
+ as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE))
+ )
Unit: milliseconds
expr min lq
nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)) 128.06701 131.12878
as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE)) 27.70863 28.42997
mean median uq max neval
150.43999 135.1366 142.99937 629.4880 100
34.83877 29.5070 33.32973 270.3104 100

Is there a way to count rows in R without loading data first? [duplicate]

I have a CSV file of size ~1 GB, and as my laptop is of basic configuration, I'm not able to open the file in Excel or R. But out of curiosity, I would like to get the number of rows in the file. How am I to do it, if at all I can do it?

For Linux/Unix:
wc -l filename
For Windows:
find /c /v "A String that is extremely unlikely to occur" filename

Option 1:
Through a file connection, count.fields() counts the number of fields per line of the file based on some sep value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.
length(count.fields(filename))
If you have a header row, you can skip it with skip = 1
length(count.fields(filename, skip = 1))
There are other arguments that you can adjust for your specific needs, like skipping blank lines.
args(count.fields)
# function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE,
# comment.char = "#")
# NULL
See help(count.fields) for more.
It's not too bad as far as speed goes. I tested it on one of my baseball files that contains 99846 rows.
nrow(data.table::fread("Batting.csv"))
# [1] 99846
system.time({ l <- length(count.fields("Batting.csv", skip = 1)) })
# user system elapsed
# 0.528 0.000 0.503
l
# [1] 99846
file.info("Batting.csv")$size
# [1] 6153740
(The more efficient) Option 2: Another idea is to use data.table::fread() to read the first column only, then take the number of rows. This would be very fast.
system.time(nrow(fread("Batting.csv", select = 1L)))
# user system elapsed
# 0.063 0.000 0.063

Estimate number of lines based on size of first 1000 lines
size1000 <- sum(nchar(readLines(con = "dgrp2.tgeno", n = 1000)))
sizetotal <- file.size("dgrp2.tgeno")
1000 * sizetotal / size1000
This is usually good enough for most purposes - and is a lot faster for huge files.

Here is something I used:
testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )
nooflines <- nooflines+linesread )
close(testcon)
nooflines
Check out this post for more:
https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/

Implementing Tony's answer in R:
file <- "/path/to/file"
cmd <- paste("wc -l <", file)
as.numeric(system(cmd, intern = TRUE))
This is about 4x faster than data.table for a file with 100k lines
> microbenchmark::microbenchmark(
+ nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)),
+ as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE))
+ )
Unit: milliseconds
expr min lq
nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)) 128.06701 131.12878
as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE)) 27.70863 28.42997
mean median uq max neval
150.43999 135.1366 142.99937 629.4880 100
34.83877 29.5070 33.32973 270.3104 100

Missing results from foreach and Parallel

Due to memory contraints in a previous script, I modified it following this advice in a similar issue as mine (do not give more data than needed by workers -
reading global variables using foreach in R). Unfortunately, now I'm struggling with missing results.
The script iterates over an 1.9M columns matrix, proccess each column and returns one row dataframe (rbind function from foreach combines each row). However, when it print the results, there are less rows (results) than the number of columns and this quantity changes every run. Seemingly, there is no error in the function inside foreach loop as it used to run smoothly in the previous script and no error or warning message pops up.
New Script:
if(!require(R.utils)) { install.packages("R.utils"); require(R.utils)}
if(!require(foreach)) { install.packages("foreach"); require(foreach)}
if(!require(doParallel)) { install.packages("doParallel"); require(doParallel)}
if(!require(data.table)) { install.packages("data.table"); require(data.table)}
registerDoParallel(cores=6)
out.file = "data.result.167_6_inside.out"
out.file2 = "data.result.167_6_outside.out"
data1 = fread("data.txt",sep = "auto", header=FALSE, stringsAsFactors=FALSE,na.strings = "NA")
data2 = transpose(data1)
rm(data1)
data3 = data2[,3: dim(data2)[2]]
levels2 = data2[-1,1:(3-1)]
rm(data2)
colClasses=c(ID="character",Col1="character",Col2="character",Col3="character",Col4="character",Col5="character",Col6="character")
res_table = dataFrame(colClasses,nrow=0)
write.table(res_table , file=out.file, append = T, col.names=TRUE, row.names=FALSE, quote=FALSE)
write.table(res_table, file=out.file2, append = T, col.names=TRUE, row.names=FALSE, quote=FALSE)
tableRes = foreach(col1=data3, .combine="rbind") %dopar% {
id1 = col1[1]
df2function = data.frame(levels2[,1,drop=F],levels2[,2,drop=F],as.numeric(col1[-1]))
mode(df2function[,1])="numeric"
mode(df2function[,2])="numeric"
values1 <- try (genericFuntion(df2function), TRUE)
if (is.numeric(try (values1$F, TRUE)))
{
res_table [1,1] = id1
res_table [1,2] = values1$F[1,1]
res_table [1,3] = values1$F[1,2]
res_table [1,4] = values1$F[1,3]
res_table [1,5] = values1$F[2,2]
res_table [1,6] = values1$F[2,3]
res_table [1,7] = values1$F[3,3]
} else
{
res_table[1,1] = id1
res_table[1,2] = NA
res_table[1,3] = NA
res_table[1,4] = NA
res_table[1,5] = NA
res_table[1,6] = NA
res_table[1,7] = NA
}
write.table(fstats_table, file=out.file, append = T, col.names=FALSE, row.names=FALSE, quote=FALSE)
return(fstats_table[1,])
}
write.table(tableFst, file=out.file, append = T, col.names=FALSE, row.names=FALSE, quote=FALSE)
In the previous script, the foreach was that way:
tableRes = foreach(i=1:length(data3), iter=icount(), .combine="rbind") %dopar% { (same code as above) }
Thus, I would like to know what are the possible causes of this behaviour.
I'm running this script in a cluster asking 80 Gb of memory (and 6 cores in this example). This is the largest amount of RAM I can request one a single node to be sure that script will not fail due to the lack of memory. (Each node has a pair of 14-core Intel Xeon skylake CPUs 2.6GHz, 128GB of RAM; OS - RHEL 7)
Ps 1: Although the new script is not paging anymore (even with more than 8 cores), seems that each child process still loading large amounts of data in the memory (~6 Gb) as I tracked using top command.
Ps 2: The new script is printing the results inside and outside the foreach loop to track if the loss of data occurs during or after the loop finishes and as I noticed every run gives me different amount of printed results inside and outside the loop.
P3: The fastest run was based on 20 cores (6 sec for 1000 iterations) and the slowest was 56 sec on a single core (tests performed using microbenchmark with 10 replications). However, more cores leads to less results being returned in the full matrix (1.9M columns).
I really appreciate any help you can provide,

mcapply: all scheduled cores encountered errors in user code

The following is my code. I am trying get the list of all the files (~20000) that end with .idat and read each file using the function illuminaio::readIDAT.
library(illuminaio)
library(parallel)
library(data.table)
# number of cores to use
ncores = 8
# this gets all the files with .idat extension ~20000 files
files <- list.files(path = './',
pattern = "*.idat",
full.names = TRUE)
# function to read the idat file and create a data.table of filename, and two more columns
# write out as csv using fwrite
get.chiptype <- function(x)
{
idat <- readIDAT(x)
res <- data.table(filename = x, nSNPs = nrow(idat$Quants), Chip = idat$ChipType)
fwrite(res, file.path = 'output.csv', append = TRUE)
}
# using mclapply call the function get.chiptype on all 20000 files.
# use 8 cores at a time
mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores)
After reading and writing info about 1200 files, I get the following message:
Warning message:
In mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores) :
all scheduled cores encountered errors in user code
How do I resolve it?

Calling mclapply() in some instances requires you to specify a random number generator that allows for multiple streams of random numbers.
R version 2.14.0 has an implementation of Pierre L'Ecuyer's multiple pseudo-random number generator.
Try adding the following before the mclapply() call, with a pre-specified value for 'my.seed':
set.seed( my.seed, kind = "L'Ecuyer-CMRG" );

Huge data file and running multiple parameters and memory issue, Fisher's test

I have a R code that I am trying to run in a server. But it is stopping in the middle/get frozen probably because of memory limitation. The data files are huge/massive (one has 20 million lines) and if you look at the double for loop in the code, length(ratSplit) = 281 and length(humanSplit) = 36. The data has specific data of human and rats' genes and human has 36 replicates, while rat has 281. So, the loop is basically 281*36 steps. What I want to do is to process data using the function getGeneType and see how different/independent are the expression of different replicate combinations. Using Fisher's test. The data rat_processed_7_25_FDR_05.out looks like this :
2 Sptbn1 114201107 114200202 chr14|Sptbn1:114201107|Sptbn1:114200202|reg|- 2 Thymus_M_GSM1328751 reg
2 Ndufb7 35680273 35683909 chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+ 2 Thymus_M_GSM1328751 rev
2 Ndufb10 13906408 13906289 chr10|Ndufb10:13906408|Ndufb10:13906289|reg|- 2 Thymus_M_GSM1328751 reg
3 Cdc14b 1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|- 3 Thymus_M_GSM1328751 reg
and the data fetal_output_7_2.out has the form
SPTLC2 78018438 77987924 chr14|SPTLC2:78018438|SPTLC2:77987924|reg|- 11 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
EXOSC1 99202993 99201016 chr10|EXOSC1:99202993|EXOSC1:99201016|rev|- 5 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
SHMT2 57627893 57628016 chr12|SHMT2:57627893|SHMT2:57628016|reg|+ 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
ZNF510 99538281 99537128 chr9|ZNF510:99538281|ZNF510:99537128|reg|- 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
PPFIBP1 27820253 27824363 chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+ 10 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
Now I have few questions on how to make this more efficient. I think when I run this code, R takes up lots of memory that ultimately causes problems. I am wondering if there is any way of doing this more efficiently
Another possibility is the usage of double for-loop'. Will sapply help? In that case, how should I apply sapply?
At the end I want to convert result into a csv file. I know this is a bit overwhelming to put code like this. But any optimization/efficient coding/programming will be A LOT! I really need to run the whole thing at least one to get the data soon.
#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]
#function to capitalize
capitalize <- function(x){
capital <- toupper(x) ## capitalize
paste0(capital)
}
#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))
#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)
print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG
#function to be used to aggregate
getGeneType <- function(types) {
types <- as.character(types)
if ('rev' %in% types) {
return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
}
else {
return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
}
}
#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis",
Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
for(j in 1:length(humanSplit)) {
ratReplicateName <- names(ratSplit[i])
humanReplicateName <- names(humanSplit[j])
#merging above two based on the one-to-one gene mapping as in geneList defined above.
mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
mergedRatData <- mergedRatData[,c(2,1,4,5)] #rearrange column
mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
#now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
#is.integer0 function is used
HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
# contingencyTable:
# HnRn --|--HyRn
# |------|-----|
# HnRy --|-- HyRy
#
fisherTest <- fisher.test(contingencyTable)
#make new line out of the result of fisherTest
newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2],
oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine) #append newline to result
if(j%%10 = 0) print(c(i,j))
}
}
write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")

Referring to the accepted answer to Monitor memory usage in R, the amount of memory used by R can be tracked with gc().
If the script is, indeed, running short of memory (which would not surprise me), the easiest way to resolve the problem would be to move the write.table() from the outside to the inside of the loop, to replace the rbind(). It would just be necessary to create a new file name for the CSV file that is written from each output, e.g. by:
csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)
If the CSV files are written without headers, they could then be concatenated separately outside R (e.g. using cat in Unix) and the header added later.
While this approach might succeed in creating the CSV file that is sought, it is possible that file might be too big to process subsequently. If so, it may be preferable to process the CSV files individually, rather than concatenating them at all.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reading in all files with a specific extension - r

I have several csv files which are stored in a folder "C://Users//Prices//" I want to read this files in R and store them as a dataframe. I tried a for loop however this would take hours to read in all the files(I measured the system.time()). Can can this be done besides using a for loop?

Speed up your read.table command by predefining colClasses=c("numeric", "factor", ...) setting stringsAsFactors=FALSE disabling csv comments using comment.char="" Via http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/

Related

How to find number of lines of a Large CSV file without reading it - using R? [duplicate]

Is there a way to count rows in R without loading data first? [duplicate]

Missing results from foreach and Parallel

mcapply: all scheduled cores encountered errors in user code

Huge data file and running multiple parameters and memory issue, Fisher's test

Categories

Resources