Why is my for loop only half-working? (R) - r

I'm not sure how to make a reproducible example of my problem and this post is very verbose. I was hoping the issue might pop out. Basically, this for loop obtains output from an external program, makes some calculations in R, and then posts the results of those calculations back into an external file.
The first iteration of the loop runs perfectly fine. It does everything correctly then proceeds to return to the top of the loop and change to the correct directory (flist[2]), but when it reaches the second function (get_stress_table), it chokes by printing "NAs" into the files rather than the file names (flist, which is a vector of file names).
The file names and sub directories being iterated through in this loop share a common name. The fact that it correctly changes to the right sub-directory in setwd, but prints 'NA' as a file name in the first function is what confuses me. Thus I don't understand the problem.
Anything sticking out?
Here is the for loop I am trying to run:
for (i in 1:length(flist)){
setwd(paste0(solutions_dir, "\\", flist[i]))
max_stress <- get_stress_table(solutions_dir = solutions_dir, flist = flist[i], lsdynadir = lsdynadir, states = 5)
xy_table <- element_time_series(stressed_eid = max_stress, solutions_dir = solutions_dir, flist = flist[i], lsdynadir = lsdynadir)
damp_coeff <- find_damp(xy_table = file_xy)
setwd(kfile_complete)
erode_damp(erosion_lines = erosion_lines, damp_coef = damp_coeff, kfile_mesh = flist[i])
}
Here is the error I return:
3.
file(con, "r")
2.
readLines(flist[i])
1.
get_stress_table(solutions_dir = solutions_dir, flist = flist[i],
lsdynadir = lsdynadir, states = 5)
Here is the inside of that function:
biggest_stresses <- data.frame(eid= numeric(),
stress = numeric(),
stringsAsFactors=FALSE)
for (j in 1:states) {
fileconn <- file("get_stresses.cfile")
line_one <- paste0("open d3plot ", solutions_dir, "\\", flist[i], "\\", "d3plot")
line_two <- "ac"
line_three <- "fringe 14"
line_four <- "pfringe"
line_five <- "anim forward"
line_six <- "anim stop; state 100;"
line_seven <- paste0("output ", solutions_dir, "\\", flist[i], "\\", flist[i], " ", j, " 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1.000000")
writeLines(c(line_one, line_two, line_three, line_four, line_five, line_six, line_seven), fileconn)
close(fileconn)
system(paste0(lsdynadir,"\\lsprepost4.3_x64.exe c=get_stresses.cfile -nographics"))
stresses <- readLines(flist[i])
start <- grep(stresses, pattern="*KEYWORD",fixed = T)
stop <- grep(stresses, pattern="$Interpreted from averaged nodal data",fixed = T)
stresses <- stresses[-seq(start, stop, by = 1)]
writeLines(stresses, flist[i])
stresses <- read.table(flist[i], header = FALSE)
names(stresses) <- c("eid", "stress")
max_stress <- which(stresses$eid == which.max(stresses$stress)
biggest_stresses <- rbind(biggest_stresses, stresses[max_stress,]
}
return(biggest_stresses[which.max(biggest_stresses$stress),1])
}

Related

%dopar% safe way of write to csv inside foreach loop

[EDITED]
It is a general question: I have seen some posts saying that it is not a good idea to use foreach and write.csv inside a foreach loop due to different cores trying to write in the file at the same time, resulting in missing results. Still, I need to write in an external file inside the parallel loop to get my output (500000+ rows and 10+ columns). Otherwise, it crushes for memory issues. So, I would like to know if there is a more safe way to write a result file within a foreach loop.
I appreciate any help on this
I am adding some more info and a much more simple code and data than what I actually have.
Description: I have two different polygons layers (sf, polygon), each with 500000+ sf. I need to calculate the area of different raster classes (1 raster layer with 3 classes) within each one of the polygons. This is the most time-consuming part of the script, specifically because I need to use sf::sf_intersection multiple times. Then, I use many different combinations of if-else and rules to populate a df with values and rules.
This is the original code, which I get memory issues with the original data:
require(sf)
require(raster)
require(rgdal)
require(rgeos)
require(dplyr)
require(stars)
## Sample data
set.seed(131)
sample_raster = raster(nrows = 1, ncols = 1, res = 0.5, xmn = 0, xmx = 11, ymn = 0, ymx = 11)
values(sample_raster) = rep(1:3, length.out = ncell(sample_raster))
crs(sample_raster) = CRS('+init=EPSG:4326')
plot(sample_raster, axes=T)
sample_raster
##
m = rbind(c(0,0), c(1,0), c(1,1), c(0,1), c(0,0))
p = st_polygon(list(m))
n = 100
l = vector("list", n)
for (i in 1:n)
l[[i]] = p + 10 * runif(2)
sample_poly = st_sfc(l)
data = data.frame(PR_ID = seq(1:100),
COND1 = rep(1:10, length.out = 100))
sample_poly = st_sf(cbind(data, sample_poly))
plot(sample_poly, col = sf.colors(categorical = TRUE, alpha = .5), add=T)
sample_poly = sample_poly %>% st_set_crs(4326)
sample_poly
##
## Code
require(parallel)
require(foreach)
require(doParallel)
idall = as.character(sample_poly$PR_ID)
area = as.numeric(st_area(sample_poly))/10000
# i=1
# listID = idall
# mainpoly = sample_poly
# mainras = sample_raster
# mainpolyarea = area
per.imovel.paralallel = function (listID, mainpoly, mainras, mainpolyarea) { # Starting the function
## Setting the parallel work up into your computer
UseCores = detectCores()-1
cl = parallel::makeCluster(UseCores, output="")
doParallel::registerDoParallel(cl)
writeLines(c(""), "log.txt") # Creates a LOG FILE in the folder to follow processing
FOREACH.RESULT = foreach(i = 1:length(listID), .packages=c('raster', 'rgdal', 'rgeos', 'dplyr', 'parallel',
'doParallel', 'sf', 'stars'), .inorder = T , .combine ='rbind') %dopar%
{ # Stating the paral-loop
sink("log.txt", append=TRUE) # LOG FILE in the home folder
cat(paste(i, "of", length(listID), as.character(Sys.time()),"\n")) # Write to LOG FILE
sink() # end diversion of output
########################
### Pick one poly
px = sf::st_buffer(mainpoly[mainpoly$PR_ID == listID[i],], # Conditional to select the geometry PR_ID in position i
dist = 0.1) # buffer = 0 w/ byid, selects the geometry
########################
### Intersect with raster and get area
px2 = sf::st_buffer(px, dist = 0.1) # Buffer because raster::mask() masks out partially covered cells since it call rasterize() first
desm_prop = raster::crop(mainras, as_Spatial(px2))
desm_prop_shp = if(all(is.na(values(desm_prop)))){NULL
} else {sf::st_intersection(st_cast(sf::st_as_sf(stars::st_as_stars(desm_prop)), "POLYGON"), px)}
names(desm_prop_shp)[1] = if(any(names(desm_prop_shp) == "layer")){"values"
} else {NULL}
desm_prop_bet0108 = if(is.null(desm_prop_shp)){NULL
} else {desm_prop_shp[desm_prop_shp$values == 1, ]}
desm_prop_bet0108 = if(is.null(desm_prop_bet0108) | length(desm_prop_bet0108) == 0){NULL
} else if(length(desm_prop_bet0108$values) == 0){NULL
} else {desm_prop_bet0108}
desm_prop_after08 = if(is.null(desm_prop_shp)){NULL
} else {desm_prop_shp[desm_prop_shp$values == 2, ]}
desm_prop_after08 = if(is.null(desm_prop_after08) | length(desm_prop_after08) == 0){NULL
} else if(length(desm_prop_after08$values) == 0){NULL
} else {desm_prop_after08}
desm_prop_upto00 = if(is.null(desm_prop_shp)){NULL
} else {desm_prop_shp[desm_prop_shp$values == 3, ]}
desm_prop_upto00 = if(is.null(desm_prop_upto00) | length(desm_prop_upto00) == 0){NULL
} else if(length(desm_prop_upto00$values) == 0){NULL
} else {desm_prop_upto00}
area_desm_prop_bet0108 <- if(is.null(desm_prop_bet0108)){0
} else { sum(as.numeric(sf::st_area(desm_prop_bet0108)/10000))} # Deforestation area in PX 2001 - 2008
area_desm_prop_after08 <- if(is.null(desm_prop_after08)){0
} else { sum(as.numeric(sf::st_area(desm_prop_after08)/10000))} # Deforestation area in PX after 2008
area_desm_prop_upto00 <- if(is.null(desm_prop_upto00)){0
} else { sum(as.numeric(sf::st_area(desm_prop_upto00)/10000))} # Deforestation area in PX upto 2000
########################
# RESULTS
TEMP.RESULTS = data.frame(PR_ID = as.character(listID[i]),
PR_AREA_HA = mainpolyarea[i],
PR_D09 = area_desm_prop_after08,
PR_D0108 = area_desm_prop_bet0108,
PR_D00 = area_desm_prop_upto00)
return (TEMP.RESULTS)
} # Ending the loop
return (FOREACH.RESULT)
parallel::stopCluster(cl) # stop cluster
stopImplicitCluster() # stop cluster
gc()
} # Ending the function
#####################################################################################################
results_feach = per.imovel.paralallel (listID = idall, mainpoly = sample_poly, mainras = sample_raster, mainpolyarea = area)
warnings()
I have also tried #mischva11 (modified) suggestion by adding this:
length_of_chunk = round(length(idall)/(length(idall)/10)) # generate chunks of 10 lines
lchunks = split(idall, sort(rep_len(1:length_of_chunk, length(idall))))
for (z in 1:length_of_chunk){
# split up the data in chunks
idall_chunk = as.vector(unlist(lchunks[z]))
results_chunk = per.imovel.paralallel (listID = idall_chunk, mainpoly = sample_poly, mainras = sample_raster, mainpolyarea = area)
# save your foreach results for each chunk, append after the first one
if (z == 1) {write.table(results_chunk, file = "TESTDATAresults1.csv")
}else {write.table(results_chunk, file = "TESTDATAresults1.csv", append = TRUE, col.names = FALSE)}
print(NULL) # print(results_chunk)
}
It works like a charm for this example.
BUT, I have a setback when running it with the real script/data: it takes ages for the foreach to close. I am watching my machine performance and log file.. after processing all lines of my sf object, my CPU work goes down as expected, but it still takes more than 30min (i did not wait for it to completely finish) to close the foreach function.
Because of it, I thought about writing the output on the flow inside the foreach work. But clearly it is not a good idea as explained here. I have seen some posts about the package 'flock' which look the output file for writing the output. I have not tested but it sounds promising.
The problem here is, that you need communication between the cores. One core has to wait for the next one until it's finished writing in the csv. That's not easily done and not possible with foreach as far as I now. foreach does provide this method with the variable inorder(by default true). You are telling us, you got memory issues. So one solution is to chunk up your output if it's possible. I do not have a good dataset for this example, so I use mtcars which will be filled by NAs
library(foreach)
library(parallel)
library(doParallel)
registerDoParallel(4)
# split your output here, I use 5 chunks here. My data is mtcars */
length_of_chunk <-round(nrow(mtcars)/5)
for ( z in 1:length_of_chunk-1){
x<-0
#here the data gets split up
data <- mtcars[(z*length_of_chunk):(z*length_of_chunk+length_of_chunk),]
#foreach with those 5 datarows
results <- foreach(i=1:length_of_chunk, .combine=rbind) %dopar% {
#***your code***
y = data[i,]
return(y)
}
print(results)
# save your foreach results and then begin again
if (z==1) {write.table(results, file= "test.csv")}
else {write.table(results, file="test.csv", append=TRUE, col.names = FALSE)}
}

How to read a .MAP file extension in R?

Is there a simple way to read a file of .MAP extension in R? I have tried a few options below but had no success. Here is a .MAP file for a reproducible example.
context: For some odd reason, the spatial regionalization used in health planning policies in Brazil is only available in this format. I would like to convert it to geopackage so we can add it to the geobr package.
# none of these options work
mp <- sf::st_read("./se_mapas_2013/se_regsaud.MAP")
mp <- rgdal::readGDAL("./se_mapas_2013/se_regsaud.MAP")
mp <- rgdal::readOGR("./se_mapas_2013/se_regsaud.MAP")
mp <- raster::raster("./se_mapas_2013/se_regsaud.MAP")
mp <- stars::read_stars("./se_mapas_2013/se_regsaud.MAP")
ps. there is a similar question on SO focused on Python, unfortunately unanswered
UPDATE
We have found a publication that uses a custom function that reads the .MAP file. See example below. However, it returns a "polylist" object. Is there a simple way to convert it to a simple feature?
original custom function
read.map = function(filename){
zz=file(filename,"rb")
#
# header of .map
#
versao = readBin(zz,"integer",1,size=2) # 100 = versao 1.00
#Bounding Box
Leste = readBin(zz,"numeric",1,size=4)
Norte = readBin(zz,"numeric",1,size=4)
Oeste = readBin(zz,"numeric",1,size=4)
Sul = readBin(zz,"numeric",1,size=4)
geocodigo = ""
nome = ""
xleg = 0
yleg = 0
sede = FALSE
poli = list()
i = 0
#
# repeat of each object in file
#
repeat{
tipoobj = readBin(zz,"integer",1,size=1) # 0=Poligono, 1=PoligonoComSede, 2=Linha, 3=Ponto
if (length(tipoobj) == 0) break
i = i + 1
Len = readBin(zz,"integer",1,size=1) # length byte da string Pascal
geocodigo[i] = readChar(zz,10)
Len = readBin(zz,"integer",1,size=1) # length byte da string Pascal
nome[i] = substr(readChar(zz,25),1,Len)
xleg[i] = readBin(zz,"numeric",1,size=4)
yleg[i] = readBin(zz,"numeric",1,size=4)
numpontos = readBin(zz,"integer",1,size=2)
sede = sede || (tipoobj = 1)
x=0
y=0
for (j in 1:numpontos){
x[j] = readBin(zz,"numeric",1,size=4)
y[j] = readBin(zz,"numeric",1,size=4)
}
# separate polygons
xInic = x[1]
yInic = y[1]
for (j in 2:numpontos){
if (x[j] == xInic & y[j] == yInic) {x[j]=NA; y[j] = NA}
}
poli[[i]] = c(x,y)
dim(poli[[i]]) = c(numpontos,2)
}
class(poli) = "polylist"
attr(poli,"region.id") = geocodigo
attr(poli,"region.name") = nome
attr(poli,"centroid") = list(x=xleg,y=yleg)
attr(poli,"sede") = sede
attr(poli,"maplim") = list(x=c(Oeste,Leste),y=c(Sul,Norte))
close(zz)
return(poli)
}
using original custom function
mp <- read.map("./se_mapas_2013/se_regsaud.MAP")
class(mp)
>[1] "polylist"
# plot
plot(attributes(mp)$maplim, type='n', asp=1, xlab=NA, ylab=NA)
title('Map')
lapply(mp, polygon, asp=T, col=3)
The problems were: use of readChar with trailing nul bytes - changed to readBin(); 8-bit characters that rawToChar() would not accept (on my UTF-8 system); multiple slivers in some files that needed dropping; and some others. I added the edited read.map() function above to maptools, but with a different name and not exported. So now (with maptools rev 370 from https://r-forge.r-project.org/R/?group_id=943 when build completes):
library(maptools)
o <- maptools:::readMAP2polylist("se_regsaud.MAP")
oo <- maptools:::.makePolylistValid(o)
ooo <- maptools:::.polylist2SpP(oo, tol=.Machine$double.eps^(1/4))
rn <- row.names(ooo)
df <- data.frame(ID=rn, row.names=rn, stringsAsFactors=FALSE)
res <- SpatialPolygonsDataFrame(ooo, data=df)
library(sf)
res_sf <- st_as_sf(res)
res_sf
plot(st_geometry(res_sf))
This approach re-uses the maptools code dating back almost twenty years, with minor edits to handle subsequent changes in reading binary files, and fixing slivers.
EDIT: looks like this doesn't work generally across all files so proper conversion to sf would need a deeper look.
Here's a quick stab at resurrection. It might be incorrect to cumulatively sum to get the multi linestrings, I tested with se_municip.MAP and it only had NAs as the closing row of each ring. If it potentially has non-connected multi-rings (multipolygon) then this approach won't work completely.
x <- read.map("se_municip.MAP")
df <- setNames(as.data.frame(do.call(rbind, x)), c("x", "y"))
df$region.name <- rep(attr(x, "region.name"), unlist(lapply(x, nrow)))
## in case there are multi-rings
df$linestring_id <- cumsum(c(0, diff(is.na(df$x))))
df$polygon_id <- as.integer(factor(df$region.name))
df <- df[!is.na(df$x), ]
sfx <- sfheaders::sf_polygon(df, x = "x", y = "y", linestring_id = "linestring_id", polygon_id = "polygon_id", keep = TRUE)
#sf::st_crs(sfx) <- sf::st_crs(<whatever it is probably 4326>)
plot(sf::st_geometry(sfx), reset = FALSE)
maps::map(add = TRUE)
Interesting that you came across an official version of a forgotten legacy!
(BTW can I publish the data sets in a package?)

R Interactive Sankey Diagram + Hierarchize Nodes

I am trying to visualize sequences of events by using Sankey diagrams.
I have a set of event (Event1 to Event16) over sequences of different length.
The steps of the sequences are noted by T0, T0 - 1, T0 - 2 ...
The width of the flow is corresponding to the frequency rate of the sequences.
I would like that all the nodes corresponding to a given step to be aligned vertically.
By using the GoogleVis package I succeed to obtain the following :
Sankey with GoogleVis
As you can see some events T0-1, T0-2 and T0-3... are on the far right, instead of with the others of their time step.
It seems to be due to the fact that it is not possible to have nodes whithout children...
Do you know a way to have hierarchize nodes or/and nodes whithout children, for GoogleVis ?
If not, do you know another R package which could allow to have these characteristics for interactive plots ?
My R code is bellow. The main variable containing the sequences is a list of list, see picture.
Data containing sequences
My code :
# Package
library(googleVis)
library(dplyr)
library(reshape2)
library(tidyverse)
# Load
load("SeqCh")
# Loop -------------------------------------------------------------
# Inits
From = c()
To = c()
Freq = c()
Target = SeqCh
# Get maximum length of sequence
maxls = 0
for (kk in 1:length(Target)){
temp = length(Target[[kk]])
if (temp > maxls){
maxls = temp
}
}
# Loop on length of sequences
for (zz in 2:maxls){
# Prefix to add to manage same event repeated
if (zz == 2){
SufixFrom = "(T0)"
SufixTo = "(T0 - 1)"
} else {
SufixFrom = paste("(T0 - ", as.character(zz-2), ")", sep = "")
SufixTo = paste("(T0 - ", as.character(zz-1), ")", sep = "")
}
# Message
cat("\n")
print(paste(" Processing events from ", SufixFrom, " to ", SufixTo))
# Loop on Target
ind = lapply(Target, function(x) length(x) == zz)
TargetSub = Target[unlist(ind)]
FreqSub = Support[unlist(ind)]
for (jj in 1:length(TargetSub)){
temp = TargetSub[[jj]]
TempFrom = paste(temp[zz-1], SufixFrom, sep = " ")
TempTo = paste(temp[zz], SufixTo, sep = " ")
From = c(From, TempFrom)
To = c(To, TempTo)
Freq = c(Freq, FreqSub[jj])
}
} # end for loop on length of sequences
# All in same variable
Flows = data.frame("From" = From, "To" = To, "Occurence_Frequency" = Freq, stringsAsFactors = FALSE)
# Plot --------------------------------------------------------------------
plot(gvisSankey(Flows, from='From', to='To', weight="Occurence_Frequency",
options=list(height=900, width=1800, sankey="{link:{color:{fill:'lightblue'}}}")))
Thanks, Romain.

starting a function in asynch mode in R, as a separate process

I am looking for the ability to start R processes Asynchronously from within R.
Something like the below function
startFunctionInAsynchMode<-function(workingDir,filesToSource, functionName, ...){
#workingdir - the dir that should be set as wd
#filesToSource - vector of fileNames to be sourced
#functionName - the actual function to be run asynchrously
#... - other parameters to be passed to the function
#Return Value - should be the System Process Id Started
}
Would anyone have quick ideas? I checked packages like parallel etc. but doesn't seem to fit.
Thanks in advance
Here is an implementation using R CMD. Basic version tested. And with some open items.
startFunctionInAsynchMode<-function(workingDir,filesToSource, functionName, ...){
wd<-getwd()
setwd(workingDir)
fs<-makeFiles()
scriptFile<-fs$ScriptFile
cat(file=scriptFile,paste0("source(\"",filesToSource,"\")", collapse = "\n"))
cat(file=scriptFile,"\n",append = T)
functionCall<-getFunctionCall(functionName,as.list(match.call()), startIndex=5)
cat(file=scriptFile,functionCall,append = T)
commandsToRun <- paste0("(R CMD BATCH ", scriptFile, " ",fs$LogFile , " --slave ) &")
print(commandsToRun)
system(commandsToRun)
Sys.sleep(5)
pids<-getPids(scriptFile, "--restore")
cat(file=fs$KillScript,paste0("kill -9 ",pids$PID[1]))
setwd(wd)
return(as.character(pids$PID[1]))
}
makeFiles<-function(){
res<-list()
dir.create("./temp/tempRgen", recursive=T,showWarnings = F)
tf<-tempfile("rGen-","./temp/tempRgen", fileext = "")
res$ScriptFile<-paste0(tf,".R")
res$LogFile<-paste0(tf,".log")
res$KillScript<-paste0(tf,"-kill.sh")
file.create(res$KillScript,showWarnings = F)
file.create(res$ScriptFile,showWarnings = F)
res
}
#Open Items to be handled
#1. Named Arguments
#2. Non String Arguments
getFunctionCall<-function(functionName,argList,startIndex){
res<-paste0(functionName,"(")
if(!is.null(argList)){
if(length(argList)>=startIndex){
first=T
for(i in startIndex:length(argList)){
if(first){
first=F
} else {
res<-paste0(res,",")
}
res<-paste0(res,"\"",argList[[i]],"\"")
}
}
}
res<-paste0(res,")")
}
getPids <- function(grepFor, refineWith){
numCols <- length(unlist(str_split(system("ps aux", intern=T)[1], "\\s+")))
psOutput <- system(paste0("ps auxww | grep ", grepFor), intern=T)
psOutput <- psOutput[str_detect(psOutput, refineWith)]
pidDf <- ldply(psOutput, parseEachPsLine)
# Remove the process that actually grep-ed for my search string
pidDf <- pidDf[!str_detect(pidDf$COMMAND, "grep"),]
return(pidDf)
}
parseEachPsLine <- function(line){
tabular <- read.table(textConnection(line), header=F, sep=" ")
tabular <- tabular[!is.na(tabular)]
psTitles <- c("USER", "PID", "CPU", "MEM", "VSZ", "RSS", "TTY", "STAT", "START", "TIME", "COMMAND")
psColNames <- setNames(seq(1, length(psTitles)), psTitles)
COMMAND <- paste0(tabular[(psColNames["COMMAND"]):length(tabular)], collapse=" ")
return(data.frame("PID"=tabular[psColNames["PID"]], "STARTED"=tabular[psColNames["START"]], "COMMAND"=COMMAND, "STATUS"=tabular[psColNames["STAT"]]))
}

Huge data file and running multiple parameters and memory issue, Fisher's test

I have a R code that I am trying to run in a server. But it is stopping in the middle/get frozen probably because of memory limitation. The data files are huge/massive (one has 20 million lines) and if you look at the double for loop in the code, length(ratSplit) = 281 and length(humanSplit) = 36. The data has specific data of human and rats' genes and human has 36 replicates, while rat has 281. So, the loop is basically 281*36 steps. What I want to do is to process data using the function getGeneType and see how different/independent are the expression of different replicate combinations. Using Fisher's test. The data rat_processed_7_25_FDR_05.out looks like this :
2 Sptbn1 114201107 114200202 chr14|Sptbn1:114201107|Sptbn1:114200202|reg|- 2 Thymus_M_GSM1328751 reg
2 Ndufb7 35680273 35683909 chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+ 2 Thymus_M_GSM1328751 rev
2 Ndufb10 13906408 13906289 chr10|Ndufb10:13906408|Ndufb10:13906289|reg|- 2 Thymus_M_GSM1328751 reg
3 Cdc14b 1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|- 3 Thymus_M_GSM1328751 reg
and the data fetal_output_7_2.out has the form
SPTLC2 78018438 77987924 chr14|SPTLC2:78018438|SPTLC2:77987924|reg|- 11 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
EXOSC1 99202993 99201016 chr10|EXOSC1:99202993|EXOSC1:99201016|rev|- 5 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
SHMT2 57627893 57628016 chr12|SHMT2:57627893|SHMT2:57628016|reg|+ 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
ZNF510 99538281 99537128 chr9|ZNF510:99538281|ZNF510:99537128|reg|- 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
PPFIBP1 27820253 27824363 chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+ 10 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
Now I have few questions on how to make this more efficient. I think when I run this code, R takes up lots of memory that ultimately causes problems. I am wondering if there is any way of doing this more efficiently
Another possibility is the usage of double for-loop'. Will sapply help? In that case, how should I apply sapply?
At the end I want to convert result into a csv file. I know this is a bit overwhelming to put code like this. But any optimization/efficient coding/programming will be A LOT! I really need to run the whole thing at least one to get the data soon.
#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]
#function to capitalize
capitalize <- function(x){
capital <- toupper(x) ## capitalize
paste0(capital)
}
#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))
#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)
print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG
#function to be used to aggregate
getGeneType <- function(types) {
types <- as.character(types)
if ('rev' %in% types) {
return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
}
else {
return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
}
}
#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis",
Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
for(j in 1:length(humanSplit)) {
ratReplicateName <- names(ratSplit[i])
humanReplicateName <- names(humanSplit[j])
#merging above two based on the one-to-one gene mapping as in geneList defined above.
mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
mergedRatData <- mergedRatData[,c(2,1,4,5)] #rearrange column
mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
#now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
#is.integer0 function is used
HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
# contingencyTable:
# HnRn --|--HyRn
# |------|-----|
# HnRy --|-- HyRy
#
fisherTest <- fisher.test(contingencyTable)
#make new line out of the result of fisherTest
newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2],
oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine) #append newline to result
if(j%%10 = 0) print(c(i,j))
}
}
write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
Referring to the accepted answer to Monitor memory usage in R, the amount of memory used by R can be tracked with gc().
If the script is, indeed, running short of memory (which would not surprise me), the easiest way to resolve the problem would be to move the write.table() from the outside to the inside of the loop, to replace the rbind(). It would just be necessary to create a new file name for the CSV file that is written from each output, e.g. by:
csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)
If the CSV files are written without headers, they could then be concatenated separately outside R (e.g. using cat in Unix) and the header added later.
While this approach might succeed in creating the CSV file that is sought, it is possible that file might be too big to process subsequently. If so, it may be preferable to process the CSV files individually, rather than concatenating them at all.

Resources