R: Memory Management during xmlEventParse of Huge (>20GB) files - r

Building on this previous question (see here), I am attempting to read in many, large xml files via xmlEventParse whilst saving node-varying data. Working with this sample xml: https://www.nlm.nih.gov/databases/dtd/medsamp2015.xml.
The code below uses xpathSapply to extract the necessary values and a series of if statements to combine the values in a way that matches the unique value (PMID) to each of the non-unique values (LastName) within a record - for which there may be no LastNames. The goal is to write a series of small csv's along the way (here, after every 1000 LastNames) to minimize the amount of memory used.
When run on the full-sized data set, the code successfully outputs files in batches, however something is still being stored in memory that eventually causes a system error once all RAM is used. I've watched the task manager while the code runs and can see R's memory grow as the program progresses. And if I stop the program mid-run and then clear the R workspace, including hidden items, the memory still appears to be in use by R. It is not until I shutdown R is the memory freed up again.
Run this a few times yourself and you'll see R's memory usage grow even after clearing the workspace.
Please help! This problem appears to be common to others reading in large XML files in this manner (See for example comments in this question).
My code is as follows:
library(XML)
filename <- "~/Desktop/medsamp2015.xml"
tempdat <- data.frame(pmid=as.numeric(),
lname=character(),
stringsAsFactors=FALSE)
cnt <- 1
branchFunction <- function() {
func <- function(x, ...) {
v1 <- xpathSApply(x, path = "//PMID", xmlValue)
v2 <- xpathSApply(x, path = "//Author/LastName", xmlValue)
print(cbind(c(rep(v1,length(v2))), v2))
#below is where I store/write the temp data along the way
#but even without doing this, memory is used (even after clearing)
tempdat <<- rbind(tempdat,cbind(c(rep(v1,length(v2))), v2))
if (nrow(tempdat) > 1000){
outname <- paste0("~/Desktop/outfiles",cnt,".csv")
write.csv(tempdat, outname , row.names = F)
tempdat <<- data.frame(pmid=as.numeric(),
lname=character(),
stringsAsFactors=FALSE)
cnt <<- cnt+1
}
}
list(MedlineCitation = func)
}
myfunctions <- branchFunction()
#RUN
xmlEventParse(
file = filename,
handlers = NULL,
branches = myfunctions
)

Here is an example, we have a launch script invoke.sh, that calls an R Script and passes the url and filename as parameters... In this case, I had previously downloaded the test file medsamp2015.xml and put in the ./data directory.
My sense would be to create a loop in the invoke.sh script and iterate through the list of target file names. For each file you invoke an R instance, download it, process the file and move on to the next.
Caveat: I didn't check or change your function against any other download files and formats. I would turn off the printing of the output by removing the print() wrapper on line 62.
print( cbind(c(rep(v1, length(v2))), v2))
See: runtime.txt for print out.
The output .csv files are placed in the ./data directory.
Note: This is a derivative of a previous answer provided by me on this subject:
R memory not released in Windows. I hope it helps by way of example.
Launch Script
1 #!/usr/local/bin/bash -x
2
3 R --no-save -q --slave < ./47162861.R --args "https://www.nlm.nih.gov/databases/dtd" "medsamp2015.xml"
R File - 47162861.R
# Set working directory
projectDir <- "~/dev/stackoverflow/47162861"
setwd(projectDir)
# -----------------------------------------------------------------------------
# Load required Packages...
requiredPackages <- c("XML")
ipak <- function(pkg) {
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
ipak(requiredPackages)
# -----------------------------------------------------------------------------
# Load required Files
# trailingOnly=TRUE means that only your arguments are returned
args <- commandArgs(trailingOnly = TRUE)
if ( length(args) != 0 ) {
dataDir <- file.path(projectDir,"data")
fileUrl = args[1]
fileName = args[2]
} else {
dataDir <- file.path(projectDir,"data")
fileUrl <- "https://www.nlm.nih.gov/databases/dtd"
fileName <- "medsamp2015.xml"
}
# -----------------------------------------------------------------------------
# Download file
# Does the directory Exist? If it does'nt create it
if (!file.exists(dataDir)) {
dir.create(dataDir)
}
# Now we check if we have downloaded the data already if not we download it
if (!file.exists(file.path(dataDir, fileName))) {
download.file(fileUrl, file.path(dataDir, fileName), method = "wget")
}
# -----------------------------------------------------------------------------
# Now we extrat the data
tempdat <- data.frame(pmid = as.numeric(), lname = character(),
stringsAsFactors = FALSE)
cnt <- 1
branchFunction <- function() {
func <- function(x, ...) {
v1 <- xpathSApply(x, path = "//PMID", xmlValue)
v2 <- xpathSApply(x, path = "//Author/LastName", xmlValue)
print(cbind(c(rep(v1, length(v2))), v2))
# below is where I store/write the temp data along the way
# but even without doing this, memory is used (even after
# clearing)
tempdat <<- rbind(tempdat, cbind(c(rep(v1, length(v2))),
v2))
if (nrow(tempdat) > 1000) {
outname <- file.path(dataDir, paste0(cnt, ".csv")) # Create FileName
write.csv(tempdat, outname, row.names = F) # Write File to created directory
tempdat <<- data.frame(pmid = as.numeric(), lname = character(),
stringsAsFactors = FALSE)
cnt <<- cnt + 1
}
}
list(MedlineCitation = func)
}
myfunctions <- branchFunction()
# -----------------------------------------------------------------------------
# RUN
xmlEventParse(file = file.path(dataDir, fileName),
handlers = NULL,
branches = myfunctions)
Test File and output
~/dev/stackoverflow/47162861/data/medsamp2015.xml
$ ll
total 2128
drwxr-xr-x# 7 hidden staff 238B Nov 10 11:05 .
drwxr-xr-x# 9 hidden staff 306B Nov 10 11:11 ..
-rw-r--r--# 1 hidden staff 32K Nov 10 11:12 1.csv
-rw-r--r--# 1 hidden staff 20K Nov 10 11:12 2.csv
-rw-r--r--# 1 hidden staff 23K Nov 10 11:12 3.csv
-rw-r--r--# 1 hidden staff 37K Nov 10 11:12 4.csv
-rw-r--r--# 1 hidden staff 942K Nov 10 11:05 medsamp2015.xml
Runtime Output
> ./invoke.sh > runtime.txt
+ R --no-save -q --slave --args https://www.nlm.nih.gov/databases/dtd medsamp2015.xml
Loading required package: XML
File: runtime.txt

Related

Error: C stack usage is too close to the limit at R startup

Everytime I open a new session in RStudio, I'm greeted with the error message:
Error: C stack usage 7953936 is too close to the limit
Based on suggestions for similar issues posted here and here, I tried using the ulimit command in terminal, but get the following error.
Isabels-MacBook-Pro ~ % ulimit -s
8176
Isabels-MacBook-Pro ~ % R --slave -e 'Cstack_info()["size"]'
Error: C stack usage 7954496 is too close to the limit
Execution halted
Yet, when I run ulimit on it's own, I get:
Isabels-MacBook-Pro ~ % ulimit
unlimited
Just to double-check, I try setting it to unlimited again:
Isabels-MacBook-Pro ~ % ulimit -s unlimited
but then get a new error:
Isabels-MacBook-Pro ~ % R --slave -e 'Cstack_info()["size"]'
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Execution halted
I have no clue what this means in this context. Is the Cstack_info() the bit getting stuck on infinite recursion?? I'd love to get this figured out, as it's getting in the way of installing some necessary packages!
In case it's helpful, here's my session info
R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.2.1
And contents of .Rprofile
# REMEMBER to restart R after you modify and save this file!
# First, execute the global .Rprofile if it exists. You may configure blogdown
# options there, too, so they apply to any blogdown projects. Feel free to
# ignore this part if it sounds too complicated to you.
if (file.exists("~/.Rprofile")) {
base::sys.source("~/.Rprofile", envir = environment())
}
# Now set options to customize the behavior of blogdown for this project. Below
# are a few sample options; for more options, see
# https://bookdown.org/yihui/blogdown/global-options.html
options(
# to automatically serve the site on RStudio startup, set this option to TRUE
blogdown.serve_site.startup = FALSE,
# to disable knitting Rmd files on save, set this option to FALSE
blogdown.knit.on_save = TRUE,
# build .Rmd to .html (via Pandoc); to build to Markdown, set this option to 'm$
blogdown.method = 'html'
)
# fix Hugo version
options(blogdown.hugo.version = "0.82.0")
Here are the contents from /Library/Frameworks/R.framework/Resources/library/base/R/Rprofile
### This is the system Rprofile file. It is always run on startup.
### Additional commands can be placed in site or user Rprofile files
### (see ?Rprofile).
### Copyright (C) 1995-2020 The R Core Team
### Notice that it is a bad idea to use this file as a template for
### personal startup files, since things will be executed twice and in
### the wrong environment (user profiles are run in .GlobalEnv).
.GlobalEnv <- globalenv()
attach(NULL, name = "Autoloads")
.AutoloadEnv <- as.environment(2)
assign(".Autoloaded", NULL, envir = .AutoloadEnv)
T <- TRUE
F <- FALSE
R.version <- structure(R.Version(), class = "simple.list")
version <- R.version # for S compatibility
## for backwards compatibility only
R.version.string <- R.version$version.string
## NOTA BENE: options() for non-base package functionality are in places like
## --------- ../utils/R/zzz.R
options(keep.source = interactive())
options(warn = 0)
# options(repos = c(CRAN="#CRAN#"))
# options(BIOC = "http://www.bioconductor.org")
## setting from an env variable added in 4.0.2
local({to <- as.integer(Sys.getenv("R_DEFAULT_INTERNET_TIMEOUT", 60))
if (is.na(to) || to <= 0) to <- 60L
options(timeout = to)
})
options(encoding = "native.enc")
options(show.error.messages = TRUE)
## keep in sync with PrintDefaults() in ../../main/print.c :
options(show.error.messages = TRUE)
## keep in sync with PrintDefaults() in ../../main/print.c :
options(scipen = 0)
options(max.print = 99999)# max. #{entries} in internal printMatrix()
options(add.smooth = TRUE)# currently only used in 'plot.lm'
if(isFALSE(as.logical(Sys.getenv("_R_OPTIONS_STRINGS_AS_FACTORS_",
"FALSE")))) {
options(stringsAsFactors = FALSE)
} else {
options(stringsAsFactors = TRUE)
}
if(!interactive() && is.null(getOption("showErrorCalls")))
options(showErrorCalls = TRUE)
local({dp <- Sys.getenv("R_DEFAULT_PACKAGES")
if(identical(dp, "")) ## it fact methods is done first
dp <- c("datasets", "utils", "grDevices", "graphics",
"stats", "methods")
else if(identical(dp, "NULL")) dp <- character(0)
else dp <- strsplit(dp, ",")[[1]]
dp <- sub("[[:blank:]]*([[:alnum:]]+)", "\\1", dp) # strip whitespace
options(defaultPackages = dp)
})
## Expand R_LIBS_* environment variables.
Sys.setenv(R_LIBS_SITE =
.expand_R_libs_env_var(Sys.getenv("R_LIBS_SITE")))
Sys.setenv(R_LIBS_USER =
.expand_R_libs_env_var(Sys.getenv("R_LIBS_USER")))
local({
if(nzchar(tl <- Sys.getenv("R_SESSION_TIME_LIMIT_CPU")))
setSessionTimeLimit(cpu = tl)
if(nzchar(tl <- Sys.getenv("R_SESSION_TIME_LIMIT_ELAPSED")))
setSessionTimeLimit(elapsed = tl)
})
setSessionTimeLimit(elapsed = tl)
})
.First.sys <- function()
{
for(pkg in getOption("defaultPackages")) {
res <- require(pkg, quietly = TRUE, warn.conflicts = FALSE,
character.only = TRUE)
if(!res)
warning(gettextf('package %s in options("defaultPackages") was not found', sQuote(pkg)$
call. = FALSE, domain = NA)
}
}
## called at C level in the startup process prior to .First.sys
.OptRequireMethods <- function()
{
pkg <- "methods" # done this way to avoid R CMD check warning
if(pkg %in% getOption("defaultPackages"))
if(!require(pkg, quietly = TRUE, warn.conflicts = FALSE,
character.only = TRUE))
warning('package "methods" in options("defaultPackages") was not found',
call. = FALSE)
}
if(nzchar(Sys.getenv("R_BATCH"))) {
.Last.sys <- function()
{
cat("> proc.time()\n")
print(proc.time())
}
## avoid passing on to spawned R processes
## A system has been reported without Sys.unsetenv, so try this
try(Sys.setenv(R_BATCH=""))
}
local({
if(nzchar(rv <- Sys.getenv("_R_RNG_VERSION_")))
local({
if(nzchar(rv <- Sys.getenv("_R_RNG_VERSION_")))
suppressWarnings(RNGversion(rv))
})
.sys.timezone <- NA_character_
.First <- NULL
.Last <- NULL
###-*- R -*- Unix Specific ----
.Library <- file.path(R.home(), "library")
.Library.site <- Sys.getenv("R_LIBS_SITE")
.Library.site <- if(!nzchar(.Library.site)) file.path(R.home(), "site-library") else unlist(strspl$
.Library.site <- .Library.site[file.exists(.Library.site)]
invisible(.libPaths(c(unlist(strsplit(Sys.getenv("R_LIBS"), ":")),
unlist(strsplit(Sys.getenv("R_LIBS_USER"), ":")
))))
local({
popath <- Sys.getenv("R_TRANSLATIONS", "")
if(!nzchar(popath)) {
paths <- file.path(.libPaths(), "translations", "DESCRIPTION")
popath <- dirname(paths[file.exists(paths)][1])
}
bindtextdomain("R", popath)
bindtextdomain("R-base", popath)
assign(".popath", popath, .BaseNamespaceEnv)
})
local({
## we distinguish between R_PAPERSIZE as set by the user and by configure
papersize <- Sys.getenv("R_PAPERSIZE_USER")
if(!nchar(papersize)) {
lcpaper <- Sys.getlocale("LC_PAPER") # might be null: OK as nchar is 0
papersize <- if(nchar(lcpaper))
if(length(grep("(_US|_CA)", lcpaper))) "letter" else "a4"
else Sys.getenv("R_PAPERSIZE")
}
options(papersize = papersize,
}
options(papersize = papersize,
printcmd = Sys.getenv("R_PRINTCMD"),
dvipscmd = Sys.getenv("DVIPS", "dvips"),
texi2dvi = Sys.getenv("R_TEXI2DVICMD"),
browser = Sys.getenv("R_BROWSER"),
pager = file.path(R.home(), "bin", "pager"),
pdfviewer = Sys.getenv("R_PDFVIEWER"),
useFancyQuotes = TRUE)
})
## non standard settings for the R.app GUI of the macOS port
if(.Platform$GUI == "AQUA") {
## this is set to let RAqua use both X11 device and X11/TclTk
if (Sys.getenv("DISPLAY") == "")
Sys.setenv("DISPLAY" = ":0")
## this is to allow gfortran compiler to work
Sys.setenv("PATH" = paste(Sys.getenv("PATH"),":/usr/local/bin",sep = ""))
}## end "Aqua"
## de-dupe the environment on macOS (bug in Yosemite which affects things like PATH)
if (grepl("^darwin", R.version$os)) local({
## we have to de-dupe one at a time and re-check since the bug affects how
## environment modifications propagate
while(length(dupes <- names(Sys.getenv())[table(names(Sys.getenv())) > 1])) {
env <- dupes[1]
value <- Sys.getenv(env)
Sys.unsetenv(env) ## removes the dupes, good
.Internal(Sys.setenv(env, value)) ## wrapper requries named vector, a pain, hence internal
}
})
local({
tests_startup <- Sys.getenv("R_TESTS")
if(nzchar(tests_startup)) source(tests_startup)
})
Is there anything glaring here that could be causing the issue?
Your user .Rprofile file is loading itself recursively for some reason:
if (file.exists("~/.Rprofile")) {
base::sys.source("~/.Rprofile", envir = environment())
}
From your comments it seems that these lines are inside ~/.Rprofile (~ expands to the user home directory, i.e. /Users/mycomputer in your case, assuming mycomputer is your user name).
Delete these lines (or comment them out), they don’t belong here. In fact, the file looks like it’s a template for a project-specific .Rprofile configuration. It would make sense inside a project directory, but not as the profile-wide user .Rprofile.
The logic for these files is as follows:
If there is an .Rprofile file in the current directory, R attempts to load that.
Otherwise, if the environment variable R_PROFILE_USER is set to the path of a file, R attempts to load this file.
Otherwise, if the file ~/.Rprofile exists, R attempts to load that.
Now, this implies that ~/.Rprofile is not loaded automatically if a projects-specific (= in the current working directory) .Rprofile exists. This is unfortunate, therefore many projects add lines similar to the above to their project-specific .Rprofile files to cause the user-wide ~/.Rprofile to be loaded as well. However, the above implementation ignores the R_PROFILE_USER environment variable. A better implementation would therefore look as follows:
rprofile = Sys.getenv('R_PROFILE_USER', '~/.Rprofile')
if (file.exists(rprofile)) {
base::sys.source(rprofile, envir = environment())
}
rm(rprofile)
Success! Thank you to everyone in the comment. The issue was resolved by deleting /Library/Frameworks/R.framework/Resources/library/base/R/Rprofile and re-installing R and Rstudio.

How to fix C function R_nc4_get_vara_double returned error in ncdf4 parallel processing in R

I want to download nc data through OPENDAP from a remote storage. I use parallel backend with foreach - dopar loop as follows:
# INPUTS
inputs=commandArgs(trailingOnly = T)
interimpath=as.character(inputs[1])
gcm=as.character(inputs[2])
period=as.character(inputs[3])
var=as.character(inputs[4])
source='MACAV2'
cat('\n\n EXTRACTING DATA FOR',var, gcm, period, '\n\n')
# CHANGING LIBRARY PATHS
.libPaths("/storage/home/htn5098/local_lib/R40") # local library for packages
setwd('/storage/work/h/htn5098/DataAnalysis')
source('./src/Rcodes/CWD_function_package.R') # Calling the function Rscript
# CALLING PACKAGES
library(foreach)
library(doParallel)
library(parallel)
library(filematrix)
# REGISTERING CORES FOR PARALLEL PROCESSING
no_cores <- detectCores()
cl <- makeCluster(no_cores)
registerDoParallel(cl)
invisible(clusterEvalQ(cl,.libPaths("/storage/home/htn5098/local_lib/R40"))) # Really have to import library paths into the workers
invisible(clusterEvalQ(cl, c(library(ncdf4))))
# EXTRACTING DATA FROM THE .NC FILES TO MATRIX FORM
url <- readLines('./data/external/MACAV2_OPENDAP_allvar_allgcm_allperiod.txt')
links <- grep(x = url,pattern = paste0('.*',var,'.*',gcm,'_.*',period), value = T)
start=c(659,93,1) # lon, lat, time
count=c(527,307,-1)
spfile <- read.csv('./data/external/SERC_MACAV2_Elev.csv',header = T)
grids <- sort(unique(spfile$Grid))
clusterExport(cl,list('ncarray2matrix','start','count','grids')) #exporting data into clusters for parallel processing
cat('\nChecking when downloading all grids\n')
# k <- foreach(x = links,.packages = c('ncdf4')) %dopar% {
# nc <- nc_open(x)
# nc.var=ncvar_get(nc,varid=names(nc$var),start=start,count=count)
# return(nc.var)
# nc_close(nc)
# }
k <- foreach(x = links,.packages = c('ncdf4'),.errorhandling = 'pass') %dopar% {
nc <- nc_open(x)
print(nc)
nc.var=ncvar_get(nc,varid=names(nc$var),start=c(659,93,1),count=c(527,307,-1))
nc_close(nc)
return(dim(nc.var))
Sys.sleep(10)
}
# k <- parSapply(cl,links,function(x) {
# nc <- nc_open(x)
# nc.var=ncvar_get(nc,varid=names(nc$var),start=start,count=count)
# nc_close(nc)
# return(nc.var)
# })
print(k)
However, I keep getting this error:
<simpleError in ncvar_get_inner(ncid2use, varid2use, nc$var[[li]]$missval, addOffset, scaleFact, start = start, count = count, verbose = verbose, signedbyte = signedbyte, collapse_degen = collapse_degen): C function R_nc4_get_vara_double returned error>
What could be the reason for this problem? Can you recommend a solution for this that is time-efficient (I have to repeat this for about 20 files)?
Thank you.
I had the same error in my code. The problem was not the code itself. It was one of the files that I wanted to read. It has something wrong, so R couldn't open it. I identified the file and downloaded it again, and the same code worked perfectly.
I also encountered the same error. For me, restarting R session did the trick.

How to apply rma() normalization to a unique CEL file?

I have implemented a R script that performs batch correction on a gene expression dataset. To do the batch correction, I first need to normalize the data in each CEL file through the Affy rma() function of Bioconductor.
If I run it on the GSE59867 dataset obtained from GEO, everything works.
I define a batch as the data collection date: I put all the CEL files having the same date into a specific folder, and then consider that date/folder as a specific batch.
On the GSE59867 dataset, a batch/folder contains only 1 CEL file. Nonetheless, the rma() function works on it perfectly.
But, instead, if I try to run my script on another dataset (GSE36809), I have some troubles: if I try to apply the rma() function to a batch/folder containing only 1 file, I get the following error:
Error in `colnames<-`(`*tmp*`, value = "GSM901376_c23583161.CEL.gz") :
attempt to set 'colnames' on an object with less than two dimensions
Here's my specific R code, to let you understand.
You first have to download the file GSM901376_c23583161.CEL.gz:
setwd(".")
options(stringsAsFactors = FALSE)
fileURL <- "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM901nnn/GSM901376/suppl/GSM901376%5Fc23583161%2ECEL%2Egz"
fileDownloadCommand <- paste("wget ", fileURL, " ", sep="")
system(fileDownloadCommand)
Library installation:
source("https://bioconductor.org/biocLite.R")
list.of.packages <- c("easypackages")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
listOfBiocPackages <- c("oligo", "affyio","BiocParallel")
bioCpackagesNotInstalled <- which( !listOfBiocPackages %in% rownames(installed.packages()) )
cat("package missing listOfBiocPackages[", bioCpackagesNotInstalled, "]: ", listOfBiocPackages[bioCpackagesNotInstalled], "\n", sep="")
if( length(bioCpackagesNotInstalled) ) {
biocLite(listOfBiocPackages[bioCpackagesNotInstalled])
}
library("easypackages")
libraries(list.of.packages)
libraries(listOfBiocPackages)
Application of rma()
thisFileDate <- "GSM901376_c23583161.CEL.gz"
thisDateRawData <- read.celfiles(thisDateCelFiles)
thisDateNormData <- rma(thisDateRawData)
After the call to rma(), I get the error.
How can I solve this problem?
I also tried to skip this normalization, by saving the thisDateRawData object directly. But then I have the problem that I cannot combine together this thisDateRawData (that is a ExpressionFeatureSet) with the outputs of rma() (that are ExpressionSet objects).
(EDIT: I extensively edited the question, and added a piece of R code you should be able to run on your pc.)
Hmm. This is a puzzling problem. the oligo::rma() function might be buggy for class GeneFeatureSet with single samples. I got it to work with a single sample by using lower-level functions, but it means I also had to create the expression set from scratch by specifying the slots:
# source("https://bioconductor.org/biocLite.R")
# biocLite("GEOquery")
# biocLite("pd.hg.u133.plus.2")
# biocLite("pd.hugene.1.0.st.v1")
library(GEOquery)
library(oligo)
# # Instead of using .gz files, I extracted the actual CELs.
# # This is just to illustrate how I read in the files; your usage will differ.
# projectDir <- "" # Path to .tar files here
# setwd(projectDir)
# untar("GSE36809_RAW.tar", exdir = "GSE36809")
# untar("GSE59867_RAW.tar", exdir = "GSE59867")
# setwd("GSE36809"); gse3_cels <- dir()
# sapply(paste(gse3_cels, sep = "/"), gunzip); setwd(projectDir)
# setwd("GSE59867"); gse5_cels <- dir()
# sapply(paste(gse5_cels, sep = "/"), gunzip); setwd(projectDir)
#
# Read in CEL
#
# setwd("GSE36809"); gse3_cels <- dir()
# gse3_efs <- read.celfiles(gse3_cels[1])
# # Assuming you've read in the CEL files as a GeneFeatureSet or
# # ExpressionFeatureSet object (i.e. gse3_efs in this example),
# # you can now fit the RMA and create an ExpressionSet object with it:
exprsData <- basicRMA(exprs(gse3_efs), pnVec = featureNames(gse3_efs))
gse3_expset <- new("ExpressionSet")
slot(gse3_expset, "assayData") <- assayDataNew(exprs = exprsData)
slot(gse3_expset, "phenoData") <- phenoData(gse3_efs)
slot(gse3_expset, "featureData") <- annotatedDataFrameFrom(attr(gse3_expset,
'assayData'), byrow = TRUE)
slot(gse3_expset, "protocolData") <- protocolData(gse3_efs)
slot(gse3_expset, "annotation") <- slot(gse3_efs, "annotation")
Hopefully the above approach will work in your code.

Read in large text file in chunks

I'm working with limited RAM (AWS free tier EC2 server - 1GB).
I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.
So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.
To reproduce:
# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
So far so good. Here's where I struggle:
word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))
Returns "cannot allocate a vector of size [size]" error message.
Tried alternatives:
word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)
Same, not enough memory
word_vectors <- readr::read_tsv_chunked("vector.txt",
callback = function(x, i) saveRDS(x, i),
chunk_size = 10000)
Resulted in:
Parsed with column specification:
cols(
`299567 300` = col_character()
)
|=========================================================================================| 100% 817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
Evaluation error: bad 'file' argument.
Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?
EDIT:
From Jonathan's answer below, tried:
library(rword2vec)
library(RSQLite)
# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
every_nlines,
table_name,
dbname = sub("\\.txt$", ".sqlite", tsv),
...) {
# Prepare reading
con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
init <- TRUE
fill_sqlite <- function(df) {
if (init) {
RSQLite::dbCreateTable(con, table_name, df)
init <<- FALSE
}
RSQLite::dbAppendTable(con, table_name, df)
NULL
}
# Read and fill by parts
bigreadr::big_fread1(tsv, every_nlines,
.transform = fill_sqlite,
.combine = unlist,
... = ...)
# Returns
con
}
vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")
Resulted in:
Splitting: 12.4 seconds.
Error: nThread >= 1L is not TRUE
Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169
To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html

Using R to access FTP Server and Download Files Results in Status "530 Not logged in"

What I'm Attempting to Do
I'm attempting to download several weather data files from the US National Climatic Data Centre's FTP server but am running into problems with an error message after successfully completing several file downloads.
After successfully downloading two station/year combinations I start getting an error "530 Not logged in" message. I've tried starting at the offending year and running from there and get roughly the same results. It downloads a year or two of data and then stops with the same error message about not being logged in.
Working Example
Following is a working example (or not) with the output truncated and pasted below.
options(timeout = 300)
ftp <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/"
td <- tempdir()
station <– c("983240-99999", "983250-99999", "983270-99999", "983280-99999", "984260-41231", "984290-99999", "984300-99999", "984320-99999", "984330-99999")
years <- 1960:2016
for (i in years) {
remote_file_list <- RCurl::getURL(
paste0(ftp, "/", i, "/"), ftp.use.epsv = FALSE, ftplistonly = TRUE,
crlf = TRUE, ssl.verifypeer = FALSE)
remote_file_list <- strsplit(remote_file_list, "\r*\n")[[1]]
file_list <- paste0(station, "-", i, ".op.gz")
file_list <- file_list[file_list %in% remote_file_list]
file_list <- paste0(ftp, i, "/", file_list)
Map(function(ftp, dest) utils::download.file(url = ftp,
destfile = dest, mode = "wb"),
file_list, file.path(td, basename(file_list)))
}
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1960/983250-99999-1960.op.gz'
Content type 'unknown' length 7135 bytes
==================================================
downloaded 7135 bytes
...
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1961/984290-99999-1961.op.gz'
Content type 'unknown' length 7649 bytes
==================================================
downloaded 7649 bytes
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz'
downloaded 0 bytes
Error in utils::download.file(url = ftp, destfile = dest, mode = "wb") :
cannot download all files In addition: Warning message:
In utils::download.file(url = ftp, destfile = dest, mode = "wb") :
URL ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz':
status was '530 Not logged in'
Different Methods and Ideas I've Tried but Haven't Yet Been Successful
So far I've tried to slow the requests down using Sys.sleep in a for loop and any other manner of retrieving the files more slowly by opening then closing connections, etc. It's puzzling because: i) it works for a bit then stops and it's not related to the particular year/station combination per se; ii) I can use nearly the exact same code and download much larger annual files of global weather data without any errors over a long period of years like this; and iii) it's not always stopping after 1961 going to 1962, sometimes it stops at 1960 when it starts on 1961, etc., but it does seem to be consistently between years, not within from what I've found.
The login is anonymous, but you can use userpwd "ftp:your#email.address". So far I've been unsuccessful in using that method to ensure that I was logged in to download the station files.
I think you're going to need a more defensive strategy when working with this FTP server:
library(curl) # ++gd > RCurl
library(purrr) # consistent "data first" functional & piping idioms FTW
library(dplyr) # progress bar
# We'll use this to fill in the years
ftp_base <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%s/"
dir_list_handle <- new_handle(ftp_use_epsv=FALSE, dirlistonly=TRUE, crlf=TRUE,
ssl_verifypeer=FALSE, ftp_response_timeout=30)
# Since you, yourself, noted the server was perhaps behaving strangely or under load
# it's prbly a much better idea (and a practice of good netizenship) to cache the
# results somewhere predictable rather than a temporary, ephemeral directory
cache_dir <- "./gsod_cache"
dir.create(cache_dir, showWarnings=FALSE)
# Given the sporadic efficacy of server connection, we'll wrap our calls
# in safe & retry functions. Change this variable if you want to have it retry
# more times.
MAX_RETRIES <- 6
# Wrapping the memory fetcher (for dir listings)
s_curl_fetch_memory <- safely(curl_fetch_memory)
retry_cfm <- function(url, handle) {
i <- 0
repeat {
i <- i + 1
res <- s_curl_fetch_memory(url, handle=handle)
if (!is.null(res$result)) return(res$result)
if (i==MAX_RETRIES) { stop("Too many retries...server may be under load") }
}
}
# Wrapping the disk writer (for the actual files)
# Note the use of the cache dir. It won't waste your bandwidth or the
# server's bandwidth or CPU if the file has already been retrieved.
s_curl_fetch_disk <- safely(curl_fetch_disk)
retry_cfd <- function(url, path) {
# you should prbly be a bit more thorough than `basename` since
# i think there are issues with the 1971 and 1972 filenames.
# Gotta leave some work up to the OP
cache_file <- sprintf("%s/%s", cache_dir, basename(url))
if (file.exists(cache_file)) return()
i <- 0
repeat {
i <- i + 1
if (i==6) { stop("Too many retries...server may be under load") }
res <- s_curl_fetch_disk(url, cache_file)
if (!is.null(res$result)) return()
}
}
# the stations and years
station <- c("983240-99999", "983250-99999", "983270-99999", "983280-99999",
"984260-41231", "984290-99999", "984300-99999", "984320-99999",
"984330-99999")
years <- 1960:2016
# progress indicators are like bowties: cool
pb <- progress_estimated(length(years))
walk(years, function(yr) {
# the year we're working on
year_url <- sprintf(ftp_base, yr)
# fetch the directory listing
tmp <- retry_cfm(year_url, handle=dir_list_handle)
con <- rawConnection(tmp$content)
fils <- readLines(con)
close(con)
# sift out only the target stations
map(station, ~grep(., fils, value=TRUE)) %>%
keep(~length(.)>0) %>%
flatten_chr() -> fils
# grab the stations files
walk(paste(year_url, fils, sep=""), retry_cfd)
# tick off progress
pb$tick()$print()
})
You may also want to set curl_interrupt to TRUE in the curl handle if you want to be able to stop/esc/interrupt the downloads.

Resources