Error reading in large file using shell/bash script? - r

I have a text file that has 185405149 lines and a header. I am reading in this file within this bash script:
#!/bin/bash
#PBS -N R_Job
#PBS -l walltime=4:00:00
#PBS -l vmem=20gb
module load R/4.2.1
cd filepath/
R --save -q -f script.R
Part of the script is below:
# import the gtex data
gtex_data <- read.table("/filepath/file.txt", header=TRUE)
I get the error: Error: cannot allocate vector of size 2.0 Gb
Execution halted.
It's got nothing to do with the directory/filepath. I suspect its to do with memory. Even after zipping the file e.g file.txt.gz and using the command:
gtex_data <- read.table(gzfile("/filepath/file.txt.gz"), header=TRUE)
It doesn't read the data.
I've tried with a smaller file e.g. reading the first 100 lines of file.txt and renaming it and loading it and it works fine.
I've even tried to increase vmem? Not sure what to do. I would be grateful for advice/help.
I've also checked the size of the file.
ls -lh file.txt
-rw-r--r-- 1 ... 107M Oct 26 16:50 file.txt

Related

How can I use system or system2 if the command has pipe to head like "cmd | head"

I noticed that when I run a long command in linux (I am using a cantos 7.3 distro, R 4.0.3 on the terminal) and that I pipe to head only the first outputs are shows to me (and the command stops)
ls -R /opt # on my system I would get tons of output for 10s of seconds
ls -R /opt | head # just get the top 5 and command is stopped straight away
when I try the equivalent in R I cannot get the same behaviour
system(command = "ls -R /opt | head") # will take a long time (I assume the time for ls -R /opt to finish)
Is there a way for me to get the same behaviour in R than the one I get on my system command line ?

Running R file using sh/bash

I have created below .sh file to run R code saved in separate .R file.
cat EE.sh
#!/bin/bash
VARIABLES=( 20190719 20190718 )
for i in ${VARIABLES[#]}; do
VARIABLENAME=$i
/usr/lib/R/bin/Rscript -e 'source("/home/EER.R")'
Basically what it is expected to do is, take the dates from VARIABLE and pass to the /home/EER.R file, and R will do execution based on passed date (after correct formatting)
Then I ran below code
sudo chmod a+rx EE.sh
and
sudo bash EE.sh
But I then get below error message.
sudo bash EE.sh
EE.sh: line 2: $'\r': command not found
EE.sh: line 3: $'\r': command not found
EE.sh: line 4: $'\r': command not found
Can anyone help me to resolve this issue.
I am using Ubuntu 18 with R version 3.4.4 (2018-03-15)
This problem looks to be related to carriage returns related(which come when we copy text from windows machine to unix machine), so to identify them use:
cat -v Input_file
If you see carriage returns in your file then try:
tr -d '\r' < Input_file > temp && mv temp Input_file
Once they are removed then try to run your program.

GNUPlot cannot be executed after mpirun command in PBS script

I have PBS command something like this
#PBS -N marcell_single_cell
#PBS -l nodes=1:ppn=1
#PBS -l walltime=20000:00:00
#PBS -e stderr.log
#PBS -o stdout.log
# Specific the shell types
#PBS -S /bin/bash
# Specific the queue type
#PBS -q dque
#uncomment this if you want to debug the process
#set -vx
cd $PBS_O_WORKDIR
ulimit -s unlimited
NPROCS=`wc -l < $PBS_NODEFILE`
#export PATH=$PBS_O_PATH
echo This job has allocated $NPROCS nodes
echo Cleaning old files...
rm -rf *.png *.plt *.log
echo Cleaning success
/opt/Lib/openmpi-2.1.3/bin/mpirun -np $NPROCS /scratch4/marcell/CellMLSimulator/bin/CellMLSimulator -ionmodel grandi2010 -solverType CVode -irepeat 4 -dt 0.01
gnuplot -p plotting.gnu
It got error something like this, thrown by the PBS error log.
/var/spool/torque/mom_priv/jobs/6265.node01.SC: line 28: gnuplot: command not found
I've already make sure that the path of GNUPlot is already been added to the PATH environment variable.
However, the strange part is, if I interchange the sequence of command, like gnuplot first and then mpirun, there isn't any error. I suspect that some commands after mpirun need some special configs, but I dunno how to do that
Already following this solution, but no avail.
sleep command not found in torque pbs but works in shell
EDITED:
it seems that the before and after mpirun still got error. and this is the which result:
which: no gnuplot in (/opt/intel/composer_xe_2011_sp1.9.293/bin/intel64:/opt/intel/composer_xe_2011_sp1.9.293/bin/intel64:/opt/pgi/linux86-64/9.0-4/bin:/opt/openmpi/bin:/usr/kerberos/bin:/prog/tools/grace/grace/bin:/home/prog/ansys_inc/v121/fluent/bin:/bin:/usr/bin:/opt/intel/composer_xe_2011_sp1.9.293/mpirt/bin/intel64:/opt/intel/composer_xe_2011_sp1.9.293/mpirt/bin/intel64:/scratch7/feber/jdk1.8.0_101:/scratch7/feber/code/apache-maven/bin:/usr/local/bin:/scratch7/cml/bin)
It's strange, since when I try to find the gnuplot, there is one in the /usr/local/bin
ls -l /usr/local/bin/gnuplot
-rwxr-xr-x 1 root root 3262113 Sep 18 2017 /usr/local/bin/gnuplot
moreover, if I run those commands without PBS, it seems executed as I expected:
/scratch4/marcell/CellMLSimulator/bin/CellMLSimulator -ionmodel grandi2010 -solverType CVode -irepeat 4 -dt 0.01
gnuplot -p plotting.gnu
It's very likely that your system has different "login/head nodes" and "compute nodes". This is a commonly used practice in many supercomputing clusters. While you build and launch your application from the head node, it gets executed on one or more compute nodes.
The compute nodes can have different hardware and software compared to the head nodes. In your case, gnuplot is installed only on the head node, as you can see from the different outputs of which gnuplot. To solve this, you have three approaches:
Request the system administrators to install gnuplot on the compute nodes.
Build and install your own version of gnuplot in a file-system accessible from the compute nodes. It could be your home directory or somewhere else depending on your cluster. In general, the filesystem where your application is will be available. In your case, anywhere under /scratch4/marcell/ would probably work.
Run gnuplot on the head node after the MPI jobs finish as a post-processing step. PBS/Torque does not provide a direct way to do this. You'll need to write a separate bash (not PBS) script to do this.

Rscript not working in qsub cluster

I have two Rscripts named iHS.hist.R and Fst.hist.R. I know both scripts work. When I use the following commands in my directory in my ubuntu terminal I get a histogram plot for each script (two total if I do both scripts)
module load R
Rscript iHS.hist.R
or I could do Rscript Fst.hist.R
The point is I know they both work.
The problem is that each Rscript takes about 20 minutes to run because my data is pretty big. And unfortunately it's only going to get bigger. I have access to a cluster and I would like to make use of that. I have created two .sh scripts to send to the cluster with qsub but I am running into issues. Here is my iHS.his.sh script for my iHS.hist.R script.
#PBS -N iHS.plots
#PBS -S /bin/bash
#PBS -l walltime=2:00:00
#PBS -l nodes=1:ppn=8
#PBS -l mem=4gb
#PBS -o $HOME/${PBS_JOBNAME}.o${PBS_JOBID}.log
#PBS -e $HOME/${PBS_JOBNAME}.e${PBS_JOBID}.err
###############related commands
###edit it
#code in qsub
###############cut columns we don't need
###
cut -f1,2,3,4 /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/matched_snps_annotated.txt > /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/cut.matched_snps_annotated.txt
cut -f1,2 /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/input_snps_insufficient_matches.txt > /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/cut.input_snps_insufficient_matches.txt
###
###############only needed columns remain
cd /group/stranger-lab/ebeiter
module load R
Rscript iHS.hist.R
The cuts in the beginning are for setting up the data in the right format.
I have tried qsub iHS.hist.sh and it gives me a job. I check on it, and after about 10 minutes it finishes. So I'm assuming it's running my Rscript. I check the error file and it's empty. I check the log file and it does not give me the usual null device 1 that I get after my jpeg is completed in my Rscript. I don't get the output jpeg file for the Rscript when the cluster job is done. I do get the output jpeg file if I just did the Rscript on it's own like at the top of this. Any idea what is going on?

Installing Pear, what did I do by entering these commands on my terminal

I'm trying to figure out how to install Pear on my Mac (10.6.6).
Not understanding what they're telling me at pear.php.net, I got some code from http://clickontyler.com/blog/2008/01/how-to-install-pear-in-mac-os-x-leopard/
First, I entered curl http://pear.php.net/go-pear > go-pear.php in my terminal.
It resulted in this output
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 88004 100 88004 0 0 47537 0 0:00:01 0:00:01 --:--:-- 59744
What does that all mean? Am I on the right track?
Next, I entered sudo php -q go-pear.php
and it gave me the long output below. In short I have no idea where I am in the installation process. However, I'm pretty sure that I'm not where I'm supposed to be at following the tutorial at http://clickontyler.com/blog/2008/01/how-to-install-pear-in-mac-os-x-leopard/
because the tutorial tells me to select all the default choices, and I don't see any options to select.
The next line of code is asking me to modify the php.ini files and it requires a password so I'm worried about doing it...Can anyone tell me if I'm on the right track?
sudo cp /etc/php.ini.default /etc/php.ini
Usage: php [options] [-f] <file> [--] [args...]
php [options] -r <code> [--] [args...]
php [options] [-B <begin_code>] -R <code> [-E <end_code>] [--] [args...]
php [options] [-B <begin_code>] -F <file> [-E <end_code>] [--] [args...]
php [options] -- [args...]
php [options] -a
-a Run interactively
-c <path>|<file> Look for php.ini file in this directory
-n No php.ini file will be used
-d foo[=bar] Define INI entry foo with value 'bar'
-e Generate extended information for debugger/profiler
-f <file> Parse and execute <file>.
-h This help
-i PHP information
-l Syntax check only (lint)
-m Show compiled in modules
-r <code> Run PHP <code> without using script tags <?..?>
-B <begin_code> Run PHP <begin_code> before processing input lines
-R <code> Run PHP <code> for every input line
-F <file> Parse and execute <file> for every input line
-E <end_code> Run PHP <end_code> after processing all input lines
-H Hide any passed arguments from external tools.
-s Output HTML syntax highlighted source.
-v Version number
-w Output source with stripped comments and whitespace.
-z <file> Load Zend extension <file>.
args... Arguments passed to script. Use -- args when first argument
starts with - or script is read from stdin
--ini Show configuration file names
--rf <name> Show information about function <name>.
--rc <name> Show information about class <name>.
--re <name> Show information about extension <name>.
--ri <name> Show configuration for extension <name>.
php does not have an argument -q. Its also mentioned in go-pear.php (http://pear.php.net/go-pear) itself, but I dont know, what it wants to tell me. However, try
sudo php go-pear.php
and then follow the instructions.
Update:
-q was used, to start the interpreter in "quiet" mode. It seems, that this option does not exists anymore, because php always starts "quiet", but it should not cause an error, anyway. Now make sure you are in the same directory as the file go-pear.php before you call php go-pear.php.
The first part shows that you successfully downloaded the file to go-pear.php.
The second part is showing that -q isn't a valid option. The third part is asking for the root password, since you're doing 'sudo'.
I used this, though I wasn't installing on Mac:
Getting and installing the PEAR package manager

Resources