read large csv files in R inside for loop

read large csv files in R inside for loop - r

To speedup I'm setting colClasses, my readfile looks like following:
readfile=function(name,save=0, rand=1)
{
data=data.frame()
tab5rows <- read.table(name, header = TRUE, nrows = 5,sep=",")
classes <- sapply(tab5rows, class)
data <- read.table(pipe(paste("cat",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",")
if(save==1)
{
out=paste(file,"Rdata",sep=".")
save(data,file=out)
}
else
{
data
}
}
contents of myscipt.sh:
#!/bin/sh
awk -v prob="$1" 'BEGIN {srand()} {if(NR==1)print $0; else if(rand() < prob) print $0;}'
In an extension to this, I needed to read file incrementaly. Say, if file had 10 lines at 10:am and 100 lines at 11:am, I needed those newly added 90 lines + the header (without which I would not be able to implement futher R processing) I made a change to readfile funtion using the comand:
data <- read.table(pipe(paste("(head -n1 && tail -n",skip,")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") here skip gives me the number of lined to be tailed (calculated by some other script, lets, say I have these already). I call this function readfileIncrementally.
abcd are csv files each with 18 columns. Now I run this inside for loop say for i in a b c d
a,b,c,d are 4 files which have different values of skip. Lets say skip=10,000 for a , 20,000 for b. If I run these individually (not in for loop), it runs fine. But in case of loop it gives me error in scan line "n" does not have 18 columns. Usually this happens when skip value is greater than 3000 (approx).
However I cross checked no. of columns using command awk -F "," 'NF != 18' ./a.csv it surely has 18 columns.
It looks like a timing issue to me, is there any way to give R the required amount of time before going to next file. Or is there anything I'm missing. On running individually it runs fine (takes few seconds though).

data <- read.table(pipe(paste("(head -n1 && tail -n",skip," | head " as.integer(skip)-1")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") worked for me. Basically the last line was not getting written completely by the time R was reading the file. And hence displayed the error that line number n didn't have 18 columns. Making it read 1 line less works fine for me.
Apart from this I didn't find any R feature to overcome such scenarios.

Related

How do you create different files from the same variable?

I explain my problem. I want to compute the scalability of an algorithm on my data set (thousands of rows). For this, I want to subset this data set and increase the size of the subsets of 500rows (so, 1st subset 500 rows, 2nd subset 1000rows, 3rd subset 1500rows...) .
I will use slurm and the SLURM_ARRAY_TASK_ID function to do this. This is my R code :
# load packages
library(SpiecEasi)
library(optparse)
args <- commandArgs(trailingOnly = F)
# get options
option_list = list(
make_option(c("-s", "--subset"), type="character", default=NULL,
help="Input file matrix ")
);
opt_parser = OptionParser(usage = "Usage: %prog -f [FILE]",option_list=option_list,
description= "Description:")
opt = parse_args(opt_parser)
# main code
print('Load matrice')
data<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rares.tsv", h=T, row.names=1, sep="\t")
print('Subset matrice')
data=data[1:opt$subset,]
#print(data)
print('Translate')
data=t(data)
#print(data)
se_gl <- spiec.easi(data, method='glasso', lambda.min.ratio=1e-2, nlambda=20)
size=format(object.size(se_gl), units="Gb")
print(size)
######!!!!######
save(se_gl, file="/home/vipailler/PROJET_M2/data/se_gl.RData")
My problem is this one : if I use 5 arrays, to compute the scalability of the spiec.easi algorithm (so, from 500 to 2500 rows) , I would like it creates 5 different se_gl variables . I mean, my last command line will only save the last variable (2500rows) and will overwrite the 4 others.
So, how can I create 5 different variables from the same se_gl variable? I know that with slurm , this code will be executed 5 times for example (if I set up 5 arrays) , but the problem is my last command line...
Some help?
Bests

You have several options. Since you mention slurm, you will probably want to just modify the filename in order to keep the scalable solution.
save(se_gl, file = sprintf("/home/vipailler/PROJET_M2/data/se_gl_%s.RData", opt$subset))

Is there a way to automatically change a line in several R scripts?

I have 50 R scripts that I need to change the same line for all of them. Is there a way to do it all of them at the same time instead of one by one using "find" and "replace"?

Loop through the files, read line by line (readLines gives a character vector), then update the Nth row, and output to new file:
lapply(list.files(path = ".", pattern = "*.R", full.names = TRUE),
function(i){
x <- readLines(i)
# if we want for example change the 6th row:
x[ 6 ] <- "# MY NEW LINES"
# then write output
write(x, file = paste0("myCleanOutput/folder/path/", basename(i)))
})
But, if all R scripts are the same, maybe use Passing command line arguments to R CMD BATCH
and have only 1 Rscript file which takes arguments.

R: Simple Random Sample of Massive Dataframe

I have a massive (8GB) dataset, which I am simply unable to read into R using my existing setup. Attempting to use fread on the dataset crashes the R session immediately, and attempting to read in random lines from the underlying file was insufficient because: (1) I don't have a good way of knowing that total number of rows in the dataset; (2) my method was not a true "random sampling."
These attempts to get the number of rows have failed (they take as long as simply reading the data in:
length(count.fields("file.dat", sep = "|"))
read.csv.sql("file.dat", header = FALSE, sep = "|", sql = "select
count(*) from file")
Is there any way via R or some other program to generate a random sample from a large underlying dataset?
Potential idea: Is it possible, given a "sample" of the first several rows to get a sense of the average amount of information contained on a per-row basis. And then back-out how many rows there must be given the size of the dataset (8 GB)? This wouldn't be accurate, but it might give a ball-park figure that I could just under-cut.

Here's one option, using the ability of fread to accept a shell command that preprocesses the file as its input. Using this option we can run a gawk script to extract the required lines. Note you may need to install gawk if it is not already on your system. If you have awk instead on your system, you can use that instead.
First lets create a dummy file to test on:
library(data.table)
dt = data.table(1:1e6, sample(letters, 1e6, replace = TRUE))
write.csv(dt, 'test.csv', row.names = FALSE)
Now we can use the shell command wc to find how many lines there are in the file:
nl = read.table(pipe("wc -l test.csv"))[[1]]
Take a sample of line numbers and write them (in ascending order) to a temp file which makes them accessible easily to gawk.
N = 20 # number of lines to sample
sample.lines = sort(sample(2:nl, N)) #start sample at line 2 to exclude header
cat(paste0(sample.lines, collapse = '\n'), file = "lines.txt")
Now we are ready to read in the sample using fread and gawk (based on this answer). You can also try some of the other gawk scripts in this linked question which could possibly be be more efficient on very large data.
dt.sample = fread("gawk 'NR == FNR {nums[$1]; next} FNR in nums' lines.txt test.csv")

R loop opening files

I am trying to run a simple 5 lines command but over 9000 different files. I wrote the following for loop
setwd("/Users/morandin/Desktop/Test")
output_file<- ("output_file.txt")
files <- list.files("/Users/morandin/Desktop/Test")
for(i in files) {
chem.w.in <- scan(i, sep=",")
pruned.tree<-drop.tip(mytree,which(chem.w.in %in% NA))
plot(pruned.tree)
pruned.tree.ja.chem.w.in <- phylo4d(pruned.tree, c(na.omit(chem.w.in)))
plot(pruned.tree.ja.chem.w.in)
out <- abouheif.moran(pruned.tree.ja.chem.w.in)
print(out)
}
Hey I am editing my question: the above code does the for loop perfectly now (thanks for all your help). I am still having an issue with the output.
I can redirect the entire output using R through bash commands but I would need the name of the processed file. My output looks like this:
class: krandtest
Monte-Carlo tests
Call: as.krandtest(sim = matrix(res$result, ncol = nvar, byrow = TRUE),
obs = res$obs, alter = alter, names = test.names)
Number of tests: 1
Adjustment method for multiple comparisons: none
Permutation number: 999
Test Obs Std.Obs Alter Pvalue
1 dt 0.1458514 0.7976225 greater 0.2
other elements: adj.method call
Is there a way to print Pvalue results and name of the file (element i)??
Thanks

Since Paul Hiemstra's answer answered #1, here's an answer to #2, assuming that by "answers" you mean "the printed output of abouheif.moran(pruned.tree.ja.chem.w.in)".
Use cat() with the argument append = true. For example:
output_file = "my_output_file.txt"
for(i in files) {
# do stuff
# make plots
out <- abouheif.moran(pruned.tree.ja.chem.w.in)
out <- sprintf("-------\n %s:\n-------\n%s\n\n", i, out)
cat(out, file = output_file, append = TRUE)
}
This will produce a file called my_output_file.txt that looks like:
-------
file_1:
-------
output_goes_here
-------
file_2:
-------
output_goes_here
Obviously the formatting is entirely up to you; I just wanted to demonstrate what could be done here.
An alternative solution would be to sink() the entire script, but I'd rather be explicit about it. A middle road might be to sink() just a small piece of the code, but except in extreme cases it's a matter of preference or style.

I suspect what is going wrong here is that list.files() by default returns a list of only the names of the files, not the entire path to the file. Setting full.names to TRUE will fix this issue. Note that you will not have to add the txt add the filename as list.files() already returns the full path to an existing file.

Gnuplot Script for average over multiple files

I have series of measurement in several files. Every file is looking like this:
1 151.973938 144.745789 152.21991 17:57:14
2 151.995697 144.755737 152.21991 17:57:14
3 152.015747 144.765076 152.21991 17:57:14
.
.
.
I'm looking for a possibility to compute the average of the same field in several files. At the end of the process I would like to have a graph of the average measurement.
Is that possible with gnuplot? I wasn't able to find a fitting option in gnuplot by myself. If not, which different way to achieve that would you recommend?
Best regards, Juuro

You cannot do it all in gnuplot. Gnuplot is limited to processing columns from one file at a time. You need some other utility to preprocess your data. Assuming the data is in the format you demonstrated (with spaces instead of semicolons), this script will take the average of the second, third and fourth columns and spit out a file with the first and fifth columns the same, and the middles averaged. Run this bash script in a directory where there are only the .txt files you want to process.
#!/bin/bash
sum=$(ls -l *.txt | wc -l)
paste -d" " *.txt | nawk -v s="$sum" '{
for(i=0;i<=s-1;i++)
{
t1 = 2+(i*5)
temp1 = temp1 + $t1
t2 = 3+(i*5)
temp2 = temp2 + $t2
t3 = 4+(i*5)
temp3 = temp3 + $t3
}
print $1" "temp1/s" "temp2/s" "temp3/s" "$5
temp1=0
temp2=0
temp3=0
}'
From inside gnuplot, you can run the script like this:
!myscript.sh > output.out
plot 'output.out' u 1:2 # orwhatever
or like this:
plot '<myscript.sh' u 1:2
(code inspired by what I found here.)

I think it's no possible with gnuplot. I would first make a script which does the averaging and prints the results to stdout. Say this script is called average.py:
plot '<average.py FILE1 FILE2 FILE3' w l
the script average.py can look like this for example.
#!/usr/bin/python
from numpy import loadtxt,mean,ones
import sys
#number of files:
nrfiles=len(sys.argv[1:])
#just to get the dimensions of the files
data=loadtxt(str(sys.argv[1]))
rows=data.shape[0]
cols=data.shape[1]
#initialize array
all=ones((int(nrfiles),int(rows),int(10)))
#load all files:
n=0
for file in sys.argv[1:]:
data=loadtxt(str(file))
all[n,:,0:cols]=data
n=n+1
#calculate mean:
mean_all=mean(all,axis=0)
#print to stdout
for i in range(rows):
a=''
for j in range(cols):
a=a+str('%010.5f ' % mean_all[i,j])
print str(a)
The limits of this script is that all files must have the same data-structure

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

read large csv files in R inside for loop - r

Related

How do you create different files from the same variable?

Is there a way to automatically change a line in several R scripts?

R: Simple Random Sample of Massive Dataframe

R loop opening files

Gnuplot Script for average over multiple files

Categories

Resources