Powershell, R, Import-Csv, select-object, Export-csv - r

I'm performing several tests using different approaches for cleaning a big csv file and then importing it into R.
This time I'm playing with Powershell in Windows.
While things work well and most accurate than when using cut() with pipe(), the process is horribly slow.
This is my command:
shell(shell = "powershell",
"Import-Csv In.csv |
select-object col1, col2, etc |
Export-csv new.csv")
And these are the system.time() results:
user system elapsed
0.61 0.42 1568.51
I've seen some other posts that use C# via streaming taking couple of dozens of seconds, but I don't know C#.
My question is, how can improve the PowerShell command in order to make it faster?
Thanks,
Diego

There's a fair amout of overhead in reading in the csv, converting the rows to powershell objects, and the converting back to csv. Doing it through the pipeline that way also causes it to do this one record at a time. You should be able to speed that up considerably if you switch to using Get-Content with a -ReadCount parameter, and extracting your data using a regular expression in a -replace operator, e.g.:
shell(shell = "powershell",
"Get-Content In.csv -ReadCount 1000 |
foreach { $_ -replace '^(.+?,.+?),','$1' |
Add-Content new.csv")
This will reduce the number if disk reads, and the -replace will be functioning as an array operator, doing 1000 records at a time.

First and foremost, my first test was wrong in the sense that due some errors I had before, several other sessions of powershell remained open and delayed the whole process.
These are the real numbers:
> system.time(shell(shell = "powershell", psh.comm))
user system elapsed
0.09 0.05 824.53
Now, as I said I couldn't find a good pattern for splitting the columns of my csv file.
I maybe need to add that it is a messy file, with fields containing commas, multiline fields, summary lines, and so on.
I tried other approaches, like one very famous in stack overflow that uses embedded C# code in PowerShell for splitting csv files.
While it works faster than the more common approach I showed previously, results are not accurate for these types of messy files.
> system.time(shell(shell = "powershell", psh.comm))
user system elapsed
0.01 0.00 212.96
Both approaches showed similar RAM consumption (~40Mb), and CPU usage (~50%) most of the time.
So while the former approach took 4 times the amount of the later, the accuracy of the results, the low cost in terms of resources, and the lesser developing time make me consider it the most efficient for big and messy csv files.

Related

Rascal MPL get line count of file without loading its contents

Is there a more efficient way than
int fileSize = size(readFileLines(fileLoc));
to get the total number of lines in a file? I presume this code has to read the entire file first, which could become costly for huge files.
I have looked into IO and Loc whether some of this info might be saved in conjunction with the file.
This is the way, unless you'd like to call wc -l via util::ShellExec 😁
Apart from streaming the file and saving some memory counting lines is always linear in the size of the file so you won't win much time.

How to delete a row in a csv file with powershell in R?

Good morning,
I'm new about powershell and I'd like to ask you if somebody can help me.
I have a big csv file around 3.5gb and my goal is to load it with fread (a data.table function) in R environment, but this function makes a error.
> n_a<-fread("C:/x/xy/xyz/name_file.csv",sep=";", fill = TRUE)
The error is:
Warning message:
In fread("C:/x/xy/xyz/name_file.csv") :
Stopped early on line 458945. Expected 29 fields but found 30. Consider fill=TRUE and comment.char=. First discarded non-empty line
I tried to use different way (I putted in my code fill=true, but doesn't work) to solve the problem, but I couldn't do it.
After different researches I found this kind of solution (always to do in R):
>system("powershell Get-Content C:/a/b/c/file.csv | Select -Index (0..458944 + 1000000) > output.csv")
The focus about the use of powershell in R is to delete a specific row and to load with fread the file.
My question is:
How I can delete a specific row in a csv in powershell but without specifying the length of the matrix?
Thank you in advance for every type of help.
Francesco
I'd hazard a guess that the invalid row's location is not known. In such a case, it might be sensible to read the original file and create a new file that contains only valid data. What's more, if the source data would benefit of manipulation, it can be done before reading it into R.
A file as large as 3,5 GiB is a bit on the large side to read in memory as such. Sure, it can be done in the days of 64 bit systems, but for simple row processing it's unwieldy. A scalable solution uses .Net methods and row-by-row approach.
To process a file on row-by-row basis, use .Net methods for efficient row reading. A StringBuilder is created to store rows that contain valid data, others are discarded. The StringBuilder is flushed on disk every so often. Even on days of SSDs, a write operation for each row is relatively slow in respect to writing in a bulk of, say, 10 000 rows a time.
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText("MyCsvFile.csv")
$i = 0
$MaxRows = 10000
$colonCount = 30
while($null -ne ($line = $reader.ReadLine())) {
# Split the line on semicolons
$elements = $line -split ';'
# If there were $colonCount elements, add those to builder
if($elements.count -eq $colonCount) {
# If $line's contents need modifications, do it here
# before adding it into the builder
[void]$sb.AppendLine($line)
++$i
}
# Write builder contents into file every now and then
if($i -ge $MaxRows) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
[void]$sb.Clear()
$i = 0
}
}
# Flush the builder after the loop if there's data
if($sb.Length -gt 0) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
}
This is easy done in powershell: Read csv in generic list, remove line and write back:
Add-Type -AssemblyName System.Collections
[System.Collections.Generic.List[string]]$csvList = #()
$csvFile = 'C:\test\myfile.csv'
$csvList = [System.IO.File]::ReadLines( $csvFile )
$lineToDelete = 2
[void]$csvList.RemoveAt( $lineToDelete - 1 )
[System.IO.File]::WriteAllLines( $csvFile, $csvList ) | Out-Null
vonPryz's helpful answer offers the best solution, given the size of your input file.
The following works too, but will be slow - in general, due to the overhead of using a pipeline, but also because Get-Content itself is slow due to decorating each line read with additional properties (see green-lighted, but not yet implemented GitHub suggestion #7537):
# Exclude line number 458945 (0-based index 458944)
Get-Content C:/a/b/c/file.csv | Select-Object -SkipIndex 458944 > output.csv
The beneficial flip side of use of the pipeline is that it acts as a memory throttle, so the above command can be used to process arbitrarily large files (though it may take a long time).

How can I download multiple objects from S3 simultaneously?

I have lots (millions) of small log files in s3 in with its name (date/time) helping to define it i.e. servername-yyyy-mm-dd-HH-MM. e.g.
s3://my_bucket/uk4039-2015-05-07-18-15.csv
s3://my_bucket/uk4039-2015-05-07-18-16.csv
s3://my_bucket/uk4039-2015-05-07-18-17.csv
s3://my_bucket/uk4039-2015-05-07-18-18.csv
...
s3://my_bucket/uk4339-2015-05-07-19-23.csv
s3://my_bucket/uk4339-2015-05-07-19-24.csv
...
etc
From EC2, using the AWS CLI, I would like to simultaneously download all files that are have the minute equal 16 for 2015, for all only server uk4339 and uk4338
Is there a clever way to do this?
Also if this is a terrible file structure in s3 to query data, I would be extremely grateful for any advice on how to set this up better.
I can put a relevant aws s3 cp ... command into a loop in a shell/bash script to sequentially download the relevant files but, was wondering if there was something more efficient.
As an added bonus I would like to row bind the results together too as one csv.
A quick example of a mock csv file can be generated in R using this line of R code
R> write.csv(data.frame(cbind(a1=rnorm(100),b1=rnorm(100),c1=rnorm(100))),file='uk4339-2015-05-07-19-24.csv',row.names=FALSE)
The csv that is created is uk4339-2015-05-07-19-24.csv. FYI, I will be importing the combined data into R at the end.
As you didn't answer my questions, nor indicate what OS you use, it is somewhat hard to make any concrete suggestions, so I will briefly suggest you use GNU Parallel to parallelise your S3 fetch requests to get around the latency.
Suppose you somehow generate a list of all the S3 files you want and put the resulting list in a file called GrabMe.txt like this
s3://my_bucket/uk4039-2015-05-07-18-15.csv
s3://my_bucket/uk4039-2015-05-07-18-16.csv
s3://my_bucket/uk4039-2015-05-07-18-17.csv
s3://my_bucket/uk4039-2015-05-07-18-18.csv
Then you can get them in parallel, say 32 at a time, like this:
parallel -j 32 echo aws s3 cp {} . < GrabMe.txt
or if you prefer reading left-to-right
cat GrabMe.txt | parallel -j 32 echo aws s3 cp {} .
You can obviously alter the number of parallel requests from 32 to any other number. At the moment, it just echoes the command it would run, but you can remove the word echo when you see how it works.
There is a good tutorial here, and Ole Tange (the author of GNU Parallel) is on SO, so we are in good company.

R writing to stdout very slow. Any ways to improve?

I am writing a simple command-line Rscript that reads some binary data and outputs it to as a stream of numeric characters. The data is of specific format and R has a very quick library to deal with the binary files in question. The file (of 7 million characters) is read quickly - in less than a second:
library(affyio)
system.time(CEL <- read.celfile("testCEL.CEL"))
user system elapsed
0.462 0.035 0.498
I want to write a part of read data to stdout:
str(CEL$INTENSITY$MEAN)
num [1:6553600] 6955 225 7173 182 148 ...
As you can see it's numeric data with ~6.5 million integers.
And the writing is terribly slow:
system.time(write(CEL$INTENSITY$MEAN, file="TEST.out"))
user system elapsed
8.953 10.739 19.694
(Here the writing is done to a file, but doing it to standard output from Rscript takes the same amount of time"
cat(vector) does not improve the speed at all. One improvement I found is this:
system.time(writeLines(as.character(CEL$INTENSITY$MEAN), "TEST.out"))
user system elapsed
6.282 0.016 6.298
It is still a far cry from the speed it got when reading the data in (and it read 5 times more data than this particular vector). Moreover I have an overhead of transforming the entire vector to character before I can proceed. Plus when sinking to stdout I cannot terminate the stream with CTRL+C if by accident I fail to redirect it to file.
So my question is - is there a faster way to simply output numeric vector from R to stdout?
Also why is reading the data in so much faster than writing? And this is not only for binary files, but in general:
system.time(tmp <- scan("TEST.out"))
Read 6553600 items
user system elapsed
1.216 0.028 1.245
Binary reads are fast. Printing to stdout is slow for two reasons:
formatting
actual printing
You can benchmark / profile either. But if you really want to be "fast", stay away from formatting for printing lots of data.
Compiled code can help make the conversion faster. But again, the fastest solution will to
remain with binary
not write to stdout out or file (but use eg something like Redis).

Compress EACH LINE of a file individually and independently of one another? (or preserve newlines)

I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.
Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?
Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?
Thanks for any help!
Are you sure you're running out of the Memory (RAM?) with your sort?
My experience debugging sort problems leads me to believe that you have probably run out of diskspace for sort to create it temporary files. Also recall that diskspace used to sort is usually in /tmp or /var/tmp.
So check out your available disk space with :
df -g
(some systems don't support -g, try -m (megs) -k (kiloB) )
If you have an undersized /tmp partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir with
sort -T /alt/dir
Note that for sort version
sort (GNU coreutils) 5.97
The help says
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
I'm not sure if this means can combine a bunch of -T=/dr1/ -T=/dr2 ... to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.
Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.
I hope this helps.
As you appear to be a new user here on S.O., allow me to welcome you and remind you of four things we do:
. 1) Read the FAQs
. 2) Please accept the answer that best solves your problem, if any, by pressing the checkmark sign. This gives the respondent with the best answer 15 points of reputation. It is not subtracted (as some people seem to think) from your reputation points ;-)
. 3) When you see good Q&A, vote them up by using the gray triangles, as the credibility of the system is based on the reputation that users gain by sharing their knowledge.
. 4) As you receive help, try to give it too, answering questions in your area of expertise
There are some possible solutions:
1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes
2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like:
awk '!x[$0]++' oldfile > newfile
3 - Why not split the files but with some criteria? Supposing all your lines begin with letters:
- break your original_file in 20 smaller files: grep "^a*$" original_file > a_file
- sort each small file: a_file, b_file, and so on
- verify the duplicates, count them, do whatever you want.

Resources