How to use wdiff with diff as input? - unix

I am using wdiff to diff two files and I am trying to use the "-d --diff-input" option so I can stream in from a unified diff as the document. Imagine the following two text files:
one.txt: My hovercraft is full of eeels.
two.txt: My hovercraft is full of slippery eels.
(The -123 is to suppress diff output and the -s for stats.)
If I do wdiff -s123 one.txt two.txt I get:
one.txt: 6 words 5 83% common 0 0% deleted 1 17% changed
two.txt: 7 words 5 71% common 0 0% inserted 2 29% changed
This is what I'd expect. However, if I do: diff one.txt two.txt | wdiff -s123d I get:
(null): 17 words 16 94% common 1 6% deleted 0 0% changed
(null): 16 words 16 100% common 0 0% inserted 0 0% changed
From what I can tell of the docs and googling, this is the intended use case and they should return the same output, but obviously not. Anyone know what I am missing?
EDIT: I am using wdiff 1.1.2 on mint/ubuntu.
EDIT:
I missed the word "unified" in the man page. It is looking for the "-u" option, so I should be specifying diff -u one.txt two.txt | wdiff -s123d. I get better results, but unfortunately unified diff still has a two line header which gets diff'd.
(null): 15 words 11 73% common 0 0% deleted 4 27% changed
(null): 16 words 11 69% common 0 0% inserted 5 31% changed
So now the problem is how to get diff to not emit the header lines. Again I have googled and experimented with no results. If course I could write a little script to rip off the three lines before wdiff, hopefully it will still parse.
On a side note, because it used the unified diff output, it should work with git diff output a well.

Related

R and GNU Parallel - How to limit number of cores used

(New to GNU Parallel)
My aim is to run the same Rscript, with different arguments, over multiple cores. My first problem is to get this working on my laptop (2 real cores, 4 virtual), then I will port this over to one with 64 cores.
Currently:
I have a Rscript, "Test.R", which takes in arguments, does a thing (say adds some numbers then writes it to a file), then stops.
I have a "commands.txt" file containing the following:
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 100
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 50 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 50 200 1000
So this tells GNU parallel to run Test.R using R (I have installed this using anaconda)
In the terminal (after navigating to the desktop which is where Test.R and commands.txt are) I use the command:
parallel --jobs 2 < commands.txt
What I want this to do, is to use 2 cores, and run the commands, from commands.txt, until all tasks are complete. (I have tried variations on this command, such as changing the 2 to a 1, in this case, 2 of the cores run at 100%, and the other 2 run around 20-30%).
When I run this, all of the 4 cores go to 100% (as seen from htop), and the first 2 jobs complete, and no more jobs get complete, despite all 4 cores still being at 100%.
When I run the same command on the 64 core compute, all 64 cores go to 100%, and I have to cancel the jobs.
Any advice on resources to look at, or what I am doing wrong would be greatly appreciated.
Bit of a long question, let me know if I can clarify anything.
The output from htop as requested, during running the above command (sorted by CPU%:
1 [||||||||||||||||||||||||100.0%] Tasks: 490, 490 thr; 4 running
2 [|||||||||||||||||||||||||99.3%] Load average: 4.24 3.46 4.12
3 [||||||||||||||||||||||||100.0%] Uptime: 1 day, 18:56:02
4 [||||||||||||||||||||||||100.0%]
Mem[|||||||||||||||||||5.83G/8.00G]
Swp[|||||||||| 678M/2.00G]
PID USER PRI NI VIRT RES S CPU% MEM% TIME+ Command
9719 user 16 0 4763M 291M ? 182. 3.6 0:19.74 /Users/user/anaconda3
9711 user 16 0 4763M 294M ? 182. 3.6 0:20.69 /Users/user/anaconda3
7575 user 24 0 4446M 94240 ? 11.7 1.1 1:52.76 /Applications/Utilities
8833 user 17 0 86.0G 259M ? 0.8 3.2 1:33.25 /System/Library/StagedF
9709 user 24 0 4195M 2664 R 0.2 0.0 0:00.12 htop
9676 user 24 0 4197M 14496 ? 0.0 0.2 0:00.13 perl /usr/local/bin/par
Based on the output from htop the script /Users/name/anaconda3/lib/R/bin/Rscript uses more than one CPU thread (182%). You have 4 CPU threads and since you run 2 Rscripts we cannot tell if Rscript would eat all 4 CPU threads if it ran by itself. Maybe it will eat all CPU threads that are available (your test on the 64 core machine suggests this).
If you are using GNU/Linux you can limit which CPU threads a program can use with taskset:
taskset 9 parallel --jobs 2 < commands.txt
This should force GNU Parallel (and all its children) to only use CPU thread 1 and 4 (9 in binary: 1001). Thus running that should limit the two jobs to run in two threads only.
By using 9 (1001 binary) or 6 (0110 binary) we are reasonably sure that the two CPU threads are on two different cores. 3 (11 binary) might refer to the two threads on the came CPU core and would therefore probably be slower. The same goes for 5 (101 binary).
In general you want to use as many CPU threads as possible as that will typically make the computation faster. It is unclear from your question why you want to avoid this.
If you are sharing the server with others a better solution is to use nice. This way you can use all the CPU power that others are not using.

How to match ID in column in unix?

I am fully aware that similar questions may have been posted, but after searching it seems that the details of our questions are different (or at least I did not manage to find a solution that can be adopted in my case).
I currently have two files: "messyFile" and "wantedID". "messyFile" is of size 80,000,000 X 2,500, whereas "wantedID" is of size 1 x 462. On the 253rd line of "messyFile", there are 2500 IDs. However, all I want is the 462 IDs in the file "wantedID". Assuming that the 462 IDs are a subset of the 2500 IDs, how can I process the file "messyFile" such that it only contains information about the 462 IDs (ie. of size 80,000,000 X 462).
Thank you so much for your patience!
ps: Sorry for the confusion. But yeah, the question can be boiled down to something like this. In the 1st row of "File#1", there are 10 IDs. In the 1st row of "File#2", there are 3 IDs ("File#2" consists of only 1 line). The 3 IDs are a subset of the 10 IDs. Now, I hope to process "File#1" so that it contains only information about the 3 IDs listed in "File#2".
ps2: "messyFile" is a vcf file, whereas "wantedID" can be a text file (I said "can be" because it is small, so I can make almost any type for it)
ps3: "File#1" should look something like this:
sample#1 sample#2 sample#3 sample#4 sample#5
0 1 0 0 1
1 1 2 0 2
"File#2" should look something like this:
sample#2 sample#4 sample#5
Desired output should look like this:
sample#2 sample#4 sample#5
1 0 1
1 0 2
For parsing VCF format, use bcftools:
http://samtools.github.io/bcftools/bcftools.html
Specifically for your task see the view command:
http://samtools.github.io/bcftools/bcftools.html#view
Example:
bcftools view -Ov -S 462sample.list -r chr:pos -o subset.vcf superset.vcf
You will need to get the position of the SNP to specify chr:pos above.
You can do this using DbSNP:
http://www.ncbi.nlm.nih.gov/SNP/index.html
Just make sure to match the genome build to the one used in the VCF file.
You can also use plink:
https://www.cog-genomics.org/plink2
But, PLINK is finicky about duplicated SNPs and other things, so it may complain unless you address these issues.
I've done what you are attempting in the past using the awk programming language. For your sanity, I recommend using one of the above tools :)
Ok, I have no idea what a vcf file is but if the File#1 and File#2 samples you gave were files containing tab separated columns this will work:
declare -a data=(`head -1 data.txt`)
declare -a header=(`head -1 header.txt`)
declare fields
declare -i count
for i in "${header[#]}" ; do
count=0
for j in "${data[#]}" ; do
count=$count+1;
if [ $i == $j ] ; then
fields=$fields,$count
fi
done
done
cut -f ${fields:1} data.txt
If they aren't tab separated values perhaps it can be amended for the actual data format.

How to monitor or print transfer state and speed for different percentage of SFTP file transfer with WinSCP?

I am running automating the process of SFTP, so I am running the commands
open sftp://username:passwordd#192.xxx.xxx.x/
# Change LOCAL directory
lcd "C:\Users\Desktop\"
# copy an individual file
put -nopermissions -preservetime "C:\Users\Desktop\xyz.webm" xyz.webm
and getting an output
C:\Users\Desktop\xyz.webm | 60734 KB | 3160.3 KB/s | binary | 100%
So while transferring this I want the output the same output for different percentages. Like would want to know the size, throughput for 20%, 40%, 60% and so on...
Here we are getting a consolidated output but would want output in steps.
Is there a way to do it or a command to get output in steps?
Thanks
You better use the WinSCP .NET assembly, instead of a plain scripting.
The assembly has Session.FileTransferProgress event.
Handle the event to monitor the FileTransferProgressEventArgs.FileProgress – when it exceeds one of your thresholds, read the state from the FileTransferProgressEventArgs.FileName, .FileProgress and .CPS.
See FileTransferProgressEventArgs class.

Downloading the entire Bitcoin transaction chain with R

I'm pretty new here so thank you in advance for the help. I'm trying to do some analysis of the entire Bitcoin transaction chain. In order to do that, I'm trying to create 2 tables
1) A full list of all Bitcoin addresses and their balance, i.e.,:
| ID | Address | Balance |
-------------------------------
| 1 | 7d4kExk... | 32 |
| 2 | 9Eckjes... | 0 |
| . | ... | ... |
2) A record of the number of transactions that have ever occurred between any two addresses in the Bitcoin network
| ID | Sender | Receiver | Transactions |
--------------------------------------------------
| 1 | 7d4kExk... | klDk39D... | 2 |
| 2 | 9Eckjes... | 7d4kExk... | 3 |
| . | ... | ... | .. |
To do this I've written a (probably very inefficient) script in R that loops through every block and scrapes blockexplorer.com to compile the tables. I've tried running it a couple of times so far but I'm running into two main issues
1 - It's very slow... I can imagine it's going to take at least a week at the rate that it's going
2 - I haven't been able to run it for more than a day or two without it hanging. It seems to just freeze RStudio.
I'd really appreaciate your help in two areas:
1 - Is there a better way to do this in R to make the code run significantly faster?
2 - Should I stop using R altogether for this and try a different approach?
Thanks in advance for the help! Please see below for the relevant chunks of code I'm using
url_start <- "http://blockexplorer.com/b/"
url_end <- ""
readUrl <- function(url) {
table <- try(readHTMLTable(url)[[1]])
if(inherits(table,"try-error")){
message(paste("URL does not seem to exist:", url))
errors <- errors + 1
return(NA)
} else {
processed <- processed + 1
return(table)
}
}
block_loop <- function (end, start = 0) {
...
addr_row <- 1 #starting row to fill out table
links_row <- 1 #starting row to fill out table
for (i in start:end) {
print(paste0("Reading block: ",i))
url <- paste(url_start,i,url_end, sep = "")
table <- readUrl(url)
if(is.na(table)){ next }
....
There are very close to 250,000 blocks on the site you mentioned (at least, 260,000 gives a 404). Curling from my connection (1 MB/s down) gives an average speed of about half a second. Try it yourself from the command line (just copy and paste) to see what you get:
curl -s -w "%{time_total}\n" -o /dev/null http://blockexplorer.com/b/220000
I'll assume your requests are about as fast as mine. Half a second times 250,000 is 125,000 seconds, or a day and a half. This is the absolute best you can get using any methods because you have to request the page.
Now, after doing an install.packages("XML"), I saw that running readHTMLTable(http://blockexplorer.com/b/220000) takes about five seconds on average. Five seconds times 250,000 is 1.25 million seconds which is about two weeks. So your estimates were correct; this is really, really slow. For reference, I'm running a 2011 MacBook Pro with a 2.2 GHz Intel Core i7 and 8GB of memory (1333 MHz).
Next, table merges in R are quite slow. Assuming 100 records per table row (seems about average) you'll have 25 million rows, and some of these rows have a kilobyte of data in them. Assuming you can fit this table in memory, concatenating tables will be a problem.
The solution to these problems that I'm most familiar with is to use Python instead of R, BeautifulSoup4 instead of readHTMLTable, and Pandas to replace R's dataframe. BeautifulSoup is fast (install lxml, a parser written in C) and easy to use, and Pandas is very quick too. Its dataframe class is modeled after R's, so you probably can work with it just fine. If you need something to request URLs and return the HTML for BeautifulSoup to parse, I'd suggest Requests. It's lean and simple, and the documentation is good. All of these are pip installable.
If you still run into problems the only thing I can think of is to get maybe 1% of the data in memory at a time, statistically reduce it, and move on to the next 1%. If you're on a machine similar to mine, you might not have another option.

What do the numbers in rsync's output mean?

When I run rsync with the --progress flag, I get information about the transfers as follows.
path/to/file
16 100% 0.01kB/s 0:00:01 (xfer#10857, to-check=427700/441502)
What do the numbers in the second row mean? I know what some of them are, but what do the others mean (marked with ??? below)?
16 ???
100% amount of transfer completed in this file
0.0.1kB/s speed of current file transfer
0:00:01: time elapsed in current file transfer
10857 count of files transferred
427700 ???
441502 ???
When the file transfer finishes, rsync
replaces the progress line with a
summary line that looks like this:
1238099 100% 146.38kB/s 0:00:08 (xfer#5, to-check=169/396)
In this example, the file was 1238099
bytes long in total, the average rate
of transfer for the whole file was
146.38 kilobytes per second over the 8 seconds that it took to complete, it
was the 5th transfer of a regular file
during the current rsync session, and
there are 169 more files for the
receiver to check (to see if they are
up-to-date or not) remaining out of
the 396 total files in the file-list.
from http://samba.anu.edu.au/ftp/rsync/rsync.html under --progress switch
path/to/file
16 100% 0.01kB/s 0:00:01 (xfer#10857, to-check=427700/441502)
The 16 is the bytes-in-this-file transferred sofar. The 100% lists the percentage of the file transferred: 100% in this case. For very short files the kb/sec number often comes out a bit weird: Small measuring errors cause big differences in the calculated overall speed. Then there is the total time. Then, the transfer number. In the example given, of the 427700 files checked so far, only 10857 needed to be transferred. Based on the modification times rsync decided that no transfer was needed for some of the others. Next there is the number of files left-to-check and the total. Modern rsync implementations will create the list that counts towards the "total" on the fly: only adding to the list if the unchecked number drops below 1000.

Resources