Awk, compare 2 files and write out incomparables - unix

I usually use
awk BEGIN {FS=" "} NR==FNR{arry[$1]=$0; next} $1 in array && $0=arr[$1] FS in fields infile1 infile2 > outfile
to extract common fields in 2 files based on field of interest. But this time I need incomparables. I have 2 files with equal lines but 200 lines in the second file do not have the same coding as in file1.
I tried to :
paste f1 f2 | sort -n -k1,2
by both fields hoping to get $1==$2 and take unequal fields but I don't get $1==$2 even when there should be.
How can I do this?

Since you seems to compare by the first field and since I don't know what your data files look like, I am going to blindly attempt at this:
$ cat data1.txt
dana 100
john 101
fiona 102
$ cat data2.txt
dana 100
john 501
fiona 102
$ cat data[12].txt|sort|uniq -u
john 101
john 501
The above solution will prints out lines that are not the same, based on the first field. Since I don't fully understand your data file, I am going ask this question. Does the following solve your problem?
diff data1.txt data2.txt

Related

List a count of similar filenames in a folder

I have a folder which contains multiple files of a similar name (differentiated by a datetime in the filename)
I would like to be able to get a count of each file group/type in that folder.
i.e.
file1_25102019_111402.csv
file1_24102019_111502.csv
file1_23102019_121402.csv
file1_22102019_101402.csv
file2_25102019_161404.csv
file2_24102019_131205.csv
file2_23102019_121306.csv
I need to be able to return something like this;
file1 4
file2 3
Ideally the answer would be something like "count of files whose first x characters are ABCD"
The filenames can be anything. the date part in the example was just to demonstrate that the filenames begin with similar text but are distinghised by "something" further along in the name (in this case a date)
So I want to be able to Group them by the first X characters in the filename.
i.e. i want to be able to say, "give me counts of all files grouped by the first 4 characters, or the first 5 characters etc."
in SQL i'd do somthing like this
select substr(object_name,1,5),
count(*)
from all_objects
group by substr(object_name,1,5)
Edited to show more examples;
File1weifwoeivnw
File15430293fjwnc
File15oiejfiwem
File2sidfsfe
File29fu09f4n
File29ewfoiwwf
File22sdiufsnvfvs
Pseudo Code:
Example 1:
ls count of first 4 characters
Output
File 7
Example 2:
ls count of first 5 characters
Output
File1 3
File2 4
Example 3
ls count of first 6 characters
Output
File1w 1
File15 2
File2s 1
File29 2
File22 1
If you want to extract the first 5 characters you can use
ls | cut -c1-5 | sort | uniq -c |awk '{ print $2,$1 }'
which prints for the first example from the question
file1 3
file2 3
If you want to have a different number of characters, change the cut command as necessary, e.g. cut -c1-6 for the first 6 characters.
If you want to separate the fields with a TAB character instead of a space, change the awk command to
awk -vOFS=\\t '{ print $2,$1 }'
This would result in
file1 3
file2 3
Other solutions that work with the first example that shows file names with a date and time string, but don't work with the additional example added later:
With your first example files, the command
ls | sed 's/_[0-9]\{8\}_[0-9]\{6\}/_*/' | sort | uniq -c
prints
3 file1_*.csv
3 file2_*.csv
Explanation:
The sed command replaces the sequence of a _, 8 digits, another _ and another 6 digits with _*.
With your first example file names, you will get file1_*.csv or
file2_*.csv 3 times each.
sort sorts the lines.
uniq -c counts the number of subsequent lines that are equal.
Or if you want to strip everything from the first _ up to the end, you can use
ls | sed 's/_.*//' | sort | uniq -c
which will print
3 file1
3 file2
You can add the awk command from the first solution to change the output format.

how to remove bad lines using cmd in data.table?

I have a csv.gz file whose content looks like:
bogusfile <- '1,2,3
1,2,3
2,,3
1,2,3,4
1,2,3
1,2,3
1,2,3'
I know there are only 3 columns but sometimes an extra bogus 4th column pops up and messes my parsing with fread.
Fortunately there is a cmd command in fread. How can we use it to discard all the lines that contains exactly more than 2 commas (these would be the offending rows with extra coumns).
Something like fread(cmd = ' linux magic to clean myfile.csv.gz')?
I was not able to make it work.
What do you think?
Thanks!
data.table::fread(cmd = 'c:/Rtools/bin/grep.exe -E -v "[^,]*,[^,]*,[^,]*," Noobie.txt')
# V1 V2 V3
# 1: 1 2 3
# 2: 1 2 3
# 3: 2 NA 3
# 4: 1 2 3
# 5: 1 2 3
# 6: 1 2 3
I had to use grep -E vice egrep because of windows ... and I had to specify the full path to grep because RTools is not in my default path. If you are on something other than windows, you should be able to shorten this to fread(cmd="egrep -v ..."). (And make sure you are in the correct directory or provide relative/absolute path to the file.)
The regex "[^,]*,[^,]*,[^,]*," is a bit literal, it can be shorted to "([^,]*,){3,}", which says
([^,]*,) group of non-comma followed by a comma
{3,} three or more
-v omit lines that match the pattern
so
data.table::fread(cmd = 'c:/Rtools/bin/grep.exe -E -v "([^,]*,){3,}" Noobie.txt')
If the data is compressed (gz), on non-Windows platforms you can choose from among:
gzip -cd filename.csv.gz | egrep -v "([^,]*,){3,}"
gunzip -c filename.csv.gz | egrep -v "([^,]*,){3,}"
zgrep -E -V "([^,]*,){3,}" filename.csv.gz
It won't work on windows since system and similar functionality on R on windows does not use bash for its shell, so the | in-fix operator doesn't do what one expects. There might be a way to get | to work in system et al, but I don't know how to get it to work with data.table::fread(..., cmd=).
Admittedly untested since ... I'm on Windows :-(

Split a text file according to data into multiple files in Unix

Assume the input file is sorted according to column 3 (the barcode begins with "TCGA"):
Joe 1 TCGA-A8-A08L-01A-11W-A019-09 T
John 2 TCGA-A8-A08L-01A-11W-A019-09 T
Jack 3 TCGA-A8-CVDL-01A-11W-A019-09 T
Jane 4 TCGA-A8-CVDL-01A-11W-A019-09 F
Justin 5 TCGA-A8-E08L-01A-11W-A019-09 F
Jasmine 6 TCGA-A8-E08L-01A-11W-A019-09 T
Jacob 7 TCGA-A8-E08L-01A-11W-A019-09 T
I want to split this text into new files with respect to 3rd column content as (just output 1st column values):
File-1:
Joe
John
File-2:
Jack
Jane
File-3:
Justin
Jasmine
Jacob
How can I achieve this?
Edit: The name of the files can be anything. It is not problem.
I tried many things such as using split, adding prefix and suffix for each segment, keeping track of previous line, etc. but there is an extremely simple solution that I could not think of at first:
awk -F' ' '{print $1 > $3}' inputfile
The name of files will be column 3's content.

grep command to count number of lines which match certain parameters and does not match certain parameters

I know how to count lines using grep which match certain parameters and does not match certain parameters separately. But how can we combine this functionality.
For Example:
Name Date Value
A 04-08-2014 1000
B 04-08-2014 2000
C 04-06-2014 3000
D 04-06-2014 1000
2000 04-08-2014 5000
So, I am looking for command which counts number of lines which matches Date 04-08-2014 but does not have Value 2000.
Note: 2000 should be part of "Value" column and dates should be in "Date" column.
Any help would be highly appreciated !
Using awk:
$ awk '$2=="04-08-2014" && $3!="2000" {count++}END{print count}' file
2
What about:
cat file |grep 04-08-2014 |grep -c -v 2000
grep "04-08-2014" file | grep -vc "2000"

Mixing other languages with R

I use R for most of my statistical analysis. However, cleaning/processing data, especially when dealing with sizes of 1Gb+, is quite cumbersome. So I use common UNIX tools for that. But my question is, is it possible to, say, run them interactively in the middle of an R session? An example: Let's say file1 is the output dataset from an R processes, with 100 rows. From this, for my next R process, I need a specific subset of columns 1 and 2, file2, which can be easily extracted through cut and awk. So the workflow is something like:
Some R process => file1
cut --fields=1,2 <file1 | awk something something >file2
Next R process using file2
Apologies in advance if this is a foolish question.
Try this (adding other read.table arguments if needed):
# 1
DF <- read.table(pipe("cut -fields=1,2 < data.txt| awk something_else"))
or in pure R:
# 2
DF <- read.table("data.txt")[1:2]
or to not even read the unwanted fields assuming there are 4 fields:
# 3
DF <- read.table("data.txt", colClasses = c(NA, NA, "NULL", "NULL"))
The last line could be modified for the case where we know we want the first two fields but don't know how many other fields there are:
# 3a
n <- count.fields("data.txt")[1]
read.table("data.txt", header = TRUE, colClasses = c(NA, NA, rep("NULL", n-2)))
The sqldf package can be used. In this example we assume a csv file, data.csv and that the desired fields are called a and b . If its not a csv file then use appropriate arguments to read.csv.sql to specify other separator, etc. :
# 4
library(sqldf)
DF <- read.csv.sql("data.csv", sql = "select a, b from file")
I think you may be looking for littler which integrates R into the Unix command-line pipelines.
Here is a simple example computing the file size distribution of of /bin:
edd#max:~/svn/littler/examples$ ls -l /bin/ | awk '{print $5}' | ./fsizes.r
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4 5736 23580 61180 55820 1965000 1
The decimal point is 5 digit(s) to the right of the |
0 | 00000000000000000000000000000000111111111111111111111111111122222222+36
1 | 01111112233459
2 | 3
3 | 15
4 |
5 |
6 |
7 |
8 |
9 | 5
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 | 6
edd#max:~/svn/littler/examples$
and it takes for that is three lines:
edd#max:~/svn/littler/examples$ cat fsizes.r
#!/usr/bin/r -i
fsizes <- as.integer(readLines())
print(summary(fsizes))
stem(fsizes)
See ?system for how to run shell commands from within R.
Staying in the tradition of literate programming, using e.g. org-mode and org-babel will do the job perfectly:
You can combine several different programming languages in one script and execute then separate, in sequence, export the results or the code, ...
It is a little bit like sweave, only that the code blocks can by python, bash, R, sql, and numerous other. Check t out: org-mode and bable and an example using different programming languages
Apart from that, I think org-mode and babel is the perfect way of writing even pure R scripts.
Preparing data before working with it in R is quite common, and I have a lot of scripts for Unix and Perl pre-processing, and have, at various times, maintained scripts/programs for MySQL, MongoDB, Hadoop, C, etc. for pre-processing.
However, you may get better mileage for portability if you do some kinds of pre-processing in R. You might try asking new questions focused on some of these particulars. For instance, to load large amounts of data into memory mapped files, I seem to evangelize bigmemory. Another example is found in the answers (especially JD Long's) to this question.

Resources