Bash/R searching columns in a huge table - r

I have a huge table I want to extract information from. Firstly, I want to extract a certain line based on a pattern -> I've done that successfully with grep. However this line has loads of columns and I'm interested only in a couple of them that have a certain pattern in them (partial match - beginning of the string). Is it possible to extract only the columns and the number of the column (the nth column) for some partial matches? Hope I was clear enough.
Languages: Preferably in bash but I can also work in R, alternatively I'm open to suggestions if you think another language can be more helpful.
Thanks!

Awk is perfect for stuff like this. To help you write a script I think we need more details. But I'm guessing you'll want to use the print feature of awk. To print out the nth column of a file "your_file" do:
awk '{print $n}' your_file
In solving your problem you may also want to loop over all N columns which you can do via:
for i in {1..N} ;
do
awk -v col=${i} '{print $col}' your_file ;
done

Related

Computing the density of a word in a file using command line

I'm trying to find the density of a word by finding the number of lines containing a word and the total number of lines. I tried this:
echo $((grep 'word' filename | wc -l)/(wc -l filename))
But it's throwing me a syntax error. I'm sure it's something basic, but I'm pretty new so any help would be appreciated!
grep and expr are the wrong tools. You want a simple awk script:
awk '/word/{count++} END { print count/NR}' input-file
Note that you're not even explicitly calling expr, but your attempt to use / as a division operator implies that is your intent. But in that context, / is integer division, so very likely not what you want. Using grep piped to wc will work, but that forces you to read the input mulitple times, which you don't want. Using awk to scan the file once, counting the lines that match seems like your best bet.

Unix redact data

I want to mask only the 2nd column of the data.
Input:
First_name,second_name,phone_number
ram,prakash,96174535
hari,pallavi,98888234
anurag,aakash,82783784
Output Expected:
First_name,second_name,phone_number
ram,*******,96174535
hari,*******,98888234
anurag,******,82783784
The sed program will do this just fine:
sed '2,$s/,[^,]*,/,*****,/'
The 2,$ only operates on lines 2 through to the end of the file (to leave the header line alone) and the substitute command s/,[^,]*,/,*****,/ will replace anything between the first and second comma with the mask *****.
Note that I've specifically used a fixed number of asterisks in the replacement string. Whether you're hiding passwords or anonymising data (as seems to be the case here), you don't want to leak any information, including the size of the names being replaced.
If you really want to use the same number of characters as in the original data, and you also want to cater for the possibility of replacing multiple fields, you can use something like:
awk -F, 'BEGIN{OFS=FS}NR==1{print;next}{gsub(/./,"*",$2);gsub(/./,"*",$4);print}'
This will also leave the first line untouched but will anonymise columns two and four (albeit with the information leakage previously mentioned):
echo 'First_name,second_name,phone_number,other
ram,prakash,96174535,abc
hari,pallavi,98888234,def
anurag,aakash,82783784,g
bob,santamaria,124,xyzzy' | awk -F, 'BEGIN{OFS=FS}NR==1{print;next}{gsub(/./,"*",$2);gsub(/./,"*",$4);print}'
First_name,second_name,phone_number,other
ram,*******,96174535,***
hari,*******,98888234,***
anurag,******,82783784,*
bob,**********,124,*****
Doing multiple columns with full anonymising would entail using $2="*****" rather than the gsub (for both columns of course).
Another in awk. Using gsub to replace every char in $2 with an *:
$ awk 'BEGIN{FS=OFS=","}NR>1{gsub(/./,"*",$2)}1' file
First_name,second_name,phone_number
ram,*******,96174535
hari,*******,98888234
anurag,******,82783784
try following too once and let me know if this helps you.
awk -F"," 'NR>1{$2="*******"} 1' OFS=, Input_file

grep: how to show the next lines after the matched one until a blank line [not possible!]

I have a dictionary (not python dict) consisting of many text files like this:
##Berlin
-capital of Germany
-3.5 million inhabitants
##Earth
-planet
How can I show one entry of the dictionary with the facts?
Thank you!
You can't. grep doesn't have a way of showing a variable amount of context. You can use -A to show a set number of lines after the match, such as -A3 to show three lines after a match, but it can't be a variable number of lines.
You could write a quick Perl program to read from the file in "paragraph mode" and then print blocks that match a regular expression.
as andy lester pointed out, you can't have grep show a variable amount of context in grep, but a short awk statement might do what you're hoping for.
if your example file were named file.dict:
awk -v term="earth" 'BEGIN{IGNORECASE=1}{if($0 ~ "##"term){loop=1} if($0 ~ /^$/){loop=0} if(loop == 1){print $0}}' *.dict
returns:
##Earth
-planet
just change the variable term to the entry you're looking for.
assuming two things:
dictionary files have same extension (.dict for example purposes)
dictionary files are all in same directory (where command is called)
If your grep supports perl regular expressions, you can do it like this:
grep -iPzo '(?s)##Berlin.*?\n(\n|$)'
See this answer for more on this pattern.
You could also do it with GNU sed like this:
query=berlin
sed -n "/$query/I"'{ :a; $p; N; /\n$/!ba; p; }'
That is, when case-insensitive $query is found, print until an empty line is found (/\n$/) or the end of file ($p).
Output in both cases (minor difference in whitespace):
##Berlin
-capital of Germany
-3.5 million inhabitants

Remove lines which are between given patterns from a file (using Unix tools)

I have a text file (more correctly, a “German style“ CSV file, i.e. semicolon-separated, decimal comma) which has a date and the value of a measurement on each line.
There are stretches of faulty values which I want to remove before further work. I'd like to store these cuts in some script so that my corrections are documented and I can replay those corrections if necessary.
The lines look like this:
28.01.2005 14:48:38;5,166
28.01.2005 14:50:38;2,916
28.01.2005 14:52:38;0,000
28.01.2005 14:54:38;0,000
(long stretch of values that should be removed; could also be something else beside 0)
01.02.2005 00:11:43;0,000
01.02.2005 00:13:43;1,333
01.02.2005 00:15:43;3,250
Now I'd like to store a list of begin and end patterns like 28.01.2005 14:52:38 + 01.02.2005 00:11:43, and the script would cut the lines matching these begin/end pairs and everything that's between them.
I'm thinking about hacking an awk script, but perhaps I'm missing an already existing tool.
Have a look at sed:
sed '/start_pat/,/end_pat/d'
will delete lines between start_pat and end_pat (inclusive).
To delete multiple such pairs, you can combine them with multiple -e options:
sed -e '/s1/,/e1/d' -e '/s2/,/e2/d' -e '/s3/,/e3/d' ...
Firstly, why do you need to keep a record of what you have done? Why not keep a backup of the original file, or take a diff between the old & new files, or put it under source control?
For the actual changes I suggest using Vim.
The Vim :global command (abbreviated to :g) can be used to run :ex commands on lines that match a regex. This is in many ways more powerful than awk since the commands can then refer to ranges relative to the matching line, plus you have the full text processing power of Vim at your disposal.
For example, this will do something close to what you want (untested, so caveat emptor):
:g!/^\d\d\.\d\d\.\d\d\d\d/ -1 write tmp.txt >> | delete
This matches lines that do NOT start with a date (the ! negates the match), appends the previous line to the file tmp.txt, then deletes the current line.
You will probably end up with duplicate lines in tmp.txt, but they can be removed by running the file through uniq.
you are also use awk
awk '/start/,/end/' file
I would seriously suggest learning the basics of perl (i.e. not the OO stuff). It will repay you in bucket-loads.
It is fast and simple to write a bit of perl to do this (and many other such tasks) once you have grasped the fundamentals, which if you are used to using awk, sed, grep etc are pretty simple.
You won't have to remember how to use lots of different tools and where you would previously have used multiple tools piped together to solve a problem, you can just use a single perl script (usually much faster to execute).
And, perl is installed on virtually every unix/linux distro now.
(that sed is neat though :-)
use grep -L (print none matching lines)
Sorry - thought you just wanted lines without 0,000 at the end

script,unix,compare

I have two files ...
file1:
002009092312291100098420090922111
010555101070002956200453T+00001190.81+00001295.920010.87P
010555101070002956200449J+00003128.85+00003693.90+00003128
010555101070002956200176H+00000281.14+00000300.32+00000281
file2:
002009092410521000098420090709111
010560458520002547500432M+00001822.88+00001592.96+00001822
010560458520002547500432D+00000106.68+00000114.77+00000106
In both files in every record starting with 01, the string from 3rd char to 25th char, i.e up to alphabet is the key.
Based on this key, I have to compare two files, and if there is any record matching in file 2, then I have to replace that record in file1, or else append it if it won't match.
Well, this is a fairly unspecific (and basic) programming question. We'll be better able to help us if you explain exactly what you did and where you got stuck.
Also, it looks a bit like homework, and people are wary of giving too much help on homework problems, as it might look like cheating.
To get you started:
I'd recommend Perl to solve this, but awk or another scripting language will also do. I'd recommend against sh/bash, as they are weak on text manipulation; also combining grep et al will become rather cumbersome.
First write a Perl program that filters records starting with 01. Then extract the key and put it into a hash (a Perl structure). Then output a new, combined file as required.
Using awk get the fields from 3-25 but doing something like
awk -F "" '/^01/{print $1}' file_name | cut -c 3-25 and match the first two fields with 01 from both files and get all the lines in two different buffers and compare both the buffers using for line in in a shell script.
Whenever the line in second buffer matches the first one grep the line in second buffer in first file and replace the line in first file with the line in second. I think you need to work a bit around the logic.

Resources