I want to use grep command to print only those lines that contains a specific string in another file. My file is in xsls format and looks like:
I only want to extract those lines that contain 5'UTR in Annotation column.
The command I use is:
grep 'UTR$' Lee2012.xslx > utr.txt
However, I end up getting an empty file. I'll appreciate if someone can explain what I am doing wrong. Insights will be appreciated. Thank you!
Your sixth column ends with UTR, not the whole line, that is why you get no results.
To get all lines where the sixth column ends with UTR, you can use
awk '$6 ~ /UTR$/' Lee2012.xslx > utr.txt
This assumes your column (field) separators are whitespace.
I have a sorted .csv file that is something like this:
AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4
CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8
CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5
I want to split the file into smaller files of roughly two lines each, but I do not want rows with like values in the first column separated.
Here, I would end up with three files:
x00000:
AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4
x00001:
CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8
x00002:
CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5
My actual data is about 7 gigs in size and contains over 100 million lines. I want to split it into files of about 100K lines each or ~6MB. I am fine with using either file size or line numbers for splitting.
I know that I can use "sort" to split, such as:
split -a 5 -d -1 2
Here, that would give me four files, and like values in the first column would be split over files in most cases.
I think I probably need awk, but, even after reading through the manual, I am not sure how to proceed.
Help is appreciated! Thanks!
An awk script:
BEGIN { FS = "," }
!name { name = sprintf("%06d-%s.txt", NR, $1) }
count >= 2 && prev != $1 {
close(name)
name = sprintf("%06d-%s.txt", NR, $1)
count = 0
}
{
print >name
prev = $1
++count
}
Running this on the given data will create three files:
$ awk -f script.awk file.csv
$ cat 000001-AABB1122.txt
AABB1122,ABC,BLAH,4
AABB1122,ACD,WHATEVER,1
AABB1122,AGT,CAT,4
$ cat 000004-CCDD4444.txt
CCDD4444,AYT,DOG,4
CCDD4444,ACG,MUMMY,8
$ cat 000006-CCEE4444.txt
CCEE4444,AOP,RUN,5
DDFF9900,TUI,SAT,33
DDFF9900,WWW,INDOOR,5
I have arbitrarily chosen to use the line number from the original file from where the first line was taken, along with the first field's data on that line as the filename.
The script counts the number of lines printed to the current output file, and if that number is greater than or equal to 2, and if the first field's value is different from the previous line's first field, the current output file is closed, a new output name is constructed, and the count is reset.
The last block simply prints to the current filename, remembers the first field in the prev variable, and increments the count.
The BEGIN block initializes the field delimiter (before the first line is read) and the !name block sets the initial output file name (when reading the very first line).
To get exactly the filenames that you have in the question, use
name = sprintf("x%05d", ++n)
to set the output filename in both places where this is done.
With csplit if available
With the given data
csplit -s infile %^A% /^C/ %^C% /^D/ /^Z/ {*}
I have a list like this:
DEL075MD1BWP30P140LVT
AN2D4BWP30P140LVT
INVD0P7BWP40P140
IND2D6BWP30P140LVT
I want to replace everything in between D and BWP with a *
How can I do that in unix and tcl
Do you have the whole list available at the same time, or are you getting one item at a time from somewhere?
Should all D-BWP groups be processed, or just one per item?
If just one per item, should it be the first or last (those are the easiest alternatives)?
Tcl REs don't have any lookbehind, which would have been nice here. But you can do without both lookbehinds and lookaheads if you capture the goalpost and paste them into the replacement as back references. The regular expression for the text between the goalposts should be [^DB]+, i.e. one or more of any text that doesn't include D or B (to make sure the match doesn't escape the goalposts and stick to other Ds or Bs in the text). So: {(D)[^DB]+(BWP)} (braces around the RE is usually a good idea).
If you have the whole list and want to process all groups, try this:
set result [regsub -all {(D)[^DB]+(BWP)} $lines {\1*\2}]
(If you can only work with one line at a time, it's basically the same, you just use a variable for a single line instead of a variable for the whole list. In the following examples, I use lmap to generate individual lines, which means I need to have the whole list anyway; this is just an example.)
Process just the first group in each line:
set result [lmap line $lines {
regsub {(D)[^DB]+(BWP)} $line {\1*\2}
}]
Process just the last group in each line:
set result [lmap line $lines {
regsub {(D)[^DB]+(BWP[^D]*)$} $line {\1*\2}
}]
The {(D)[^DB]+(BWP[^D]*)$} RE extends the right goalpost to ensure that there is no D (and hence possibly a new group) anywhere between the goalpost and the end of the string.
Documentation:
lmap (for Tcl 8.5),
lmap,
regsub,
set,
Syntax of Tcl regular expressions
Good morning.
First things first: I know next to nothing about shell scripting in Unix, so please pardon my naivety.
Here's what I'd like to do, and I think it's relatively simple: I would like to create a .ksh file to do two things: 1) take a user-provided numerical value (argument) and paste it into a new column at the end of a dataset (a separate .txt file), and 2) execute a different .ksh script.
I envision calling this script at the Unix prompt, with the input value added thereafter. Something like, "paste_and_run.ksh 58", where 58 would populate a new, final (un-headered) column in an existing dataset (specifically, it'd populate the 77th column).
To be perfectly honest, I'm not even sure where to start with this, so any input would be very appreciated. Apologies for the lack of code within the question. Please let me know if I can offer any more detail, and thank you for taking a look.
I have found the answer: the "nawk" command.
TheNumber=$3
PE_Infile=$1
Where the above variables correspond to the third and first arguments from the command line, respectively. "PE_Infile" represents the file (with full path) to be manipulated, and "TheNumber" represents the number to populate the final column. Then:
nawk -F"|" -v TheNewNumber=$TheNumber '{print $0 "|" TheNewNumber/10000}' $PE_Infile > $BinFolder/Temp_Input.txt
Here, the -F"|" dictates the delimiter, and the -v dictates what is to be added. For reasons unknown to myself, the declaration of a new varible (TheNewNumber) was necessary to perform the arithmetic manipulation within the print statement. print $0 means that the whole line would be printed, while tacking the "|" symbol and the value of the command line input divided by 10000 to the end. Finally, we have the input file and an output file (Temp_PE_Input.txt, within a path represented by the $Binfolder variable).
Running the desired script afterward was as simple as typing out the script name (with path), and adding corresponding arguments ($2 $3) afterward as needed, each separated by a space.
A bit new to UNIX but I have a question with reagrds altering csv files going into a datafeed.
There are a few | seperated columns where the date has come back as (for example)
|07-04-2006 15:50:33:44:55:66|
and this needs to be changed to
|07-04-2006|
It doesn't matter if all the data gets written to another file. There are thousands of rows in these files.
Ideally, I'm looking for a way of going to the 3rd and 7th piped columns and taking the first 10 characters and removing anything else till the next |
Thanks in advance for your help.
What exactly do you want?
You can replace |07-04-2006 15:50:33:44:55:66| by |07-04-2006| using File IO.
This operates on all columns, but should do unless there are date columns which must not be changed:
sed 's/|\(..-..-....\) ..:..:..:..:..:..|/|\1|/g'
If you want to change the files in place, you can use sed's -i option.