I have a text file that looks like this:
>long_name
AAC-TGA
>long_name2
CCTGGAA
And a list of column numbers: 2, 4, 7. Of course I can have these as a variable like:
cols="2 4 7"
I need to replace every column of the rows that don't start with > with a single character, e.g an N, to result in:
>long_name
ANCNTGN
>long_name2
CNTNGAN
Additional details - the file has ~200K lines. All lines that don't start with > are the same length. Line indices will never exceed the length of the non > lines.
It seems to me that some combination of sed and awk must be able to do this quickly, but I cannot for the life of me figure out how to link it all together.
E.g. I can use sed to work on all lines that don't start with a > like this (in this case replacing all spaces with N's):
sed -i.bak '/^[^>]/s/ /N/g' input.txt
And I can use AWK to replace specific columns of lines as I want to like this (I think...):
awk '$2=N'
But I am struggling to stitch this together
With GNU awk, set i/o field separators to empty string so that each character becomes a field, and you can easily update them.
awk -v cols='2 4 7' '
BEGIN {
split(cols,f)
FS=OFS=""
}
!/^>/ {
for (i in f)
$(f[i])="N"
}
1' file
Also see Save modifications in place with awk.
You can generate a list of replacement commands first and then pass them to sed
$ printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7
$ printf '2, 4, 7' | sed -E 's|[^0-9]*([0-9]+)[^0-9]*|/^>/! s/./N/\1\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7
$ sed -f <(printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g') ip.txt
>long_name
ANCNTGN
>long_name2
CNTNGAN
Can also use {} grouping
$ printf '2 4 7' | sed -E 's|^|/^>/!{|; s|[0-9]+|s/./N/&; |g; s|$|}|'
/^>/!{s/./N/2; s/./N/4; s/./N/7; }
Using any awk in any shell on every UNIX box:
$ awk -v cols='2 4 7' '
BEGIN { split(cols,c) }
!/^>/ { for (i in c) $0=substr($0,1,c[i]-1) "N" substr($0,c[i]+1) }
1' file
>long_name
ANCNTGN
>long_name2
CNTNGAN
I have the following script:
curl -s -S 'https://bittrex.com/Api/v2.0/pub/market/GetTicks?marketName=BTC-NBT&tickInterval=thirtyMin&_=1521347400000' | jq -r '.result|.[] |[.T,.O,.H,.L,.C,.V,.BV] | #tsv | tostring | gsub("\t";",") | "(\(.))"'
This is the output:
(2018-03-17T18:30:00,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(2018-03-17T19:00:00,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(2018-03-17T19:30:00,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(2018-03-17T20:00:00,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
I want to replace the date with timestamp.
I can make this conversion with date in the shell
date -d '2018-03-17T18:30:00' +%s%3N
1521325800000
I want this result:
(1521325800000,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(1521327600000,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(1521329400000,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(1521331200000,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
This data is stored in MySQL.
Is it possible to execute the date conversion with jq or another command like awk, sed, perl in a single command line?
Here is an all-jq solution that assumes the "Z" (UTC+0) timezone.
In brief, simply replace .T by:
((.T + "Z") | fromdate | tostring + "000")
To verify this, consider:
timestamp.jq
[splits("[(),]")]
| .[1] |= ((. + "Z")|fromdate|tostring + "000") # milliseconds
| .[1:length-1]
| "(" + join(",") + ")"
Invocation
jq -rR -f timestamp.jq input.txt
Output
(1521311400000,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(1521313200000,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(1521315000000,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(1521316800000,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
Here is an unportable awk solution. It is not portable because it relies on the system date command; on the system I'm using, the relevant invocation looks like: date -j -f "%Y-%m-%eT%T" STRING "+%s"
awk -F, 'BEGIN{OFS=FS}
NF==0 { next }
{ sub(/\(/,"",$1);
cmd="date -j -f \"%Y-%m-%eT%T\" " $1 " +%s";
cmd | getline $1;
$1=$1 "000"; # milliseconds
printf "%s", "(";
print;
}' input.txt
Output
(1521325800000,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(1521327600000,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(1521329400000,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(1521331200000,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
Solution with sed :
sed -e 's/(\([^,]\+\)\(,.*\)/echo "(\$(date -d \1 +%s%3N),\2"/g' | ksh
test :
<commande_curl> | sed -e 's/(\([^,]\+\)\(,.*\)/echo "(\$(date -d \1 +%s%3N),\2"/g' | ksh
or :
<commande_curl> > results_curl.txt
cat results_curl.txt | sed -e 's/(\([^,]\+\)\(,.*\)/echo "(\$(date -d \1 +%s%3N),\2"/g' | ksh
Is it possible to count the occurrence of each word like using uniq -c but with the count after the word rather than before?
Example scenario
Input file named as text1.txt which contain the following data
Renault:cilo:84563
Renault:cilo:84565
M&M:Thar:84566
Tata:nano:84567
M&M:quanto:84568
M&M:quanto:84569
The fields used in the above data are car_company:car_model:customerID
Desired result
cilo 2
Thar 1
nano 1
quanto 2
(car_model and number of cars sold grouped by car_model)
My code
cat test1.txt | cut -d: -f2 | uniq -c
Actual Result
2 cilo
1 Thar
1 nano
2 quanto
Is it possible to do the above process without using uniq -c ,so that I can swap the order of the fields (columns)?
You can use uniq, and simply post-process its output to swap the columns:
cut -d: -f2 test1.txt | uniq -c | awk '{print $2 "\t" $1 "\n" }'
EDIT: Added \n, as noted in a comment.
Save your commands output into a file "badresult";
cat test1.txt | cut -d: -f2 | uniq -c > badresult
Then cut the seventh field and save it into a file named "counts"(you should use space(" ") as a seperator);
cut -d" " -f7 badresult > counts
Then cut the eighth field and save it into a file named "models"(you should use space(" ") as a seperator);
cut -d" " -f8 badresult > models
Now you have your counts and models in seperate files. All you have to do is to show these two files seperately with "pr" command(-m: one file per column, -T:no pre-information)
pr -m -T models counts
Using awk:
cat test1.txt | cut -d: -f2 | uniq -c | awk '{ t = $1; $1 = $2; $2 = t; print }'
The little awk code exchanges fields 1 and 2 using a temporary.
You just need awk for this:
$ awk -F: '{a[$2]++} END {for (i in a) print i, a[i]}' file
cilo 2
quanto 2
nano 1
Thar 1
This goes through every line keeping track of how many times the second field has appeared. Since everything is stored in the array a, then it is just a matter of looping through it and printing its content.
Well, I have about 114 files that I want to join side-by-side based on the 1st column that each file shares, which's the ID number. Each file consists of 2 columns and over 400000 lines. I used write.table to join those tables together in one table and I got X's in my header. For example, my header should be like:
ID 1_sample1 2_sample2 3_sample3
But I get it like this:
ID X1_sample1 X2_sample2 X3_sample3
I read about this problem and found out the check.names get rid of this problem, but in my case when I use check.names I get the following error:
"unused argument (check.name = F)"
Thus, I decided to use sed to fix the problem, it actually works great, BUT it joins the 2nd line and the 1st line. For instance, my 1st column and second column should be something like this:
ID 1_sample1 2_sample2 3_sample
cg123 .0235 2.156 -5.546
But I get the following instead:
ID 1_sample1 2_sample2 3_sample cg123 .0235 2.156 -5.546
Can any one check this code for me, please. I might've done something wrong to not get each line separated from the other.
head -n 1 inFILE | tr "\t" "\n" | sed -e 's/^X//g' | sed -e 's/\./-/' | sed -e 's/\./(/' |sed -e 's/\./)/' | tr "\n" "\t" > outFILE
tail -n +2 beta.norm.txt >> outFILE
If your data is tab delimited, the simple fix would be
sed '1,1s/\tX/\t/g' < inputfile > outputfile
1,1 only operate on the range "line 1 to line 1"
\tX find tab followed by X
/\t/ replace with tab
g all occurrences
It does seem as though your original attempt does more than just strip the X - it also changes successive dots to (-) but you don't show in your example why you need that. The reason your code joins the first two lines is that you only replace \n with \t in your last tr command - which leaves you with no \n at the end of the line.
You need to attach a \n at the end of your first line before concatenating lines 2 and beyond with your second command. Experiment with
head -n 1 inFILE | tr "\t" "\n" | sed -e 's/^X//g' | sed -e 's/\./-/' | sed -e 's/\./(/' |sed -e 's/\./)/' | tr "\n" "\t" > outFILE
echo "\n" >> outFile
tail -n +2 beta.norm.txt >> outFILE
whether that works depends on your OS. There are other ways to add a newline...
edit using awk is probably much cleaner - for example
awk '(NR==1){gsub(" X"," ", $0);}{print;}' inputFile > outputFile
Explanation:
(NR==1) for the first line only (record number == 1) do:
{gsub(" X","", $0);} do a global substitution of "space followed by X", with "space"
for all lines (including the one that was just modified) do:
{print;}' print the whole line
What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?
paste -d '\n' file1 file2
Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1
#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-
With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.
cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.