Skip 2 lines after reading 1 line of a file using AWK - unix

I have a huge file which i am reading through awk , using awk i am calculating the sum of values on that file.
Below is the file format i have :
18/11/13 00:00:50 585 17353 296883 666
18/11/13 00:01:50 965 26536 216201 558
18/11/13 00:02:50 990 38685 390537 768
18/11/13 00:03:50 1004 22435 377633 404
18/11/13 00:04:50 709 15754 161435 12062
18/11/13 00:05:50 96 7084 403551 0
18/11/13 00:06:50 107 14588 504683 597
18/11/13 00:07:50 115 27562 457555 814
awk '{sum+=$4; ++n} END {print " Tot="n," Avg="sum/n}' filename
Now i thin i want to skip 2 rows after we read a row from the file.

for your comment under your question, you just need to skip line2 and 3:
awk 'NR==1||NR>3{sum+=$4; ++n} END {print " Tot="n," Avg="sum/n}' filename

Related

AWK apply to all the rows except first few

I have a report xyz.rpt in the below format
# ajsh askj sdjh 54 12
# jhgj 765 2839
kjsd sdh sdsdf sdffsdf sdff
5464 765 67 65 76
2356 423 34 45 34
and so on
I want to print first 2 header line also along with the output of awk command.
awk '{if ( $5 < 65 ) print}' > aaa.rpt
One awk idea:
awk '
FNR<=2 # print 1st 2 lines
FNR>2 && $5+0==$5 && $5<65 # print lines where $5 is numeric and $5 < 65
' xyz.rpt
Or as a one-liner:
awk 'FNR<=2; FNR>2 && $5+0==$5 && $5 < 65' xyz.rpt
This generates:
ajsh askj sdjh 54 12
jhgj 765 2839
2356 423 34 45 34
not as elegant as i hoped
nawk '($5==+$(5*(+$5<=4^3)))+(3>FNR)'
ajsh askj sdjh 54 12
jhgj 765 2839
2356 423 34 45 34
$ awk 'NR<=2 || ($5+0==$5 && $5<65)' file
My solution would be:
awk 'BEGIN { for(i=0;i<2;i++) {getline; print;}}$5<65{print}' input
The BEGIN block is executed at the start. getline reads a line, which gets outputed.
$5<65 In stead of the if construction it's better to use a pattern.

AWK match lines/columns then compare another column and print

relatively new to AWK here. Wanting to compare two files. First two columns are to match in order to compare the 3rd column. 3rd column needs to be 100 larger in order to print that line from the second file. Some data may exist in one file but not in the other. I don't think it matters to AWK, but spaceing isn't very consistent for delimination. Here is a small snipit.
File1
USTL_WR_DATA MCASYNC#L -104 -102 -43 -46
USTL_WR_DATA SMC#L 171 166 67 65
TC_MCA_GCKN SMC#L -100 -100 0 0
WDF_ARRAY_DW0(0) DCDC#L 297 297 101 105
WDF_ARRAY_DW0(0) MCASYNC#L 300 300 50 50
WDF_ARRAY_DW0(0) MCMC#L 12 11 34 31
File2
TC_MCA_GCKN SMC#L 200 200 0 0
WDF_ARRAY_DW0(0) DCDC#L 842 867 271 270
WDF_ARRAY_DW0(0) MCASYNC#L 300 300 50 50
WDF_ARRAY_DW0(1) SMCw#L 300 300 50 50
WDF_ARRAY_DW0(2) DCDC#L 896 927 279 286
WDF_ARRAY_DW0(2) MCASYNC#L 300 300 50 50
Output
TC_MCA_GCKN SMC#L 200 200 0 0
WDF_ARRAY_DW0(0) DCDC#L 842 867 271 270
Here is my code. Not working. Not sure why.
awk 'NR==FNR{a[$1,$2];b[$3];next} (($1,$2) in a) && ($3> (b[$1]+100))' File1 File2
NR==FNR{a[$1,$2];b[$3];next} makes two arrays from the first file (I had issues making it one), the first two columns go in a to confirm we're comparing the same thing, and the third column I'm using to compare since late mode high seems like a reasonable assert to compare
(($1,$2) in a) makes sure first two columns in second file are the ones we're comparing to.
&& ($3> (b[$1]+100))' I think this is what's giving the issue. Supposed to see if second file column 3 is 100 or more greater than first file column 3 (first and only column in array b)
you need to key the value with the same ($1,$2) combination. Since we don't use a for any other purposes just store the value there.
$ awk 'NR==FNR {a[$1,$2]=$3; next}
($1,$2) in a && $3>a[$1,$2]+100' file1 file2
TC_MCA_GCKN SMC#L 200 200 0 0
WDF_ARRAY_DW0(0) DCDC#L 842 867 271 270

How can I grep content from file1 matching file2 and put them in the order of file2

I have file1.txt with content:
rs002
rs113
rs209
rs227
rs151
rs104
I have file2.txt with content:
rs113 113
rs002 002
rs227 227
rs209 209
rs104 104
rs151 151
I want to get the lines of file2.txt that match the records in file1.txt, for which I tried:
grep -Fwf file1.txt file2.txt
with output as follows:
rs113 113
rs002 002
rs227 227
rs209 209
rs104 104
rs151 151
This extracts all the matching lines, but it is in the order of occurrence in file2.txt. Is there any way to extract the matching records while maintaining the order from file1.txt? The desired output is as follows:
rs002 002
rs113 113
rs209 209
rs227 227
rs151 151
rs104 104
One (amittedly not very elegant) solution is to loop over file1.txt and look for a match for each line:
while IFS= read -r line; do
grep -wF "$line" file2.txt
done < file1.txt
which gives the output
rs002 002
rs113 113
rs209 209
rs227 227
rs151 151
rs104 104
If you know that each line occurs only once at most, this can be accelerated a bit by telling grep to stop after the first match:
grep -m 1 -wF "$line" file2.txt
This is a GNU extension, as far as I can tell.
Notice that looping over a file to do some processing on another file in each loop usually is a sign that there is a much more efficient way to do things, so this should probably only be used for files small enough where the effort of coming up with a better solution takes longer than processing them with this solution.
This is too complicated for grep. If file2.txt is not huge, i.e. it fits into memory, you should probably be using awk:
awk 'FNR==NR { f2[$1] = $2; next } $1 in f2 { print $1, f2[$1] }' file2.txt file1.txt
Output:
rs002 002
rs113 113
rs209 209
rs227 227
rs151 151
rs104 104
Create a sed-command file from file2
sed 's#^\([^ ]*\)\(.*\)#/\1/ s/$/\2/#' file2 > tmp.sed
sed -f tmp.sed file1
These 2 lines can be combined avoiding the tmp file
sed -f <(sed 's#^\([^ ]*\)\(.*\)#/\1/ s/$/\2/#' file2) file1
That should help (but will not optimal for big input):
$ for line in `cat file1.txt`; do grep $line file2.txt; done

Dividing a file in R and automatically creating notepad files

I have a file which is like this :
"1943" 359 1327 "t000000" 8
"1944" 359 907 "t000000" 8
"1946" 359 472 "t000000" 8
"1947" 359 676 "t000000" 8
"1948" 326 359 "t000000" 8
"1949" 359 585 "t000000" 8
"1950" 359 1157 "t000000" 8
"2460" 275 359 "t000000" 8
"2727" 22 556 "t000000" 8
"2730" 22 676 "t000000" 8
"479" 17 1898 "t0000000" 5
"864" 347 720 "t000s" 12
"3646" 349 691 "t000s" 7
"6377" 870 1475 "t000s" 14
"7690" 566 870 "t000s" 14
"7691" 870 2305 "t000s" 14
"8120" 870 1179 "t000s" 14
"8122" 44 870 "t000s" 14
"8124" 870 1578 "t000s" 14
"8125" 206 870 "t000s" 14
"8126" 870 1834 "t000s" 14
"6455" 1 1019 "t000t" 13
"4894" 126 691 "t00t" 9
"4896" 126 170 "t00t" 9
"560" 17 412 "t0t" 7
"130" 65 522 "tq" 18
"1034" 17 990 "tq" 10
"332" 3 138 "ts" 2
"2063" 61 383 "ts" 5
"2089" 127 147 "ts" 11
"2431" 148 472 "ts" 15
"2706" 28 43 "ts" 21
.....................
The first column is the random row number ( got after some sorting that I needed ), the fourth column contains the pattern for which I actually want different notepad files.
What I want is that I get individual notepad files named for example, f1.txt,f2.txt,f3.txt...containing all the rows for a value in column 4. For example, I get a different file for "t000000" and then a different one for "t000s" and then a seperate one for "t00t" and so on...
I did this,
list2env(split(sort, sort[,4]),envir=.GlobalEnv)
Here sort is my text file name of data set and 3 is that column.
And then I can use the write.table command, but since my file is huge, I get around 100's of files like that and doing write.table manually like that is very difficult. Is there any way I can automate it?
Using the excellent data.table package:
library(data.table)
# get your source file
the_file <- fread('~/Desktop/file.txt') #replace with your file path
# vector of unique values of column 4 & the roots of your output filename
fl_names <- unique(the_file$V4)
# dump all the relevant subsets to files
for (f in fl_names) write.table(the_file[V4==f, ], paste0(f, '.txt'), row.names=FALSE)
You've already figured out split, but instead of list2env, which will make more work for you just use lapply:
# Generally confusing to name a data.frame
# the same as a common function!
X <- split(sort, sort[, 4])
invisible(lapply(names(X), function(y)
write.csv(X[[y]], file = paste0(y, ".csv"))))
Proof of concept:
Dir <- getwd() # Won't be necessary in your actual script
setwd(tempdir()) # I just don't want my working directory filled
list.files(pattern=".csv") # with random csv files, so I'm using tempdir()
# character(0) # Note that there are no csv files presently
X <- split(sort, sort[, 4]) # You've already figured this step out
## invisible is just so you don't have to see an empty list
## printed in your console. The rest is pretty straightforward
invisible(lapply(names(X), function(y)
write.csv(X[[y]], file = paste0(y, ".csv"))))
list.files(pattern=".csv") # Check that the files are there
# [1] "t000000.csv" "t0000000.csv" "t000s.csv" "t000t.csv"
# [5] "t00t.csv" "t0t.csv" "tq.csv" "ts.csv"
setwd(Dir) # Won't be necessary for your actual script

how to de-merge column name and data in 1st row when importing a .csv file?

Code:
data <- read.csv("./data.csv",header=T)
data
Output:
X224786 X578 X871 X9719
1 230034 546 969 10262
2 236562 599 845 10120
Expected Output:
A B C D
224786 578 871 9719
230034 546 969 10262
236562 599 845 10120
Obviously, your *.csv file has no header line. So, try:
data <- read.csv("./data.csv", header=F)
names(data) <- c("A","B","C","D")
Try read.table instead of read.csv. read.csv requires commas between each field.

Resources