how to delete closely duplicate lines in vim - unix

I have the following lines:
123 abcd 456 xyz
123 abcd 678 xyz
234 egfs 434 ert
345 fggfgf 456 455 rty
234 egfs 422 ert 33
So here, if the first field is same for multiple lines, they are considered duplicate. so, in the above example 123 is same in 2 lines, they are considered duplicates (though they differ in one field in the middle). Similarly, lines with 234 are duplicates.
I need to remove these duplicate lines.
Since they aren't 100% duplicates, sort u doesn't work. Does anyone know how i can delete these duplicate lines?

this would be a very easy task for awk, I would do it with awk. In vim, you can do:
% !awk '\!a[$1]++'
then you got:
123 abcd 456 xyz
234 egfs 434 ert
345 fggfgf 456 455 rty
if you do it in shell, you don't have to escape the !:
awk '!a[$1]++' file

g/\%(^\1\>.*$\n\)\#<=\(\k\+\).*$/d

This is easy with my PatternsOnText plugin. It allows to specify a pattern that is ignored for the duplicate check; in your case, that would be everything after the first (space-delimited) field:
%DeleteDuplicateLinesIgnoring / .*/

Related

AWK match lines/columns then compare another column and print

relatively new to AWK here. Wanting to compare two files. First two columns are to match in order to compare the 3rd column. 3rd column needs to be 100 larger in order to print that line from the second file. Some data may exist in one file but not in the other. I don't think it matters to AWK, but spaceing isn't very consistent for delimination. Here is a small snipit.
File1
USTL_WR_DATA MCASYNC#L -104 -102 -43 -46
USTL_WR_DATA SMC#L 171 166 67 65
TC_MCA_GCKN SMC#L -100 -100 0 0
WDF_ARRAY_DW0(0) DCDC#L 297 297 101 105
WDF_ARRAY_DW0(0) MCASYNC#L 300 300 50 50
WDF_ARRAY_DW0(0) MCMC#L 12 11 34 31
File2
TC_MCA_GCKN SMC#L 200 200 0 0
WDF_ARRAY_DW0(0) DCDC#L 842 867 271 270
WDF_ARRAY_DW0(0) MCASYNC#L 300 300 50 50
WDF_ARRAY_DW0(1) SMCw#L 300 300 50 50
WDF_ARRAY_DW0(2) DCDC#L 896 927 279 286
WDF_ARRAY_DW0(2) MCASYNC#L 300 300 50 50
Output
TC_MCA_GCKN SMC#L 200 200 0 0
WDF_ARRAY_DW0(0) DCDC#L 842 867 271 270
Here is my code. Not working. Not sure why.
awk 'NR==FNR{a[$1,$2];b[$3];next} (($1,$2) in a) && ($3> (b[$1]+100))' File1 File2
NR==FNR{a[$1,$2];b[$3];next} makes two arrays from the first file (I had issues making it one), the first two columns go in a to confirm we're comparing the same thing, and the third column I'm using to compare since late mode high seems like a reasonable assert to compare
(($1,$2) in a) makes sure first two columns in second file are the ones we're comparing to.
&& ($3> (b[$1]+100))' I think this is what's giving the issue. Supposed to see if second file column 3 is 100 or more greater than first file column 3 (first and only column in array b)
you need to key the value with the same ($1,$2) combination. Since we don't use a for any other purposes just store the value there.
$ awk 'NR==FNR {a[$1,$2]=$3; next}
($1,$2) in a && $3>a[$1,$2]+100' file1 file2
TC_MCA_GCKN SMC#L 200 200 0 0
WDF_ARRAY_DW0(0) DCDC#L 842 867 271 270

gsub not working while implementing in a loop

here i have the following dataframe df in R.
kyid industry amount
112 Apparel 345436
234 APPEARELS 234567
213 apparels 345678
345 Airlines 235678
123 IT 456789
124 IT 897685
i want to replace in industry which incorrectly written Apparel, or APPEARLS to Apparels .
i tried using creating a list and run it through a loop.
l<-c('Apparel ','APPEARELS','apparels')
for(i in range(1:3)){
df$industry<-gsub(pattern=l[i],"Apparels",df$industry)
}
it is not working.only one element changes.
But, when i take the statement individually it is not creating an error and its working.
df$industry<-gsub(pattern=","Apparels",df$industry)
but this is a large dataset so i nned this to work in R please help.
sub without loop using | :
l <- c("Apparel" , "APPEARELS", "apparels")
# Using OPs data
sub(paste(l, collapse = "|"), "Apparels", df$industry)
# [1] "Apparels" "Apparels" "Apparels" "Airlines" "IT" "IT"
I'm using sub instead of gsub as there's only one occurrence of pattern in a string (at least in example).
While range returns a sequence in Python, it returns the minimum and maximum of a vector in R:
range(1:3)
# [1] 1 3
Instead, you could use 1:3 or seq(1,3) or seq_along(l), which all return
# [1] 1 2 3
Also note the difference between 'Apparel' and 'Apparel '.
So
df<-read.table(header=T, text="kyid industry amount
112 Apparel 345436
234 APPEARELS 234567
213 apparels 345678
345 Airlines 235678
123 IT 456789
124 IT 897685")
l<-c('Apparel','APPEARELS','apparels')
for(i in seq_along(l)){
df$industry<-gsub(pattern=l[i],"Apparels",df$industry)
}
df
# kyid industry amount
# 1 112 Apparels 345436
# 2 234 Apparels 234567
# 3 213 Apparels 345678
# 4 345 Airlines 235678
# 5 123 IT 456789
# 6 124 IT 897685

Make vector of rows that match a condition in R

I have a dataframe 'likes' that looks like this:
uid Likes
123 Harry Potter
123 Fitness
123 Muesli
123 Fanta
123 Nokia
321 Harry Potter
321 Muesli
455 Harry Potter
455 Muesli
699 Muesli
123 Belgium
Furthermore I have a bunch of strings, for example: WhatLikes <- c("Harry Potter","Muesli")
I want a vector of the uid's that 'like' Harry Potter OR Muesli. Take note that WhatLikes is much bigger than this example.
The solution should thus be a vector that contains 123,321,455,699.
Help me out! thanks!
We can use %in% to get a logical index of elements in 'Likes' that are found in 'WhatLikes'. Get the corresponding 'uid' from the dataset. Apply unique to remove the duplicate 'uid's.
unique(df1$uid[df1$Likes %in% WhatLikes])
#[1] 123 321 455 699

R: How to separate string in a cell into several cells in a row?

I have following data
Data <- data.frame(
X = ("123 234 345 456","222 333 444 555 666" )
)
Data
# X
# 123 234 345 456
# 222 333 444 555 666
A String in one cell, and the length of string is not same in each row
I want the following result
>Result
# X Y Z A B
# 123 234 345 456
# 222 333 444 555 666
one word in one cell
Can anybody help?
strsplit is not required here. read.table should work fine. Try:
read.table(text = as.character(Data$X), header=FALSE, fill=TRUE)
You will have to rename the resulting variable names though.

Dynamically change what is being awked

I have this input below:
IDNO H1 H2 H3 HT Q1 Q2 Q3 Q4 Q5 M1 M2 EXAM
OUT OF 100 100 150 350 30 30 30 30 30 100 150 400
1434 22 95 135 252 15 20 12 18 14 45 121 245
1546 99 102 140 341 15 17 14 15 23 91 150 325
2352 93 93 145 331 14 17 23 14 10 81 101 260
(et cetera)
H1 H2 H3 HT Q1 Q2 Q3 Q4 Q5 M1 M2 EXAM
OUT OF 100 100 150 350 30 30 30 30 30 100 150 400
I need to use write a unix script to use the awk function to dynamically find any column that is entered in and have it displayed to the screen. I have successfully awked specific columns, but I cant seem to figure out how to make it change based on different columns. My instructor will simply pick a column for test data and my program needs to find that column.
what I was trying was something like:
#!/bin/sh
awk {'print $(I dont know what goes here)'} testdata.txt
EDIT: Sorry i should have been more specific, he is entering in the header name as the input. for example "H3". Then it needs to awk that.
I think you are just looking for:
#!/bin/sh
awk 'NR==1{ for( i = 1; i <= NF; i++ ) if( $i == header ) col=i }
{ print $col }' header=${1?No header entered} testdata.txt
This makes no attempt to deal with a column header that does not appear
in the input. (Left as an exercise for the reader.)
Well, you question is quite diffuse and in principle you want someone else write your awk script... You should check man pages for awk, they are quite descriptive.
my 2 cent wort, as an example would be (for the row) :
myscript.sh:
#/bin/sh
cat $1 | awk -v a=$2 -v b=$3 '{if ($(a)==b){print $0}}';
if you just want a column, well,
#/bin/sh
cat $1 | awk -v a=$2 '{print $(a)}';
Your input would be :
myscript.sh file_of_data col_num
Again, reiterating: Please study man pages of awk. Also please when asking question present what you have tried (code) and errors (logs). This will make people more ready to help you.
Your line format has a lot of variation (in the number of fields). That said, what about something like this:
echo "Which column name?"
read column
case $column in
(H1) N=2;;
(H2) N=3;;
(H3) N=4;;
...
(*) echo "Please try again"; exit 1;;
esac
awk -v N=$N '{print $N}' testdata.txt

Resources