Quantcast
Channel: Jive Syndication Feed
Viewing all articles
Browse latest Browse all 18

GNU Tools for checking input files: using awk to check for duplicate keys

$
0
0

Last time I made the case that it's a good idea to check your input file prior to a migration. I wrote about how you can use the uniq command to check for duplicate lines in your Iinput files in Using GNU tools to quickly check your input files - duplicates lines.


But maybe you don't want to check the complete lines, but only check if certain field combinations ( e.g. key-fields!) appear more than once.

 

Lets say you have a file like this:

 

 

ABC;XYZ;MATNR;DBBD;LGORT;SOMETHIG_ELSE

 

12121;13213;MAT12;dfhsf;1000;sdfsdjhf

1sad21;13213;MAT12;dfhsf;1000;sdfsdjsadhf

12121;13213;MAT12;;1200;sdfsdjhf

121;13213;MAT45;;1200;sdfsdjhf

 

 

-> each line is clearly unique, however, if MATNR and LGORT are key fields, then we have a problem.

 

We can find out with the help of awk (I'm using gawk):

 

 

cat [filename] | gawk -F ; "{print $3, $5 }"

 

-> it reads the file, interpreting “;” as the field-separator (-F;), the prints the 3rd and 5th field ($3 $5), separated by the "output field separator" (OFS), which by default is space (,).

 

So the output in the example is:

MAT12 1000

MAT12 1000

MAT12 1200

MAT45 1200

 

 

-> as we now only have the key-values we wanted to compare, we can easily pipe it into the uniq -d we already know to see if there are any duplicates.

(and as this might be a lot of lines, we just count them with wc -l)

 

So here is our one-liner for this task:

 

cat [filename] | gawk -F ; "{print $3, $5 }" | | uniq - d | wc –l

 

(-> if it’s 0, everything is fine!)


Viewing all articles
Browse latest Browse all 18

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>