Last time I made the case that it's a good idea to check your input file prior to a migration. I wrote about how you can use the uniq command to check for duplicate lines in your Iinput files in Using GNU tools to quickly check your input files - duplicates lines.
But maybe you don't want to check the complete lines, but only check if certain field combinations ( e.g. key-fields!) appear more than once.
Lets say you have a file like this:
ABC;XYZ;MATNR;DBBD;LGORT;SOMETHIG_ELSE
12121;13213;MAT12;dfhsf;1000;sdfsdjhf
1sad21;13213;MAT12;dfhsf;1000;sdfsdjsadhf
12121;13213;MAT12;;1200;sdfsdjhf
121;13213;MAT45;;1200;sdfsdjhf
-> each line is clearly unique, however, if MATNR and LGORT are key fields, then we have a problem.
We can find out with the help of awk (I'm using gawk):
cat [filename] | gawk -F ; "{print $3, $5 }"
-> it reads the file, interpreting “;” as the field-separator (-F;), the prints the 3rd and 5th field ($3 $5), separated by the "output field separator" (OFS), which by default is space (,).
So the output in the example is:
MAT12 1000
MAT12 1000
MAT12 1200
MAT45 1200
-> as we now only have the key-values we wanted to compare, we can easily pipe it into the uniq -d we already know to see if there are any duplicates.
(and as this might be a lot of lines, we just count them with wc -l)
So here is our one-liner for this task:
cat [filename] | gawk -F ; "{print $3, $5 }" | | uniq - d | wc –l
(-> if it’s 0, everything is fine!)