Identifying and removing duplicates
I have a table that contains duplicates of some rows on all fields except the transaction ID. The table is quite large (1/2 gig) and there are several columns in each row (32).
I am using the following coding to identify and remove the duplicates:-
ALTER TABLE jobtransactionactuallabour ADD COLUMN duplicate TINYINT(1)NOT NULL DEFAULT '0';
UPDATE jobtransactionactuallabour SET duplicate=if(@SI_JobYearNumber=SI_JobYearNumber and @I_JobNumber=I_JobNumber and @DT_ActualDate=DT_ActualDate and @I_Hours1=I_Hours1 and @C_OperationCode=C_OperationCode and @C_EmployeeCode=C_EmployeeCode,1,0), SI_JobYearNumber=(@SI_JobYearNumber:=SI_JobYearNumber), I_JobNumber=(@I_JobNumber:=I_JobNumber), DT_ActualDate=(@DT_ActualDate:=DT_ActualDate), I_Hours1=(@I_Hours1:=I_Hours1), C_OperationCode=(@C_OperationCode:=C_OperationCode), C_EmployeeCode=(@C_EmployeeCode:=C_EmployeeCode) ORDER BY SI_JobYearNumber, I_JobNumber, DT_ActualDate, I_Hours1, C_OperationCode, C_EmployeeCode;
DELETE FROM jobtransactionactuallabour WHERE duplicate=1;
ALTER TABLE jobtransactionactuallabour DROP COLUMN duplicate;
On first running this doesn't work but sometimes, on subsequent runs, it achieves the correct result.
It would appear that the ORDER BY clause in the UPDATE doesn't always get the data in the correct order to identity duplicates.
I am at a loss to identify what it is that is causing the variable results and how to change my coding to achieve a consistent result each time.
Is there anyone that can shed some light on this problem?
Answer Answered by Ian Gilfillan in this comment.
It appears the values in all of your ORDER BY fields will be the same for more than one row, so your example does not specify a determinate order.
But since you have a unique ID, there's a much easier way to do this, without adding a new fields, or updating anything. See Removing Duplicates.