FM 13: Anti-deduping, part 3

Today we have two great examples of using demos from this site as a starting point and making significant improvements: one from Malcolm Fitzgerald in response to FM 13: Anti-deduping, part 1, and one from Ralph Learmont in response to FM 13: Anti-deduping, part 2.

As you may recall the challenge was to retain duplicates and omit unique entries from within an existing found set, as opposed to starting from all records… otherwise we could have just searched on ! (find all duplicates), but since the ! operator does not play nicely with constrain, it was apparently not an option for this particular challenge.

Or so I thought. These gentlemen convincingly prove otherwise — and like all great techniques, what you’re about to see has the potential to be useful in a variety of situations, not just the narrow confines of this particular challenge.

Malcolm Fitzgerald – Example #1

4-25-2015 5-06-32 PM

Demo file:  MF-anti-deduping.zip

In this demo we are looking to anti-dedupe based on zip code, and when the file is run locally with all 20K records visible, here are timings for the four methods we looked at in part 1:

  • GetNthRecord: 89 seconds
  • FilterValues: 83 seconds
  • PatternCount: 16 seconds
  • Position: 7 seconds

Well, folks, we have a new winner for local performance, because Malcolm’s “bump into temporary table” approach clocks in at a blazing 4 seconds.

The performance boost is due to two clever insights:

  1. If you import the found set into an empty temporary table, so the table consists of only the records in the found set, then you can use the ! operator to isolate the duplicates, and then use Go To Related Records to assemble the found set in the main table.
  2. You can defer the cleanup of the temporary table till the user isn’t paying attention, rather than making them wait for it to happen before displaying the results.

Here’s the script, with Malcolm’s original version shown for clarity — the version in the demo has been expanded to optionally restore the starting found set.

4-20-2015 12-57-25 PM

I emailed Malcolm…

Wow. On my system it's almost twice as fast as my Position method on 20K records (4 seconds vs 7 seconds). The technique really shows its worth on the larger found sets, because with 10K records, it appears to perform about the same as Position, but with 20K significantly faster.

He replied:

It was Bruce Robertson who put me onto export/import years back.

Doing the cleanup first, which is needed for the script to perform properly added extra time. My initial idea was that the house-keeping could be done at any time. When didn’t matter, because it was very low priority. if it wasn’t done the script would delete the records itself. It could be triggered by OnWindowClose, for example, but that’s when I thought that I’d use an OnTimer call.

Using onTimer allows gives control back to the user sooner. The user perception is that everything is done. The fact that another process then runs in the background is invisible to them. Also, I think that users respond comparatively slowly. They get the custom dialog, OK it and then move on to do something else. The decision making process after OK-ing the message may take seconds and in that time the clean-up script has run.

Ralph Learmont – Example #2

Update 9 May 2015: don’t miss Ralph’s simplification of this technique + great demo + explanation here — Successfully Find Duplicate Values Within A Set Of Records

4-24-2015 3-24-21 PM

Demo file:  RL-anti-deduping.zip

As you may recall from part 2, the challenge here is to omit unique first names within the current found set.

Ralph writes:

I used your demo file part 2 and added my technique as an extra script. It uses a button I added at the right hand side. Within my script, there’s this covering explanation:

The exclamation "!" operator finds all duplicates in the whole TABLE, not in the FOUND SET.

Details: There's a quirk in the way Filemaker deals with duplicate records. This makes it difficult to find duplicate records WITHIN a found set of records. If you try to constrain the Find to the current set of records, you might discover extra spurious records appearing. These records have "partner-duplicates" outside of the found set. These unwanted spurious records will appear as single occurrences when you inspect a sorted column. Technically they are duplicates. It's just that their partners lie "outside", hidden in the omitted slab of records.

This technique overcomes that problem. It’s able to reveal duplicates which exist in the current found set of records.

4-24-2015 4-44-44 PM

A bit of background…. Back when I did this a few months ago, I almost had it nailed, but not quite. I just felt there was a more native way of getting the results we need, based basically on Filemaker’s ability to fairly QUICKLY find duplicates — it was just a pity it looked at the whole table rather than just the found set!

When it finally worked, I was ecstatic because it really is extremely fast, even though the field it searches is unstored. The script is quite simple too. (I’ve not tested on a network.)

This is a superb example of outside-the-box problem solving, and the runtime speed is nothing short of astonishing — 20K records processed locally in 1 second, as opposed to 15 seconds for the Position method, which was the best performer in part 2 of this series.

I’m not going to pretend I understand why it works, but this much is obvious: there is an unstored calculation echoing the FirstName field when the window name = “Temp”.

4-24-2015 4-15-51 PM

It appears that the records that are outside the found set are to some degree unaware that the window has been renamed — as suggested by the field name “FoundSetAware” — at least as far as the constrain on ! is concerned.

4-25-2015 3-16-11 PMTo explore this behavior here are some tests you can do.

  1. Start from a found set of 10K records (so your found set is smaller than the total record count)
  2. Open the script debugger, run the script to the point that the window has been renamed to “Temp”, then cancel the script and exit the debugger
  3. Do a find ! on FoundSetAware and note that your starting found set of 10K was not respected
  4. Reset your found set to 10K records
  5. Now do a constrain ! on FoundSetAware and note that the number of records returned is considerably smaller, i.e., your starting found set was respected, and anti-deduping only occurred within that found set.

So much for anti-deduping… Ralph points out that the technique can be used for standard deduping as well:

I also added a script which does the reverse…

4-24-2015 3-40-41 PM

It finds all the non-duplicate values, in case that’s of interest.

4-24-2015 4-04-09 PM

An Experiment

Demo file:  RL-anti-deduping-KF-experiment.zip

Can we simplify things slightly? What if we re-define FountSetAware like so…

4-25-2015 4-23 PM

…and then modify the salient portion of the script?

4-25-2015 4-13-37 PM

Initial testing suggests that this is a viable alternative.

Conclusion

I am too busy shaking my head in astonishment to say anything more, except thank you Malcolm and Ralph for sharing these amazing demos.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s