FM 13: Anti-deduping, part 2

In part one, we explored various “anti-deduping” techniques. As you may recall, the challenge was to retain duplicates and omit unique entries from within an existing found set, as opposed to starting from all records… otherwise we could have just searched on ! (find all duplicates), but the ! operator does not play nicely with constrain, and cannot help us with this particular challenge.

Update 9 May 2015: it turns out that there is a way to reliably use contrain with the ! operator from within a found set — see Ralph Learmont’s technique + great demo + explanation here — Successfully Find Duplicate Values Within A Set Of Records

Also, last time, we were seeking to omit unique zip (postal) codes, which, in the example file, were always five digits long, i.e., of fixed length. Today our goal is to omit unique (a.k.a. “orphaned”) first names, and it turns out there are some additional challenges when the field in question is of variable length, which we will explore in today’s demo file: Anti-deduping, part 2.

Question: if we say we want to locate “all the Marias”, what do we really mean? Here are some possibilities…

12-8-2014 9-46-46 AM

…and you can easily try these searches out for yourself in the demo, like so:

12-7-2014 4-52-01 PM

What you’ll notice, moving from top to bottom, is that each search is more restrictive than its predecessor, and I think most of us would agree that an exact field match search (==maria) is the one we want. To be clear, we will not be using searches as part of our anti-deduping strategy, but the above exercise serves as a reminder that “Marian”, “Marianne” and “Maria” (for example) are not the same name and should not be mistakenly identified as duplicates.

Of course we want our anti-deduping technique to be both accurate (no false positives or false negatives) and fast, and we’re going to look at three methods, using:

FilterValues
PatternCount
Position

As in part one, all of today’s approaches begin by using a summary list field…

12-8-2014 11-55-49 AM

…to build a list of first names in the current found set…

12-8-2014 11-57-18 AM

…which is then pushed into a global variable named $$summaryList.

12-8-2014 11-59-37 AM

Filter Values vs. PatternCount

Let’s start by searching the FirstName field for “marian”.

12-8-2014 1-47-42 PM

Note that 26 records are returned, but only the first two are exact matches.

12-8-2014 11-50-44 AM

Here are the first names for these 26 records, aggregated in the summary list…

12-8-2014 1-51-40 PM

…and here’s how FilterValues evaluates the first row:

12-8-2014 1-55-31 PM

As you may recall from part one, FilterValues is accurate but slow. PatternCount is considerably faster, so perhaps we can use it instead?

12-8-2014 2-07-25 PM

The first attempt doesn’t meet our needs, because it’s counting all occurrences of the text string “Marian” regardless of where they fall within the FirstName field.

Pay close attention to the next test, because depending on how your data is ordered (I won’t say sorted, because what we’re about to see can happen whether sorted or not), it may or may not yield accurate results. First let’s sort by last name, make the first “Marian” record active, and give it a shot:

12-8-2014 2-24-48 PM

Hmmm… looks good doesn’t it? But that’s what’s so insidious about this method — it works under many, but not all, circumstances. Let’s unsort, make sure one of the “Marian” records is active, and try again.

12-8-2014 2-29-58 PM

What the heck is going on? Well, it turns out that when the same first name occurs more than once in sequence, the hard-return between them is not double counted by PatternCount.

It may help to visualize the first few lines of ¶$$summaryList¶ as “¶Marian¶Marian¶”. We’ve instructed PatternCount to look for “¶Marian¶”, which consumes the first eight characters, leaving “Marian¶” which of course does not equal “¶Marian¶”. In other words…

PatternCount ( "¶Marian¶Marian¶" ; "¶Marian¶" ) = 1

So, is there a way to trick PatternCount into yielding a correct answer even when the Marians occur sequentially? Yes, there is…

12-8-2014 2-51-04 PM

…and here it is:

12-8-2014 5-00-00 PM

In case it’s not clear, we’re using Substitute to convert each ¶ inside $$summaryList to ¶¶, so now the first few lines of the x variable look like “¶Marian¶¶Marian¶” and…

PatternCount ( "¶Marian¶¶Marian¶" ; "¶Marian¶" ) = 2

We’ll do some speed comparisons below, but for the time being, you can take my word for it that PatternCount is considerably faster than FilterValues… and (given a found set of sufficient size) Position is considerably faster than PatternCount.

Position

As we saw in part one, when we use the previous two methods, we count all the occurrences of the field entry in question and consider an entry to be unique if the count = 1. By contrast, with Position we don’t actually have to count the values, just test to see if there’s a second occurrence. (There may be numerous occurrences, but we don’t need to waste processor time finding out how many.)

This time let’s begin by finding all the Marias.

0001

As with PatternCount, the most straight-forward use of Position can produce false positives…

0003

…and this approach has the same problems as its PatternCount predecessor.

0004

Fortunately the same trick that made PatternCount viable can be used here.

0002

Performance

Let’s generate a found set of 10K records, and compare the three approaches — and to ensure that caching isn’t affecting the results, let’s restart FileMaker between each test. Also let’s make sure the records are unsorted (we examined the effects of sorting and a work around to mitigate those effects in part one).

12-8-2014 9-08-55 PM

Here are the results I get with a 10K starting found set on a local file:

FilterValues: 21 seconds
PatternCount: 10 seconds
Position: 6 seconds

Here are the results I get with the same found set accessing the file across a WAN:

FilterValues: 23 seconds
PatternCount: 13 seconds
Position: 8 seconds