Demo file: sql-multi-table-query-optimization-v2.zip
Recently a colleague requested help w/ a SQL query that was performing slowly. I wrote back:
- Make sure there are no records open locally, i.e., on your machine, in the tables you are querying (for more information see this article by Wim Decorte: ExecuteSQL – The Good, The Bad & The Ugly)
- Only query stored values — if you’re querying a single table it should be fast
- If you’re querying multiple tables, make sure to construct the query to run optimally
Well today we’re going to take a look at #3 and see what we can learn re: optimizing multi-table queries. But first, I should point out that what follows reflects my current understanding of how things work. I hope this can be the start of a conversation where others will chime in and share their knowledge.
Let’s take a look at the demo file. Here are table definitions, with over 100K sales records and 4.6K customer records, as well as a states table with one entry for each of the 50 U.S. states. The remaining tables (types and dev) don’t concern us today.
Here is the relationships graph….
…and just a reminder that SQL doesn’t care about joins on the graph; all of today’s SQL examples would be equally valid if the graph looked like this.
Note 1: I’ve intentionally avoided using reserved SQL words as table or field names in the demo file to help keep the example syntax clean. If you’re curious, you can see a complete list of FileMaker’s reserved SQL words here — https://support.claris.com/s/article/Reserved-words-in-FileMaker-Pro-1503693036814?language=en_US
Note 2: the demo is self-updating, and on startup will generate sales data as necessary to bring entries current as of yesterday’s date.
Note 3: all the examples in today’s article are intended to be run in the data viewer. At the risk of stating the obvious, you should keep the “Automatically evaluate” box unchecked in the data viewer unless you know that the query you are about to run is safe (i.e., is well constructed, and that it won’t cause FileMaker to lock up for an inordinate amount of time).
Okay, time for our first challenge. We’d like to produce a mini-report of aggregated sales data, summarized by “st” (state abbreviation), for a single date, like so…
…and we can produce this report by pasting this code into the data viewer.
ExecuteSQL ( " SELECT customers.st, SUM ( sales.amt_sale ) FROM customers, sales WHERE sales.id_customer = customers.id AND sales.date_sale = ? GROUP BY customers.st " ; " • " ; "" ; Date(1;20;2022) )
Actually, let’s live a little dangerously and check the “Automatically evaluate” box in the data viewer. Now change the date and note that it takes a second or two to update.
Caution: I’m suggesting you change a few specific bits of syntax while you have “Automatically evaluate” checked. If you make other changes to the SQL syntax, you may lock FileMaker up for a long time while it attempts to perform the query. When in doubt, uncheck that box.
Is there anything we can do to improve performance? Yes there is… we can swap the order of the two tables in the FROM clause, like so:
ExecuteSQL ( " SELECT customers.st, SUM ( sales.amt_sale ) FROM sales, customers WHERE sales.id_customer = customers.id AND sales.date_sale = ? GROUP BY customers.st " ; " • " ; "" ; Date(1;20;2022) )
Again, with the “Automatically evaluate” box checked, change the date, and notice how much faster the updated result appears. Why does this one little change make a difference? Here’s what I think is going on: by putting sales before customers in the FROM clause we’re instructing the SQL engine to…
- First gather the sales records for the specified date
- Then join that set of sales records with the corresponding set of customers
Ideally in a multi-table query we want to start from the smallest possible set of matching rows, and proceed from there, but in the first query we told FileMaker’s SQL engine to…
- First join all 100K sales records with the corresponding set of customers
- Then filter down the sales records to just those matching the specified date
…in other words, in the first query we forced the SQL engine do a lot more work.
What if we use proper JOIN terminology instead of sneaking the join logic into the WHERE clause? If we put customers in the FROM clause the performance will be similar to what we saw in the first example (i.e., somewhat sluggish).
ExecuteSQL ( " SELECT customers.st, SUM ( sales.amt_sale ) FROM customers JOIN sales ON sales.id_customer = customers.id WHERE sales.date_sale = ? GROUP BY customers.st " ; " • " ; "" ; Date(1;20;2022) )
But, if we instead put sales in the FROM clause, things are noticeably zippier.
ExecuteSQL ( " SELECT customers.st, SUM ( sales.amt_sale ) FROM sales JOIN customers ON sales.id_customer = customers.id WHERE sales.date_sale = ? GROUP BY customers.st " ; " • " ; "" ; Date(1;20;2022) )
[To avoid getting sidetracked, I haven’t attempted to format the sales values as currency in today’s examples. If that’s something you’d like to do, check out Fun with SQL Joins and Currency Formatting.]
Next, let’s look at a more complicated challenge… this time we’d like to summarize sales by state and customer, include the full state name, rather than the two-character abbreviation, and aggregate sales across a date range rather than for a single date. So, our query will need to interrogate three tables to produce a little report like this.
Given that the result columns appear in this order: state/customer/sales it might seem resonable to reflect that sequence in the FROM clause…
ExecuteSQL ( " SELECT states.state, customers.customer, SUM ( sales.amt_sale ) FROM states, customers, sales WHERE sales.date_sale >= ? AND sales.date_sale <= ? AND sales.id_customer = customers.id AND customers.st = states.st GROUP BY states.state, customers.customer " ; " • " ; "" ; Date(1;25;2020) ; Date(1;31;2020) )
…and performance isn’t too bad, but not as zippy as it could be.
How about we construct the FROM clause to go from sales through customers to states (basically echoing the relational structure on the graph)?
ExecuteSQL ( " SELECT states.state, customers.customer, SUM ( sales.amt_sale ) FROM sales, customers, states WHERE sales.date_sale >= ? AND sales.date_sale <= ? AND sales.id_customer = customers.id AND states.st = customers.st GROUP BY states.state, customers.customer " ; " • " ; "" ; Date(1;25;2020) ; Date(1;31;2020) )
Now that’s more like it. Once again, we start from the smallest number of matching records, and work our way outward from there.
And this might be a good time to mention that the order of the entries in the WHERE clause appears to be irrelevant, and also irrelevant is the choice of which field goes on which side of the = sign when you join two tables (either implicitly or explicitly).
Here’s the proper “JOIN” version of the preceding.
ExecuteSQL ( " SELECT states.state, customers.customer, SUM ( sales.amt_sale ) FROM sales JOIN customers ON sales.id_customer = customers.id JOIN states ON customers.st = states.st WHERE sales.date_sale >= ? AND sales.date_sale <= ? GROUP BY states.state, customers.customer " ; " • " ; "" ; Date(1;25;2020) ; Date(1;31;2020) )
Both versions are blazingly fast. I tend to prefer to quietly embed my join criteria into the WHERE clause as the whole thing reads more like natural language and, for me at any rate, requires less brain power to understand. But on the other hand, using proper JOIN syntax makes it harder to write ill-conceived queries that will bring FileMaker to its knees.
As the saying goes, YMMV (“your mileage may vary”).
5 thoughts on “SQL Multi-Table Query Optimization”
Kevin, in the demo file “update entries” script, steps 62-65 have in those Set Field calcs. Is there a plug-in required or a missing custom function?
Thanks Shawn — fix coming soon.
Demo has been updated to v2. Thanks again Shawn.
It’s a great demo, but i have tried the “JOIN” syntax, but there is still a limit to the performance. It starts to return very slow after a few 100,000 to several million records, even in the local machine.
Hello Bing, thank you for the kind words regarding the demo. I’m not surprised to learn that there’s a point at which performance becomes unacceptable even with a well constructed query.