I’m always interested in bits about the performance of specific types of SQL queries, so I was curious when I came across Bill Schneider‘s post about PostgreSQL performance of “where exists”. He was comparing the performance of two different queries that provide the same results, so I took a look at some of my data and converted them to fit data that I already had:
Query 1
select course_abbr from courses_basetable where exists ( select 1 from editions where editions.course_abbr = courses_basetable.course_abbr )
Query 2
select distinct(editions.course_abbr) from editions join courses on editions.course_abbr = courses.course_abbr
Both of these queries return the same list of course abbreviations, but perform slightly different. To avoid the suspense I’m going to give away the ending now: Query 1 runs faster than Query 2 (tested using PostgreSQL 7.4.6), but for Bill Query 2 ran faster (using PostgreSQL 7.3.x). For my tests Query 1 took an average time of 29 ms and Query 2 took an average of 31 ms. So what, 2 ms you say, I throw that much CPU time away several times a minute! The absolute times don’t really matter, the percentage difference does. In this case Query 1 is more than 5% faster than Query 2.
To find out more about why these two queries perform so differently I ran them both using EXPLAIN ANALYZE to see what was going on. It should be rather obvious why Query 2 takes longer after seeing the EXPLAIN ANALYZE results:
Query 1
Seq Scan on courses_basetable (cost=0.00..35.82 rows=19 width=8) (actual time=0.369..9.485 rows=36 loops=1) Filter: (subplan) SubPlan -> Seq Scan on editions (cost=0.00..1.81 rows=2 width=0) (actual time=0.207..0.207 rows=1 loops=38) Filter: ((course_abbr)::text = ($0)::text)
Query 2
Unique (cost=6.06..6.38 rows=36 width=8) (actual time=4.697..5.252 rows=36 loops=1) -> Sort (cost=6.06..6.22 rows=65 width=8) (actual time=4.682..4.849 rows=65 loops=1) Sort Key: editions.course_abbr -> Hash Join (cost=1.47..4.10 rows=65 width=8) (actual time=1.652..3.357 rows=65 loops=1) Hash Cond: (("outer".course_abbr)::text = ("inner".course_abbr)::text) -> Seq Scan on editions (cost=0.00..1.65 rows=65 width=8) (actual time=0.017..0.462 rows=65 loops=1) -> Hash (cost=1.38..1.38 rows=38 width=8) (actual time=1.034..1.034 rows=0 loops=1) -> Seq Scan on courses_basetable (cost=0.00..1.38 rows=38 width=8) (actual time=0.106..0.510 rows=38 loops=1)
Query 2 has more work to than Query 1. If anyone else has other data points for this query let me know.
On a side note, Blogger really needs to support TrackBack, I’m tired of giving a response in a post and then having to leave a comment to let the person know where to find it.