Categories
Database PostgreSQL

PostgreSQL Performance of “where exists”

I’m always interested in bits about the performance of specific types of SQL queries, so I was curious when I came across Bill Schneider‘s post about PostgreSQL performance of “where exists”. He was comparing the performance of two different queries that provide the same results, so I took a look at some of my data and converted them to fit data that I already had:

Query 1

select course_abbr from courses_basetable
where exists (
  select 1 from editions
  where editions.course_abbr = courses_basetable.course_abbr
)

Query 2

select distinct(editions.course_abbr) from editions
join courses on
  editions.course_abbr = courses.course_abbr

Both of these queries return the same list of course abbreviations, but perform slightly different. To avoid the suspense I’m going to give away the ending now: Query 1 runs faster than Query 2 (tested using PostgreSQL 7.4.6), but for Bill Query 2 ran faster (using PostgreSQL 7.3.x). For my tests Query 1 took an average time of 29 ms and Query 2 took an average of 31 ms. So what, 2 ms you say, I throw that much CPU time away several times a minute! The absolute times don’t really matter, the percentage difference does. In this case Query 1 is more than 5% faster than Query 2.

To find out more about why these two queries perform so differently I ran them both using EXPLAIN ANALYZE to see what was going on. It should be rather obvious why Query 2 takes longer after seeing the EXPLAIN ANALYZE results:

Query 1

Seq Scan on courses_basetable  (cost=0.00..35.82 rows=19 width=8)
(actual time=0.369..9.485 rows=36 loops=1)
  Filter: (subplan)
  SubPlan
    ->  Seq Scan on editions  (cost=0.00..1.81 rows=2 width=0)
          (actual time=0.207..0.207 rows=1 loops=38)
          Filter: ((course_abbr)::text = ($0)::text)

Query 2

Unique  (cost=6.06..6.38 rows=36 width=8)
(actual time=4.697..5.252 rows=36 loops=1)
  ->  Sort  (cost=6.06..6.22 rows=65 width=8)
        (actual time=4.682..4.849 rows=65 loops=1)
        Sort Key: editions.course_abbr
        ->  Hash Join  (cost=1.47..4.10 rows=65 width=8)
              (actual time=1.652..3.357 rows=65 loops=1)
              Hash Cond: (("outer".course_abbr)::text = ("inner".course_abbr)::text)
              ->  Seq Scan on editions  (cost=0.00..1.65 rows=65 width=8)
                    (actual time=0.017..0.462 rows=65 loops=1)
              ->  Hash  (cost=1.38..1.38 rows=38 width=8)
                    (actual time=1.034..1.034 rows=0 loops=1)
                    ->  Seq Scan on courses_basetable  (cost=0.00..1.38 rows=38 width=8)
                          (actual time=0.106..0.510 rows=38 loops=1)

Query 2 has more work to than Query 1. If anyone else has other data points for this query let me know.

On a side note, Blogger really needs to support TrackBack, I’m tired of giving a response in a post and then having to leave a comment to let the person know where to find it.