Postgres Initial Impressions

2010/5/25 2:52

One of the driving factors for completing the WURFL Capabilities GUI was to try out Postgres...

Firstly I admit to having the same bias as many other developers: if something is popular, then there must be better alternatives. Perl formed the backend for Amazon and Ebay, so PHP must be better. PHP is supporting Digg and Facebook, which means that Python is actually better suited for those applications. And of course, MySQL.... Its hard to come by an explanation for MySQL's popularity. Most tech guys I've talked to seem to think that its just a convenient backend that lucked out--it's relatively popularity totally a fluke.

If I could sum up my experience with postgres in two words, they'd be: "Temporary Table". I'll expand on that further on, right now I should go into my system setup.

I'm doing development on a Mac (strike one?), with a custom apache2 build and custom php build living in /opt/php. Postgres offers a nice installer with a few management apps and start/stop controls, but it's only 32bit whereas my php/apache install really really wanted 64bit (or vice-versa, I can't remember). I'll admit, I probably didn't bump up the shared memory resources as much as I could, I think they're around 64 megs, but I'm not throwing anything too complex at this database.

But back to Temporary Tables.

Here's a simplified version of the wurfl database in pseudo-sql:

devices (
 id INTEGER DEFAULT nextval(device_sequence),
 name VARCHAR(255),
 PRIMARY KEY(id)
)
capabilities (
  id INTEGER DEFAULT nextval(device_sequence),
  name VARCHAR(255),
  PRIMARY KEY(id)
)

device_capabilities (
  device_id INTEGER,
  capability_id INTEGER,
  value VARCHAR(255),
  FOREIGN KEY(device_id) REFERENCES devices(id),
  FOREIGN KEY(capability_id) REFERENCES capabilities(id)
)

Being mostly versed in MySQL I am comfortable writing big, ugly queries with slews of joins. So for a simple search for a single capability, I'm inclined to write:


SELECT d.id, d.name
FROM devices d
INNER JOIN device_capabilities c1
  ON d.id = c1.device_id
  AND c1.capability_id = 1
  AND value like ('android')

This works pretty well in Postgres, too. Going from the client request to the backend to the database and back to the client in three seconds. The problem arises with more criteria...

SELECT d.id, d.name
FROM devices d
INNER JOIN device_capabilities c1
  ON d.id = c1.device_id
  AND c1.capability_id = 1
  AND value like ('android')
INNER JOIN device_capabilities c2
  ON d.id = c2.device_id
  AND c2.capability_id = 2
  AND value like ('htc')

Suddenly the query's taking upwards of five minutes to complete! Now, maybe there's an error I don't see, but this looks pretty standard to me.

So my solution was to structure the queries as such:

CREATE TEMPORARY TABLE temp(d_id INTEGER, primary key(d_id));
INSERT INTO temp(d_id)
SELECT device_id
FROM device_capabilities c1
WHERE d.id = c1.device_id
  AND c1.capability_id = 1
  AND value like ('android')

And then:

DELETE FROM temp
WHERE d_id NOT IN (
  SELECT device_id
  FROM device_capabilities c1
   ON d.id = c1.device_id
   AND c1.capability_id = 1
   AND value like ('htc')
)

And bingo, back down to three seconds or so. This doesn't feel like best practice: my natural inclination is to utilize as few database queries as possible to get a result, and this violates that rule. But the execution time makes a pretty convincing case for itself. I think the big ugly queries should work correctly, maybe I need to play around with the shared memory config and what not. As it stands though, I'm going with the temporary table method initially. I wouldn't release the single-query method unless I could get good results on my test box beforehand.

All things considered, if learning Postgres wasn't one of the key aims of this exercise I would have used mysql as the back end and may yet switch if performance issues continue to beguile me.

Let me stress that I'm not writing this to put down postgres, you tend to notice faults first...

Update 2013/03/13: It turns out that Postgres is avowed against the Entity Attribute Value design pattern. Their file system is just not optimized to handle it. When you create temporary tables, Postgres stores them in-memory, thus the reason for the speed gains.