Friday, October 28, 2005

Extending Propel -- Integrating Zip Code Distance Calculation

For the past day I've been working a lot with zip code distance calculation, and now it's time to integrate my findings into my application. I've chosen to build my application around the Propel framework, so my code will be added to the application classes which Propel so graciously generates skeletons for.

We'll start out with the Event class. We want to get all events within n miles of a particular zipcode. Since this is generalized beyond the level of Event, we will place it within the EventPeer class (for more information, read Propel's documentation).

class EventPeer extends BaseEventPeer {

static function GetWithinRadius( $zip, $maxDistance ) {
$con = Propel::getConnection(self::DATABASE_NAME);

// get a list of columns that we actually care about (sort of hackish, but not really. see Select Using SQL in Propel's user guide)
$c = new Criteria();
self::addSelectColumns( $c );

$sql = "SELECT ".join( ",", $c->getSelectColumns() )." FROM ".self::TABLE_NAME." WHERE ZipDistance(?, ".self::ZIPCODE.") < ?"; $stmt = $con->prepareStatement( $sql );
$stmt->setInt( 1, $zip );
$stmt->setInt( 2, $maxDistance );

$rs = $stmt->executeQuery(ResultSet::FETCHMODE_NUM);

return self::populateObjects($rs);
}

}


Easy enough? yep. We implement UserPeer::GetWithinRadius exactly the same. Now as an exercise to show you how fun Propel can be, here is a function to get all events within n miles of a particular User:

class User extends BaseUser {
function getNearByEvents( $maxDistance ) {
$zip = $this->getZipCode();

return EventPeer::GetWithinRadius( $zip, $maxDistance );
}
}


This is ridiculously simple, isn't it? I'd like to give thanks to the inventors of stored procedures and Propel.

More Zip Code Distance Calculation... Stored Procedures

To take last night's post, Efficiently Calculating Zip Code Radius With SQL, one step farther I have created stored procedures to keep the logic entirely within the database.

Procedure Definitions

/* params = lat1, lon1, lat2, lon2, return = distance */
CREATE OR REPLACE FUNCTION LatLonDistance( float8, float8, float8, float8 ) RETURNS float8
AS 'SELECT SQRT(POW((69.1*($2-$4)*COS($3/57.3)),2)+POW((69.1*($1-$3)),2));
' LANGUAGE 'sql' RETURNS NULL ON NULL INPUT;

/* params = zip1, zip2, return = distance */
CREATE OR REPLACE FUNCTION ZipDistance( int4, int4 ) RETURNS float8
AS 'SELECT LatLonDistance( zd.lat, zd.lon,zd2.lat,zd2.lon )
FROM zipdata zd, zipdata zd2
WHERE zd.zipcode = $1
and zd2.zipcode = $2;
' LANGUAGE 'sql' RETURNS NULL ON NULL INPUT;


Usage
This is just as versitile, but much cleaner than te quries last night:

a. Getting the distance between two users:
SELECT u.username, u2.username, ZipDistance( u.zipcode, u2.zipcode ) AS distance
FROM users u, users u2
WHERE u.username='harlan' AND u2.username='zack';

b. Getting all users within 10 miles of a zipcode (55431):
SELECT u.username
FROM users u
WHERE ZipDistance( 55431, u.zipcode ) <>

Efficiently Calculating Zip Code Radius With SQL

I've been curious about how the popular website MySpace manages to pull off its "search by distance" functionality using such a huge set of data. Sofar I've thought of a few options:

1) Do calculations with every query:

SELECT u.username,
SQRT((POW((69.1*(zd.lon-zd2.lon)*COS(zd2.lat/57.3)),2)+POW((69.1*(zd.lat-zd2.lat)),2))) AS distance
FROM zipdata zd, zipdata zd2
RIGHT JOIN users u ON u.zipcode = zd2.zipcode
WHERE zd.zipcode = 55431
AND (POW((69.1*(zd.lon-zd2.lon)*COS(zd2.lat/57.3)),2)+POW((69.1*(zd.lat-zd2.lat)),2)) > 10

Surprisingly, that is a very fast query. 30ms including transit.

2) Create a table with pre-calculated values selected into it.
a.
see if the desired zipcode has been selected into the zipdistance table. If not, run the same query as above, only remove the distance limitation and insert it into zipdistance:
INSERT INTO zipdistance
SELECT zd.zipcode,
zd2.zipcode,
SQRT((POW((69.1*(zd.lon-zd2.lon)*COS(zd2.lat/57.3)),2)+POW((69.1*(zd.lat-zd2.lat)),2))) AS distance
FROM zipdata zd, zipdata zd2
WHERE zd.zipcode = 55431;
b. use zipdistance for the cross reference:
SELECT u.username, zd.distance FROM zipdistance zd
RIGHT JOIN users u ON zd.zip2 = u.zipcode
WHERE zd.zip1 = 55431
AND zd.distance > 10;
This method also requires 30ms including transit.

Conclusion
While one would think that solution 1 would be unbearably slow, it is actually quite fast when a JOIN operation is performed. My guess is that the optimizer only compares records that it finds in user zipcodes. So really since my dataset is small each solution is the same speed, but as the set of user data grows the database will have to compare more records and thus be slower to calculate (sol 1) than to lookup values in zipdistance (sol 2).

One final thing to keep in mind: if you're using PostgreSQL, remember to VACUUM your tables!

Sources
http://www.sanisoft.com/ziploc/ - zipcode latitude and longitude data and some sample code for getting all zipcodes within radius

Table structure for zipdistance:
CREATE TABLE zipdistance
(
zip1 int4 NOT NULL,
zip2 int4 NOT NULL,
distance float8 NOT NULL
)

Tuesday, October 25, 2005

PHP5's __autoload() rediscovered

Today I had an idea that seemed neat: a self-learning __autoload() function. The idea was that since PHP doesn't enforce the one class per file rule, or even have namespaces for hierarchical directory naming, it is fairly common to find several classes in a single file. This can get to be confusing if one tries to be conservative with the number of files they include.

To solve this problem, I wrote a small function that will traverse a directory tree and return a hash table of className => path.



$classes = array();
getClasses( ".", $classes );

// this is an ultra-simplistic implementation. it
// uses a global just so it doesn't have to reload
// with every iteration. a better implementation
// may be to wrap this all into a singleton class.
function __autoload( $className ) {
global $classes;
if( isset( $classes[$className] ) && file_exists( $classes[$className] ) ) {
require_once( $classes[$className] );
return true;
}

}


function getClasses( $path, &$classes, $fileNamePattern = null ) {
if( is_dir( $path ) ) {
$d = opendir( $path );
while( $fileName = readdir( $d ) ) {
if( preg_match( '/^\.\.?$/', $fileName ) ) { continue; }
// check to see that the filename matches the user's pattern
if( $fileNamePattern && !preg_match( $fileNamePattern, $fileName ) ) { continue; }

getClasses( "$path/$fileName", $classes );
}
closedir( $d );
} elseif( file_exists( $path ) ) {
// one could do this line by line if memory is an issue
$data = file_get_contents( $path );

preg_match_all( "/(class|interface)[\n\r\s]+(\w+)/", $data, $m );

foreach( $m[2] as $className ) {
$classes[$className] = $path;
}
}

}


While this could be made much more versatile with file modification times used
for rescanning files, caching between requests and the like, I have
just kept it simple to illustrate the main idea and let you improve
upon it as you so desire.

Saturday, October 22, 2005

Bad development practices for the sake of product delivery

Note: Please keep in mind that this is a rough draft, it's just what came to my fingertips when I felt motived to type. Please comment on ideas and perhaps suggest structure, rather than mechanics. This is an article I will return to finishing. Thanks!

I've been thinking about my personal development style quite a bit lately, and starting to question it. I've always been one to reserve my progress until there is a deliverable unit ready to ship. It's a very idealistic way of doing things, and also a very pro-consumer and efficient way of doing things; but it doesn't work very well in the real word, or at least in my world of self-motivated independent software development.

A major source of motivation is people being excited by your project and asking about when it will be complete, wanting updates, etc. When nobody cares, other ideas that seem more practical become more and more attractive, and eventually I begin to pursue it, and it becomes the new focus. With a new focus, the old projects are inherently abandoned, and sometimes picked up a few months, or in some cases even years later.

It's a shame that I have abandoned some of the projects that I have over the years, with never so much as a glimpse of them, and seeing projects that do the same thing. Sometimes I wish I could just show them my old work and say 'hey, can I join your team?' but that's just kind of lame.

So anyway, off my personal rant/tangent, in the business world (or at least the part of it that I have been in) I have learned that haste and waste is a much more effective method of delivering software. The more I think about it, the more sense it makes, even from an economic standpoint. Let me explain:

The number one fundamental Microeconomics 101 (and NOT Accounting)is "opportunity cost." Opportunity cost is simply the cost of doing or not doing something. For example, if you call in sick to work, you know that you're down X dollars for the dat, even though you didn't actually pay that money. You are down because you had the opportunity to get it, but decided to forego it in favor of resting and being able to deliver work you can be more proud of tomorrow, or some other reason which is worth more money than your daily wage. So, you have managed to take something of subjective value and put an actual dollar ammount on it.

What's the point of all this? Well, think about TIME; time is money. If you opt to go through a "proper" development process, it will take much more time to bring a product to market than if you were to just throw together a solution that works and roll it out, and fix it later. Sure, it costs more time & money in the long run, but it also buys you the competitive edge, which is equal to more money. So the questins become: how much time does shitty development buy, and what is that time worth? how much money do we save or lose in the long run by developing two version of the product? This would require a cost/benefit analysis, (which of course has a decreasing marginal cost as abaility to make estimates improves. assuming a proper method is employed which allows for each subsequent estimate to take less time, but that is beyond the scope of this article) [get into what this would entail and why we can assume this]. lost my train of thought.

Sunday, October 16, 2005

PyAIM-t as an alternative to aim-transport + jabberd2

Objective:
Get an AIM-transport in place for my new jabberd2 server.

Outcome:
Success. While I didn't end up taking the course of action I expected, I ended up with a better end result. AIM-t for 1.4.3 never did seem to work with buddy lists, but I instead used PyAIM-t after learning that it wasn't ejabberd specific.

Steps:
Well, this one was definitely a lot more of a challenge than mu-conference. I originally planned to try compiling AIM-t through JCR, but decided to explore more options when I found that jcomp.mk was specific to mu-conference; just to see if there was an easier route before trying to learn a whole new set of stuff. Next I started exploring the option of running concurrent jabber servers as I was going to do for mu-conference; but I had this nagging feeling that perhaps that wasn't the best answer. I also lack an understanding of what really happens under the hood of jabber servers; I decided to order two books last night to aide me in future experiments: Jabber Developer's Handbook and Programming Jabber: Extending XML Messaging. I had seen an alternative solution mentioned, PyAIM-t, but every time I had seen it, it was in the context of ejabberd, so I always ignored it. After seeing the hackery involved in running AIM-t, I decided it was worth a shot, so I checked it out; I figured I could port it to jabberd2 when I got my books. Instead I found instructions for running it with a number of different XMPP daemons! I was quite impressed by this, so I got the necessary twisted port and then PyAIM-t distribution. I followed their directions here and was online in no time. It's documented well enough that I don't really even need to put my config changes here!

Closing Remarks:
I'm happy I ended up deviating from the path of JCR or running a native jabberd1.4 server because I learned a little something about routers and the architecture of jabber components. This definitely opens up some more doors in my mind about how to properly utilize XMPP to its fullest extent through jabber; I can't wait 'till my new books come! Lesson of the night: don't write off options unless you're sure they won't work.

mu-conference, jcr and jabberd2

Objective:
To get mu-conference running with jabberd2.

Projects Utilized:
JabberD 2.0.3
JCR (Jabber Component Runtime) 0.2.4

Outcome:
Success. I could have done a couple things: 1) run a jabberd 1.4 server with a service entry, 2) use JCR to compile mu-conference as a stand-alone daemon that uplinks to the router. I opted for plan 2 because it seemed like a cleaner option, however I may still revert to plan 1, depending how JCR with aim-transport turns out.


Steps:
(I typed this once and lost it, and I don't feel like typing a lot again)

Follow the steps here to get JCR set up, mu-conference binary and the base muc-jcr.xml config file in place.

After copying muc-jcr.xml to my etc directory, I made these changes:


<jcr>
...
<!-- the name of the entry in your router.xml -->
<name>muclinker</name>
<!-- the password required by your router.xml -->
<secret>secret</secret>
<!-- the hostname that you'll be using for conference -->
<host>c.im.nullirc.net</host>
<!-- the IP of your router (usually the same as your main jabber server) -->
<ip>127.0.0.1</ip>
<!-- the port of that server -->
<port>5347</port>

<!-- can these be relative? I'm not sure, I haven't tried -->
<!-- ps. don't forget to change permissions! -->
<spool>/var/spool/jabber/c.im.nullirc.net</spool>
<logdir>/var/log/mu-conference</logdir>
<pidfile>/var/jabberd/pid/mu-conference.pid</pidfile>
...
</jcr>


And changes to router.xml:

<alias name='c.im.nullirc.net' target='muclinker'/>

Finally I wanted my muc to run with the jabberd wrapper, so I went ahead and figured out how to add a job... right before I found this article telling me how to do it (under Jabber and MUC Startup); good thing it's self documented code or I'd have spent all night before I found the new article.

The main lesson of the night is CHECK PERMISSIONS IF YOU'VE BEEN SYSADMINING FOR AN EXTENDED PERIOD OF TIME.

Saturday, October 15, 2005

Jabber authentication against an existing database

Objective:
Configure a jabber server which authenticates against an existing userbase and "just works" when a user tries to log to the jabber server in using their credentials.

Projects utilized:
JabberD 2.0.3
PostgreSQL 7.4.7

Outcome:
Success, while I could have reached the objective using my existing jabberd 1.4.3 setup, I opted instead to upgrade to jabberd2. My reasoning for this was mostly whim, 1) I couldn't find a stand-alone copy of xdb_sql because of some issues with a server being hacked, and the author not re-verifying the code. I could have tried stripping it from the 1.4.4 distribution, but I wasn't sure if it was complete and I couldn't find much in the way of documentation about it. 2) I have no prior experience with jabberd2 and wanted it.

jabberd2 includes database backend support out of the box, there is the advantage that all data is now stored in the database where they were in xdb_file before. This was an unexpect, but welcome surprise to me; I had planned at exploring options for relational database storage of the roster and other data after I finished this project.

Steps:
First of all, I built the jabberd2 port on freebsd using the WITH_POSTGRESQL option. Next I followed the instructions here to get the server setup and authenticating against a brand new database. Once I was able to observe the data the server created, I was able to determine that only the authreg table needed to be in my current app's database. So I went to town looking for how to make my custom queries like xdb_sql would allow in jabber1.4, but there was no documentation of the sort for jabberd2. Finally after some googling I found this thread which explained that the sm module is basically set in stone, but the c2s module is as flexible as xdb_sql. After some tinkering, I found that this functionality was indeed possible, though undocumented. In the end, I changed c2s to use my existing database with these settings:


<pgsql>
<!-- Database server host and port -->
<host>localhost</host>
<port>5432</port>

<!-- Database name -->
<dbname>mint2</dbname>

<!-- Database username and password -->
<user>xxxxxx</user>
<pass>xxxxxx</pass>

<table>users</table>

<sql>
<!-- use 32 as a flag for disabled jabber account -->
<select>SELECT "password" FROM "users" WHERE "username" = '%s' AND "realm" = '%s' AND (status & 32) = 0</select>
<delete>UPDATE "users" SET status = status | 32 WHERE "username" = '%s' AND "realm" = '%s'</delete>
</sql>
</pgsql>


I was also required to enable <auto-create/> in the sm module for the active table. An alternative to this would be to create an entry in my registration script OR to setup another handler for the active data. The latter would be the best option IMO, but for now auto-create will more than suffice.

Sunday, October 09, 2005

PHP code profiling

So I've just discovered the xdebug module for PHP. It's fucking sweet. I've barley played with the actual debugger at all yet, but I must say I'm extremely impressed by the code profiling modes. It allows several modes of output to track down slow and hot areas of code.

One can start in mode 4 to see total exection times for functions, then move to mode 6 to follow execution paths for problem functions that were determined to be slow:

Function Summary Profile (sorted by total execution time)
Total Time Taken Avg. Time Taken Number of Calls Function Name
0.2226700783 0.2226700783 1 *ViewSongBlock->fillContext
0.1100060940 0.1100060940 1 {require_once}
0.0959060192 0.0106562244 9 *BasePeer::doSelect
0.0848278999 0.0848278999 1 *CachableBlock->fillContext
0.0846049786 0.0846049786 1 *AccountInfoBlock->fillContext
0.0789210796 0.0789210796 1 *BaseSong->getAlbumSongsJoinAlbum
0.0758068562 0.0758068562 1 {require_once}
0.0700411797 0.0009727942 72 *TemplateApplicatorBlock->applyTemplate
0.0693910122 0.0693910122 1 *BaseSong->getSongCommentsJoinComment


Stack-Dump Profile (sorted by line numbers)
Time Taken Number of Calls Function Name Location
-> 0.2244310379 1 *ViewSongBlock->fillContext /usr/home/mint/public_html/server1/view-controller.php:335
-> 0.0122299194 1 *BaseSongPeer::retrieveByPK /usr/home/mint/public_html/server1/blocks/ViewSongBlock.class.php:27
-> 0.0000281334 1 *Propel::getConnection /usr/home/mint/public_html/server1/propel/build/classes/mint/om/BaseSongPeer.php:457
-> 0.0000720024 1 *Criteria->__construct /usr/home/mint/public_html/server1/propel/build/classes/mint/om/BaseSongPeer.php:460
-> 0.0000178814 1 *Criteria->setDbName /usr/home/mint/pear/lib/propel/util/Criteria.php:160
-> 0.0001721382 1 *Criteria->add /usr/home/mint/public_html/server1/propel/build/classes/mint/om/BaseSongPeer.php:461
-> 0.0000669956 1 *Criterion->__construct /usr/home/mint/pear/lib/propel/util/Criteria.php:557
-> 0.0000119209 1 explode /usr/home/mint/pear/lib/propel/util/Criteria.php:1187
-> 0.0116310120 1 *BaseSongPeer::doSelect /usr/home/mint/public_html/server1/propel/build/classes/mint/om/BaseSongPeer.php:463