ReadWriteWeb has an article about Google potentially using PuSH to get updates into the search index. I knew this idea sounded familiar so I went hunting through some of my old posts and found – Apache Module Idea mod_ping from 2004.
Back then I had thought a lot about searching blog feeds. There were some services that offered blog feed searching but they were all pretty bad. I wrote a review of the situation at the time in Why Hasn’t Anyone Figured Out How To Do Feed Searches?. Keep in mind that Google’s blog search feature wouldn’t be announced for another year.
It seemed odd to me that blog feed search would be so bad given how strongly the blogging software community had embraced the idea of pinging updates. This led me to the idea of some sort of mod_ping for Apache that would do similar pings for any type of website updates, it didn’t have to be limited to just blogs. Obviously I never took this idea any where (besides writing that post) and to be honest I hadn’t really revisited the idea much since. Search engines took a different route for update frequency, with features like sitemaps in 2006.
Fast forward to 2010 and we’ve got discussion of Google potentially using PubSubHubbub (PuSH) to subscribe to updates from every single web site out there. This brings up an interesting question though. Since PuSH focuses on feed formats (RSS & Atom) for pings, what format will pings from sites that don’t have feeds look like? Will the ping just contain the entire HTML output of the updated page? What about a diff (unified format of course!) between the new HTML and the previous HTML for a given page?
Folks like Brett Slatkin have been thinking about this sort of thing on a deeper level than I ever did, so I’m curious to see where this goes.
5 replies on “mod_ping, Maybe I Should Have Called It PubSubHubbub”
The Google thing sounds fine for slow moving sites. Gets a little freaky for sites that change contantly but I guess that’s where we are headed. I’m guessing many sites will just batch things up and update every so many minutes.
I’m not sure, right now there are probably more questions than answers.
I too had been thinking of this type of thing before Google released, although not by the number of years that you had been, more like at the same time that Google were and I was very pleased that Google did release theirs.
I was doing mine to enable two microblogging servers to talk to the world and wanted to abstract the posting mechanism from the microblogging server so I had given some thought to the amount of information contained within the ping. Using a fat ping to ensure that the data is all contained in the message means that you will not get a potential swamping of the sever when the subscibers rush to get the update but what update to provide? For me it comes down to the fact that as a protocol it is very flexible and does not require this decision to be made to use it. Accordingly I would see it as each as to its merits.
Certainly for my purposes it was more a case of the entire message with “attachments” files, photos etc not having to be present but just the relevant links being in the message which was accordingly very small.
It may be that there are further standards or protocols that emerge that use PuSH as the carrier but I do think that with ease of use and deployment this is the type of thing that will be central to the way the web is used in the not too distant future.
How about sending the sha1 hash of the feed and comparing with the stored hash? It’s rather small compared to the entire document.
That would tell you that it had changed, but wouldn’t provide you with the details of what changed.