Categories
Posts

MakeItLink – Detecting URLs In Text And Making Them Links

In late October Jeff Atwood wrote about The Problems With URLs, describing the problems of parsing out URLs in text and transforming them into links. Here’s a simple example:

My website is at http://josephscott.org/

Would be changed into:

My website is at <a href=’http://josephscott.org/’>http://josephscott.org/</a>

Sounds simple right? Once you start looking at what the valid character set is for URLs things get tricky. I won’t rehash all of items, go the The Problem With URLs post to see an example of some of the problems.

I knew that WordPress had a make_clickable function (in wp-includes/formatting.php) that did this exact thing. After testing this against some of the problems that Jeff points out it became clear that make_clickable() didn’t handle these edge cases. I made some rather crude tweaks to the WordPress code to fix some of these and opened ticket 8300 with my patches. Then filosofo came along and not only cleaned up my hacks, but reduced the amount of code needed in general. Major kudos to filosofo!

At this point it looks like we’ve got code to make make_clickable() work correctly with problem URLs. I’m going to wait until after WordPress 2.7 is released to push for getting this code committed since we’re trying to get 2.7 wrapped up.

I got thinking, this bit of code would be really handy to have as a stand alone library. So I pulled out the various pieces of code needed to make this work and put it together in a single PHP class: MakeItLink

[sourcecode language=”php”]
class MakeItLink {
protected function _link_www( $matches ) {
$url = $matches[2];
$url = MakeItLink::cleanURL( $url );
if( empty( $url ) ) {
return $matches[0];
}

return “{$matches[1]}{$url}“;
}

public function cleanURL( $url ) {
if( $url == ” ) {
return $url;
}

$url = preg_replace( “|[^a-z0-9-~+_.?#=!&;,/:%@$*'()x80-xff]|i”, ”, $url );
$url = str_replace( array( “%0d”, “%0a” ), ”, $url );
$url = str_replace( “;//”, “://”, $url );

/* If the URL doesn’t appear to contain a scheme, we
* presume it needs http:// appended (unless a relative
* link starting with / or a php file).
*/
if(
strpos( $url, “:” ) === false
&& substr( $url, 0, 1 ) != “/”
&& !preg_match( “|^[a-z0-9-]+?.php|i”, $url )
) {
$url = “http://{$url}”;
}

// Replace ampersans and single quotes
$url = preg_replace( “|&([^#])(?![a-z]{2,8};)|”, “&$1”, $url );
$url = str_replace( “‘”, “'”, $url );

return $url;
}

public function transform( $text ) {
$text = ” {$text}”;

$text = preg_replace_callback(
‘#(?])(()?([w]+?://(?:[wx80-xff#$%&~/-=?@[](+]|[.,;:](?![s<])|(?(1))(?![s<])|)))*)#is',
array( 'MakeItLink', '_link_www' ),
$text
);

$text = preg_replace( '#(]+?>|>))]+?>([^>]+?)#i’, “$1$3“, $text );
$text = trim( $text );

return $text;
}
}
[/sourcecode]

It’s very easy to use, just load up the text you want to search for link and call the transform method:

[sourcecode language=”php”]
$text = MakeItLink::transform( $text );
[/sourcecode]

All of this code came out of WordPress, which is licensed under the GPL, so consider the MakeItLink code GPL as well. If you’ve got some improvements let me know and make sure that it gets back into the original WordPress functions as well.