In late October Jeff Atwood wrote about The Problems With URLs, describing the problems of parsing out URLs in text and transforming them into links. Here’s a simple example:
My website is at http://josephscott.org/
Would be changed into:
My website is at <a href=’http://josephscott.org/’>http://josephscott.org/</a>
Sounds simple right? Once you start looking at what the valid character set is for URLs things get tricky. I won’t rehash all of items, go the The Problem With URLs post to see an example of some of the problems.
I knew that WordPress had a make_clickable function (in wp-includes/formatting.php) that did this exact thing. After testing this against some of the problems that Jeff points out it became clear that make_clickable() didn’t handle these edge cases. I made some rather crude tweaks to the WordPress code to fix some of these and opened ticket 8300 with my patches. Then filosofo came along and not only cleaned up my hacks, but reduced the amount of code needed in general. Major kudos to filosofo!
At this point it looks like we’ve got code to make make_clickable() work correctly with problem URLs. I’m going to wait until after WordPress 2.7 is released to push for getting this code committed since we’re trying to get 2.7 wrapped up.
I got thinking, this bit of code would be really handy to have as a stand alone library. So I pulled out the various pieces of code needed to make this work and put it together in a single PHP class: MakeItLink
[sourcecode language=”php”]
class MakeItLink {
protected function _link_www( $matches ) {
$url = $matches[2];
$url = MakeItLink::cleanURL( $url );
if( empty( $url ) ) {
return $matches[0];
}
return “{$matches[1]}{$url}“;
}
public function cleanURL( $url ) {
if( $url == ” ) {
return $url;
}
$url = preg_replace( “|[^a-z0-9-~+_.?#=!&;,/:%@$*'()x80-xff]|i”, ”, $url );
$url = str_replace( array( “%0d”, “%0a” ), ”, $url );
$url = str_replace( “;//”, “://”, $url );
/* If the URL doesn’t appear to contain a scheme, we
* presume it needs http:// appended (unless a relative
* link starting with / or a php file).
*/
if(
strpos( $url, “:” ) === false
&& substr( $url, 0, 1 ) != “/”
&& !preg_match( “|^[a-z0-9-]+?.php|i”, $url )
) {
$url = “http://{$url}”;
}
// Replace ampersans and single quotes
$url = preg_replace( “|&([^#])(?![a-z]{2,8};)|”, “&$1”, $url );
$url = str_replace( “‘”, “'”, $url );
return $url;
}
public function transform( $text ) {
$text = ” {$text}”;
$text = preg_replace_callback(
‘#(?])(()?([w]+?://(?:[wx80-xff#$%&~/-=?@[](+]|[.,;:](?![s<])|(?(1))(?![s<])|)))*)#is',
array( 'MakeItLink', '_link_www' ),
$text
);
$text = preg_replace( '#(]+?>|>))]+?>([^>]+?)#i’, “$1$3“, $text );
$text = trim( $text );
return $text;
}
}
[/sourcecode]
It’s very easy to use, just load up the text you want to search for link and call the transform method:
[sourcecode language=”php”]
$text = MakeItLink::transform( $text );
[/sourcecode]
All of this code came out of WordPress, which is licensed under the GPL, so consider the MakeItLink code GPL as well. If you’ve got some improvements let me know and make sure that it gets back into the original WordPress functions as well.
13 replies on “MakeItLink – Detecting URLs In Text And Making Them Links”
Horray for regular expressions! 🙂
Funny comment considering your recent posts on regular expressions 🙂
I need to start looking seriously at commenting more regex’s when using PHP’s preg_* functions. Ugly looking regex’s can quickly turn into black boxes.
Is this library PHP5 only? I got a bunch of errors testing it out, some of which I banished by stripping off the ‘protected’ and ‘public’ function labels, but even after that I ended up with a preg_replace_callback error I couldn’t resolve.
(And before anyone says “Just upgrade PHP already!”, that’s not always an available option.)
It does have PHP5’isms in it (such as the protected and public keywords), but there really isn’t anything in the code that requires PHP5.
I take that back, it looks like something did change in PHP5 for preg_replace_callback that causes it to break in PHP4.
Sorry, this appears to completely my fault. Not sure when/where, but I managed to completely botch that regex line. I’ve updated the code in the post with the correct regex line. I tried the corrected version (minus the protected/public keywords) with PHP4 and confirmed that it works now.
My apologies for the confusion.
Perfecto– works great! Thanks much. I may integrate this into my Latest Tweet plugin, depending on how much effort it will take to do so and still support the URL-shortening feature.
Hi,
I copied the code to my clipboard and pasted it to a PHP file. Then I removed the protected/public keywords because I have PHP4. I started testing it and it does not work for a link in parens that has no parens. The ending closing paren is included as part of the link. This is the example I was testing(hopefully it will render properly):
My site (http://josephscott.org)
If it doesn’t render properly it is this case:
(link)
Can you test this with a PHP4 version of MakeItLink? And if it works for you can you post a link to your actual class file?
Hi again,
I have reconstructed the wordpress make_clickable code to try and get this to work and it still does not work(at least on a PHP4 engine).
Please see MakeItLink2.php and MakeItLink3.php which can be found in this zip:
http://pinchpile.com/makelink.zip
MakeItLink2.php is a class for making links without taking parens into consideration. This works as expected.
MakeItLink3.php is a class for taking parens into account as coded in the leaner version diff for ticket 8300. This code still does not work for(on PHP4 at least) this test case:
My site (http://josephscott.org)
Can you test it on PHP 4/5 and get back to me? Thanks.
Multiple people have indicated that this code is working. Can you confirm that your copied code is the same as what I posted?
Hi,
I tested again and you are partially right. This case:
“My site (http://josephscott.org)”
still did not work but this did:
“My site (http://josephscott.org) period.”
So I added a trailing space character to the string on line 41 and that seemed to fix the problem.
I am having trouble deciphering that regex. Can we not find a solution where line 41 looks like:
“{$text}”
instead of:
” {$text} ”
thx
We welcome patches 🙂
Thanks for sharing this Joseph! Great stuff!
For anyone else who is noticing a problem when copying the code displayed on this page, you may have noticed that some of the HTML entities have been encoded. So a plain copy/paste from above doesn’t work. You have to unencode them. In this script it’s not all that easy to find them all.
I did find another article that has the same snippet available without that problem:
http://stackoverflow.com/questions/1159006/find-urls-replies-and-hashtags-from-tweets