Linkify tweets with regex

Regular expressions are powerful, useful, and — in my opinion — lots of fun! Thanks to the prevalence of Twitter, every web developer will be exposed to regex sooner or later: before outputting tweets in HTML, Twitter names and hyperlinks must be wrapped in anchor tags.

Matching @names

Here's the gist: a match will begin with "@" and the at sign must be followed by one or more word (letter / number / underscore) characters. The @name must either appear at the beginning of the tweet or be preceded by a space. This prevents the regular expression from matching "@example" in "me@example.com".

JavaScript implementation

tweet.replace(/(^|\s)(@\w+)/gm, '$1<a href="http://twitter.com/$2">$2</a>');

It would of course be nicer to write:

tweet.replace(/(?<=(?:^|\s))(@\w+)/gm, '<a href="http://twitter.com/$1">$1</a>');

Unfortunately, JavaScript does not support lookbehinds in regular expressions, so one's forced to capture the preceding space character (if in fact there is one) and spit it out in the replacement string.

PHP implementation

preg_replace('/(^|\s)(@\w+)/m', '$1<a href="http://twitter.com/$2">$2</a>', $tweet);

Python implementation

Python does support lookbehinds, but only fixed-width lookbehinds, so it won't allow (?<=^|\s). No matter.

import re
re.sub(r'(?m)(^|\s)(@\w+)',
        lambda m: m.group(1) + '<a href="http://twitter.com/' + m.group(2) + '">' + m.group(2) + '</a>',
        tweet)

For once, Python's syntax is the least elegant!

Interestingly, while testing these snippets I found I did not need to specify multi-line mode. Perhaps multi-line mode is assumed? I'd like to know the answer.

Matching hyperlinks

The regular expression involved in matching hyperlinks is more complex. I'll point you to John Gruber's liberal regex for matching URLs as he's clearly put a great deal of thought into what is essentially a single line of code!

Comments

John Gruber's regex is too liberal for tweets URLs, because sometimes people will do the following: "I like this URL http://t.co/awiefj, and it likes me."

His regex will capture the final comma, but it should not be captured. Then, things get trickier if the URL is adjacent to ".

Yep, matching URLs in text is something that's impossible to do with 100% accuracy, no matter how many hours you spend fiddling with your regex. I think the best approach is to write something simple which handles the common cases, and not worry about the inevitable failures. I agree that matching the comma in your example is bad; that's a common case I'd like to handle "correctly".

Respond