Combating XSS with HTMLPurifier

25 August 2011

Cross site scripting (XSS) is a pervasive problem facing web applications these days. In a typical cross site scripting attack an attacker will utilize a portion of a web application to supply data that will result in the rendering of malicious HTML or JavaScript to other users. Many developers ask how to prevent XSS vulnerabilities in their applications quickly and easily. The simplest answer is to never trust user supplied data. This is a great rule of thumb but it doesn't provide a toolkit that developers can utilize to combat the problem. Instead this maxim merely forces developers to come up with ad hoc solutions. Using open source tools means that developers can implement well documented libraries without having to code them from scratch (avoiding re-inventing the wheel). HTMLPurifier is one such tool that might prove to be a great solution.

Cross site scripting (XSS) is a pervasive problem facing web applications these days. Cross site scripting, often referred to as an arbitrary HTML injection vulnerability, is a security problem in software that allows end users to write HTML that is rendered by browsers. There are many different types of cross site scripting vulnerabilities but all stem from the root cause of HTML being a format that contains data along with display instructions. This mixture of data and instructions is problematic when attackers gain the ability to inject instructions.

The Problem

In a typical cross site scripting attack an attacker will utilize a portion of a web application to supply data. This data is then displayed by the application server and rendered in client browsers. Often times web application developers fail to realize that an attacker could supply HTML instructions rather than data and that these instructions are rendered by client browsers. Instead of providing a search term, an attacker could enter a snippit of JavaScript code into a search box. This code might then be rendered by the server in the search results screen, allowing an attacker to craft code that will execute in users browsers.

Cross site scripting attacks are often used to attack weaknesses in browsers or browser plugins. Attackers might embed a link to a malicious PDF that is loaded by the browser using the PDF plugin, or a malicious video that is rendered by the video plugin, or they might simply provide data that causes a browser to crash and execute code on end users machines. By allowing an attacker to enter data into a web application that is then displayed without being filtered, web applications fail to prevent cross site scripting attacks.

Many developers ask how to prevent XSS vulnerabilities in their applications quickly and easily. The simplest answer is to never trust user supplied data. This is a great rule of thumb but it doesn't provide a toolkit that developers can utilize to combat the problem. Instead this maxim merely forces developers to come up with ad hoc solutions.

Solutions

Rather than attempting to develop a custom defensive mechanism to protect against XSS it is much more effective to use a supported third party library. Using open source tools means that developers can implement well documented libraries without having to code them from scratch (avoiding re-inventing the wheel). HTMLPurifier (http://htmlpurifier.org/) is one such tool that might prove to be a great solution.

HTMLPurifier is available in source code format from their website, or as a PEAR package. Many Linux distributions also provide HTMLPurifier as a package, which is an ideal solution because the system's package management system can be used to keep HTMLPurifier up to date with the latest version. This allows the system to take advantage of updates and revisions without developers having to reinstall code.

Instructions for Use

HTMLPurifier is relatively simple to use. In order to clean user supplied data the application merely needs to invoke the HTMLPurifier object and then use the purify method to sanitize the input like so:

include_once('/path/to/htmlpurifier/library/HTMLPurifier.auto.php');

$purifier = new HTMLPurifier();
$sanitized = $purifier->purify($untrusted_user_input);

This will provide the $sanitized variable that contains the untrusted user input after malicious tags, attributes or other data, have been stripped out. This is a simple solution to the problem of untrusted input although it does have some issues.

Conclusion

HTMLPurifier is a sledge hammer in terms of it's approach to untrusted data. It simply removes any material that does not meet stringent standards. While it is possible to tweak the tags and attributes allowed by HTMLPurifier it is important to realize that HTMPurifier will not rewrite input for safer display. For instance, if a piece of content contains HTML that is meant to be displayed as data rather than rendered as instructions, HTMLPurifier will not change the less than and greater than symbols into their ASCII equivalents in the same manner as PHP's htmlspecialchars() function. Instead the offending tags will simply be removed. This can cause issues if HTMLPurifier is being used to format user input for display and you want to modify HTML but not remove it.

Despite this issue HTMLPurifier is an excellent solution to provide when developers ask about libraries that can be used to prevent cross site scripting. HTMLPurifier is actively maintained, extremely robust, and provides several different options for installation and deployment. While HTMLPurifier may not work for every situation it is certainly worth adding to your developers' arsenal.