Mitigating XSS in PHP
Causes of XSS
XSS is caused by the fact that HTML encodes data and instructions in the same format (plain text). When HTML transmission occurs (over HTTP) the body of the communication contains the content of the HTTP message encoded in HTML. As anyone who has tried to write HTML knows, the only difference between instructions to browsers concerning the layout and appearance of a page and the text to present on the page are less-than (<) and greater-than (>) symbols that delineate tags. Tags are HTML elements surrounded by these less than and greater than symbols in the same way that XML delineates content. Thus, the characters that segregate data from instructions are used as character of data for display as well. Confusion over delineation is further exacerbated by the fact that tags can have attributes, delimited by quotes (single or double, or in some cases no quotes) and spaces, which are also displayed characters.
The complexity of segregating instructions to the client browser from content being displayed makes for a fertile attack surface. XSS is a vulnerability that allows an attacker to exploit this confusion, escape the bounds of delineation (much like in SQL Injection attacks) and hijack either data or instructions given to the browser. This enables arbitrary script injection into a web page.
In order to carry out an exploit attackers craft malicious pieces of user supplied data to inject attack code. This injection can take two main forms: a transient attack where the malicious content is carried within the request (such as within a link), called reflected XSS, and persistent XSS where malicious data is injected into an applications permanent data store (such as a back end database). Whenever an application displays user supplied input, of any form, from web form post data to url data to previously supplied data like profile information, or even the filenames of uploaded pictures, the potential for XSS attack exists.
Several functions exist that can help to sanitize data before display on a page. The choice of function appropriate for a specific situation is somewhat dependent on the nature of the data and the intent of the display. Data intended to be included in an HTML tag's attribute will need to be formatted in a different way than text designed to be displayed on a web page. Data that is nested inside other formatting or scripting elements can be particularly difficult to sanitize.
In general, the approach or white listing known good output is safest. There are a broad range of XSS safety functions native to PHP that range in terms of effectiveness. When possible use the most comprehensive data sanitizing function whenever possible. Only in cases where rules for display of certain data necessitate that scrubbing be relaxed to allow certain characters or element should the less effective functions be used.
The htmlentities function (http://us2.php.net/manual/en/function.htmlentities.php) trakes, as its first argument a string, and returns the string with every character having an HTML entity replaced by that entity. This means that every 'less than' symbol will be replaced by '<'. This method is extremely effective in preventing XSS since it translates every delimiting character, from quotes to greater than symbols, and turns them into HTML for display. This makes escaping from defined data boundaries nearly impossible. The problem with htmlentities is that it can lead to fairly messy data display since any HTML included will be rendered, rather than hidden. Thus paragraph, anchor, and other tags will show up as text.
The Perl regular expression replacement function (http://us2.php.net/manual/en/function.preg-replace.php) is useful for custom data cleaning situations. Regular expressions can be written to remove control characters as needed to prevent XSS attacks. Use of this function grants an extremely fine gain of control but can also lead to errors of oversight.
The urlencode function (http://us2.php.net/manual/en/function.urlencode.php) is also extremely useful in preventing XSS attacks, especially in places where links are to be displayed. The urlencode function will encode functions for browsers, for instance replacing spaces with %20. URL encoded strings will have all special and control characters translated, so effectively prevents XSS, but can make for messy display in cases where the output string is not utilized as a link.