Mitigating XSS in PHP

15 January 2013

Causes of XSS

XSS is caused by the fact that HTML encodes data and instructions in the same format (plain text). When HTML transmission occurs (over HTTP) the body of the communication contains the content of the HTTP message encoded in HTML. As anyone who has tried to write HTML knows, the only difference between instructions to browsers concerning the layout and appearance of a page and the text to present on the page are less-than (<) and greater-than (>) symbols that delineate tags. Tags are HTML elements surrounded by these less than and greater than symbols in the same way that XML delineates content. Thus, the characters that segregate data from instructions are used as character of data for display as well. Confusion over delineation is further exacerbated by the fact that tags can have attributes, delimited by quotes (single or double, or in some cases no quotes) and spaces, which are also displayed characters.

The complexity of segregating instructions to the client browser from content being displayed makes for a fertile attack surface. XSS is a vulnerability that allows an attacker to exploit this confusion, escape the bounds of delineation (much like in SQL Injection attacks) and hijack either data or instructions given to the browser. This enables arbitrary script injection into a web page.

In order to carry out an exploit attackers craft malicious pieces of user supplied data to inject attack code. This injection can take two main forms: a transient attack where the malicious content is carried within the request (such as within a link), called reflected XSS, and persistent XSS where malicious data is injected into an applications permanent data store (such as a back end database). Whenever an application displays user supplied input, of any form, from web form post data to url data to previously supplied data like profile information, or even the filenames of uploaded pictures, the potential for XSS attack exists.

Protection

Several functions exist that can help to sanitize data before display on a page. The choice of function appropriate for a specific situation is somewhat dependent on the nature of the data and the intent of the display. Data intended to be included in an HTML tag's attribute will need to be formatted in a different way than text designed to be displayed on a web page. Data that is nested inside other formatting or scripting elements can be particularly difficult to sanitize.

In general, the approach or white listing known good output is safest. There are a broad range of XSS safety functions native to PHP that range in terms of effectiveness. When possible use the most comprehensive data sanitizing function whenever possible. Only in cases where rules for display of certain data necessitate that scrubbing be relaxed to allow certain characters or element should the less effective functions be used.

htmlentities()

The htmlentities function (http://us2.php.net/manual/en/function.htmlentities.php) trakes, as its first argument a string, and returns the string with every character having an HTML entity replaced by that entity. This means that every 'less than' symbol will be replaced by '&lt;'. This method is extremely effective in preventing XSS since it translates every delimiting character, from quotes to greater than symbols, and turns them into HTML for display. This makes escaping from defined data boundaries nearly impossible. The problem with htmlentities is that it can lead to fairly messy data display since any HTML included will be rendered, rather than hidden. Thus paragraph, anchor, and other tags will show up as text.

htmlspecialchars()

Like the htmlentities function, htmlspecialchars (http://us2.php.net/manual/en/function.htmlspecialchars.php) encodes certain characters into their equivalent HTML display (& becomes &amp; for instance). The difference between htmlspecialchars and htmlentities is that htmlspecialchars translates only a limited subset of characters: single and double quotes, less than and greater than, and ampersand symbols. This is sufficient for stopping most XSS attacks. Notably this function does not escape semi-colons used to delimit JavaScript commands, and thus is not a complete solution, but is appropriate in many places.

strip_tags()

The strip_tags function (http://us2.php.net/manual/en/function.strip-tags.php) finds all HTML tags (opening and closing) and removes them from text. This function is extremely effective for stopping XSS that includes either the script tag or other HTML tags (such as iframe, applet, or object tags). This tag doesn't remove control characters like quotes or semi-colons, so it will not protect data displayed within JavaScript or HTML tag attributes. Usefully, the strip_tags function can be passed a list of allowable tags, so that if HTML is required in display this function can remove tags not in a whitelist.

preg_replace()

The Perl regular expression replacement function (http://us2.php.net/manual/en/function.preg-replace.php) is useful for custom data cleaning situations. Regular expressions can be written to remove control characters as needed to prevent XSS attacks. Use of this function grants an extremely fine gain of control but can also lead to errors of oversight.

urlenconde()

The urlencode function (http://us2.php.net/manual/en/function.urlencode.php) is also extremely useful in preventing XSS attacks, especially in places where links are to be displayed. The urlencode function will encode functions for browsers, for instance replacing spaces with %20. URL encoded strings will have all special and control characters translated, so effectively prevents XSS, but can make for messy display in cases where the output string is not utilized as a link.

Conclusion

Preventing XSS is extremely difficult since there are often a multitude of different places where user supplied data needs to be displayed to application users. The nature of user supplied data, however, makes it untrusted. When working with untrusted data it is necessary to sanitize the data before display. PHP offers a wide array of functions to accomplish this task, but some may be more suitable than others depending on the display circumstances. In general try to use broad functions such as htmlentities, but in some instances it may be more appropriate to use functions such as strip_tags or urlencode to ensure safe, and readable, output is rendered.

Further Resources

OWASP maintains the XSS Prevention Cheat Sheet (https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet) which is an excellent resource for developers.