mirror of
https://github.com/TracksApp/tracks.git
synced 2026-01-20 16:06:10 +01:00
333 lines
12 KiB
Text
333 lines
12 KiB
Text
= Sanitize
|
|
|
|
Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
|
|
elements and attributes, Sanitize will remove all unacceptable HTML from a
|
|
string.
|
|
|
|
Using a simple configuration syntax, you can tell Sanitize to allow certain
|
|
elements, certain attributes within those elements, and even certain URL
|
|
protocols within attributes that contain URLs. Any HTML elements or attributes
|
|
that you don't explicitly allow will be removed.
|
|
|
|
Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
|
|
of fragile regular expressions, Sanitize has no trouble dealing with malformed
|
|
or maliciously-formed HTML, and will always output valid HTML or XHTML.
|
|
|
|
*Author*:: Ryan Grove (mailto:ryan@wonko.com)
|
|
*Version*:: 1.2.1 (2010-04-20)
|
|
*Copyright*:: Copyright (c) 2010 Ryan Grove. All rights reserved.
|
|
*License*:: MIT License (http://opensource.org/licenses/mit-license.php)
|
|
*Website*:: http://github.com/rgrove/sanitize
|
|
|
|
== Requires
|
|
|
|
* Nokogiri ~> 1.4.1
|
|
* libxml2 >= 2.7.2
|
|
|
|
== Installation
|
|
|
|
Latest stable release:
|
|
|
|
gem install sanitize
|
|
|
|
Latest development version:
|
|
|
|
gem install sanitize --pre
|
|
|
|
== Usage
|
|
|
|
If you don't specify any configuration options, Sanitize will use its strictest
|
|
settings by default, which means it will strip all HTML and leave only text
|
|
behind.
|
|
|
|
require 'rubygems'
|
|
require 'sanitize'
|
|
|
|
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
|
|
|
|
Sanitize.clean(html) # => 'foo'
|
|
|
|
== Configuration
|
|
|
|
In addition to the ultra-safe default settings, Sanitize comes with three other
|
|
built-in modes.
|
|
|
|
=== Sanitize::Config::RESTRICTED
|
|
|
|
Allows only very simple inline formatting markup. No links, images, or block
|
|
elements.
|
|
|
|
Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
|
|
|
|
=== Sanitize::Config::BASIC
|
|
|
|
Allows a variety of markup including formatting tags, links, and lists. Images
|
|
and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
|
|
protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
|
|
mitigate SEO spam.
|
|
|
|
Sanitize.clean(html, Sanitize::Config::BASIC)
|
|
# => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
|
|
|
|
=== Sanitize::Config::RELAXED
|
|
|
|
Allows an even wider variety of markup than BASIC, including images and tables.
|
|
Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
|
|
are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
|
|
added to links.
|
|
|
|
Sanitize.clean(html, Sanitize::Config::RELAXED)
|
|
# => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
|
|
|
|
=== Custom Configuration
|
|
|
|
If the built-in modes don't meet your needs, you can easily specify a custom
|
|
configuration:
|
|
|
|
Sanitize.clean(html, :elements => ['a', 'span'],
|
|
:attributes => {'a' => ['href', 'title'], 'span' => ['class']},
|
|
:protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
|
|
|
|
==== :add_attributes (Hash)
|
|
|
|
Attributes to add to specific elements. If the attribute already exists, it will
|
|
be replaced with the value specified here. Specify all element names and
|
|
attributes in lowercase.
|
|
|
|
:add_attributes => {
|
|
'a' => {'rel' => 'nofollow'}
|
|
}
|
|
|
|
==== :attributes (Hash)
|
|
|
|
Attributes to allow for specific elements. Specify all element names and
|
|
attributes in lowercase.
|
|
|
|
:attributes => {
|
|
'a' => ['href', 'title'],
|
|
'blockquote' => ['cite'],
|
|
'img' => ['alt', 'src', 'title']
|
|
}
|
|
|
|
If you'd like to allow certain attributes on all elements, use the symbol
|
|
<code>:all</code> instead of an element name.
|
|
|
|
:attributes => {
|
|
:all => ['class'],
|
|
'a' => ['href', 'title']
|
|
}
|
|
|
|
==== :allow_comments (boolean)
|
|
|
|
Whether or not to allow HTML comments. Allowing comments is strongly
|
|
discouraged, since IE allows script execution within conditional comments. The
|
|
default value is <code>false</code>.
|
|
|
|
==== :elements (Array)
|
|
|
|
Array of element names to allow. Specify all names in lowercase.
|
|
|
|
:elements => [
|
|
'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
|
|
'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
|
|
'sup', 'u', 'ul'
|
|
]
|
|
|
|
==== :output (Symbol)
|
|
|
|
Output format. Supported formats are <code>:html</code> and <code>:xhtml</code>,
|
|
defaulting to <code>:xhtml</code>.
|
|
|
|
==== :output_encoding (String)
|
|
|
|
Character encoding to use for HTML output. Default is <code>'utf-8'</code>.
|
|
|
|
==== :protocols (Hash)
|
|
|
|
URL protocols to allow in specific attributes. If an attribute is listed here
|
|
and contains a protocol other than those specified (or if it contains no
|
|
protocol at all), it will be removed.
|
|
|
|
:protocols => {
|
|
'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
|
|
'img' => {'src' => ['http', 'https']}
|
|
}
|
|
|
|
If you'd like to allow the use of relative URLs which don't have a protocol,
|
|
include the symbol <code>:relative</code> in the protocol array:
|
|
|
|
:protocols => {
|
|
'a' => {'href' => ['http', 'https', :relative]}
|
|
}
|
|
|
|
==== :remove_contents (boolean or Array)
|
|
|
|
If set to +true+, Sanitize will remove the contents of any non-whitelisted
|
|
elements in addition to the elements themselves. By default, Sanitize leaves the
|
|
safe parts of an element's contents behind when the element is removed.
|
|
|
|
If set to an Array of element names, then only the contents of the specified
|
|
elements (when filtered) will be removed, and the contents of all other filtered
|
|
elements will be left behind.
|
|
|
|
The default value is <code>false</code>.
|
|
|
|
==== :transformers
|
|
|
|
See below.
|
|
|
|
=== Transformers
|
|
|
|
Transformers allow you to filter and alter nodes using your own custom logic, on
|
|
top of (or instead of) Sanitize's core filter. A transformer is any object that
|
|
responds to <code>call()</code> (such as a lambda or proc) and returns either
|
|
<code>nil</code> or a Hash containing certain optional response values.
|
|
|
|
To use one or more transformers, pass them to the <code>:transformers</code>
|
|
config setting:
|
|
|
|
Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
|
|
|
|
==== Input
|
|
|
|
Each registered transformer's <code>call()</code> method will be called once for
|
|
each element node in the HTML, and will receive as an argument an environment
|
|
Hash that contains the following items:
|
|
|
|
[<code>:config</code>]
|
|
The current Sanitize configuration Hash.
|
|
|
|
[<code>:node</code>]
|
|
A Nokogiri::XML::Node object representing an HTML element.
|
|
|
|
[<code>:node_name</code>]
|
|
The name of the current HTML node, always lowercase (e.g. "div" or "span").
|
|
|
|
==== Processing
|
|
|
|
Each transformer has full access to the Nokogiri::XML::Node that's passed into
|
|
it and to the rest of the document via the node's <code>document()</code>
|
|
method. Any changes will be reflected instantly in the document and passed on to
|
|
subsequently-called transformers and to Sanitize itself. A transformer may even
|
|
call Sanitize internally to perform custom sanitization if needed.
|
|
|
|
Nodes are passed into transformers in the order in which they're traversed. It's
|
|
important to note that Nokogiri traverses markup from the deepest node upward,
|
|
not from the first node to the last node:
|
|
|
|
html = '<div><span>foo</span></div>'
|
|
transformer = lambda{|env| puts env[:node].name }
|
|
|
|
# Prints "span", then "div".
|
|
Sanitize.clean(html, :transformers => transformer)
|
|
|
|
Transformers have a tremendous amount of power, including the power to
|
|
completely bypass Sanitize's built-in filtering. Be careful!
|
|
|
|
==== Output
|
|
|
|
A transformer may return either +nil+ or a Hash. A return value of +nil+
|
|
indicates that the transformer does not wish to act on the current node in any
|
|
way. A returned Hash may contain the following items, all of which are optional:
|
|
|
|
[<code>:attr_whitelist</code>]
|
|
Array of attribute names to add to the whitelist for the current node, in
|
|
addition to any whitelisted attributes already defined in the current config.
|
|
|
|
[<code>:node</code>]
|
|
A Nokogiri::XML::Node object that should replace the current node. All
|
|
subsequent transformers and Sanitize itself will receive this new node.
|
|
|
|
[<code>:whitelist</code>]
|
|
If _true_, the current node (and only the current node) will be whitelisted,
|
|
regardless of the current Sanitize config.
|
|
|
|
[<code>:whitelist_nodes</code>]
|
|
Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
|
|
document, regardless of the current Sanitize config.
|
|
|
|
==== Example: Transformer to whitelist YouTube video embeds
|
|
|
|
The following example demonstrates how to create a Sanitize transformer that
|
|
will safely whitelist valid YouTube video embeds without having to blindly allow
|
|
other kinds of embedded content, which would be the case if you tried to do this
|
|
by just whitelisting all <code><object></code>, <code><embed></code>, and
|
|
<code><param></code> elements:
|
|
|
|
lambda do |env|
|
|
node = env[:node]
|
|
node_name = env[:node_name]
|
|
parent = node.parent
|
|
|
|
# Since the transformer receives the deepest nodes first, we look for a
|
|
# <param> element or an <embed> element whose parent is an <object>.
|
|
return nil unless (node_name == 'param' || node_name == 'embed') &&
|
|
parent.name.to_s.downcase == 'object'
|
|
|
|
if node_name == 'param'
|
|
# Quick XPath search to find the <param> node that contains the video URL.
|
|
return nil unless movie_node = parent.search('param[@name="movie"]')[0]
|
|
url = movie_node['value']
|
|
else
|
|
# Since this is an <embed>, the video URL is in the "src" attribute. No
|
|
# extra work needed.
|
|
url = node['src']
|
|
end
|
|
|
|
# Verify that the video URL is actually a valid YouTube video URL.
|
|
return nil unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
|
|
|
|
# We're now certain that this is a YouTube embed, but we still need to run
|
|
# it through a special Sanitize step to ensure that no unwanted elements or
|
|
# attributes that don't belong in a YouTube embed can sneak in.
|
|
Sanitize.clean_node!(parent, {
|
|
:elements => ['embed', 'object', 'param'],
|
|
:attributes => {
|
|
'embed' => ['allowfullscreen', 'allowscriptaccess', 'height', 'src', 'type', 'width'],
|
|
'object' => ['height', 'width'],
|
|
'param' => ['name', 'value']
|
|
}
|
|
})
|
|
|
|
# Now that we're sure that this is a valid YouTube embed and that there are
|
|
# no unwanted elements or attributes hidden inside it, we can tell Sanitize
|
|
# to whitelist the current node (<param> or <embed>) and its parent
|
|
# (<object>).
|
|
{:whitelist_nodes => [node, parent]}
|
|
end
|
|
|
|
== Contributors
|
|
|
|
The following lovely people have contributed to Sanitize in the form of patches
|
|
or ideas that later became code:
|
|
|
|
* Peter Cooper <git@peterc.org>
|
|
* Gabe da Silveira <gabe@websaviour.com>
|
|
* Ryan Grove <ryan@wonko.com>
|
|
* Adam Hooper <adam@adamhooper.com>
|
|
* Mutwin Kraus <mutle@blogage.de>
|
|
* Dev Purkayastha <dev.purkayastha@gmail.com>
|
|
* David Reese <work@whatcould.com>
|
|
* Rafael Souza <me@rafaelss.com>
|
|
* Ben Wanicur <bwanicur@verticalresponse.com>
|
|
|
|
== License
|
|
|
|
Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
|
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
|
this software and associated documentation files (the 'Software'), to deal in
|
|
the Software without restriction, including without limitation the rights to
|
|
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
|
|
the Software, and to permit persons to whom the Software is furnished to do so,
|
|
subject to the following conditions:
|
|
|
|
The above copyright notice and this permission notice shall be included in all
|
|
copies or substantial portions of the Software.
|
|
|
|
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
|
|
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
|
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
|
|
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
|
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|