corigin.com

sofware news

Sanitation

Posted in Live (July 28, 2007 at 2:54 pm)

It’s amazing how issues float to the top of multiple minds independently.
I’ve been spending a lot of time thinking about how to sanitize to-be-published
data.
Then Rob Sayre wrote
Interoperability and XSS Mitigation;
XSS stands for “cross-site scripting”, the main threat that you sanitize to
avoid.
Sam Ruby noticed got active:
Interoperability and XSS Mitigation
announced the
Sanitization rules
wiki-space.
Microsoft’s Joe Cheng is worrying, too.

mod_atom

As of now,
mod_atom is, as a
pure Atom Store,
approaching
1.0 status
. It’s interoperated with every credible client that’s tried.
(Further evidence, were any needed, that the Atom protocol is
Really Simple).
Except for, it’s not finished, for one small and one large reason.
The small reason is that it
doesn’t yet
generate HTML
(but that’s not hard). The big reason is that it’s
not
safe
; I can send it HTML loaded with horrible XSS exploits and it’ll stuff
them into Web-space, ready to wreak havoc on the world.

Feedparser’s Whitelist Approach

What the
Sanitization Wiki
Page
doesn’t spell out is that this logic, derived originally from
Feedparser, is whitelist based. For HTML, it goes through the data, examines
each element and attribute, and lets it survive if it appears on the
“Approved Elements/Attributes” list.

The same approach is used with MathML and SVG markup; CSS is sanitized by
removing the url() pattern and anything that looks like it might
be hiding something bad.

I haven’t seen any pushback against the basic approach, which makes me
happy because it seems very sound to me.

At Microsoft

Check out Joe Cheng’s
AtomPub interop event notes.
He writes “I’m thinking about implementing a web app that takes any AtomPub
endpoint and makes a blog out of it, although I would love it if someone beat
me to it.” So he’ll be looking at the same problems.

During the interop we were talking about sanitizing the payload, and I
described the whitelist approach. Joe pointed out that that simply removing
style, both element and attribute, wouldn’t work for his users,
because authoring tools use this to produce nice visuals that there’s no other
obvious way to get.

So I guess that you could look inside style elements and
attributes and do your CSS-cleanup there in situ. Hmm.

Where to Sanitize?

mod_atom actually has some cleanup code right now. If you post an Atom
entry with text marked type="xhtml", it applies a whitelist
algorithm much as specified above. Which is easy, because the Apache server
includes an XML parser that builds a DOM for you, and it’s straightforward to
run around it checking against the whitelist. The still-unsolved problem is
type="html", because that requires parsing the HTML. Blecch.

Right now, the mod_atom cleanup happens as the data comes in, so the
version in the Atompub Collection feeds is sanitized. I’m beginning to think
that’s wrong, that the Atom Store part of mod_atom should preserve the data
as-is, as much as possible; presumably, those feeds and entries will be
access-controlled, not world-readable. Then
there should be a separate set of feeds offered to the world for
subscription purposes. They, and the HTML pages, exist only in the sanitized
state.

But at this stage we’re just making this up as we go along. It’s really
nice, though, that everyone seems to have realized that the problem is real
and important; and if we can develop a set of Best Common Practices, that’d be
good for everyone.

…more

When Sysadmins Go BAD!Ваш собственный OpenIDOpera 9.50a1Law is codeScratching itches in the cloud

Leave a Reply

You must be logged in to post a comment.