What (if any) HTML should we allow in posts?

celtschk · 30 January 2020 13:57

One thing that MarkDown allows is inclusion of arbitrary HTML.

I think it is obvious that we don’t want that, as that would be a major security risk (think of someone adding JavaScript to a post). Thus we need to decide what we do want to allow. I see the following possibilities:

Don’t allow any HTML at all.

This certainly is the safest option, but also the most limiting: It means that either we have something implemented in Markdown, or we don’t have it at all.
Allow a hand-curated list of HTML tags, and no attributes.

This is as safe as the first option (except there’s more opportunity to make mistakes), but requires more development effort. On the other hand, it allows to easily support features that are not or not yet implemented in MarkDown, as long as it can be done with attribute-free HTML tags.
Allow a hand-curated list of HTML tags and a hand-curated per-tag list of attributes.

This is the most complex (and most easy to screw up) option, but also the most flexible one.

I think for MVP we can simply disallow HTML. But maybe someone considers as essential some HTML feature that’s not available in MarkDown.

aklin · 30 January 2020 14:02

+1

This is an old problem with many solutions. Vanilla MD has enough features for the majority of content, and as a bonus, you won’t have to do weird formatting on your HTML code sample to make it work.

Turning off user-inserted HTML is the way to go, at least for now.

cwellsx · 30 January 2020 14:17

I don’t know why SE supports it – possibly for people who don’t know markdown? Or for copy-and-pasting?

Looking at SE’s list all you can do with HTML which you can’t equally do with Markdown seems to be <ins> and <del> and <dl> – also <sub> and <sup>, which people use sometimes e.g. to format/simulate footnotes – and <img> size attributes.

mbomb007 · 30 January 2020 15:18

Could we at least have HTML comments allowed? Still not MVP, but I think sometimes it’s nice to be able to include a note in the post that cannot be seen. This is also something that SE uses for prettify markup for syntax highlighting (e.g. ).

cwellsx · 30 January 2020 15:24

I think you can do that in markdown too, on SE and elsewhere, using the “code fence” method, like this:

```python
print('Hello, world!')
```

mbomb007 · 30 January 2020 15:33

Huh, didn’t know about that. It’s still nice to be able to include HTML comments in a post, though.

I would also like to see literal tabs allowed in code blocks. They already are here (hurray! See below), but SE had this problem.

i=0;j=10  
while i < j:  
	print(i, '	Tab,    4spaces')

Olin · 30 January 2020 15:40

Personally, I made heavy use of HTML at SE, and only used markdown when absolutely needed. If the right HTML is supported, then markdown isn’t needed at all.

I don’t like markdown because it’s yet another thing to remember how it works on this site, and the documentation is usually not there when you actually need it when typing a post.

The HTML constructs I used frequently were <sup>, <sub>, <ol>, <ul>, <a> for external links, and probably a few others.

I also often used the &xxx; constructs, usually for special characters like Ω, ω, Π, Θ, ½, ¾, ≤, ≠, ≥, non-breaking space, etc.

cwellsx · 30 January 2020 15:44

The “effort” in question being, presumably, to select a suitable already-implemented implementation to reuse; plus testing, I suppose.

manassehkatz · 30 January 2020 15:46

IIRC, SE does not support ~~strikethrough~~ which is standard enough in MD in many other places, including (obviously) right here. So instead they use ~~the s tag~~ which then doesn’t work in comments. So ˜ ˜ for strikethrough is a must. Plus HTML entities as you mentioned.

<a> should not be needed - you can use [] () in MD and it works great.

Lists also work fine in MD - no need for <ol> etc.

So really all that leaves that I can think of is tables, which has been discussed elsewhere and can be, IMHO, post-MVP.

Olin · 30 January 2020 15:51

That’s part of the point. “Markdown” isn’t as reliable what it means as HTML. It’s a pain to have to remember the formatting details of every web site. I’d rather not have any markdown at all, just support for a few selected HTML tags, and of course the &xxx; constructs.

On this site, maybe, after having to dig out the documentation while trying to type a post. I don’t see the point to markdown at all if maybe a dozen or so selected HTML tags are supported.

manassehkatz · 30 January 2020 16:02

There are a number of reasons why I think Markdown is great for a site like this (meaning: Q&A, forum, any site where non-programmers will regularly enter content):

Minimize security issues (both Javascript and potentially some other issues)
Much easier to learn than HTML for non-programmers
A lot less to type (e.g., 2 characters for asterisk space instead of 4 characters for an li tag
More natural (some exceptions, but a lot of markdown tags are close to “normal” text
Easier to translate to other platforms. My personal issue is PDF generation. Trivial (relatively) with a fixed set of MD commands. Quite messy if you want to translate arbitrary HTML.

So to me, the answer is to extend input options as need (e.g., MathJax as mentioned by a few people already) but not to open up to arbitrary HTML.

Olin · 30 January 2020 16:22

I understand arbitrary HTML would be bad, but nobody is suggesting that. A dozen well-chosen tags would probably cover it. And, the &xxx; constructs need to be supported.

manassehkatz · 30 January 2020 18:48

Ideally if we could avoid all HTML <tags>, I think that would be best.

As far as &xxx; constructs, those are typically referred to as HTML Entities and do not pose any risks that I am aware of and are easy to parse (if not already parsed by the usual markdown library).

mbomb007 · 30 January 2020 20:42

Markdown is easier to type, plain and simple; there are no closing tags. # for headers, * for unordered lists. It’s nice.

The only thing I don’t like is trying to escape a backtick in code. If I want to put a single backtick in a code block, I can’t figure it out.

celtschk · 30 January 2020 21:03

In a code block there should be no problem at all:

This is a back tick: `
Here are another two: ``
And `here` they are in the middle, too.

Maybe you meant inline code like this. Well, for that you just write more backticks at the beginning/end. For example, A single backtick like ` can be entered as `` ` ``, and the latter I wrote as ``` `` ` `` ``` (how I wrote the last one is left as exercise for the reader :-)).

pinobatch · 31 January 2020 04:22

Most Markdown processors that I’ve evaluated support lists equivalent to <ol type="1">. Semantically, this means several items in sequence. I haven’t seen wide support for <ol type="A"> (or BBCode [list=a]), which to me indicates a choice among several options, which may or may not be mutually exclusive. I just tested, and Discourse Flavored Markdown doesn’t appear to automatically translate a. and b. or A. and B. or A) and B) to turn them into HTML list items.

Nor have I seen a Markdown processor with a counterpart to <dl>. The only vaguely Markdown-like system I know of that translates something to <dl> is MediaWiki markup, which turns this:

;Linux
:A copylefted operating system kernel by Linus Torvalds et al.
;GNU/Linux
:An operating system consisting of Linux, GNU Coreutils, and two or more other major components of the GNU system.

Into this:

<dl>
<dt>Linux</dt>
<dd>A copylefted operating system kernel by Linus Torvalds et al.</dd>
<dt>GNU/Linux</dt>
<dd>An operating system consisting of Linux, GNU Coreutils, and two or more other major components of the GNU system.</dd>
</dl>

Which gets rendered thus:

Linux: A copylefted operating system kernel by Linus Torvalds et al.
GNU/Linux: An operating system consisting of Linux, GNU Coreutils, and two or more other major components of the GNU system.

manassehkatz · 31 January 2020 04:34

Then the question becomes, for purposes of our project, do these other types of lists (or other HTML that does not have Markdown equivalent) matter?

I would argue that for ordinary Q&A, they do not. There are needs for MathJax or other “extras” for certain communities, but not so much for “extra” HTML in general.

That being said, there may be more need for additional HTML in Blogs, Wikis and other additional post types that we are planning to have.

curiousdannii · 31 January 2020 05:35

I’d suggest we officially use CommonMark, which supports many HTML tags. Committing to CommonMark means we can use conformant implementations in both C# for the server and JS for formatting dynamic things (comments perhaps?) hopefully with identical output.

Corsaka · 31 January 2020 08:16

Why can’t we have both?

Supporting both HTML and Markdown would allow far more flexibility anyway; we can use shortcuts via markdown, or we can use good ol’ HTML when you’ve forgotten Markdown or you just feel like you want to.

Corsaka · 31 January 2020 08:19

CommonMark doesn’t seem to do anything to stop ACE. It runs raw HTML as is.