What (if any) HTML should we allow in posts?

I like Markdown, but it does have downsides. There are a bunch of dialects out there, and plenty of extensions for things like tables, etc. The one I use predominantly, is the Github dialect. Some websites use one dialect and a set of extensions, another website doesn’t use, hence I can see why there are certain resentments against Markdown.

A more complete, monolithic set of rendering features is ReStructuredText. Given that most people here are coming from SO, I’m not sure going the ReST route would be a good idea. Also, the ReST syntax isn’t particularly beautiful. It is geared towards writing documentation for programs. It normally allows some HTML, but that kinda depends on the interpreter, it usually allows switching off of HTML. ReST includes support for simple tables and definition lists. Markdown needs extensions for those, a fact which I dislike, too.

What @Olin describes, is pretty much what BB-Code offers. Except that there’s plenty of things missing from BB-Code (such as subscript and superscript), but that would be actually relatively easy to add, should Codidact decide to provide it. Same thing for HTML Entity Tags. I almost never use them, I just rely on UTF-8 input using either my Greek deadkey for Greek letters, or Compose for mathematical symbols, other special characters, etc. Standard BB-Code has a bunch of tags, which are pretty much useless in a Q/A context with curated content: things like aligning text with [center] … [/center]. It’s also rather verbose.

At the end of the day, there’s no perfect solution so suit everyone. Hence there are so many competing markup languages. I’d suggest to stick to a particular Markdown dialect that is in wide use (such as Github-Markdown), or ReStructuredText, or use something like BB-Code, with some tags removed from the standard set, and a few tags added to the standard set (such as [sub] and [sup] might be a good idea). Since there are so many competing variants already, Codidact might as well create its own one, that suits its requirements.

From my viewpoint markdown is sufficient. I have used html tags a few times in order to make some non standard styling1 but since support for my favorites tags, ‘marquee’ and ‘blink’, is missing I did not use html a lot.

But maybe that, a lot of html tags, is not something that we should want.

An advantage of markdown is that it also works as a sort of style guide.

So, any allowed html should just eclipse the possibilities with markdown. On the one hand you want it for the convenience of some people that wish to use html, although <b>html tags<\b> sometimes make raw body text more difficult to read than **markdown** symbols). On the other hand you want to reduce it in order to reduce the variability in styles (and keep the site more clean and easier, less complex, to read/follow).

1: I used it to write footnotes with smaller text size

3 Likes

People worried about abuses should be aware that .NET has HtmlAgilityPack available, a fast, high-quality HTML parser that makes it trivial to write sanitizers. With HtmlAgilityPack, writing something that will scrub <script> tags, style attributes, and other unwanted bits of HTML from submissions only takes a few lines of code.

6 Likes

Is it possible to add a HTML markup to Markdown conversion for portions of the post which are not meant to be quoted code blocks? Would that be a good idea?

This is clearly a preference thing. From my viewpoint, a limited subset of HTML is sufficient. I really dislike markdown, in part because there is no one standard across sites, and use HTML whenever possible.

Clearly there are people out there with both preferences. Don’t dismiss one just because it’s not your favorite.

That sounds like a Good Thing. Those would both be very annoying on a Q&A site (and pretty much everywhere else in any civilized corner of the internet).

It’s fair to say that HTML tags aren’t really needed; but if posts from SE are going to be imported to Codidact, it’d be a shame not to render them properly when they do contain tags.

1 Like

Statistics is one of the more barbarous SE sites—we sorely miss the blink & marquee tags.

1 Like

Since this thread is tagged “mvp”…

Start with just markdown. As we identify needs not addressed by markdown, address them. One such need is MathJax or something similar. A preference for HTML isn’t sufficient at this point if it’s something you can do in markdown. If it’s a lot harder to do in markdown, that’s a reason to discuss it.

(And we already said CommonMark, though I don’t know that anybody was looking at its HTML support.)

5 Likes

Yes that (importing) sounds like a compelling argument for supporting at least the same elements as SE does.

2 Likes

FYI, when I looked at CommonMark it seems that it actually includes a limited selection of HTML tags as a part of it, along with the usual Markdown stuff like * for lists. I have seen different messages regarding how good/bad/safe/etc. but it does seem a reasonable way to start.

On the other hand, you won’t have the same subset of HTML across sites, either. There is no standard what HTML is and isn’t allowed, etc. You’d end up having to look up what subset is allowed where, for instance while one website might allow <table></table> another might not.

Between sites, not the same set of extensions might be provided, but if “Markdown” is provided, you can at least expect the minimum as defined in the original publication. Extensions can add additional syntax grammar, but not remove from it. Markdown specifically allows inline HTML except block-level HTML elements.

So technically, Markdown actually implements what you require.

2 Likes

Thanks. I didn’t try adding spaces, since normally they are escaped with backslashes on SE, I think.

Do we really need to reinvent the wheel? I’d say no.

Please no. These tags are obsolete HTML, and they are horribly distracting.

I’d say that if we are going to import posts, we could probably convert the most important HTML markup into markdown during the import process if we needed to. Otherwise, I’d be up for having the core HTML that SE supports, but I definitely would vote against <blink> and <marquee>.

2 Likes

I argued for removing HTML from the CommonMark spec while it was being developed.

The security problem with Markdown is that it is natural to naively assume that it is safe, but this is not the case .

tl;dr: HTML sanitisers are complex and have bugs. Allowing HTML puts your users at risk for very little gain (plugins are the safe way to extend markdown for <sub>, <s> etc)

I think it’s worth noting that although folk didn’t agree that it should be removed from the spec, many, including the developer of markdown-it, seemed to agree with the sentiment:

Better approach is to not accept html tags and to not generate bad markup at all.

2 Likes

Are there still browsers that support those tags anyway?

Sorry - I should admit now I was kidding. I’m sure Martijn was too.

Erm, I believe you kinda misunderstood the sentiment, here.
The point I was making, was that should Codidact decide on settling on BB-Code specifically, you might as well add or remove tags to your BB-Code dialect, since there are no accepted variants in wide use. There’s just “the standard” BB-Code (which is incredibly outdated) and every website and forum were and still are adding and removing BB-Code tags to their hearts content - so why shouldn’t Codidact do that, too.

Having had a look at the CommonMark spec, it seems it allows the same HTML that regular Markdown does, plus a load of really questionable additions.
This thread is not the right place to ask, by why would you need to allow <script></script> tags? This opens up the barn door for all sorts of annoying scripts being run while someone’s reading an article. Not to mention the security risk this opens up. One day, someone might explain that to me. It also allows <style></style>, and I can think of doing plenty funny stuff with that. In fact it also allows block-level tags (such as <div>) and even document definitions (<!DOCTYPE>). However, probably the most “shocking” one is <iframe>.

I haven’t looked at the implementation, but I’m assuming certain CommonMark features can be turned off?

2 Likes

I imagine it’s for situations where there’s no user submitted content, but for example you’re using markdown for writing blog posts, maybe using Github pages (I don’t know if <script> is supported there.)

Obviously any solution chosen for Codidact would have to ban scripts etc, and I think many/most commonmark libraries would support that.

4 Likes

blink and marquee tags are highlights of the capabilities of html and are only surpassed by the possibilities to embed swf.

They give users all creativity to design their posts, and that is what we really need.

If usenet had had these html options, beyond using their limited *bold*, italic

=headers=

    * items
       ** subitems

> and quotes

then it might still be popular.


I wasn’t kidding. I am making a very serious point there. But yes, it needs to be read in an ironic way.

1 Like

Counterpoint: In the good ol’ days there was Courier, or a passable 7x9 dot-matrix fixed pitch font. (If you were lucky). And “bold” by double-printing and underline by overprinting with underscores. And we were happy! (Oh yes, we walked through 2 feet of snow to school, up hill both ways.)

But seriously, while the primary output is HTML, it will not be the only output (e.g., PDF - and I tried Wikipedia’s PDF for comparison recently - it works very well.), and including “whatever” HTML in a document can really complicate alternative output formats. Plus there are security and usability issues. Markdown, without any raw HTML inserted, can really be quite powerful, and we are talking about Q&A (and Blogs and similar) here, not a typesetting system or a personal “here’s what I can do” web site.

3 Likes