HTML Versus XHTML

The argument about whether to use HTML4.01 or XHTML1 is one that comes up time and time again. Not so long ago, most people were advising the use of XHTML1 almost without question, in the belief that it’s little more than a newer implementation of HTML and therefore somehow automatically the ‘better’ option (specifically, XHTML1 is a reformulation of HTML4 as an application of XML). However, many people who once recommended using XHTML1 have since changed their minds on the topic, including some SitePoint authors who made very strong arguments as to why examples in their books should be presented in HTML 4.01 rather than XHTML 1. Others have decided to move on to using HTML5 instead, even though the W3C has not finalized this particular language’s specification. That said, it is clear now that HTML5 is where the Web is headed (XHTML2, the proposed successor to XHTML 1 & XHTML 1.1, has effectively been canned with the working group being disbanded at the end of 2009). Until such time as the HTML5 specification is solidified and has strong support across the major browsers, the safest option is to use HTML4.01 or XHTML1 (which from this point on we’ll refer to simply as HTML and XHTML without the version numbers).1

It seems that we need to clarify what HTML and XHTML are, what their differences are, and why one or the other should be used.

The first thing you should realize is that using HTML is not wrong as long as you specify that you’re using HTML with the appropriate doctype, and the HTML you use is valid for that doctype. If you want to use HTML 4.01, no one can stop you! Ignore anyone who tells you that XHTML is the only way to go, and that using HTML 4.01 is somehow backwards. That said, you should be aware of the differences between HTML and XHTML, as these may affect your choice of markup.

Main Differences Between XHTML and HTML

The following list details the main differences between XHTML and HTML. Most of them are related to syntax differences, although there are some less obvious variations that you may not be aware of:

  • XHTML is more choosy than HTML—there are some elements that absolutely must appear in the XHTML markup, but which may be omitted if you’re using HTML 4 and earlier versions. These elements include the html, head, and body elements (although why you’d want to omit any of them is a mystery to me). In addition, every element you use in XHTML must have both an opening and closing tag (for example, you’d write <p>This is a paragraph</p> in XHTML, but <p>This is all you need in HTML, as no end tag is required).
  • For empty elements—those that hold no content but refer to a resource of some kind, such as an img, link, or meta element—the tag must have a trailing closing slash, like so: <img src="moo.jpg" alt="moo"/>. Evidently, this makes XHTML a little more verbose than HTML, but not to the extent that it has an adverse effect on the page weight.
  • XHTML allows us to indicate any element as being empty—for example, an empty paragraph can be expressed as <p/>—but this isn’t valid when the page is served as text/html. To that end, you should restrict your use of this syntax to elements that are defined to be empty in the HTML specifications.
  • In XHTML, all tags must be written in lowercase. In HTML, you can use capital letters for elements, lowercase letters for attributes, or whatever convention you like!
  • All attributes in XHTML must be contained in quotes (single or double, but usually double), hence <input type=submit name=cmdGo/> would be valid in HTML 4.01, but would be invalid in XHTML. To be valid, it would need to be <input type="submit" name="cmdGo"/>.
  • In XHTML, all attributes must be expressed in attribute-name and attribute-value parings with quote marks surrounding the attribute value part, like so: class="fuzzy".
  • In HTML, some elements have attributes that do not appear to require a value—for example, the checked attribute for checkbox input elements. I stressed the word “appear” because technically it’s the attribute name that’s omitted, not the value. These are known as Boolean attributes, and in HTML you could specify that a checkbox should be checked simply by typing <input type="checkbox" name="chkNewsletter" checked>. In XHTML, though, you must supply both an attribute and value, which results in seemingly needless repetition: <input type="checkbox" name="chkNewsletter" checked="checked">.
  • In XHTML, the opening <html> tag requires an xmlns attribute (XML NameSpace) as follows: <html xmlns="http://www.w3.org/1999/xhtml">. However, strangely, if you omit it, the W3C validator doesn’t protest as it should.
  • XHTML requires certain characters to appear as named entities. For example, you can’t use the & character: it must be expressed using an HTML entity "&amp;".
  • In XHTML, languages in the document must be expressed using the xml:lang attribute instead of lang.
  • A MIME type must be declared appropriately in the HTTP headers as "application/xhtml+xml" (this is the best option), "application/xml" (acceptable), or "text/xml" (which isn’t recommended). The MIME type is set as a configuration option on the server, and is usually Apache or IIS.
  • DTDs don’t support the validation of mixed namespace documents very well.
  • If you use XHTML and set the proper MIME type (see the section below called Serving the Correct MIME Type), you’ll encounter a small snag: Internet Explorer. At the time of writing, this browser—which still holds the lion’s share of the market—is the only one of the browsers tested for this reference that can’t handle a document set with a MIME type of "application/xhtml+xml". When IE encounters a page that contains this HTTP header, it doesn’t render the page on screen, but instead prompts the user to download or save the document.
  • When you’re using XHTML, text encoding should be set within the XML declaration, not in the HTTP headers (although doing the latter is still allowed). A full tutorial on the thorny issues surrounding the ways of setting character encodings is available on the W3C site.

In addition to these points, there are a number of differences between the way that an XHTML document handles scripts and the way it handles style sheets, including:

  • There are requirements for the way in which comments inside scripts should be handled; Lachlan Hunt covers this topic thoroughly in his blog entry, “HTML Comments in Scripts.”
  • document.write() and document.writeln() do not work in XHTML.
  • innerHTML property is also ignored by some user agents.
  • As XHTML is case sensitive, there can be an issue with element and attribute names in DOM methods. For example, onClick and onSubmit are invalid, while onclick and onsubmit are valid.

The list above might discourage some newcomers from learning the XHTML syntax—it does certainly appear that, at the very least, XHTML requires more discipline and thought than HTML! However, one advantage of learning XHTML syntax rather than the looser HTML syntax it is that if you stick to the rules I’ve outlined above, you’ll be creating pages that render just as HTML would in the browser, but which also validate as XHTML. The presence of some XHTML-specific attributes—namely the xmlns attribute in the root element, and the use of xml:lang rather than lang—does mean that you can’t simply change the doctype of a valid XHTML document back to an HTML doctype and have the page validate, though. It will contain features that are not understood by, or accepted in, the HTML specifications.

So, there’s nothing wrong with learning the HTML 4.01 syntax to begin with, and progressing to XHTML when you feel more comfortable doing so. The transition from HTML to XHTML doesn’t have to be a massive step, although some of the habits you’ll have picked up while you were marking up HTML documents will need to be unlearned to make this a successful transition. Learning HTML 4.0 is not a bad thing, and it doesn’t make you a lazy coder—it’s just different from XHTML.

Does XHTML Reduce Your Markup Toolset?

You may have heard or read that choosing XHTML means that you can’t use certain presentational elements such as center, font, basefont, or u. But this isn’t strictly true. You may use these elements in XHTML Transitional and Frameset just as you could in HTML Transitional and Frameset—the difference is that they’re not allowed in the Strict versions of these markup languages. Hopefully that’s a myth busted!

If you do opt to use XHTML Strict (or HTML Strict), and you thereby lose these presentational elements, you’ll definitely need to rely on CSS to do the work of prettying things up; this approach also places just a little more emphasis on the use of more structurally orientated elements available in HTML and XHTML. But don’t be led to believe that XHTML is in some way more structural than HTML 4.01. You’re not going to be adding new structural features through your use of XHTML—headings, paragraphs, block quotes, and so on were all present in HTML 4.01.

Regardless of the flavor of markup you choose—HTML or XHTML—you can easily mark up your page using a series of div and span elements, style it entirely in CSS, validate it, and still be left with a document that offers no apparent meaning or structure about the content it contains. In short, the language is only as good as the pair of hands responsible for crafting it, and thus XHTML doesn’t guarantee a better end result!

Opting for HTML for Optimized Page Weights

One possible reason for using HTML 4.01 over XHTML (of any kind or level of strictness) might be that page size is a very important consideration. For example, you may be creating a page that needs to be downloaded over a restricted connection, perhaps to a mobile device of some kind. By using HTML 4.01, you’re able to reduce the markup by not using quote marks and not using closing tags where the spec indicates that they’re optional.

If you’re building your own personal web site, or you’re building a site for an organization that doesn’t have (or expect) huge amounts of traffic, the aim of achieving slightly leaner page weights probably won’t be a strong case for using HTML 4.01. However, if we’re talking about a site that receives a significant amount of traffic, the savings may well add up, so you might need to get your calculator out! For example, if your use of HTML 4.01 means that you can omit 100 bytes of characters from a given document (without those deletions having an adverse effect on the document’s presentation in the browser), and if that document receives one million hits a day, over the course of a month that saving will amount to almost 3GB of bandwidth. Now, this is just a hypothetical scenario, and this is but one page in a web site, but depending on the number of visitors your site attracts, a shaving of markup here, and a corner cut there—all the while ensuring that your page validates as HTML 4.01—may lead you to choose the slightly more svelte HTML rather than XHTML. However, for most site owners the savings will be negligible.

Serving the Correct MIME Type

If you intend to create a web page that can be treated as XML and parsed accordingly, you’d probably create that page in XHTML. You may also want to take this course of action for the purposes of including another XML-based technology such as MathML or CML (Chemical Markup Language) in your page. If you do find yourself needing to use those technologies, you’re almost certainly not a “typical” web page author, and as such, most of what follows won’t really concern you too much …

In order for your page to be interpreted as proper XML, you must serve it with a MIME type of "application/xhtml+xml" (normally, web pages are served as "text/html"). Once you do so, you’ll have to be very careful with the coding of your web page. One validation slip-up—for example, an unquoted attribute value, a non-symmetrical opening and closing of tags, or an unclosed tag—and your web page won’t render at all. Users will be presented with a fatal server error of some kind, which will tell them that the page couldn’t be parsed or understood. It’s very unforgiving!

Here’s a simple test that you can try for yourself. Create a simple HTML page using the markup below:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Page title</title>
  </head>
  <body>
    <h1>Hello world</h1>
    <p>This is a web page.</p>
  </body>
</html>

Now save this document with the file extension .xhtml, rather than .htm or .html. Next, open the file in Firefox, Opera, or Safari. Is everything looking okay? Now, make a subtle change: amend the closing </h1>—and only the closing </h1>—to use an uppercase H. Refresh the page and see what happens. If everything’s gone to plan, you’ll now be looking at a broken web page, similar to the one shown in Figure 1.

Figure 1. A mismatched tag breaking an XHTML document when served as "application/xhtml+xml" An example of how you need to get things perfect in XHTML documents when served as application/xhtml+xml. A simple mis-matched tag causing the entire page to break.

What’s important about this exercise is that the behavior displayed by these browsers when they open a local file that’s not well formed, and has the .xhtml extension, is exactly the same as the error that they’d present if they encountered a malformed page on the web that was served with the MIME type "application/xhtml+xml". Bear in mind that even if you take the utmost care with your own code, it only takes one poorly formed user comment to do the necessary damage! I’m sure you can see what a tricky problem this can be!

The argument against using XHTML is basically this: if you’re not using XHTML for the purposes of creating an XML-based web application of some kind, there’s no real reason to use XHTML—you may as well stick to HTML. And if you’re intent on creating a web page that validates as XHTML, but it is served with the "text/html" MIME type, you won’t really reap any kind of benefit either. So if you want to use XHTML, learn it properly and be sure to understand the pitfalls. Otherwise, you may be better off with HTML.

XHTML: Encouraging Good Practices?

I advise people to learn XHTML, not HTML, regardless of whether the web page is going to be treated as an XML web application of some kind or as a simple web page (for more on this thorny topic, see the section entitled Serving the Correct MIME Type). By taking this approach, you’re encouraged to nest elements properly, close all your tags properly, and use quotes around all your attributes. This is my preference, but I’m under no illusion as to the fact that if I serve one of these web pages as "application/xhtml+xml" and it contains even a slight error, all my good work will end with the fatal error mentioned above. That said, should I later wish to incorporate XML features into my pages, I will have a good starting point to work from.

Given that this is a reference, rather than a guide aimed at total beginners, you likely already know a certain amount about HTML and XHTML; you may feel more comfortable taking the same approach, and using the XHTML syntax. If you’re a beginner, however, you may prefer to start with HTML 4.01, but you should still follow the rules for that version of HTML.

Footnotes

1 This is a topic that I’ve had to address, as my beginners’ book on HTML and CSS, Build Your Own Web Site The Right Way Using HTML & CSS, used XHTML, while the SitePoint Forums members argued about which flavor of HTML should be used. This argument prompted more than a few people who bought my book as complete beginners to ask me directly, “Why do you recommend using XHTML while some people say not to use it?”

User-contributed notes

ID:
#10
Contributed:
by ScottyDM
Date:
Tue, 18 Mar 2008 02:35:53 GMT

When deciding whether to use HTML or XHTML, I believe the most important factor is stated by the page's author is:

"The argument against using XHTML is basically this: if you are not using XHTML for the purposes of creating an XML-based web app of some kind, there is no real reason for using XHTML, you may as well stick to HTML. And if you are intent on creating a web page that validates as XHTML but it is served with the MIME-type of "text/html", then you’re not really reaping any kind of benefit either."

It'd be great to get away from the half-baked structure of HTML and go to a pure XML solution, but until MIME type "application/xhtml+xml" works on the vast majority of browsers that's not gonna happen for most websites. Meanwhile, pretending that your XHTML pages will render the same in "application/xhtml+xml" as they do in "text/html" isn't useful. For this reason I've switched back from XHTML to HTML, but I use XHTML coding rules when they don't violate HTML rules: such as lower case tags, using quotes around attribute values, closing all tags where possible (but no " />"), etc.

Also, and by experience, CSS doesn't always give the expected results when you don't properly close paired tags such as <p> and </p>. Another reason to follow many of the XHTML rules when coding HTML.

ID:
#6
Contributed:
by AutisticCuckoo
Date:
Thu, 13 Mar 2008 09:49:41 GMT

To say that "<input type=submit name=cmdGo /> would be valid in HTML 4.01" can be misleading. Yes, it is valid but it doesn't mean the same thing as it would in XHMTL. In HTML 4.01 that code is equivalent to <input type=submit name=cmdGo>&gt; (an INPUT element followed by a greater-than character).

Attribute minimisation in SGML and HTML actually means it's the attribute *name* that's omitted. When you use <input checked> instead of <input checked=checked>, it's actually "checked=" that's omitted, not "=checked".

Ampersands must be escaped as a character entity reference or numeric character reference even in HTML (except in element types whose content type is CDATA, like SCRIPT and STYLE). The difference is that HTML UAs have error handling that usually allows sloppy coding, while XML UAs are required to abort at well-formedness errors.

XHTML 1.0 allows both 'lang' and 'xml:lang' attributes, and Appendix C says you should use both when serving the document as text/html. XHTML 1.1, on the other hand, deprecates the 'lang' attribute.

The HTML syntax rules are no less strict or stringent than those of XHTML. Valid HTML can be parsed unambiguously; the optional tags are not really needed. XHTML's syntax rules are more *consistent* and *simple*, but I think it's unfair to describe HTML syntax as "loose". You could just as well call it "efficient" and say XHTML syntax is "bloated".

It should also be clearly stated that serving XHTML as text/html means you are not using XHTML *at all*. You are using HTML with syntax errors, as far as user agents are concerned. In other words, you are *relying* on browser bugs and error handling.

Too many beginners believe they are actually using XHTML, believing the hype about how "modern" and "strict" it is, yet their pages would fail miserably if they were served as XHTML.

XHTML is not "HTML that looks like XML". It's XML with a predefined set of element types, and serving it as anything other than an application of XML renders it completely pointless from a technical point of view.

Related Products