HTML Entities and a Much Needed Style Guide

When writing on the Web, it is easy to forget to use correct punctuation. Time for me to change.

HTML and HTML Entities

HTML is the language of the Web. Every Web page you visit uses it, even if JavaScript pulls the content to your browser after the page has loaded (such as on some social media sites).

HTML Entities are as old as the Web, and as new as HTML5. The ampersand, for example, is typed as & in source code because HTML entities begin with an ampersand (&). Although not strictly necessary because Web browsers have always tried to understand what the code of a page means, escaping the ampersand means a less forgiving browser will not misinterpret it.

That is why, until now, the most common HTML5 validation issue I have bug fixed on this site (other than closing tags in the wrong order) is that of unescaped ampersands.

The reason HTML entities were created is because the original character set of the Web was ASCII. The British Pound symbol, for example, is typed in source code as £ and is represented as £ because ASCII did not have the £ character in its character set.

Excluding special characters that are not able to be represented on the screen (such as backspace), the characters in the ASCII character set are !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ (plus 32 other characters including space and delete). In order to write out that list of characters, I had to escape 3 of them: ampersand (&/&), less than (</<), and greater than (>/>) because < and > are used in HTML for tags (elements) and & is to denote HTML entities.

Although the British English Apple keyboard I am using to type this does not have a delete key, every single ASCII character in the list above was able to be typed either directly or using an Alt-key combination (e.g. the octothorpe symbol is alt+3: #).

The Web has moved on since ASCII, and after HTML 3.2 we got a new character set in HTML 4: ISO-8859-1, more commonly called Latin or Latin-1. The British Pound symbol is in the Latin character set, so why have my newer pages switched from just typing the symbol to using the HTML entity? The simple answer is because it is not in the ASCII character set.

Of course, this entire Web site uses HTML5 and the UTF-8 character encoding. That means that if I want the Euro symbol I could just press alt-2 on my keyboard and it would be perfectly valid. On the other hand, I could use € or (as I have been doing) € - a numeric character reference.

The problem with character sets is that the characters do not have the same numeric character reference in every character set. If you have ever visited a page that is composed almost entirely of Latin characters and have seen the character À, Á, Ã, or Â you have seen a problem with conversions between character sets (or mixed character sets).

The simple, albeit ridiculous solution is to create Web content using ASCII characters only. Why declare my pages on this site as UTF-8 if I'm not going to use the full character set then? Because Windows 98 (and earlier) versions of Windows used the character set Windows-1252 and Web browsers assume that if a document looks like it only uses Latin characters then it might as well assume that Windows-1252, rather than Latin-1, is the intended character set.

UTF-8 declaration and trying to stick to ASCII characters means if the odd £ ends up in my source code old browsers won't make stupid assumptions.

A Style Guide Just For Me

Style guides are not generally something that a single person uses as a reference when writing, rather they are used by groups of people that have the same opinion on things such as whether Web should be capitalised, or e-mail should have a hyphen.

This style guide, however, is going to be different.

I currently use Textastic to write Web pages, which has the option for regex search and replace. This style guide is therefore going to concentrate on creating a list of regular expressions to use after typing up a page so that I can transform what I have typed into something using the "correct" punctuation.

Ampersand: &

Ampersands are fairly easy, although until I can come up with some better regex I have to ensure I don't replace them inside <?php ?> tags.

As well as using the following regex, I also need to check for ampersands in URLs.

Search term: &([ &])
Replacement text: &$1

En dash: –

The en dash is longer than a hyphen, and is used for numerical ranges, such as dates, times, and file sizes. The easiest way to search for these are by recognising that I usually type number dash number unit.

For this regex, I have to be careful not to replace actual dates in ISO 8601 format, as well as code in code/samp blocks, and maths.

Search term: ([0-9])-([0-9])
Replacement text: $1–$2

Em dash: —

The em dash is longer than an en dash, and pairs of them can be used in place of pairs of commas to signify more importance or empthasis on the text within the em dashes. If parentheses give less empthasis to the contained text than pairs of commas, then em dashes could be considered the opposite side of commas to parentheses.

When typing, I typically use a hyphen surrounded by spaces in place of the em dash, therefore the search term (but not the replacement text) shall contain spaces too.

Search term: -
Replacement text: —

Minus symbol: −

The heading I have just created is the first time I have ever used the minus symbol on a Web page. As I'm on the subject of hyphen-like HTML entities the minus symbol is the obvious next candidate.

The minus symbol is exactly as its name suggests: the mathematical symbol for minus.

If something is not an ISO date/time, a range of numbers, something where an em dash is the appropriate replacement for the hyphen, or hyphenated text containing a numeral (where the hyphen is the correct character), nor is it source code, then the minus symbol is most likely the appropriate candidate to replace the hyphen.

This is tricky, and I can probably do with a much better regex search/replace expression to target what I'm looking for better.

Search term: -
Replacement text: −

Times symbol: ×

While on the subject of maths, I should really stop using asterisks (*) in place of the times symbol when representing multiplication.

Search term: *
Replacement text: ×

Divide symbol: ÷

I should also stop using / in place of the division symbol.

Search term: /
Replacement text: ÷

Halves, Thirds, Quarters, Fifths, Sixths, Eighths: ½ ⅓ ⅔ ¼ ¾ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞

Vulgar fractions have their own HTML Entities.

Search term: 1/2
Replacmenet text: ½

Search term: 1/3
Replacment text: &frac13;

Search term: 2/3
Replacement text: &frac23;

Search term: 1/4
Replacmenet text: ¼

Search term: 3/4
Replacmenet text: ¾

Search term: 1/5
Replacmenet text: &frac15;

Search term: 2/5
Replacmenet text: &frac25;

Search term: 3/5
Replacmenet text: &frac35;

Search term: 4/5
Replacmenet text: &frac45;

Search term: 1/6
Replacmenet text: &frac16;

Search term: 5/6
Replacmenet text: &frac56;

Search term: 1/8
Replacmenet text: &frac18;

Search term: 3/8
Replacmenet text: &frac38;

Search term: 5/8
Replacmenet text: &frac58;

Search term: 7/8
Replacmenet text: &frac78;

For other fractions, it is possible to use either numeric character references or the superscript (for the numerator digit(s)) and subscript (for the denominator digit(s)) numeric character references, separating the numerator and denominator with the fraction slash character: ⁄ (&frasl;). For example, nine tenths = ⁹&frasl;₁&#8320 = ⁹⁄₁₀.

Elipsis: …

Three periods are not an ellipsis.

Search term: ...
Replacement text: …

Double Quotes and Single Quotes: “ ” ‘ ’

Quotation marks are one of those things that are not universal. In written text my quotation marks are exactly the same as I use for typing: ".

The obvious question, though, is when would it be appropriate to use the quotation marks that are commonly said to look like 6's and 9's AKA smart quotes?

Obviously I should stop using regular/straight quotes for emphasis ("air quotes") if I really want to be semantic about this, but then shouldn't I be using the <q> tag for quotes in HTML5 for actual quotes?

The Apostrophe: '

The good old apostrophe. Despite it being on every keyboard I have ever used, there is a reason I should include it on this page of HTML entity substitutions.

As with the double quote, ampersand, less than, and greater than characters, the apostrophe is used within HTML—it can be used as an alternative to the double quote in HTML tags.

For the most part not using the HTML entity is fine but there are times, such as for complex alternative image text where someone is quoted using a plural or possessive, where ' and " would make things simpler. Likewise when being used in JavaScript.

Whether I replace all apostrophes with ' or not is a decision I have not yet reached. As with quotation marks, this requires further thought.

Degrees, Prime, Double Prime, Triple Prime, Quadruple Prime: ° ′ ″ ‴ ⁗

Although falling out of use due to decimalisation, degrees may be used for coordinates; the prime symbol for feet, arcminutes, and minutes; and the double prime symbol for inches, arcseconds, and seconds.

The triple and quadruple prime symbols are highly unlikely to be used anywhere on this site.

Search term: ([0-9])'
Replacmenet text: $1′

Search term: ([0-9])"
Replacmenet text: $1″

Ratio, Therefore, Because: ∶ ∴ ∵

Although I am unlikely to use therefore and because anywhere on this site, there may be times when I need to use the ratio symbol when colon would be incorrect (for example, screen resolutions).

Search term: ([0-9]):([0-9])
Replacmenet text: $1&ratio;$2

Not Equal, Approximately: ≠ ≈

I need to refine the following two regexes so that the content of <?php ?> and <code> blocks are not included.

Search term: !=
Replacmenet text: ≠

Search term: ~=
Replacmenet text: ≈

Non-Breaking Space:  

Generally, a non-breaking space should be used between a number and the unit so that they do not end up broken by a new line.

The problem with this is that it is difficult to create a regex search term that covers all possible units. MB, MiB, KB, KiB, Mbps, Mb/s, mph, mi/h… I am not quite sure how best to handle this.