Post Archive
› April 22, 2002
Entity Replacement
A recent discussion here has resulted in a very handy script for replacing often used characters with their proper entity equivalents. The concept for the work started out over at Francois’s site, was discussed here, and then Dave created a home for his final product called Entity Replacement as newest entry in his gazingus.org website. Also it should be noted that the seed for these ideas came from the ALA article The Trouble with Em ’n En. Francois also has a downloadable textpad macro, Dreamweaver and Textpad optimization for the use of these entities, and a Numeric Entity browser support table.
Comments
1. April 23, 2002 05:08 PM
2. April 23, 2002 05:25 PM
Dave Posted…
Good points. I was thinking along the same lines today, in that this kind of functionality is best suited for the server side. Among other things, server–side code is guaranteed to work, and users won’t get the feeling that they are having to be “corrected.” The logical end to this kind of idea would be a general purpose markup sweeper, a fairly tall order to be sure, and one definitely not suited to client–side kludges. The W3C has something called HTML–Tidy, but I’m not sure if it addresses these typography issues. A similar program hooked into MT would be a killer app.3. April 23, 2002 05:26 PM
Nate Posted…
Good points Francois, I’m thinking that the optimum preference would be that the data is stored sans markup and is filtered based on output type – the webpage would get appropriate entities inserted and formatting applied, and the emails would be sent plain text. BTW – currently only webgraphics authors are getting email updates.The problem with my idea here is, how do you store formatting data if the text is stored sans any html markup? We could extend the entity filter to insert breaks to replace carriage returns, but any emphasis or links would require html be part of the package I would assume.
I’m guessing that a more complex email output filter could ignore the entity replacement, remove other HTML markup, and replace html links with number references plus append the post with a list of numbered URIs. This would avoid the necessity of clean text entry, but makes the filtering process more cumbersome.
4. April 23, 2002 05:27 PM
francois Posted…
More on structured text here. It does involve some ...training on the part of the writer, which would tend to limit the take–up...5. April 23, 2002 06:28 PM
francois Posted…
Currently, if you have “convert line and paragraph breaks in comments” turned on in the MT config, how do the comments get stored in the MT database? I suspect they’re stored as they’ve been posted, sans HTML, and only get published with HTML added when building the files. This, ideally, would be the way the entity converter should work — storing the text as posted, but publishing it to HTML with entities converted. This looks like a job for... perl–hacker. When I get a chance, I’ll try and get some opinions on this on the MT boards. In the meantime, considering that it is only the webgraphics authors getting email notifications, I would recommend keeping the script. We end up with a typographically superior site, and are constantly reminded of certain useful character entities :) at the cost of only slightly ‘dirtier’ emails. Dave, re HTML–Tidy: Do you by any chance have Homesite, bundled with Dreamweaver? It comes bundled with Tidy, which is used by its “Codesweeper” tool and can be configured in dozens of ways — including HTML–>XHTML! This is quite a friendly way of exploring Tidy. Rolling that into MT would could be really useful, but it’ll add a lot of complexity to the config. Will have to consider whether it’s really necessary.6. April 23, 2002 06:40 PM
evan Posted…
Pardon my ignorance, but what is the purpose of all of this? Correct me if I’m wrong, but everything worked fine before, did it not? And I can’t remember exactly what the issue was nor can I find an exact mention of it, but isn’t there some problem with the validity of charachter entities greater than 152?Please enlighten me.
7. April 23, 2002 10:49 PM
Nate Posted…
Evan, I don’t blame you, this thread has winded around a funny path which is somewhat tough to follow. It’s not so much that there was a problem that has been fixed, more an optimization for better typography. As is noted in The Trouble with Em n En, default html text does not use the correct entities for things such as “quotes” and – dashes. To get the proper typographic symbol one has to explicitly enter the numeric entity. The script that Dave wrote takes that hastle out of that by filtering through and replacing the appropriate characters using javascript and regular expressions. We are now discussing alternate methods for filtering, and debating wether text stored in a cms such as MoveableType should be kept in pure text form and filtered upon output, or if it should be stored after the filter has run through it. I think we’ve come to the conclusion that storing text as cleanly as possible would be ideal.8. April 24, 2002 02:58 AM
francois Posted…
Evan, I guess you could say the issue is basically a cosmetic one — getting proper quotes and dashes only lends a visual improvement to pages; it is not required for validation or, for that matter, for human comprehension. What spurred me into action was the reminder that these typographical marks are possible on web sites, and the only reason people don’t use them, is to avoid the hassle. Laziness, iow.
A validation issue does crop up if you have been using these characters, but have used the numerical entities &129; to &159; — these will not validate. Many HTML creation tools, including Dreamweaver, insert these invalid characters. My XHTML add–ons for Dreamweaver also fixes this problem.
I have no idea why those characters are invalid, and why they were so widely adopted — the alistapart article doesn’t explain. I guess some research would unearth the background story.
francois Posted…
People on the webgraphics notification list will have seen by now blog comments coming in littered with escaped entities (– and the like) as a result of this script. If these notifications become part of the plaintext email domain, then these entities are arguably a hindrance.
But then, it can be argued, so is HTML. By giving posters the power to write HTML via the comment box (instead of merely autoconverting linebreaks to HTML post–notification), that HTML also ends up, redundantly, in the email domain.
I’m just trying to make my mind up how things ought to be. If we say that the blog’s principle manifestation (including comments) is the website, and the email notifications are a secondary convenience, then this means of improving the quality of the HTML is a good thing. But if the email component of a blog is an important one (say blog posts, and comments to that post, are intended to work as newsletters), then this is no longer appropriate. Then HTML format emails become necessary (preferably not), or another bit of programming that applies the entity replacement only after sending the email. I suspect that would require hacking the MT Perl source, rather than javascript.
My brother keeps telling me that kludges like these are unnecessary if “Structured Text” conventions are followed. Can anyone shed more light on this? He may well have a point. But my work was born out of the long–standing tradition of pragmatic kludges to get better–looking websites, by anyone, today.