[IMC-bristol] Re: [Imc-bristol-tech] weird characters on posts
Mike Tonks
mike at bettercode.com
Mon Mar 22 07:19:23 PST 2004
Wow - I have been having a headache with this for al-muajaha [arabic &
english text mixed] for ages - should have asked you earlier.
Let me try to understand your terms a little, and relate them to my own
environment:
HTTP Header: This is probably output by, and therefore defined by the
settings of the apache web server?
HTML: This is defined in the meta tag?
<meta http-equiv="content-type" content="text/html; charset=utf-8">
If so, this is easy to change - there is a single php file for the whole
site.
Database: I'm not quite sure what the settings are for the mysql
database or how to change them, but I'm concerned that they may be
incorrect...
I saw the euskalinfo thread. It's mainly punctuation, but eg Espana
spelt correctly in Spanish has a special character in it. A similar
thing has happened before - MS Word changes speech marks into wierd
back-tick characters which mess up the same way.
Jamie Lokier wrote:
> Crash wrote:
>
>>looking at a few posts (particularly the Euskalinfo Basque thread) there
>>seems to be a problem with special chars. I notice the character set is
>>utf-8, which i think is quite unusual - should it be changed to iso-8859-1?
>
>
> A bit about utf-8: it's used on multi-language web pages nowadays
> because it's the only reasonable encoding for characters of all
> languages. iso-8859-1 is only good for Western European languages and
> those which use the same characters. That's probably why it's there:
> because the new Dada software is supposed to have better multi-language
> support. All but the most ancient of browsers will have no problem
> with it. I use utf-8 and have never had a problem with it; however,
> you might have issues if you're using Netscape 3 from 1995.
>
> The problem is worse than you describe. The HTTP server says it's
> iso-8859-1, and the HTML says it's utf-8. Browsers are likely to vary
> in which one they use for displaying the page. They _should_ use the
> HTTP one, but the two values are not supposed to be different!
>
> Viewing the thread in Mozilla, it shows in iso-8859-1 which is correct
> for that article. But "View Page Info" shows utf-8.
>
> The same encoding is used for text that's submitted in forms. You can
> imagine, with the conflicting encodings, that some postings are
> getting sent in utf-8 and some in iso-8859-1, according to the
> different software people are using to post.
>
> I looked at the "Euskalinfo (Bristol) about the Madrid 11-3 bombings"
> thread. The main article is in iso-8859-1, but with the HTML labelled
> as utf-8, and the HTTP server saying it's iso-8859-1. It looks ok in
> Mozilla. But look carefully at the comments: at least one of them has
> characters encoded in utf-8, and they don't display properly in
> iso-8859-1. That means *no* setting displays the whole page
> correctly. :-/
>
> There is a problem, and it's on the front page too not just that
> article. The problem is that HTTP says the page is in iso-8859-1, and
> the HTML (maybe the template?) says it's utf-8. *They're not
> supposed to disagree!*
>
> It's ok to omit the character encoding from either HTTP headers or the
> HTML meta tag, letting the other one do by itself.
>
> That's probably resulted in a lot of iso-8859-1 stories being stored,
> so they sort of look fine on most browsers most of the time. Either
> utf-8 or iso-8859-1 are fine, if the software doesn't care and just
> displays text from the database unchanged. But utf-8 is needed if you
> want people to be able to enter characters of every supported
> language, and the software might have requirements if it processes the
> text in any way.
>
> It's fixable. But fixing pages like the Euskalinfo thread where
> there's a mixture of contributions in iso-8859-1 and utf-8 on the same
> page is "interesting". Assuming the database didn't record which
> encoding was used, the usual strategy is to guess: test for utf-8
> validity, and if it fails assume it's iso-8859-1, and convert from
> that to utf-8. It's fortunately quite rare for iso-8859-1 prose to be
> valid utf-8.
>
> -- Jamie
>
> _______________________________________________
> Imc-bristol-tech mailing list
> Imc-bristol-tech at lists.indymedia.org
> http://lists.indymedia.org/mailman/listinfo/imc-bristol-tech
>
More information about the Imc-bristol
mailing list