[IMC-bristol] Re: [Imc-bristol-tech] weird characters on posts

Mike Tonks mike at bettercode.com
Mon Mar 22 07:19:23 PST 2004


Wow - I have been having a headache with this for al-muajaha [arabic & 
english text mixed] for ages - should have asked you earlier.

Let me try to understand your terms a little, and relate them to my own 
environment:

HTTP Header: This is probably output by, and therefore defined by the 
settings of the apache web server?

HTML: This is defined in the meta tag?

<meta http-equiv="content-type" content="text/html; charset=utf-8">

If so, this is easy to change - there is a single php file for the whole 
site.

Database: I'm not quite sure what the settings are for the mysql 
database or how to change them, but I'm concerned that they may be 
incorrect...


I saw the euskalinfo thread.  It's mainly punctuation, but eg Espana 
spelt correctly in Spanish has a special character in it.  A similar 
thing has happened before - MS Word changes speech marks into wierd 
back-tick characters which mess up the same way.




Jamie Lokier wrote:
> Crash wrote:
> 
>>looking at a few posts (particularly the Euskalinfo Basque thread) there 
>>seems to be a problem with special chars. I notice the character set is 
>>utf-8, which i think is quite unusual - should it be changed to iso-8859-1?
> 
> 
> A bit about utf-8: it's used on multi-language web pages nowadays
> because it's the only reasonable encoding for characters of all
> languages.  iso-8859-1 is only good for Western European languages and
> those which use the same characters.  That's probably why it's there:
> because the new Dada software is supposed to have better multi-language
> support.  All but the most ancient of browsers will have no problem
> with it.  I use utf-8 and have never had a problem with it; however,
> you might have issues if you're using Netscape 3 from 1995.
> 
> The problem is worse than you describe.  The HTTP server says it's
> iso-8859-1, and the HTML says it's utf-8.  Browsers are likely to vary
> in which one they use for displaying the page.  They _should_ use the
> HTTP one, but the two values are not supposed to be different!
> 
> Viewing the thread in Mozilla, it shows in iso-8859-1 which is correct
> for that article.  But "View Page Info" shows utf-8.
> 
> The same encoding is used for text that's submitted in forms.  You can
> imagine, with the conflicting encodings, that some postings are
> getting sent in utf-8 and some in iso-8859-1, according to the
> different software people are using to post.
> 
> I looked at the "Euskalinfo (Bristol) about the Madrid 11-3 bombings"
> thread.  The main article is in iso-8859-1, but with the HTML labelled
> as utf-8, and the HTTP server saying it's iso-8859-1.  It looks ok in
> Mozilla.  But look carefully at the comments: at least one of them has
> characters encoded in utf-8, and they don't display properly in
> iso-8859-1.  That means *no* setting displays the whole page
> correctly. :-/
> 
> There is a problem, and it's on the front page too not just that
> article.  The problem is that HTTP says the page is in iso-8859-1, and
> the HTML (maybe the template?)  says it's utf-8.  *They're not
> supposed to disagree!*
> 
> It's ok to omit the character encoding from either HTTP headers or the
> HTML meta tag, letting the other one do by itself.
> 
> That's probably resulted in a lot of iso-8859-1 stories being stored,
> so they sort of look fine on most browsers most of the time.  Either
> utf-8 or iso-8859-1 are fine, if the software doesn't care and just
> displays text from the database unchanged.  But utf-8 is needed if you
> want people to be able to enter characters of every supported
> language, and the software might have requirements if it processes the
> text in any way.
> 
> It's fixable.  But fixing pages like the Euskalinfo thread where
> there's a mixture of contributions in iso-8859-1 and utf-8 on the same
> page is "interesting".  Assuming the database didn't record which
> encoding was used, the usual strategy is to guess: test for utf-8
> validity, and if it fails assume it's iso-8859-1, and convert from
> that to utf-8.  It's fortunately quite rare for iso-8859-1 prose to be
> valid utf-8.
> 
> -- Jamie
> 
> _______________________________________________
> Imc-bristol-tech mailing list
> Imc-bristol-tech at lists.indymedia.org
> http://lists.indymedia.org/mailman/listinfo/imc-bristol-tech
> 




More information about the Imc-bristol mailing list