Encoding (?) error on some (!) pages

rkzn

Hi. If I read [url=http://www.apaci.maisbarcelos.pt]a certain Portuguese language page[/url] the text has all the "special" characters (ã, ç, á, and so on) replaced by unknown character marks (�). e.g. "Certifica��o" instead of "Certificação". The same page's text looks ok read using both Firefox and Old Opera. Somewhat mysteriously [url=http://acop.planetaclix.pt/]one other page[/url] is ok, except for the title (in the title bar), reading "Associa��o" instead of "Associação". Incidentally that gives me a clue to a possible cause. The title is in one html page, while the main text is in another included in a frameset. The frameset has no encoding defined, while the frame (main page) has: [code]<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">[/code] The first page also has no encoding definition. I am pretty sure (that is 90% sure) these � were not so common recently. Anyone else sees the same? Anyone knows why (not)?

rkzn

@Gwen-Dragon:

With Latest Snapshot 1.0.365.3 you can set a standard encoding […]

Done that, it works. Thank you. (1.0357, here, but the preference already exists) Actually I was thinking today, at work, "maybe there is an option for that…" after all that is most of the point with Vivaldi

Maybe Vivaldi also changed its default encoding lately? (or maybe I did, changing to Unicode?) I don't recall having this problem before.

I have some - only some - understanding about encodings, so I know it is not easy. But maybe there is some way to make a better guess other than stick to a default? Say, if a page has LOTS of replacement characters - � - it is likely the encoding is wrong. After all, even pages about it only have a few

Isildur

@Gwen-Dragon:

I dont know if Vivaldi had and if yes which encoding it had usesd before.

Vivaldi added a setting in the Latest Snapshot to have a default encoding when it is missing on webpages. Many users complained about a missing setting because other (old and new) browsers had it.

But there is really no way to guess correctly a encoding by a browser.
I'm working on WWW since 1999 and am a programmer since early 1980s, i know that guessing (mixed/bad) encodings is not easy/usable.

Hopefully resurrecting this isn't annoying, but I found this an interesting topic.

In the case of Vivaldi, it would probably make sense to have not just a configurable default, but a per-tab manual override in the View menu, like most browsers have.

Musing more generally on the topic of charsets:

I know it's by no means foolproof (especially on shorter files), and you (Gwen-Dragon) are likely aware, but a program can try [, making guesses based on the presence/absence/frequency of certain byte values or byte combinations. True many charsets overlap a lot in characteristics, so it makes sense to assume the most frequently used charsets (or some user-set preference order) in case where multiple ones would be plausible.

(Aside: Judging an entire file's encoding as a whole falls apart if the file is broken by mixed encodings, but for something like a plain txt file at least, one could imagine an algorithm that builds up confidence for a particular charset as it reads through the text, then hits a first point where a byte or byte sequence contradicts the high-confidence guess. If the page as a whole can't be reconciled with any one character set, it can then backtrack to that point, assume that everything before was indeed in the earlier guessed charset, and treat everything subsequent independently and iteratively repeat the process as needed for the remaining portion. May not be so relevant to html files, though, as it would get problematic sorting out where the real encoding boundaries are. Hmm, I suppose it's possible that someone might subject themselves to security risks from overriding a document's declared charset, or viewing a site that negligently neglected to declare a charset so the user's default is in effect, if it's a site where people can submit arbitrary text comments? That is, if there's a way to get angle brackets to appear in an alternate interpretation of the data. That's avoided with UTF-8 text being rendered to anything like US-ASCII that uses 0x3c to mean "<", though, because the first bit is always 1 in UTF-8 byte sequence other than for encoding the 7-bit ASCII characters, so there's no way to get a value as low as 0x3c.)

It's not the same thing, but charset detection has parallels to how a browser like Chrome can (in real-time)](https://en.wikipedia.org/wiki/Charset_detection}charset) detect that you've landed at a page in a foreign language, to know to offer you translation.

The Mozilla project had a charset detection library; I'm not sure if that's still active.

kumiponi

I'm sometimes bothered when browsers showing text files (as opposed to HTML) use the fall-back charset that's meant for HTML pages. Everything generated today is mainly UTF-8 so most text files that have UTF-8 characters look bad by default. On the other hand, if you set the fall-back charset to be UTF-8, many older web pages look bad. The solution would be to allow different defaults for different MIME Types / file types.