Incorrect encoding handling: UTF-8 byte stream misidentified as GBK/GB18030.

Core Issue Description

The system fails to correctly identify the character encoding of the input/output stream. Specifically, UTF-8 encoded text is being decoded using the GBK (or GB18030) codec, resulting in corrupted text (Mojibake).

Technical Breakdown

  • Observed Behavior: Characters like 代译序 (UTF-8) are rendered as 浠h瘧搴 (GBK).

  • Root Cause: A 3-byte UTF-8 character is being incorrectly split into 2-byte chunks and mapped to the GBK character map.

  • Affected Format: Markdown files/strings.

Reproduction

saving link by extension:

https://www.marxists.org/chinese/reference-books/guy-debord-1967/00b.htm

Please authenticate to join the conversation.

Upvoters
Status

Completed

Board

Reedle

Tags

Bug - Reedle Parse

Date

3 months ago

Author

rizzotho

Subscribe to post

Get notified by email when there are changes.