Skip to content
Snippets Groups Projects
  • Vladimir Davydov's avatar
    8caf1fff
    yaml: don't encode unprintable strings as binary blobs · 8caf1fff
    Vladimir Davydov authored
    Historically, we encode strings that contain invalid or non-printable
    utf-8 sequences in YAML as binary base64 blobs. We do that because of
    limitations/bugs of the YAML encoder, which refuses to encode invalid
    utf-8 strings. To work around this issue, we introduced the helper
    utf8_check_printable, which is basically a copy of yaml_check_utf8,
    and treat strings for which it fails as binary data (MP_BIN).
    
    This commit updates the YAML submodule to the version where all known
    issues with encoding invalid/unprintable utf-8 strings are fixed and
    removes special treatment of such strings (drops utf8_check_printable).
    Now unprintable or invalid utf-8 sequences are emitted as code points,
    e.g. '\xFF' or '\uFFFF'. This change is a pre-requisite for introducing
    the new varbinary type to Lua. Without it plain strings would be
    implicitly converted to varbinary after decoding/encoding them in YAML,
    which would be confusing.
    
    Closes #8756
    
    NO_DOC=bug fix
    
    (cherry picked from commit 890a821c)
    8caf1fff
    History
    yaml: don't encode unprintable strings as binary blobs
    Vladimir Davydov authored
    Historically, we encode strings that contain invalid or non-printable
    utf-8 sequences in YAML as binary base64 blobs. We do that because of
    limitations/bugs of the YAML encoder, which refuses to encode invalid
    utf-8 strings. To work around this issue, we introduced the helper
    utf8_check_printable, which is basically a copy of yaml_check_utf8,
    and treat strings for which it fails as binary data (MP_BIN).
    
    This commit updates the YAML submodule to the version where all known
    issues with encoding invalid/unprintable utf-8 strings are fixed and
    removes special treatment of such strings (drops utf8_check_printable).
    Now unprintable or invalid utf-8 sequences are emitted as code points,
    e.g. '\xFF' or '\uFFFF'. This change is a pre-requisite for introducing
    the new varbinary type to Lua. Without it plain strings would be
    implicitly converted to varbinary after decoding/encoding them in YAML,
    which would be confusing.
    
    Closes #8756
    
    NO_DOC=bug fix
    
    (cherry picked from commit 890a821c)
gh-8756-yaml-unprintable-utf8-encoding-fix.md 232 B

bugfix/core

  • Eliminated implicit conversion of unprintable utf-8 strings to binary blobs when encoded in YAML. Now unprintable characters are encoded as escaped utf-8 code points, for example, \x80 or \u200B (gh-8756).