UTF-8 BOM is not removed when running on Node.js and input is a file #1344

mojavelinux · 2021-07-02T19:27:20Z

Although there's a test that covers removing the UTF-8 BOM from the AsciiDoc source, this only works on Node.js when the input is a string created in JavaScript. When the AsciiDoc source is read from a file, or the string is created from Buffer.from, the UTF-8 BOM is not removed.

This snippet explains why we're hitting this problem:

const fs = require('fs')

fs.writeFileSync('test.adoc', Buffer.concat([Buffer.from([0xEF, 0xBB, 0xBF]), Buffer.from('= Document Title')]))

const contentsFromFile = fs.readFileSync('test.adoc')
console.log(contentsFromFile.toString().charCodeAt())
console.log(contentsFromFile.toString().charCodeAt(1))
// => 65279
// => '='

const contentsFromString = '\xef\xbb\xbf= Document Title'
console.log(contentsFromString.charCodeAt())
console.log(contentsFromString.charCodeAt(3))
// => 239
// => '='

It appears that when Node.js creates a string by way of a Buffer, such as when reading the contents of a file, it changes the UTF-8 BOM into a different BOM character (code: 65279, char ref: 0xFEFF). I have not found any way to disable this behavior. It's basically a quirk of Node.js.

I think Asciidoctor.js should detect this alternate BOM and remove the character. (I'm open to changing Asciidoctor Ruby, if we determine it's necessary).

mojavelinux · 2021-07-02T19:30:28Z

What I can't figure out is how to emulate this scenario in Ruby. I can't figure how to get Ruby to report the character code (8-bit unsigned integer) 65279. If I try to pack it, this is what I get:

[65279].pack('C*')
# => "\xFF"

Here's what I get if I unpack the BOM:

"\xFE\xFF".unpack('C*')
=> [254, 255]

And if I write it to a file, Ruby always writes the original BOM, "\xEF\xBB\xFB"

File.write '/tmp/bom.txt', [65279].pack('U*'), mode: 'wb'

We may just have to accept this character code is a weird quirk of Node.js and deal with it as such. There are translations going on we just can't see.

mojavelinux · 2021-07-02T20:24:59Z

I'm now convinced it's best to deal with this in Asciidoctor.js. Our custom unpack method is an ideal place for it. Here's the logic we could use that would make this work:

`self.charCodeAt() === 65279 ? [239, 187, 191] : #{self[0, 3].bytes.select.with_index {|_, i| i.even? }}`

Let me know if you want me to proceed with a fix.

mojavelinux · 2021-07-02T20:45:31Z

An alternate way to do this to check if self[0, 1].bytes is [255, 254]. Either way seems to work.

mojavelinux · 2021-07-02T20:51:04Z

Here's something that's interesting:

encodeURIComponent(self.charAt())
// => %EF%BB%BF

and

encodeURIComponent(String.fromCharCode(65279))
// => %EF%BB%BF

So it is recognizing 65279 as a UTF-8 BOM.

I think checking the char code is the right strategy (as proposed above). It's the same as this, with a bit less overhead:

self.charAt() === String.fromCharCode(65279)

…turn standard BOM

…dard BOM (#1345)

Bypass Ruby method calls to make BOM detection more portable. Port asciidoctor#1344 fix.

…dard BOM (#1345)

mojavelinux added a commit to mojavelinux/asciidoctor.js that referenced this issue Jul 3, 2021

resolves asciidoctor#1344 look for alternate char code for BOM and re…

9734bc3

…turn standard BOM

mojavelinux added a commit to mojavelinux/asciidoctor.js that referenced this issue Jul 4, 2021

resolves asciidoctor#1344 look for alternate char code for BOM and re…

d4e44e4

…turn standard BOM

mojavelinux self-assigned this Jul 4, 2021

mojavelinux mentioned this issue Jul 4, 2021

🐛 resolves #1344 look for alternate char code for BOM and return standard BOM #1345

Merged

ggrossetie closed this as completed in #1345 Jul 4, 2021

ggrossetie pushed a commit that referenced this issue Jul 4, 2021

🐛 resolves #1344 look for alternate char code for BOM and return stan…

2e6ded9

…dard BOM (#1345)

ggrossetie pushed a commit to ggrossetie/asciidoctor.js that referenced this issue Jul 10, 2021

🐛 look for alternate char code for BOM and return standard BOM

5e71468

Bypass Ruby method calls to make BOM detection more portable. Port asciidoctor#1344 fix.

ggrossetie pushed a commit to ggrossetie/asciidoctor.js that referenced this issue Jul 10, 2021

🐛 look for alternate char code for BOM and return standard BOM

10f4143

Bypass Ruby method calls to make BOM detection more portable. Port asciidoctor#1344 fix.

ggrossetie pushed a commit that referenced this issue Aug 4, 2021

🐛 resolves #1344 look for alternate char code for BOM and return stan…

050ee19

…dard BOM (#1345)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 BOM is not removed when running on Node.js and input is a file #1344

UTF-8 BOM is not removed when running on Node.js and input is a file #1344

mojavelinux commented Jul 2, 2021

mojavelinux commented Jul 2, 2021 •

edited

Loading

mojavelinux commented Jul 2, 2021

mojavelinux commented Jul 2, 2021

mojavelinux commented Jul 2, 2021 •

edited

Loading

UTF-8 BOM is not removed when running on Node.js and input is a file #1344

UTF-8 BOM is not removed when running on Node.js and input is a file #1344

Comments

mojavelinux commented Jul 2, 2021

mojavelinux commented Jul 2, 2021 • edited Loading

mojavelinux commented Jul 2, 2021

mojavelinux commented Jul 2, 2021

mojavelinux commented Jul 2, 2021 • edited Loading

mojavelinux commented Jul 2, 2021 •

edited

Loading

mojavelinux commented Jul 2, 2021 •

edited

Loading