Discussion:
FAO BBC: Double-encoded UTF-8 in Programme's JSON.
Ralph Corderoy
2018-03-04 13:39:36 UTC
Permalink
Hi,

I noticed get_iplayer showing

Rothaí Móra an tSaoil: Series 1

and wondered if it was a bug, but the BBC's JSON has

$ curl -sS https://www.bbc.co.uk/programmes/b09w6dhm.json |
grep -o '"Roth[^"]*"'
"Rotha\u00c3\u00ad M\u00c3\u00b3ra an tSaoil"
"Rotha\u00c3\u00ad M\u00c3\u00b3ra an tSaoil"
$

and get_iplayer is correctly showing U+c3 and U+ad after `Rotha'.

The problem is the BBC have taken a UTF-8 encoding of the intended rune
and encoded it again as UTF-8.

$ iconv -f utf-8 -t ucs-2be <<<$'\xc3\xad \xc3\xb3' |
od --endian=big -tx2
0000000 00ed 0020 00f3 000a
0000010
$

Thus the title is meant to be

$ printf 'Rotha\u00ed M\u00f3ra an tSaoil\n'
Rothaí Móra an tSaoil
$

Can a BBC lurker please see if they can stop it happening. Thanks.
--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy
Loading...