parser error

Discussion:

parser error

RS

2017-10-24 19:35:11 UTC

Does anyone have any idea what causes a parser error? I thought at first
that it was an AtomicParsley error, but the --verbose output seems to
indicate it didn't get that far. The programme is the editorial version of
the 1941 Hitchcock film Suspicion, b00gmlrx, from hlshd1 (CDN:
akamai_hls_open/10). --no-hq-audio was set. There probably were no
subtitles, but get_iplayer (this is v3.05 in Windows 10) normally just says
so. The resultant .mp4 file can be played in VLC, but MediaInfo shows no
metadata. The ^ in the error message is placed just after the N in the line
above.

INFO: Downloaded: 1689.27 MiB (01:35:20) [573] in 00:03:03 at 73.85 Mibit/s
INFO: Converting to MP4
INFO: Command: "ffmpeg" "-loglevel" "fatal" "-stats" "-y" "-i"
"H:\Dvid17A\Suspicion.hls.ts" "-c:v" "copy" "-c:a" "copy" "-bsf:a"
"aac_adtstoasc" "H:\Dvid17A\Suspicion.partial.mp4"
frame=143024 fps=1417 q=-1.0 Lsize= 1669279kB time=01:35:20.93
bitrate=2390.3kbits/s speed=56.7x
INFO: Command exit code 0 (raw code = 0)
INFO: Downloading subtitles
INFO: Getting URL:
http://www.bbc.co.uk/iplayer/subtitles/ng/modav/bUnknown-591e0c64-779b-4f16-9582-bd3bc6c441bd_b09c79wx_1508034118207.xml
:7: parser error : Char 0x0 out of allowed range
SUSPICION
^
:7: parser error : Premature end of data in tag title line 6
SUSPICION
^
:7: parser error : Premature end of data in tag metadata line 5
SUSPICION
^
:7: parser error : Premature end of data in tag head line 3
SUSPICION
^
:7: parser error : Premature end of data in tag tt line 2
SUSPICION
^

Colin Law

2017-10-24 19:44:44 UTC

If you download the xml file the error refers to and look at it, it
can be seen that there are lots of null characters (hence the error
Char 0x0 out of range). The file is corrupt.

Colin

Post by RS
Does anyone have any idea what causes a parser error? I thought at first
that it was an AtomicParsley error, but the --verbose output seems to
indicate it didn't get that far. The programme is the editorial version of
akamai_hls_open/10). --no-hq-audio was set. There probably were no
subtitles, but get_iplayer (this is v3.05 in Windows 10) normally just says
so. The resultant .mp4 file can be played in VLC, but MediaInfo shows no
metadata. The ^ in the error message is placed just after the N in the line
above.
INFO: Downloaded: 1689.27 MiB (01:35:20) [573] in 00:03:03 at 73.85 Mibit/s
INFO: Converting to MP4
INFO: Command: "ffmpeg" "-loglevel" "fatal" "-stats" "-y" "-i"
"H:\Dvid17A\Suspicion.hls.ts" "-c:v" "copy" "-c:a" "copy" "-bsf:a"
"aac_adtstoasc" "H:\Dvid17A\Suspicion.partial.mp4"
frame=143024 fps=1417 q=-1.0 Lsize= 1669279kB time=01:35:20.93
bitrate=2390.3kbits/s speed=56.7x
INFO: Command exit code 0 (raw code = 0)
INFO: Downloading subtitles
http://www.bbc.co.uk/iplayer/subtitles/ng/modav/bUnknown-591e0c64-779b-4f16-9582-bd3bc6c441bd_b09c79wx_1508034118207.xml
:7: parser error : Char 0x0 out of allowed range
SUSPICION
^
:7: parser error : Premature end of data in tag title line 6
SUSPICION
^
:7: parser error : Premature end of data in tag metadata line 5
SUSPICION
^
:7: parser error : Premature end of data in tag head line 3
SUSPICION
^
:7: parser error : Premature end of data in tag tt line 2
SUSPICION
^
_______________________________________________
get_iplayer mailing list
http://lists.infradead.org/mailman/listinfo/get_iplayer

RS

2017-10-24 20:41:54 UTC

From: Colin Law
Sent: Tuesday, October 24, 2017 8:44 PM

Post by Colin Law
If you download the xml file the error refers to and look at it, it
can be seen that there are lots of null characters (hence the error
Char 0x0 out of range). The file is corrupt.

Thanks for the explanation. I'm glad I asked because I hadn't realised that
was where subtitles came from. I had assumed there was a ready-made .srt
file to download.

I see from --info there are three subtitle modes. If I try
--subtitles-only --subtitles2
it tells me that is an invalid option.

If I try --subtitles-only --tvmode=subtitles2
it tells me No media streams found.

--info displays metadata, so that must be stored somewhere other than that
corrupt XML file with the subtitles. I'll have to explore some more.

RS

2017-10-24 22:38:14 UTC

From: Jeremy Nicoll - ml gip
Sent: Tuesday, October 24, 2017 8:54 PM

The problem doesn't lie with g_ip, but with a corrupted(?) file on the
bbc server.

This is the first time I have seen this error, so it must be rare. For that
reason it is not worth bothering with. If it happened more often
get_iplayer ought when unable to download subtitles successfully to continue
to call AtomicParsley to add metadata. Further, if subtitles1 fails it
ought to try subtitles2 and subtitles3, just as it does with hlshd1, hlshd2,
hlshd3 ...
For the benefit of those who use the PVR it ought to set a return code to
indicate all was not well.

Vangelis forthnet

2017-10-25 05:14:46 UTC

Post by RS
The resultant .mp4 file can be played in VLC,
but MediaInfo shows no metadata.

Hello Richard :-)
If you ended up, for whatever reason,
with an untagged file, you can always (re-)tag
post download with the --tag-only switch:

get_iplayer --type=video --pid=b00gmlrx --tag-only --tag-podcast-tv --tag-only-filename="path\to\Suspicion.mp4"

(I assume you renamed the "Suspicion.partial.mp4" to just "Suspicion.mp4")

Post by RS
The programme is the editorial version of
the 1941 Hitchcock film Suspicion, b00gmlrx

pid=b00gmlrx => vpid=b09c79wx (needed later...)

Post by RS
Does anyone have any idea what causes a parser error?

Answered by Colin; some further analysis below...

Post by RS
I'm glad I asked because I hadn't realised
that was where subtitles came from.
I had assumed there was a ready-made .srt
file to download.

On-line media portals (like iPlayer) rarely use the .srt
(subrip text) format, because it's usually incompatible
with their embedded player (Flash based/HTML5 one);
I'm certainly not an expert on this subject, but Flash
based players usually require an XML caption file
(referred to also as DFXP), while HTML5 ones
may use the WebVTT (.vtt) format.

DFXP is s a timed-text format that was developed by W3C
(stands for "Distribution Format Exchange Profile"); it is
currently referred to as TTML, read more at:
https://en.wikipedia.org/wiki/Timed_Text_Markup_Language

GiP will use mediaselector URLs (which contain the vpid string)
to retrieve the URIs pointing to the iPlayer ttml files;
PC/iptv-all/apple-ipad-hls mediasets are tried. The URI
you included in your original post will be found, e.g., in

http://open.live.bbc.co.uk/mediaselector/5/select/version/2.0/mediaset/iptv-all/vpid/b09c79wx
(geo-filtered)
in the <media expires="2017-11-21T14:05:00Z" kind="captions"
XML element; this URI is a legacy format, not geo-blocked,
supplier="sis", never expires...
You'll also notice two other URIs for the same subtitles file, these
are the Video Factory flavours; they are served from Akamai/Limelight,
are UK-only and tokenised, with limited lifespans;
but ALL 3 URIs point to the same file!

GiP fetches the XML subs file (which is referred to as "raw"
in GiP terminology) and then, through a dedicated perl subroutine
("ttml_to_srt", line 6588 of 3.05 script) converts it to .srt;
--subsraw flag will let you also keep the original file...

Post by RS
I see from --info there are three subtitle modes.

I used GiP 3.05 and the following command:

perl get_iplayer-305w.pl --type=tv --pid=b00gmlrx -i --streaminfo >
Streams.txt 2>&1

and yes, there are 3 captions modes identified,
but, alas, I can sure tell there's a bug in the
detection scheme somewhere; no sign of the
legacy format, plus there's duplication, as

subtitles3=subtitles1

==================================
stream: subtitles1
bitrate:
expires: 2017-11-21T14:05:00Z
ext: srt
priority: 20
size: 118212
streamer: http
streamurl:
http://vod-sub-uk-live.bbcfmt.hs.llnwd.net/iplayer/subtitles/ng/moda
v/bUnknown-591e0c64-779b-4f16-9582-bd3bc6c441bd_b09c79wx_1508034118207.xml?s=150
8878211&e=1508921411&h=c1d8bb45cd85f418d83103af0ef1979a
type: (captions) http stream (CDN: mf_limelight_uk_plain/20)

stream: subtitles2
bitrate:
expires: 2017-11-21T14:05:00Z
ext: srt
priority: 10
size: 118212
streamer: http
streamurl:
http://vod-sub-uk-live.akamaized.net/iplayer/subtitles/ng/modav/bUnk
nown-591e0c64-779b-4f16-9582-bd3bc6c441bd_b09c79wx_1508034118207.xml?__gda__=150
8921411_8042e3b62cef7eb303c0b44d69225c99
type: (captions) http stream (CDN: mf_akamai_uk_plain/10)

stream: subtitles3
bitrate:
expires: 2017-11-21T14:05:00Z
ext: srt
priority: 20
size: 118212
streamer: http
streamurl:
http://vod-sub-uk-live.bbcfmt.hs.llnwd.net/iplayer/subtitles/ng/moda
v/bUnknown-591e0c64-779b-4f16-9582-bd3bc6c441bd_b09c79wx_1508034118207.xml?s=150
8878211&e=1508921411&h=c1d8bb45cd85f418d83103af0ef1979a
type: (captions) http stream (CDN: mf_limelight_uk_plain/20)
==================================

but all three point to the same file!

Now, if you load the legacy URL
http://www.bbc.co.uk/iplayer/subtitles/ng/modav/bUnknown-591e0c64-779b-4f16-9582-bd3bc6c441bd_b09c79wx_1508034118207.xml

Post by RS
XML Parsing Error: not well-formed
Location: (The URI)
SUSPICION

Right-click -> View Page Source
and you'll be able to view the file contents
and actually visualise the corruption:
Loading Image...

With the aid of Fx's Page Source and
a Text Editor, I managed to reconstitute
a proper TTML file, then used SubtitleEdit
to convert to (monochrome) .srt.
If you're in need of it, contact me off-list...

Post by RS
If I try --subtitles-only --tvmode=subtitles2
it tells me No media streams found.

I don't think subtitle mode user selection is supported;
legacy GiP code assumed only one captions mode,
so this could be a new requested feature; I see no
reason for it though; all modes point to the same file,
negligible speed differences between CDNs for such
small files of just a few KBs...

Post by RS
get_iplayer ought when unable to download subtitles
successfully to continue to call AtomicParsley to add metadata.

While in this case it's not the actual downloading that failed,
but rather the conversion to .srt (e.g. you can fetch the raw
corrupt ttml with --subsraw), I too agree with that.

After another series of tests made, of note is the fact
that every GiP version from 3.00 onwards does fail to
convert this corrupted subtitles file, but, lo-and-behold,
v2.99 does so successfully:
=======================================
get_iplayer v2.99, Copyright (C) 2008-2010 Phil Lewis
This program comes with ABSOLUTELY NO WARRANTY; for details
use --warranty.
This is free software, and you are welcome to redistribute it under
certain
conditions; use --conditions for details.

NOTE: A UK TV licence is required to legally access BBC iPlayer TV content

INFO Trying to download PID using type tv
INFO: pid found in cache
Matches:
5276: Suspicion - -, BBC Two, b00gmlrx
WARNING: Could not download programme metadata from
http://www.bbc.co.uk/program
mes/b00gmlrx.xml
INFO: Downloading Subtitles to 'D:\Vangelis\iPlayer
Recordings/Suspicion_-__b00g
mlrx_editorial.srt'
=======================================

Actually, this is not a fluke; prior to 3.00,
GiP would produce monochrome .srt files, so,
without examining the code itself, I suspect
the older TTML parsing code was more forgiving...

Post by RS
Further, if subtitles1 fails it ought to
try subtitles2 and subtitles3

Again, it isn't the actual download that failed,
but the conversion; since all 3 (2 by my tests)
modes point to same file, conversion of the other
two should fail also; we still don't know at which
stage the corruption took place; I'm presuming
during file generation, not during upload to CDNs (?).

Now, if the actual download failed, then I see
your point as a valid one... I won't pretend I fully
understand the actual perl code, but perl wizards
could enlighten us as to actual content of GiP
subroutines "subtitles_available" & "download_subtitles";
my hunch is GiP already does what you suggest,
as far as downloading is concerned...

Apologies for the length of this post and thanks
to those that stayed to read the end of it...

Kindest regards,
Vangelis.

RS

2017-10-25 23:51:15 UTC

From: Vangelis forthnet
Sent: Wednesday, October 25, 2017 6:14 AM

Post by Vangelis forthnet

Post by RS
The resultant .mp4 file can be played in VLC,
but MediaInfo shows no metadata.

Hello Vangelis

Many thanks for a very thorough explanation.

Post by Vangelis forthnet
If you ended up, for whatever reason,
with an untagged file, you can always (re-)tag
get_iplayer --type=video --pid=b00gmlrx --tag-only --tag-podcast-tv --tag-only-filename="path\to\Suspicion.mp4"

Thanks, that's useful. Up to now I have assumed that tagging requires a
massive data collection exercise, so I have downloaded afresh (which is not
as great a hardship as it used to be now we have resuming for HLS and HVF)
if something goes wrong.

Post by Vangelis forthnet
(I assume you renamed the "Suspicion.partial.mp4" to just "Suspicion.mp4")

No, I didn't think of that, so it is something else which is strange. If
downloading subtitles fails (other than because there are no subtitles)
get_iplayer skips tagging by AtomicParsley and renames the partial file as
though nothing is wrong.

I have received an email from someone who told me the Suspicion subtitles
download fine on his XP installation with v3.01. When I use v3.01 I still
get the problem. He also mentioned that there was something in the v3.01
release notes about changes to subtitle handling. I used --subsfmt=default
and the subtitles downloaded without problem. I can't do that in v3.05
because --subsfmt has been removed.

Post by Vangelis forthnet
...

Post by RS
Does anyone have any idea what causes a parser error?

Answered by Colin; some further analysis below...

The corruption he refers to is a few spurious NUL characters in
<head><metadata>. The subtitles themselves are in <body> and they are
intact.

Post by Vangelis forthnet

Post by RS
I'm glad I asked because I hadn't realised
that was where subtitles came from.
I had assumed there was a ready-made .srt
file to download.

On-line media portals (like iPlayer) rarely use the .srt
(subrip text) format, because it's usually incompatible
with their embedded player (Flash based/HTML5 one);
I'm certainly not an expert on this subject, but Flash
based players usually require an XML caption file
(referred to also as DFXP), while HTML5 ones
may use the WebVTT (.vtt) format.
DFXP is s a timed-text format that was developed by W3C
(stands for "Distribution Format Exchange Profile"); it is
https://en.wikipedia.org/wiki/Timed_Text_Markup_Language

I didn't know any of this, or the paragraphs I have not quoted, so I am
grateful for the explanation. I was aware of --subsraw, but that does not
solve the problem. All the players I have used have accepted .srt files.

I suspect the answer is that XML::LibXML is less tolerant than whatever was
used before (XML::Parser or XML::Simple?)

One good thing to come out of it is that I found a program (which you also
mention and which I have not yet installed) called Subtitle Edit which will
convert XML to SubRip. It will also translate subtitles to other languages
which could be useful with visitors whose mother tongue is not English.

Post by Vangelis forthnet

Post by RS
I see from --info there are three subtitle modes.

perl get_iplayer-305w.pl --type=tv --pid=b00gmlrx -i --streaminfo >
Streams.txt 2>&1
and yes, there are 3 captions modes identified,
but, alas, I can sure tell there's a bug in the
detection scheme somewhere; no sign of the
legacy format, plus there's duplication, as
...
but all three point to the same file!

You got the same as me, with 1 and 3 pointing to Limelight and 2 to Akamai.
Subsequently it changed so there is now only subtitles1 (CDN: sis/10)

Post by Vangelis forthnet
...
With the aid of Fx's Page Source and
a Text Editor, I managed to reconstitute
a proper TTML file, then used SubtitleEdit
to convert to (monochrome) .srt.
If you're in need of it, contact me off-list...

Many thanks for the offer but I now have a file from which I can delete the
spurious characters to create valid XML and feed to Subtitle Edit. I also
have a .srt file which I downloaded with v3.01 and --subsfmt=default.

Best wishes
Richard

Jeremy Nicoll - ml gip

2017-10-26 00:27:05 UTC

Post by RS
The corruption he refers to is a few spurious NUL characters in
<head><metadata>. The subtitles themselves are in <body> and they are
intact.

But you're a human looking at the file. XML files have a tightly
defined
syntax (defined by a formal grammar called a DTD). When a program tries
to extract data from an XML file it does so using standard code that
knows
what the structure of the file is because it has also read the DTD.

Anyway for a program to be able to parse an XML file the parser reads
the file character by character and at every point it knows (from the
grammar definition) exactly what could come next and can classify it
as required.

By definition an XML file is only an XML file if it entirely matches
the grammar that is defined. As soon as a parser finds a character
that makes no sense, the whole file is classed as corrupt, not an XML
file after all.

Much much more at: https://en.wikipedia.org/wiki/XML

--
Jeremy Nicoll - my opinions are my own

RS

2017-10-27 10:33:43 UTC

Post by RS
The corruption he refers to is a few spurious NUL characters in
<head><metadata>. The subtitles themselves are in <body> and they are
intact.

But you're a human looking at the file. XML files have a tightly defined
syntax (defined by a formal grammar called a DTD). When a program tries
to extract data from an XML file it does so using standard code that knows
what the structure of the file is because it has also read the DTD.
Anyway for a program to be able to parse an XML file the parser reads
the file character by character and at every point it knows (from the
grammar definition) exactly what could come next and can classify it
as required.
By definition an XML file is only an XML file if it entirely matches
the grammar that is defined. As soon as a parser finds a character
that makes no sense, the whole file is classed as corrupt, not an XML
file after all.
Much much more at: https://en.wikipedia.org/wiki/XML

I don't agree with you about the approach to parsing. The key exercise
is to match pairs of tags and to associate what is between the matched
pairs with keywords in the tags, but that is not relevant to this
discussion. The Wikipedia article you refer to says in 3.1
"The code point U+0000 (Null) is the only character that is not
permitted in any XML 1.0 or 1.1 document." so you are right to that extent.

That is not the end of the story. The parser has to decide what to do
when it finds an invalid character. It appears (I am guessing) that
XML::LibXML rejects the entire document even to the extent of rejecting
tag content which does not include any invalid character. It also
appears (and again I am guessing) that XML::Simple takes a different
approach and ignores invalid characters. Whether it ignores invalid
characters anywhere in the document or only if, as is the case here,
they are outside the desired tag pair (<body> ... <\body>) I am not able
to say on the evidence I have seen.

It is then up to the calling script (get_iplayer.pl) to decide what
action to take in response the action taken by the parser. It is not
adequate just to allow XML::LibXML to display "parser error" and take no
further action. My knowledge of Perl is not sufficient to understand
how get_iplayer.pl interacts with XML::LibXML.

I said that similar errors in subtitles were rare and so not worth
bothering with. That was before I became aware of the v3.02 and v3.03
changes to cease use of XML::Simple and to require version 1.91 of
XML::LibXML. In the past any similar errors will have been masked by
XML::Simple.

Best wishes
Richard

Jeremy Nicoll - ml gip

2017-10-27 12:59:08 UTC

Post by RS
The corruption he refers to is a few spurious NUL characters in
<head><metadata>. The subtitles themselves are in <body> and they are
intact.

But you're a human looking at the file. XML files have a tightly defined
syntax (defined by a formal grammar called a DTD). When a program tries
to extract data from an XML file it does so using standard code that knows
what the structure of the file is because it has also read the DTD.
Anyway for a program to be able to parse an XML file the parser reads
the file character by character and at every point it knows (from the
grammar definition) exactly what could come next and can classify it
as required.
By definition an XML file is only an XML file if it entirely matches
the grammar that is defined. As soon as a parser finds a character
that makes no sense, the whole file is classed as corrupt, not an XML
file after all.
Much much more at: https://en.wikipedia.org/wiki/XML

I don't agree with you about the approach to parsing. The key
exercise is to match pairs of tags and to associate what is between
the matched pairs with keywords in the tags, but that is not relevant
to this discussion. The Wikipedia article you refer to says in 3.1
"The code point U+0000 (Null) is the only character that is not
permitted in any XML 1.0 or 1.1 document." so you are right to that extent.
That is not the end of the story. The parser has to decide what to do
when it finds an invalid character.

The point you seem to be missing is that for XML parsing, the parser
does not have to decide. The XML /standard/ is (however inconvenient
it is) that any error means the parse stops.

Read the wikipedia page's section on

"Well-formedness and error-handling"

What you're really arguing for is for g_ip's author NOT to use an XML
parser to parse possible badly-formed XML pages.

Maybe some sort of regex-baed text extraction could in this specific
case find the text fields in a well-formed or maybe only a little
badly-formed XML document.

Post by RS
It is then up to the calling script (get_iplayer.pl) to decide what
action to take in response the action taken by the parser. It is not
adequate just to allow XML::LibXML to display "parser error" and take
no further action.

Even though that's what the XML standard says IS the correct action?

--
Jeremy Nicoll - my opinions are my own

Bernard Peek

2017-10-27 18:06:40 UTC

Post by Jeremy Nicoll - ml gip

Post by RS
It is then up to the calling script (get_iplayer.pl) to decide what
action to take in response the action taken by the parser. It is not
adequate just to allow XML::LibXML to display "parser error" and take
no further action.

Even though that's what the XML standard says IS the correct action?

PMFJI

I built data transfer standards for the UK's outdoor advertising
industry. I deliberately chose to use XML based standards because it
enabled automatic validation of data files. The standards were quite
specific. All automated systems were required to refuse any files not
compatible with the DTD I had on my web server. Data providers were
expected to prevalidate any files they sent to any other company.

This was my main argument for switching to XML from flat-files.

--
Bernard Peek

***@shrdlu.com

Alex

2017-10-27 19:07:42 UTC

I think the real issue is not to fix gIp ('cos it ain't broke), or
break gIp (to introduce a workaround for non-standard data), but to
get the BBC to fix their subtitles. If the poor subs feed can be
shown to cause issues with iPlayer, then the BBC should fix them (I
can't say that they are obligated to do so, as I don't know - but I
suspect that there are requirements in the charter to ensure
accessibility).

Whilst it is a pain when a broken subs file crashes the pvr scheduler
(it'd be nice if it handled it gracefully and skipped the download),
it's easy to fix it (tell gIp not to download the sub for that
particular item - or manually download that item without subs to enter
it into the download library so that no further automated attempt is
made).

Yes, I know, this is /not/ a viable solution for those who depend on
subs, but as mentioned elsewhere in this thread, the subs can be
downloaded using other methods and parsed separately.

Post by Jeremy Nicoll - ml gip

Post by RS
It is then up to the calling script (get_iplayer.pl) to decide what
action to take in response the action taken by the parser. It is not
adequate just to allow XML::LibXML to display "parser error" and take
no further action.

Even though that's what the XML standard says IS the correct action?

PMFJI
I built data transfer standards for the UK's outdoor advertising industry. I
deliberately chose to use XML based standards because it enabled automatic
validation of data files. The standards were quite specific. All automated
systems were required to refuse any files not compatible with the DTD I had
on my web server. Data providers were expected to prevalidate any files they
sent to any other company.
This was my main argument for switching to XML from flat-files.
--
Bernard Peek
_______________________________________________
get_iplayer mailing list
http://lists.infradead.org/mailman/listinfo/get_iplayer

--
Alex

RS

2017-10-27 20:47:00 UTC

Post by Bernard Peek

Post by Jeremy Nicoll - ml gip

Post by RS
It is then up to the calling script (get_iplayer.pl) to decide what
action to take in response the action taken by the parser. It is not
adequate just to allow XML::LibXML to display "parser error" and take
no further action.

Even though that's what the XML standard says IS the correct action?

PMFJI
I built data transfer standards for the UK's outdoor advertising
industry. I deliberately chose to use XML based standards because it
enabled automatic validation of data files. The standards were quite
specific. All automated systems were required to refuse any files not
compatible with the DTD I had on my web server. Data providers were
expected to prevalidate any files they sent to any other company.
This was my main argument for switching to XML from flat-files.

If you are both right about the strictness of the standard, and I have
to defer to your superior knowledge, why does XML::LibXML have options
for recovery and validation? According to
http://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/Parser.pod#PARSER_OPTIONS
and
http://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/Error.pod
it also has a choice of Verbose and Quiet error handlers. Authors can
use their own error handlers, or remove the error handler altogether.
An example given is recovery from a missing closing tag. I have not
seen a definition of fatal error. Is a spurious NUL a fatal error? I
suspect it is less serious than a missing closing tag. It is easy to
recover from; you just ignore it. Subject to what anyone may tell me, I
would have thought non-matching tags would be more likely to be a fatal
error.

It must be remembered that an important function of XML, in contrast to
other mark up languages, is that it is human readable as well as machine
readable.

Error recovery must always be appropriate for the importance of
integrity of the data and the probability of errors. I can understand
there are applications where strict compliance is necessary, but
subtitles does not seem to me to be one of them.

Subtitles for this film used to work with XML::Simple. A problem only
occurred with the move to XML::LibXML to support coloured subtitles. Is
it possible to configure XML::LibXML to be as error tolerant as XML::Simple?

Best wishes
Richard

Jeremy Nicoll - ml gip

2017-10-28 11:02:36 UTC

Post by RS
If you are both right about the strictness of the standard, and I have
to defer to your superior knowledge, why does XML::LibXML have options
for recovery and validation? According to
http://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/Parser.pod#PARSER_OPTIONS
and
http://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/Error.pod
it also has a choice of Verbose and Quiet error handlers. Authors can
use their own error handlers, or remove the error handler altogether.

The most obvous reason would be to use XML::libXML as a validator,
before
releasing files you were then certain were properly formed.

I think 'recovery' is this sense merely means the parser returns an
error code; there's nothing to suggest that you can then go on and
make data-extraction calls against the XML file... you'll just keep
getting the error code.

Post by RS
An example given is recovery from a missing closing tag.

- which is no use in this situation when the NUL occurs before any of
the
data you're interested in.

Post by RS
I have not
seen a definition of fatal error. Is a spurious NUL a fatal error?

I think so, according to that original wikipedia article, because it
said that a NUL is one of the only characters that can never be valid
in an XMl document.

Post by RS
I suspect it is less serious than a missing closing tag.

Not if the parser knowing it can NEVER be valid stops right there.

Post by RS
It is easy to recover from; you just ignore it.

There's no reason to ignore it. By definition, finding one means
that you do not have a valid XML file.

Post by RS
Subject to what anyone may tell me,
I would have thought non-matching tags would be more likely to be a
fatal error.

Well, HTML - which has looser parsing criteria - does manage that sort
of thing. But HTML is not XML.

Post by RS
It must be remembered that an important function of XML, in contrast
to other mark up languages, is that it is human readable as well as
machine readable.

OTOH the designers of XML clearly felt that well-formedness was just
as important.

Post by RS
Error recovery must always be appropriate for the importance of
integrity of the data and the probability of errors. I can understand
there are applications where strict compliance is necessary, but
subtitles does not seem to me to be one of them.

Then take that up with the BBC and tell them that their choice of XML
for these files is inappropriate.

Post by RS
Subtitles for this film used to work with XML::Simple. A problem only
occurred with the move to XML::LibXML to support coloured subtitles.

Surely the problem is that this specific XML file is corrupted?

Are you finding that every single XML file is corrupt?

--
Jeremy Nicoll - my opinions are my own

Bernard Peek

2017-10-28 17:12:48 UTC

Post by Bernard Peek

Post by Jeremy Nicoll - ml gip

Post by RS
It is then up to the calling script (get_iplayer.pl) to decide what
action to take in response the action taken by the parser. It is not
adequate just to allow XML::LibXML to display "parser error" and take
no further action.

Even though that's what the XML standard says IS the correct action?

PMFJI
I built data transfer standards for the UK's outdoor advertising
industry. I deliberately chose to use XML based standards because it
enabled automatic validation of data files. The standards were quite
specific. All automated systems were required to refuse any files not
compatible with the DTD I had on my web server. Data providers were
expected to prevalidate any files they sent to any other company.
This was my main argument for switching to XML from flat-files.

If you are both right about the strictness of the standard, and I have
to defer to your superior knowledge, why does XML::LibXML have options
for recovery and validation?

If you are particularly masochistic you can write code to recover data
from files that you already know are corrupt. Sometimes you can't just
throw the problem back at the data provider. The nice thing about
failing to validate is that it's a boolean value. It unambiguously
points the finger of blame at the data provider. Whether you can use
that to force them to fix the problem is a political issue not a
technical one.

Post by RS
According to
http://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/Parser.pod#PARSER_OPTIONS
and
http://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/Error.pod
it also has a choice of Verbose and Quiet error handlers. Authors can
use their own error handlers, or remove the error handler altogether.
An example given is recovery from a missing closing tag. I have not
seen a definition of fatal error. Is a spurious NUL a fatal error? I
suspect it is less serious than a missing closing tag. It is easy to
recover from; you just ignore it. Subject to what anyone may tell me,
I would have thought non-matching tags would be more likely to be a
fatal error.
It must be remembered that an important function of XML, in contrast
to other mark up languages, is that it is human readable as well as
machine readable.

Making XML human-readable was a compromise. The drawback is that it
encourages tinkerers to believe that they can or should attempt to fix
problems when, in most cases, the only sensible thing to do is kick them
back to the provider. What you end up with is multiple people in
different places putting in lots of time fixing someone else's mistakes.
Allowing that to continue is a disservice to other data users and should
be a last resort. Just because something is doable doesn't make doing it
a good idea.

--
Bernard Peek
***@shrdlu.com

Vangelis forthnet

2017-10-27 15:31:45 UTC

Post by RS
I have received an email from someone who told me
the Suspicion subtitles download fine on his XP installation with v3.01.
When I use v3.01 I still get the problem.
He also mentioned that there was something in the v3.01
release notes about changes to subtitle handling.
I used --subsfmt=default and the subtitles downloaded without problem.
I can't do that in v3.05 because --subsfmt has been removed.

Hi Richard

of note is the fact that every GiP version
from 3.00 onwards does fail to convert
this corrupted subtitles file, but, lo-and-behold,

2.99 was the last to use (by default) the old XML parsing code;
in 3.00 the new "coloured subtitles" feature was implemented,
introducing new XML parsing code:

https://github.com/get-iplayer/get_iplayer/wiki/release300to309#2-subtitles-now-in-colour

The subtitles conversion in get_iplayer
has been re-implemented using XML::LibXML

Fall-back to the old code was still kept in both 3.00/3.01
via the --subsfmt=default option.

However, that option is deprecated
and will be removed in a future release

That was done in 3.02+

https://github.com/get-iplayer/get_iplayer/wiki/release300to309#changes-in-302

Removed deprecated options: --subsfmt
so if you find programmes whose subtitles
can't be processed with the new implementation,
report them in the forums.

I suspect it's now too late, since the "new" implementation
has been the only one since 3.02+, but I suspect it wouldn't hurt
to let the maintainer know about this current occurrence...

Best regards,
Vangelis

RS

2017-10-27 20:04:27 UTC

Post by Vangelis forthnet

so if you find programmes whose subtitles
can't be processed with the new implementation,
report them in the forums.

I suspect it's now too late, since the "new" implementation
has been the only one since 3.02+, but I suspect it wouldn't hurt
to let the maintainer know about this current occurrence...

I have often suspected that the maintainer does indeed read this list
server. At one time he or she did announce new releases here, but that
seems to have stopped.

I too suspect that it is too late to have XML::Simple and
subsfmt=default restored as an alternative to the coloured subtitles.
As I shall be saying in another reply, I suspect it may be possible to
configure XML::LibXML to be more error tolerant.

Best wishes
Richard

Ralph Corderoy

2017-10-27 21:16:03 UTC

Hi Richard,

Post by RS
As I shall be saying in another reply, I suspect it may be possible to
configure XML::LibXML to be more error tolerant.

get_iplayer should fail hard if the XML parser it uses complains as
there's no good reason for it to expect poor XML and to recover from
errors.

If the BBC haven't already been informed that a particular URL serves
broken XML then that's the first thing to change, including pointing out
the NUL bytes that are causing the problem. I'm sure they'd like to work
out what went wrong, and stop it happening again. And it's better that
than all those that might attempt to use the XML to work around it.
https://tools.ietf.org/html/draft-thomson-postel-was-wrong-01

--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

Vangelis forthnet

2017-11-03 01:41:43 UTC

Post by Ralph Corderoy
If the BBC haven't already been informed
that a particular URL serves broken XML
then that's the first thing to change,
including pointing out the NUL bytes that are causing the problem.
I'm sure they'd like to work out what went wrong,
and stop it happening again.

It looks as though the problem has been fixed upstream!
After navigating to (geo-filtered):

http://open.live.bbc.co.uk/mediaselector/5/select/version/2.0/mediaset/iptv-all/vpid/b09c79wx

all three "connection href"s for service="captions"
load and render perfectly now in Firefox,
without generating an XML Parsing Error...
Someone from the BBC staff does browse
this list or was it perhaps an in-house find?

Regards,
Vangelis.

RS

2017-11-03 12:19:35 UTC

From: Vangelis forthnet
Sent: Friday, November 3, 2017 1:41 AM

Post by Vangelis forthnet

Post by Ralph Corderoy
If the BBC haven't already been informed
that a particular URL serves broken XML
then that's the first thing to change,
including pointing out the NUL bytes that are causing the problem.
I'm sure they'd like to work out what went wrong,
and stop it happening again.

It looks as though the problem has been fixed upstream!
http://open.live.bbc.co.uk/mediaselector/5/select/version/2.0/mediaset/iptv-all/vpid/b09c79wx
all three "connection href"s for service="captions"
load and render perfectly now in Firefox,
without generating an XML Parsing Error...
Someone from the BBC staff does browse
this list or was it perhaps an in-house find?

That’s interesting. I don’t know whether anyone tried to view Suspicion in
the iPlayer with subtitles to see if that was affected, but is seems more
likely that the BBC would respond to something causing an error in the
iPlayer. The BBC does correct errors. When we had problems with missing
segment errors in HLS, many programmes were corrected a week or so after
broadcast.

What is more interesting is that neither the file you refer to nor the
captions file it links to
http://www.bbc.co.uk/iplayer/subtitles/ng/modav/bUnknown-5df25dc8-d38f-43e5-93a2-38b6c778f852_b09c79wx_1509625417009.xml
are XML files.

As I understand it, an XML file has to begin <xml>, have a link in its
header to the DTD, and end <\xml>.

I may be slightly wrong about that. The problem subtitles file began
<?xml version="1.0" encoding="utf-8"?>

The media selection file you refer to begins
-<mediaSelection>
where - is a dash character I can't copy. Other <media> tags are preceded
by a similar dash character.

The captions file begins
-<tt ttp:timeBase="media" xml:lang="en">
where again - is a dash character.

For both files Firefox displays a banner reading
This XML file does not appear to have any style information associated with
it. The document tree is shown below.

I have been meaning to reply to Ralph and the others who commented. I was
going to do it here, but to avoid making this email any longer I'll do it in
a separate email, except to draw attention to the Wikipedia article on
TimedText_Markup_Language which you have already referred to
https://en.wikipedia.org/wiki/Timed_Text_Markup_Language
and in particular reference 2, WebVTT versus TTML: XML considered harmful
for web captions?
http://www.balisage.net/Proceedings/vol10/html/Tai01/BalisageVol10-Tai01.html

Under the heading "Established industries versus emerging user communities"
it says,

"While XML has been well received and is used in established industries, it
has at least a disputable role on the web. The most prominent areas of
debate are the draconian error handling implemented by XHMTL supporting web
browsers and the growing suppression of XML through JSON as an interchange
format for data on the web."

Best wishes
Richard

Ralph Corderoy

2017-11-03 12:50:12 UTC

Hi Richard,

Post by RS
http://www.bbc.co.uk/iplayer/subtitles/ng/modav/bUnknown-5df25dc8-d38f-43e5-93a2-38b6c778f852_b09c79wx_1509625417009.xml

...

Post by RS
I may be slightly wrong about that. The problem subtitles file began
<?xml version="1.0" encoding="utf-8"?>

That's fine. https://www.w3.org/TR/xml/#NT-XMLDecl

Post by RS
-<mediaSelection>
where - is a dash character I can't copy.

...

Post by RS
For both files Firefox

Don't use Firefox to view XML. XML is plain text. Download it to a
file and then use a text editor to view it. Firefox is trying to be
helpful, but fails.

--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

RS

2017-11-03 13:34:23 UTC

From: Ralph Corderoy
Sent: Friday, November 3, 2017 12:50 PM

Post by Ralph Corderoy
Don't use Firefox to view XML. XML is plain text. Download it to a
file and then use a text editor to view it. Firefox is trying to be
helpful, but fails.

Hi Ralph

Sorry, yes. I have confused myself by looking at one in Brackets and the
other in Firefox. If I look at them both in Brackets they are very similar
apart from the NULs and some letters and digits in the <ttm:title>.

Best wishes
Richard

Vangelis forthnet

2017-11-03 18:59:02 UTC

but is seems more likely that the BBC
would respond to something causing an error
in the iPlayer. The BBC does correct errors.

Hi Richard :-)

by opening
http://www.bbc.co.uk/iplayer/subtitles/ng/modav/bUnknown-5df25dc8-d38f-43e5-93a2-38b6c778f852_b09c79wx_1509625417009.xml

Created on 2/11/2017 at 12:23:23

so this was just fixed at noon yesterday
(most recent Suspicion repeat aired on
22/10/***@13:30; so it took them eleven days
to identify and remedy the problem)...
I have no doubt an iPlayer user did alert them; however,
this is not mentioned as a "Recently Fixed Fault" at:
https://www.bbc.co.uk/iplayer/help/programme-availability/programme-issues

What is more interesting is that neither the file you refer to
nor the captions file it links to (URI snipped) are XML files.

... Well, I'm a complete dunce with regards to XML structure,
but when I see an .xml extension in the URI, I "assume" it points
to an ".xml" file... FWIW, the "belisage" article you referenced says

the existing XML standard for timed text, TTML

This then got me to
https://www.w3.org/TR/ttml1/#content-attribute-id
https://www.w3.org/TR/2005/REC-xml-id-20050909/
https://www.w3.org/TR/ttml1/#content-attribute-lang
https://www.w3.org/TR/ttml1/#content-attribute-space

When viewed in Firefox, as you said

The captions file begins
-<tt ttp:timeBase="media" xml:lang="en">

so xml:lang does imply it's an XML subset...
Of course, Ralph came to the rescue for both
of us, as fetching the file on disk and viewing
it with an editor (BTW, I use PSPad), I do see
first lines being

<?xml version="1.0" encoding="UTF-8"?>
<tt xmlns="http://www.w3.org/2006/10/ttaf1"
xmlns:ttp="http://www.w3.org/2006/10/ttaf1#parameter"

and it's those bits that Fx omits ;-(

where - is a dash character I can't copy.
Other <media> tags are preceded
by a similar dash character.

This "dash" character you are referring to
is not actually present inside the XML files,
but it's added by Fx; by clicking it, you can
collapse/expand content between matching tags;
clicking turns it into a plus sign and the XML
element's content gets hidden:

-<metadata><ttm:title>
SUSPICION - BRD000000
</ttm:title><ttm:copyright>
Ericsson 2017
</ttm:copyright>
</metadata>

turns into

+<metadata></metadata>

As you can infer, default behaviour
is the expanded state...

Kindest regards,
Vangelis.

RS

2017-11-10 11:43:30 UTC

From: Vangelis forthnet
Sent: Friday, November 3, 2017 6:59 PM

Post by Vangelis forthnet
by opening
http://www.bbc.co.uk/iplayer/subtitles/ng/modav/bUnknown-5df25dc8-d38f-43e5-93a2-38b6c778f852_b09c79wx_1509625417009.xml
Created on 2/11/2017 at 12:23:23
so this was just fixed at noon yesterday
(most recent Suspicion repeat aired on
to identify and remedy the problem)...
I have no doubt an iPlayer user did alert them; however,
https://www.bbc.co.uk/iplayer/help/programme-availability/programme-issues

There is a report in the forum
https://squarepenguin.co.uk/forums/showthread.php?tid=1587
of a subtitle file with 3 NULs in <head><metadata> in lines 7 and 10. The
PID is b0074513 and the file is
http://vod-sub-uk-live.bbcfmt.hs.llnwd.net/iplayer/subtitles/ng/modav/bUnknown-6dc3f082-7f2f-4410-818e-4e86a3e2736f_b000tm9z_1509582051026.xml

It seems the person reporting it is outside the UK, so dinky has said he or
she can't investigate, although a link to an online subtitles editor has
been given, and the next release will fail more gracefully.

If there are only one or two a month it is not a serious problem, especially
if they do get corrected eventually.

Best wishes
Richard

Graham Temple Personal

2017-12-12 12:06:55 UTC

There are regular failures now of delivery of subtitles files. It starts to
download them but they never convert to SRT and just delete. I followed this
thread through but for someone like me who just uses the PVR and settings
within, this is all way over my head to work out other ways of obtaining the
subtitles.

I am on the latest version.

Recent examples:

The league of gentleman series 1 E4 & E5
Ronny Chieng E7
Strictly Come Dancing E23 (week 12)

None of them have corrected the output since.

The message on the PVR is:

INFO: Downloaded: 390.29 MB (00:29:06) @ 37.62 Mb/s (hvfxsd1) [audio+video]
INFO: Converting to MP4
INFO: Tagging MP4
INFO: Downloading subtitles
ERROR: Failed to load subtitles:
:7: parser error : Char 0x0 out of allowed range
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag title line 6
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag metadata line 5
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag head line 3
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag tt line 2
LEAGUE OF GENTLEMEN 4 - BD313601
^
INFO: Downloading subtitles
ERROR: Failed to load subtitles:
:7: parser error : Char 0x0 out of allowed range
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag title line 6
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag metadata line 5
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag head line 3
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag tt line 2
LEAGUE OF GENTLEMEN 4 - BD313601
^
INFO: Downloading subtitles
ERROR: Failed to load subtitles:
:7: parser error : Char 0x0 out of allowed range
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag title line 6
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag metadata line 5
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag head line 3
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag tt line 2
LEAGUE OF GENTLEMEN 4 - BD313601
^
ERROR: Subtitles conversion to SRT failed
ERROR: Use --subtitles-only to re-download
Recording complete

It is still a small % but frequent enough to be annoying if you rely on
subtitles to fully follow the speech.

GT

-----Original Message-----
From: get_iplayer [mailto:get_iplayer-***@lists.infradead.org] On Behalf
Of RS
Sent: 10 November 2017 11:44
To: ***@lists.infradead.org
Subject: Re: parser error

From: Vangelis forthnet
Sent: Friday, November 3, 2017 6:59 PM

Post by Vangelis forthnet
by opening
http://www.bbc.co.uk/iplayer/subtitles/ng/modav/bUnknown-5df25dc8-d38f-
43e5-93a2-38b6c778f852_b09c79wx_1509625417009.xml
Created on 2/11/2017 at 12:23:23
so this was just fixed at noon yesterday (most recent Suspicion repeat
remedy the problem)...
I have no doubt an iPlayer user did alert them; however, this is not
https://www.bbc.co.uk/iplayer/help/programme-availability/programme-iss
ues

There is a report in the forum
https://squarepenguin.co.uk/forums/showthread.php?tid=1587
of a subtitle file with 3 NULs in <head><metadata> in lines 7 and 10. The
PID is b0074513 and the file is
http://vod-sub-uk-live.bbcfmt.hs.llnwd.net/iplayer/subtitles/ng/modav/bUnkno
wn-6dc3f082-7f2f-4410-818e-4e86a3e2736f_b000tm9z_1509582051026.xml

It seems the person reporting it is outside the UK, so dinky has said he or
she can't investigate, although a link to an online subtitles editor has
been given, and the next release will fail more gracefully.

If there are only one or two a month it is not a serious problem, especially
if they do get corrected eventually.

Best wishes
Richard

Graham Temple Personal

2017-12-12 12:34:42 UTC

Correction, the SCD and Ronny Chieng ones now download, but not RV. It is
still annoying to keep checking back.

-----Original Message-----
From: Graham Temple Personal [mailto:***@gmail.com]
Sent: 12 December 2017 12:07
To: 'RS'; ***@lists.infradead.org
Subject: RE: parser error

There are regular failures now of delivery of subtitles files. It starts to
download them but they never convert to SRT and just delete. I followed this
thread through but for someone like me who just uses the PVR and settings
within, this is all way over my head to work out other ways of obtaining the
subtitles.

I am on the latest version.

Recent examples:

The league of gentleman series 1 E4 & E5 Ronny Chieng E7 Strictly Come
Dancing E23 (week 12)

None of them have corrected the output since.

The message on the PVR is:

INFO: Downloaded: 390.29 MB (00:29:06) @ 37.62 Mb/s (hvfxsd1) [audio+video]
INFO: Converting to MP4
INFO: Tagging MP4
INFO: Downloading subtitles
ERROR: Failed to load subtitles:
:7: parser error : Char 0x0 out of allowed range
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag title line 6
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag metadata line 5
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag head line 3
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag tt line 2
LEAGUE OF GENTLEMEN 4 - BD313601
^
INFO: Downloading subtitles
ERROR: Failed to load subtitles:
:7: parser error : Char 0x0 out of allowed range
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag title line 6
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag metadata line 5
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag head line 3
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag tt line 2
LEAGUE OF GENTLEMEN 4 - BD313601
^
INFO: Downloading subtitles
ERROR: Failed to load subtitles:
:7: parser error : Char 0x0 out of allowed range
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag title line 6
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag metadata line 5
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag head line 3
LEAGUE OF GENTLEMEN 4 - BD313601
^
:7: parser error : Premature end of data in tag tt line 2
LEAGUE OF GENTLEMEN 4 - BD313601
^
ERROR: Subtitles conversion to SRT failed
ERROR: Use --subtitles-only to re-download Recording complete

It is still a small % but frequent enough to be annoying if you rely on
subtitles to fully follow the speech.

GT

-----Original Message-----
From: get_iplayer [mailto:get_iplayer-***@lists.infradead.org] On Behalf
Of RS
Sent: 10 November 2017 11:44
To: ***@lists.infradead.org
Subject: Re: parser error

From: Vangelis forthnet
Sent: Friday, November 3, 2017 6:59 PM

Post by Vangelis forthnet
by opening
http://www.bbc.co.uk/iplayer/subtitles/ng/modav/bUnknown-5df25dc8-d38f-
43e5-93a2-38b6c778f852_b09c79wx_1509625417009.xml
Created on 2/11/2017 at 12:23:23
so this was just fixed at noon yesterday (most recent Suspicion repeat
remedy the problem)...
I have no doubt an iPlayer user did alert them; however, this is not
https://www.bbc.co.uk/iplayer/help/programme-availability/programme-iss
ues

There is a report in the forum
https://squarepenguin.co.uk/forums/showthread.php?tid=1587
of a subtitle file with 3 NULs in <head><metadata> in lines 7 and 10. The
PID is b0074513 and the file is
http://vod-sub-uk-live.bbcfmt.hs.llnwd.net/iplayer/subtitles/ng/modav/bUnkno
wn-6dc3f082-7f2f-4410-818e-4e86a3e2736f_b000tm9z_1509582051026.xml

It seems the person reporting it is outside the UK, so dinky has said he or
she can't investigate, although a link to an online subtitles editor has
been given, and the next release will fail more gracefully.

If there are only one or two a month it is not a serious problem, especially
if they do get corrected eventually.

Best wishes
Richard

Ralph Corderoy

2017-12-12 14:59:57 UTC

Hi Graham,

Post by RS
:7: parser error : Char 0x0 out of allowed range

...

Post by RS
It is still a small % but frequent enough to be annoying if you rely
on subtitles to fully follow the speech.

I haven't tried this, and I'm looking at 3.06 rather than 3.07, but if
you find these lines in your get_iplayer script,

sub ttml_to_srt {
my $ttml = shift;

and add after them

$ttml =~ y/\0//d;

then that will delete any ASCII NUL bytes from the obtained URL before
attempting to parse it as XML. Hopefully.

But really, report each occurrence to the BBC because they're shipping
invalid XML and they need to find out why they keep doing it and fix the
cause.

(When looking at this, I also noticed a --subsraw option that saves the
URL's content in .../foo.ttxt before attempting to XML-parse it.)

--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

25 Replies
75 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

RS 2017-10-24 19:35:11 UTC

Colin Law 2017-10-24 19:44:44 UTC

RS 2017-10-24 20:41:54 UTC

RS 2017-10-24 22:38:14 UTC

Vangelis forthnet 2017-10-25 05:14:46 UTC

RS 2017-10-25 23:51:15 UTC

Jeremy Nicoll - ml gip 2017-10-26 00:27:05 UTC

RS 2017-10-27 10:33:43 UTC

Jeremy Nicoll - ml gip 2017-10-27 12:59:08 UTC

Bernard Peek 2017-10-27 18:06:40 UTC

Alex 2017-10-27 19:07:42 UTC

RS 2017-10-27 20:47:00 UTC

Jeremy Nicoll - ml gip 2017-10-28 11:02:36 UTC

Bernard Peek 2017-10-28 17:12:48 UTC

Vangelis forthnet 2017-10-27 15:31:45 UTC

RS 2017-10-27 20:04:27 UTC

Ralph Corderoy 2017-10-27 21:16:03 UTC

Vangelis forthnet 2017-11-03 01:41:43 UTC

RS 2017-11-03 12:19:35 UTC

Ralph Corderoy 2017-11-03 12:50:12 UTC

RS 2017-11-03 13:34:23 UTC

Vangelis forthnet 2017-11-03 18:59:02 UTC

RS 2017-11-10 11:43:30 UTC

Graham Temple Personal 2017-12-12 12:06:55 UTC

Graham Temple Personal 2017-12-12 12:34:42 UTC

Ralph Corderoy 2017-12-12 14:59:57 UTC

about - legalese

Loading...